PyTorch channels-last + AMP training loop
python · 2026-05-25Memory-format + mixed-precision combo that speeds up conv nets on tensor cores.
For conv-heavy models on NVIDIA tensor cores, two cheap changes compound: move the model and inputs to `channels_last` memory format, and wrap the forward pass in `autocast` with a `GradScaler`. The memory format keeps activations in a layout the tensor cores prefer; AMP runs the math in bf16/fp16 while keeping a fp32 master copy.
pythonmodel = model.to(device, memory_format=torch.channels_last) scaler = torch.amp.GradScaler("cuda") for x, y in loader: x = x.to(device, memory_format=torch.channels_last, non_blocking=True) y = y.to(device, non_blocking=True) optimizer.zero_grad(set_to_none=True) with torch.autocast("cuda", dtype=torch.float16): loss = criterion(model(x), y) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()CUDA grid-stride loop for vector add
cuda · 2026-05-22The grid-stride pattern that lets one kernel handle any input size.
A grid-stride loop decouples the launch configuration from the problem size: each thread processes multiple elements, striding by the total number of threads. This keeps the kernel correct for any `n`, plays nicely with occupancy tuning, and lets you reuse the same launch geometry across input sizes.
cuda__global__ void vadd(const float *a, const float *b, float *c, int n) { int stride = blockDim.x * gridDim.x; for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += stride) { c[i] = a[i] + b[i]; } } // Launch: size the grid to the device, not to n. int block = 256; int grid = (n + block - 1) / block; // cap this for very large n vadd<<<grid, block>>>(d_a, d_b, d_c, n);