Snippets

json mcp

Small, correct, copy-paste-able pieces of code.

PyTorch channels-last + AMP training loop
python · 2026-05-25
Memory-format + mixed-precision combo that speeds up conv nets on tensor cores.
For conv-heavy models on NVIDIA tensor cores, two cheap changes compound: move the model and inputs to `channels_last` memory format, and wrap the forward pass in `autocast` with a `GradScaler`. The memory format keeps activations in a layout the tensor cores prefer; AMP runs the math in bf16/fp16 while keeping a fp32 master copy.
python
```
model = model.to(device, memory_format=torch.channels_last)
scaler = torch.amp.GradScaler("cuda")

for x, y in loader:
    x = x.to(device, memory_format=torch.channels_last, non_blocking=True)
    y = y.to(device, non_blocking=True)

    optimizer.zero_grad(set_to_none=True)
    with torch.autocast("cuda", dtype=torch.float16):
        loss = criterion(model(x), y)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
```
CUDA grid-stride loop for vector add
cuda · 2026-05-22
The grid-stride pattern that lets one kernel handle any input size.
A grid-stride loop decouples the launch configuration from the problem size: each thread processes multiple elements, striding by the total number of threads. This keeps the kernel correct for any `n`, plays nicely with occupancy tuning, and lets you reuse the same launch geometry across input sizes.
cuda
```
__global__ void vadd(const float *a, const float *b, float *c, int n) {
  int stride = blockDim.x * gridDim.x;
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += stride) {
    c[i] = a[i] + b[i];
  }
}

// Launch: size the grid to the device, not to n.
int block = 256;
int grid = (n + block - 1) / block;        // cap this for very large n
vadd<<<grid, block>>>(d_a, d_b, d_c, n);
```

Snippets

PyTorch channels-last + AMP training loop

CUDA grid-stride loop for vector add