[{"title":"PyTorch channels-last + AMP training loop","description":"Memory-format + mixed-precision combo that speeds up conv nets on tensor cores.","date":"2026-05-25","lang":"python","tags":["pytorch","performance","training"],"kind":"snippets","slug":"pytorch-channels-last-amp","body":"\nFor conv-heavy models on NVIDIA tensor cores, two cheap changes compound: move the\nmodel and inputs to `channels_last` memory format, and wrap the forward pass in\n`autocast` with a `GradScaler`. The memory format keeps activations in a layout the\ntensor cores prefer; AMP runs the math in bf16/fp16 while keeping a fp32 master copy.\n\n```python\nmodel = model.to(device, memory_format=torch.channels_last)\nscaler = torch.amp.GradScaler(\"cuda\")\n\nfor x, y in loader:\n    x = x.to(device, memory_format=torch.channels_last, non_blocking=True)\n    y = y.to(device, non_blocking=True)\n\n    optimizer.zero_grad(set_to_none=True)\n    with torch.autocast(\"cuda\", dtype=torch.float16):\n        loss = criterion(model(x), y)\n\n    scaler.scale(loss).backward()\n    scaler.step(optimizer)\n    scaler.update()\n```\n","readingTimeMins":1,"url":"https://ai.thesatyajit.com/snippets/pytorch-channels-last-amp"},{"title":"CUDA grid-stride loop for vector add","description":"The grid-stride pattern that lets one kernel handle any input size.","date":"2026-05-22","lang":"cuda","tags":["cuda","kernels","performance"],"kind":"snippets","slug":"cuda-vector-add-grid-stride","body":"\nA grid-stride loop decouples the launch configuration from the problem size: each\nthread processes multiple elements, striding by the total number of threads. This\nkeeps the kernel correct for any `n`, plays nicely with occupancy tuning, and lets\nyou reuse the same launch geometry across input sizes.\n\n```cuda\n__global__ void vadd(const float *a, const float *b, float *c, int n) {\n  int stride = blockDim.x * gridDim.x;\n  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += stride) {\n    c[i] = a[i] + b[i];\n  }\n}\n\n// Launch: size the grid to the device, not to n.\nint block = 256;\nint grid = (n + block - 1) / block;        // cap this for very large n\nvadd<<<grid, block>>>(d_a, d_b, d_c, n);\n```\n","readingTimeMins":1,"url":"https://ai.thesatyajit.com/snippets/cuda-vector-add-grid-stride"}]