PyTorch Cheat Sheet#

A comprehensive quick-reference covering tensor operations, autograd, model building, training loops, data loading, and debugging patterns. Designed to complement Chapters 29–31 of this course.

How to Use This Page

This is a pure reference document – no executable code, just patterns you can copy and adapt. Use Ctrl+F / Cmd+F to search for specific topics.

1. Tensor Basics#

Creation#

import torch

# From Python data
x = torch.tensor([1, 2, 3])                # from list
x = torch.tensor([[1, 2], [3, 4]])          # 2D from nested list

# Standard constructors
x = torch.zeros(3, 4)                       # all zeros
x = torch.ones(2, 3)                        # all ones
x = torch.full((2, 3), 7.0)                 # filled with 7.0
x = torch.empty(3, 4)                       # uninitialized (fast)
x = torch.eye(3)                            # 3x3 identity matrix

# Random tensors
x = torch.randn(3, 4)                       # standard normal N(0,1)
x = torch.rand(3, 4)                        # uniform [0, 1)
x = torch.randint(0, 10, (3, 4))            # random integers in [0, 10)

# Sequences
x = torch.arange(0, 10, 2)                  # [0, 2, 4, 6, 8]
x = torch.linspace(0, 1, 100)               # 100 points in [0, 1]

# Like-constructors (match shape/dtype/device of existing tensor)
y = torch.zeros_like(x)
y = torch.randn_like(x)

NumPy Interop#

# NumPy -> PyTorch (shared memory -- no copy!)
x = torch.from_numpy(np_array)

# PyTorch -> NumPy (shared memory on CPU)
np_array = x.numpy()              # CPU tensor only
np_array = x.cpu().numpy()        # safe for GPU tensors
np_array = x.detach().numpy()     # safe if requires_grad=True
np_array = x.detach().cpu().numpy()  # safest -- works in all cases

Shared Memory Warning

torch.from_numpy() and .numpy() share the underlying memory buffer. Modifying one will modify the other. Use .clone() if you need an independent copy.

Properties#

x.shape          # torch.Size([3, 4]) -- dimensions
x.dtype          # torch.float32 -- data type
x.device         # device(type='cpu') or device(type='cuda', index=0)
x.requires_grad  # True/False -- gradient tracking
x.ndim           # number of dimensions (same as len(x.shape))
x.numel()        # total number of elements
x.is_contiguous()  # memory layout check

Data Types#

PyTorch dtype

Alias

Notes

torch.float32

torch.float

Default for floats. Use this for training.

torch.float64

torch.double

Double precision. Rarely needed.

torch.float16

torch.half

Half precision. Used for mixed-precision training.

torch.bfloat16

Brain floating point. Better range than float16.

torch.int64

torch.long

Default for integers. Required for class labels.

torch.int32

torch.int

32-bit integer.

torch.bool

Boolean tensor.

# Type casting
x = x.float()                 # -> float32
x = x.long()                  # -> int64
x = x.to(torch.float16)       # explicit dtype

2. Tensor Operations#

NumPy vs. PyTorch Equivalents#

Operation

NumPy

PyTorch

Notes

Reshape

np.reshape(x, shape)

x.view(shape) or x.reshape(shape)

view requires contiguous memory

Flatten

x.flatten()

x.view(-1) or x.flatten()

Concatenate

np.concatenate([a,b])

torch.cat([a,b], dim=0)

Along existing dim

Stack

np.stack([a,b])

torch.stack([a,b], dim=0)

Creates new dim

Split

np.split(x, n)

torch.chunk(x, n, dim=0)

Transpose

x.T

x.T or x.permute(...)

Squeeze

np.squeeze(x)

x.squeeze()

Remove dims of size 1

Unsqueeze

np.expand_dims(x, 0)

x.unsqueeze(0)

Add dim of size 1

Matrix multiply

a @ b

a @ b or torch.mm(a, b)

Batch matmul

np.matmul(a, b)

torch.bmm(a, b)

For 3D tensors

Element-wise

a * b, a + b

a * b, a + b

Same syntax

Sum

np.sum(x, axis=0)

x.sum(dim=0)

axis vs dim

Mean

np.mean(x, axis=0)

x.mean(dim=0)

Argmax

np.argmax(x, axis=0)

x.argmax(dim=0)

Clamp

np.clip(x, a, b)

x.clamp(min=a, max=b)

Where

np.where(cond, a, b)

torch.where(cond, a, b)

Indexing and Slicing#

# Same syntax as NumPy
x[0]              # first row
x[:, 1]           # second column
x[0:3, :]         # first three rows
x[x > 0]          # boolean indexing
x[[0, 2, 4]]      # fancy indexing

# Useful for batches
x[..., -1]        # last element along final dim (Ellipsis)

Broadcasting Rules#

Same rules as NumPy:

  1. Dimensions are compared from the right (trailing dimensions).

  2. Two dimensions are compatible if they are equal or one of them is 1.

  3. Missing dimensions on the left are treated as size 1.

# Example: (4, 3) + (3,) -> (4, 3)
# Example: (4, 1) + (1, 3) -> (4, 3)
# Example: (2, 1, 3) + (4, 1) -> (2, 4, 3)

In-Place Operations#

# Trailing underscore = in-place
x.add_(1)          # x = x + 1
x.mul_(2)          # x = x * 2
x.zero_()          # x = 0
x.fill_(5)         # x = 5
x.clamp_(0, 1)     # clamp in-place

Avoid In-Place on Grad Tensors

In-place operations on tensors that require gradients can cause errors during backpropagation. PyTorch needs the original values to compute gradients, and in-place ops destroy them.

3. Device Management#

# Detect available device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Move tensors
x = x.to(device)                  # generic
x = x.cuda()                      # explicit GPU
x = x.cpu()                       # explicit CPU

# Move model (moves ALL parameters and buffers)
model = model.to(device)

# Create tensor directly on device
x = torch.randn(3, 4, device=device)

# Check device
x.device                           # device(type='cuda', index=0)
x.is_cuda                          # True / False

Common Pitfall

All tensors in an operation must be on the same device. You cannot add a CPU tensor to a CUDA tensor. If you get a “RuntimeError: expected all tensors to be on the same device” error, check that both your data and model are on the same device.

# Apple Silicon (MPS backend)
device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')

# Multi-GPU: specify GPU index
device = torch.device('cuda:0')    # first GPU
device = torch.device('cuda:1')    # second GPU

4. Autograd#

PyTorch’s automatic differentiation engine. Every operation on tensors with requires_grad=True is recorded on a dynamic computation graph. Calling .backward() traverses this graph in reverse to compute gradients.

Basic Gradient Computation#

x = torch.tensor(3.0, requires_grad=True)
y = x**2 + 2*x + 1
y.backward()            # compute dy/dx
print(x.grad)           # tensor(8.)  -- dy/dx = 2x + 2 = 8

Gradient Control#

# Disable gradient tracking (inference, evaluation)
with torch.no_grad():
    pred = model(x)           # no graph built, saves memory

# Alternative: inference mode (even more memory efficient)
with torch.inference_mode():
    pred = model(x)

# Detach tensor from computation graph
x_detached = x.detach()       # shares data, no grad tracking

# Prevent gradient for specific parameters
for param in model.encoder.parameters():
    param.requires_grad = False   # freeze encoder

Zeroing Gradients#

# CRITICAL: PyTorch accumulates gradients by default!
optimizer.zero_grad()          # preferred in training loop

# Manual alternatives
param.grad = None              # modern PyTorch preferred
param.grad.zero_()             # in-place zeroing
model.zero_grad()              # zero all model params

Why Gradients Accumulate

Gradient accumulation is by design – it enables computing effective gradients over multiple mini-batches when GPU memory is too small for a single large batch. But if you forget optimizer.zero_grad(), gradients from previous iterations contaminate the current update. This is one of the most common PyTorch bugs.

Inspecting the Computation Graph#

x = torch.tensor(2.0, requires_grad=True)
y = x * 3
z = y + 1

z.grad_fn                      # <AddBackward0> -- last operation
z.grad_fn.next_functions       # links to previous ops
y.is_leaf                      # False (created by an op)
x.is_leaf                      # True (user-created)

5. nn.Module Pattern#

The fundamental building block for all neural networks in PyTorch.

Custom Module Template#

import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()             # ALWAYS call super().__init__()
        self.layer1 = nn.Linear(784, 128)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.2)
        self.layer2 = nn.Linear(128, 10)
    
    def forward(self, x):
        x = self.relu(self.layer1(x))
        x = self.dropout(x)
        return self.layer2(x)

Inspecting Parameters#

model = MyModel()

# Iterate over all parameters
model.parameters()                          # iterator
list(model.parameters())                    # list of Parameter tensors

# Named parameters (for debugging, freezing)
for name, param in model.named_parameters():
    print(f"{name}: {param.shape}")

# Total parameter count
total = sum(p.numel() for p in model.parameters())
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total: {total}, Trainable: {trainable}")

# List sub-modules
model.children()                            # immediate children
model.modules()                             # all modules recursively
model.named_modules()                       # with names

Module State#

model.train()    # training mode: dropout active, batchnorm uses batch stats
model.eval()     # evaluation mode: dropout off, batchnorm uses running stats

model.training   # True / False -- check current mode

Always Set the Mode

Forgetting model.eval() before inference leads to non-deterministic predictions (dropout still drops) and incorrect batch normalization (uses batch stats instead of learned running statistics). Forgetting model.train() before training means dropout and batchnorm behave incorrectly.

6. nn.Sequential Shortcut#

For simple feed-forward architectures where data flows linearly through layers.

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(128, 10),
)

With Named Layers#

from collections import OrderedDict

model = nn.Sequential(OrderedDict([
    ('fc1', nn.Linear(784, 256)),
    ('relu1', nn.ReLU()),
    ('fc2', nn.Linear(256, 10)),
]))

# Access by name
model.fc1.weight.shape   # torch.Size([256, 784])

When NOT to Use Sequential

Write a custom nn.Module when you need skip connections (ResNet), multiple inputs/outputs, conditional logic, or shared weights.

7. Common Layers#

Core Layers#

Layer

Code

Parameters

Input Shape

Output Shape

Fully connected

nn.Linear(in_f, out_f)

\(W\): (out, in), \(b\): (out,)

(*, in_f)

(*, out_f)

1D convolution

nn.Conv1d(in_ch, out_ch, k)

\(W\): (out, in, k)

(N, C_in, L)

(N, C_out, L')

2D convolution

nn.Conv2d(in_ch, out_ch, k)

\(W\): (out, in, k, k)

(N, C_in, H, W)

(N, C_out, H', W')

Max pool

nn.MaxPool2d(k)

None

(N, C, H, W)

(N, C, H/k, W/k)

Avg pool

nn.AvgPool2d(k)

None

(N, C, H, W)

(N, C, H/k, W/k)

Adaptive avg pool

nn.AdaptiveAvgPool2d((1,1))

None

(N, C, H, W)

(N, C, 1, 1)

Batch norm (1D)

nn.BatchNorm1d(features)

\(\gamma\), \(\beta\)

(N, features)

(N, features)

Batch norm (2D)

nn.BatchNorm2d(channels)

\(\gamma\), \(\beta\)

(N, C, H, W)

(N, C, H, W)

Layer norm

nn.LayerNorm(shape)

\(\gamma\), \(\beta\)

(*, shape)

(*, shape)

Dropout

nn.Dropout(p=0.5)

None

any

same

Embedding

nn.Embedding(vocab, dim)

(vocab, dim)

(*) int

(*, dim)

Recurrent Layers#

Layer

Code

Notes

Simple RNN

nn.RNN(input_size, hidden_size, num_layers=1)

Vanilla recurrence

LSTM

nn.LSTM(input_size, hidden_size, num_layers=1)

Long short-term memory

GRU

nn.GRU(input_size, hidden_size, num_layers=1)

Gated recurrent unit

# RNN usage pattern
rnn = nn.LSTM(input_size=10, hidden_size=20, num_layers=2, batch_first=True)
x = torch.randn(32, 50, 10)    # (batch, seq_len, features)
output, (h_n, c_n) = rnn(x)    # output: (32, 50, 20), h_n: (2, 32, 20)

Activation Functions#

Activation

Module

Functional

Formula

ReLU

nn.ReLU()

F.relu(x)

\(\max(0, x)\)

LeakyReLU

nn.LeakyReLU(0.01)

F.leaky_relu(x)

\(\max(0.01x, x)\)

Sigmoid

nn.Sigmoid()

torch.sigmoid(x)

\(\frac{1}{1+e^{-x}}\)

Tanh

nn.Tanh()

torch.tanh(x)

\(\frac{e^x - e^{-x}}{e^x + e^{-x}}\)

GELU

nn.GELU()

F.gelu(x)

\(x \cdot \Phi(x)\)

Softmax

nn.Softmax(dim=-1)

F.softmax(x, dim=-1)

\(\frac{e^{x_i}}{\sum_j e^{x_j}}\)

LogSoftmax

nn.LogSoftmax(dim=-1)

F.log_softmax(x, dim=-1)

\(\log\text{softmax}(x)\)

Module vs. Functional

Use nn.ReLU() as a module attribute when you want it visible in print(model). Use F.relu(x) (from torch.nn.functional) inside forward() for a lighter touch. Both produce identical results – it is purely a style choice.

8. Loss Functions#

Loss

Code

Use Case

Input

Target

Mean squared error

nn.MSELoss()

Regression

any shape

same shape

Mean absolute error

nn.L1Loss()

Robust regression

any shape

same shape

Cross-entropy

nn.CrossEntropyLoss()

Multi-class classification

(N, C) raw logits

(N,) class indices

Binary cross-entropy

nn.BCEWithLogitsLoss()

Binary / multi-label

(N, *) raw logits

(N, *) floats in [0,1]

Negative log-likelihood

nn.NLLLoss()

After log_softmax

(N, C) log-probs

(N,) class indices

Huber (smooth L1)

nn.SmoothL1Loss()

Robust regression

any shape

same shape

KL divergence

nn.KLDivLoss(reduction='batchmean')

Distribution matching

log-probs

probs

Cosine embedding

nn.CosineEmbeddingLoss()

Similarity learning

(N, D)

(N,) in {-1, 1}

CrossEntropyLoss Includes Softmax

nn.CrossEntropyLoss() internally applies log_softmax before NLLLoss. Do not apply softmax to your model output when using this loss – you would be applying softmax twice, which is a common bug that leads to poor training.

# Classification example
criterion = nn.CrossEntropyLoss()
logits = model(x)                    # shape: (batch, num_classes) -- RAW
loss = criterion(logits, labels)     # labels: (batch,) of ints

# With class weights (for imbalanced data)
weights = torch.tensor([1.0, 2.0, 0.5])  # one per class
criterion = nn.CrossEntropyLoss(weight=weights)

# Ignoring padding tokens (NLP)
criterion = nn.CrossEntropyLoss(ignore_index=-100)

9. Optimizers#

Common Optimizers#

import torch.optim as optim

# Stochastic gradient descent
optimizer = optim.SGD(model.parameters(), lr=0.01)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)

# Adam (adaptive learning rate)
optimizer = optim.Adam(model.parameters(), lr=0.001)
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))

# AdamW (decoupled weight decay -- generally preferred over Adam)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# RMSprop
optimizer = optim.RMSprop(model.parameters(), lr=0.01, alpha=0.99)

Per-Parameter Options#

# Different learning rates for different parts of the model
optimizer = optim.Adam([
    {'params': model.encoder.parameters(), 'lr': 1e-5},   # fine-tune slowly
    {'params': model.decoder.parameters(), 'lr': 1e-3},   # train faster
])

Learning Rate Schedulers#

from torch.optim.lr_scheduler import (
    StepLR, ExponentialLR, CosineAnnealingLR, ReduceLROnPlateau, OneCycleLR
)

# Step decay: multiply lr by gamma every step_size epochs
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)

# Exponential decay
scheduler = ExponentialLR(optimizer, gamma=0.95)

# Cosine annealing
scheduler = CosineAnnealingLR(optimizer, T_max=100)

# Reduce on plateau (watches a metric)
scheduler = ReduceLROnPlateau(optimizer, mode='min', patience=5)

# One-cycle policy (best for super-convergence)
scheduler = OneCycleLR(optimizer, max_lr=0.01, total_steps=1000)

# Usage in training loop
for epoch in range(num_epochs):
    train(...)  
    scheduler.step()                    # most schedulers
    # scheduler.step(val_loss)          # for ReduceLROnPlateau

Optimizer

When to Use

SGD + Momentum

Classic choice; often best final accuracy with tuning

Adam

Good default; fast convergence; less sensitive to lr

AdamW

Preferred over Adam when using weight decay (most modern work)

RMSprop

RNNs, reinforcement learning

10. Training Loop Template#

Standard Training Loop#

model = MyModel().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(num_epochs):
    # --- Training phase ---
    model.train()
    train_loss = 0.0
    correct = 0
    total = 0
    
    for X_batch, y_batch in train_loader:
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)
        
        pred = model(X_batch)             # 1. Forward pass
        loss = criterion(pred, y_batch)   # 2. Compute loss
        
        optimizer.zero_grad()             # 3. Zero gradients
        loss.backward()                   # 4. Backward pass
        optimizer.step()                  # 5. Update weights
        
        train_loss += loss.item() * X_batch.size(0)
        correct += (pred.argmax(dim=1) == y_batch).sum().item()
        total += y_batch.size(0)
    
    train_loss /= total
    train_acc = correct / total
    
    # --- Validation phase ---
    model.eval()
    val_loss = 0.0
    val_correct = 0
    val_total = 0
    
    with torch.no_grad():
        for X_batch, y_batch in val_loader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)
            
            pred = model(X_batch)
            loss = criterion(pred, y_batch)
            
            val_loss += loss.item() * X_batch.size(0)
            val_correct += (pred.argmax(dim=1) == y_batch).sum().item()
            val_total += y_batch.size(0)
    
    val_loss /= val_total
    val_acc = val_correct / val_total
    
    print(f"Epoch {epoch+1}/{num_epochs}: "
          f"train_loss={train_loss:.4f}, train_acc={train_acc:.4f}, "
          f"val_loss={val_loss:.4f}, val_acc={val_acc:.4f}")

The 5 Sacred Steps

Every training iteration follows the same 5-step pattern:

  1. Forward – compute predictions

  2. Loss – measure error

  3. Zero – clear old gradients

  4. Backward – compute new gradients

  5. Step – update parameters

Steps 3-4-5 can swap order slightly (zero_grad can come before forward), but the logic must remain: zero before backward, backward before step.

Gradient Clipping (for RNNs)#

loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

Gradient Accumulation (simulate larger batch)#

accumulation_steps = 4
optimizer.zero_grad()

for i, (X_batch, y_batch) in enumerate(train_loader):
    loss = criterion(model(X_batch.to(device)), y_batch.to(device))
    loss = loss / accumulation_steps    # normalize
    loss.backward()                     # accumulate gradients
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

11. Data Loading#

Custom Dataset#

from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    def __init__(self, X, y, transform=None):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.long)
        self.transform = transform
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        sample = self.X[idx]
        if self.transform:
            sample = self.transform(sample)
        return sample, self.y[idx]

DataLoader#

dataset = MyDataset(X_train, y_train)
loader = DataLoader(
    dataset,
    batch_size=64,           # samples per batch
    shuffle=True,            # randomize order each epoch
    num_workers=4,           # parallel data loading
    pin_memory=True,         # faster GPU transfer
    drop_last=True,          # drop incomplete final batch
)

# Iterate
for X_batch, y_batch in loader:
    print(X_batch.shape, y_batch.shape)
    break

Built-in Datasets (torchvision)#

from torchvision import datasets, transforms

# Standard transform pipeline
transform = transforms.Compose([
    transforms.Resize(32),
    transforms.ToTensor(),                      # PIL -> tensor, scale to [0,1]
    transforms.Normalize((0.5,), (0.5,)),       # normalize to [-1, 1]
])

# MNIST
train_data = datasets.MNIST(
    root='data/', train=True, download=True, transform=transform
)
test_data = datasets.MNIST(
    root='data/', train=False, download=True, transform=transform
)

train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=256, shuffle=False)

Common Datasets#

Dataset

Code

Shape

Classes

MNIST

datasets.MNIST(...)

1x28x28

10 digits

FashionMNIST

datasets.FashionMNIST(...)

1x28x28

10 clothing

CIFAR-10

datasets.CIFAR10(...)

3x32x32

10 objects

CIFAR-100

datasets.CIFAR100(...)

3x32x32

100 objects

ImageNet

datasets.ImageNet(...)

3x224x224

1000 objects

Train/Validation Split#

from torch.utils.data import random_split

full_dataset = MyDataset(X, y)
train_size = int(0.8 * len(full_dataset))
val_size = len(full_dataset) - train_size
train_set, val_set = random_split(full_dataset, [train_size, val_size])

12. Save and Load#

Full Checkpoint (model + optimizer + epoch)#

# Save checkpoint
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss.item(),
}, 'checkpoint.pth')

# Load checkpoint
checkpoint = torch.load('checkpoint.pth', weights_only=True)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch']

Avoid Saving the Entire Model

torch.save(model, 'model.pth') uses Python pickle, which ties the saved file to the exact class definition and directory structure. Saving state_dict() is portable and robust.

13. Common Patterns and Idioms#

Flattening for Fully-Connected Layers#

# After conv layers, before FC layers
x = x.view(x.size(0), -1)        # flatten keeping batch dim
x = x.flatten(1)                  # equivalent, more explicit
x = nn.Flatten()(x)               # as a module (use in Sequential)

Weight Initialization#

def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.xavier_uniform_(m.weight)
        nn.init.zeros_(m.bias)
    elif isinstance(m, nn.Conv2d):
        nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')

model.apply(init_weights)         # applies recursively to all modules

Freezing Layers (Transfer Learning)#

# Freeze all layers
for param in model.parameters():
    param.requires_grad = False

# Unfreeze only the classifier head
for param in model.classifier.parameters():
    param.requires_grad = True

# Only pass trainable params to optimizer
optimizer = optim.Adam(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr=1e-3
)

Extracting Scalar from Tensor#

loss_value = loss.item()          # Python float from 0-dim tensor
count = correct.item()            # Python int
# .item() only works on tensors with exactly one element

Reproducibility#

torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
np.random.seed(42)

# For full determinism (may slow things down)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Mixed Precision Training#

from torch.amp import autocast, GradScaler

scaler = GradScaler()

for X_batch, y_batch in train_loader:
    optimizer.zero_grad()
    
    with autocast(device_type='cuda'):      # forward in float16
        pred = model(X_batch)
        loss = criterion(pred, y_batch)
    
    scaler.scale(loss).backward()           # scaled backward
    scaler.step(optimizer)                  # unscale + step
    scaler.update()

14. CNN Architecture Patterns#

Basic CNN for Image Classification#

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1),      # 28x28 -> 28x28
            nn.ReLU(),
            nn.MaxPool2d(2),                      # 28x28 -> 14x14
            nn.Conv2d(32, 64, 3, padding=1),      # 14x14 -> 14x14
            nn.ReLU(),
            nn.MaxPool2d(2),                      # 14x14 -> 7x7
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 7 * 7, 128),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(128, num_classes),
        )
    
    def forward(self, x):
        x = self.features(x)
        return self.classifier(x)

Conv2d Output Size Formula#

\[H_{\text{out}} = \left\lfloor \frac{H_{\text{in}} + 2 \times \text{padding} - \text{kernel\_size}}{\text{stride}} \right\rfloor + 1\]

Config

Input 28x28

Input 32x32

Conv2d(*, *, 3, padding=1)

28x28

32x32

Conv2d(*, *, 3, padding=0)

26x26

30x30

Conv2d(*, *, 5, padding=2)

28x28

32x32

MaxPool2d(2)

14x14

16x16

Conv2d(*, *, 3, stride=2, padding=1)

14x14

16x16

15. Debugging Tips#

Shape Debugging#

# Print shapes at each step in forward()
def forward(self, x):
    print(f"Input:      {x.shape}")
    x = self.conv1(x)
    print(f"After conv1: {x.shape}")
    x = self.pool(x)
    print(f"After pool:  {x.shape}")
    x = x.flatten(1)
    print(f"Flattened:   {x.shape}")
    return self.fc(x)

Common Errors and Fixes#

Error

Likely Cause

Fix

RuntimeError: mat1 and mat2 shapes cannot be multiplied

Wrong Linear input size

Print shape before the Linear layer

RuntimeError: expected all tensors to be on the same device

Mixed CPU/CUDA tensors

.to(device) for both model and data

RuntimeError: element 0 of tensors does not require grad

Forgot requires_grad=True

Check input tensor settings

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

In-place op on grad tensor

Replace x += 1 with x = x + 1

ValueError: expected target size (N, C) got (N,)

Wrong loss function target format

Check nn.CrossEntropyLoss expects (N,) indices

Loss is nan

Exploding gradients, bad lr

Reduce lr, add gradient clipping

Loss stuck / not decreasing

Learning rate too low, or bug

Overfit on 1 batch first to verify model works

NaN and Gradient Checks#

# Check for NaN in loss
assert not torch.isnan(loss), "Loss is NaN!"

# Check for NaN in gradients
for name, param in model.named_parameters():
    if param.grad is not None and torch.isnan(param.grad).any():
        print(f"NaN gradient in {name}")

# Numerical gradient verification
torch.autograd.gradcheck(func, inputs, eps=1e-6, atol=1e-4)

# Check gradient magnitudes
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: grad_norm={param.grad.norm():.4f}")

The “Overfit One Batch” Test#

# Sanity check: can the model memorize a single batch?
X_batch, y_batch = next(iter(train_loader))
X_batch, y_batch = X_batch.to(device), y_batch.to(device)

model.train()
for i in range(200):
    pred = model(X_batch)
    loss = criterion(pred, y_batch)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if i % 50 == 0:
        print(f"Step {i}: loss={loss.item():.4f}")
# Loss should drop to ~0. If not, your model or data pipeline has a bug.

16. Quick Reference Card#

The 25 most-used PyTorch functions and patterns at a glance.

#

Function / Pattern

What It Does

1

torch.tensor(data)

Create a tensor from Python data

2

torch.randn(shape)

Random tensor from standard normal

3

x.to(device)

Move tensor to CPU/GPU

4

x.shape

Tensor dimensions

5

x.view(shape) / x.reshape(shape)

Reshape tensor

6

x.requires_grad_(True)

Enable gradient tracking (in-place)

7

loss.backward()

Compute all gradients via backprop

8

x.grad

Access computed gradient

9

torch.no_grad()

Context manager to disable gradients

10

x.detach()

Detach tensor from computation graph

11

nn.Linear(in, out)

Fully connected layer

12

nn.Conv2d(in, out, k)

2D convolution layer

13

nn.ReLU()

ReLU activation

14

nn.Sequential(...)

Chain layers into a model

15

nn.CrossEntropyLoss()

Classification loss (includes softmax)

16

nn.MSELoss()

Regression loss

17

optim.Adam(params, lr)

Adam optimizer

18

optimizer.zero_grad()

Zero all parameter gradients

19

optimizer.step()

Update parameters using gradients

20

model.train()

Set model to training mode

21

model.eval()

Set model to evaluation mode

22

model.parameters()

Iterator over model parameters

23

DataLoader(dataset, batch_size)

Batched data iterator

24

torch.save(state_dict, path)

Save model weights

25

loss.item()

Extract Python scalar from loss tensor

17. Import Cheat Sheet#

import torch                              # core library
import torch.nn as nn                     # neural network modules
import torch.nn.functional as F           # functional API (relu, softmax, etc.)
import torch.optim as optim               # optimizers
from torch.utils.data import Dataset, DataLoader  # data utilities

import torchvision                        # vision datasets and transforms
from torchvision import datasets, transforms

import numpy as np                        # interop
import matplotlib.pyplot as plt           # plotting

References.

  • Paszke, A., Gross, S., Massa, F., et al. (2019). “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” Advances in Neural Information Processing Systems 32.

  • PyTorch Documentation – official API reference.

  • PyTorch Tutorials – official learning resources.