PyTorch Cheat Sheet

PyTorch Cheat Sheet#

A comprehensive quick-reference covering tensor operations, autograd, model building, training loops, data loading, and debugging patterns. Designed to complement Chapters 29–31 of this course.

How to Use This Page

This is a pure reference document – no executable code, just patterns you can copy and adapt. Use Ctrl+F / Cmd+F to search for specific topics.

1. Tensor Basics#

Creation#

import torch

# From Python data
x = torch.tensor([1, 2, 3])                # from list
x = torch.tensor([[1, 2], [3, 4]])          # 2D from nested list

# Standard constructors
x = torch.zeros(3, 4)                       # all zeros
x = torch.ones(2, 3)                        # all ones
x = torch.full((2, 3), 7.0)                 # filled with 7.0
x = torch.empty(3, 4)                       # uninitialized (fast)
x = torch.eye(3)                            # 3x3 identity matrix

# Random tensors
x = torch.randn(3, 4)                       # standard normal N(0,1)
x = torch.rand(3, 4)                        # uniform [0, 1)
x = torch.randint(0, 10, (3, 4))            # random integers in [0, 10)

# Sequences
x = torch.arange(0, 10, 2)                  # [0, 2, 4, 6, 8]
x = torch.linspace(0, 1, 100)               # 100 points in [0, 1]

# Like-constructors (match shape/dtype/device of existing tensor)
y = torch.zeros_like(x)
y = torch.randn_like(x)

NumPy Interop#

# NumPy -> PyTorch (shared memory -- no copy!)
x = torch.from_numpy(np_array)

# PyTorch -> NumPy (shared memory on CPU)
np_array = x.numpy()              # CPU tensor only
np_array = x.cpu().numpy()        # safe for GPU tensors
np_array = x.detach().numpy()     # safe if requires_grad=True
np_array = x.detach().cpu().numpy()  # safest -- works in all cases

Shared Memory Warning

torch.from_numpy() and .numpy() share the underlying memory buffer. Modifying one will modify the other. Use .clone() if you need an independent copy.

Properties#

x.shape          # torch.Size([3, 4]) -- dimensions
x.dtype          # torch.float32 -- data type
x.device         # device(type='cpu') or device(type='cuda', index=0)
x.requires_grad  # True/False -- gradient tracking
x.ndim           # number of dimensions (same as len(x.shape))
x.numel()        # total number of elements
x.is_contiguous()  # memory layout check

Data Types#

PyTorch dtype	Alias	Notes
`torch.float32`	`torch.float`	Default for floats. Use this for training.
`torch.float64`	`torch.double`	Double precision. Rarely needed.
`torch.float16`	`torch.half`	Half precision. Used for mixed-precision training.
`torch.bfloat16`	–	Brain floating point. Better range than float16.
`torch.int64`	`torch.long`	Default for integers. Required for class labels.
`torch.int32`	`torch.int`	32-bit integer.
`torch.bool`	–	Boolean tensor.

# Type casting
x = x.float()                 # -> float32
x = x.long()                  # -> int64
x = x.to(torch.float16)       # explicit dtype

2. Tensor Operations#

NumPy vs. PyTorch Equivalents#

Operation	NumPy	PyTorch	Notes
Reshape	`np.reshape(x, shape)`	`x.view(shape)` or `x.reshape(shape)`	`view` requires contiguous memory
Flatten	`x.flatten()`	`x.view(-1)` or `x.flatten()`
Concatenate	`np.concatenate([a,b])`	`torch.cat([a,b], dim=0)`	Along existing dim
Stack	`np.stack([a,b])`	`torch.stack([a,b], dim=0)`	Creates new dim
Split	`np.split(x, n)`	`torch.chunk(x, n, dim=0)`
Transpose	`x.T`	`x.T` or `x.permute(...)`
Squeeze	`np.squeeze(x)`	`x.squeeze()`	Remove dims of size 1
Unsqueeze	`np.expand_dims(x, 0)`	`x.unsqueeze(0)`	Add dim of size 1
Matrix multiply	`a @ b`	`a @ b` or `torch.mm(a, b)`
Batch matmul	`np.matmul(a, b)`	`torch.bmm(a, b)`	For 3D tensors
Element-wise	`a * b, a + b`	`a * b, a + b`	Same syntax
Sum	`np.sum(x, axis=0)`	`x.sum(dim=0)`	`axis` vs `dim`
Mean	`np.mean(x, axis=0)`	`x.mean(dim=0)`
Argmax	`np.argmax(x, axis=0)`	`x.argmax(dim=0)`
Clamp	`np.clip(x, a, b)`	`x.clamp(min=a, max=b)`
Where	`np.where(cond, a, b)`	`torch.where(cond, a, b)`

Indexing and Slicing#

# Same syntax as NumPy
x[0]              # first row
x[:, 1]           # second column
x[0:3, :]         # first three rows
x[x > 0]          # boolean indexing
x[[0, 2, 4]]      # fancy indexing

# Useful for batches
x[..., -1]        # last element along final dim (Ellipsis)

Broadcasting Rules#

Same rules as NumPy:

Dimensions are compared from the right (trailing dimensions).
Two dimensions are compatible if they are equal or one of them is 1.
Missing dimensions on the left are treated as size 1.

# Example: (4, 3) + (3,) -> (4, 3)
# Example: (4, 1) + (1, 3) -> (4, 3)
# Example: (2, 1, 3) + (4, 1) -> (2, 4, 3)

In-Place Operations#

# Trailing underscore = in-place
x.add_(1)          # x = x + 1
x.mul_(2)          # x = x * 2
x.zero_()          # x = 0
x.fill_(5)         # x = 5
x.clamp_(0, 1)     # clamp in-place

Avoid In-Place on Grad Tensors

In-place operations on tensors that require gradients can cause errors during backpropagation. PyTorch needs the original values to compute gradients, and in-place ops destroy them.

3. Device Management#

# Detect available device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Move tensors
x = x.to(device)                  # generic
x = x.cuda()                      # explicit GPU
x = x.cpu()                       # explicit CPU

# Move model (moves ALL parameters and buffers)
model = model.to(device)

# Create tensor directly on device
x = torch.randn(3, 4, device=device)

# Check device
x.device                           # device(type='cuda', index=0)
x.is_cuda                          # True / False

Common Pitfall

All tensors in an operation must be on the same device. You cannot add a CPU tensor to a CUDA tensor. If you get a “RuntimeError: expected all tensors to be on the same device” error, check that both your data and model are on the same device.

# Apple Silicon (MPS backend)
device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')

# Multi-GPU: specify GPU index
device = torch.device('cuda:0')    # first GPU
device = torch.device('cuda:1')    # second GPU

4. Autograd#

PyTorch’s automatic differentiation engine. Every operation on tensors with requires_grad=True is recorded on a dynamic computation graph. Calling .backward() traverses this graph in reverse to compute gradients.

Basic Gradient Computation#

x = torch.tensor(3.0, requires_grad=True)
y = x**2 + 2*x + 1
y.backward()            # compute dy/dx
print(x.grad)           # tensor(8.)  -- dy/dx = 2x + 2 = 8

Gradient Control#

# Disable gradient tracking (inference, evaluation)
with torch.no_grad():
    pred = model(x)           # no graph built, saves memory

# Alternative: inference mode (even more memory efficient)
with torch.inference_mode():
    pred = model(x)

# Detach tensor from computation graph
x_detached = x.detach()       # shares data, no grad tracking

# Prevent gradient for specific parameters
for param in model.encoder.parameters():
    param.requires_grad = False   # freeze encoder

Zeroing Gradients#

# CRITICAL: PyTorch accumulates gradients by default!
optimizer.zero_grad()          # preferred in training loop

# Manual alternatives
param.grad = None              # modern PyTorch preferred
param.grad.zero_()             # in-place zeroing
model.zero_grad()              # zero all model params

Why Gradients Accumulate

Gradient accumulation is by design – it enables computing effective gradients over multiple mini-batches when GPU memory is too small for a single large batch. But if you forget optimizer.zero_grad(), gradients from previous iterations contaminate the current update. This is one of the most common PyTorch bugs.

Inspecting the Computation Graph#

x = torch.tensor(2.0, requires_grad=True)
y = x * 3
z = y + 1

z.grad_fn                      # <AddBackward0> -- last operation
z.grad_fn.next_functions       # links to previous ops
y.is_leaf                      # False (created by an op)
x.is_leaf                      # True (user-created)

5. nn.Module Pattern#

The fundamental building block for all neural networks in PyTorch.

Custom Module Template#

import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()             # ALWAYS call super().__init__()
        self.layer1 = nn.Linear(784, 128)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.2)
        self.layer2 = nn.Linear(128, 10)
    
    def forward(self, x):
        x = self.relu(self.layer1(x))
        x = self.dropout(x)
        return self.layer2(x)

Inspecting Parameters#

model = MyModel()

# Iterate over all parameters
model.parameters()                          # iterator
list(model.parameters())                    # list of Parameter tensors

# Named parameters (for debugging, freezing)
for name, param in model.named_parameters():
    print(f"{name}: {param.shape}")

# Total parameter count
total = sum(p.numel() for p in model.parameters())
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total: {total}, Trainable: {trainable}")

# List sub-modules
model.children()                            # immediate children
model.modules()                             # all modules recursively
model.named_modules()                       # with names

Module State#

model.train()    # training mode: dropout active, batchnorm uses batch stats
model.eval()     # evaluation mode: dropout off, batchnorm uses running stats

model.training   # True / False -- check current mode

Always Set the Mode

Forgetting model.eval() before inference leads to non-deterministic predictions (dropout still drops) and incorrect batch normalization (uses batch stats instead of learned running statistics). Forgetting model.train() before training means dropout and batchnorm behave incorrectly.

6. nn.Sequential Shortcut#

For simple feed-forward architectures where data flows linearly through layers.

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(128, 10),
)

With Named Layers#

from collections import OrderedDict

model = nn.Sequential(OrderedDict([
    ('fc1', nn.Linear(784, 256)),
    ('relu1', nn.ReLU()),
    ('fc2', nn.Linear(256, 10)),
]))

# Access by name
model.fc1.weight.shape   # torch.Size([256, 784])

When NOT to Use Sequential

Write a custom nn.Module when you need skip connections (ResNet), multiple inputs/outputs, conditional logic, or shared weights.

7. Common Layers#

Core Layers#

Layer	Code	Parameters	Input Shape	Output Shape
Fully connected	`nn.Linear(in_f, out_f)`	\(W\): `(out, in)`, \(b\): `(out,)`	`(*, in_f)`	`(*, out_f)`
1D convolution	`nn.Conv1d(in_ch, out_ch, k)`	\(W\): `(out, in, k)`	`(N, C_in, L)`	`(N, C_out, L')`
2D convolution	`nn.Conv2d(in_ch, out_ch, k)`	\(W\): `(out, in, k, k)`	`(N, C_in, H, W)`	`(N, C_out, H', W')`
Max pool	`nn.MaxPool2d(k)`	None	`(N, C, H, W)`	`(N, C, H/k, W/k)`
Avg pool	`nn.AvgPool2d(k)`	None	`(N, C, H, W)`	`(N, C, H/k, W/k)`
Adaptive avg pool	`nn.AdaptiveAvgPool2d((1,1))`	None	`(N, C, H, W)`	`(N, C, 1, 1)`
Batch norm (1D)	`nn.BatchNorm1d(features)`	\(\gamma\), \(\beta\)	`(N, features)`	`(N, features)`
Batch norm (2D)	`nn.BatchNorm2d(channels)`	\(\gamma\), \(\beta\)	`(N, C, H, W)`	`(N, C, H, W)`
Layer norm	`nn.LayerNorm(shape)`	\(\gamma\), \(\beta\)	`(*, shape)`	`(*, shape)`
Dropout	`nn.Dropout(p=0.5)`	None	any	same
Embedding	`nn.Embedding(vocab, dim)`	`(vocab, dim)`	`(*)` int	`(*, dim)`

Recurrent Layers#

Layer	Code	Notes
Simple RNN	`nn.RNN(input_size, hidden_size, num_layers=1)`	Vanilla recurrence
LSTM	`nn.LSTM(input_size, hidden_size, num_layers=1)`	Long short-term memory
GRU	`nn.GRU(input_size, hidden_size, num_layers=1)`	Gated recurrent unit

# RNN usage pattern
rnn = nn.LSTM(input_size=10, hidden_size=20, num_layers=2, batch_first=True)
x = torch.randn(32, 50, 10)    # (batch, seq_len, features)
output, (h_n, c_n) = rnn(x)    # output: (32, 50, 20), h_n: (2, 32, 20)

Activation Functions#

Activation	Module	Functional	Formula
ReLU	`nn.ReLU()`	`F.relu(x)`	\(\max(0, x)\)
LeakyReLU	`nn.LeakyReLU(0.01)`	`F.leaky_relu(x)`	\(\max(0.01x, x)\)
Sigmoid	`nn.Sigmoid()`	`torch.sigmoid(x)`	\(\frac{1}{1+e^{-x}}\)
Tanh	`nn.Tanh()`	`torch.tanh(x)`	\(\frac{e^x - e^{-x}}{e^x + e^{-x}}\)
GELU	`nn.GELU()`	`F.gelu(x)`	\(x \cdot \Phi(x)\)
Softmax	`nn.Softmax(dim=-1)`	`F.softmax(x, dim=-1)`	\(\frac{e^{x_i}}{\sum_j e^{x_j}}\)
LogSoftmax	`nn.LogSoftmax(dim=-1)`	`F.log_softmax(x, dim=-1)`	\(\log\text{softmax}(x)\)

Module vs. Functional

Use nn.ReLU() as a module attribute when you want it visible in print(model). Use F.relu(x) (from torch.nn.functional) inside forward() for a lighter touch. Both produce identical results – it is purely a style choice.

8. Loss Functions#

Loss	Code	Use Case	Input	Target
Mean squared error	`nn.MSELoss()`	Regression	any shape	same shape
Mean absolute error	`nn.L1Loss()`	Robust regression	any shape	same shape
Cross-entropy	`nn.CrossEntropyLoss()`	Multi-class classification	`(N, C)` raw logits	`(N,)` class indices
Binary cross-entropy	`nn.BCEWithLogitsLoss()`	Binary / multi-label	`(N, *)` raw logits	`(N, *)` floats in [0,1]
Negative log-likelihood	`nn.NLLLoss()`	After `log_softmax`	`(N, C)` log-probs	`(N,)` class indices
Huber (smooth L1)	`nn.SmoothL1Loss()`	Robust regression	any shape	same shape
KL divergence	`nn.KLDivLoss(reduction='batchmean')`	Distribution matching	log-probs	probs
Cosine embedding	`nn.CosineEmbeddingLoss()`	Similarity learning	`(N, D)`	`(N,)` in {-1, 1}

CrossEntropyLoss Includes Softmax

nn.CrossEntropyLoss() internally applies log_softmax before NLLLoss. Do not apply softmax to your model output when using this loss – you would be applying softmax twice, which is a common bug that leads to poor training.

# Classification example
criterion = nn.CrossEntropyLoss()
logits = model(x)                    # shape: (batch, num_classes) -- RAW
loss = criterion(logits, labels)     # labels: (batch,) of ints

# With class weights (for imbalanced data)
weights = torch.tensor([1.0, 2.0, 0.5])  # one per class
criterion = nn.CrossEntropyLoss(weight=weights)

# Ignoring padding tokens (NLP)
criterion = nn.CrossEntropyLoss(ignore_index=-100)

9. Optimizers#

Common Optimizers#

import torch.optim as optim

# Stochastic gradient descent
optimizer = optim.SGD(model.parameters(), lr=0.01)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)

# Adam (adaptive learning rate)
optimizer = optim.Adam(model.parameters(), lr=0.001)
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))

# AdamW (decoupled weight decay -- generally preferred over Adam)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# RMSprop
optimizer = optim.RMSprop(model.parameters(), lr=0.01, alpha=0.99)

Per-Parameter Options#

# Different learning rates for different parts of the model
optimizer = optim.Adam([
    {'params': model.encoder.parameters(), 'lr': 1e-5},   # fine-tune slowly
    {'params': model.decoder.parameters(), 'lr': 1e-3},   # train faster
])

Learning Rate Schedulers#

from torch.optim.lr_scheduler import (
    StepLR, ExponentialLR, CosineAnnealingLR, ReduceLROnPlateau, OneCycleLR
)

# Step decay: multiply lr by gamma every step_size epochs
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)

# Exponential decay
scheduler = ExponentialLR(optimizer, gamma=0.95)

# Cosine annealing
scheduler = CosineAnnealingLR(optimizer, T_max=100)

# Reduce on plateau (watches a metric)
scheduler = ReduceLROnPlateau(optimizer, mode='min', patience=5)

# One-cycle policy (best for super-convergence)
scheduler = OneCycleLR(optimizer, max_lr=0.01, total_steps=1000)

# Usage in training loop
for epoch in range(num_epochs):
    train(...)  
    scheduler.step()                    # most schedulers
    # scheduler.step(val_loss)          # for ReduceLROnPlateau

Optimizer	When to Use
SGD + Momentum	Classic choice; often best final accuracy with tuning
Adam	Good default; fast convergence; less sensitive to lr
AdamW	Preferred over Adam when using weight decay (most modern work)
RMSprop	RNNs, reinforcement learning

10. Training Loop Template#

Standard Training Loop#

model = MyModel().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(num_epochs):
    # --- Training phase ---
    model.train()
    train_loss = 0.0
    correct = 0
    total = 0
    
    for X_batch, y_batch in train_loader:
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)
        
        pred = model(X_batch)             # 1. Forward pass
        loss = criterion(pred, y_batch)   # 2. Compute loss
        
        optimizer.zero_grad()             # 3. Zero gradients
        loss.backward()                   # 4. Backward pass
        optimizer.step()                  # 5. Update weights
        
        train_loss += loss.item() * X_batch.size(0)
        correct += (pred.argmax(dim=1) == y_batch).sum().item()
        total += y_batch.size(0)
    
    train_loss /= total
    train_acc = correct / total
    
    # --- Validation phase ---
    model.eval()
    val_loss = 0.0
    val_correct = 0
    val_total = 0
    
    with torch.no_grad():
        for X_batch, y_batch in val_loader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)
            
            pred = model(X_batch)
            loss = criterion(pred, y_batch)
            
            val_loss += loss.item() * X_batch.size(0)
            val_correct += (pred.argmax(dim=1) == y_batch).sum().item()
            val_total += y_batch.size(0)
    
    val_loss /= val_total
    val_acc = val_correct / val_total
    
    print(f"Epoch {epoch+1}/{num_epochs}: "
          f"train_loss={train_loss:.4f}, train_acc={train_acc:.4f}, "
          f"val_loss={val_loss:.4f}, val_acc={val_acc:.4f}")

The 5 Sacred Steps

Every training iteration follows the same 5-step pattern:

Forward – compute predictions
Loss – measure error
Zero – clear old gradients
Backward – compute new gradients
Step – update parameters

Steps 3-4-5 can swap order slightly (zero_grad can come before forward), but the logic must remain: zero before backward, backward before step.

Gradient Clipping (for RNNs)#

loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

Gradient Accumulation (simulate larger batch)#

accumulation_steps = 4
optimizer.zero_grad()

for i, (X_batch, y_batch) in enumerate(train_loader):
    loss = criterion(model(X_batch.to(device)), y_batch.to(device))
    loss = loss / accumulation_steps    # normalize
    loss.backward()                     # accumulate gradients
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

11. Data Loading#

Custom Dataset#

from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    def __init__(self, X, y, transform=None):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.long)
        self.transform = transform
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        sample = self.X[idx]
        if self.transform:
            sample = self.transform(sample)
        return sample, self.y[idx]

DataLoader#

dataset = MyDataset(X_train, y_train)
loader = DataLoader(
    dataset,
    batch_size=64,           # samples per batch
    shuffle=True,            # randomize order each epoch
    num_workers=4,           # parallel data loading
    pin_memory=True,         # faster GPU transfer
    drop_last=True,          # drop incomplete final batch
)

# Iterate
for X_batch, y_batch in loader:
    print(X_batch.shape, y_batch.shape)
    break

Built-in Datasets (torchvision)#

from torchvision import datasets, transforms

# Standard transform pipeline
transform = transforms.Compose([
    transforms.Resize(32),
    transforms.ToTensor(),                      # PIL -> tensor, scale to [0,1]
    transforms.Normalize((0.5,), (0.5,)),       # normalize to [-1, 1]
])

# MNIST
train_data = datasets.MNIST(
    root='data/', train=True, download=True, transform=transform
)
test_data = datasets.MNIST(
    root='data/', train=False, download=True, transform=transform
)

train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=256, shuffle=False)

Common Datasets#

Dataset	Code	Shape	Classes
MNIST	`datasets.MNIST(...)`	1x28x28	10 digits
FashionMNIST	`datasets.FashionMNIST(...)`	1x28x28	10 clothing
CIFAR-10	`datasets.CIFAR10(...)`	3x32x32	10 objects
CIFAR-100	`datasets.CIFAR100(...)`	3x32x32	100 objects
ImageNet	`datasets.ImageNet(...)`	3x224x224	1000 objects

Train/Validation Split#

from torch.utils.data import random_split

full_dataset = MyDataset(X, y)
train_size = int(0.8 * len(full_dataset))
val_size = len(full_dataset) - train_size
train_set, val_set = random_split(full_dataset, [train_size, val_size])

12. Save and Load#

Model Weights (Recommended)#

# Save weights only
torch.save(model.state_dict(), 'model_weights.pth')

# Load weights
model = MyModel()                                     # create model first
model.load_state_dict(torch.load('model_weights.pth', weights_only=True))
model.eval()                                          # set to eval mode

Full Checkpoint (model + optimizer + epoch)#

# Save checkpoint
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss.item(),
}, 'checkpoint.pth')

# Load checkpoint
checkpoint = torch.load('checkpoint.pth', weights_only=True)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch']

Avoid Saving the Entire Model

torch.save(model, 'model.pth') uses Python pickle, which ties the saved file to the exact class definition and directory structure. Saving state_dict() is portable and robust.

13. Common Patterns and Idioms#

Flattening for Fully-Connected Layers#

# After conv layers, before FC layers
x = x.view(x.size(0), -1)        # flatten keeping batch dim
x = x.flatten(1)                  # equivalent, more explicit
x = nn.Flatten()(x)               # as a module (use in Sequential)

Weight Initialization#

def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.xavier_uniform_(m.weight)
        nn.init.zeros_(m.bias)
    elif isinstance(m, nn.Conv2d):
        nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')

model.apply(init_weights)         # applies recursively to all modules

Freezing Layers (Transfer Learning)#

# Freeze all layers
for param in model.parameters():
    param.requires_grad = False

# Unfreeze only the classifier head
for param in model.classifier.parameters():
    param.requires_grad = True

# Only pass trainable params to optimizer
optimizer = optim.Adam(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr=1e-3
)

Extracting Scalar from Tensor#

loss_value = loss.item()          # Python float from 0-dim tensor
count = correct.item()            # Python int
# .item() only works on tensors with exactly one element

Reproducibility#

torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
np.random.seed(42)

# For full determinism (may slow things down)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Mixed Precision Training#

from torch.amp import autocast, GradScaler

scaler = GradScaler()

for X_batch, y_batch in train_loader:
    optimizer.zero_grad()
    
    with autocast(device_type='cuda'):      # forward in float16
        pred = model(X_batch)
        loss = criterion(pred, y_batch)
    
    scaler.scale(loss).backward()           # scaled backward
    scaler.step(optimizer)                  # unscale + step
    scaler.update()

14. CNN Architecture Patterns#

Basic CNN for Image Classification#

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1),      # 28x28 -> 28x28
            nn.ReLU(),
            nn.MaxPool2d(2),                      # 28x28 -> 14x14
            nn.Conv2d(32, 64, 3, padding=1),      # 14x14 -> 14x14
            nn.ReLU(),
            nn.MaxPool2d(2),                      # 14x14 -> 7x7
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 7 * 7, 128),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(128, num_classes),
        )
    
    def forward(self, x):
        x = self.features(x)
        return self.classifier(x)

Conv2d Output Size Formula#

\[H_{\text{out}} = \left\lfloor \frac{H_{\text{in}} + 2 \times \text{padding} - \text{kernel\_size}}{\text{stride}} \right\rfloor + 1\]

Config	Input 28x28	Input 32x32
`Conv2d(, , 3, padding=1)`	28x28	32x32
`Conv2d(, , 3, padding=0)`	26x26	30x30
`Conv2d(, , 5, padding=2)`	28x28	32x32
`MaxPool2d(2)`	14x14	16x16
`Conv2d(, , 3, stride=2, padding=1)`	14x14	16x16

15. Debugging Tips#

Shape Debugging#

# Print shapes at each step in forward()
def forward(self, x):
    print(f"Input:      {x.shape}")
    x = self.conv1(x)
    print(f"After conv1: {x.shape}")
    x = self.pool(x)
    print(f"After pool:  {x.shape}")
    x = x.flatten(1)
    print(f"Flattened:   {x.shape}")
    return self.fc(x)

Common Errors and Fixes#

Error	Likely Cause	Fix
`RuntimeError: mat1 and mat2 shapes cannot be multiplied`	Wrong Linear input size	Print shape before the Linear layer
`RuntimeError: expected all tensors to be on the same device`	Mixed CPU/CUDA tensors	`.to(device)` for both model and data
`RuntimeError: element 0 of tensors does not require grad`	Forgot `requires_grad=True`	Check input tensor settings
`RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation`	In-place op on grad tensor	Replace `x += 1` with `x = x + 1`
`ValueError: expected target size (N, C) got (N,)`	Wrong loss function target format	Check `nn.CrossEntropyLoss` expects `(N,)` indices
Loss is `nan`	Exploding gradients, bad lr	Reduce lr, add gradient clipping
Loss stuck / not decreasing	Learning rate too low, or bug	Overfit on 1 batch first to verify model works

NaN and Gradient Checks#

# Check for NaN in loss
assert not torch.isnan(loss), "Loss is NaN!"

# Check for NaN in gradients
for name, param in model.named_parameters():
    if param.grad is not None and torch.isnan(param.grad).any():
        print(f"NaN gradient in {name}")

# Numerical gradient verification
torch.autograd.gradcheck(func, inputs, eps=1e-6, atol=1e-4)

# Check gradient magnitudes
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: grad_norm={param.grad.norm():.4f}")

The “Overfit One Batch” Test#

# Sanity check: can the model memorize a single batch?
X_batch, y_batch = next(iter(train_loader))
X_batch, y_batch = X_batch.to(device), y_batch.to(device)

model.train()
for i in range(200):
    pred = model(X_batch)
    loss = criterion(pred, y_batch)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if i % 50 == 0:
        print(f"Step {i}: loss={loss.item():.4f}")
# Loss should drop to ~0. If not, your model or data pipeline has a bug.

16. Quick Reference Card#

The 25 most-used PyTorch functions and patterns at a glance.

#	Function / Pattern	What It Does
1	`torch.tensor(data)`	Create a tensor from Python data
2	`torch.randn(shape)`	Random tensor from standard normal
3	`x.to(device)`	Move tensor to CPU/GPU
4	`x.shape`	Tensor dimensions
5	`x.view(shape)` / `x.reshape(shape)`	Reshape tensor
6	`x.requires_grad_(True)`	Enable gradient tracking (in-place)
7	`loss.backward()`	Compute all gradients via backprop
8	`x.grad`	Access computed gradient
9	`torch.no_grad()`	Context manager to disable gradients
10	`x.detach()`	Detach tensor from computation graph
11	`nn.Linear(in, out)`	Fully connected layer
12	`nn.Conv2d(in, out, k)`	2D convolution layer
13	`nn.ReLU()`	ReLU activation
14	`nn.Sequential(...)`	Chain layers into a model
15	`nn.CrossEntropyLoss()`	Classification loss (includes softmax)
16	`nn.MSELoss()`	Regression loss
17	`optim.Adam(params, lr)`	Adam optimizer
18	`optimizer.zero_grad()`	Zero all parameter gradients
19	`optimizer.step()`	Update parameters using gradients
20	`model.train()`	Set model to training mode
21	`model.eval()`	Set model to evaluation mode
22	`model.parameters()`	Iterator over model parameters
23	`DataLoader(dataset, batch_size)`	Batched data iterator
24	`torch.save(state_dict, path)`	Save model weights
25	`loss.item()`	Extract Python scalar from loss tensor

17. Import Cheat Sheet#

import torch                              # core library
import torch.nn as nn                     # neural network modules
import torch.nn.functional as F           # functional API (relu, softmax, etc.)
import torch.optim as optim               # optimizers
from torch.utils.data import Dataset, DataLoader  # data utilities

import torchvision                        # vision datasets and transforms
from torchvision import datasets, transforms

import numpy as np                        # interop
import matplotlib.pyplot as plt           # plotting

References.

Paszke, A., Gross, S., Massa, F., et al. (2019). “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” Advances in Neural Information Processing Systems 32.
PyTorch Documentation – official API reference.
PyTorch Tutorials – official learning resources.