PyTorch Cheat Sheet#
A comprehensive quick-reference covering tensor operations, autograd, model building, training loops, data loading, and debugging patterns. Designed to complement Chapters 29–31 of this course.
How to Use This Page
This is a pure reference document – no executable code, just patterns you can
copy and adapt. Use Ctrl+F / Cmd+F to search for specific topics.
1. Tensor Basics#
Creation#
import torch
# From Python data
x = torch.tensor([1, 2, 3]) # from list
x = torch.tensor([[1, 2], [3, 4]]) # 2D from nested list
# Standard constructors
x = torch.zeros(3, 4) # all zeros
x = torch.ones(2, 3) # all ones
x = torch.full((2, 3), 7.0) # filled with 7.0
x = torch.empty(3, 4) # uninitialized (fast)
x = torch.eye(3) # 3x3 identity matrix
# Random tensors
x = torch.randn(3, 4) # standard normal N(0,1)
x = torch.rand(3, 4) # uniform [0, 1)
x = torch.randint(0, 10, (3, 4)) # random integers in [0, 10)
# Sequences
x = torch.arange(0, 10, 2) # [0, 2, 4, 6, 8]
x = torch.linspace(0, 1, 100) # 100 points in [0, 1]
# Like-constructors (match shape/dtype/device of existing tensor)
y = torch.zeros_like(x)
y = torch.randn_like(x)
NumPy Interop#
# NumPy -> PyTorch (shared memory -- no copy!)
x = torch.from_numpy(np_array)
# PyTorch -> NumPy (shared memory on CPU)
np_array = x.numpy() # CPU tensor only
np_array = x.cpu().numpy() # safe for GPU tensors
np_array = x.detach().numpy() # safe if requires_grad=True
np_array = x.detach().cpu().numpy() # safest -- works in all cases
Shared Memory Warning
torch.from_numpy() and .numpy() share the underlying memory buffer.
Modifying one will modify the other. Use .clone() if you need an independent copy.
Properties#
x.shape # torch.Size([3, 4]) -- dimensions
x.dtype # torch.float32 -- data type
x.device # device(type='cpu') or device(type='cuda', index=0)
x.requires_grad # True/False -- gradient tracking
x.ndim # number of dimensions (same as len(x.shape))
x.numel() # total number of elements
x.is_contiguous() # memory layout check
Data Types#
PyTorch dtype |
Alias |
Notes |
|---|---|---|
|
|
Default for floats. Use this for training. |
|
|
Double precision. Rarely needed. |
|
|
Half precision. Used for mixed-precision training. |
|
– |
Brain floating point. Better range than float16. |
|
|
Default for integers. Required for class labels. |
|
|
32-bit integer. |
|
– |
Boolean tensor. |
# Type casting
x = x.float() # -> float32
x = x.long() # -> int64
x = x.to(torch.float16) # explicit dtype
2. Tensor Operations#
NumPy vs. PyTorch Equivalents#
Operation |
NumPy |
PyTorch |
Notes |
|---|---|---|---|
Reshape |
|
|
|
Flatten |
|
|
|
Concatenate |
|
|
Along existing dim |
Stack |
|
|
Creates new dim |
Split |
|
|
|
Transpose |
|
|
|
Squeeze |
|
|
Remove dims of size 1 |
Unsqueeze |
|
|
Add dim of size 1 |
Matrix multiply |
|
|
|
Batch matmul |
|
|
For 3D tensors |
Element-wise |
|
|
Same syntax |
Sum |
|
|
|
Mean |
|
|
|
Argmax |
|
|
|
Clamp |
|
|
|
Where |
|
|
Indexing and Slicing#
# Same syntax as NumPy
x[0] # first row
x[:, 1] # second column
x[0:3, :] # first three rows
x[x > 0] # boolean indexing
x[[0, 2, 4]] # fancy indexing
# Useful for batches
x[..., -1] # last element along final dim (Ellipsis)
Broadcasting Rules#
Same rules as NumPy:
Dimensions are compared from the right (trailing dimensions).
Two dimensions are compatible if they are equal or one of them is 1.
Missing dimensions on the left are treated as size 1.
# Example: (4, 3) + (3,) -> (4, 3)
# Example: (4, 1) + (1, 3) -> (4, 3)
# Example: (2, 1, 3) + (4, 1) -> (2, 4, 3)
In-Place Operations#
# Trailing underscore = in-place
x.add_(1) # x = x + 1
x.mul_(2) # x = x * 2
x.zero_() # x = 0
x.fill_(5) # x = 5
x.clamp_(0, 1) # clamp in-place
Avoid In-Place on Grad Tensors
In-place operations on tensors that require gradients can cause errors during backpropagation. PyTorch needs the original values to compute gradients, and in-place ops destroy them.
3. Device Management#
# Detect available device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Move tensors
x = x.to(device) # generic
x = x.cuda() # explicit GPU
x = x.cpu() # explicit CPU
# Move model (moves ALL parameters and buffers)
model = model.to(device)
# Create tensor directly on device
x = torch.randn(3, 4, device=device)
# Check device
x.device # device(type='cuda', index=0)
x.is_cuda # True / False
Common Pitfall
All tensors in an operation must be on the same device. You cannot add a CPU tensor to a CUDA tensor. If you get a “RuntimeError: expected all tensors to be on the same device” error, check that both your data and model are on the same device.
# Apple Silicon (MPS backend)
device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')
# Multi-GPU: specify GPU index
device = torch.device('cuda:0') # first GPU
device = torch.device('cuda:1') # second GPU
4. Autograd#
PyTorch’s automatic differentiation engine. Every operation on tensors with
requires_grad=True is recorded on a dynamic computation graph. Calling
.backward() traverses this graph in reverse to compute gradients.
Basic Gradient Computation#
x = torch.tensor(3.0, requires_grad=True)
y = x**2 + 2*x + 1
y.backward() # compute dy/dx
print(x.grad) # tensor(8.) -- dy/dx = 2x + 2 = 8
Gradient Control#
# Disable gradient tracking (inference, evaluation)
with torch.no_grad():
pred = model(x) # no graph built, saves memory
# Alternative: inference mode (even more memory efficient)
with torch.inference_mode():
pred = model(x)
# Detach tensor from computation graph
x_detached = x.detach() # shares data, no grad tracking
# Prevent gradient for specific parameters
for param in model.encoder.parameters():
param.requires_grad = False # freeze encoder
Zeroing Gradients#
# CRITICAL: PyTorch accumulates gradients by default!
optimizer.zero_grad() # preferred in training loop
# Manual alternatives
param.grad = None # modern PyTorch preferred
param.grad.zero_() # in-place zeroing
model.zero_grad() # zero all model params
Why Gradients Accumulate
Gradient accumulation is by design – it enables computing effective gradients
over multiple mini-batches when GPU memory is too small for a single large batch.
But if you forget optimizer.zero_grad(), gradients from previous iterations
contaminate the current update. This is one of the most common PyTorch bugs.
Inspecting the Computation Graph#
x = torch.tensor(2.0, requires_grad=True)
y = x * 3
z = y + 1
z.grad_fn # <AddBackward0> -- last operation
z.grad_fn.next_functions # links to previous ops
y.is_leaf # False (created by an op)
x.is_leaf # True (user-created)
5. nn.Module Pattern#
The fundamental building block for all neural networks in PyTorch.
Custom Module Template#
import torch.nn as nn
class MyModel(nn.Module):
def __init__(self):
super().__init__() # ALWAYS call super().__init__()
self.layer1 = nn.Linear(784, 128)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(0.2)
self.layer2 = nn.Linear(128, 10)
def forward(self, x):
x = self.relu(self.layer1(x))
x = self.dropout(x)
return self.layer2(x)
Inspecting Parameters#
model = MyModel()
# Iterate over all parameters
model.parameters() # iterator
list(model.parameters()) # list of Parameter tensors
# Named parameters (for debugging, freezing)
for name, param in model.named_parameters():
print(f"{name}: {param.shape}")
# Total parameter count
total = sum(p.numel() for p in model.parameters())
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total: {total}, Trainable: {trainable}")
# List sub-modules
model.children() # immediate children
model.modules() # all modules recursively
model.named_modules() # with names
Module State#
model.train() # training mode: dropout active, batchnorm uses batch stats
model.eval() # evaluation mode: dropout off, batchnorm uses running stats
model.training # True / False -- check current mode
Always Set the Mode
Forgetting model.eval() before inference leads to non-deterministic predictions
(dropout still drops) and incorrect batch normalization (uses batch stats instead
of learned running statistics). Forgetting model.train() before training means
dropout and batchnorm behave incorrectly.
6. nn.Sequential Shortcut#
For simple feed-forward architectures where data flows linearly through layers.
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(128, 10),
)
With Named Layers#
from collections import OrderedDict
model = nn.Sequential(OrderedDict([
('fc1', nn.Linear(784, 256)),
('relu1', nn.ReLU()),
('fc2', nn.Linear(256, 10)),
]))
# Access by name
model.fc1.weight.shape # torch.Size([256, 784])
When NOT to Use Sequential
Write a custom nn.Module when you need skip connections (ResNet),
multiple inputs/outputs, conditional logic, or shared weights.
7. Common Layers#
Core Layers#
Layer |
Code |
Parameters |
Input Shape |
Output Shape |
|---|---|---|---|---|
Fully connected |
|
\(W\): |
|
|
1D convolution |
|
\(W\): |
|
|
2D convolution |
|
\(W\): |
|
|
Max pool |
|
None |
|
|
Avg pool |
|
None |
|
|
Adaptive avg pool |
|
None |
|
|
Batch norm (1D) |
|
\(\gamma\), \(\beta\) |
|
|
Batch norm (2D) |
|
\(\gamma\), \(\beta\) |
|
|
Layer norm |
|
\(\gamma\), \(\beta\) |
|
|
Dropout |
|
None |
any |
same |
Embedding |
|
|
|
|
Recurrent Layers#
Layer |
Code |
Notes |
|---|---|---|
Simple RNN |
|
Vanilla recurrence |
LSTM |
|
Long short-term memory |
GRU |
|
Gated recurrent unit |
# RNN usage pattern
rnn = nn.LSTM(input_size=10, hidden_size=20, num_layers=2, batch_first=True)
x = torch.randn(32, 50, 10) # (batch, seq_len, features)
output, (h_n, c_n) = rnn(x) # output: (32, 50, 20), h_n: (2, 32, 20)
Activation Functions#
Activation |
Module |
Functional |
Formula |
|---|---|---|---|
ReLU |
|
|
\(\max(0, x)\) |
LeakyReLU |
|
|
\(\max(0.01x, x)\) |
Sigmoid |
|
|
\(\frac{1}{1+e^{-x}}\) |
Tanh |
|
|
\(\frac{e^x - e^{-x}}{e^x + e^{-x}}\) |
GELU |
|
|
\(x \cdot \Phi(x)\) |
Softmax |
|
|
\(\frac{e^{x_i}}{\sum_j e^{x_j}}\) |
LogSoftmax |
|
|
\(\log\text{softmax}(x)\) |
Module vs. Functional
Use nn.ReLU() as a module attribute when you want it visible in print(model).
Use F.relu(x) (from torch.nn.functional) inside forward() for a lighter
touch. Both produce identical results – it is purely a style choice.
8. Loss Functions#
Loss |
Code |
Use Case |
Input |
Target |
|---|---|---|---|---|
Mean squared error |
|
Regression |
any shape |
same shape |
Mean absolute error |
|
Robust regression |
any shape |
same shape |
Cross-entropy |
|
Multi-class classification |
|
|
Binary cross-entropy |
|
Binary / multi-label |
|
|
Negative log-likelihood |
|
After |
|
|
Huber (smooth L1) |
|
Robust regression |
any shape |
same shape |
KL divergence |
|
Distribution matching |
log-probs |
probs |
Cosine embedding |
|
Similarity learning |
|
|
CrossEntropyLoss Includes Softmax
nn.CrossEntropyLoss() internally applies log_softmax before NLLLoss.
Do not apply softmax to your model output when using this loss – you would
be applying softmax twice, which is a common bug that leads to poor training.
# Classification example
criterion = nn.CrossEntropyLoss()
logits = model(x) # shape: (batch, num_classes) -- RAW
loss = criterion(logits, labels) # labels: (batch,) of ints
# With class weights (for imbalanced data)
weights = torch.tensor([1.0, 2.0, 0.5]) # one per class
criterion = nn.CrossEntropyLoss(weight=weights)
# Ignoring padding tokens (NLP)
criterion = nn.CrossEntropyLoss(ignore_index=-100)
9. Optimizers#
Common Optimizers#
import torch.optim as optim
# Stochastic gradient descent
optimizer = optim.SGD(model.parameters(), lr=0.01)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)
# Adam (adaptive learning rate)
optimizer = optim.Adam(model.parameters(), lr=0.001)
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
# AdamW (decoupled weight decay -- generally preferred over Adam)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
# RMSprop
optimizer = optim.RMSprop(model.parameters(), lr=0.01, alpha=0.99)
Per-Parameter Options#
# Different learning rates for different parts of the model
optimizer = optim.Adam([
{'params': model.encoder.parameters(), 'lr': 1e-5}, # fine-tune slowly
{'params': model.decoder.parameters(), 'lr': 1e-3}, # train faster
])
Learning Rate Schedulers#
from torch.optim.lr_scheduler import (
StepLR, ExponentialLR, CosineAnnealingLR, ReduceLROnPlateau, OneCycleLR
)
# Step decay: multiply lr by gamma every step_size epochs
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)
# Exponential decay
scheduler = ExponentialLR(optimizer, gamma=0.95)
# Cosine annealing
scheduler = CosineAnnealingLR(optimizer, T_max=100)
# Reduce on plateau (watches a metric)
scheduler = ReduceLROnPlateau(optimizer, mode='min', patience=5)
# One-cycle policy (best for super-convergence)
scheduler = OneCycleLR(optimizer, max_lr=0.01, total_steps=1000)
# Usage in training loop
for epoch in range(num_epochs):
train(...)
scheduler.step() # most schedulers
# scheduler.step(val_loss) # for ReduceLROnPlateau
Optimizer |
When to Use |
|---|---|
SGD + Momentum |
Classic choice; often best final accuracy with tuning |
Adam |
Good default; fast convergence; less sensitive to lr |
AdamW |
Preferred over Adam when using weight decay (most modern work) |
RMSprop |
RNNs, reinforcement learning |
10. Training Loop Template#
Standard Training Loop#
model = MyModel().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(num_epochs):
# --- Training phase ---
model.train()
train_loss = 0.0
correct = 0
total = 0
for X_batch, y_batch in train_loader:
X_batch = X_batch.to(device)
y_batch = y_batch.to(device)
pred = model(X_batch) # 1. Forward pass
loss = criterion(pred, y_batch) # 2. Compute loss
optimizer.zero_grad() # 3. Zero gradients
loss.backward() # 4. Backward pass
optimizer.step() # 5. Update weights
train_loss += loss.item() * X_batch.size(0)
correct += (pred.argmax(dim=1) == y_batch).sum().item()
total += y_batch.size(0)
train_loss /= total
train_acc = correct / total
# --- Validation phase ---
model.eval()
val_loss = 0.0
val_correct = 0
val_total = 0
with torch.no_grad():
for X_batch, y_batch in val_loader:
X_batch = X_batch.to(device)
y_batch = y_batch.to(device)
pred = model(X_batch)
loss = criterion(pred, y_batch)
val_loss += loss.item() * X_batch.size(0)
val_correct += (pred.argmax(dim=1) == y_batch).sum().item()
val_total += y_batch.size(0)
val_loss /= val_total
val_acc = val_correct / val_total
print(f"Epoch {epoch+1}/{num_epochs}: "
f"train_loss={train_loss:.4f}, train_acc={train_acc:.4f}, "
f"val_loss={val_loss:.4f}, val_acc={val_acc:.4f}")
The 5 Sacred Steps
Every training iteration follows the same 5-step pattern:
Forward – compute predictions
Loss – measure error
Zero – clear old gradients
Backward – compute new gradients
Step – update parameters
Steps 3-4-5 can swap order slightly (zero_grad can come before forward), but the logic must remain: zero before backward, backward before step.
Gradient Clipping (for RNNs)#
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
Gradient Accumulation (simulate larger batch)#
accumulation_steps = 4
optimizer.zero_grad()
for i, (X_batch, y_batch) in enumerate(train_loader):
loss = criterion(model(X_batch.to(device)), y_batch.to(device))
loss = loss / accumulation_steps # normalize
loss.backward() # accumulate gradients
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
11. Data Loading#
Custom Dataset#
from torch.utils.data import Dataset, DataLoader
class MyDataset(Dataset):
def __init__(self, X, y, transform=None):
self.X = torch.tensor(X, dtype=torch.float32)
self.y = torch.tensor(y, dtype=torch.long)
self.transform = transform
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
sample = self.X[idx]
if self.transform:
sample = self.transform(sample)
return sample, self.y[idx]
DataLoader#
dataset = MyDataset(X_train, y_train)
loader = DataLoader(
dataset,
batch_size=64, # samples per batch
shuffle=True, # randomize order each epoch
num_workers=4, # parallel data loading
pin_memory=True, # faster GPU transfer
drop_last=True, # drop incomplete final batch
)
# Iterate
for X_batch, y_batch in loader:
print(X_batch.shape, y_batch.shape)
break
Built-in Datasets (torchvision)#
from torchvision import datasets, transforms
# Standard transform pipeline
transform = transforms.Compose([
transforms.Resize(32),
transforms.ToTensor(), # PIL -> tensor, scale to [0,1]
transforms.Normalize((0.5,), (0.5,)), # normalize to [-1, 1]
])
# MNIST
train_data = datasets.MNIST(
root='data/', train=True, download=True, transform=transform
)
test_data = datasets.MNIST(
root='data/', train=False, download=True, transform=transform
)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=256, shuffle=False)
Common Datasets#
Dataset |
Code |
Shape |
Classes |
|---|---|---|---|
MNIST |
|
1x28x28 |
10 digits |
FashionMNIST |
|
1x28x28 |
10 clothing |
CIFAR-10 |
|
3x32x32 |
10 objects |
CIFAR-100 |
|
3x32x32 |
100 objects |
ImageNet |
|
3x224x224 |
1000 objects |
Train/Validation Split#
from torch.utils.data import random_split
full_dataset = MyDataset(X, y)
train_size = int(0.8 * len(full_dataset))
val_size = len(full_dataset) - train_size
train_set, val_set = random_split(full_dataset, [train_size, val_size])
12. Save and Load#
Model Weights (Recommended)#
# Save weights only
torch.save(model.state_dict(), 'model_weights.pth')
# Load weights
model = MyModel() # create model first
model.load_state_dict(torch.load('model_weights.pth', weights_only=True))
model.eval() # set to eval mode
Full Checkpoint (model + optimizer + epoch)#
# Save checkpoint
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss.item(),
}, 'checkpoint.pth')
# Load checkpoint
checkpoint = torch.load('checkpoint.pth', weights_only=True)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch']
Avoid Saving the Entire Model
torch.save(model, 'model.pth') uses Python pickle, which ties the saved file
to the exact class definition and directory structure. Saving state_dict() is
portable and robust.
13. Common Patterns and Idioms#
Flattening for Fully-Connected Layers#
# After conv layers, before FC layers
x = x.view(x.size(0), -1) # flatten keeping batch dim
x = x.flatten(1) # equivalent, more explicit
x = nn.Flatten()(x) # as a module (use in Sequential)
Weight Initialization#
def init_weights(m):
if isinstance(m, nn.Linear):
nn.init.xavier_uniform_(m.weight)
nn.init.zeros_(m.bias)
elif isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
model.apply(init_weights) # applies recursively to all modules
Freezing Layers (Transfer Learning)#
# Freeze all layers
for param in model.parameters():
param.requires_grad = False
# Unfreeze only the classifier head
for param in model.classifier.parameters():
param.requires_grad = True
# Only pass trainable params to optimizer
optimizer = optim.Adam(
filter(lambda p: p.requires_grad, model.parameters()),
lr=1e-3
)
Extracting Scalar from Tensor#
loss_value = loss.item() # Python float from 0-dim tensor
count = correct.item() # Python int
# .item() only works on tensors with exactly one element
Reproducibility#
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
np.random.seed(42)
# For full determinism (may slow things down)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
Mixed Precision Training#
from torch.amp import autocast, GradScaler
scaler = GradScaler()
for X_batch, y_batch in train_loader:
optimizer.zero_grad()
with autocast(device_type='cuda'): # forward in float16
pred = model(X_batch)
loss = criterion(pred, y_batch)
scaler.scale(loss).backward() # scaled backward
scaler.step(optimizer) # unscale + step
scaler.update()
14. CNN Architecture Patterns#
Basic CNN for Image Classification#
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(1, 32, 3, padding=1), # 28x28 -> 28x28
nn.ReLU(),
nn.MaxPool2d(2), # 28x28 -> 14x14
nn.Conv2d(32, 64, 3, padding=1), # 14x14 -> 14x14
nn.ReLU(),
nn.MaxPool2d(2), # 14x14 -> 7x7
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(64 * 7 * 7, 128),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(128, num_classes),
)
def forward(self, x):
x = self.features(x)
return self.classifier(x)
Conv2d Output Size Formula#
Config |
Input 28x28 |
Input 32x32 |
|---|---|---|
|
28x28 |
32x32 |
|
26x26 |
30x30 |
|
28x28 |
32x32 |
|
14x14 |
16x16 |
|
14x14 |
16x16 |
15. Debugging Tips#
Shape Debugging#
# Print shapes at each step in forward()
def forward(self, x):
print(f"Input: {x.shape}")
x = self.conv1(x)
print(f"After conv1: {x.shape}")
x = self.pool(x)
print(f"After pool: {x.shape}")
x = x.flatten(1)
print(f"Flattened: {x.shape}")
return self.fc(x)
Common Errors and Fixes#
Error |
Likely Cause |
Fix |
|---|---|---|
|
Wrong Linear input size |
Print shape before the Linear layer |
|
Mixed CPU/CUDA tensors |
|
|
Forgot |
Check input tensor settings |
|
In-place op on grad tensor |
Replace |
|
Wrong loss function target format |
Check |
Loss is |
Exploding gradients, bad lr |
Reduce lr, add gradient clipping |
Loss stuck / not decreasing |
Learning rate too low, or bug |
Overfit on 1 batch first to verify model works |
NaN and Gradient Checks#
# Check for NaN in loss
assert not torch.isnan(loss), "Loss is NaN!"
# Check for NaN in gradients
for name, param in model.named_parameters():
if param.grad is not None and torch.isnan(param.grad).any():
print(f"NaN gradient in {name}")
# Numerical gradient verification
torch.autograd.gradcheck(func, inputs, eps=1e-6, atol=1e-4)
# Check gradient magnitudes
for name, param in model.named_parameters():
if param.grad is not None:
print(f"{name}: grad_norm={param.grad.norm():.4f}")
The “Overfit One Batch” Test#
# Sanity check: can the model memorize a single batch?
X_batch, y_batch = next(iter(train_loader))
X_batch, y_batch = X_batch.to(device), y_batch.to(device)
model.train()
for i in range(200):
pred = model(X_batch)
loss = criterion(pred, y_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if i % 50 == 0:
print(f"Step {i}: loss={loss.item():.4f}")
# Loss should drop to ~0. If not, your model or data pipeline has a bug.
16. Quick Reference Card#
The 25 most-used PyTorch functions and patterns at a glance.
# |
Function / Pattern |
What It Does |
|---|---|---|
1 |
|
Create a tensor from Python data |
2 |
|
Random tensor from standard normal |
3 |
|
Move tensor to CPU/GPU |
4 |
|
Tensor dimensions |
5 |
|
Reshape tensor |
6 |
|
Enable gradient tracking (in-place) |
7 |
|
Compute all gradients via backprop |
8 |
|
Access computed gradient |
9 |
|
Context manager to disable gradients |
10 |
|
Detach tensor from computation graph |
11 |
|
Fully connected layer |
12 |
|
2D convolution layer |
13 |
|
ReLU activation |
14 |
|
Chain layers into a model |
15 |
|
Classification loss (includes softmax) |
16 |
|
Regression loss |
17 |
|
Adam optimizer |
18 |
|
Zero all parameter gradients |
19 |
|
Update parameters using gradients |
20 |
|
Set model to training mode |
21 |
|
Set model to evaluation mode |
22 |
|
Iterator over model parameters |
23 |
|
Batched data iterator |
24 |
|
Save model weights |
25 |
|
Extract Python scalar from loss tensor |
17. Import Cheat Sheet#
import torch # core library
import torch.nn as nn # neural network modules
import torch.nn.functional as F # functional API (relu, softmax, etc.)
import torch.optim as optim # optimizers
from torch.utils.data import Dataset, DataLoader # data utilities
import torchvision # vision datasets and transforms
from torchvision import datasets, transforms
import numpy as np # interop
import matplotlib.pyplot as plt # plotting
References.
Paszke, A., Gross, S., Massa, F., et al. (2019). “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” Advances in Neural Information Processing Systems 32.
PyTorch Documentation – official API reference.
PyTorch Tutorials – official learning resources.