Chapter 28: We built an autograd engine in 80 lines — the Value class with forward ops and backward propagation. PyTorch does the same thing — on tensors, on GPUs, in C++.
Our micrograd (ch28)
x = Value(2.0)
y = Value(3.0)
z = x * y + (x ** 2).relu()
z.backward()
print(x.grad) # 7.0
PyTorch equivalent
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
z = x * y + (x ** 2).clamp(min=0)
z.backward()
print(x.grad) # tensor(7.)
Same computation graph, same gradients. The only difference: torch.Tensor operates on n-dimensional arrays at GPU speed, not single scalars.
Framework Timeline
Key innovation of PyTorch: define-by-run (dynamic graph) — the computation graph is built on the fly, just like our Value class. Debug with print(), use if/else in forward pass.
Definition Tensors
A tensor is an n-dimensional array with autograd support. It tracks the computation graph and can compute gradients via .backward().
Operation
NumPy
PyTorch
Create array
np.array([1,2,3])
torch.tensor([1,2,3])
Zeros
np.zeros((3,4))
torch.zeros(3,4)
Random
np.random.randn(3,4)
torch.randn(3,4)
Matrix multiply
A @ B
A @ B
Element-wise
A * B
A * B
Shape
A.shape
A.shape
Gradients
—
requires_grad=True
GPU
—
.to('cuda')
Nearly identical API — but PyTorch adds autograd and GPU acceleration. Converting: torch.from_numpy(arr) and tensor.numpy().
Autograd on Tensors
Micrograd (scalars)
x = Value(2.0)
y = Value(3.0)
z = x * y + (x**2).relu()
z.backward()
# x.grad = 7.0 (dy/dx)
# One scalar at a time
PyTorch (tensors)
W = torch.randn(784, 128,
requires_grad=True)
x = torch.randn(64, 784)
h = (x @ W).clamp(min=0)
loss = h.sum()
loss.backward()
# W.grad.shape = (784, 128)
Same algorithm, different scale: Micrograd computes $\partial z / \partial x$ for one scalar. PyTorch computes $\partial L / \partial W$ for a 784×128 weight matrix — 100,352 gradients in one call.
Under the hood: The same topological sort + reverse-mode AD we implemented in ch28. PyTorch just uses optimized C++/CUDA kernels for each operation.
Definition nn.Module
nn.Module is the base class for all neural network components — parameter management, device transfer, serialization, composability.
Compare with ch19: Our MultiLayerNetwork had the same structure — __init__ stored weights, forward() computed output. PyTorch adds .parameters(), .to(device), .state_dict().
Key pattern: Define layers in __init__, define computation in forward. PyTorch handles gradients, parameter tracking, and GPU transfer automatically.
Pause & Think
What does loss.backward() do?
Map each step to what our Value class did in Chapter 28:
Value.backward() (ch28)
1. Topological sort of graph
2. Set output.grad = 1.0
3. Reverse iterate: each node calls _backward()
4. Chain rule accumulates gradients
loss.backward() (PyTorch)
1. Topological sort of tensor graph
2. Set loss.grad = 1.0
3. Reverse iterate: each op calls grad_fn
4. Chain rule accumulates .grad tensors
Important Training Loop Anatomy
for epoch in range(num_epochs):
for X_batch, y_batch in dataloader:
y_pred = model(X_batch) # 1. Forward pass (ch15-19)
loss = criterion(y_pred, y_batch) # 2. Compute loss (ch26)
optimizer.zero_grad() # 3. Clear gradients (ch28)
loss.backward() # 4. Backward pass (ch16, ch28)
optimizer.step() # 5. Update weights (ch27)
Line 1: Forward — same as our network.forward(x) Line 2: Loss — cross-entropy from ch26 Line 3: Zero grads — we did this manually
Line 4: Backward — our Value.backward() Line 5: Step — SGD/Adam from ch27 Five lines = entire training algorithm
Why zero_grad()? PyTorch accumulates gradients by default (useful for gradient accumulation with large batches). Without zeroing, gradients from previous iterations corrupt the update.
Definition Dataset & DataLoader
Dataset: Stores samples and labels. Implements __len__ and __getitem__. DataLoader: Wraps a Dataset to provide batching, shuffling, and parallel loading.
97% accuracy on 10,000 test digits. Our from-scratch MLP in ch19 achieved similar results on synthetic data — now on real handwritten digits.
Warning From TinyCNN to PyTorch
Aspect
NumPy Conv2D (ch23)
nn.Conv2d (PyTorch)
Forward pass
Nested for loops in Python
Optimized C++/CUDA kernel
Backward pass
Manual gradient derivation
Automatic via autograd
Parameters
self.kernels numpy array
self.weight + self.bias
Batching
One image at a time
Batch dimension built-in
GPU support
None
.to('cuda')
Speed (MNIST)
~15 min (CPU)
~30 sec (CPU)
Our Conv2D (ch23)
for i in range(out_h):
for j in range(out_w):
for f in range(n_filters):
patch = x[:, i:i+K, j:j+K]
out[f,i,j] = np.sum(
patch * self.kernels[f]) + self.biases[f]
PyTorch nn.Conv2d
self.conv1 = nn.Conv2d(
in_channels=1,
out_channels=32,
kernel_size=3,
padding=1
)
out = self.conv1(x) # That's it
vs MLP: 98.5% vs 97%, fewer params, faster convergence. Convolution inductive bias pays off.
Learned Filters
First-layer 3×3 filters learned on MNIST digits — compare with ch25’s synthetic patterns:
Typical learned filters (conceptual heatmaps)
Same edge detectors emerge! In ch25, our TinyCNN learned horizontal/vertical/diagonal filters on synthetic 8×8 patterns. On real MNIST digits, the same types of features appear — edges, corners, curves.
Hubel & Wiesel (1962): Biological simple cells in V1 are also oriented edge detectors. The CNN rediscovers these features through gradient descent alone.
Layer 2 filters combine Layer 1 edges into more complex patterns: loops, junctions, strokes — building a feature hierarchy.
Scaling Up
The journey so far: From TinyCNN (114 params, 8×8 patterns, minutes in NumPy) to PyTorch CNN (25K params, 60K images, 30 seconds). Same principles, industrial scale.
Model
Chapter
Parameters
Data
Time
Accuracy
Perceptron
Ch 4
3
Synthetic 2D
instant
100% (linear)
MLP (from scratch)
Ch 19
~1,200
Synthetic 2D
seconds
~95%
TinyCNN (from scratch)
Ch 23-25
114
8×8 synthetic
minutes
~92%
MLP (PyTorch)
Ch 30
109K
MNIST (60K)
~60 sec
97.0%
CNN (PyTorch)
Ch 31
25K
MNIST (60K)
~30 sec
98.5%
The pattern: Each step up the ladder uses the same building blocks (linear transform, nonlinearity, gradient descent) at greater scale. Understanding from scratch enables mastery of industrial tools.
Framework Corner
The same CNN architecture in three frameworks — ideas transcend syntax:
PyTorch
class Net(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1,32,3)
self.fc = nn.Linear(32*13*13,10)
def forward(self, x):
x = F.relu(self.conv1(x))
x = F.max_pool2d(x, 2)
x = x.view(-1, 32*13*13)
return self.fc(x)
class Net(nn.Module):
@nn.compact
def __call__(self, x):
x = nn.Conv(32, (3,3))(x)
x = nn.relu(x)
x = nn.max_pool(x, (2,2),
strides=(2,2))
x = x.reshape((x.shape[0],-1))
x = nn.Dense(10)(x)
return x
Architecture ideas transcend frameworks. Conv2D, ReLU, MaxPool, Dense — these are universal building blocks. Learn the concepts (Parts I–VIII), then any framework becomes syntactic sugar.