Part IX

Introduction
to PyTorch

From Micrograd to Industrial Scale

Chapters 29–31

Definition The Bridge

Chapter 28: We built an autograd engine in 80 lines — the Value class with forward ops and backward propagation. PyTorch does the same thing — on tensors, on GPUs, in C++.

Our micrograd (ch28)

x = Value(2.0) y = Value(3.0) z = x * y + (x ** 2).relu() z.backward() print(x.grad) # 7.0

PyTorch equivalent

x = torch.tensor(2.0, requires_grad=True) y = torch.tensor(3.0, requires_grad=True) z = x * y + (x ** 2).clamp(min=0) z.backward() print(x.grad) # tensor(7.)
Same computation graph, same gradients. The only difference: torch.Tensor operates on n-dimensional arrays at GPU speed, not single scalars.

Framework Timeline

2010 Theano Symbolic graphs Bergstra et al. 2014 Caffe Config-based CNNs Jia et al. 2015 TensorFlow Google Brain Static graph 2017 PyTorch Dynamic graph Paszke et al. (2019) 2018 JAX Functional + JIT Bradbury et al. dynamic computation graph
Key innovation of PyTorch: define-by-run (dynamic graph) — the computation graph is built on the fly, just like our Value class. Debug with print(), use if/else in forward pass.

Definition Tensors

A tensor is an n-dimensional array with autograd support. It tracks the computation graph and can compute gradients via .backward().
OperationNumPyPyTorch
Create arraynp.array([1,2,3])torch.tensor([1,2,3])
Zerosnp.zeros((3,4))torch.zeros(3,4)
Randomnp.random.randn(3,4)torch.randn(3,4)
Matrix multiplyA @ BA @ B
Element-wiseA * BA * B
ShapeA.shapeA.shape
Gradientsrequires_grad=True
GPU.to('cuda')
Nearly identical API — but PyTorch adds autograd and GPU acceleration. Converting: torch.from_numpy(arr) and tensor.numpy().

Autograd on Tensors

Micrograd (scalars)

x = Value(2.0) y = Value(3.0) z = x * y + (x**2).relu() z.backward() # x.grad = 7.0 (dy/dx) # One scalar at a time

PyTorch (tensors)

W = torch.randn(784, 128, requires_grad=True) x = torch.randn(64, 784) h = (x @ W).clamp(min=0) loss = h.sum() loss.backward() # W.grad.shape = (784, 128)
Same algorithm, different scale: Micrograd computes $\partial z / \partial x$ for one scalar. PyTorch computes $\partial L / \partial W$ for a 784×128 weight matrix — 100,352 gradients in one call.
Under the hood: The same topological sort + reverse-mode AD we implemented in ch28. PyTorch just uses optimized C++/CUDA kernels for each operation.

Definition nn.Module

nn.Module is the base class for all neural network components — parameter management, device transfer, serialization, composability.
class TwoLayerMLP(nn.Module): def __init__(self, in_dim, h, out_dim): super().__init__() self.layer1 = nn.Linear(in_dim, h) self.relu = nn.ReLU() self.layer2 = nn.Linear(h, out_dim) def forward(self, x): return self.layer2(self.relu(self.layer1(x)))
Compare with ch19: Our MultiLayerNetwork had the same structure — __init__ stored weights, forward() computed output. PyTorch adds .parameters(), .to(device), .state_dict().
Key pattern: Define layers in __init__, define computation in forward. PyTorch handles gradients, parameter tracking, and GPU transfer automatically.

 

Pause & Think

What does loss.backward() do?

Map each step to what our Value class did in Chapter 28:

Value.backward() (ch28)
1. Topological sort of graph
2. Set output.grad = 1.0
3. Reverse iterate: each node calls _backward()
4. Chain rule accumulates gradients
loss.backward() (PyTorch)
1. Topological sort of tensor graph
2. Set loss.grad = 1.0
3. Reverse iterate: each op calls grad_fn
4. Chain rule accumulates .grad tensors

Important Training Loop Anatomy

for epoch in range(num_epochs): for X_batch, y_batch in dataloader: y_pred = model(X_batch) # 1. Forward pass (ch15-19) loss = criterion(y_pred, y_batch) # 2. Compute loss (ch26) optimizer.zero_grad() # 3. Clear gradients (ch28) loss.backward() # 4. Backward pass (ch16, ch28) optimizer.step() # 5. Update weights (ch27)
Line 1: Forward — same as our network.forward(x)
Line 2: Loss — cross-entropy from ch26
Line 3: Zero grads — we did this manually
Line 4: Backward — our Value.backward()
Line 5: Step — SGD/Adam from ch27
Five lines = entire training algorithm
Why zero_grad()? PyTorch accumulates gradients by default (useful for gradient accumulation with large batches). Without zeroing, gradients from previous iterations corrupt the update.

Definition Dataset & DataLoader

Dataset: Stores samples and labels. Implements __len__ and __getitem__.
DataLoader: Wraps a Dataset to provide batching, shuffling, and parallel loading.
train_set = datasets.MNIST( 'data/', train=True, download=True, transform=transforms.ToTensor()) loader = DataLoader(train_set, batch_size=64, # ch27 mini-batch shuffle=True) for X_batch, y_batch in loader: # X: (64, 1, 28, 28) # y: (64,)
From ch27: We split data into mini-batches with X[i:i+B]. DataLoader automates this + shuffling + parallel loading.
Key idea: Dataset = how to access one sample. DataLoader = how to batch, shuffle, and prefetch many samples efficiently.

Loss Functions & Optimizers

Loss Functions

PyTorchFrom chapter
nn.MSELoss()Ch 15 (GD)
nn.CrossEntropyLoss()Ch 26 (softmax + CE)
nn.BCELoss()Ch 17 (sigmoid)
Note: CrossEntropyLoss includes softmax — don't apply softmax in your model!

Optimizers

PyTorchFrom chapter
optim.SGD()Ch 15 (vanilla GD)
optim.SGD(momentum)Ch 27 (momentum)
optim.Adam()Ch 27 (adaptive)
Same math, wrapped in PyTorch. Adam from ch27: $m_t, v_t$ with bias correction.
Putting it together: criterion = nn.CrossEntropyLoss()    optimizer = optim.Adam(model.parameters(), lr=1e-3)

MNIST Results: MLP

Training curves

0 50% 100% 0 Epoch 10 Accuracy ~97% Loss

Architecture

Model: 784 → 128 → 64 → 10
Activation: ReLU
Optimizer: Adam (lr=0.001)
Batch size: 64
Parameters: 109,386
Training time: ~60 seconds (CPU)
97% accuracy on 10,000 test digits. Our from-scratch MLP in ch19 achieved similar results on synthetic data — now on real handwritten digits.

Warning From TinyCNN to PyTorch

AspectNumPy Conv2D (ch23)nn.Conv2d (PyTorch)
Forward passNested for loops in PythonOptimized C++/CUDA kernel
Backward passManual gradient derivationAutomatic via autograd
Parametersself.kernels numpy arrayself.weight + self.bias
BatchingOne image at a timeBatch dimension built-in
GPU supportNone.to('cuda')
Speed (MNIST)~15 min (CPU)~30 sec (CPU)

Our Conv2D (ch23)

for i in range(out_h): for j in range(out_w): for f in range(n_filters): patch = x[:, i:i+K, j:j+K] out[f,i,j] = np.sum( patch * self.kernels[f]) + self.biases[f]

PyTorch nn.Conv2d

self.conv1 = nn.Conv2d( in_channels=1, out_channels=32, kernel_size=3, padding=1 ) out = self.conv1(x) # That's it

CNN on Full MNIST

Architecture

Input 1x28x28 Conv2d 32 x 3x3 ReLU MaxPool 2x2 32x14x14 Conv2d 64 x 3x3 ReLU MaxPool 2x2 64x7x7 Flatten 3136 Linear 128 → 10 Softmax Total: ~25,000 parameters 98.5% accuracy • 5 epochs • 30 sec

PyTorch code

class MNISTNet(nn.Module): def __init__(self): super().__init__() self.features = nn.Sequential( nn.Conv2d(1,32,3,padding=1), nn.ReLU(), nn.MaxPool2d(2), nn.Conv2d(32,64,3,padding=1), nn.ReLU(), nn.MaxPool2d(2)) self.fc = nn.Sequential( nn.Linear(64*7*7, 128), nn.ReLU(), nn.Linear(128, 10))
vs MLP: 98.5% vs 97%, fewer params, faster convergence. Convolution inductive bias pays off.

Learned Filters

First-layer 3×3 filters learned on MNIST digits — compare with ch25’s synthetic patterns:

Typical learned filters (conceptual heatmaps)

Horizontal Vertical Diagonal Center Blue = positive weights   Red = negative weights
Same edge detectors emerge! In ch25, our TinyCNN learned horizontal/vertical/diagonal filters on synthetic 8×8 patterns. On real MNIST digits, the same types of features appear — edges, corners, curves.
Hubel & Wiesel (1962): Biological simple cells in V1 are also oriented edge detectors. The CNN rediscovers these features through gradient descent alone.
Layer 2 filters combine Layer 1 edges into more complex patterns: loops, junctions, strokes — building a feature hierarchy.

Scaling Up

The journey so far: From TinyCNN (114 params, 8×8 patterns, minutes in NumPy) to PyTorch CNN (25K params, 60K images, 30 seconds). Same principles, industrial scale.
ModelChapterParametersDataTimeAccuracy
PerceptronCh 43Synthetic 2Dinstant100% (linear)
MLP (from scratch)Ch 19~1,200Synthetic 2Dseconds~95%
TinyCNN (from scratch)Ch 23-251148×8 syntheticminutes~92%
MLP (PyTorch)Ch 30109KMNIST (60K)~60 sec97.0%
CNN (PyTorch)Ch 3125KMNIST (60K)~30 sec98.5%
The pattern: Each step up the ladder uses the same building blocks (linear transform, nonlinearity, gradient descent) at greater scale. Understanding from scratch enables mastery of industrial tools.

Framework Corner

The same CNN architecture in three frameworks — ideas transcend syntax:

PyTorch

class Net(nn.Module): def __init__(self): super().__init__() self.conv1 = nn.Conv2d(1,32,3) self.fc = nn.Linear(32*13*13,10) def forward(self, x): x = F.relu(self.conv1(x)) x = F.max_pool2d(x, 2) x = x.view(-1, 32*13*13) return self.fc(x)

TensorFlow / Keras

model = tf.keras.Sequential([ tf.keras.layers.Conv2D( 32, 3, activation='relu', input_shape=(28,28,1)), tf.keras.layers.MaxPool2D(2), tf.keras.layers.Flatten(), tf.keras.layers.Dense(10) ]) model.compile( optimizer='adam', loss='sparse_categorical_...')

JAX / Flax

class Net(nn.Module): @nn.compact def __call__(self, x): x = nn.Conv(32, (3,3))(x) x = nn.relu(x) x = nn.max_pool(x, (2,2), strides=(2,2)) x = x.reshape((x.shape[0],-1)) x = nn.Dense(10)(x) return x
Architecture ideas transcend frameworks. Conv2D, ReLU, MaxPool, Dense — these are universal building blocks. Learn the concepts (Parts I–VIII), then any framework becomes syntactic sugar.

Part IX Recap

Tensors PyTorch tensors = NumPy arrays + autograd + GPU acceleration
Autograd Same reverse-mode AD as our Value class (ch28) — but on tensors with optimized C++ kernels
nn.Module Base class for all models: __init__ defines layers, forward() defines computation
Training loop 5 lines: forward → loss → zero_grad → backward → step — maps to ch15–28
DataLoader Automated mini-batch SGD (ch27): batching, shuffling, parallel loading
MNIST CNN 98.5% accuracy, 25K params, 30 sec — same building blocks as TinyCNN (ch23–25), industrial scale
Next: Part X — Recurrent Neural Networks and LSTM. From processing images to processing sequences.