Part IX

Introduction
to PyTorch

From Micrograd to Industrial Scale

Chapters 29–31

Definition The Bridge

Chapter 28: We built an autograd engine in 80 lines — the Value class with forward ops and backward propagation. PyTorch does the same thing — on tensors, on GPUs, in C++.

Our micrograd (ch28)

x = Value(2.0) y = Value(3.0) z = x * y + (x ** 2).relu() z.backward() print(x.grad) # 7.0

PyTorch equivalent

x = torch.tensor(2.0, requires_grad=True) y = torch.tensor(3.0, requires_grad=True) z = x * y + (x ** 2).clamp(min=0) z.backward() print(x.grad) # tensor(7.)

Same computation graph, same gradients. The only difference: torch.Tensor operates on n-dimensional arrays at GPU speed, not single scalars.

Framework Timeline

Key innovation of PyTorch: define-by-run (dynamic graph) — the computation graph is built on the fly, just like our Value class. Debug with print(), use if/else in forward pass.

Definition Tensors

A tensor is an n-dimensional array with autograd support. It tracks the computation graph and can compute gradients via .backward().

Operation	NumPy	PyTorch
Create array	`np.array([1,2,3])`	`torch.tensor([1,2,3])`
Zeros	`np.zeros((3,4))`	`torch.zeros(3,4)`
Random	`np.random.randn(3,4)`	`torch.randn(3,4)`
Matrix multiply	`A @ B`	`A @ B`
Element-wise	`A * B`	`A * B`
Shape	`A.shape`	`A.shape`
Gradients	—	`requires_grad=True`
GPU	—	`.to('cuda')`

Nearly identical API — but PyTorch adds autograd and GPU acceleration. Converting: torch.from_numpy(arr) and tensor.numpy().

Autograd on Tensors

Micrograd (scalars)

x = Value(2.0) y = Value(3.0) z = x * y + (x**2).relu() z.backward() # x.grad = 7.0 (dy/dx) # One scalar at a time

PyTorch (tensors)

W = torch.randn(784, 128, requires_grad=True) x = torch.randn(64, 784) h = (x @ W).clamp(min=0) loss = h.sum() loss.backward() # W.grad.shape = (784, 128)

Same algorithm, different scale: Micrograd computes $\partial z / \partial x$ for one scalar. PyTorch computes $\partial L / \partial W$ for a 784×128 weight matrix — 100,352 gradients in one call.

Under the hood: The same topological sort + reverse-mode AD we implemented in ch28. PyTorch just uses optimized C++/CUDA kernels for each operation.

Definition nn.Module

nn.Module is the base class for all neural network components — parameter management, device transfer, serialization, composability.

class TwoLayerMLP(nn.Module): def __init__(self, in_dim, h, out_dim): super().__init__() self.layer1 = nn.Linear(in_dim, h) self.relu = nn.ReLU() self.layer2 = nn.Linear(h, out_dim) def forward(self, x): return self.layer2(self.relu(self.layer1(x)))

Compare with ch19: Our MultiLayerNetwork had the same structure — __init__ stored weights, forward() computed output. PyTorch adds .parameters(), .to(device), .state_dict().

Key pattern: Define layers in __init__, define computation in forward. PyTorch handles gradients, parameter tracking, and GPU transfer automatically.

Pause & Think

What does loss.backward() do?

Map each step to what our Value class did in Chapter 28:

Value.backward() (ch28)
1. Topological sort of graph
2. Set output.grad = 1.0
3. Reverse iterate: each node calls _backward()
4. Chain rule accumulates gradients

loss.backward() (PyTorch)
1. Topological sort of tensor graph
2. Set loss.grad = 1.0
3. Reverse iterate: each op calls grad_fn
4. Chain rule accumulates .grad tensors

Important Training Loop Anatomy

for epoch in range(num_epochs): for X_batch, y_batch in dataloader: y_pred = model(X_batch) # 1. Forward pass (ch15-19) loss = criterion(y_pred, y_batch) # 2. Compute loss (ch26) optimizer.zero_grad() # 3. Clear gradients (ch28) loss.backward() # 4. Backward pass (ch16, ch28) optimizer.step() # 5. Update weights (ch27)

Line 1: Forward — same as our network.forward(x)
Line 2: Loss — cross-entropy from ch26
Line 3: Zero grads — we did this manually

Line 4: Backward — our Value.backward()
Line 5: Step — SGD/Adam from ch27
Five lines = entire training algorithm

Why zero_grad()? PyTorch accumulates gradients by default (useful for gradient accumulation with large batches). Without zeroing, gradients from previous iterations corrupt the update.

Definition Dataset & DataLoader

Dataset: Stores samples and labels. Implements __len__ and __getitem__.
DataLoader: Wraps a Dataset to provide batching, shuffling, and parallel loading.

train_set = datasets.MNIST( 'data/', train=True, download=True, transform=transforms.ToTensor()) loader = DataLoader(train_set, batch_size=64, # ch27 mini-batch shuffle=True) for X_batch, y_batch in loader: # X: (64, 1, 28, 28) # y: (64,)

From ch27: We split data into mini-batches with X[i:i+B]. DataLoader automates this + shuffling + parallel loading.

Key idea: Dataset = how to access one sample. DataLoader = how to batch, shuffle, and prefetch many samples efficiently.

Loss Functions & Optimizers

Loss Functions

PyTorch	From chapter
`nn.MSELoss()`	Ch 15 (GD)
`nn.CrossEntropyLoss()`	Ch 26 (softmax + CE)
`nn.BCELoss()`	Ch 17 (sigmoid)

Note: CrossEntropyLoss includes softmax — don't apply softmax in your model!

Optimizers

PyTorch	From chapter
`optim.SGD()`	Ch 15 (vanilla GD)
`optim.SGD(momentum)`	Ch 27 (momentum)
`optim.Adam()`	Ch 27 (adaptive)

Same math, wrapped in PyTorch. Adam from ch27: $m_t, v_t$ with bias correction.

Putting it together: criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=1e-3)

MNIST Results: MLP

Training curves

Architecture

Model: 784 → 128 → 64 → 10
Activation: ReLU
Optimizer: Adam (lr=0.001)
Batch size: 64
Parameters: 109,386
Training time: ~60 seconds (CPU)

97% accuracy on 10,000 test digits. Our from-scratch MLP in ch19 achieved similar results on synthetic data — now on real handwritten digits.

Warning From TinyCNN to PyTorch

Aspect	NumPy Conv2D (ch23)	nn.Conv2d (PyTorch)
Forward pass	Nested `for` loops in Python	Optimized C++/CUDA kernel
Backward pass	Manual gradient derivation	Automatic via autograd
Parameters	`self.kernels` numpy array	`self.weight` + `self.bias`
Batching	One image at a time	Batch dimension built-in
GPU support	None	`.to('cuda')`
Speed (MNIST)	~15 min (CPU)	~30 sec (CPU)

Our Conv2D (ch23)

for i in range(out_h): for j in range(out_w): for f in range(n_filters): patch = x[:, i:i+K, j:j+K] out[f,i,j] = np.sum( patch * self.kernels[f]) + self.biases[f]

PyTorch nn.Conv2d

self.conv1 = nn.Conv2d( in_channels=1, out_channels=32, kernel_size=3, padding=1 ) out = self.conv1(x) # That's it

CNN on Full MNIST

Architecture

PyTorch code

class MNISTNet(nn.Module): def __init__(self): super().__init__() self.features = nn.Sequential( nn.Conv2d(1,32,3,padding=1), nn.ReLU(), nn.MaxPool2d(2), nn.Conv2d(32,64,3,padding=1), nn.ReLU(), nn.MaxPool2d(2)) self.fc = nn.Sequential( nn.Linear(64*7*7, 128), nn.ReLU(), nn.Linear(128, 10))

vs MLP: 98.5% vs 97%, fewer params, faster convergence. Convolution inductive bias pays off.

Learned Filters

First-layer 3×3 filters learned on MNIST digits — compare with ch25’s synthetic patterns:

Typical learned filters (conceptual heatmaps)

Same edge detectors emerge! In ch25, our TinyCNN learned horizontal/vertical/diagonal filters on synthetic 8×8 patterns. On real MNIST digits, the same types of features appear — edges, corners, curves.

Hubel & Wiesel (1962): Biological simple cells in V1 are also oriented edge detectors. The CNN rediscovers these features through gradient descent alone.

Layer 2 filters combine Layer 1 edges into more complex patterns: loops, junctions, strokes — building a feature hierarchy.

Scaling Up

The journey so far: From TinyCNN (114 params, 8×8 patterns, minutes in NumPy) to PyTorch CNN (25K params, 60K images, 30 seconds). Same principles, industrial scale.

Model	Chapter	Parameters	Data	Time	Accuracy
Perceptron	Ch 4	3	Synthetic 2D	instant	100% (linear)
MLP (from scratch)	Ch 19	~1,200	Synthetic 2D	seconds	~95%
TinyCNN (from scratch)	Ch 23-25	114	8×8 synthetic	minutes	~92%
MLP (PyTorch)	Ch 30	109K	MNIST (60K)	~60 sec	97.0%
CNN (PyTorch)	Ch 31	25K	MNIST (60K)	~30 sec	98.5%

The pattern: Each step up the ladder uses the same building blocks (linear transform, nonlinearity, gradient descent) at greater scale. Understanding from scratch enables mastery of industrial tools.

Framework Corner

The same CNN architecture in three frameworks — ideas transcend syntax:

PyTorch

class Net(nn.Module): def __init__(self): super().__init__() self.conv1 = nn.Conv2d(1,32,3) self.fc = nn.Linear(32*13*13,10) def forward(self, x): x = F.relu(self.conv1(x)) x = F.max_pool2d(x, 2) x = x.view(-1, 32*13*13) return self.fc(x)

TensorFlow / Keras

model = tf.keras.Sequential([ tf.keras.layers.Conv2D( 32, 3, activation='relu', input_shape=(28,28,1)), tf.keras.layers.MaxPool2D(2), tf.keras.layers.Flatten(), tf.keras.layers.Dense(10) ]) model.compile( optimizer='adam', loss='sparse_categorical_...')

JAX / Flax

class Net(nn.Module): @nn.compact def __call__(self, x): x = nn.Conv(32, (3,3))(x) x = nn.relu(x) x = nn.max_pool(x, (2,2), strides=(2,2)) x = x.reshape((x.shape[0],-1)) x = nn.Dense(10)(x) return x

Architecture ideas transcend frameworks. Conv2D, ReLU, MaxPool, Dense — these are universal building blocks. Learn the concepts (Parts I–VIII), then any framework becomes syntactic sugar.

Part IX Recap

Tensors PyTorch tensors = NumPy arrays + autograd + GPU acceleration

Autograd Same reverse-mode AD as our Value class (ch28) — but on tensors with optimized C++ kernels

nn.Module Base class for all models: __init__ defines layers, forward() defines computation

Training loop 5 lines: forward → loss → zero_grad → backward → step — maps to ch15–28

DataLoader Automated mini-batch SGD (ch27): batching, shuffling, parallel loading

MNIST CNN 98.5% accuracy, 25K params, 30 sec — same building blocks as TinyCNN (ch23–25), industrial scale

Next: Part X — Recurrent Neural Networks and LSTM. From processing images to processing sequences.