Part VII

Convolutional
Neural Networks

Learning to See

Chapters 21–25

Warning The Curse of Full Connectivity

A 28×28 grayscale image with 100 hidden neurons: 78,500 weights
A 224×224 RGB image with 1000 hidden neurons: 150 million weights
Most weights are redundant — nearby pixels share local structure

Image Size	FC Weights (100 hidden)	Conv Weights (3×3, 32 filters)
28 × 28 × 1	78,500	320
64 × 64 × 3	1,228,900	896
224 × 224 × 3	150,528,100	896

Key insight: Conv weights depend only on filter size and count — not on image resolution.

Definition Three Key Ideas

1. Local receptive fields: Each neuron connects to only a small spatial patch of the input, not the entire image. A 3×3 filter sees 9 pixels instead of all 784.

2. Weight sharing: The same filter (same weights) is applied at every spatial position. One set of parameters scans the entire image.

3. Translation equivariance: If the input shifts, the feature map shifts by the same amount. A vertical edge detector finds vertical edges regardless of where they appear.

Together: These three ideas form the inductive bias of CNNs — the assumption that spatial structure matters and patterns can appear anywhere.

Historical Timeline

50 years from neuroscience to engineering: Hubel & Wiesel's discovery of oriented edge detectors in cat visual cortex inspired architectures that now power computer vision.

The CNN Pipeline

Definition Cross-Correlation

Discrete cross-correlation (called "convolution" in DL): $$y_{i,j} = \sum_{u=0}^{K-1}\sum_{v=0}^{K-1} x_{i+u,\,j+v} \cdot k_{u,v} + b$$

The kernel slides across the input
At each position: element-wise multiply, then sum
One kernel produces one feature map
$F$ kernels produce $F$ feature maps (output channels)

Convolution by Hand

5×5 Input

3×3 Kernel

Output at (1,1):
1·1 + 2·0 + 0·(−1)
+ 3·1 + 1·0 + 1·(−1)
+ 0·1 + 2·0 + 3·(−1)
= 1 + 0 + 0 + 3 + 0 − 1 + 0 + 0 − 3 = 0

This kernel is a vertical edge detector: it responds strongly when there's a brightness difference between the left and right sides of the patch.

Theorem Output Size Formula

For an input of width $W$, kernel size $K$, padding $P$, and stride $S$: $$\text{output size} = \left\lfloor \frac{W - K + 2P}{S} \right\rfloor + 1$$

No padding

$W = 5, K = 3, P = 0, S = 1$

$\lfloor(5 - 3 + 0)/1\rfloor + 1 = \mathbf{3}$

Output shrinks by $K-1$

Same padding

$W = 5, K = 3, P = 1, S = 1$

$\lfloor(5 - 3 + 2)/1\rfloor + 1 = \mathbf{5}$

Output = Input size

Stride = 2

$W = 6, K = 3, P = 0, S = 2$

$\lfloor(6 - 3 + 0)/2\rfloor + 1 = \mathbf{2}$

Stride halves output

Common trap: Forgetting the floor operation when stride > 1. If the kernel doesn't fit evenly, the last position is skipped.

Pause & Think

Input is 8×8, kernel is 3×3, padding = 0, stride = 1.

What is the output size?

Now: what if we add padding = 1?

Answer 1: $\lfloor(8 - 3 + 0)/1\rfloor + 1 = 6 \times 6$

Answer 2: $\lfloor(8 - 3 + 2)/1\rfloor + 1 = 8 \times 8$ (same padding!)

ReLU and Max Pooling

ReLU: $\max(0, x)$

Max Pooling (2×2, stride 2)

ReLU introduces nonlinearity (stacking linear convolutions without it would still be linear). Max pooling reduces spatial size and provides a degree of translation invariance.

TinyCNN Architecture

Layer Summary

Layer	Output Shape	Parameters	Role
Input	1 × 8 × 8	0	Grayscale pixels
Conv2D (3 filters, 3×3)	3 × 6 × 6	30	Feature detectors
ReLU	3 × 6 × 6	0	Nonlinearity
MaxPool (2×2)	3 × 3 × 3	0	Spatial reduction
Flatten	27	0	Reshape for dense
Dense + bias	3	84	Classification
Total		114

Parameter breakdown: Conv layer has $3 \times (1 \times 3 \times 3) + 3 = 30$ params (3 filters × 9 kernel weights + 3 biases). Dense layer has $27 \times 3 + 3 = 84$ params. The dense layer uses 73% of all parameters despite being "just" a classifier.

Synthetic Dataset: Three Pattern Classes

Vertical

Class 0

Horizontal

Class 1

Diagonal

Class 2

Dataset design: Each 8×8 image is generated with random stripe position/width and Gaussian noise. Simple enough for a 114-parameter CNN, but non-trivial for a linear classifier.

Theorem Backprop Through Convolution

Kernel gradient: Correlate input with upstream gradient: $$\frac{\partial \mathcal{L}}{\partial k_{u,v}} = \sum_{i,j} \frac{\partial \mathcal{L}}{\partial y_{i,j}} \cdot x_{i+u,\, j+v}$$

Input gradient: Full convolution with flipped kernel: $$\frac{\partial \mathcal{L}}{\partial x_{m,n}} = \sum_{u,v} \frac{\partial \mathcal{L}}{\partial y_{m-u,\,n-v}} \cdot k_{u,v}$$

Key insight: The backward pass through a convolution layer is itself a convolution! CNNs fit naturally into the backpropagation framework — every operation is differentiable and composable via the chain rule.

Pooling Backward: Gradient Routing

Max pooling is not differentiable at ties, but in practice ties are rare. The gradient is routed entirely to the position that held the maximum value. All non-winner positions receive zero gradient.

Critical Always Gradient Check!

ALWAYS verify CNN gradients numerically. Conv backward passes involve indexing and flipping — bugs are easy to introduce and hard to spot.

Two-sided finite difference: $$\frac{\partial \mathcal{L}}{\partial \theta_i} \approx \frac{\mathcal{L}(\theta_i + \epsilon) - \mathcal{L}(\theta_i - \epsilon)}{2\epsilon}$$ with $\epsilon \approx 10^{-5}$.

Relative error: $$e_{\mathrm{rel}} = \frac{\|\nabla_{\mathrm{bp}} - \nabla_{\mathrm{num}}\|}{\|\nabla_{\mathrm{bp}}\| + \|\nabla_{\mathrm{num}}\|} < 10^{-5}$$

Interpretation:
$e_{\mathrm{rel}} < 10^{-7}$: perfect
$10^{-5}$ to $10^{-7}$: acceptable
$e_{\mathrm{rel}} > 10^{-3}$: bug in your backprop

Training Curves

Loss (Cross-Entropy)

Accuracy

Convergence: TinyCNN reaches >99% accuracy within ~40 epochs. Train and validation curves track closely — no overfitting, because the model is small (114 params) relative to the task structure.

Filter Evolution During Training

3 learned kernels at different epochs (random init → specialized edge detectors)

Comparison CNN vs MLP

Property	CNN (TinyCNN)	MLP (1 hidden layer)
Architecture	Conv + Pool + Dense	Dense only
Parameters	114	1,242
Accuracy (3 classes)	> 99%	> 99%
Weight sharing	Yes (filters)	No
Translation equivariance	Yes	No
Parameter ratio	MLP uses 10.9× more parameters for the same accuracy

Same accuracy, very different efficiency. The CNN's inductive bias (locality + weight sharing) lets it solve this spatial task with far fewer parameters. But both models achieve >99% — on this toy task, the MLP has enough capacity to memorize the patterns.

MLP Capacity Sweep

Width 1 fails (only 1 hidden neuron cannot separate 3 classes). Width 4 starts working but is seed-dependent. Width 8+ reliably matches CNN accuracy — but uses 6–12× more parameters.

Four-Class Extension: Adding "Dot"

Adding a fourth class (centered dot pattern) tests whether architectures generalize.

Model	Filters / Width	Params	Accuracy
CNN	3 filters	117	> 99%
MLP (narrow)	8 hidden	555	~92%
MLP (wide)	18 hidden	1,221	> 99%

Class 3: Dot

CNN's feature extractor does not need to grow. The same 3 filters suffice — only the dense layer gains one more output neuron. The MLP needs substantially more capacity to handle a new class.

Inference Trace: One Sample Through the Pipeline

What Filters Learn

Hubel & Wiesel (1962): Each CNN filter functions like a simple cell in the visual cortex — responding to oriented edges. Pooling corresponds to complex cells, providing position tolerance.

Biological Visual Cortex

Simple cells: edges at specific orientation & position
Complex cells: orientation-selective, position-tolerant
Hierarchy: V1 → V2 → V4 → IT

Convolutional Neural Network

Conv filters: detect edges everywhere (weight sharing)
Pooling: spatial tolerance via local summaries
Hierarchy: edges → textures → parts → objects

Convergence of biology and engineering: Neither LeCun nor Fukushima set out to replicate cortical computation, yet learned representations converge — suggesting these principles are near-optimal for visual processing.

Scaling Up: From TinyCNN to Modern CNNs

Same principles, different scale. Every architecture below uses the same building blocks: convolution, nonlinearity, pooling, and dense classification — exactly what TinyCNN implements from scratch.

Architecture	Year	Parameters	Depth	Key Innovation
TinyCNN (ours)	2025	114	2	Educational from-scratch impl.
LeNet-5	1998	60 K	5	First practical CNN (digits)
AlexNet	2012	60 M	8	GPU training, ReLU, dropout
VGG-16	2014	138 M	16	Uniform 3×3 filters
ResNet-50	2015	25 M	50	Skip connections

Note: ResNet-50 has fewer parameters than VGG-16 despite being 3× deeper. Skip connections enable depth without parameter explosion. Architecture design matters as much as scale.

Part VII Recap

Spatial bias CNNs exploit image structure through weight sharing and local receptive fields

Convolution Sliding dot product between input patch and learned kernel — output size: $\lfloor(W - K + 2P)/S\rfloor + 1$

ReLU + Pool ReLU introduces nonlinearity; max pooling compresses spatially and adds translation tolerance

Backprop Backward pass through conv = convolution with flipped kernel; pooling routes gradient to winners only

Filters Learned filters become interpretable edge detectors — mirroring Hubel & Wiesel's simple cells

Efficiency TinyCNN: 114 params vs MLP: 1,242 params for same accuracy — inductive bias is parameter-efficient, not magic

From 114 parameters to 25 million: LeNet (1998) → AlexNet (2012) → ResNet (2015). Same building blocks — Conv, ReLU, Pool, Dense — at every scale.