Part VII

Convolutional
Neural Networks

Learning to See

Chapters 21–25

Warning The Curse of Full Connectivity

  • A 28×28 grayscale image with 100 hidden neurons: 78,500 weights
  • A 224×224 RGB image with 1000 hidden neurons: 150 million weights
  • Most weights are redundant — nearby pixels share local structure
Image SizeFC Weights (100 hidden)Conv Weights (3×3, 32 filters)
28 × 28 × 178,500320
64 × 64 × 31,228,900896
224 × 224 × 3150,528,100896
Key insight: Conv weights depend only on filter size and count — not on image resolution.

Definition Three Key Ideas

1. Local receptive fields: Each neuron connects to only a small spatial patch of the input, not the entire image. A 3×3 filter sees 9 pixels instead of all 784.
2. Weight sharing: The same filter (same weights) is applied at every spatial position. One set of parameters scans the entire image.
3. Translation equivariance: If the input shifts, the feature map shifts by the same amount. A vertical edge detector finds vertical edges regardless of where they appear.
Together: These three ideas form the inductive bias of CNNs — the assumption that spatial structure matters and patterns can appear anywhere.

Historical Timeline

1962 Hubel & Wiesel Simple & complex cells in visual cortex 1980 Fukushima Neocognitron Hierarchical features 1989 LeCun LeNet Backprop + Conv + backprop training 2012 Krizhevsky et al. AlexNet GPU + deep CNN scale + GPUs + data
50 years from neuroscience to engineering: Hubel & Wiesel's discovery of oriented edge detectors in cat visual cortex inspired architectures that now power computer vision.

The CNN Pipeline

Input Image Conv Feature extraction ReLU Nonlinearity Pool Downsample Conv Higher features ReLU Nonlinearity Pool Downsample Flatten + Dense Softmax Classes Feature Extraction (learned) Classifier

Definition Cross-Correlation

Discrete cross-correlation (called "convolution" in DL): $$y_{i,j} = \sum_{u=0}^{K-1}\sum_{v=0}^{K-1} x_{i+u,\,j+v} \cdot k_{u,v} + b$$
  • The kernel slides across the input
  • At each position: element-wise multiply, then sum
  • One kernel produces one feature map
  • $F$ kernels produce $F$ feature maps (output channels)
5×5 Input 1 0 1 0 1 0 1 0 1 3×3 Kernel 1 0 1 0 1 0 1 0 1 × output = 5 1·1 + 0·0 + 1·1 + 0·0 + 1·1 + 0·0 + 1·1 + 0·0 + 1·1 = 5

Convolution by Hand

5×5 Input

2 0 1 3 0 1 1 2 0 1 0 3 1 1 2 1 0 2 3 1 2 1 0 1 0

3×3 Kernel

1 0 -1 1 0 -1 1 0 -1
Output at (1,1):
1·1 + 2·0 + 0·(−1)
+ 3·1 + 1·0 + 1·(−1)
+ 0·1 + 2·0 + 3·(−1)
= 1 + 0 + 0 + 3 + 0 − 1 + 0 + 0 − 3 = 0
This kernel is a vertical edge detector: it responds strongly when there's a brightness difference between the left and right sides of the patch.

Theorem Output Size Formula

For an input of width $W$, kernel size $K$, padding $P$, and stride $S$: $$\text{output size} = \left\lfloor \frac{W - K + 2P}{S} \right\rfloor + 1$$

No padding

$W = 5, K = 3, P = 0, S = 1$

$\lfloor(5 - 3 + 0)/1\rfloor + 1 = \mathbf{3}$

Output shrinks by $K-1$

Same padding

$W = 5, K = 3, P = 1, S = 1$

$\lfloor(5 - 3 + 2)/1\rfloor + 1 = \mathbf{5}$

Output = Input size

Stride = 2

$W = 6, K = 3, P = 0, S = 2$

$\lfloor(6 - 3 + 0)/2\rfloor + 1 = \mathbf{2}$

Stride halves output

Common trap: Forgetting the floor operation when stride > 1. If the kernel doesn't fit evenly, the last position is skipped.

Pause & Think

Input is 8×8, kernel is 3×3, padding = 0, stride = 1.

What is the output size?

Now: what if we add padding = 1?

Answer 1: $\lfloor(8 - 3 + 0)/1\rfloor + 1 = 6 \times 6$

Answer 2: $\lfloor(8 - 3 + 2)/1\rfloor + 1 = 8 \times 8$ (same padding!)

ReLU and Max Pooling

ReLU: $\max(0, x)$

Before ReLU 3 -1 2 -4 -2 5 1 -3 4 -1 6 2 -5 3 -2 1 After ReLU 3 0 2 0 0 5 1 0 4 0 6 2 0 3 0 1 Negatives → 0, positives unchanged

Max Pooling (2×2, stride 2)

4×4 Input 3 0 2 0 0 5 1 0 4 0 0 2 0 3 0 1 2×2 Output 5 2 4 2 max of each 2×2 block 4×4 → 2×2 (4× smaller)
ReLU introduces nonlinearity (stacking linear convolutions without it would still be linear). Max pooling reduces spatial size and provides a degree of translation invariance.

TinyCNN Architecture

Input 1×8×8 64 pixels Conv2D 3 filters, 3×3 3×6×6 30 params ReLU 3×6×6 0 params MaxPool 2×2 3×3×3 0 params Flatten 27 0 params Dense 27→3 3 classes 84 params Softmax probabilities Total: 114 parameters 1×8×8 3×6×6 3×6×6 3×3×3 27 3

Layer Summary

LayerOutput ShapeParametersRole
Input1 × 8 × 80Grayscale pixels
Conv2D (3 filters, 3×3)3 × 6 × 630Feature detectors
ReLU3 × 6 × 60Nonlinearity
MaxPool (2×2)3 × 3 × 30Spatial reduction
Flatten270Reshape for dense
Dense + bias384Classification
Total114
Parameter breakdown: Conv layer has $3 \times (1 \times 3 \times 3) + 3 = 30$ params (3 filters × 9 kernel weights + 3 biases). Dense layer has $27 \times 3 + 3 = 84$ params. The dense layer uses 73% of all parameters despite being "just" a classifier.

Synthetic Dataset: Three Pattern Classes

Vertical

Class 0

Horizontal

Class 1

Diagonal

Class 2

Dataset design: Each 8×8 image is generated with random stripe position/width and Gaussian noise. Simple enough for a 114-parameter CNN, but non-trivial for a linear classifier.

Theorem Backprop Through Convolution

Kernel gradient: Correlate input with upstream gradient: $$\frac{\partial \mathcal{L}}{\partial k_{u,v}} = \sum_{i,j} \frac{\partial \mathcal{L}}{\partial y_{i,j}} \cdot x_{i+u,\, j+v}$$
Input gradient: Full convolution with flipped kernel: $$\frac{\partial \mathcal{L}}{\partial x_{m,n}} = \sum_{u,v} \frac{\partial \mathcal{L}}{\partial y_{m-u,\,n-v}} \cdot k_{u,v}$$
Key insight: The backward pass through a convolution layer is itself a convolution! CNNs fit naturally into the backpropagation framework — every operation is differentiable and composable via the chain rule.

Pooling Backward: Gradient Routing

Forward Pass 3 1 5 0 7 2 4 1 8 0 3 2 1 6 0 9 max 7 5 8 9 Backward Pass Upstream grad g1 g2 g3 g4 route Input gradient 0 0 g2 0 g1 0 0 0 g3 0 0 0 0 0 0 g4 Only winners receive gradient
Max pooling is not differentiable at ties, but in practice ties are rare. The gradient is routed entirely to the position that held the maximum value. All non-winner positions receive zero gradient.

Critical Always Gradient Check!

ALWAYS verify CNN gradients numerically. Conv backward passes involve indexing and flipping — bugs are easy to introduce and hard to spot.
Two-sided finite difference: $$\frac{\partial \mathcal{L}}{\partial \theta_i} \approx \frac{\mathcal{L}(\theta_i + \epsilon) - \mathcal{L}(\theta_i - \epsilon)}{2\epsilon}$$ with $\epsilon \approx 10^{-5}$.
Relative error: $$e_{\mathrm{rel}} = \frac{\|\nabla_{\mathrm{bp}} - \nabla_{\mathrm{num}}\|}{\|\nabla_{\mathrm{bp}}\| + \|\nabla_{\mathrm{num}}\|} < 10^{-5}$$
Interpretation:
$e_{\mathrm{rel}} < 10^{-7}$: perfect
$10^{-5}$ to $10^{-7}$: acceptable
$e_{\mathrm{rel}} > 10^{-3}$: bug in your backprop

Training Curves

Loss (Cross-Entropy)

0 0.5 1.1 0 20 40 60 80 Epoch Train Val

Accuracy

0.3 0.65 1.0 0 20 40 60 80 Epoch Train Val
Convergence: TinyCNN reaches >99% accuracy within ~40 epochs. Train and validation curves track closely — no overfitting, because the model is small (114 params) relative to the task structure.

Filter Evolution During Training

3 learned kernels at different epochs (random init → specialized edge detectors)

Epoch 0 Epoch 5 Epoch 15 Epoch 40 Epoch 80 F1 F2 F3 vertical horizontal diagonal

Comparison CNN vs MLP

PropertyCNN (TinyCNN)MLP (1 hidden layer)
ArchitectureConv + Pool + DenseDense only
Parameters1141,242
Accuracy (3 classes)> 99%> 99%
Weight sharingYes (filters)No
Translation equivarianceYesNo
Parameter ratioMLP uses 10.9× more parameters for the same accuracy
Same accuracy, very different efficiency. The CNN's inductive bias (locality + weight sharing) lets it solve this spatial task with far fewer parameters. But both models achieve >99% — on this toy task, the MLP has enough capacity to memorize the patterns.

MLP Capacity Sweep

Accuracy 0.3 0.65 1.0 Hidden layer width 1 4 8 12 18 CNN (114 params) fails unstable stable overkill min/max (5 seeds)
Width 1 fails (only 1 hidden neuron cannot separate 3 classes). Width 4 starts working but is seed-dependent. Width 8+ reliably matches CNN accuracy — but uses 6–12× more parameters.

Four-Class Extension: Adding "Dot"

Adding a fourth class (centered dot pattern) tests whether architectures generalize.

ModelFilters / WidthParamsAccuracy
CNN3 filters117> 99%
MLP (narrow)8 hidden555~92%
MLP (wide)18 hidden1,221> 99%

Class 3: Dot

CNN's feature extractor does not need to grow. The same 3 filters suffice — only the dense layer gains one more output neuron. The MLP needs substantially more capacity to handle a new class.

Inference Trace: One Sample Through the Pipeline

Input 1×8×8 conv Conv Maps strong weak medium 3×6×6 ReLU ReLU Maps strong sparse medium 3×6×6 pool Pooled 3×3×3 flat 27 dense Softmax 0.94 vert 0.03 horiz 0.03 diag Filter 1 (vertical detector) fires strongly → large activation survives ReLU & pooling → dense layer maps to "vertical" class The other filters contribute weak signals — the network is confident this is a vertical stripe pattern.

What Filters Learn

Hubel & Wiesel (1962): Each CNN filter functions like a simple cell in the visual cortex — responding to oriented edges. Pooling corresponds to complex cells, providing position tolerance.

Biological Visual Cortex

  • Simple cells: edges at specific orientation & position
  • Complex cells: orientation-selective, position-tolerant
  • Hierarchy: V1 → V2 → V4 → IT

Convolutional Neural Network

  • Conv filters: detect edges everywhere (weight sharing)
  • Pooling: spatial tolerance via local summaries
  • Hierarchy: edges → textures → parts → objects
Convergence of biology and engineering: Neither LeCun nor Fukushima set out to replicate cortical computation, yet learned representations converge — suggesting these principles are near-optimal for visual processing.

Scaling Up: From TinyCNN to Modern CNNs

Same principles, different scale. Every architecture below uses the same building blocks: convolution, nonlinearity, pooling, and dense classification — exactly what TinyCNN implements from scratch.
ArchitectureYearParametersDepthKey Innovation
TinyCNN (ours)20251142Educational from-scratch impl.
LeNet-5199860 K5First practical CNN (digits)
AlexNet201260 M8GPU training, ReLU, dropout
VGG-162014138 M16Uniform 3×3 filters
ResNet-50201525 M50Skip connections
Note: ResNet-50 has fewer parameters than VGG-16 despite being 3× deeper. Skip connections enable depth without parameter explosion. Architecture design matters as much as scale.

Part VII Recap

Spatial bias CNNs exploit image structure through weight sharing and local receptive fields
Convolution Sliding dot product between input patch and learned kernel — output size: $\lfloor(W - K + 2P)/S\rfloor + 1$
ReLU + Pool ReLU introduces nonlinearity; max pooling compresses spatially and adds translation tolerance
Backprop Backward pass through conv = convolution with flipped kernel; pooling routes gradient to winners only
Filters Learned filters become interpretable edge detectors — mirroring Hubel & Wiesel's simple cells
Efficiency TinyCNN: 114 params vs MLP: 1,242 params for same accuracy — inductive bias is parameter-efficient, not magic
From 114 parameters to 25 million: LeNet (1998) → AlexNet (2012) → ResNet (2015). Same building blocks — Conv, ReLU, Pool, Dense — at every scale.