A 28×28 grayscale image with 100 hidden neurons: 78,500 weights
A 224×224 RGB image with 1000 hidden neurons: 150 million weights
Most weights are redundant — nearby pixels share local structure
Image Size
FC Weights (100 hidden)
Conv Weights (3×3, 32 filters)
28 × 28 × 1
78,500
320
64 × 64 × 3
1,228,900
896
224 × 224 × 3
150,528,100
896
Key insight: Conv weights depend only on filter size and count — not on image resolution.
Definition Three Key Ideas
1. Local receptive fields: Each neuron connects to only a small spatial patch of the input, not the entire image. A 3×3 filter sees 9 pixels instead of all 784.
2. Weight sharing: The same filter (same weights) is applied at every spatial position. One set of parameters scans the entire image.
3. Translation equivariance: If the input shifts, the feature map shifts by the same amount. A vertical edge detector finds vertical edges regardless of where they appear.
Together: These three ideas form the inductive bias of CNNs — the assumption that spatial structure matters and patterns can appear anywhere.
Historical Timeline
50 years from neuroscience to engineering: Hubel & Wiesel's discovery of oriented edge detectors in cat visual cortex inspired architectures that now power computer vision.
ReLU introduces nonlinearity (stacking linear convolutions without it would still be linear). Max pooling reduces spatial size and provides a degree of translation invariance.
TinyCNN Architecture
Layer Summary
Layer
Output Shape
Parameters
Role
Input
1 × 8 × 8
0
Grayscale pixels
Conv2D (3 filters, 3×3)
3 × 6 × 6
30
Feature detectors
ReLU
3 × 6 × 6
0
Nonlinearity
MaxPool (2×2)
3 × 3 × 3
0
Spatial reduction
Flatten
27
0
Reshape for dense
Dense + bias
3
84
Classification
Total
114
Parameter breakdown: Conv layer has $3 \times (1 \times 3 \times 3) + 3 = 30$ params (3 filters × 9 kernel weights + 3 biases). Dense layer has $27 \times 3 + 3 = 84$ params. The dense layer uses 73% of all parameters despite being "just" a classifier.
Synthetic Dataset: Three Pattern Classes
Vertical
Class 0
Horizontal
Class 1
Diagonal
Class 2
Dataset design: Each 8×8 image is generated with random stripe position/width and Gaussian noise. Simple enough for a 114-parameter CNN, but non-trivial for a linear classifier.
Input gradient: Full convolution with flipped kernel:
$$\frac{\partial \mathcal{L}}{\partial x_{m,n}} = \sum_{u,v} \frac{\partial \mathcal{L}}{\partial y_{m-u,\,n-v}} \cdot k_{u,v}$$
Key insight: The backward pass through a convolution layer is itself a convolution! CNNs fit naturally into the backpropagation framework — every operation is differentiable and composable via the chain rule.
Pooling Backward: Gradient Routing
Max pooling is not differentiable at ties, but in practice ties are rare. The gradient is routed entirely to the position that held the maximum value. All non-winner positions receive zero gradient.
Critical Always Gradient Check!
ALWAYS verify CNN gradients numerically. Conv backward passes involve indexing and flipping — bugs are easy to introduce and hard to spot.
Interpretation:
$e_{\mathrm{rel}} < 10^{-7}$: perfect
$10^{-5}$ to $10^{-7}$: acceptable
$e_{\mathrm{rel}} > 10^{-3}$: bug in your backprop
Training Curves
Loss (Cross-Entropy)
Accuracy
Convergence: TinyCNN reaches >99% accuracy within ~40 epochs. Train and validation curves track closely — no overfitting, because the model is small (114 params) relative to the task structure.
Filter Evolution During Training
3 learned kernels at different epochs (random init → specialized edge detectors)
Comparison CNN vs MLP
Property
CNN (TinyCNN)
MLP (1 hidden layer)
Architecture
Conv + Pool + Dense
Dense only
Parameters
114
1,242
Accuracy (3 classes)
> 99%
> 99%
Weight sharing
Yes (filters)
No
Translation equivariance
Yes
No
Parameter ratio
MLP uses 10.9× more parameters for the same accuracy
Same accuracy, very different efficiency. The CNN's inductive bias (locality + weight sharing) lets it solve this spatial task with far fewer parameters. But both models achieve >99% — on this toy task, the MLP has enough capacity to memorize the patterns.
MLP Capacity Sweep
Width 1 fails (only 1 hidden neuron cannot separate 3 classes). Width 4 starts working but is seed-dependent. Width 8+ reliably matches CNN accuracy — but uses 6–12× more parameters.
Four-Class Extension: Adding "Dot"
Adding a fourth class (centered dot pattern) tests whether architectures generalize.
Model
Filters / Width
Params
Accuracy
CNN
3 filters
117
> 99%
MLP (narrow)
8 hidden
555
~92%
MLP (wide)
18 hidden
1,221
> 99%
Class 3: Dot
CNN's feature extractor does not need to grow. The same 3 filters suffice — only the dense layer gains one more output neuron. The MLP needs substantially more capacity to handle a new class.
Inference Trace: One Sample Through the Pipeline
What Filters Learn
Hubel & Wiesel (1962): Each CNN filter functions like a simple cell in the visual cortex — responding to oriented edges. Pooling corresponds to complex cells, providing position tolerance.
Biological Visual Cortex
Simple cells: edges at specific orientation & position
Convergence of biology and engineering: Neither LeCun nor Fukushima set out to replicate cortical computation, yet learned representations converge — suggesting these principles are near-optimal for visual processing.
Scaling Up: From TinyCNN to Modern CNNs
Same principles, different scale. Every architecture below uses the same building blocks: convolution, nonlinearity, pooling, and dense classification — exactly what TinyCNN implements from scratch.
Architecture
Year
Parameters
Depth
Key Innovation
TinyCNN (ours)
2025
114
2
Educational from-scratch impl.
LeNet-5
1998
60 K
5
First practical CNN (digits)
AlexNet
2012
60 M
8
GPU training, ReLU, dropout
VGG-16
2014
138 M
16
Uniform 3×3 filters
ResNet-50
2015
25 M
50
Skip connections
Note: ResNet-50 has fewer parameters than VGG-16 despite being 3× deeper. Skip connections enable depth without parameter explosion. Architecture design matters as much as scale.
Part VII Recap
Spatial biasCNNs exploit image structure through weight sharing and local receptive fields
ConvolutionSliding dot product between input patch and learned kernel — output size: $\lfloor(W - K + 2P)/S\rfloor + 1$
ReLU + PoolReLU introduces nonlinearity; max pooling compresses spatially and adds translation tolerance
BackpropBackward pass through conv = convolution with flipped kernel; pooling routes gradient to winners only