{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a1b2c3d4",
   "metadata": {},
   "source": [
    "# Chapter 21: From Pixels to Features\n",
    "\n",
    "In the previous chapters, we built fully connected multilayer networks capable of approximating any continuous function. These networks are powerful in theory, but when we turn to image data, their architecture reveals a critical weakness: **every input pixel connects to every hidden neuron**, creating an explosion of parameters that makes learning slow, memory-hungry, and prone to overfitting.\n",
    "\n",
    "This chapter motivates the transition from fully connected networks to **convolutional neural networks (CNNs)** by identifying the structural assumptions that make image processing fundamentally different from generic function approximation. We will see how three elegant ideas\u2014weight sharing, local receptive fields, and translation equivariance\u2014reduce the parameter count by orders of magnitude while encoding the spatial structure of visual data directly into the network architecture."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2c3d4e5",
   "metadata": {},
   "source": [
    "## 1. The Curse of Full Connectivity\n",
    "\n",
    "Consider the simplest image classification task: recognizing handwritten digits from $28 \\times 28$ grayscale images (the MNIST dataset). Each image has $28 \\times 28 = 784$ pixels, so the input layer has 784 neurons.\n",
    "\n",
    "In a fully connected network with a single hidden layer of 100 neurons, the first layer alone requires:\n",
    "\n",
    "$$784 \\times 100 + 100 = 78{,}500 \\text{ parameters}$$\n",
    "\n",
    "This is already substantial for a tiny $28 \\times 28$ image. What happens with realistic image sizes?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3d4e5f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# Parameter counts for fully connected first layer (input -> 100 hidden neurons)\n",
    "image_sizes = {\n",
    "    'MNIST (28x28)':        (28, 28, 1),\n",
    "    'CIFAR-10 (32x32x3)':   (32, 32, 3),\n",
    "    'ImageNet (224x224x3)':  (224, 224, 3),\n",
    "    'HD photo (1920x1080x3)':(1920, 1080, 3),\n",
    "}\n",
    "\n",
    "hidden_neurons = 100\n",
    "\n",
    "print(\"Fully connected layer: input -> 100 hidden neurons\")\n",
    "print(\"=\" * 60)\n",
    "for name, (h, w, c) in image_sizes.items():\n",
    "    input_dim = h * w * c\n",
    "    fc_params = input_dim * hidden_neurons + hidden_neurons  # weights + biases\n",
    "    print(f\"{name:30s}  input_dim = {input_dim:>10,d}  params = {fc_params:>15,d}\")\n",
    "\n",
    "# Compare with a CNN: 16 filters of size 3x3 on a single-channel input\n",
    "print(\"\\nConvolutional layer: 16 filters of 3x3 on 1-channel input\")\n",
    "print(\"=\" * 60)\n",
    "num_filters = 16\n",
    "kernel_size = 3\n",
    "in_channels = 1\n",
    "conv_params = num_filters * (in_channels * kernel_size * kernel_size + 1)  # weights + biases\n",
    "print(f\"Parameters (independent of image size): {conv_params}\")\n",
    "print(f\"\\nRatio for MNIST: {78_500 / conv_params:.0f}x fewer parameters with convolution\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4e5f6a7",
   "metadata": {},
   "source": [
    "The numbers are striking. A single fully connected layer on a modest $224 \\times 224 \\times 3$ colour image requires over **15 million parameters**\u2014before we even add a second layer. Meanwhile, a convolutional layer with 16 filters of size $3 \\times 3$ uses only **160 parameters**, regardless of the input image size.\n",
    "\n",
    "```{admonition} The Core Problem\n",
    ":class: warning\n",
    "\n",
    "Fully connected layers treat every pixel as an independent feature with no spatial relationship to its neighbours. This wastes capacity on learning redundant patterns and demands enormous amounts of training data to generalize.\n",
    "```\n",
    "\n",
    "The solution is to design an architecture that respects the **spatial structure** of images. This is precisely what convolutional neural networks do."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5f6a7b8",
   "metadata": {},
   "source": [
    "## 2. Three Key Ideas\n",
    "\n",
    "Convolutional neural networks are built on three principles that exploit the structure of visual data.\n",
    "\n",
    "### 2.1 Local Receptive Fields\n",
    "\n",
    "```{admonition} Definition (Local Receptive Field)\n",
    ":class: note\n",
    "\n",
    "A **local receptive field** is a small, contiguous region of the input (e.g., a $3 \\times 3$ or $5 \\times 5$ patch) to which a single hidden neuron is connected. Instead of seeing the entire image, each neuron receives input only from a spatially localized neighbourhood.\n",
    "```\n",
    "\n",
    "This reflects a fundamental property of natural images: nearby pixels are strongly correlated (they tend to belong to the same object, edge, or texture), while distant pixels carry largely independent information. A neuron responsible for detecting a vertical edge needs only a small local patch to do its job.\n",
    "\n",
    "### 2.2 Weight Sharing\n",
    "\n",
    "```{admonition} Definition (Weight Sharing)\n",
    ":class: note\n",
    "\n",
    "**Weight sharing** means that the same set of weights (a **filter** or **kernel**) is applied at every spatial position of the input. All neurons in a given **feature map** share identical parameters.\n",
    "```\n",
    "\n",
    "If a $3 \\times 3$ filter is useful for detecting horizontal edges in the top-left corner of an image, the same filter should be equally useful in the bottom-right corner. Weight sharing encodes this assumption directly: a single filter is **slid** across the entire image, producing one feature map.\n",
    "\n",
    "This is the main reason CNNs have so few parameters compared to fully connected networks. Instead of learning separate weights for each spatial position, the network learns a small number of filters and reuses them everywhere.\n",
    "\n",
    "### 2.3 Translation Equivariance\n",
    "\n",
    "```{admonition} Definition (Translation Equivariance)\n",
    ":class: note\n",
    "\n",
    "A function $f$ is **translation equivariant** if shifting the input causes an identical shift in the output:\n",
    "\n",
    "$$f(\\text{shift}(\\bx)) = \\text{shift}(f(\\bx))$$\n",
    "\n",
    "Convolution is inherently translation equivariant: if a cat's ear appears 10 pixels to the right, the corresponding feature map activation also shifts 10 pixels to the right.\n",
    "```\n",
    "\n",
    "This is distinct from **translation invariance** (the output does not change at all when the input is shifted). Pure convolution is equivariant, not invariant. Invariance is typically achieved later in the pipeline through **pooling** operations.\n",
    "\n",
    "```{admonition} Equivariance vs. Invariance\n",
    ":class: tip\n",
    "\n",
    "- **Equivariance**: \"If the input moves, the output moves with it.\" (Convolution)\n",
    "- **Invariance**: \"If the input moves, the output stays the same.\" (Global average pooling, classification head)\n",
    "\n",
    "A CNN typically starts with equivariant layers (convolutions) and progressively introduces invariance (pooling, global averaging) so that the final class prediction is insensitive to the object's position.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f6a7b8c9",
   "metadata": {},
   "source": [
    "## 3. Historical Context\n",
    "\n",
    "The ideas behind convolutional networks did not appear in a vacuum. They emerged from a rich interaction between neuroscience and computer science spanning several decades.\n",
    "\n",
    "### Hubel & Wiesel (1962): The Neuroscience Foundation\n",
    "\n",
    "David Hubel and Torsten Wiesel performed groundbreaking experiments on the cat visual cortex, discovering that individual neurons respond to **oriented edges** at specific positions in the visual field. They identified two types of cells:\n",
    "\n",
    "- **Simple cells**: respond to edges of a particular orientation at a specific location (analogous to convolutional filters).\n",
    "- **Complex cells**: respond to edges of a particular orientation regardless of exact position (analogous to pooling).\n",
    "\n",
    "Their work earned the **Nobel Prize in Physiology or Medicine (1981)** and directly inspired the hierarchical feature extraction architecture of CNNs.\n",
    "\n",
    "### The Neocognitron (Fukushima, 1980)\n",
    "\n",
    "Kunihiko Fukushima built the **Neocognitron**, a hierarchical neural network explicitly modeled on Hubel and Wiesel's findings. It introduced:\n",
    "\n",
    "- Alternating layers of **S-cells** (simple, feature-extracting) and **C-cells** (complex, position-invariant).\n",
    "- Local receptive fields and weight sharing.\n",
    "- Training via a self-organizing (unsupervised) rule rather than backpropagation.\n",
    "\n",
    "The Neocognitron was the first neural architecture to successfully recognize handwritten characters with some positional invariance.\n",
    "\n",
    "### LeNet (LeCun et al., 1989\u20131998)\n",
    "\n",
    "Yann LeCun combined the architectural ideas of the Neocognitron with **backpropagation** (Chapter 16) to create **LeNet**, the first CNN trained end-to-end with gradient descent. Key milestones:\n",
    "\n",
    "- **1989**: LeCun et al. applied backpropagation to a convolutional network for handwritten zip code recognition.\n",
    "- **1998**: LeNet-5, the mature architecture, was deployed by the U.S. Postal Service for reading handwritten digits on mail, processing millions of checks.\n",
    "\n",
    "LeNet-5 established the modern CNN template: convolutional layers $\\to$ pooling layers $\\to$ fully connected layers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7b8c9d0",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.patches as mpatches\n",
    "import numpy as np\n",
    "\n",
    "plt.style.use('seaborn-v0_8-whitegrid')\n",
    "\n",
    "# Timeline data for CNN history\n",
    "events = [\n",
    "    (1962, 'Hubel & Wiesel\\nOriented edge detectors\\nin cat visual cortex', '#3b82f6'),\n",
    "    (1980, 'Neocognitron\\n(Fukushima)\\nS-cells + C-cells', '#059669'),\n",
    "    (1989, 'LeNet\\n(LeCun et al.)\\nBackprop + convolution', '#d97706'),\n",
    "    (1998, 'LeNet-5\\nDeployed at U.S.\\nPostal Service', '#dc2626'),\n",
    "]\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(12, 3.5))\n",
    "\n",
    "# Draw timeline axis\n",
    "years = [e[0] for e in events]\n",
    "y_min, y_max = min(years) - 5, max(years) + 5\n",
    "ax.plot([y_min, y_max], [0, 0], color='#64748b', linewidth=2, zorder=1)\n",
    "\n",
    "# Plot events\n",
    "for i, (year, label, color) in enumerate(events):\n",
    "    side = 1 if i % 2 == 0 else -1\n",
    "    y_offset = side * 1.2\n",
    "    \n",
    "    # Vertical connector\n",
    "    ax.plot([year, year], [0, y_offset * 0.6], color=color, linewidth=2, zorder=2)\n",
    "    \n",
    "    # Dot on timeline\n",
    "    ax.scatter(year, 0, s=120, color=color, zorder=3, edgecolors='white', linewidth=1.5)\n",
    "    \n",
    "    # Label\n",
    "    ax.text(year, y_offset, f'{year}\\n{label}', ha='center',\n",
    "            va='bottom' if side > 0 else 'top',\n",
    "            fontsize=9, fontweight='bold', color=color,\n",
    "            bbox=dict(boxstyle='round,pad=0.3', facecolor='white',\n",
    "                      edgecolor=color, alpha=0.9))\n",
    "\n",
    "ax.set_xlim(1955, 2005)\n",
    "ax.set_ylim(-2.5, 2.5)\n",
    "ax.set_xlabel('Year', fontsize=11)\n",
    "ax.set_title('Timeline: From Neuroscience to Convolutional Neural Networks',\n",
    "             fontsize=13, fontweight='bold', pad=15)\n",
    "ax.set_yticks([])\n",
    "ax.spines['left'].set_visible(False)\n",
    "ax.spines['right'].set_visible(False)\n",
    "ax.spines['top'].set_visible(False)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8c9d0e1",
   "metadata": {},
   "source": [
    "## 4. The CNN Pipeline\n",
    "\n",
    "A typical convolutional neural network processes an image through a sequence of distinct stages:\n",
    "\n",
    "1. **Convolutional layers** extract local features (edges, textures, shapes) by sliding learned filters across the input.\n",
    "2. **Activation functions** (typically ReLU) introduce nonlinearity after each convolution.\n",
    "3. **Pooling layers** reduce spatial dimensions, introducing a degree of translation invariance.\n",
    "4. These three operations are repeated in multiple stages, with each stage extracting increasingly abstract features.\n",
    "5. **Flattening** converts the final 2D feature maps into a 1D vector.\n",
    "6. **Fully connected (dense) layers** combine the extracted features for classification.\n",
    "7. **Softmax** produces a probability distribution over classes.\n",
    "\n",
    "The following diagram illustrates this pipeline:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c9d0e1f2",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.patches as mpatches\n",
    "import numpy as np\n",
    "\n",
    "plt.style.use('seaborn-v0_8-whitegrid')\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(14, 4))\n",
    "\n",
    "# Pipeline stages\n",
    "stages = [\n",
    "    ('Input\\nImage',       '#64748b', (0.8, 1.6)),\n",
    "    ('Conv',               '#3b82f6', (0.7, 1.4)),\n",
    "    ('ReLU',               '#2563eb', (0.6, 1.2)),\n",
    "    ('Pool',               '#059669', (0.5, 1.0)),\n",
    "    ('Conv',               '#3b82f6', (0.5, 1.0)),\n",
    "    ('ReLU',               '#2563eb', (0.45, 0.9)),\n",
    "    ('Pool',               '#059669', (0.4, 0.8)),\n",
    "    ('Flatten',            '#d97706', (0.15, 1.4)),\n",
    "    ('Dense',              '#dc2626', (0.15, 1.2)),\n",
    "    ('Softmax',            '#7c3aed', (0.15, 1.0)),\n",
    "]\n",
    "\n",
    "x_pos = 0.5\n",
    "x_positions = []\n",
    "gap = 0.15\n",
    "\n",
    "for i, (label, color, (w, h)) in enumerate(stages):\n",
    "    x_positions.append(x_pos)\n",
    "    rect = mpatches.FancyBboxPatch(\n",
    "        (x_pos - w/2, -h/2), w, h,\n",
    "        boxstyle=mpatches.BoxStyle('Round', pad=0.05),\n",
    "        facecolor=color, edgecolor='white', linewidth=2, alpha=0.85\n",
    "    )\n",
    "    ax.add_patch(rect)\n",
    "    ax.text(x_pos, 0, label, ha='center', va='center',\n",
    "            fontsize=9, fontweight='bold', color='white')\n",
    "    \n",
    "    # Arrow to next stage\n",
    "    if i < len(stages) - 1:\n",
    "        next_w = stages[i+1][2][0]\n",
    "        arrow_start = x_pos + w/2 + 0.02\n",
    "        next_x = x_pos + w/2 + gap + next_w/2\n",
    "        arrow_end = next_x - next_w/2 - 0.02\n",
    "        ax.annotate('', xy=(arrow_end, 0), xytext=(arrow_start, 0),\n",
    "                    arrowprops=dict(arrowstyle='->', color='#94a3b8',\n",
    "                                   lw=1.5, connectionstyle='arc3,rad=0'))\n",
    "    \n",
    "    x_pos += w/2 + gap + (stages[i+1][2][0]/2 if i < len(stages)-1 else 0)\n",
    "\n",
    "# Bracket labels for feature extraction vs classification\n",
    "ax.annotate('', xy=(x_positions[0] - 0.4, -1.15), xytext=(x_positions[6] + 0.4, -1.15),\n",
    "            arrowprops=dict(arrowstyle='-', color='#3b82f6', lw=1.5))\n",
    "ax.text((x_positions[0] + x_positions[6])/2, -1.35,\n",
    "        'Feature Extraction', ha='center', fontsize=10, color='#3b82f6', fontstyle='italic')\n",
    "\n",
    "ax.annotate('', xy=(x_positions[7] - 0.1, -1.15), xytext=(x_positions[9] + 0.1, -1.15),\n",
    "            arrowprops=dict(arrowstyle='-', color='#dc2626', lw=1.5))\n",
    "ax.text((x_positions[7] + x_positions[9])/2, -1.35,\n",
    "        'Classification', ha='center', fontsize=10, color='#dc2626', fontstyle='italic')\n",
    "\n",
    "ax.set_xlim(-0.2, x_pos + 0.5)\n",
    "ax.set_ylim(-1.8, 1.3)\n",
    "ax.set_aspect('equal')\n",
    "ax.axis('off')\n",
    "ax.set_title('The CNN Pipeline', fontsize=14, fontweight='bold', pad=15)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0e1f2a3",
   "metadata": {},
   "source": [
    "**Key insight:** The early convolutional layers detect simple, local features (edges, corners). As we go deeper, the receptive field of each neuron grows, and the network learns to combine low-level features into higher-level representations (textures $\\to$ parts $\\to$ objects). This hierarchical feature extraction mirrors the organization of the primate visual cortex discovered by Hubel and Wiesel.\n",
    "\n",
    "In the next chapters, we will build each of these components from scratch:\n",
    "\n",
    "| Chapter | Component | Key Concept |\n",
    "|:-------:|:---------:|:------------|\n",
    "| 22 | Convolution | Sliding filter, cross-correlation, feature maps |\n",
    "| 23 | Pooling & Architecture | Max pooling, stride, padding, LeNet-5 |\n",
    "| 24 | CNN Backpropagation | Gradients through conv and pool layers |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1f2a3b4",
   "metadata": {},
   "source": [
    "## 5. Exercises\n",
    "\n",
    "### Exercise 21.1: Parameter Counting\n",
    "\n",
    "A colour image of size $64 \\times 64 \\times 3$ is fed into a fully connected layer with 256 hidden neurons.\n",
    "\n",
    "**(a)** How many parameters (weights + biases) does this layer have?\n",
    "\n",
    "**(b)** Now consider a convolutional layer with 32 filters of size $5 \\times 5$, applied to the same $64 \\times 64 \\times 3$ input. How many parameters does this layer have?\n",
    "\n",
    "**(c)** What is the ratio of parameters between the fully connected and convolutional layers?\n",
    "\n",
    "### Exercise 21.2: Equivariance vs. Invariance\n",
    "\n",
    "Let $f$ be a function mapping images to feature maps, and let $T_\\delta$ denote a spatial translation by $\\delta$ pixels.\n",
    "\n",
    "**(a)** Write the mathematical condition for $f$ to be **translation equivariant**.\n",
    "\n",
    "**(b)** Write the mathematical condition for $f$ to be **translation invariant**.\n",
    "\n",
    "**(c)** Is the convolution operation equivariant, invariant, or neither? Justify your answer.\n",
    "\n",
    "**(d)** Give an example of a function that is translation invariant. Why is pure invariance undesirable in the early layers of a vision network?\n",
    "\n",
    "### Exercise 21.3: Receptive Field Growth\n",
    "\n",
    "Consider a network with two consecutive $3 \\times 3$ convolutional layers (stride 1, no padding).\n",
    "\n",
    "**(a)** What is the effective receptive field of a single neuron in the second convolutional layer's output? That is, how large a region of the original input does it depend on?\n",
    "\n",
    "**(b)** Generalize: if we stack $L$ convolutional layers, each with kernel size $K \\times K$ (stride 1, no padding), what is the effective receptive field size?\n",
    "\n",
    "**(c)** How many $3 \\times 3$ layers would we need to achieve the same receptive field as a single $7 \\times 7$ filter? Which approach uses fewer parameters (assuming the same number of input and output channels)?\n",
    "\n",
    "### Exercise 21.4: Why Not Fully Connected?\n",
    "\n",
    "Suppose you train a fully connected network on $28 \\times 28$ images of the digit \"7\", and it learns to recognize the digit when it appears in the centre of the image.\n",
    "\n",
    "**(a)** Will the network correctly recognize a \"7\" shifted 5 pixels to the right? Explain why or why not, referring to the structure of the weight matrix.\n",
    "\n",
    "**(b)** How does a convolutional network handle this same scenario differently?\n",
    "\n",
    "### Exercise 21.5: Biological Analogy\n",
    "\n",
    "Hubel and Wiesel identified **simple cells** and **complex cells** in the cat visual cortex.\n",
    "\n",
    "**(a)** Which CNN component is analogous to simple cells? Which is analogous to complex cells?\n",
    "\n",
    "**(b)** In what important ways does the analogy break down? Consider at least two differences between biological vision and CNNs."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2a3b4c5",
   "metadata": {},
   "source": [
    "## 6. Summary and Key Takeaways\n",
    "\n",
    "- Fully connected networks suffer from a **parameter explosion** on image data, requiring millions of weights even for modest image sizes.\n",
    "- Convolutional neural networks solve this through three key ideas:\n",
    "  - **Local receptive fields**: each neuron sees only a small patch.\n",
    "  - **Weight sharing**: the same filter is applied everywhere.\n",
    "  - **Translation equivariance**: shifting the input shifts the output.\n",
    "- The CNN architecture was inspired by **Hubel & Wiesel's** neuroscience discoveries (1962) and developed through **Fukushima's Neocognitron** (1980) and **LeCun's LeNet** (1989\u20131998).\n",
    "- The standard CNN pipeline consists of: **Convolution $\\to$ ReLU $\\to$ Pooling** (repeated), followed by **Flatten $\\to$ Dense $\\to$ Softmax**."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3b4c5d6",
   "metadata": {},
   "source": [
    "## 7. References\n",
    "\n",
    "1. D. H. Hubel and T. N. Wiesel, \"Receptive fields, binocular interaction and functional architecture in the cat's visual cortex,\" *The Journal of Physiology*, vol. 160, no. 1, pp. 106\u2013154, 1962.\n",
    "\n",
    "2. K. Fukushima, \"Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,\" *Biological Cybernetics*, vol. 36, no. 4, pp. 193\u2013202, 1980.\n",
    "\n",
    "3. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, \"Backpropagation applied to handwritten zip code recognition,\" *Neural Computation*, vol. 1, no. 4, pp. 541\u2013551, 1989.\n",
    "\n",
    "4. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, \"Gradient-based learning applied to document recognition,\" *Proceedings of the IEEE*, vol. 86, no. 11, pp. 2278\u20132324, 1998."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}