Test: Review and Self-Assessment#

This chapter summarizes the most fundamental facts, definitions, theorems, and examples from the course. Use it to review before the self-assessment quiz. Each section mirrors a part of the book and distills the essential material you should know.

Interactive Knowledge Test

Test your understanding with our 50-question interactive quiz. Each question provides immediate feedback with explanations and links back to the relevant chapter.


Part I: Origins (Chapters 1–2)#

Key Definitions#

Definition: McCulloch–Pitts Neuron (1943)

A binary threshold unit governed by five axioms:

  1. Binary activity — a neuron is either firing (1) or quiescent (0).

  2. Fixed threshold — each neuron has a fixed threshold \(\theta_i \in \mathbb{Z}_{>0}\).

  3. Excitatory inputs — firing requires the sum of excitatory inputs to meet or exceed the threshold.

  4. Inhibitory veto — any active inhibitory input unconditionally prevents firing.

  5. Synchronous update — all neurons update in discrete, lockstep time steps.

There is no learning rule — all weights are fixed at \(\pm 1\) and the threshold is set by design.

Essential Formulas#

Formula

Meaning

\(x_i(t+1) = 1 \iff \sum_{j \in E_i} x_j(t) \ge \theta_i\) and no inhibitory input active

MP firing rule

AND gate: \(\theta = 2\), two excitatory inputs

Fires only when both inputs are 1

OR gate: \(\theta = 1\), two excitatory inputs

Fires when at least one input is 1

NOT gate: \(\theta = 1\), one inhibitory input mapped to excitation trick

Fires when the single input is 0

Key Theorems#

Theorem: Universality of MP Networks (McCulloch & Pitts, 1943)

Any Boolean function \(f: \{0,1\}^n \to \{0,1\}\) can be computed by a feed-forward network of McCulloch–Pitts neurons (Theorem I in the original paper). This follows from the fact that AND, OR, and NOT form a functionally complete set, and each can be realized by a single MP neuron.

Key Limitation

A single MP neuron cannot compute XOR. The XOR function requires at least a two-layer network.

Quick Self-Check#

  • What are the five axioms of the McCulloch–Pitts neuron model?

  • Why can a single MP neuron compute AND and OR but not XOR?

  • How does the inhibitory veto mechanism work, and why does it differ from a negative weight?

  • How many MP neurons are needed to compute any Boolean function of \(n\) inputs?


Part II: The Perceptron (Chapters 4–7)#

Key Definitions#

Definition: Perceptron (Rosenblatt, 1958)

A single-layer classifier with real-valued, learnable weights \(\bw \in \R^n\) and bias \(b \in \R\):

\[\begin{split} f(\bx) = \text{step}(\bw \cdot \bx + b) = \begin{cases} 1 & \text{if } \bw \cdot \bx + b \ge 0, \\ 0 & \text{otherwise.} \end{cases} \end{split}\]

The decision boundary is the hyperplane \(\bw \cdot \bx + b = 0\).

Definition: Perceptron Learning Rule

Given a misclassified sample \((\bx, y)\) with prediction \(\hat{y}\):

\[ \bw \leftarrow \bw + \eta\,(y - \hat{y})\,\bx, \qquad b \leftarrow b + \eta\,(y - \hat{y}) \]

where \(\eta > 0\) is the learning rate. The rule adds \(\bx\) on false negatives and subtracts \(\bx\) on false positives.

Definition: Linear Separability

Two sets \(A, B \subset \R^n\) are linearly separable if and only if there exists a hyperplane \(\bw \cdot \bx + b = 0\) that places all points of \(A\) on one side and all points of \(B\) on the other. Equivalently:

\[ A \text{ and } B \text{ are linearly separable} \iff \operatorname{conv}(A) \cap \operatorname{conv}(B) = \emptyset \]

where \(\operatorname{conv}(\cdot)\) denotes the convex hull.

Key Theorems#

Theorem: Perceptron Convergence (Novikoff, 1962)

If the training set is linearly separable with margin \(\gamma > 0\) and all data points satisfy \(\|\bx\| \le R\), then the perceptron learning algorithm makes at most

\[ k \le \frac{R^2}{\gamma^2} \]

weight updates before converging to a correct solution. The bound is independent of the dimension \(n\) and the number of training samples.

Essential Formulas#

Formula

Meaning

\(f(\bx) = \text{step}(\bw \cdot \bx + b)\)

Perceptron output

\(\bw \leftarrow \bw + \eta(y - \hat{y})\bx\)

Perceptron weight update

\(k \le R^2 / \gamma^2\)

Convergence bound (Novikoff)

\(\operatorname{conv}(A) \cap \operatorname{conv}(B) = \emptyset\)

Convex hull separability criterion

VC dim of perceptron in \(\R^n = n + 1\)

Shattering capacity

Boolean Functions#

Fact: Boolean Functions of Two Inputs

There are \(2^{2^2} = 16\) Boolean functions \(f: \{0,1\}^2 \to \{0,1\}\). Of these, 14 are linearly separable and can be computed by a single perceptron. The two that are not linearly separable are XOR and XNOR.

Quick Self-Check#

  • State the perceptron convergence theorem, including all assumptions.

  • What is the geometric interpretation of the margin \(\gamma\)?

  • How does increasing \(R\) (data spread) affect the convergence bound?

  • Why are there exactly 16 Boolean functions of 2 inputs, and which 2 are not linearly separable?

  • What is the VC dimension of a perceptron in \(\R^3\)?


Part III: XOR and Multilayer Networks (Chapters 8, 10, 11)#

Key Definitions#

Definition: XOR Function

The exclusive-or function \(\text{XOR}: \{0,1\}^2 \to \{0,1\}\) is defined by:

\(x_1\)

\(x_2\)

\(\text{XOR}(x_1, x_2)\)

0

0

0

0

1

1

1

0

1

1

1

0

Definition: Credit Assignment Problem

In a multilayer network, the credit assignment problem asks: given an error at the output, how should blame (or credit) be distributed among the hidden neurons whose individual contributions to the error are not directly observable? This problem motivated the development of backpropagation.

Key Theorems#

Theorem: XOR Is Not Linearly Separable

Algebraic proof. Suppose \(w_1 x_1 + w_2 x_2 + b \ge 0\) iff \(\text{XOR}(x_1, x_2) = 1\). The four constraints yield:

\[ (0,0) \Rightarrow b < 0, \quad (1,1) \Rightarrow w_1 + w_2 + b < 0, \quad (0,1) \Rightarrow w_2 + b \ge 0, \quad (1,0) \Rightarrow w_1 + b \ge 0. \]

Adding the last two: \(w_1 + w_2 + 2b \ge 0\), i.e., \(w_1 + w_2 \ge -2b > 0\). But the second constraint gives \(w_1 + w_2 < -b < 0\). Contradiction.

Geometric proof. The points \((0,1)\) and \((1,0)\) (class 1) and \((0,0)\) and \((1,1)\) (class 0) form a pattern where no single line can separate the two classes — the convex hulls intersect.

The Multilayer Solution#

XOR via a Hidden Layer

XOR can be decomposed as:

\[ \text{XOR}(x_1, x_2) = \text{AND}\bigl(\text{OR}(x_1, x_2),\; \text{NAND}(x_1, x_2)\bigr) \]

This requires 2 hidden neurons (one for OR, one for NAND) and 1 output neuron (AND). The hidden layer performs a coordinate transformation that maps the four XOR input points into a linearly separable configuration.

Essential Formulas#

Formula

Meaning

\(\text{XOR}(x_1,x_2) = \text{AND}(\text{OR}(x_1,x_2), \text{NAND}(x_1,x_2))\)

XOR decomposition

Hidden layer maps \((0,0) \to (0,1)\), \((0,1) \to (1,1)\), \((1,0) \to (1,1)\), \((1,1) \to (1,0)\)

Coordinate transformation

Quick Self-Check#

  • Give both the algebraic and geometric proofs that XOR is not linearly separable.

  • How many hidden neurons are needed to solve XOR? Can you do it with one?

  • What is the credit assignment problem and why does it arise in multilayer networks?

  • How does the hidden layer transform the XOR inputs into a linearly separable representation?


Part V: Backpropagation (Chapters 15–19)#

Key Definitions#

Definition: Gradient Descent

An iterative optimization algorithm that updates parameters in the direction of steepest descent of the loss function:

\[ \bw_{t+1} = \bw_t - \eta\, \nabla \mathcal{L}(\bw_t) \]

where \(\eta > 0\) is the learning rate and \(\mathcal{L}\) is the loss (cost) function.

Variants: batch (full dataset), stochastic / SGD (single sample), mini-batch (subset of samples).

Definition: Backpropagation Algorithm

An efficient method for computing \(\nabla \mathcal{L}\) in a multilayer network by applying the chain rule layer by layer, from output to input. The four fundamental equations are:

BP1 — Output error:

\[ \delta^L = \nabla_a \mathcal{L} \odot \sigma'(\bz^L) \]

BP2 — Error backpropagation:

\[ \bdelta^l = \bigl((\bW^{l+1})^\top \bdelta^{l+1}\bigr) \odot \sigma'(\bz^l) \]

BP3 — Bias gradient:

\[ \frac{\partial \mathcal{L}}{\partial b^l_j} = \delta^l_j \]

BP4 — Weight gradient:

\[ \frac{\partial \mathcal{L}}{\partial w^l_{jk}} = a^{l-1}_k \, \delta^l_j \]

where \(\odot\) is the Hadamard (element-wise) product, \(\sigma'\) is the activation derivative, \(\bz^l\) is the pre-activation, and \(\ba^{l-1}\) is the previous layer’s activation.

Key Theorems#

Theorem: Universal Approximation (Cybenko, 1989; Hornik, 1989)

Let \(\sigma\) be any continuous sigmoidal function. Then for any continuous function \(f: [0,1]^n \to \R\) and any \(\varepsilon > 0\), there exists a single-hidden-layer network

\[ F(\bx) = \sum_{j=1}^{N} \alpha_j \, \sigma(\bw_j \cdot \bx + b_j) \]

such that \(\|F - f\|_\infty < \varepsilon\). That is, one hidden layer with sufficiently many neurons can approximate any continuous function on a compact set to arbitrary precision.

Important caveat: The theorem guarantees existence but says nothing about how to find the weights, nor does it bound the number of hidden neurons needed.

Activation Functions#

Function

Formula

Derivative

Notes

Sigmoid

\(\sigma(x) = \frac{1}{1 + e^{-x}}\)

\(\sigma'(x) = \sigma(x)(1 - \sigma(x))\)

Max derivative \(= 1/4\) at \(x = 0\)

Tanh

\(\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)

\(1 - \tanh^2(x)\)

Zero-centered, max derivative \(= 1\)

ReLU

\(\max(0, x)\)

\(\begin{cases} 1 & x > 0 \\ 0 & x \le 0 \end{cases}\)

No saturation for \(x > 0\); dead neurons for \(x < 0\)

The Vanishing Gradient Problem#

Vanishing Gradients

With sigmoid activations, since \(\max \sigma'(x) = 1/4\), the gradient signal attenuates by a factor of at most \(1/4\) per layer during backpropagation. After \(k\) layers:

\[ \|\text{gradient at layer } l\| \propto \left(\frac{1}{4}\right)^k \]

For \(k = 10\) layers, the gradient is attenuated by a factor of \(\sim 10^{-6}\). This makes training deep sigmoid networks extremely difficult.

Essential Formulas#

Formula

Meaning

\(\bw_{t+1} = \bw_t - \eta \nabla \mathcal{L}(\bw_t)\)

Gradient descent update

\(\delta^L = \nabla_a \mathcal{L} \odot \sigma'(\bz^L)\)

BP1: output layer error

\(\bdelta^l = ((\bW^{l+1})^\top \bdelta^{l+1}) \odot \sigma'(\bz^l)\)

BP2: hidden layer error propagation

\(\frac{\partial \mathcal{L}}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j\)

BP4: weight gradient

\(\sigma'(x) = \sigma(x)(1-\sigma(x))\)

Sigmoid derivative

\(\left(\frac{1}{4}\right)^k\)

Gradient attenuation through \(k\) sigmoid layers

Two-sided finite diff: \(\frac{f(x+\varepsilon) - f(x-\varepsilon)}{2\varepsilon}\)

Gradient checking (\(\varepsilon \approx 10^{-7}\), relative error \(< 10^{-5}\))

Quick Self-Check#

  • Write down the four backpropagation equations (BP1–BP4) and explain each term.

  • Why is the sigmoid derivative at most \(1/4\)? How does this cause vanishing gradients?

  • How does ReLU mitigate the vanishing gradient problem? What new issue does it introduce?

  • State the Universal Approximation Theorem. What does it guarantee and what does it not guarantee?

  • How would you verify a backpropagation implementation using gradient checking?

  • What is the difference between batch, stochastic, and mini-batch gradient descent?


Master Formula Reference#

The table below collects the single most important formula from each chapter.

Chapter

Topic

Key Formula

Ch 1

MP neuron firing rule

\(x_i(t+1) = 1 \iff \sum_{j \in E_i} x_j(t) \ge \theta_i\) (no inhibitory veto)

Ch 2

Boolean gates

AND: \(\theta=2\); OR: \(\theta=1\); NOT: inhibitory input

Ch 4

Perceptron model

\(f(\bx) = \text{step}(\bw \cdot \bx + b)\)

Ch 5

Perceptron learning

\(\bw \leftarrow \bw + \eta(y - \hat{y})\bx\)

Ch 6

Convergence bound

\(k \le R^2 / \gamma^2\)

Ch 7

Boolean functions

14 of 16 functions of 2 inputs are linearly separable

Ch 8

XOR impossibility

No \(\bw, b\) satisfy all four XOR constraints simultaneously

Ch 10

Separability criterion

\(\operatorname{conv}(A) \cap \operatorname{conv}(B) = \emptyset\)

Ch 11

XOR solution

\(\text{XOR} = \text{AND}(\text{OR}, \text{NAND})\)

Ch 15

Gradient descent

\(\bw_{t+1} = \bw_t - \eta \nabla \mathcal{L}(\bw_t)\)

Ch 16

Backpropagation

\(\bdelta^l = ((\bW^{l+1})^\top \bdelta^{l+1}) \odot \sigma'(\bz^l)\)

Ch 17

Sigmoid derivative

\(\sigma'(x) = \sigma(x)(1 - \sigma(x))\), max \(= 1/4\)

Ch 18

Gradient checking

\(\frac{f(x+\varepsilon) - f(x-\varepsilon)}{2\varepsilon}\), relative error \(< 10^{-5}\)

Ch 19

UAT

\(F(\bx) = \sum_j \alpha_j \sigma(\bw_j \cdot \bx + b_j)\) approximates any \(f \in C([0,1]^n)\)


Key People#

Person(s)

Contribution

Year

Warren McCulloch & Walter Pitts

First mathematical model of a neuron

1943

Frank Rosenblatt

Perceptron model and learning algorithm

1958

Albert Novikoff

Perceptron convergence theorem

1962

Marvin Minsky & Seymour Papert

Perceptrons — proved limitations of single-layer networks

1969

David Rumelhart, Geoffrey Hinton & Ronald Williams

Popularized backpropagation for multilayer networks

1986

George Cybenko

Universal Approximation Theorem (sigmoid)

1989

Kurt Hornik

Universal Approximation Theorem (general)

1989


Ready to Test Yourself?#

Take the Quiz!

Now that you have reviewed the material, test your knowledge with our 50-question interactive quiz. The quiz covers all parts of the course and provides immediate feedback with explanations.

Target: Aim for at least 80% (40/50) to confirm solid understanding of the fundamentals.