Test: Review and Self-Assessment

Test: Review and Self-Assessment#

This chapter summarizes the most fundamental facts, definitions, theorems, and examples from the course. Use it to review before the self-assessment quiz. Each section mirrors a part of the book and distills the essential material you should know.

Interactive Knowledge Test

Test your understanding with our 50-question interactive quiz. Each question provides immediate feedback with explanations and links back to the relevant chapter.

Part I: Origins (Chapters 1–2)#

Key Definitions#

Definition: McCulloch–Pitts Neuron (1943)

A binary threshold unit governed by five axioms:

Binary activity — a neuron is either firing (1) or quiescent (0).
Fixed threshold — each neuron has a fixed threshold \(\theta_i \in \mathbb{Z}_{>0}\).
Excitatory inputs — firing requires the sum of excitatory inputs to meet or exceed the threshold.
Inhibitory veto — any active inhibitory input unconditionally prevents firing.
Synchronous update — all neurons update in discrete, lockstep time steps.

There is no learning rule — all weights are fixed at \(\pm 1\) and the threshold is set by design.

Essential Formulas#

Formula	Meaning
\(x_i(t+1) = 1 \iff \sum_{j \in E_i} x_j(t) \ge \theta_i\) and no inhibitory input active	MP firing rule
AND gate: \(\theta = 2\), two excitatory inputs	Fires only when both inputs are 1
OR gate: \(\theta = 1\), two excitatory inputs	Fires when at least one input is 1
NOT gate: \(\theta = 1\), one inhibitory input mapped to excitation trick	Fires when the single input is 0

Key Theorems#

Theorem: Universality of MP Networks (McCulloch & Pitts, 1943)

Any Boolean function \(f: \{0,1\}^n \to \{0,1\}\) can be computed by a feed-forward network of McCulloch–Pitts neurons (Theorem I in the original paper). This follows from the fact that AND, OR, and NOT form a functionally complete set, and each can be realized by a single MP neuron.

Key Limitation

A single MP neuron cannot compute XOR. The XOR function requires at least a two-layer network.

Quick Self-Check#

What are the five axioms of the McCulloch–Pitts neuron model?
Why can a single MP neuron compute AND and OR but not XOR?
How does the inhibitory veto mechanism work, and why does it differ from a negative weight?
How many MP neurons are needed to compute any Boolean function of \(n\) inputs?

Part II: The Perceptron (Chapters 4–7)#

Key Definitions#

Definition: Perceptron (Rosenblatt, 1958)

A single-layer classifier with real-valued, learnable weights \(\bw \in \R^n\) and bias \(b \in \R\):

\[\begin{split} f(\bx) = \text{step}(\bw \cdot \bx + b) = \begin{cases} 1 & \text{if } \bw \cdot \bx + b \ge 0, \\ 0 & \text{otherwise.} \end{cases} \end{split}\]

The decision boundary is the hyperplane \(\bw \cdot \bx + b = 0\).

Definition: Perceptron Learning Rule

Given a misclassified sample \((\bx, y)\) with prediction \(\hat{y}\):

\[ \bw \leftarrow \bw + \eta\,(y - \hat{y})\,\bx, \qquad b \leftarrow b + \eta\,(y - \hat{y}) \]

where \(\eta > 0\) is the learning rate. The rule adds \(\bx\) on false negatives and subtracts \(\bx\) on false positives.

Definition: Linear Separability

Two sets \(A, B \subset \R^n\) are linearly separable if and only if there exists a hyperplane \(\bw \cdot \bx + b = 0\) that places all points of \(A\) on one side and all points of \(B\) on the other. Equivalently:

\[ A \text{ and } B \text{ are linearly separable} \iff \operatorname{conv}(A) \cap \operatorname{conv}(B) = \emptyset \]

where \(\operatorname{conv}(\cdot)\) denotes the convex hull.

Key Theorems#

Theorem: Perceptron Convergence (Novikoff, 1962)

If the training set is linearly separable with margin \(\gamma > 0\) and all data points satisfy \(\|\bx\| \le R\), then the perceptron learning algorithm makes at most

\[ k \le \frac{R^2}{\gamma^2} \]

weight updates before converging to a correct solution. The bound is independent of the dimension \(n\) and the number of training samples.

Essential Formulas#

Formula	Meaning
\(f(\bx) = \text{step}(\bw \cdot \bx + b)\)	Perceptron output
\(\bw \leftarrow \bw + \eta(y - \hat{y})\bx\)	Perceptron weight update
\(k \le R^2 / \gamma^2\)	Convergence bound (Novikoff)
\(\operatorname{conv}(A) \cap \operatorname{conv}(B) = \emptyset\)	Convex hull separability criterion
VC dim of perceptron in \(\R^n = n + 1\)	Shattering capacity

Boolean Functions#

Fact: Boolean Functions of Two Inputs

There are \(2^{2^2} = 16\) Boolean functions \(f: \{0,1\}^2 \to \{0,1\}\). Of these, 14 are linearly separable and can be computed by a single perceptron. The two that are not linearly separable are XOR and XNOR.

Quick Self-Check#

State the perceptron convergence theorem, including all assumptions.
What is the geometric interpretation of the margin \(\gamma\)?
How does increasing \(R\) (data spread) affect the convergence bound?
Why are there exactly 16 Boolean functions of 2 inputs, and which 2 are not linearly separable?
What is the VC dimension of a perceptron in \(\R^3\)?

Part III: XOR and Multilayer Networks (Chapters 8, 10, 11)#

Key Definitions#

Definition: XOR Function

The exclusive-or function \(\text{XOR}: \{0,1\}^2 \to \{0,1\}\) is defined by:

\(x_1\)	\(x_2\)	\(\text{XOR}(x_1, x_2)\)
0	0	0
0	1	1
1	0	1
1	1	0

Definition: Credit Assignment Problem

In a multilayer network, the credit assignment problem asks: given an error at the output, how should blame (or credit) be distributed among the hidden neurons whose individual contributions to the error are not directly observable? This problem motivated the development of backpropagation.

Key Theorems#

Theorem: XOR Is Not Linearly Separable

Algebraic proof. Suppose \(w_1 x_1 + w_2 x_2 + b \ge 0\) iff \(\text{XOR}(x_1, x_2) = 1\). The four constraints yield:

\[ (0,0) \Rightarrow b < 0, \quad (1,1) \Rightarrow w_1 + w_2 + b < 0, \quad (0,1) \Rightarrow w_2 + b \ge 0, \quad (1,0) \Rightarrow w_1 + b \ge 0. \]

Adding the last two: \(w_1 + w_2 + 2b \ge 0\), i.e., \(w_1 + w_2 \ge -2b > 0\). But the second constraint gives \(w_1 + w_2 < -b < 0\). Contradiction.

Geometric proof. The points \((0,1)\) and \((1,0)\) (class 1) and \((0,0)\) and \((1,1)\) (class 0) form a pattern where no single line can separate the two classes — the convex hulls intersect.

The Multilayer Solution#

XOR via a Hidden Layer

XOR can be decomposed as:

\[ \text{XOR}(x_1, x_2) = \text{AND}\bigl(\text{OR}(x_1, x_2),\; \text{NAND}(x_1, x_2)\bigr) \]

This requires 2 hidden neurons (one for OR, one for NAND) and 1 output neuron (AND). The hidden layer performs a coordinate transformation that maps the four XOR input points into a linearly separable configuration.

Essential Formulas#

Formula	Meaning
\(\text{XOR}(x_1,x_2) = \text{AND}(\text{OR}(x_1,x_2), \text{NAND}(x_1,x_2))\)	XOR decomposition
Hidden layer maps \((0,0) \to (0,1)\), \((0,1) \to (1,1)\), \((1,0) \to (1,1)\), \((1,1) \to (1,0)\)	Coordinate transformation

Quick Self-Check#

Give both the algebraic and geometric proofs that XOR is not linearly separable.
How many hidden neurons are needed to solve XOR? Can you do it with one?
What is the credit assignment problem and why does it arise in multilayer networks?
How does the hidden layer transform the XOR inputs into a linearly separable representation?

Part V: Backpropagation (Chapters 15–19)#

Key Definitions#

Definition: Gradient Descent

An iterative optimization algorithm that updates parameters in the direction of steepest descent of the loss function:

\[ \bw_{t+1} = \bw_t - \eta\, \nabla \mathcal{L}(\bw_t) \]

where \(\eta > 0\) is the learning rate and \(\mathcal{L}\) is the loss (cost) function.

Variants: batch (full dataset), stochastic / SGD (single sample), mini-batch (subset of samples).

Definition: Backpropagation Algorithm

An efficient method for computing \(\nabla \mathcal{L}\) in a multilayer network by applying the chain rule layer by layer, from output to input. The four fundamental equations are:

BP1 — Output error:

\[ \delta^L = \nabla_a \mathcal{L} \odot \sigma'(\bz^L) \]

BP2 — Error backpropagation:

\[ \bdelta^l = \bigl((\bW^{l+1})^\top \bdelta^{l+1}\bigr) \odot \sigma'(\bz^l) \]

BP3 — Bias gradient:

\[ \frac{\partial \mathcal{L}}{\partial b^l_j} = \delta^l_j \]

BP4 — Weight gradient:

\[ \frac{\partial \mathcal{L}}{\partial w^l_{jk}} = a^{l-1}_k \, \delta^l_j \]

where \(\odot\) is the Hadamard (element-wise) product, \(\sigma'\) is the activation derivative, \(\bz^l\) is the pre-activation, and \(\ba^{l-1}\) is the previous layer’s activation.

Key Theorems#

Theorem: Universal Approximation (Cybenko, 1989; Hornik, 1989)

Let \(\sigma\) be any continuous sigmoidal function. Then for any continuous function \(f: [0,1]^n \to \R\) and any \(\varepsilon > 0\), there exists a single-hidden-layer network

\[ F(\bx) = \sum_{j=1}^{N} \alpha_j \, \sigma(\bw_j \cdot \bx + b_j) \]

such that \(\|F - f\|_\infty < \varepsilon\). That is, one hidden layer with sufficiently many neurons can approximate any continuous function on a compact set to arbitrary precision.

Important caveat: The theorem guarantees existence but says nothing about how to find the weights, nor does it bound the number of hidden neurons needed.

Activation Functions#

Function	Formula	Derivative	Notes
Sigmoid	\(\sigma(x) = \frac{1}{1 + e^{-x}}\)	\(\sigma'(x) = \sigma(x)(1 - \sigma(x))\)	Max derivative \(= 1/4\) at \(x = 0\)
Tanh	\(\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)	\(1 - \tanh^2(x)\)	Zero-centered, max derivative \(= 1\)
ReLU	\(\max(0, x)\)	\(\begin{cases} 1 & x > 0 \\ 0 & x \le 0 \end{cases}\)	No saturation for \(x > 0\); dead neurons for \(x < 0\)

The Vanishing Gradient Problem#

Vanishing Gradients

With sigmoid activations, since \(\max \sigma'(x) = 1/4\), the gradient signal attenuates by a factor of at most \(1/4\) per layer during backpropagation. After \(k\) layers:

\[ \|\text{gradient at layer } l\| \propto \left(\frac{1}{4}\right)^k \]

For \(k = 10\) layers, the gradient is attenuated by a factor of \(\sim 10^{-6}\). This makes training deep sigmoid networks extremely difficult.

Essential Formulas#

Formula	Meaning
\(\bw_{t+1} = \bw_t - \eta \nabla \mathcal{L}(\bw_t)\)	Gradient descent update
\(\delta^L = \nabla_a \mathcal{L} \odot \sigma'(\bz^L)\)	BP1: output layer error
\(\bdelta^l = ((\bW^{l+1})^\top \bdelta^{l+1}) \odot \sigma'(\bz^l)\)	BP2: hidden layer error propagation
\(\frac{\partial \mathcal{L}}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j\)	BP4: weight gradient
\(\sigma'(x) = \sigma(x)(1-\sigma(x))\)	Sigmoid derivative
\(\left(\frac{1}{4}\right)^k\)	Gradient attenuation through \(k\) sigmoid layers
Two-sided finite diff: \(\frac{f(x+\varepsilon) - f(x-\varepsilon)}{2\varepsilon}\)	Gradient checking (\(\varepsilon \approx 10^{-7}\), relative error \(< 10^{-5}\))

Quick Self-Check#

Write down the four backpropagation equations (BP1–BP4) and explain each term.
Why is the sigmoid derivative at most \(1/4\)? How does this cause vanishing gradients?
How does ReLU mitigate the vanishing gradient problem? What new issue does it introduce?
State the Universal Approximation Theorem. What does it guarantee and what does it not guarantee?
How would you verify a backpropagation implementation using gradient checking?
What is the difference between batch, stochastic, and mini-batch gradient descent?

Master Formula Reference#

The table below collects the single most important formula from each chapter.

Chapter	Topic	Key Formula
Ch 1	MP neuron firing rule	\(x_i(t+1) = 1 \iff \sum_{j \in E_i} x_j(t) \ge \theta_i\) (no inhibitory veto)
Ch 2	Boolean gates	AND: \(\theta=2\); OR: \(\theta=1\); NOT: inhibitory input
Ch 4	Perceptron model	\(f(\bx) = \text{step}(\bw \cdot \bx + b)\)
Ch 5	Perceptron learning	\(\bw \leftarrow \bw + \eta(y - \hat{y})\bx\)
Ch 6	Convergence bound	\(k \le R^2 / \gamma^2\)
Ch 7	Boolean functions	14 of 16 functions of 2 inputs are linearly separable
Ch 8	XOR impossibility	No \(\bw, b\) satisfy all four XOR constraints simultaneously
Ch 10	Separability criterion	\(\operatorname{conv}(A) \cap \operatorname{conv}(B) = \emptyset\)
Ch 11	XOR solution	\(\text{XOR} = \text{AND}(\text{OR}, \text{NAND})\)
Ch 15	Gradient descent	\(\bw_{t+1} = \bw_t - \eta \nabla \mathcal{L}(\bw_t)\)
Ch 16	Backpropagation	\(\bdelta^l = ((\bW^{l+1})^\top \bdelta^{l+1}) \odot \sigma'(\bz^l)\)
Ch 17	Sigmoid derivative	\(\sigma'(x) = \sigma(x)(1 - \sigma(x))\), max \(= 1/4\)
Ch 18	Gradient checking	\(\frac{f(x+\varepsilon) - f(x-\varepsilon)}{2\varepsilon}\), relative error \(< 10^{-5}\)
Ch 19	UAT	\(F(\bx) = \sum_j \alpha_j \sigma(\bw_j \cdot \bx + b_j)\) approximates any \(f \in C([0,1]^n)\)

Key People#

Person(s)	Contribution	Year
Warren McCulloch & Walter Pitts	First mathematical model of a neuron	1943
Frank Rosenblatt	Perceptron model and learning algorithm	1958
Albert Novikoff	Perceptron convergence theorem	1962
Marvin Minsky & Seymour Papert	Perceptrons — proved limitations of single-layer networks	1969
David Rumelhart, Geoffrey Hinton & Ronald Williams	Popularized backpropagation for multilayer networks	1986
George Cybenko	Universal Approximation Theorem (sigmoid)	1989
Kurt Hornik	Universal Approximation Theorem (general)	1989

Ready to Test Yourself?#

Take the Quiz!

Now that you have reviewed the material, test your knowledge with our 50-question interactive quiz. The quiz covers all parts of the course and provides immediate feedback with explanations.

Target: Aim for at least 80% (40/50) to confirm solid understanding of the fundamentals.

Test: Review and Self-Assessment

Contents

Test: Review and Self-Assessment#

Part I: Origins (Chapters 1–2)#

Key Definitions#

Essential Formulas#

Key Theorems#

Quick Self-Check#

Part II: The Perceptron (Chapters 4–7)#

Key Definitions#

Key Theorems#

Essential Formulas#

Boolean Functions#

Quick Self-Check#

Part III: XOR and Multilayer Networks (Chapters 8, 10, 11)#

Key Definitions#

Key Theorems#

The Multilayer Solution#

Essential Formulas#

Quick Self-Check#

Part V: Backpropagation (Chapters 15–19)#

Key Definitions#

Key Theorems#

Activation Functions#

The Vanishing Gradient Problem#

Essential Formulas#

Quick Self-Check#

Master Formula Reference#

Key People#

Ready to Test Yourself?#