Chapter 8: The XOR Problem#

The Most Famous Impossibility Result in Neural Network Theory#

The XOR (exclusive-or) problem stands as one of the most important results in the history of artificial intelligence. It is the simplest Boolean function that a single-layer perceptron cannot compute. This seemingly modest observation had enormous consequences: it helped trigger the first “AI winter,” a period of reduced funding and interest in neural network research that lasted from roughly 1969 to 1986.

In this chapter, we provide two complete proofs that XOR is not linearly separable, visualize the geometric obstruction, watch a perceptron fail to learn XOR, and then show how adding a single hidden layer resolves the problem entirely.

Why XOR Matters#

XOR is not just an abstract curiosity. It is:

  • The simplest non-linearly-separable Boolean function (only 2 inputs, 4 data points).

  • Functionally complete when combined with AND and a constant 1: {XOR, AND, 1} can generate any Boolean function via algebraic normal form.

  • The 2-bit case of the PARITY function, which generalizes to \(n\) bits.

  • A pedagogical gem: the failure is visible in a 2D plot with just 4 points.

Tip

Why XOR is the simplest non-linearly-separable function.

There are exactly \(2^{2^2} = 16\) Boolean functions of two variables. Of these 16, exactly 14 are linearly separable (AND, OR, NAND, NOR, constant 0/1, identity, negation, implications, etc.). Only 2 are not: XOR and XNOR (its complement). XOR uses the absolute minimum — 2 inputs, 4 data points — to demonstrate the failure of linear separability. You literally cannot construct a simpler counterexample.

Danger

THE critical limitation — XOR impossibility killed perceptron research and triggered the first AI Winter. This is the single most consequential result in early neural network history.

When Minsky and Papert published their proof in 1969, funding agencies interpreted it as evidence that neural networks were a dead end. DARPA and other agencies redirected funding to symbolic AI. Researchers who had devoted their careers to neural networks found themselves unable to publish or obtain grants. The field went into hibernation for nearly 17 years, until the backpropagation revolution of 1986.

Note

The Minsky-Rosenblatt Controversy

The story of the XOR problem is inseparable from the rivalry between Marvin Minsky and Frank Rosenblatt. Both were at Cornell in the late 1950s. Rosenblatt was the optimistic champion of the perceptron, making bold claims about its potential (some justified, some not). Minsky was the skeptic, who saw fundamental limitations that Rosenblatt’s enthusiasm glossed over.

Minsky’s 1969 book Perceptrons, co-authored with Seymour Papert, was in part a response to what Minsky perceived as overblown hype. The XOR impossibility proof was the book’s centerpiece. Rosenblatt died in a boating accident in 1971, at age 43, before he could respond to or outlive the criticism. The field he pioneered would not recover until long after his death.

The tragedy is that Minsky and Papert were mathematically correct but strategically devastating. Their theorems applied only to single-layer perceptrons, yet the broader community interpreted them as condemning all neural networks.

Hide code cell source
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection

# Set consistent style
plt.rcParams.update({
    'figure.figsize': (8, 6),
    'font.size': 12,
    'axes.grid': True,
    'grid.alpha': 0.3
})

1. XOR Definition#

The exclusive-or function returns 1 when exactly one of its inputs is 1, and 0 otherwise.

Truth Table#

\(x_1\)

\(x_2\)

\(x_1 \oplus x_2\)

0

0

0

0

1

1

1

0

1

1

1

0

Algebraic Identities#

XOR can be expressed algebraically in several equivalent ways:

\[x_1 \oplus x_2 = x_1 + x_2 - 2x_1 x_2\]

This is because:

  • When both are 0: \(0 + 0 - 0 = 0\)

  • When one is 1: \(1 + 0 - 0 = 1\) or \(0 + 1 - 0 = 1\)

  • When both are 1: \(1 + 1 - 2 = 0\)

Equivalently:

\[x_1 \oplus x_2 = (x_1 - x_2)^2\]

since \((x_1 - x_2)^2 = x_1^2 - 2x_1 x_2 + x_2^2 = x_1 - 2x_1 x_2 + x_2\) for binary inputs (where \(x^2 = x\)).

2. Proof 1: Algebraic Proof of Non-Separability#

Theorem (XOR Impossibility)

No single perceptron (with step activation) can compute XOR. That is, there exist no weights \(w_1, w_2\) and bias \(b\) such that

\[f(x_1, x_2) = \text{step}(w_1 x_1 + w_2 x_2 + b)\]

computes the XOR function.

Proof. Suppose for contradiction that there exist weights \(w_1, w_2\) and bias \(b\) such that

\[f(x_1, x_2) = \text{step}(w_1 x_1 + w_2 x_2 + b)\]

computes XOR, where \(\text{step}(z) = 1\) if \(z \geq 0\) and \(\text{step}(z) = 0\) if \(z < 0\).

From the truth table, we require:

Input

Output

Condition

\((0,0) \to 0\)

\(w_1(0) + w_2(0) + b < 0\)

\(b < 0\) … (i)

\((0,1) \to 1\)

\(w_1(0) + w_2(1) + b \geq 0\)

\(w_2 + b \geq 0\) … (ii)

\((1,0) \to 1\)

\(w_1(1) + w_2(0) + b \geq 0\)

\(w_1 + b \geq 0\) … (iii)

\((1,1) \to 0\)

\(w_1(1) + w_2(1) + b < 0\)

\(w_1 + w_2 + b < 0\) … (iv)

Adding inequalities (ii) and (iii):

\[(w_2 + b) + (w_1 + b) \geq 0\]
\[w_1 + w_2 + 2b \geq 0\]
\[w_1 + w_2 \geq -2b \quad \text{...(v)}\]

From inequality (iv):

\[w_1 + w_2 < -b \quad \text{...(vi)}\]

Combining (v) and (vi):

\[-2b \leq w_1 + w_2 < -b\]

This requires \(-2b < -b\), which simplifies to:

\[b > 0\]

But this contradicts inequality (i), which requires \(b < 0\).

\[\boxed{\text{Contradiction. Therefore no such } w_1, w_2, b \text{ exist. } \blacksquare}\]

3. Proof 2: Geometric Proof of Non-Separability#

Proof (Geometric Argument)

Theorem. The XOR dataset is not linearly separable.

Proof. Consider the two classes:

  • Class 0 (output 0): \(C_0 = \{(0,0), (1,1)\}\)

  • Class 1 (output 1): \(C_1 = \{(0,1), (1,0)\}\)

The convex hull of \(C_0\) is the line segment from \((0,0)\) to \((1,1)\):

\[\text{conv}(C_0) = \{\lambda(0,0) + (1-\lambda)(1,1) : \lambda \in [0,1]\} = \{(t, t) : t \in [0,1]\}\]

The convex hull of \(C_1\) is the line segment from \((0,1)\) to \((1,0)\):

\[\text{conv}(C_1) = \{\lambda(0,1) + (1-\lambda)(1,0) : \lambda \in [0,1]\} = \{(1-s, s) : s \in [0,1]\}\]

These two segments intersect at \((0.5, 0.5)\) (set \(t = 0.5\) and \(s = 0.5\)).

By the Separating Hyperplane Theorem, two sets are linearly separable if and only if their convex hulls are disjoint. Since \(\text{conv}(C_0) \cap \text{conv}(C_1) = \{(0.5, 0.5)\} \neq \emptyset\), the XOR dataset is not linearly separable. \(\blacksquare\)

Hide code cell source
# Geometric proof: Visualize XOR points and convex hulls

fig, ax = plt.subplots(1, 1, figsize=(8, 8))

# XOR data points
class0_x = np.array([[0, 0], [1, 1]])  # output 0
class1_x = np.array([[0, 1], [1, 0]])  # output 1

# Draw convex hulls (line segments in this case)
# Class 0 hull: line from (0,0) to (1,1) - diagonal
t = np.linspace(0, 1, 100)
ax.plot(t, t, 'b-', linewidth=3, alpha=0.4, label='conv(Class 0)')
ax.fill_between(t, t - 0.02, t + 0.02, alpha=0.15, color='blue')

# Class 1 hull: line from (0,1) to (1,0) - anti-diagonal  
ax.plot(t, 1 - t, 'r-', linewidth=3, alpha=0.4, label='conv(Class 1)')
ax.fill_between(t, 1 - t - 0.02, 1 - t + 0.02, alpha=0.15, color='red')

# Plot the intersection point
ax.plot(0.5, 0.5, 'k*', markersize=20, zorder=10, label='Intersection (0.5, 0.5)')

# Plot data points
ax.scatter(class0_x[:, 0], class0_x[:, 1], c='blue', s=200, zorder=5,
           edgecolors='black', linewidths=2, marker='o', label='Class 0: XOR=0')
ax.scatter(class1_x[:, 0], class1_x[:, 1], c='red', s=200, zorder=5,
           edgecolors='black', linewidths=2, marker='s', label='Class 1: XOR=1')

# Annotate points
offsets = {(0, 0): (-0.15, -0.1), (1, 1): (0.05, 0.05),
           (0, 1): (-0.15, 0.05), (1, 0): (0.05, -0.1)}
for pt, label in [((0, 0), '(0,0)\u21920'), ((1, 1), '(1,1)\u21920'),
                   ((0, 1), '(0,1)\u21921'), ((1, 0), '(1,0)\u21921')]:
    ox, oy = offsets[pt]
    ax.annotate(label, pt, xytext=(pt[0]+ox, pt[1]+oy), fontsize=13, fontweight='bold')

ax.set_xlim(-0.3, 1.3)
ax.set_ylim(-0.3, 1.3)
ax.set_xlabel('$x_1$', fontsize=14)
ax.set_ylabel('$x_2$', fontsize=14)
ax.set_title('XOR: Convex Hulls Intersect \u2192 Not Linearly Separable', fontsize=14)
ax.legend(fontsize=11, loc='upper right')
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
../_images/e008be570ce9b702c869b94e7e650ed83722e887e7cda1b29732d4e8ab181747.png

3a. AND vs XOR: Side-by-Side Decision Boundary Comparison#

To build intuition for why XOR fails where AND succeeds, let us place the two problems side by side. For AND, a single line cleanly separates the classes. For XOR, no line can do the job.

Hide code cell source
import numpy as np
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_and = np.array([0, 0, 0, 1])
y_xor = np.array([0, 1, 1, 0])

for ax, y_vals, title, subtitle in [
    (axes[0], y_and, 'AND Function', 'Linearly Separable'),
    (axes[1], y_xor, 'XOR Function', 'NOT Linearly Separable')
]:
    # Decision region background
    xx, yy = np.meshgrid(np.linspace(-0.5, 1.5, 300), np.linspace(-0.5, 1.5, 300))
    
    if 'AND' in title:
        # AND: w1=1, w2=1, b=-1.5 => separating line: x1 + x2 = 1.5
        Z = (xx + yy >= 1.5).astype(float)
        ax.contourf(xx, yy, Z, levels=[-0.5, 0.5, 1.5], colors=['#BBDDFF', '#FFBBBB'], alpha=0.4)
        ax.contour(xx, yy, Z, levels=[0.5], colors=['green'], linewidths=3)
        ax.text(0.2, 1.35, 'Separating line: $x_1 + x_2 = 1.5$', fontsize=11, color='green', fontweight='bold')
    else:
        # XOR: show several failed attempts
        x_line = np.linspace(-0.5, 1.5, 100)
        for slope, intercept, ls in [(-1, 0.5, '--'), (-1, 1.5, '-.'), (-0.5, 0.8, ':')]:
            ax.plot(x_line, slope * x_line + intercept, color='gray', linestyle=ls,
                    linewidth=2, alpha=0.5)
        ax.text(0.05, 1.35, 'No single line works!', fontsize=12, color='red', fontweight='bold')

    # Plot data points
    for i in range(4):
        color = 'red' if y_vals[i] == 1 else 'blue'
        marker = 's' if y_vals[i] == 1 else 'o'
        ax.scatter(X[i, 0], X[i, 1], c=color, s=300, marker=marker,
                   edgecolors='black', linewidths=2, zorder=5)
        ax.annotate(f'({X[i,0]},{X[i,1]})\u2192{y_vals[i]}',
                    (X[i, 0], X[i, 1]),
                    xytext=(X[i, 0] + 0.08, X[i, 1] - 0.12),
                    fontsize=11, fontweight='bold')
    
    ax.set_xlim(-0.5, 1.5)
    ax.set_ylim(-0.5, 1.5)
    ax.set_xlabel('$x_1$', fontsize=14)
    ax.set_ylabel('$x_2$', fontsize=14)
    ax.set_title(f'{title}\n({subtitle})', fontsize=14)
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)

plt.suptitle('AND vs XOR: Why One Line Is Enough for AND but Not for XOR',
             fontsize=15, y=1.02)
plt.tight_layout()
plt.show()
../_images/0ce268a441cefeeb6d0f4370174cefd7a4c4e8fcb4f9f09709303be57d348f48.png

3b. Convex Hull Intersection: Why XOR Fails#

The convex hull criterion provides the definitive geometric test. Below we show the convex hulls for AND (disjoint) and XOR (intersecting) side by side, making the obstruction visually obvious.

Hide code cell source
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon as MplPolygon

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_and = np.array([0, 0, 0, 1])
y_xor = np.array([0, 1, 1, 0])

for ax, y_vals, title in [
    (axes[0], y_and, 'AND: Convex Hulls Are DISJOINT'),
    (axes[1], y_xor, 'XOR: Convex Hulls INTERSECT')
]:
    class0 = X[y_vals == 0]
    class1 = X[y_vals == 1]

    # Draw convex hulls
    if len(class0) >= 3:
        # Compute simple convex hull ordering for polygon
        from itertools import combinations
        centroid = class0.mean(axis=0)
        angles = np.arctan2(class0[:, 1] - centroid[1], class0[:, 0] - centroid[0])
        order = np.argsort(angles)
        hull_pts = class0[order]
        poly = MplPolygon(hull_pts, alpha=0.2, color='blue', edgecolor='blue', linewidth=2)
        ax.add_patch(poly)
    elif len(class0) == 2:
        ax.plot(class0[:, 0], class0[:, 1], 'b-', linewidth=3, alpha=0.4)
    
    if len(class1) >= 3:
        centroid = class1.mean(axis=0)
        angles = np.arctan2(class1[:, 1] - centroid[1], class1[:, 0] - centroid[0])
        order = np.argsort(angles)
        hull_pts = class1[order]
        poly = MplPolygon(hull_pts, alpha=0.2, color='red', edgecolor='red', linewidth=2)
        ax.add_patch(poly)
    elif len(class1) == 2:
        ax.plot(class1[:, 0], class1[:, 1], 'r-', linewidth=3, alpha=0.4)
    elif len(class1) == 1:
        pass  # Single point, hull is just the point
    
    # Check intersection for XOR case
    if 'XOR' in title:
        ax.plot(0.5, 0.5, 'k*', markersize=20, zorder=10, label='Intersection')
        ax.annotate('Hulls intersect\nat (0.5, 0.5)!', (0.5, 0.5),
                    xytext=(0.6, 0.2), fontsize=12, color='black', fontweight='bold',
                    arrowprops=dict(arrowstyle='->', color='black', linewidth=2))
    else:
        ax.annotate('Hulls are\ndisjoint!', (0.5, 0.5),
                    xytext=(0.4, 0.5), fontsize=12, color='green', fontweight='bold')
    
    # Plot data points
    ax.scatter(class0[:, 0], class0[:, 1], c='blue', s=250, zorder=5,
               edgecolors='black', linewidths=2, marker='o', label='Class 0')
    ax.scatter(class1[:, 0], class1[:, 1], c='red', s=250, zorder=5,
               edgecolors='black', linewidths=2, marker='s', label='Class 1')
    
    # Annotate points
    for i in range(4):
        ax.annotate(f'({X[i,0]},{X[i,1]})\u2192{y_vals[i]}',
                    (X[i, 0], X[i, 1]),
                    xytext=(X[i, 0] + 0.07, X[i, 1] + 0.07),
                    fontsize=10, fontweight='bold')
    
    ax.set_xlim(-0.3, 1.4)
    ax.set_ylim(-0.3, 1.4)
    ax.set_xlabel('$x_1$', fontsize=14)
    ax.set_ylabel('$x_2$', fontsize=14)
    sep_status = 'Separable' if 'AND' in title else 'NOT Separable'
    ax.set_title(f'{title}\n\u2192 {sep_status}', fontsize=13)
    ax.legend(fontsize=10, loc='upper right')
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)

plt.suptitle('Convex Hull Criterion: Disjoint Hulls \u21D4 Linearly Separable',
             fontsize=15, y=1.02)
plt.tight_layout()
plt.show()
/var/folders/z7/wp7m8p7x1250jzvklw5z24mm0000gn/T/ipykernel_83351/3809229431.py:26: UserWarning: Setting the 'color' property will override the edgecolor or facecolor properties.
  poly = MplPolygon(hull_pts, alpha=0.2, color='blue', edgecolor='blue', linewidth=2)
../_images/f7584dfb255f057ccc412f3185098a538f6d5c2a020bd1d1b538efc59065567c.png

4. The Perceptron Trying to Learn XOR#

Let us now watch a perceptron attempt to learn XOR using the classical perceptron learning rule. We know from the algebraic proof that convergence is impossible. The perceptron convergence theorem guarantees convergence only when the data is linearly separable. For XOR, the weights will cycle endlessly, and the number of errors per epoch will oscillate without ever reaching zero.

Hide code cell source
# Perceptron trying (and failing) to learn XOR

# XOR dataset
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])

def step(z):
    return (z >= 0).astype(int)

# Perceptron learning
np.random.seed(42)
w = np.random.randn(2) * 0.5
b = 0.0
lr = 0.1
n_epochs = 100

errors_per_epoch = []
weight_history = [(w.copy(), b)]

for epoch in range(n_epochs):
    errors = 0
    for i in range(len(X)):
        z = np.dot(w, X[i]) + b
        y_pred = step(z)
        error = y[i] - y_pred
        if error != 0:
            w = w + lr * error * X[i]
            b = b + lr * error
            errors += 1
    errors_per_epoch.append(errors)
    weight_history.append((w.copy(), b))

print(f"Final weights: w = {w}, b = {b:.4f}")
print(f"Errors in last 10 epochs: {errors_per_epoch[-10:]}")
print(f"Total epochs with zero errors: {sum(1 for e in errors_per_epoch if e == 0)}")
Final weights: w = [-0.05164292 -0.06913215], b = 0.0000
Errors in last 10 epochs: [4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
Total epochs with zero errors: 0
Hide code cell source
# Visualize the perceptron's failure

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Errors per epoch (oscillating)
ax1 = axes[0]
ax1.plot(range(1, n_epochs + 1), errors_per_epoch, 'r-o', markersize=3, linewidth=1)
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Number of Errors', fontsize=12)
ax1.set_title('Perceptron on XOR: Errors Never Reach Zero', fontsize=13)
ax1.set_ylim(-0.5, max(errors_per_epoch) + 0.5)
ax1.axhline(y=0, color='green', linestyle='--', alpha=0.7, label='Target: 0 errors')
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)

# Plot 2: Weight trajectory
ax2 = axes[1]
w1_hist = [wh[0][0] for wh in weight_history]
w2_hist = [wh[0][1] for wh in weight_history]
b_hist = [wh[1] for wh in weight_history]

ax2.plot(w1_hist, w2_hist, 'b-', alpha=0.5, linewidth=0.8)
ax2.plot(w1_hist[0], w2_hist[0], 'go', markersize=12, label='Start', zorder=5)
ax2.plot(w1_hist[-1], w2_hist[-1], 'rs', markersize=12, label='End', zorder=5)
ax2.set_xlabel('$w_1$', fontsize=14)
ax2.set_ylabel('$w_2$', fontsize=14)
ax2.set_title('Weight Trajectory: Cycling, Never Settling', fontsize=13)
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
../_images/1c9ad9c7e8742068fef818e9f3712a6de1a8a5a187dfd1ec3ec62e523e787a01.png

As predicted by the theory, the perceptron never converges. The error count oscillates between 1 and 4 across epochs, and the weight vector cycles through different configurations without settling. This is not a failure of the learning rate or initialization — it is a fundamental impossibility.

5. Why XOR Matters#

The XOR problem is far more than an academic exercise. It occupies a central place in the theory and history of neural networks for several interconnected reasons:

Simplest Non-Linearly-Separable Function#

Among all Boolean functions of two variables, XOR is the simplest that cannot be computed by a single perceptron. The 16 Boolean functions of 2 inputs include AND, OR, NAND, NOR (all linearly separable) and XOR, XNOR (not separable). XOR uses the minimum number of inputs and data points to exhibit the limitation.

Functional Completeness#

The set \(\{\text{XOR}, \text{AND}, 1\}\) is functionally complete: with a constant 1 available, XOR and AND generate every Boolean function via algebraic normal form. This means that the inability to compute XOR is not a minor gap — it locks out an entire class of computations.

Parity Generalization#

XOR is \(\text{PARITY}(2)\): it outputs 1 when an odd number of inputs are 1. The \(n\)-bit parity function

\[\text{PARITY}(x_1, \ldots, x_n) = x_1 \oplus x_2 \oplus \cdots \oplus x_n\]

is equally impossible for a single perceptron, and Minsky and Papert proved this requires examining all input bits simultaneously (order \(n\)).

Pedagogical Clarity#

With only 4 data points in 2 dimensions, the XOR problem can be fully visualized. The geometric intuition — two classes whose convex hulls intersect — is immediate and memorable.

6. The Solution: Adding a Hidden Layer#

The key insight is that XOR can be decomposed into operations that are individually linearly separable:

\[\text{XOR}(x_1, x_2) = \text{OR}(x_1, x_2) \;\text{AND}\; \text{NAND}(x_1, x_2)\]

This is because:

  • OR returns 1 when at least one input is 1.

  • NAND returns 0 only when both inputs are 1.

  • Their AND returns 1 exactly when at least one input is 1 but not both — which is XOR.

Network Architecture#

Input Layer      Hidden Layer       Output Layer
                  \u250c\u2500\u2500\u2500\u2500\u2500\u2510
  x\u2081 \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 \u2502 OR  \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2510
       \u2572         \u2514\u2500\u2500\u2500\u2500\u2500\u2518      \u2502   \u250c\u2500\u2500\u2500\u2500\u2500\u2510
        \u2572                     \u251c\u2500\u2500\u2500\u2502 AND \u2502\u2500\u2500\u2500\u2500 y
         \u2572       \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2510    \u2502   \u2514\u2500\u2500\u2500\u2500\u2500\u2518
  x\u2082 \u2500\u2500\u2500\u2500\u2572\u2500\u2500\u2500\u2500\u2500 \u2502 NAND \u2502\u2500\u2500\u2500\u2500\u2518
           \u2572     \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2518
            \u2572\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500

Exact Weights#

  • Hidden neuron 1 (OR): \(h_1 = \text{step}(x_1 + x_2 - 0.5)\)

    • Fires when \(x_1 + x_2 \geq 0.5\), i.e., when at least one input is 1.

  • Hidden neuron 2 (NAND): \(h_2 = \text{step}(-x_1 - x_2 + 1.5)\)

    • Fires when \(-x_1 - x_2 + 1.5 \geq 0\), i.e., when \(x_1 + x_2 \leq 1.5\), i.e., NOT both 1.

  • Output neuron (AND): \(y = \text{step}(h_1 + h_2 - 1.5)\)

    • Fires when \(h_1 + h_2 \geq 1.5\), i.e., when both \(h_1 = 1\) and \(h_2 = 1\).

Hide code cell source
# Implementation: XOR via OR + NAND + AND

def step(z):
    return (z >= 0).astype(int)

# XOR dataset
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_true = np.array([0, 1, 1, 0])

# Construction 1: OR + NAND + AND
print("Construction 1: XOR = OR AND NAND")
print("=" * 55)
print(f"{'x1':>3} {'x2':>3} | {'h1(OR)':>7} {'h2(NAND)':>9} | {'y(AND)':>7} {'target':>7}")
print("-" * 55)

for i in range(4):
    x1, x2 = X[i]
    # Hidden layer
    h1 = step(np.array([x1 + x2 - 0.5]))  # OR
    h2 = step(np.array([-x1 - x2 + 1.5]))  # NAND
    # Output layer
    y_pred = step(np.array([h1[0] + h2[0] - 1.5]))  # AND
    
    match = '  OK' if y_pred[0] == y_true[i] else '  FAIL'
    print(f"{x1:>3.0f} {x2:>3.0f} | {h1[0]:>7} {h2[0]:>9} | {y_pred[0]:>7} {y_true[i]:>7}{match}")

print("\nAll outputs match! The 2-layer network correctly computes XOR.")
Construction 1: XOR = OR AND NAND
=======================================================
 x1  x2 |  h1(OR)  h2(NAND) |  y(AND)  target
-------------------------------------------------------
  0   0 |       0         1 |       0       0  OK
  0   1 |       1         1 |       1       1  OK
  1   0 |       1         1 |       1       1  OK
  1   1 |       1         0 |       0       0  OK

All outputs match! The 2-layer network correctly computes XOR.

7. Hidden Space Transformation#

The hidden layer performs a nonlinear coordinate transformation that maps the original 2D input space to a new 2D hidden space where the classes become linearly separable.

The mapping \((x_1, x_2) \mapsto (h_1, h_2)\) is:

\((x_1, x_2)\)

\(h_1 = \text{OR}\)

\(h_2 = \text{NAND}\)

\((h_1, h_2)\)

XOR

\((0, 0)\)

0

1

\((0, 1)\)

0

\((0, 1)\)

1

1

\((1, 1)\)

1

\((1, 0)\)

1

1

\((1, 1)\)

1

\((1, 1)\)

1

0

\((1, 0)\)

0

Notice that in hidden space:

  • Class 0 maps to \(\{(0,1), (1,0)\}\)

  • Class 1 maps to \(\{(1,1), (1,1)\}\) (both inputs map to the same point!)

These are trivially linearly separable.

Hide code cell source
# Side-by-side: original space vs. hidden space

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Original space
ax1 = axes[0]
X_class0 = X[y_true == 0]
X_class1 = X[y_true == 1]

ax1.scatter(X_class0[:, 0], X_class0[:, 1], c='blue', s=250, zorder=5,
            edgecolors='black', linewidths=2, marker='o', label='XOR = 0')
ax1.scatter(X_class1[:, 0], X_class1[:, 1], c='red', s=250, zorder=5,
            edgecolors='black', linewidths=2, marker='s', label='XOR = 1')

# Draw convex hulls
ax1.plot([0, 1], [0, 1], 'b--', alpha=0.4, linewidth=2)
ax1.plot([0, 1], [1, 0], 'r--', alpha=0.4, linewidth=2)
ax1.plot(0.5, 0.5, 'k*', markersize=15, zorder=10)

for pt, label in zip(X, ['(0,0)\u21920', '(0,1)\u21921', '(1,0)\u21921', '(1,1)\u21920']):
    ax1.annotate(label, pt, xytext=(pt[0]+0.05, pt[1]+0.07), fontsize=11)

ax1.set_xlim(-0.3, 1.4)
ax1.set_ylim(-0.3, 1.4)
ax1.set_xlabel('$x_1$', fontsize=14)
ax1.set_ylabel('$x_2$', fontsize=14)
ax1.set_title('Original Space: NOT Separable', fontsize=14)
ax1.legend(fontsize=11)
ax1.set_aspect('equal')
ax1.grid(True, alpha=0.3)

# Hidden space
ax2 = axes[1]

# Compute hidden representations
H = np.zeros((4, 2))
for i in range(4):
    x1, x2 = X[i]
    H[i, 0] = int(x1 + x2 - 0.5 >= 0)   # OR
    H[i, 1] = int(-x1 - x2 + 1.5 >= 0)   # NAND

H_class0 = H[y_true == 0]
H_class1 = H[y_true == 1]

ax2.scatter(H_class0[:, 0], H_class0[:, 1], c='blue', s=250, zorder=5,
            edgecolors='black', linewidths=2, marker='o', label='XOR = 0')
ax2.scatter(H_class1[:, 0], H_class1[:, 1], c='red', s=350, zorder=5,
            edgecolors='black', linewidths=2, marker='s', label='XOR = 1 (both map here)')

# Draw a separating line in hidden space: h1 + h2 = 1.5
h1_line = np.linspace(-0.3, 1.5, 100)
h2_line = 1.5 - h1_line
ax2.plot(h1_line, h2_line, 'g-', linewidth=3, label='Separating line: $h_1+h_2=1.5$')

# Annotate hidden points
ax2.annotate('(0,0)\u2192(0,1)', (0, 1), xytext=(-0.35, 1.1), fontsize=10)
ax2.annotate('(1,1)\u2192(1,0)', (1, 0), xytext=(1.05, 0.1), fontsize=10)
ax2.annotate('(0,1),(1,0)\u2192(1,1)', (1, 1), xytext=(0.55, 1.1), fontsize=10)

ax2.set_xlim(-0.5, 1.6)
ax2.set_ylim(-0.5, 1.6)
ax2.set_xlabel('$h_1$ (OR)', fontsize=14)
ax2.set_ylabel('$h_2$ (NAND)', fontsize=14)
ax2.set_title('Hidden Space: Separable!', fontsize=14)
ax2.legend(fontsize=10, loc='lower left')
ax2.set_aspect('equal')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
../_images/7b8936bf9f00cd840fea1893be7064266fc236ff43f9404d83af1977ac63968c.png

8. Alternative Construction: The “Band” Approach#

There is a second elegant way to solve XOR with 2 hidden neurons, using a different geometric idea.

Construction 2: Dual Threshold#

  • Hidden neuron 1: \(h_1 = \text{step}(x_1 + x_2 - 0.5)\)

    • Fires when \(x_1 + x_2 \geq 0.5\) (i.e., when the sum is at least 1)

  • Hidden neuron 2: \(h_2 = \text{step}(x_1 + x_2 - 1.5)\)

    • Fires when \(x_1 + x_2 \geq 1.5\) (i.e., when both inputs are 1)

  • Output neuron: \(y = \text{step}(h_1 - h_2 - 0.5)\)

    • Fires when \(h_1 = 1\) and \(h_2 = 0\) (exactly in the “band” between the two thresholds)

Geometric Interpretation#

The two hidden neurons define two parallel lines in input space:

  • Line 1: \(x_1 + x_2 = 0.5\)

  • Line 2: \(x_1 + x_2 = 1.5\)

The output neuron selects the band between these two lines: it fires for points where \(0.5 \leq x_1 + x_2 < 1.5\). This band contains exactly the XOR=1 points.

Hide code cell source
# Alternative Construction: Band approach

print("Construction 2: Dual Threshold (Band)")
print("=" * 60)
print(f"{'x1':>3} {'x2':>3} | {'sum':>4} {'h1(>=0.5)':>10} {'h2(>=1.5)':>10} | {'y':>3} {'target':>7}")
print("-" * 60)

for i in range(4):
    x1, x2 = X[i]
    s = x1 + x2
    h1 = int(s >= 0.5)
    h2 = int(s >= 1.5)
    y_pred = int(h1 - h2 - 0.5 >= 0)
    match = 'OK' if y_pred == y_true[i] else 'FAIL'
    print(f"{x1:>3.0f} {x2:>3.0f} | {s:>4.1f} {h1:>10} {h2:>10} | {y_pred:>3} {y_true[i]:>7}  {match}")

print("\nAll outputs match!")
Construction 2: Dual Threshold (Band)
============================================================
 x1  x2 |  sum  h1(>=0.5)  h2(>=1.5) |   y  target
------------------------------------------------------------
  0   0 |  0.0          0          0 |   0       0  OK
  0   1 |  1.0          1          0 |   1       1  OK
  1   0 |  1.0          1          0 |   1       1  OK
  1   1 |  2.0          1          1 |   0       0  OK

All outputs match!
Hide code cell source
# Visualize the "band" construction

fig, ax = plt.subplots(1, 1, figsize=(8, 8))

# Draw the two parallel decision boundaries
x_range = np.linspace(-0.3, 1.5, 200)

# Line 1: x1 + x2 = 0.5
ax.plot(x_range, 0.5 - x_range, 'g-', linewidth=2.5, label='$x_1 + x_2 = 0.5$')
# Line 2: x1 + x2 = 1.5
ax.plot(x_range, 1.5 - x_range, 'purple', linewidth=2.5, linestyle='-', label='$x_1 + x_2 = 1.5$')

# Shade the band between the two lines
x_fill = np.linspace(-0.3, 1.5, 200)
y_lower = np.clip(0.5 - x_fill, -0.5, 1.5)
y_upper = np.clip(1.5 - x_fill, -0.5, 1.5)
ax.fill_between(x_fill, y_lower, y_upper, alpha=0.15, color='orange', label='Band: XOR = 1 region')

# Plot data points
X_class0 = X[y_true == 0]
X_class1 = X[y_true == 1]
ax.scatter(X_class0[:, 0], X_class0[:, 1], c='blue', s=250, zorder=5,
           edgecolors='black', linewidths=2, marker='o', label='XOR = 0')
ax.scatter(X_class1[:, 0], X_class1[:, 1], c='red', s=250, zorder=5,
           edgecolors='black', linewidths=2, marker='s', label='XOR = 1')

# Annotations
ax.annotate('sum < 0.5', (0, 0), xytext=(-0.25, -0.2), fontsize=12, color='blue')
ax.annotate('0.5 \u2264 sum < 1.5', (0.5, 0.5), xytext=(0.15, 0.3), fontsize=12, color='red')
ax.annotate('sum \u2265 1.5', (1, 1), xytext=(1.05, 1.1), fontsize=12, color='blue')

ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_xlabel('$x_1$', fontsize=14)
ax.set_ylabel('$x_2$', fontsize=14)
ax.set_title('XOR via Band: Two Parallel Decision Boundaries', fontsize=14)
ax.legend(fontsize=11, loc='upper right')
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
../_images/d70f8df7a4250fd504022396d1af1606deec4794f82126647a269672d7d46b0a.png

9. Exercises#

Exercise 8.1: XNOR with a 2-Layer Network#

The XNOR function is the complement of XOR:

\(x_1\)

\(x_2\)

XNOR

0

0

1

0

1

0

1

0

0

1

1

1

Design a 2-layer network (2 hidden neurons + 1 output neuron) that computes XNOR. Give the exact weights and biases, and verify with a truth table.

Hint: XNOR = NOT(XOR). How can you negate the output of the XOR network?

Exercise 8.2: Can You Solve XOR with 1 Hidden Neuron?#

Consider a network with 2 inputs, 1 hidden neuron (with step activation), and 1 output neuron (with step activation). Can this network compute XOR?

Prove that it cannot.

Hint: A single hidden neuron maps \(\{0,1\}^2\) to \(\{0,1\}\), producing only a 1-dimensional hidden representation. The output neuron must then separate two classes in 1D. What does the hidden neuron compute?

Exercise 8.3: 3-Bit Parity Network#

Design a network that computes \(\text{PARITY}(x_1, x_2, x_3) = x_1 \oplus x_2 \oplus x_3\).

The truth table is:

\(x_1\)

\(x_2\)

\(x_3\)

Parity

0

0

0

0

0

0

1

1

0

1

0

1

0

1

1

0

1

0

0

1

1

0

1

0

1

1

0

0

1

1

1

1

How many hidden neurons do you need? Can you organize them into layers?

Hint: Decompose as \(\text{XOR}(\text{XOR}(x_1, x_2), x_3)\).

Hide code cell source
# Skeleton for Exercise 8.3: 3-bit parity
# Students should fill in the weights

X3 = np.array([[0,0,0], [0,0,1], [0,1,0], [0,1,1],
               [1,0,0], [1,0,1], [1,1,0], [1,1,1]])
y3 = np.array([0, 1, 1, 0, 1, 0, 0, 1])

def parity_network(x):
    """Compute 3-bit parity using a multi-layer network.
    
    TODO: Fill in the weights and biases.
    Hint: First compute XOR(x1, x2), then XOR(result, x3).
    """
    x1, x2, x3 = x
    
    # Layer 1: compute intermediate values for XOR(x1, x2)
    # h1 = step(???)
    # h2 = step(???)
    
    # Layer 2: combine to get XOR(x1, x2), then prepare for XOR with x3
    # ...
    
    # Output layer
    # y = step(???)
    
    pass  # Replace with implementation

print("Exercise 8.3: Implement the parity_network function above.")
print("Then verify:")
print("for i in range(8):")
print("    assert parity_network(X3[i]) == y3[i]")
Exercise 8.3: Implement the parity_network function above.
Then verify:
for i in range(8):
    assert parity_network(X3[i]) == y3[i]