Chapter 4: Rosenblatt’s Perceptron#

Part 2: The Perceptron#

4.1 Introduction: From Logic to Learning#

In 1957, Frank Rosenblatt, a psychologist at the Cornell Aeronautical Laboratory, proposed a model that would mark a watershed moment in the history of artificial intelligence: the Perceptron. While McCulloch and Pitts (1943) had demonstrated that networks of simple binary neurons could, in principle, compute any logical function, their model had a critical limitation: the weights (connections) had to be designed by hand for each task. There was no mechanism for the network to learn from data.

Rosenblatt’s key insight was to introduce adjustable weights and a learning rule that could automatically discover the correct weight settings from labeled examples. This was nothing less than the birth of supervised machine learning.

Note

Historical Context: Rosenblatt and the Mark I Perceptron

Frank Rosenblatt (1928–1971) was a remarkable polymath – a psychologist, neuroscientist, and engineer at the Cornell Aeronautical Laboratory. In 1958, he unveiled the Mark I Perceptron, a physical machine that could learn to recognize simple visual patterns.

The Mark I was not a simulation on a digital computer; it was a custom-built electromechanical device. Its weights were implemented as motor-driven potentiometers – physical resistors whose values were adjusted by small electric motors during learning. The New York Times reported it as a machine that could “perceive, recognize, and identify its surroundings without any human training or control.”

Tragically, Rosenblatt died in a boating accident on his 43rd birthday in 1971, just as the field he helped create was entering its first “AI winter.” His contributions were only fully appreciated decades later, when neural networks experienced a renaissance in the 1980s and beyond.

The Mark I Perceptron Hardware#

Rosenblatt did not merely propose a mathematical model; he built a physical machine. The Mark I Perceptron (1958) was a remarkable piece of hardware:

  • Input layer: An array of 400 cadmium sulfide (CdS) photocells arranged in a 20x20 grid, serving as a primitive retina that could “see” simple images.

  • Association layer: 512 association units (“A-units”), each randomly connected to a subset of the photocells. These connections were fixed and random, inspired by the apparent randomness of neural connectivity in the brain.

  • Output layer: A set of response units (“R-units”) that produced the final classification.

  • Adjustable weights: Implemented as motor-driven potentiometers – physical resistors whose values could be changed by small electric motors. When the learning algorithm called for a weight increase, a motor would turn the corresponding potentiometer.

The machine could learn to classify simple shapes (such as distinguishing triangles from squares) purely from examples, without explicit programming. The New York Times famously reported it as a machine that could “perceive, recognize, and identify its surroundings without any human training or control.”

The Intellectual Leap#

The transition from McCulloch-Pitts to Rosenblatt can be summarized as:

Aspect

McCulloch-Pitts (1943)

Rosenblatt’s Perceptron (1957)

Weights

Fixed, hand-designed

Adjustable, learned from data

Purpose

Model of neural computation

Pattern recognition machine

Design

Logical analysis

Statistical learning

Key question

What can neurons compute?

How can neurons learn?

4.2 The Mathematical Model#

Definition (Rosenblatt Perceptron)

The Rosenblatt Perceptron is a binary classifier that computes, for an input vector \(\mathbf{x} \in \mathbb{R}^n\):

\[f(\mathbf{x}) = \text{step}(\mathbf{w} \cdot \mathbf{x} + b)\]

where the Heaviside step function is:

\[\begin{split}\text{step}(z) = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{if } z < 0 \end{cases}\end{split}\]

The model is parameterized by a weight vector \(\mathbf{w} = (w_1, w_2, \ldots, w_n) \in \mathbb{R}^n\) and a bias \(b \in \mathbb{R}\). The key innovation over the McCulloch-Pitts neuron is that \(\mathbf{w}\) and \(b\) are learned from data via a supervised learning rule, rather than being hand-designed.

Components#

Let us define each component precisely:

  1. Input vector: \(\mathbf{x} = (x_1, x_2, \ldots, x_n) \in \mathbb{R}^n\). This is the feature representation of a single data point. For the Mark I Perceptron, \(n = 512\) (the outputs of the association units).

  2. Weight vector: \(\mathbf{w} = (w_1, w_2, \ldots, w_n) \in \mathbb{R}^n\). Each weight \(w_i\) represents the strength of the connection from input \(x_i\) to the output neuron. Positive weights are excitatory; negative weights are inhibitory.

  3. Bias: \(b \in \mathbb{R}\). The bias (also called the threshold when written as \(-b\)) determines how easy it is for the neuron to fire. A positive bias makes the neuron more likely to output 1; a negative bias makes it harder.

  4. Dot product (pre-activation): The quantity \(z = \mathbf{w} \cdot \mathbf{x} + b = \sum_{i=1}^n w_i x_i + b\) is the pre-activation or net input. It measures how much the input aligns with the weight vector.

  5. Step function: The step function produces a binary decision:

    • If \(z \geq 0\): the perceptron outputs 1 (“fires”, “active”, “positive class”)

    • If \(z < 0\): the perceptron outputs 0 (“silent”, “inactive”, “negative class”)

Alternative Conventions#

Some texts use the \(\{-1, +1\}\) convention for labels and outputs:

\[\begin{split}f(\mathbf{x}) = \text{sign}(\mathbf{w} \cdot \mathbf{x} + b) = \begin{cases} +1 & \text{if } \mathbf{w} \cdot \mathbf{x} + b \geq 0 \\ -1 & \text{if } \mathbf{w} \cdot \mathbf{x} + b < 0 \end{cases}\end{split}\]

This convention is common in the convergence theorem analysis and in support vector machines. We will use both conventions as appropriate.

4.3 The Bias Trick#

It is often convenient to absorb the bias into the weight vector by augmenting the input with a constant feature \(x_0 = 1\). Define:

\[\tilde{\mathbf{x}} = (1, x_1, x_2, \ldots, x_n) \in \mathbb{R}^{n+1}\]
\[\tilde{\mathbf{w}} = (b, w_1, w_2, \ldots, w_n) \in \mathbb{R}^{n+1}\]

Then:

\[\tilde{\mathbf{w}} \cdot \tilde{\mathbf{x}} = b \cdot 1 + w_1 x_1 + w_2 x_2 + \cdots + w_n x_n = \mathbf{w} \cdot \mathbf{x} + b\]

So the perceptron model simplifies to:

\[f(\tilde{\mathbf{x}}) = \text{step}(\tilde{\mathbf{w}} \cdot \tilde{\mathbf{x}})\]

This “bias trick” has several advantages:

  • Notational simplicity: We deal with a single vector \(\tilde{\mathbf{w}}\) instead of \(\mathbf{w}\) and \(b\) separately.

  • Algorithmic uniformity: The learning rule updates all components of \(\tilde{\mathbf{w}}\) identically.

  • Geometric clarity: The decision boundary passes through the origin in the augmented \((n+1)\)-dimensional space.

In practice, we often keep the bias separate for implementation clarity, but the bias trick is indispensable for theoretical analysis.

4.4 Python Implementation#

Hide code cell source
import numpy as np
import matplotlib.pyplot as plt

# Set global plotting style
plt.rcParams.update({
    'figure.figsize': (8, 6),
    'font.size': 12,
    'axes.grid': True,
    'grid.alpha': 0.3
})
class Perceptron:
    """Rosenblatt's Perceptron model.
    
    A single-layer binary classifier that computes:
        f(x) = step(w . x + b)
    
    Parameters
    ----------
    n_features : int
        Number of input features (dimensionality of input space).
    
    Attributes
    ----------
    weights : np.ndarray of shape (n_features,)
        Weight vector w.
    bias : float
        Bias term b.
    """
    
    def __init__(self, n_features):
        self.n_features = n_features
        self.weights = np.zeros(n_features)
        self.bias = 0.0
    
    def decision_function(self, X):
        """Compute the raw pre-activation value w . x + b.
        
        Parameters
        ----------
        X : np.ndarray of shape (n_samples, n_features) or (n_features,)
            Input data.
        
        Returns
        -------
        np.ndarray
            Pre-activation values z = w . x + b.
        """
        X = np.atleast_2d(X)
        return X @ self.weights + self.bias
    
    def predict(self, X):
        """Predict binary class labels using the step function.
        
        Parameters
        ----------
        X : np.ndarray of shape (n_samples, n_features) or (n_features,)
            Input data.
        
        Returns
        -------
        np.ndarray of int
            Predicted labels (0 or 1).
        """
        z = self.decision_function(X)
        return (z >= 0).astype(int)
    
    def set_weights(self, weights, bias):
        """Manually set weights and bias.
        
        Parameters
        ----------
        weights : array-like of shape (n_features,)
            Weight vector.
        bias : float
            Bias term.
        """
        self.weights = np.array(weights, dtype=float)
        self.bias = float(bias)
    
    def __repr__(self):
        return (f"Perceptron(weights={self.weights}, bias={self.bias})")


# Demonstration
print("=== Perceptron Demonstration ===")
print()

# Create a perceptron for 2D inputs
p = Perceptron(n_features=2)

# Set weights to implement AND-like behavior: w = [1, 1], b = -1.5
p.set_weights([1.0, 1.0], -1.5)
print(f"Model: {p}")
print()

# Test on all binary inputs
test_inputs = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
print("Input     | w.x + b  | Output")
print("-" * 35)
for x in test_inputs:
    z = p.decision_function(x)
    y = p.predict(x)
    print(f"  {x}    | {z[0]:+.1f}    |   {y[0]}")
=== Perceptron Demonstration ===

Model: Perceptron(weights=[1. 1.], bias=-1.5)

Input     | w.x + b  | Output
-----------------------------------
  [0 0]    | -1.5    |   0
  [0 1]    | -0.5    |   0
  [1 0]    | -0.5    |   0
  [1 1]    | +0.5    |   1

The perceptron with \(\mathbf{w} = (1, 1)\) and \(b = -1.5\) computes the AND function:

  • Only when both inputs are 1 does \(w \cdot x + b = 1 + 1 - 1.5 = 0.5 \geq 0\).

  • All other inputs give \(z < 0\).

Let us verify this is indeed correct by examining the pre-activation values.

4.5 Geometric Interpretation#

Tip

Geometric Interpretation of the Perceptron

The perceptron partitions the input space \(\mathbb{R}^n\) into two half-spaces separated by a hyperplane. The key geometric facts are:

  1. The weight vector \(\mathbf{w}\) is the normal (perpendicular) to the decision boundary.

  2. \(\mathbf{w}\) points toward the positive region (output = 1).

  3. The bias \(b\) controls the offset of the hyperplane from the origin.

  4. The distance from the origin to the hyperplane is \(|b|/\lVert\mathbf{w}\rVert\).

Changing \(\mathbf{w}\) rotates the boundary; changing \(b\) shifts it.

The perceptron’s decision is determined by the sign of the pre-activation \(z = \mathbf{w} \cdot \mathbf{x} + b\). The decision boundary is the set of points where \(z = 0\):

\[\mathcal{H} = \{\mathbf{x} \in \mathbb{R}^n : \mathbf{w} \cdot \mathbf{x} + b = 0\}\]

This is a hyperplane in \(\mathbb{R}^n\):

  • In \(\mathbb{R}^2\): a line

  • In \(\mathbb{R}^3\): a plane

  • In \(\mathbb{R}^n\) (\(n > 3\)): a hyperplane (an \((n-1)\)-dimensional affine subspace)

Properties of the Decision Hyperplane#

Property 1: \(\mathbf{w}\) is the normal vector. The weight vector \(\mathbf{w}\) is perpendicular (normal) to the decision hyperplane. To see why: if \(\mathbf{x}_1\) and \(\mathbf{x}_2\) are both on the boundary, then

\[\mathbf{w} \cdot \mathbf{x}_1 + b = 0 \quad \text{and} \quad \mathbf{w} \cdot \mathbf{x}_2 + b = 0\]

Subtracting gives \(\mathbf{w} \cdot (\mathbf{x}_1 - \mathbf{x}_2) = 0\). Any vector along the hyperplane is orthogonal to \(\mathbf{w}\), confirming that \(\mathbf{w}\) is normal to the hyperplane.

Property 2: \(\mathbf{w}\) points toward the positive region. The positive class (\(f(\mathbf{x}) = 1\)) lies on the side of the hyperplane toward which \(\mathbf{w}\) points.

Property 3: Distance from the origin. The signed distance from the origin to the hyperplane is

\[d = \frac{b}{\lVert\mathbf{w}\rVert}\]

where \(\lVert\mathbf{w}\rVert = \sqrt{w_1^2 + w_2^2 + \cdots + w_n^2}\) is the Euclidean norm of the weight vector. The absolute distance from the origin is \(|b| / \lVert\mathbf{w}\rVert\).

Property 4: Distance of any point to the hyperplane. For a point \(\mathbf{x}_0\), the signed distance to the hyperplane is

\[d(\mathbf{x}_0) = \frac{\mathbf{w} \cdot \mathbf{x}_0 + b}{\lVert\mathbf{w}\rVert}\]
Hide code cell source
def plot_perceptron_geometry(weights, bias, data=None, labels=None,
                             xlim=(-3, 3), ylim=(-3, 3), title=None):
    """Visualize the perceptron's decision boundary, weight vector, and regions.
    
    Parameters
    ----------
    weights : array-like of shape (2,)
        Weight vector [w1, w2].
    bias : float
        Bias term.
    data : np.ndarray of shape (n, 2), optional
        Data points to plot.
    labels : np.ndarray of shape (n,), optional
        Labels (0 or 1) for the data points.
    xlim, ylim : tuple
        Axis limits.
    title : str, optional
        Plot title.
    """
    w = np.array(weights, dtype=float)
    b = float(bias)
    
    fig, ax = plt.subplots(1, 1, figsize=(8, 8))
    
    # Create a mesh for the decision regions
    xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 300),
                         np.linspace(ylim[0], ylim[1], 300))
    Z = xx * w[0] + yy * w[1] + b
    
    # Color the decision regions
    ax.contourf(xx, yy, Z, levels=[-1e10, 0, 1e10],
                colors=['#FFCCCC', '#CCCCFF'], alpha=0.4)
    
    # Draw the decision boundary
    ax.contour(xx, yy, Z, levels=[0], colors='black', linewidths=2)
    
    # Draw the weight vector as an arrow from the origin
    # Scale it for visualization
    w_norm = np.linalg.norm(w)
    if w_norm > 0:
        # Arrow from origin in the direction of w
        scale = min(xlim[1], ylim[1]) * 0.4
        w_display = w / w_norm * scale
        ax.annotate('', xy=w_display, xytext=(0, 0),
                    arrowprops=dict(arrowstyle='->', color='red', lw=2.5))
        ax.text(w_display[0] * 1.15, w_display[1] * 1.15, r'$\mathbf{w}$',
                fontsize=16, color='red', fontweight='bold',
                ha='center', va='center')
        
        # Mark the closest point on the boundary to the origin
        closest_point = -b * w / (w_norm ** 2)
        ax.plot(*closest_point, 'ko', markersize=6)
        
        # Draw the distance line from origin to the boundary
        ax.plot([0, closest_point[0]], [0, closest_point[1]],
                'g--', linewidth=1.5, label=f'$|b|/||w|| = {abs(b)/w_norm:.2f}$')
    
    # Plot data points if provided
    if data is not None and labels is not None:
        mask_0 = labels == 0
        mask_1 = labels == 1
        ax.scatter(data[mask_0, 0], data[mask_0, 1], c='red', marker='o',
                   s=100, edgecolors='black', zorder=5, label='Class 0')
        ax.scatter(data[mask_1, 0], data[mask_1, 1], c='blue', marker='s',
                   s=100, edgecolors='black', zorder=5, label='Class 1')
    
    # Mark origin
    ax.plot(0, 0, 'k+', markersize=12, markeredgewidth=2)
    
    # Labels
    ax.set_xlabel('$x_1$', fontsize=14)
    ax.set_ylabel('$x_2$', fontsize=14)
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)
    ax.set_aspect('equal')
    ax.legend(fontsize=11, loc='upper left')
    
    # Add region labels
    ax.text(xlim[0] + 0.3, ylim[1] - 0.5,
            'Output = 0\n($\\mathbf{w} \\cdot \\mathbf{x} + b < 0$)',
            fontsize=10, color='red', alpha=0.7)
    ax.text(xlim[1] - 2.0, ylim[0] + 0.3,
            'Output = 1\n($\\mathbf{w} \\cdot \\mathbf{x} + b \\geq 0$)',
            fontsize=10, color='blue', alpha=0.7)
    
    if title:
        ax.set_title(title, fontsize=14, fontweight='bold')
    else:
        ax.set_title(f'Perceptron: $\\mathbf{{w}}$={w.tolist()}, $b$={b}',
                     fontsize=14)
    
    plt.tight_layout()
    plt.show()


# Visualize the AND perceptron
AND_data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
AND_labels = np.array([0, 0, 0, 1])

plot_perceptron_geometry(
    weights=[1, 1], bias=-1.5,
    data=AND_data, labels=AND_labels,
    xlim=(-1, 2.5), ylim=(-1, 2.5),
    title='AND Gate: $\\mathbf{w} = (1, 1)$, $b = -1.5$'
)
../_images/0038dc34b297d24275e9c69e9520b23b7c7c32b93bab7bf42bc1dee8786539a8.png

In the plot above, observe that:

  • The decision boundary (black line) is the line \(x_1 + x_2 - 1.5 = 0\), i.e., \(x_2 = -x_1 + 1.5\).

  • The weight vector \(\mathbf{w} = (1,1)\) (red arrow) is perpendicular to this line.

  • The positive region (blue shading, output = 1) is on the side that \(\mathbf{w}\) points toward.

  • The distance from the origin to the boundary is \(|b|/\|\mathbf{w}\| = 1.5/\sqrt{2} \approx 1.06\) (green dashed line).

  • Only the point \((1,1)\) is on the positive side, which is exactly the AND function.

3D Weight-Space Visualization#

To deepen our geometric understanding, let us visualize the perceptron in 3D. The weight vector \(\mathbf{w}\) determines the normal direction of the decision hyperplane. Below, we show the weight vector and the decision plane in 3D space, along with the input data points lifted by their pre-activation values.

Hide code cell source
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# 3D Weight-Space Visualization
# We show the decision plane w1*x1 + w2*x2 + b = 0 in (x1, x2, z) space,
# where z = w.x + b is the pre-activation.

w = np.array([1.0, 1.0])
b = -1.5

fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')

# Create mesh for the pre-activation surface z = w1*x1 + w2*x2 + b
x1_range = np.linspace(-0.5, 2.0, 40)
x2_range = np.linspace(-0.5, 2.0, 40)
X1, X2 = np.meshgrid(x1_range, x2_range)
Z_surface = w[0] * X1 + w[1] * X2 + b

# Plot the pre-activation surface
ax.plot_surface(X1, X2, Z_surface, alpha=0.25, color='steelblue',
                edgecolor='none', label='Pre-activation plane')

# Plot the decision plane z = 0
Z_zero = np.zeros_like(X1)
ax.plot_surface(X1, X2, Z_zero, alpha=0.15, color='gray',
                edgecolor='none')

# Plot the decision boundary line (intersection of the two planes) on z=0
x1_line = np.linspace(-0.5, 2.0, 100)
x2_line = (-w[0] * x1_line - b) / w[1]
valid = (x2_line >= -0.5) & (x2_line <= 2.0)
ax.plot(x1_line[valid], x2_line[valid], np.zeros(valid.sum()),
        'k-', linewidth=3, label='Decision boundary ($z=0$)')

# Plot the weight vector as an arrow from the origin in the (x1, x2) plane
w_norm = np.linalg.norm(w)
w_scaled = w / w_norm * 0.8
ax.quiver(0, 0, 0, w_scaled[0], w_scaled[1], 0,
          color='red', arrow_length_ratio=0.15, linewidth=3,
          label=r'Weight vector $\mathbf{w}$')

# Plot the normal to the surface (w1, w2, 1) direction (gradient of z)
normal = np.array([w[0], w[1], 1.0])
normal_scaled = normal / np.linalg.norm(normal) * 0.8
ax.quiver(0.5, 0.5, w[0]*0.5+w[1]*0.5+b, normal_scaled[0], normal_scaled[1], normal_scaled[2],
          color='green', arrow_length_ratio=0.15, linewidth=2.5,
          label='Surface normal')

# Plot the data points (AND gate) at their pre-activation values
AND_data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
AND_labels = np.array([0, 0, 0, 1])
AND_z = AND_data @ w + b

for i in range(4):
    color = 'blue' if AND_labels[i] == 1 else 'red'
    marker = 's' if AND_labels[i] == 1 else 'o'
    ax.scatter(AND_data[i, 0], AND_data[i, 1], AND_z[i],
              c=color, marker=marker, s=150, edgecolors='black',
              zorder=5, linewidths=1.5)
    # Vertical line from z=0 to the point
    ax.plot([AND_data[i, 0], AND_data[i, 0]],
            [AND_data[i, 1], AND_data[i, 1]],
            [0, AND_z[i]], '--', color=color, alpha=0.5, linewidth=1)
    ax.text(AND_data[i, 0] + 0.08, AND_data[i, 1] + 0.08, AND_z[i] + 0.1,
            f'z={AND_z[i]:.1f}', fontsize=9, color=color)

ax.set_xlabel('$x_1$', fontsize=13, labelpad=10)
ax.set_ylabel('$x_2$', fontsize=13, labelpad=10)
ax.set_zlabel('$z = \\mathbf{w} \\cdot \\mathbf{x} + b$', fontsize=13, labelpad=10)
ax.set_title('3D Weight-Space Visualization\n'
             'AND Gate: $\\mathbf{w}=(1,1)$, $b=-1.5$',
             fontsize=14, fontweight='bold')

# Add text annotation for z=0 plane
ax.text(1.8, 1.8, 0.1, '$z = 0$ plane\n(decision boundary)',
        fontsize=10, color='gray')

ax.view_init(elev=25, azim=-60)
ax.legend(fontsize=10, loc='upper left')
plt.tight_layout()
plt.show()
../_images/8004ae8094560c1b5fee87991c9edd83fe0feacc9136d58c7250c40dcd99c968.png

Comparison: McCulloch-Pitts Neuron vs. Rosenblatt Perceptron#

Let us systematically compare the two foundational neural models we have studied so far.

Hide code cell source
import numpy as np
import matplotlib.pyplot as plt

# Comparison Table: M-P Neuron vs Perceptron
fig, ax = plt.subplots(figsize=(14, 7))
ax.axis('off')

# Define comparison data
headers = ['Feature', 'McCulloch-Pitts Neuron (1943)', 'Rosenblatt Perceptron (1957)']
rows = [
    ['Inputs', 'Binary $\\{0, 1\\}$', 'Real-valued $\\mathbb{R}^n$'],
    ['Weights', 'Fixed: excitatory (+1)\nor inhibitory (absolute veto)', 'Adjustable: any real value\n(learned from data)'],
    ['Threshold', 'Integer threshold $\\theta$\n(hand-designed)', 'Real-valued bias $b$\n(learned)'],
    ['Learning', 'None -- weights set by\nhuman designer', 'Perceptron learning rule\n(automatic from examples)'],
    ['Activation', 'Step function\nwith inhibitory veto', 'Heaviside step function'],
    ['Purpose', 'Model of neural logic:\nwhat CAN neurons compute?', 'Pattern recognition machine:\nhow can neurons LEARN?'],
    ['Expressiveness', 'Any Boolean function\n(with networks)', 'Only linearly separable\nfunctions (single layer)'],
    ['Convergence\nGuarantee', 'N/A (no learning)', 'Yes, for linearly separable\ndata (Convergence Theorem)'],
    ['Key Limitation', 'No learning mechanism', 'Cannot learn XOR\nor non-separable functions'],
]

# Create the table
table = ax.table(
    cellText=rows,
    colLabels=headers,
    cellLoc='center',
    loc='center',
    colWidths=[0.2, 0.4, 0.4]
)

# Style the table
table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1.0, 2.2)

# Header style
for j in range(3):
    cell = table[0, j]
    cell.set_facecolor('#2C3E50')
    cell.set_text_props(color='white', fontweight='bold', fontsize=11)
    cell.set_height(0.06)

# Row styles
for i in range(1, len(rows) + 1):
    color = '#EBF5FB' if i % 2 == 1 else '#FDFEFE'
    for j in range(3):
        cell = table[i, j]
        cell.set_facecolor(color)
        if j == 0:
            cell.set_text_props(fontweight='bold')

ax.set_title('McCulloch-Pitts Neuron vs. Rosenblatt Perceptron',
             fontsize=15, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()
../_images/31ad74fe66afa96d8bfcaee533343829dcc758cf13d0254e7ab7bb7072806f10.png

4.6 Interactive Exploration#

Let us explore how the decision boundary changes as we modify the weight vector \(\mathbf{w}\) and the bias \(b\).

Hide code cell source
# Generate random 2D data: two clusters
np.random.seed(42)
n_per_class = 30

# Cluster 1 (class 0): centered at (-1, -1)
X0 = np.random.randn(n_per_class, 2) * 0.5 + np.array([-1, -1])
# Cluster 2 (class 1): centered at (1, 1)
X1 = np.random.randn(n_per_class, 2) * 0.5 + np.array([1, 1])

X = np.vstack([X0, X1])
y = np.array([0] * n_per_class + [1] * n_per_class)

print(f"Generated {len(X)} data points ({n_per_class} per class).")
print(f"Class 0 center: {X0.mean(axis=0).round(2)}")
print(f"Class 1 center: {X1.mean(axis=0).round(2)}")
Generated 60 data points (30 per class).
Class 0 center: [-1.07 -1.08]
Class 1 center: [0.95 1.05]
Hide code cell source
# Experiment 1: Changing weights rotates the decision boundary
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Different weight vectors (different angles)
weight_configs = [
    ([1, 0], 0, '$\\mathbf{w} = (1, 0)$\nVertical boundary'),
    ([1, 1], 0, '$\\mathbf{w} = (1, 1)$\n$45^\\circ$ boundary'),
    ([0, 1], 0, '$\\mathbf{w} = (0, 1)$\nHorizontal boundary'),
    ([-1, 1], 0, '$\\mathbf{w} = (-1, 1)$\n$135^\\circ$ boundary'),
    ([2, 1], 0, '$\\mathbf{w} = (2, 1)$\nSteep boundary'),
    ([1, 2], 0, '$\\mathbf{w} = (1, 2)$\nShallow boundary'),
]

for ax, (w, b, label) in zip(axes.flat, weight_configs):
    w = np.array(w, dtype=float)
    
    # Create mesh
    xx, yy = np.meshgrid(np.linspace(-3, 3, 200), np.linspace(-3, 3, 200))
    Z = xx * w[0] + yy * w[1] + b
    
    ax.contourf(xx, yy, Z, levels=[-1e10, 0, 1e10],
                colors=['#FFCCCC', '#CCCCFF'], alpha=0.3)
    ax.contour(xx, yy, Z, levels=[0], colors='black', linewidths=2)
    
    # Plot data
    ax.scatter(X[y == 0, 0], X[y == 0, 1], c='red', marker='o',
               s=30, edgecolors='black', alpha=0.7, label='Class 0')
    ax.scatter(X[y == 1, 0], X[y == 1, 1], c='blue', marker='s',
               s=30, edgecolors='black', alpha=0.7, label='Class 1')
    
    # Draw w as arrow
    w_norm = np.linalg.norm(w)
    w_disp = w / w_norm * 1.0
    ax.annotate('', xy=w_disp, xytext=(0, 0),
                arrowprops=dict(arrowstyle='->', color='red', lw=2))
    
    ax.set_xlim(-3, 3)
    ax.set_ylim(-3, 3)
    ax.set_aspect('equal')
    ax.set_title(label, fontsize=11)
    ax.grid(True, alpha=0.3)

fig.suptitle('Effect of Changing Weights: Rotation of the Decision Boundary',
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
../_images/a364c9d428eaa2249f09f7109a734f8c52074cfd4b00e7da54779566a1b76f68.png
Hide code cell source
# Experiment 2: Changing bias shifts the decision boundary
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

w_fixed = np.array([1.0, 1.0])
bias_values = [-3, -1.5, -0.5, 0, 0.5, 1.5]

for ax, b in zip(axes.flat, bias_values):
    # Create mesh
    xx, yy = np.meshgrid(np.linspace(-3, 3, 200), np.linspace(-3, 3, 200))
    Z = xx * w_fixed[0] + yy * w_fixed[1] + b
    
    ax.contourf(xx, yy, Z, levels=[-1e10, 0, 1e10],
                colors=['#FFCCCC', '#CCCCFF'], alpha=0.3)
    ax.contour(xx, yy, Z, levels=[0], colors='black', linewidths=2)
    
    # Plot data
    ax.scatter(X[y == 0, 0], X[y == 0, 1], c='red', marker='o',
               s=30, edgecolors='black', alpha=0.7)
    ax.scatter(X[y == 1, 0], X[y == 1, 1], c='blue', marker='s',
               s=30, edgecolors='black', alpha=0.7)
    
    # Mark distance from origin
    w_norm = np.linalg.norm(w_fixed)
    dist = abs(b) / w_norm
    
    ax.set_xlim(-3, 3)
    ax.set_ylim(-3, 3)
    ax.set_aspect('equal')
    ax.set_title(f'$b = {b}$, distance from origin = ${dist:.2f}$', fontsize=11)
    ax.grid(True, alpha=0.3)

fig.suptitle('Effect of Changing Bias: Shifting the Decision Boundary\n'
             '(Fixed $\\mathbf{w} = (1, 1)$)',
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
../_images/5560bd5be783f0f5d05c4ef6d1dc0b2b862a3ecfbfe12698c833a4b9ab01947d.png

Key Observations#

  1. Changing \(\mathbf{w}\) rotates the boundary: The direction of \(\mathbf{w}\) determines the orientation of the decision line. Different weight vectors lead to different angles.

  2. Changing \(b\) shifts the boundary: The bias controls the position (offset) of the decision line. Increasing \(b\) moves the boundary in the direction opposite to \(\mathbf{w}\) (i.e., it enlarges the positive region).

  3. Together, \(\mathbf{w}\) and \(b\) fully specify any hyperplane in \(\mathbb{R}^n\).

4.7 Linear Separability#

A perceptron can only solve classification problems where the two classes can be separated by a hyperplane. This leads to a fundamental concept.

Definition#

Two sets \(S_0, S_1 \subset \mathbb{R}^n\) are linearly separable if there exists a hyperplane \(\mathcal{H} = \{\mathbf{x} : \mathbf{w} \cdot \mathbf{x} + b = 0\}\) such that:

\[\mathbf{w} \cdot \mathbf{x} + b > 0 \quad \text{for all } \mathbf{x} \in S_1\]
\[\mathbf{w} \cdot \mathbf{x} + b < 0 \quad \text{for all } \mathbf{x} \in S_0\]

Danger

Fundamental Limitation: The perceptron can ONLY learn linearly separable functions. If the data cannot be separated by a single hyperplane (a line in 2D, a plane in 3D), the perceptron learning algorithm will never converge and no set of weights will produce zero training error.

The most famous example is the XOR function. This single limitation was the focus of Minsky and Papert’s devastating 1969 critique, which contributed to the first “AI winter.”

Equivalence with Convex Hull Disjointness#

An elegant geometric characterization:

Theorem: Two finite sets \(S_0, S_1 \subset \mathbb{R}^n\) are linearly separable if and only if their convex hulls are disjoint:

\[\text{conv}(S_0) \cap \text{conv}(S_1) = \emptyset\]

Recall that the convex hull of a set \(S\) is the smallest convex set containing \(S\):

\[\text{conv}(S) = \left\{ \sum_{i=1}^k \lambda_i \mathbf{x}_i : \mathbf{x}_i \in S, \; \lambda_i \geq 0, \; \sum_{i=1}^k \lambda_i = 1 \right\}\]

Geometric Intuition#

  • In \(\mathbb{R}^2\) (the plane): A dataset is linearly separable if you can draw a straight line between the two classes.

  • In \(\mathbb{R}^3\) (3D space): You need a plane to separate the classes.

  • In general \(\mathbb{R}^n\): An \((n-1)\)-dimensional hyperplane.

Not All Datasets Are Linearly Separable#

The most famous example of a non-linearly-separable function is XOR (exclusive or):

\(x_1\)

\(x_2\)

XOR

0

0

0

0

1

1

1

0

1

1

1

0

The positive points \((0,1)\) and \((1,0)\) are on opposite corners of the unit square, as are the negative points \((0,0)\) and \((1,1)\). No straight line can separate them. This limitation will be a central topic in our later discussion of Minsky and Papert’s analysis.

Hide code cell source
# Visualize linearly separable vs. non-separable datasets
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# --- Panel 1: AND (linearly separable) ---
ax = axes[0]
AND_X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
AND_y = np.array([0, 0, 0, 1])

ax.scatter(AND_X[AND_y == 0, 0], AND_X[AND_y == 0, 1], c='red', marker='o',
           s=200, edgecolors='black', zorder=5, label='Class 0')
ax.scatter(AND_X[AND_y == 1, 0], AND_X[AND_y == 1, 1], c='blue', marker='s',
           s=200, edgecolors='black', zorder=5, label='Class 1')

# Draw a separating line
x_line = np.linspace(-0.5, 1.5, 100)
y_line = -x_line + 1.5  # w=[1,1], b=-1.5
ax.plot(x_line, y_line, 'k-', linewidth=2, label='Decision boundary')
ax.fill_between(x_line, y_line, 2, alpha=0.1, color='blue')
ax.fill_between(x_line, y_line, -1, alpha=0.1, color='red')

ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_aspect('equal')
ax.set_title('AND: Linearly Separable', fontsize=13, fontweight='bold',
             color='green')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)

# --- Panel 2: OR (linearly separable) ---
ax = axes[1]
OR_y = np.array([0, 1, 1, 1])

ax.scatter(AND_X[OR_y == 0, 0], AND_X[OR_y == 0, 1], c='red', marker='o',
           s=200, edgecolors='black', zorder=5, label='Class 0')
ax.scatter(AND_X[OR_y == 1, 0], AND_X[OR_y == 1, 1], c='blue', marker='s',
           s=200, edgecolors='black', zorder=5, label='Class 1')

# Separating line
y_line = -x_line + 0.5  # w=[1,1], b=-0.5
ax.plot(x_line, y_line, 'k-', linewidth=2, label='Decision boundary')
ax.fill_between(x_line, y_line, 2, alpha=0.1, color='blue')
ax.fill_between(x_line, y_line, -1, alpha=0.1, color='red')

ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_aspect('equal')
ax.set_title('OR: Linearly Separable', fontsize=13, fontweight='bold',
             color='green')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)

# --- Panel 3: XOR (NOT linearly separable) ---
ax = axes[2]
XOR_y = np.array([0, 1, 1, 0])

ax.scatter(AND_X[XOR_y == 0, 0], AND_X[XOR_y == 0, 1], c='red', marker='o',
           s=200, edgecolors='black', zorder=5, label='Class 0')
ax.scatter(AND_X[XOR_y == 1, 0], AND_X[XOR_y == 1, 1], c='blue', marker='s',
           s=200, edgecolors='black', zorder=5, label='Class 1')

# Try several lines to show none works
for angle in [30, 60, 90, 120, 150]:
    rad = np.radians(angle)
    slope = -np.cos(rad) / (np.sin(rad) + 1e-10)
    y_line = slope * (x_line - 0.5) + 0.5
    ax.plot(x_line, y_line, '--', linewidth=1, alpha=0.4, color='gray')

ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_aspect('equal')
ax.set_title('XOR: NOT Linearly Separable', fontsize=13, fontweight='bold',
             color='red')
ax.text(0.5, -0.3, 'No single line can\nseparate the classes!',
        ha='center', fontsize=11, color='red', style='italic')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)

plt.suptitle('Linear Separability of Boolean Functions', fontsize=15,
             fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
../_images/74158d4631f72a3494d9ea647087b56a6e1b02f003e04479e44959b7e4bffa1f.png

4.8 Exercises#

Exercise 4.1: Manual Prediction#

Given a perceptron with \(\mathbf{w} = (2, -3)\) and \(b = 1\), compute the output \(f(\mathbf{x})\) for each of the following input vectors:

  1. \(\mathbf{x} = (1, 0)\)

  2. \(\mathbf{x} = (0, 1)\)

  3. \(\mathbf{x} = (1, 1)\)

  4. \(\mathbf{x} = (3, 2)\)

  5. \(\mathbf{x} = (-1, -1)\)

Show your work: write out \(z = \mathbf{w} \cdot \mathbf{x} + b\) for each.

Exercise 4.2: Decision Boundary Sketch#

For a perceptron with \(\mathbf{w} = (1, 2)\) and \(b = -3\):

  1. Write the equation of the decision boundary.

  2. Find the intercepts of this line with the \(x_1\) and \(x_2\) axes.

  3. Sketch the line by hand on a piece of paper. Mark the positive and negative regions.

  4. Compute the distance from the origin to the decision boundary.

Exercise 4.3: Scaling Invariance#

Suppose you have a perceptron with weights \(\mathbf{w}\) and bias \(b\). You now multiply all parameters by a constant \(c > 0\), obtaining \(\mathbf{w}' = c\mathbf{w}\) and \(b' = cb\).

  1. Does the decision boundary change? Prove your answer.

  2. Does the distance from the origin to the boundary change?

  3. What is the practical implication of this observation?

Hide code cell source
# Solution verification for Exercise 4.1
print("=" * 50)
print("Exercise 4.1: Solution Verification")
print("=" * 50)

p = Perceptron(n_features=2)
p.set_weights([2, -3], 1)

test_vectors = [
    np.array([1, 0]),
    np.array([0, 1]),
    np.array([1, 1]),
    np.array([3, 2]),
    np.array([-1, -1]),
]

print(f"\nPerceptron: w = {p.weights}, b = {p.bias}")
print()
for i, x in enumerate(test_vectors, 1):
    z = p.decision_function(x)[0]
    y_hat = p.predict(x)[0]
    print(f"  {i}. x = {x}")
    print(f"     z = w . x + b = {p.weights[0]}*{x[0]} + ({p.weights[1]})*{x[1]} + {p.bias} = {z}")
    print(f"     f(x) = step({z}) = {y_hat}")
    print()
==================================================
Exercise 4.1: Solution Verification
==================================================

Perceptron: w = [ 2. -3.], b = 1.0

  1. x = [1 0]
     z = w . x + b = 2.0*1 + (-3.0)*0 + 1.0 = 3.0
     f(x) = step(3.0) = 1

  2. x = [0 1]
     z = w . x + b = 2.0*0 + (-3.0)*1 + 1.0 = -2.0
     f(x) = step(-2.0) = 0

  3. x = [1 1]
     z = w . x + b = 2.0*1 + (-3.0)*1 + 1.0 = 0.0
     f(x) = step(0.0) = 1

  4. x = [3 2]
     z = w . x + b = 2.0*3 + (-3.0)*2 + 1.0 = 1.0
     f(x) = step(1.0) = 1

  5. x = [-1 -1]
     z = w . x + b = 2.0*-1 + (-3.0)*-1 + 1.0 = 2.0
     f(x) = step(2.0) = 1
Hide code cell source
# Solution verification for Exercise 4.2
print("=" * 50)
print("Exercise 4.2: Decision Boundary Analysis")
print("=" * 50)

w = np.array([1, 2])
b = -3

print(f"\nBoundary equation: {w[0]}*x1 + {w[1]}*x2 + ({b}) = 0")
print(f"Simplified: x1 + 2*x2 = 3")
print(f"Slope-intercept: x2 = -0.5*x1 + 1.5")
print(f"\nx1-intercept: x1 = {-b/w[0]} (set x2 = 0)")
print(f"x2-intercept: x2 = {-b/w[1]} (set x1 = 0)")
print(f"\nDistance from origin: |b|/||w|| = {abs(b)}/{np.linalg.norm(w):.4f} = {abs(b)/np.linalg.norm(w):.4f}")

# Visualize
plot_perceptron_geometry(
    weights=w, bias=b,
    xlim=(-1, 5), ylim=(-1, 3),
    title='Exercise 4.2: $\\mathbf{w}=(1,2)$, $b=-3$'
)
==================================================
Exercise 4.2: Decision Boundary Analysis
==================================================

Boundary equation: 1*x1 + 2*x2 + (-3) = 0
Simplified: x1 + 2*x2 = 3
Slope-intercept: x2 = -0.5*x1 + 1.5

x1-intercept: x1 = 3.0 (set x2 = 0)
x2-intercept: x2 = 1.5 (set x1 = 0)

Distance from origin: |b|/||w|| = 3/2.2361 = 1.3416
../_images/c306b46d4ea5440a7db2ff5ec9899ad09e2834650bad4a58940e346fed81cac8.png