{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cell-0",
   "metadata": {},
   "source": [
    "# Chapter 6: The Perceptron Convergence Theorem\n",
    "\n",
    "\n",
    "## Part 2: The Perceptron"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-1",
   "metadata": {},
   "source": [
    "## 6.1 Introduction\n",
    "\n",
    "The Perceptron Convergence Theorem is one of the most important results in the history of machine learning. It provides the **first formal guarantee** that a learning algorithm will converge to a correct solution in a finite number of steps, provided a solution exists.\n",
    "\n",
    "### Historical Context\n",
    "\n",
    "- **1957**: Frank Rosenblatt proposes the perceptron and its learning rule.\n",
    "- **1962**: Rosenblatt publishes *Principles of Neurodynamics*, containing a version of the convergence proof.\n",
    "- **1962**: Albert Novikoff provides a clean, elegant proof that became the standard reference.\n",
    "- The theorem was independently discovered by Agmon (1954) and Motzkin & Schoenberg (1954) in the context of solving systems of linear inequalities.\n",
    "\n",
    "The theorem tells us:\n",
    "1. **If** the data is linearly separable, the perceptron learning algorithm **will** find a separating hyperplane.\n",
    "2. The number of mistakes (updates) is **bounded** by a quantity that depends on the geometry of the data.\n",
    "3. The bound does **not** depend on the dimensionality of the data.\n",
    "\n",
    "This last point is remarkable: a perceptron in a million-dimensional space is not inherently harder to train than one in two dimensions, as long as the geometric margin is favorable."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-2",
   "metadata": {},
   "source": [
    "## 6.2 Statement of the Theorem\n",
    "\n",
    "We use the $\\{-1, +1\\}$ label convention, which simplifies the proof. The perceptron predicts $\\hat{y} = \\text{sign}(\\mathbf{w} \\cdot \\mathbf{x})$, and the update rule on a misclassification of $(\\mathbf{x}_i, y_i)$ is:\n",
    "\n",
    "$$\\mathbf{w}_{t+1} = \\mathbf{w}_t + y_i \\mathbf{x}_i$$\n",
    "\n",
    "(We absorb the bias into $\\mathbf{w}$ using the bias trick, so $\\mathbf{x}$ is augmented with $x_0 = 1$.)\n",
    "\n",
    "```{admonition} Theorem (Perceptron Convergence -- Novikoff, 1962)\n",
    ":class: important\n",
    "\n",
    "Let $\\{(\\mathbf{x}_1, y_1), \\ldots, (\\mathbf{x}_m, y_m)\\}$ be a training set with $y_i \\in \\{-1, +1\\}$. Suppose there exists a unit vector $\\mathbf{w}^*$ with $\\|\\mathbf{w}^*\\| = 1$ and a margin $\\gamma > 0$ such that:\n",
    "\n",
    "$$y_i(\\mathbf{w}^* \\cdot \\mathbf{x}_i) \\geq \\gamma \\quad \\text{for all } i = 1, \\ldots, m$$\n",
    "\n",
    "Let $R = \\max_i \\|\\mathbf{x}_i\\|$. Then the perceptron algorithm, starting from $\\mathbf{w}_0 = \\mathbf{0}$, makes at most\n",
    "\n",
    "$$k \\leq \\frac{R^2}{\\gamma^2}$$\n",
    "\n",
    "mistakes (weight updates) before finding a separating hyperplane.\n",
    "```\n",
    "\n",
    "```{tip}\n",
    "**Geometric Meaning of $R$ and $\\gamma$**\n",
    "\n",
    "- **$R$ (data radius)**: The maximum distance of any data point from the origin. It measures how \"spread out\" the data is. Larger $R$ means points are farther away, and the bound gets worse.\n",
    "- **$\\gamma$ (margin)**: The minimum signed distance from any data point to the decision hyperplane. It measures how \"cleanly\" the data can be separated. Larger $\\gamma$ means the classes are well-separated, and the bound gets better.\n",
    "- **$R^2/\\gamma^2$**: This ratio captures the \"difficulty\" of the classification problem. Easy problems (large margin, compact data) have small ratios; hard problems (thin margin, spread-out data) have large ratios.\n",
    "- The ratio is **independent of dimension** $n$: adding irrelevant features does not change $R$ or $\\gamma$.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-3",
   "metadata": {},
   "source": [
    "## 6.3 Definitions\n",
    "\n",
    "Before presenting the proof, let us define the key quantities precisely.\n",
    "\n",
    "### Margin $\\gamma$\n",
    "\n",
    "The **margin** of a separating hyperplane (defined by unit vector $\\mathbf{w}^*$) with respect to a dataset is the minimum signed distance from any point to the hyperplane:\n",
    "\n",
    "$$\\gamma = \\min_{i=1,\\ldots,m} y_i(\\mathbf{w}^* \\cdot \\mathbf{x}_i)$$\n",
    "\n",
    "Since $\\|\\mathbf{w}^*\\| = 1$, the quantity $y_i(\\mathbf{w}^* \\cdot \\mathbf{x}_i)$ is the signed distance from $\\mathbf{x}_i$ to the hyperplane, multiplied by the label. The condition $\\gamma > 0$ means all points are correctly classified with at least $\\gamma$ distance from the boundary.\n",
    "\n",
    "### Radius $R$\n",
    "\n",
    "The **radius** of the dataset is the maximum norm of any data point:\n",
    "\n",
    "$$R = \\max_{i=1,\\ldots,m} \\|\\mathbf{x}_i\\|$$\n",
    "\n",
    "This measures how \"spread out\" the data is. When using the bias trick ($\\mathbf{x}$ is augmented with $x_0=1$), $R$ accounts for the augmented dimension.\n",
    "\n",
    "### Target Weight Vector $\\mathbf{w}^*$\n",
    "\n",
    "The vector $\\mathbf{w}^*$ is any unit vector that separates the data with margin $\\gamma$. Note:\n",
    "- The theorem asserts that **some** separating $\\mathbf{w}^*$ exists (this is the linear separability assumption).\n",
    "- The perceptron does **not** need to find $\\mathbf{w}^*$ itself; $\\mathbf{w}^*$ is used only in the analysis.\n",
    "- The optimal choice of $\\mathbf{w}^*$ (maximizing $\\gamma$) gives the tightest bound. This optimal $\\mathbf{w}^*$ is the **maximum-margin** hyperplane, which is exactly what Support Vector Machines find.\n",
    "\n",
    "### The Separability Condition\n",
    "\n",
    "The formal condition $y_i(\\mathbf{w}^* \\cdot \\mathbf{x}_i) \\geq \\gamma$ for all $i$ means:\n",
    "- If $y_i = +1$: then $\\mathbf{w}^* \\cdot \\mathbf{x}_i \\geq \\gamma > 0$ (positive class is on the positive side)\n",
    "- If $y_i = -1$: then $\\mathbf{w}^* \\cdot \\mathbf{x}_i \\leq -\\gamma < 0$ (negative class is on the negative side)\n",
    "\n",
    "No point lies within $\\gamma$ of the decision boundary."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-4",
   "metadata": {},
   "source": [
    "## 6.4 The Complete Proof\n",
    "\n",
    "The proof proceeds in three steps: a lower bound on $\\mathbf{w}^* \\cdot \\mathbf{w}_k$, an upper bound on $\\|\\mathbf{w}_k\\|^2$, and a squeeze argument combining both.\n",
    "\n",
    "```{admonition} Proof\n",
    ":class: dropdown\n",
    "\n",
    "### Step 1: Lower Bound on $\\mathbf{w}^* \\cdot \\mathbf{w}_k$\n",
    "\n",
    "**Claim**: After $k$ mistakes, $\\mathbf{w}^* \\cdot \\mathbf{w}_k \\geq k\\gamma$.\n",
    "\n",
    "**Proof**: We proceed by induction on the number of mistakes $k$.\n",
    "\n",
    "**Base case** ($k = 0$): $\\mathbf{w}_0 = \\mathbf{0}$, so $\\mathbf{w}^* \\cdot \\mathbf{w}_0 = 0 \\geq 0 = 0 \\cdot \\gamma$. The claim holds.\n",
    "\n",
    "**Inductive step**: Suppose after $k$ mistakes, $\\mathbf{w}^* \\cdot \\mathbf{w}_k \\geq k\\gamma$. The $(k+1)$-th mistake occurs on some example $(\\mathbf{x}_{i_k}, y_{i_k})$, and the update is:\n",
    "\n",
    "$$\\mathbf{w}_{k+1} = \\mathbf{w}_k + y_{i_k} \\mathbf{x}_{i_k}$$\n",
    "\n",
    "Taking the dot product with $\\mathbf{w}^*$:\n",
    "\n",
    "$$\\mathbf{w}^* \\cdot \\mathbf{w}_{k+1} = \\mathbf{w}^* \\cdot \\mathbf{w}_k + y_{i_k}(\\mathbf{w}^* \\cdot \\mathbf{x}_{i_k})$$\n",
    "\n",
    "By the inductive hypothesis, $\\mathbf{w}^* \\cdot \\mathbf{w}_k \\geq k\\gamma$.\n",
    "\n",
    "By the separability condition, $y_{i_k}(\\mathbf{w}^* \\cdot \\mathbf{x}_{i_k}) \\geq \\gamma$.\n",
    "\n",
    "Therefore:\n",
    "\n",
    "$$\\mathbf{w}^* \\cdot \\mathbf{w}_{k+1} \\geq k\\gamma + \\gamma = (k+1)\\gamma \\quad \\blacksquare$$\n",
    "\n",
    "**Consequence**: By the Cauchy-Schwarz inequality:\n",
    "\n",
    "$$k\\gamma \\leq \\mathbf{w}^* \\cdot \\mathbf{w}_k \\leq \\|\\mathbf{w}^*\\| \\cdot \\|\\mathbf{w}_k\\| = \\|\\mathbf{w}_k\\|$$\n",
    "\n",
    "(since $\\|\\mathbf{w}^*\\| = 1$). Thus $\\|\\mathbf{w}_k\\|^2 \\geq k^2\\gamma^2$.\n",
    "\n",
    "\n",
    "### Step 2: Upper Bound on $\\|\\mathbf{w}_k\\|^2$\n",
    "\n",
    "**Claim**: After $k$ mistakes, $\\|\\mathbf{w}_k\\|^2 \\leq kR^2$.\n",
    "\n",
    "**Proof**: Again by induction.\n",
    "\n",
    "**Base case** ($k = 0$): $\\|\\mathbf{w}_0\\|^2 = 0 \\leq 0 = 0 \\cdot R^2$. The claim holds.\n",
    "\n",
    "**Inductive step**: Suppose $\\|\\mathbf{w}_k\\|^2 \\leq kR^2$. The $(k+1)$-th update gives:\n",
    "\n",
    "$$\\|\\mathbf{w}_{k+1}\\|^2 = \\|\\mathbf{w}_k + y_{i_k}\\mathbf{x}_{i_k}\\|^2$$\n",
    "\n",
    "$$= \\|\\mathbf{w}_k\\|^2 + 2y_{i_k}(\\mathbf{w}_k \\cdot \\mathbf{x}_{i_k}) + y_{i_k}^2\\|\\mathbf{x}_{i_k}\\|^2$$\n",
    "\n",
    "$$= \\|\\mathbf{w}_k\\|^2 + 2y_{i_k}(\\mathbf{w}_k \\cdot \\mathbf{x}_{i_k}) + \\|\\mathbf{x}_{i_k}\\|^2$$\n",
    "\n",
    "(since $y_{i_k}^2 = 1$).\n",
    "\n",
    "**Key insight**: the update happens because the perceptron **misclassified** $(\\mathbf{x}_{i_k}, y_{i_k})$. This means:\n",
    "\n",
    "$$y_{i_k}(\\mathbf{w}_k \\cdot \\mathbf{x}_{i_k}) \\leq 0$$\n",
    "\n",
    "Therefore the cross-term $2y_{i_k}(\\mathbf{w}_k \\cdot \\mathbf{x}_{i_k}) \\leq 0$, and we can drop it:\n",
    "\n",
    "$$\\|\\mathbf{w}_{k+1}\\|^2 \\leq \\|\\mathbf{w}_k\\|^2 + \\|\\mathbf{x}_{i_k}\\|^2 \\leq kR^2 + R^2 = (k+1)R^2 \\quad \\blacksquare$$\n",
    "\n",
    "\n",
    "### Step 3: The Squeeze\n",
    "\n",
    "Combining the lower bound from Step 1 and the upper bound from Step 2:\n",
    "\n",
    "From Step 1: $\\|\\mathbf{w}_k\\|^2 \\geq k^2\\gamma^2$\n",
    "\n",
    "From Step 2: $\\|\\mathbf{w}_k\\|^2 \\leq kR^2$\n",
    "\n",
    "Therefore:\n",
    "\n",
    "$$k^2\\gamma^2 \\leq \\|\\mathbf{w}_k\\|^2 \\leq kR^2$$\n",
    "\n",
    "Dividing both sides by $k > 0$:\n",
    "\n",
    "$$k\\gamma^2 \\leq R^2$$\n",
    "\n",
    "$$\\boxed{k \\leq \\frac{R^2}{\\gamma^2}}$$\n",
    "\n",
    "This completes the proof. $\\blacksquare$\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-4a",
   "metadata": {},
   "source": [
    "### Summary of the Proof Structure\n",
    "\n",
    "The beauty of this proof lies in the squeeze:\n",
    "\n",
    "- **Lower bound** says $\\|\\mathbf{w}_k\\|$ must grow *at least as fast* as $k\\gamma$ (linear in $k$), because each mistake makes definite progress toward $\\mathbf{w}^*$.\n",
    "- **Upper bound** says $\\|\\mathbf{w}_k\\|$ can grow *at most as fast* as $\\sqrt{k} \\cdot R$ (square root of $k$), because the misclassification condition limits the energy added.\n",
    "- These two growth rates are incompatible for large $k$: a linear function eventually exceeds any square-root function. The squeeze forces $k$ to be finite.\n",
    "\n",
    "| Bound | Growth of $\\|\\mathbf{w}_k\\|$ | Growth of $\\|\\mathbf{w}_k\\|^2$ |\n",
    "|-------|----------------------------|---------------------------|\n",
    "| Lower | $\\geq k\\gamma$ (linear) | $\\geq k^2\\gamma^2$ (quadratic) |\n",
    "| Upper | $\\leq \\sqrt{k} R$ (sub-linear) | $\\leq kR^2$ (linear) |\n",
    "\n",
    "```{danger}\n",
    "**The bound $R^2/\\gamma^2$ can be exponentially loose in practice!**\n",
    "\n",
    "While the theorem guarantees at most $R^2/\\gamma^2$ mistakes, in practice the actual number of mistakes is often **much smaller** -- sometimes by a factor of 10x to 100x. The bound is a **worst-case** guarantee, not an average-case prediction. There exist adversarial data orderings that force the perceptron to approach the bound, but for typical random data the algorithm converges much faster.\n",
    "\n",
    "Moreover, the bound depends on the choice of reference vector $\\mathbf{w}^*$. Using a suboptimal $\\mathbf{w}^*$ (with a smaller margin) gives a looser bound. The tightest bound comes from the **maximum-margin** separating hyperplane, which is what Support Vector Machines compute.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-8",
   "metadata": {},
   "source": [
    "## 6.5 Visualizing the Proof"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-9",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.rcParams.update({\n",
    "    'figure.figsize': (8, 6),\n",
    "    'font.size': 12,\n",
    "    'axes.grid': True,\n",
    "    'grid.alpha': 0.3\n",
    "})\n",
    "\n",
    "try:\n",
    "    plt.style.use('seaborn-v0_8-whitegrid')\n",
    "except OSError:\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-10",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "def perceptron_with_tracking(X, y, max_epochs=100):\n",
    "    \"\"\"Run the perceptron algorithm ({-1,+1} labels) and track ||w|| at each update.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    X : np.ndarray of shape (n_samples, n_features)\n",
    "        Training data (already augmented with bias column if desired).\n",
    "    y : np.ndarray of shape (n_samples,)\n",
    "        Labels in {-1, +1}.\n",
    "    max_epochs : int\n",
    "        Maximum epochs.\n",
    "    \n",
    "    Returns\n",
    "    -------\n",
    "    w : np.ndarray\n",
    "        Final weight vector.\n",
    "    norm_history : list of float\n",
    "        ||w_k|| after each update.\n",
    "    dot_history : list of float\n",
    "        w* . w_k after each update (w* is the max-margin direction).\n",
    "    n_mistakes : int\n",
    "        Total number of mistakes.\n",
    "    \"\"\"\n",
    "    n_samples, n_features = X.shape\n",
    "    w = np.zeros(n_features)\n",
    "    \n",
    "    norm_history = [0.0]\n",
    "    n_mistakes = 0\n",
    "    \n",
    "    for epoch in range(max_epochs):\n",
    "        errors = 0\n",
    "        for i in range(n_samples):\n",
    "            if y[i] * (w @ X[i]) <= 0:\n",
    "                w = w + y[i] * X[i]\n",
    "                n_mistakes += 1\n",
    "                norm_history.append(np.linalg.norm(w))\n",
    "                errors += 1\n",
    "        if errors == 0:\n",
    "            break\n",
    "    \n",
    "    return w, norm_history, n_mistakes\n",
    "\n",
    "\n",
    "def compute_margin_and_radius(X, y):\n",
    "    \"\"\"Compute the maximum margin gamma and radius R for the dataset.\n",
    "    \n",
    "    Uses a simple approach: find the direction that maximizes the minimum\n",
    "    signed distance (approximate via the mean-difference direction).\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    X : np.ndarray of shape (n_samples, n_features)\n",
    "    y : np.ndarray of shape (n_samples,) with values in {-1, +1}\n",
    "    \n",
    "    Returns\n",
    "    -------\n",
    "    gamma : float\n",
    "        Margin (minimum y_i * (w* . x_i)).\n",
    "    R : float\n",
    "        Maximum ||x_i||.\n",
    "    w_star : np.ndarray\n",
    "        Unit normal vector of the separating hyperplane.\n",
    "    \"\"\"\n",
    "    # Use mean-difference direction as an approximation to the max-margin direction\n",
    "    pos_mean = X[y == 1].mean(axis=0)\n",
    "    neg_mean = X[y == -1].mean(axis=0)\n",
    "    w_star = pos_mean - neg_mean\n",
    "    w_star = w_star / np.linalg.norm(w_star)\n",
    "    \n",
    "    # Compute margin\n",
    "    margins = y * (X @ w_star)\n",
    "    gamma = margins.min()\n",
    "    \n",
    "    # Compute radius\n",
    "    R = np.max(np.linalg.norm(X, axis=1))\n",
    "    \n",
    "    return gamma, R, w_star\n",
    "\n",
    "\n",
    "# Generate separable data\n",
    "np.random.seed(42)\n",
    "n_per_class = 30\n",
    "separation = 3.0\n",
    "\n",
    "X0 = np.random.randn(n_per_class, 2) * 0.5 + np.array([-separation/2, 0])\n",
    "X1 = np.random.randn(n_per_class, 2) * 0.5 + np.array([separation/2, 0])\n",
    "X_raw = np.vstack([X0, X1])\n",
    "y_pm = np.array([-1] * n_per_class + [1] * n_per_class)\n",
    "\n",
    "# Augment with bias column\n",
    "X_aug = np.hstack([np.ones((X_raw.shape[0], 1)), X_raw])\n",
    "\n",
    "# Compute geometric quantities\n",
    "gamma, R, w_star = compute_margin_and_radius(X_aug, y_pm)\n",
    "bound = R**2 / gamma**2\n",
    "\n",
    "print(f\"Dataset: {len(y_pm)} points, separation = {separation}\")\n",
    "print(f\"Margin (gamma): {gamma:.4f}\")\n",
    "print(f\"Radius (R): {R:.4f}\")\n",
    "print(f\"Theoretical bound R^2/gamma^2: {bound:.1f}\")\n",
    "print()\n",
    "\n",
    "# Run perceptron\n",
    "w_final, norm_history, n_mistakes = perceptron_with_tracking(X_aug, y_pm)\n",
    "print(f\"Actual mistakes: {n_mistakes}\")\n",
    "print(f\"Bound / Actual: {bound / max(n_mistakes, 1):.1f}x\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-11",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Visualize the proof: ||w_k|| growth with bounds\n",
    "fig, axes = plt.subplots(1, 2, figsize=(16, 7))\n",
    "\n",
    "k_vals = np.arange(len(norm_history))\n",
    "\n",
    "# --- Panel 1: ||w_k|| growth ---\n",
    "ax = axes[0]\n",
    "ax.plot(k_vals, norm_history, 'b.-', linewidth=2, markersize=4,\n",
    "        label=r'Actual $\\|\\mathbf{w}_k\\|$', zorder=3)\n",
    "\n",
    "# Lower bound: ||w_k|| >= k * gamma\n",
    "k_cont = np.linspace(0, len(norm_history) - 1, 200)\n",
    "ax.plot(k_cont, k_cont * gamma, 'g--', linewidth=2,\n",
    "        label=r'Lower bound: $k\\gamma$')\n",
    "\n",
    "# Upper bound: ||w_k|| <= sqrt(k) * R\n",
    "ax.plot(k_cont, np.sqrt(k_cont) * R, 'r--', linewidth=2,\n",
    "        label=r'Upper bound: $\\sqrt{k} \\cdot R$')\n",
    "\n",
    "# Shade the feasible region\n",
    "ax.fill_between(k_cont, k_cont * gamma, np.sqrt(k_cont) * R,\n",
    "                alpha=0.1, color='yellow', label='Feasible region')\n",
    "\n",
    "ax.set_xlabel('Number of mistakes $k$', fontsize=13)\n",
    "ax.set_ylabel(r'$\\|\\mathbf{w}_k\\|$', fontsize=13)\n",
    "ax.set_title(r'Growth of $\\|\\mathbf{w}_k\\|$: Actual vs. Bounds', fontsize=14,\n",
    "             fontweight='bold')\n",
    "ax.legend(fontsize=11)\n",
    "ax.grid(True, alpha=0.3)\n",
    "\n",
    "# --- Panel 2: The Squeeze on ||w_k||^2 ---\n",
    "ax = axes[1]\n",
    "norm_sq = np.array(norm_history) ** 2\n",
    "ax.plot(k_vals, norm_sq, 'b.-', linewidth=2, markersize=4,\n",
    "        label=r'Actual $\\|\\mathbf{w}_k\\|^2$', zorder=3)\n",
    "\n",
    "# Lower bound: ||w_k||^2 >= k^2 * gamma^2 (quadratic)\n",
    "ax.plot(k_cont, (k_cont * gamma) ** 2, 'g--', linewidth=2,\n",
    "        label=r'Lower bound: $k^2\\gamma^2$ (quadratic)')\n",
    "\n",
    "# Upper bound: ||w_k||^2 <= k * R^2 (linear)\n",
    "ax.plot(k_cont, k_cont * R**2, 'r--', linewidth=2,\n",
    "        label=r'Upper bound: $kR^2$ (linear)')\n",
    "\n",
    "# Mark the crossing point (the bound k = R^2/gamma^2)\n",
    "ax.axvline(x=bound, color='purple', linestyle=':', linewidth=2,\n",
    "           label=f'$R^2/\\\\gamma^2 = {bound:.0f}$')\n",
    "\n",
    "# Shade the feasible region\n",
    "valid_k = k_cont[k_cont <= bound + 5]\n",
    "ax.fill_between(valid_k, (valid_k * gamma)**2, valid_k * R**2,\n",
    "                alpha=0.1, color='yellow')\n",
    "\n",
    "ax.set_xlabel('Number of mistakes $k$', fontsize=13)\n",
    "ax.set_ylabel(r'$\\|\\mathbf{w}_k\\|^2$', fontsize=13)\n",
    "ax.set_title(r'The Squeeze: $k^2\\gamma^2 \\leq \\|\\mathbf{w}_k\\|^2 \\leq kR^2$',\n",
    "             fontsize=14, fontweight='bold')\n",
    "ax.legend(fontsize=10)\n",
    "ax.grid(True, alpha=0.3)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-12",
   "metadata": {},
   "source": [
    "The left panel shows $\\|\\mathbf{w}_k\\|$ growing between the linear lower bound $k\\gamma$ and the sub-linear upper bound $\\sqrt{k} \\cdot R$. The right panel shows $\\|\\mathbf{w}_k\\|^2$ squeezed between the quadratic lower bound $k^2\\gamma^2$ and the linear upper bound $kR^2$. The vertical purple line marks the theoretical maximum $k = R^2/\\gamma^2$; the actual number of mistakes is always to its left."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-12a",
   "metadata": {},
   "source": [
    "### Relationship Between R, gamma, and the Convergence Bound\n",
    "\n",
    "Let us visualize how the convergence bound $R^2/\\gamma^2$ changes as a function of $R$ and $\\gamma$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-12b",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from matplotlib import cm\n",
    "\n",
    "# Plot the convergence bound R^2 / gamma^2 as a function of R and gamma\n",
    "fig, axes = plt.subplots(1, 3, figsize=(18, 5.5))\n",
    "\n",
    "# --- Panel 1: 3D surface of bound vs (R, gamma) ---\n",
    "ax = fig.add_subplot(131, projection='3d')\n",
    "# Remove the flat axes panel\n",
    "axes[0].set_visible(False)\n",
    "\n",
    "R_range = np.linspace(0.5, 5.0, 50)\n",
    "gamma_range = np.linspace(0.1, 2.0, 50)\n",
    "R_grid, Gamma_grid = np.meshgrid(R_range, gamma_range)\n",
    "Bound_grid = R_grid**2 / Gamma_grid**2\n",
    "\n",
    "surf = ax.plot_surface(R_grid, Gamma_grid, np.log10(Bound_grid),\n",
    "                       cmap=cm.RdYlGn_r, alpha=0.8, edgecolor='none')\n",
    "ax.set_xlabel('$R$ (radius)', fontsize=11, labelpad=8)\n",
    "ax.set_ylabel('$\\\\gamma$ (margin)', fontsize=11, labelpad=8)\n",
    "ax.set_zlabel('$\\\\log_{10}(R^2/\\\\gamma^2)$', fontsize=11, labelpad=8)\n",
    "ax.set_title('Convergence Bound\\n$R^2/\\\\gamma^2$', fontsize=12, fontweight='bold')\n",
    "ax.view_init(elev=25, azim=-130)\n",
    "\n",
    "# --- Panel 2: Bound vs gamma (fixed R) ---\n",
    "ax2 = axes[1]\n",
    "gamma_vals = np.linspace(0.1, 3.0, 200)\n",
    "for R_val in [1.0, 2.0, 3.0, 5.0]:\n",
    "    ax2.plot(gamma_vals, R_val**2 / gamma_vals**2, linewidth=2,\n",
    "             label=f'$R = {R_val}$')\n",
    "ax2.set_xlabel('Margin $\\\\gamma$', fontsize=13)\n",
    "ax2.set_ylabel('Bound $R^2/\\\\gamma^2$', fontsize=13)\n",
    "ax2.set_title('Bound vs. Margin\\n(fixed data radius $R$)', fontsize=13, fontweight='bold')\n",
    "ax2.set_yscale('log')\n",
    "ax2.legend(fontsize=11)\n",
    "ax2.grid(True, alpha=0.3)\n",
    "ax2.set_xlim(0.1, 3.0)\n",
    "\n",
    "# --- Panel 3: Bound vs R (fixed gamma) ---\n",
    "ax3 = axes[2]\n",
    "R_vals = np.linspace(0.5, 10.0, 200)\n",
    "for gamma_val in [0.1, 0.5, 1.0, 2.0]:\n",
    "    ax3.plot(R_vals, R_vals**2 / gamma_val**2, linewidth=2,\n",
    "             label=f'$\\\\gamma = {gamma_val}$')\n",
    "ax3.set_xlabel('Radius $R$', fontsize=13)\n",
    "ax3.set_ylabel('Bound $R^2/\\\\gamma^2$', fontsize=13)\n",
    "ax3.set_title('Bound vs. Data Radius\\n(fixed margin $\\\\gamma$)', fontsize=13, fontweight='bold')\n",
    "ax3.set_yscale('log')\n",
    "ax3.legend(fontsize=11)\n",
    "ax3.grid(True, alpha=0.3)\n",
    "ax3.set_xlim(0.5, 10.0)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-12c",
   "metadata": {},
   "source": [
    "### Theoretical Bound vs. Actual Iterations\n",
    "\n",
    "The following visualization demonstrates how the theoretical bound $R^2/\\gamma^2$ compares to the actual number of perceptron mistakes across datasets with different separations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-12d",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Compare theoretical bound vs actual iterations for many random datasets\n",
    "np.random.seed(42)\n",
    "\n",
    "separations_test = [1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 5.0, 6.0]\n",
    "n_trials_per = 15\n",
    "\n",
    "all_bounds = []\n",
    "all_actuals = []\n",
    "all_seps = []\n",
    "\n",
    "for sep in separations_test:\n",
    "    for trial in range(n_trials_per):\n",
    "        rng = np.random.RandomState(trial * 137 + int(sep * 100))\n",
    "        X0_t = rng.randn(25, 2) * 0.5 + np.array([-sep/2, 0])\n",
    "        X1_t = rng.randn(25, 2) * 0.5 + np.array([sep/2, 0])\n",
    "        X_t = np.vstack([X0_t, X1_t])\n",
    "        y_t = np.array([-1]*25 + [1]*25)\n",
    "        X_t_aug = np.hstack([np.ones((50, 1)), X_t])\n",
    "        \n",
    "        g, Rv, ws = compute_margin_and_radius(X_t_aug, y_t)\n",
    "        if g <= 0:\n",
    "            continue\n",
    "        bd = Rv**2 / g**2\n",
    "        _, _, n_mk = perceptron_with_tracking(X_t_aug, y_t, max_epochs=500)\n",
    "        \n",
    "        all_bounds.append(bd)\n",
    "        all_actuals.append(n_mk)\n",
    "        all_seps.append(sep)\n",
    "\n",
    "all_bounds = np.array(all_bounds)\n",
    "all_actuals = np.array(all_actuals)\n",
    "all_seps = np.array(all_seps)\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(16, 6))\n",
    "\n",
    "# --- Panel 1: Scatter of actual vs bound, colored by separation ---\n",
    "ax = axes[0]\n",
    "scatter = ax.scatter(all_bounds, all_actuals, c=all_seps, cmap='viridis',\n",
    "                     edgecolors='black', s=60, alpha=0.7, linewidths=0.5)\n",
    "max_val = max(all_bounds.max(), all_actuals.max()) * 1.1\n",
    "ax.plot([0, max_val], [0, max_val], 'r--', linewidth=2, alpha=0.7,\n",
    "        label='Actual = Bound (equality)')\n",
    "ax.set_xlabel('Theoretical Bound $R^2/\\\\gamma^2$', fontsize=13)\n",
    "ax.set_ylabel('Actual Mistakes $k$', fontsize=13)\n",
    "ax.set_title('Theoretical Bound vs. Actual Mistakes', fontsize=13, fontweight='bold')\n",
    "ax.legend(fontsize=11)\n",
    "ax.grid(True, alpha=0.3)\n",
    "cbar = plt.colorbar(scatter, ax=ax)\n",
    "cbar.set_label('Cluster separation', fontsize=11)\n",
    "\n",
    "# --- Panel 2: Bar chart of ratio bound/actual by separation ---\n",
    "ax = axes[1]\n",
    "unique_seps = sorted(set(all_seps))\n",
    "ratios_by_sep = []\n",
    "for s in unique_seps:\n",
    "    mask = all_seps == s\n",
    "    ratios = all_bounds[mask] / np.maximum(all_actuals[mask], 1)\n",
    "    ratios_by_sep.append(ratios)\n",
    "\n",
    "bp = ax.boxplot(ratios_by_sep, positions=range(len(unique_seps)),\n",
    "               widths=0.6, patch_artist=True)\n",
    "\n",
    "cmap = plt.cm.viridis\n",
    "for i, patch in enumerate(bp['boxes']):\n",
    "    color = cmap(i / max(len(unique_seps) - 1, 1))\n",
    "    patch.set_facecolor(color)\n",
    "    patch.set_alpha(0.7)\n",
    "\n",
    "ax.set_xticks(range(len(unique_seps)))\n",
    "ax.set_xticklabels([f'{s:.1f}' for s in unique_seps])\n",
    "ax.set_xlabel('Cluster Separation', fontsize=13)\n",
    "ax.set_ylabel('Ratio: Bound / Actual', fontsize=13)\n",
    "ax.set_title('How Loose is the Bound?', fontsize=13, fontweight='bold')\n",
    "ax.axhline(y=1, color='red', linestyle='--', linewidth=1.5, alpha=0.7,\n",
    "           label='Tight bound (ratio = 1)')\n",
    "ax.legend(fontsize=11)\n",
    "ax.grid(True, alpha=0.3, axis='y')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(f\"\\nOverall statistics:\")\n",
    "print(f\"  Mean ratio (bound/actual): {np.mean(all_bounds / np.maximum(all_actuals, 1)):.1f}x\")\n",
    "print(f\"  Bound always respected: {np.all(all_actuals <= all_bounds + 1)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-13",
   "metadata": {},
   "source": [
    "## 6.6 Empirical Verification"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-14",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "def run_experiment(n_per_class, separation, seed):\n",
    "    \"\"\"Generate data, run perceptron, return R, gamma, bound, actual mistakes.\"\"\"\n",
    "    rng = np.random.RandomState(seed)\n",
    "    X0 = rng.randn(n_per_class, 2) * 0.5 + np.array([-separation/2, 0])\n",
    "    X1 = rng.randn(n_per_class, 2) * 0.5 + np.array([separation/2, 0])\n",
    "    X_raw = np.vstack([X0, X1])\n",
    "    y_pm = np.array([-1] * n_per_class + [1] * n_per_class)\n",
    "    X_aug = np.hstack([np.ones((X_raw.shape[0], 1)), X_raw])\n",
    "    \n",
    "    gamma, R, w_star = compute_margin_and_radius(X_aug, y_pm)\n",
    "    \n",
    "    if gamma <= 0:\n",
    "        return None  # Not separable with this direction\n",
    "    \n",
    "    _, _, n_mistakes = perceptron_with_tracking(X_aug, y_pm, max_epochs=500)\n",
    "    \n",
    "    return {\n",
    "        'R': R,\n",
    "        'gamma': gamma,\n",
    "        'bound': R**2 / gamma**2,\n",
    "        'actual': n_mistakes\n",
    "    }\n",
    "\n",
    "\n",
    "# Run experiments with different separations\n",
    "print(\"=\" * 75)\n",
    "print(\"Empirical Verification of the Convergence Bound\")\n",
    "print(\"=\" * 75)\n",
    "print()\n",
    "print(f\"{'Separation':>12} | {'R':>8} | {'gamma':>8} | {'R^2/gamma^2':>12} | {'Actual k':>10} | {'Ratio':>8}\")\n",
    "print(\"-\" * 75)\n",
    "\n",
    "for sep in [2.0, 3.0, 4.0, 5.0, 6.0]:\n",
    "    result = run_experiment(30, sep, seed=42)\n",
    "    if result:\n",
    "        ratio = result['bound'] / max(result['actual'], 1)\n",
    "        print(f\"{sep:>12.1f} | {result['R']:>8.3f} | {result['gamma']:>8.3f} | \"\n",
    "              f\"{result['bound']:>12.1f} | {result['actual']:>10d} | {ratio:>8.1f}x\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-15",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Run 100 random trials for a fixed separation and compare distribution\n",
    "n_trials = 100\n",
    "separation = 3.0\n",
    "\n",
    "actual_mistakes = []\n",
    "bounds = []\n",
    "\n",
    "for trial in range(n_trials):\n",
    "    result = run_experiment(30, separation, seed=trial)\n",
    "    if result:\n",
    "        actual_mistakes.append(result['actual'])\n",
    "        bounds.append(result['bound'])\n",
    "\n",
    "actual_mistakes = np.array(actual_mistakes)\n",
    "bounds = np.array(bounds)\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(16, 6))\n",
    "\n",
    "# --- Panel 1: Histogram of actual mistakes vs bound ---\n",
    "ax = axes[0]\n",
    "ax.hist(actual_mistakes, bins=20, color='steelblue', edgecolor='black',\n",
    "        alpha=0.7, label='Actual mistakes $k$')\n",
    "ax.axvline(x=np.mean(bounds), color='red', linewidth=2, linestyle='--',\n",
    "           label=f'Mean bound $R^2/\\\\gamma^2 = {np.mean(bounds):.0f}$')\n",
    "ax.set_xlabel('Number of mistakes', fontsize=13)\n",
    "ax.set_ylabel('Count', fontsize=13)\n",
    "ax.set_title(f'Distribution of Mistakes ({n_trials} Trials, sep={separation})',\n",
    "             fontsize=13, fontweight='bold')\n",
    "ax.legend(fontsize=11)\n",
    "ax.grid(True, alpha=0.3)\n",
    "\n",
    "# --- Panel 2: Scatter of actual vs bound ---\n",
    "ax = axes[1]\n",
    "ax.scatter(bounds, actual_mistakes, c='steelblue', edgecolors='black',\n",
    "           alpha=0.6, s=40)\n",
    "max_val = max(bounds.max(), actual_mistakes.max()) * 1.1\n",
    "ax.plot([0, max_val], [0, max_val], 'r--', linewidth=2,\n",
    "        label='$k = R^2/\\\\gamma^2$ (equality)')\n",
    "ax.set_xlabel('Theoretical bound $R^2/\\\\gamma^2$', fontsize=13)\n",
    "ax.set_ylabel('Actual mistakes $k$', fontsize=13)\n",
    "ax.set_title('Actual Mistakes vs. Theoretical Bound', fontsize=13,\n",
    "             fontweight='bold')\n",
    "ax.legend(fontsize=11)\n",
    "ax.grid(True, alpha=0.3)\n",
    "ax.set_aspect('equal')\n",
    "ax.set_xlim(0, max_val)\n",
    "ax.set_ylim(0, max_val)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(f\"\\nSummary over {n_trials} trials (separation={separation}):\")\n",
    "print(f\"  Actual mistakes: mean={actual_mistakes.mean():.1f}, \"\n",
    "      f\"std={actual_mistakes.std():.1f}, \"\n",
    "      f\"max={actual_mistakes.max()}, min={actual_mistakes.min()}\")\n",
    "print(f\"  Bound R^2/gamma^2: mean={bounds.mean():.1f}, \"\n",
    "      f\"std={bounds.std():.1f}\")\n",
    "print(f\"  Fraction of bound used: mean={np.mean(actual_mistakes/bounds):.2%}\")\n",
    "print(f\"  Bound is always respected: {np.all(actual_mistakes <= bounds)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-16",
   "metadata": {},
   "source": [
    "### Observations\n",
    "\n",
    "1. The actual number of mistakes $k$ is always at or below the theoretical bound $R^2/\\gamma^2$.\n",
    "2. The bound is often **loose** -- the actual number of mistakes is typically much smaller than the bound.\n",
    "3. The bound is a **worst-case** guarantee, not an average-case prediction."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-17",
   "metadata": {},
   "source": [
    "## 6.7 The Bound is Tight\n",
    "\n",
    "Despite appearing loose in practice, the bound $k \\leq R^2/\\gamma^2$ is **tight** in the worst case. There exist datasets for which the perceptron makes exactly $\\lfloor R^2/\\gamma^2 \\rfloor$ mistakes.\n",
    "\n",
    "### Worst-Case Construction (Sketch)\n",
    "\n",
    "Consider the $n$-dimensional dataset consisting of the $n$ standard basis vectors $\\mathbf{e}_1, \\ldots, \\mathbf{e}_n$, all labeled $+1$. The separating direction is $\\mathbf{w}^* = \\frac{1}{\\sqrt{n}}(1, 1, \\ldots, 1)$.\n",
    "\n",
    "- **Margin**: $\\gamma = \\mathbf{w}^* \\cdot \\mathbf{e}_i = 1/\\sqrt{n}$ for each $i$.\n",
    "- **Radius**: $R = \\|\\mathbf{e}_i\\| = 1$.\n",
    "- **Bound**: $R^2/\\gamma^2 = 1/(1/n) = n$.\n",
    "\n",
    "The perceptron, cycling through $\\mathbf{e}_1, \\ldots, \\mathbf{e}_n$ in order, needs to update on each one exactly once (each time adding a new basis direction), making exactly $n$ mistakes. So the bound is achieved.\n",
    "\n",
    "This shows the bound is **asymptotically tight**: there is no better general bound of the form $C \\cdot R^2/\\gamma^2$ for some constant $C < 1$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-18",
   "metadata": {},
   "source": [
    "## 6.8 Margin and Generalization\n",
    "\n",
    "The convergence theorem reveals a deep connection between **geometric margin** and **learning complexity**:\n",
    "\n",
    "$$\\text{Number of mistakes} \\leq \\frac{R^2}{\\gamma^2}$$\n",
    "\n",
    "This means:\n",
    "- **Larger margin $\\gamma$** $\\Rightarrow$ **fewer mistakes** $\\Rightarrow$ **faster learning**.\n",
    "- **Smaller margin** $\\Rightarrow$ **more mistakes** $\\Rightarrow$ **harder problem**.\n",
    "\n",
    "This insight foreshadows one of the most important ideas in machine learning: the **maximum margin principle**.\n",
    "\n",
    "### Preview: Support Vector Machines\n",
    "\n",
    "While the perceptron finds *any* separating hyperplane, the **Support Vector Machine** (SVM) finds the one with the **largest margin** $\\gamma^*$. The motivation comes directly from the convergence theorem:\n",
    "\n",
    "- Among all separating hyperplanes, the one with the largest margin gives the tightest bound on the number of perceptron mistakes.\n",
    "- More importantly, larger margins correlate with better **generalization** (performance on unseen data).\n",
    "\n",
    "The SVM was introduced by Vapnik and Chervonenkis (1963) and later developed into a practical algorithm by Boser, Guyon, and Vapnik (1992). The margin concept from the perceptron convergence theorem is at its theoretical core.\n",
    "\n",
    "### Connection to PAC Learning\n",
    "\n",
    "In the framework of Probably Approximately Correct (PAC) learning:\n",
    "- The ratio $R/\\gamma$ determines the effective complexity of the classification problem.\n",
    "- This ratio is related to the **fat-shattering dimension**, a generalization of the VC dimension for margin classifiers.\n",
    "- Sample complexity bounds for linear classifiers depend on $R/\\gamma$, not on the ambient dimension $n$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-19",
   "metadata": {},
   "source": [
    "## 6.9 What Happens Without Separability?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-20",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Generate NON-separable data: overlapping Gaussian clusters\n",
    "np.random.seed(42)\n",
    "n_per_class = 40\n",
    "\n",
    "X0 = np.random.randn(n_per_class, 2) * 1.0 + np.array([-0.5, 0])\n",
    "X1 = np.random.randn(n_per_class, 2) * 1.0 + np.array([0.5, 0])\n",
    "X_nonsep = np.vstack([X0, X1])\n",
    "y_nonsep = np.array([0] * n_per_class + [1] * n_per_class)\n",
    "\n",
    "# Run perceptron for many epochs -- it should NOT converge\n",
    "class PerceptronTracker:\n",
    "    \"\"\"Perceptron that tracks detailed history for non-separable data analysis.\"\"\"\n",
    "    \n",
    "    def __init__(self, n_features, max_epochs=100):\n",
    "        self.weights = np.zeros(n_features)\n",
    "        self.bias = 0.0\n",
    "        self.max_epochs = max_epochs\n",
    "        self.errors_per_epoch = []\n",
    "        self.accuracy_per_epoch = []\n",
    "    \n",
    "    def fit(self, X, y):\n",
    "        X = np.atleast_2d(X)\n",
    "        n_samples = X.shape[0]\n",
    "        self.weights = np.zeros(X.shape[1])\n",
    "        self.bias = 0.0\n",
    "        self.errors_per_epoch = []\n",
    "        self.accuracy_per_epoch = []\n",
    "        \n",
    "        for epoch in range(self.max_epochs):\n",
    "            errors = 0\n",
    "            for i in range(n_samples):\n",
    "                y_hat = int(np.dot(self.weights, X[i]) + self.bias >= 0)\n",
    "                if y_hat != y[i]:\n",
    "                    update = (y[i] - y_hat)\n",
    "                    self.weights += update * X[i]\n",
    "                    self.bias += update\n",
    "                    errors += 1\n",
    "            \n",
    "            self.errors_per_epoch.append(errors)\n",
    "            self.accuracy_per_epoch.append(1 - errors / n_samples)\n",
    "            \n",
    "            if errors == 0:\n",
    "                break\n",
    "        \n",
    "        return self\n",
    "\n",
    "\n",
    "p_nonsep = PerceptronTracker(n_features=2, max_epochs=200)\n",
    "p_nonsep.fit(X_nonsep, y_nonsep)\n",
    "\n",
    "print(f\"Epochs run: {len(p_nonsep.errors_per_epoch)}\")\n",
    "print(f\"Converged: {p_nonsep.errors_per_epoch[-1] == 0}\")\n",
    "print(f\"Errors in last 5 epochs: {p_nonsep.errors_per_epoch[-5:]}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-21",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(16, 6))\n",
    "\n",
    "# --- Panel 1: Errors per epoch (oscillating) ---\n",
    "ax = axes[0]\n",
    "epochs_range = range(1, len(p_nonsep.errors_per_epoch) + 1)\n",
    "ax.plot(list(epochs_range), p_nonsep.errors_per_epoch, 'b-', linewidth=1, alpha=0.7)\n",
    "ax.fill_between(list(epochs_range), p_nonsep.errors_per_epoch, alpha=0.15, color='red')\n",
    "ax.axhline(y=0, color='green', linestyle='--', linewidth=1.5, alpha=0.5,\n",
    "           label='Target: 0 errors')\n",
    "ax.set_xlabel('Epoch', fontsize=13)\n",
    "ax.set_ylabel('Errors per Epoch', fontsize=13)\n",
    "ax.set_title('Non-Separable Data: Errors Oscillate Forever',\n",
    "             fontsize=13, fontweight='bold', color='red')\n",
    "ax.legend(fontsize=11)\n",
    "ax.grid(True, alpha=0.3)\n",
    "\n",
    "# --- Panel 2: Decision boundary at final epoch ---\n",
    "ax = axes[1]\n",
    "\n",
    "x_min, x_max = X_nonsep[:, 0].min() - 1, X_nonsep[:, 0].max() + 1\n",
    "y_min, y_max = X_nonsep[:, 1].min() - 1, X_nonsep[:, 1].max() + 1\n",
    "xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),\n",
    "                     np.linspace(y_min, y_max, 200))\n",
    "Z = xx * p_nonsep.weights[0] + yy * p_nonsep.weights[1] + p_nonsep.bias\n",
    "\n",
    "ax.contourf(xx, yy, Z, levels=[-1e10, 0, 1e10],\n",
    "            colors=['#FFCCCC', '#CCCCFF'], alpha=0.3)\n",
    "ax.contour(xx, yy, Z, levels=[0], colors='black', linewidths=2)\n",
    "\n",
    "ax.scatter(X_nonsep[y_nonsep == 0, 0], X_nonsep[y_nonsep == 0, 1],\n",
    "           c='red', marker='o', s=40, edgecolors='black', alpha=0.7,\n",
    "           label='Class 0')\n",
    "ax.scatter(X_nonsep[y_nonsep == 1, 0], X_nonsep[y_nonsep == 1, 1],\n",
    "           c='blue', marker='s', s=40, edgecolors='black', alpha=0.7,\n",
    "           label='Class 1')\n",
    "\n",
    "ax.set_xlabel('$x_1$', fontsize=13)\n",
    "ax.set_ylabel('$x_2$', fontsize=13)\n",
    "ax.set_title('Non-Separable Data: No Perfect Boundary Exists',\n",
    "             fontsize=13, fontweight='bold', color='red')\n",
    "ax.legend(fontsize=11)\n",
    "ax.grid(True, alpha=0.3)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-22",
   "metadata": {},
   "source": [
    "### The Pocket Algorithm\n",
    "\n",
    "When data is not linearly separable, the standard perceptron never converges. The **Pocket Algorithm** (Gallant, 1990) is a simple modification:\n",
    "\n",
    "1. Run the standard perceptron algorithm.\n",
    "2. After each epoch, evaluate the current weights on the entire training set.\n",
    "3. Keep a \"pocket\" copy of the **best weights seen so far** (the ones with the fewest errors).\n",
    "4. Return the pocket weights when the algorithm terminates.\n",
    "\n",
    "This gives an approximate solution that minimizes training error, even when perfect separation is impossible. It is a precursor to more sophisticated approaches like the **relaxation algorithm** and **gradient descent on smooth loss functions**."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-23",
   "metadata": {},
   "source": [
    "## 6.10 Exercises\n",
    "\n",
    "### Exercise 6.1: Compute R and gamma by Hand\n",
    "\n",
    "Consider the dataset (using $\\{-1, +1\\}$ labels and the bias trick with $x_0 = 1$):\n",
    "\n",
    "| $x_1$ | $x_2$ | $y$ | Augmented: $(1, x_1, x_2)$ |\n",
    "|-------|-------|-----|--------------------------|\n",
    "| 1 | 1 | +1 | (1, 1, 1) |\n",
    "| 2 | 2 | +1 | (1, 2, 2) |\n",
    "| -1 | -1 | -1 | (1, -1, -1) |\n",
    "| -2 | 0 | -1 | (1, -2, 0) |\n",
    "\n",
    "1. Compute $R = \\max_i \\|\\tilde{\\mathbf{x}}_i\\|$ for the augmented vectors.\n",
    "2. Find a separating unit vector $\\mathbf{w}^*$ and compute the margin $\\gamma = \\min_i y_i(\\mathbf{w}^* \\cdot \\tilde{\\mathbf{x}}_i)$.\n",
    "3. What is the convergence bound $R^2/\\gamma^2$?\n",
    "4. Run the perceptron by hand (or in code) and count the actual number of mistakes. Is it within the bound?\n",
    "\n",
    "### Exercise 6.2: Dimension Independence\n",
    "\n",
    "Prove that the bound $k \\leq R^2/\\gamma^2$ does not explicitly depend on the dimensionality $n$. That is, adding irrelevant features (columns of zeros) to the data does not change $R$ or $\\gamma$.\n",
    "\n",
    "*Hint*: If $\\tilde{\\mathbf{x}}_i = (\\mathbf{x}_i, 0, \\ldots, 0)$ and $\\tilde{\\mathbf{w}}^* = (\\mathbf{w}^*, 0, \\ldots, 0)$, compute $\\|\\tilde{\\mathbf{x}}_i\\|$ and $\\tilde{\\mathbf{w}}^* \\cdot \\tilde{\\mathbf{x}}_i$.\n",
    "\n",
    "### Exercise 6.3: Achieving the Bound\n",
    "\n",
    "Consider the dataset in $\\mathbb{R}^n$ consisting of the $n$ standard basis vectors $\\mathbf{e}_1, \\ldots, \\mathbf{e}_n$, all with label $y = +1$.\n",
    "\n",
    "1. What is $R$?\n",
    "2. What is the optimal separating direction $\\mathbf{w}^*$ and the corresponding $\\gamma$?\n",
    "3. Show that $R^2/\\gamma^2 = n$.\n",
    "4. Run the perceptron (presenting $\\mathbf{e}_1, \\mathbf{e}_2, \\ldots, \\mathbf{e}_n$ in order). How many mistakes does it make? Does it match the bound?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-24",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# Exercise 6.3: Verification for the tight bound construction\n",
    "print(\"=\" * 60)\n",
    "print(\"Exercise 6.3: Achieving the Bound\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "for n in [2, 3, 5, 10, 20, 50]:\n",
    "    # n standard basis vectors, all labeled +1\n",
    "    X_basis = np.eye(n)\n",
    "    y_basis = np.ones(n)\n",
    "    \n",
    "    # Optimal w* = (1/sqrt(n), ..., 1/sqrt(n))\n",
    "    w_star = np.ones(n) / np.sqrt(n)\n",
    "    \n",
    "    R = np.max(np.linalg.norm(X_basis, axis=1))  # = 1\n",
    "    gamma = np.min(y_basis * (X_basis @ w_star))   # = 1/sqrt(n)\n",
    "    bound = R**2 / gamma**2\n",
    "    \n",
    "    # Run perceptron ({-1,+1} convention: use +1 labels only)\n",
    "    w = np.zeros(n)\n",
    "    mistakes = 0\n",
    "    for epoch in range(n + 5):\n",
    "        no_errors = True\n",
    "        for i in range(n):\n",
    "            if w @ X_basis[i] <= 0:  # misclassification (should be > 0)\n",
    "                w = w + X_basis[i]\n",
    "                mistakes += 1\n",
    "                no_errors = False\n",
    "        if no_errors:\n",
    "            break\n",
    "    \n",
    "    print(f\"  n={n:>3d}: R={R:.1f}, gamma=1/sqrt({n})={gamma:.4f}, \"\n",
    "          f\"bound={bound:.0f}, actual={mistakes}, \"\n",
    "          f\"match={'YES' if mistakes == n else 'NO'}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}