Chapter 6: Index of Coincidence#
“The index of coincidence is a number which measures the unevenness of the frequency distribution of a given sample of text.” – William F. Friedman, Riverbank Publication No. 22, 1922
In Chapter 5 we saw how the Kasiski examination exploits repeated n-grams in Vigenère ciphertext to recover the key length. That method relies on the ciphertext being long enough for repeated patterns to appear – a condition that is not always met. In this chapter we introduce a fundamentally different approach: Friedman’s Index of Coincidence (IoC), a statistical measure that works even on shorter texts and requires no repeated n-grams at all.
The Index of Coincidence represents one of the most significant intellectual advances in the history of cryptanalysis. It was the first time that a purely statistical quantity – rather than pattern matching – was used systematically to break ciphers.
6.1 History: William F. Friedman and the Statistical Revolution#
William Frederick Friedman (1891–1969) is widely regarded as the father of American cryptanalysis. Born Wolfe Friedman in Kishinev (modern-day Chisinau, Moldova), he emigrated to the United States as a child. Originally trained in genetics at Cornell University, Friedman was hired in 1915 by George Fabyan to work at the Riverbank Laboratories in Geneva, Illinois – a private research estate that had an active cryptanalysis department.
Between 1920 and 1922, Friedman wrote a series of monographs published as the Riverbank Publications. The most influential of these was Publication No. 22, titled “The Index of Coincidence and Its Applications in Cryptanalysis” (1922). In this work, Friedman introduced a statistical quantity that measures the probability that two randomly chosen letters from a text are the same. This simple idea had profound consequences:
Why the IoC mattered
Before Friedman, breaking polyalphabetic ciphers required identifying repeated patterns (Kasiski, 1863) or extensive trial-and-error. The IoC provided an automatic, quantitative test for key length that:
Works on texts too short for the Kasiski method
Does not require the analyst to find repeated n-grams
Gives a numerical score that can be computed mechanically
Is invariant under monoalphabetic substitution
Friedman went on to lead the U.S. Army’s Signals Intelligence Service (SIS), where he and his team broke the Japanese diplomatic cipher PURPLE in 1940 – one of the great cryptanalytic achievements of World War II. His wife, Elizebeth Smith Friedman, was herself a distinguished cryptanalyst who broke smuggling and espionage ciphers for the U.S. Coast Guard and Treasury Department.
Historical note
The term index of coincidence comes from the probability interpretation: it is the probability that two randomly chosen letters from a text coincide (are the same letter). Friedman denoted this quantity by the Greek letter kappa (\(\kappa\)).
6.2 Formal Definitions#
6.2.1 The Index of Coincidence#
Let \(x = x_1 x_2 \cdots x_N\) be a text of length \(N\) over an alphabet \(\mathcal{A}\) of size \(c\) (for English, \(c = 26\)). Let \(f_i\) denote the number of occurrences of the \(i\)-th letter of the alphabet in \(x\), so that \(\sum_{i=0}^{c-1} f_i = N\).
Definition 6.1 (Index of Coincidence)
The Index of Coincidence of the text \(x\) is defined as:
This is the probability that two letters chosen at random (without replacement) from the text are identical.
6.2.2 Expected IoC for English: \(\kappa_p\)#
For natural English text, the letter frequencies are far from uniform. Let \(p_i\) denote the probability of the \(i\)-th letter in English. The expected IoC for English plaintext is:
This value arises because certain letters (E, T, A, O, …) are much more frequent than others, concentrating the probability mass and increasing the coincidence rate.
6.2.3 Expected IoC for Random Text: \(\kappa_r\)#
If all 26 letters are equally likely (a uniform distribution), the expected IoC is:
Key insight
The contrast between \(\kappa_p \approx 0.0667\) and \(\kappa_r \approx 0.0385\) is the engine of the IoC method. Monoalphabetic substitution preserves the IoC (it merely relabels letters), while polyalphabetic substitution with a key of length \(L > 1\) drives the IoC toward \(\kappa_r\).
6.2.4 Mutual Index of Coincidence#
Given two texts \(x\) of length \(N\) and \(y\) of length \(M\), with letter frequency counts \(f_i\) and \(g_i\) respectively, the Mutual Index of Coincidence is:
This measures the probability that a letter drawn at random from \(x\) and a letter drawn at random from \(y\) are the same. If both texts were enciphered with the same monoalphabetic key, the MIC will be close to \(\kappa_p\). If they were enciphered with different keys, it will be close to \(\kappa_r\).
6.3 Implementation#
We now build an IndexOfCoincidence class that encapsulates all the IoC computations needed for cryptanalysis.
import numpy as np
class IndexOfCoincidence:
"""Tools for computing and applying the Index of Coincidence."""
# Standard English letter frequencies (A-Z)
ENGLISH_FREQ = np.array([
0.08167, 0.01492, 0.02782, 0.04253, 0.12702, 0.02228, # A-F
0.02015, 0.06094, 0.06966, 0.00153, 0.00772, 0.04025, # G-L
0.02406, 0.06749, 0.07507, 0.01929, 0.00095, 0.05987, # M-R
0.06327, 0.09056, 0.02758, 0.00978, 0.02360, 0.00150, # S-X
0.01974, 0.00074 # Y-Z
])
def __init__(self, alphabet_size=26):
self.c = alphabet_size
@staticmethod
def _clean(text):
"""Remove non-alphabetic characters and convert to lowercase."""
return ''.join(ch.lower() for ch in text if ch.isalpha())
@staticmethod
def _letter_counts(text):
"""Return an array of 26 letter counts for cleaned text."""
cleaned = IndexOfCoincidence._clean(text)
counts = np.zeros(26, dtype=int)
for ch in cleaned:
counts[ord(ch) - ord('a')] += 1
return counts
def compute_ioc(self, text):
"""Compute the Index of Coincidence of a text.
IC(x) = sum_i f_i(f_i - 1) / (N(N-1))
"""
counts = self._letter_counts(text)
N = counts.sum()
if N <= 1:
return 0.0
numerator = np.sum(counts * (counts - 1))
denominator = N * (N - 1)
return float(numerator / denominator)
def expected_ioc_english(self):
"""Return the expected IoC for English text: kappa_p."""
return float(np.sum(self.ENGLISH_FREQ ** 2))
def expected_ioc_random(self):
"""Return the expected IoC for uniformly random text: kappa_r = 1/c."""
return 1.0 / self.c
def mutual_ioc(self, text1, text2):
"""Compute the Mutual Index of Coincidence between two texts.
MIC(x,y) = sum_i f_i * g_i / (N * M)
"""
counts1 = self._letter_counts(text1)
counts2 = self._letter_counts(text2)
N = counts1.sum()
M = counts2.sum()
if N == 0 or M == 0:
return 0.0
return float(np.sum(counts1 * counts2) / (N * M))
def estimate_key_length(self, ciphertext, max_key_length=20):
"""Estimate key length by computing average IoC for columns.
For each candidate key length L, split the ciphertext into L
columns and compute the average IoC across columns. The correct
key length yields an average IoC close to kappa_p (~0.0667).
Returns:
ioc_by_length: dict mapping L -> average IoC for L columns
"""
cleaned = self._clean(ciphertext)
N = len(cleaned)
ioc_by_length = {}
for L in range(1, max_key_length + 1):
# Split into L columns
columns = ['' for _ in range(L)]
for idx, ch in enumerate(cleaned):
columns[idx % L] += ch
# Compute IoC for each column and average
ioc_values = []
for col in columns:
if len(col) > 1:
ioc_values.append(self.compute_ioc(col))
if ioc_values:
ioc_by_length[L] = float(np.mean(ioc_values))
else:
ioc_by_length[L] = 0.0
return ioc_by_length
# Quick verification
ioc_tool = IndexOfCoincidence()
print(f"Expected IoC for English (kappa_p): {float(ioc_tool.expected_ioc_english()):.6f}")
print(f"Expected IoC for random (kappa_r): {float(ioc_tool.expected_ioc_random()):.6f}")
print(f"Ratio kappa_p / kappa_r: {float(ioc_tool.expected_ioc_english() / ioc_tool.expected_ioc_random()):.2f}")
Expected IoC for English (kappa_p): 0.065497
Expected IoC for random (kappa_r): 0.038462
Ratio kappa_p / kappa_r: 1.70
Implementation note
The _letter_counts method converts all text to lowercase and discards non-alphabetic characters before counting. This ensures that spaces, punctuation, and case variations do not affect the IoC computation.
We also need the Vigenère cipher for our experiments.
import numpy as np
def vigenere_encrypt(plaintext, key):
"""Encrypt plaintext with the Vigenere cipher.
Non-alphabetic characters are stripped. Output is lowercase.
"""
cleaned = ''.join(ch.lower() for ch in plaintext if ch.isalpha())
key_lower = key.lower()
key_len = len(key_lower)
result = []
for i, ch in enumerate(cleaned):
p = ord(ch) - ord('a')
k = ord(key_lower[i % key_len]) - ord('a')
c = (p + k) % 26
result.append(chr(c + ord('a')))
return ''.join(result)
def vigenere_decrypt(ciphertext, key):
"""Decrypt ciphertext with the Vigenere cipher."""
cleaned = ''.join(ch.lower() for ch in ciphertext if ch.isalpha())
key_lower = key.lower()
key_len = len(key_lower)
result = []
for i, ch in enumerate(cleaned):
c = ord(ch) - ord('a')
k = ord(key_lower[i % key_len]) - ord('a')
p = (c - k) % 26
result.append(chr(p + ord('a')))
return ''.join(result)
def caesar_encrypt(plaintext, shift):
"""Caesar cipher (monoalphabetic shift)."""
cleaned = ''.join(ch.lower() for ch in plaintext if ch.isalpha())
return ''.join(chr((ord(ch) - ord('a') + shift) % 26 + ord('a'))
for ch in cleaned)
# Test round-trip
test_plain = "thequickbrownfoxjumpsoverthelazydog"
test_key = "secret"
test_cipher = vigenere_encrypt(test_plain, test_key)
test_recovered = vigenere_decrypt(test_cipher, test_key)
print(f"Plaintext: {test_plain}")
print(f"Key: {test_key}")
print(f"Ciphertext: {test_cipher}")
print(f"Recovered: {test_recovered}")
print(f"Round-trip: {'PASS' if test_recovered == test_plain else 'FAIL'}")
Plaintext: thequickbrownfoxjumpsoverthelazydog
Key: secret
Ciphertext: llghybuodispfjqonnetufzxjxjvptrcffk
Recovered: thequickbrownfoxjumpsoverthelazydog
Round-trip: PASS
6.4 Key Results#
6.4.1 Friedman’s Theorem#
The central theoretical result underpinning the IoC method is the following.
Theorem 6.1 (Friedman, 1922)
For a polyalphabetic cipher with key length \(L\) applied to a plaintext drawn from a natural language with \(\kappa_p = \sum p_i^2\), the expected Index of Coincidence of the ciphertext is approximately:
where \(\kappa_r = 1/c\) is the random coincidence rate.
Proof sketch. Consider a ciphertext of length \(N\) produced by a Vigenère cipher with key length \(L\). Position \(j\) in the ciphertext was produced by shifting plaintext letter \(j\) by \(k_{j \bmod L}\). Two randomly chosen ciphertext positions \(j_1\) and \(j_2\) will coincide in letter value if:
They fall in the same coset (i.e., \(j_1 \equiv j_2 \pmod{L}\)): both are shifted by the same key letter, so the ciphertext letters coincide whenever the plaintext letters coincide. This happens with probability \(\kappa_p\).
They fall in different cosets: they are shifted by different (independent) key letters, so coincidence occurs essentially at the random rate \(\kappa_r\).
The probability of falling in the same coset is approximately \(1/L\), and in different cosets approximately \((L-1)/L\). Therefore:
6.4.2 Key Length Estimation#
Corollary 6.2 (Friedman’s formula for key length)
Given the observed IoC of a ciphertext \(\mathrm{IC}_{\text{obs}}\), the key length \(L\) can be estimated by:
For English (\(\kappa_p \approx 0.0667\), \(\kappa_r \approx 0.0385\)):
6.4.3 Column-Based Method#
In practice, a more reliable approach than Friedman’s direct formula is the column method:
For each candidate key length \(L = 1, 2, \ldots, L_{\max}\), split the ciphertext into \(L\) columns (column \(j\) contains every \(L\)-th character starting at position \(j\)).
Compute the IoC of each column.
Average the column IoC values.
The correct \(L\) gives an average IoC close to \(\kappa_p \approx 0.0667\), because each column is enciphered by a single Caesar shift.
6.4.4 Comparison with Kasiski#
Feature |
Kasiski Examination |
Index of Coincidence |
|---|---|---|
Principle |
Repeated n-gram distances |
Statistical frequency measure |
Minimum text length |
Needs repeated patterns (long text) |
Works on shorter texts |
Output |
GCD of distances (exact divisors) |
Numerical estimate (continuous) |
Automation |
Requires pattern search + GCD |
Simple arithmetic |
Robustness |
Can fail if no repeats found |
Always produces a value |
Experiments#
Experiment 1: IoC is Invariant Under Monoalphabetic Substitution#
We compute the IoC for English plaintext, random text, and several Caesar-shifted versions to demonstrate that the IoC is preserved under monoalphabetic substitution.
import numpy as np
# A substantial English text sample
english_text = (
"Ciphers oh ciphers secrets they keep "
"Locked away in code deep and discreet "
"Symbols and numbers jumbled and mixed "
"A message encoded ready to be fixed "
"Through history they have been used to hide "
"Plans of war love letters beside "
"The Rosetta Stone a mystery solved "
"A codebreakers work duly evolved "
"From Caesar to Enigma through time they change "
"Yet the essence remains mysterious and strange "
"A puzzle to solve a game to play "
"A message to decipher in every way "
"Crack the code reveal the truth "
"A challenge to minds sharp and astute "
"For ciphers oh ciphers they hold the key "
"To secrets untold waiting to be set free "
"Deep within the caves so dark and cold "
"The dwarves they sat and stories told "
"Of treasure hoards and ancient lore "
"And secrets kept for evermore "
"But in their hands they held a tool "
"A wondrous thing so sharp and cool "
"A blade of steel that could carve letters "
"And keep their secrets safe forever"
)
# Create the IoC tool
ioc_tool = IndexOfCoincidence()
# IoC of original English text
ioc_english = ioc_tool.compute_ioc(english_text)
print(f"IoC of English plaintext: {float(ioc_english):.6f}")
# IoC of random text (uniformly random letters)
rng = np.random.default_rng(42)
random_text = ''.join(chr(rng.integers(0, 26) + ord('a')) for _ in range(len(english_text)))
ioc_random = ioc_tool.compute_ioc(random_text)
print(f"IoC of random text: {float(ioc_random):.6f}")
# IoC of Caesar-shifted texts
print("\nIoC under Caesar shifts:")
print(f"{'Shift':>6s} {'IoC':>10s}")
print("-" * 20)
for shift in [0, 3, 7, 13, 19, 25]:
shifted = caesar_encrypt(english_text, shift)
ioc_shifted = ioc_tool.compute_ioc(shifted)
print(f"{int(shift):>6d} {float(ioc_shifted):>10.6f}")
print(f"\nExpected kappa_p: {float(ioc_tool.expected_ioc_english()):.6f}")
print(f"Expected kappa_r: {float(ioc_tool.expected_ioc_random()):.6f}")
IoC of English plaintext: 0.071538
IoC of random text: 0.039058
IoC under Caesar shifts:
Shift IoC
--------------------
0 0.071538
3 0.071538
7 0.071538
13 0.071538
19 0.071538
25 0.071538
Expected kappa_p: 0.065497
Expected kappa_r: 0.038462
Observation
The IoC is exactly the same for all Caesar shifts. This is because a Caesar shift is a bijection on the alphabet – it permutes the letter counts but does not change the multiset of counts. Since the IoC formula depends only on the counts \(\{f_i\}\), it is invariant under any permutation of the alphabet.
Experiment 2: IoC vs. Key Length for Vigenère Ciphertext#
We encrypt the English text with a Vigenère key of length 6, then compute the average column IoC for trial key lengths \(L = 1, 2, \ldots, 12\). The correct key length should produce a peak near \(\kappa_p\).
import numpy as np
import matplotlib.pyplot as plt
# Encrypt with a key of length 6
key = "hunter"
ciphertext = vigenere_encrypt(english_text, key)
print(f"Plaintext length: {len(ioc_tool._clean(english_text))} characters")
print(f"Key: '{key}' (length {len(key)})")
print(f"Ciphertext (first 80 chars): {ciphertext[:80]}")
print(f"\nIoC of full ciphertext: {float(ioc_tool.compute_ioc(ciphertext)):.6f}")
# Estimate key length by column IoC
max_L = 12
ioc_by_L = ioc_tool.estimate_key_length(ciphertext, max_key_length=max_L)
print(f"\n{'L':>3s} {'Avg Column IoC':>15s} {'Assessment'}")
print("-" * 45)
for L in range(1, max_L + 1):
avg_ioc = ioc_by_L[L]
marker = " <-- PEAK" if avg_ioc > 0.060 else ""
print(f"{int(L):>3d} {float(avg_ioc):>15.6f} {marker}")
Plaintext length: 735 characters
Key: 'hunter' (length 6)
Ciphertext (first 80 chars): jccaiiziuvmgoyelwvjlrmwkoyldivwfbvovkujtczuwbwiulyctrukcfvvvlnfrqsvfftruuozuiizd
IoC of full ciphertext: 0.044249
L Avg Column IoC Assessment
---------------------------------------------
1 0.044249
2 0.047276
3 0.055169
4 0.047285
5 0.043817
6 0.070684 <-- PEAK
7 0.042595
8 0.046647
9 0.056016
10 0.046104
11 0.042155
12 0.071212 <-- PEAK
Show code cell source
import numpy as np
import matplotlib.pyplot as plt
# Plot IoC vs trial key length
lengths = list(ioc_by_L.keys())
ioc_values = list(ioc_by_L.values())
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(lengths, ioc_values, color='steelblue', edgecolor='navy', alpha=0.8)
ax.axhline(y=0.0667, color='green', linestyle='--', linewidth=1.5,
label=r'$\kappa_p \approx 0.0667$ (English)')
ax.axhline(y=0.0385, color='red', linestyle='--', linewidth=1.5,
label=r'$\kappa_r \approx 0.0385$ (random)')
ax.set_xlabel('Trial key length $L$', fontsize=12)
ax.set_ylabel('Average column IoC', fontsize=12)
ax.set_title('Index of Coincidence vs. Trial Key Length\n'
f'(true key = "{key}", length = {len(key)})', fontsize=13)
ax.set_xticks(lengths)
ax.legend(fontsize=11)
ax.set_ylim(0.02, 0.08)
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.savefig('ioc_vs_key_length.png', dpi=150, bbox_inches='tight')
plt.show()
print("Figure saved: ioc_vs_key_length.png")
Figure saved: ioc_vs_key_length.png
Observation
The bar chart shows a clear peak at \(L = 6\) (the true key length) and also at \(L = 12\) (a multiple of 6). At \(L = 6\), the average column IoC is near \(\kappa_p \approx 0.0667\), confirming that each column is governed by a single Caesar shift. For wrong key lengths, the average column IoC hovers closer to \(\kappa_r \approx 0.0385\).
Experiment 3: Full Attack – IoC for Key Length, Chi-Squared for Key Recovery#
Having determined the key length, we recover each key letter by minimizing the chi-squared statistic between the column’s shifted frequency distribution and the expected English frequencies.
import numpy as np
# English letter frequencies
ENGLISH_FREQ = np.array([
0.08167, 0.01492, 0.02782, 0.04253, 0.12702, 0.02228,
0.02015, 0.06094, 0.06966, 0.00153, 0.00772, 0.04025,
0.02406, 0.06749, 0.07507, 0.01929, 0.00095, 0.05987,
0.06327, 0.09056, 0.02758, 0.00978, 0.02360, 0.00150,
0.01974, 0.00074
])
def chi_squared_shift(column_text, shift, expected_freq):
"""Compute chi-squared statistic for a column after applying a shift.
We shift the column letters back by `shift` and compare the resulting
frequency distribution to the expected English frequencies.
"""
cleaned = ''.join(ch.lower() for ch in column_text if ch.isalpha())
N = len(cleaned)
if N == 0:
return float('inf')
# Count letter frequencies after shifting back
counts = np.zeros(26)
for ch in cleaned:
shifted_idx = (ord(ch) - ord('a') - shift) % 26
counts[shifted_idx] += 1
observed_freq = counts / N
# Chi-squared statistic
chi2 = np.sum((observed_freq - expected_freq) ** 2 / expected_freq)
return chi2
def recover_key(ciphertext, key_length):
"""Recover the Vigenere key given the key length.
For each column, try all 26 shifts and pick the one that minimizes
the chi-squared statistic against English letter frequencies.
"""
cleaned = ''.join(ch.lower() for ch in ciphertext if ch.isalpha())
key = []
for col_idx in range(key_length):
# Extract column
column = cleaned[col_idx::key_length]
# Try all 26 shifts, pick the one with smallest chi-squared
best_shift = 0
best_chi2 = float('inf')
for shift in range(26):
chi2 = chi_squared_shift(column, shift, ENGLISH_FREQ)
if chi2 < best_chi2:
best_chi2 = chi2
best_shift = shift
key.append(chr(best_shift + ord('a')))
return ''.join(key)
def select_best_key_length(ioc_results, threshold=0.060):
"""Select the most likely key length from IoC results.
Among all candidate lengths whose average column IoC exceeds the
threshold, prefer the smallest L. Additionally, if a candidate L
is a multiple of a smaller candidate L', prefer L' (divisor
analysis: a doubled key gives equally high IoC).
"""
# Collect candidates that exceed the threshold
candidates = [L for L, ioc in sorted(ioc_results.items())
if ioc >= threshold]
if not candidates:
# Fall back to the length with the highest IoC
return max(ioc_results, key=ioc_results.get)
# Return the smallest candidate -- it will be the fundamental
# period; larger candidates are almost always multiples of it.
return candidates[0]
# Step 1: Determine key length via IoC
print("=" * 55)
print("FULL ATTACK ON VIGENERE CIPHER USING IoC")
print("=" * 55)
ioc_results = ioc_tool.estimate_key_length(ciphertext, max_key_length=12)
best_L = select_best_key_length(ioc_results)
print(f"\nStep 1: Key length estimation via IoC")
print(f" Best candidate key length: L = {best_L}")
print(f" Average column IoC at L={best_L}: {float(ioc_results[best_L]):.6f}")
# Step 2: Recover key via chi-squared
recovered_key = recover_key(ciphertext, best_L)
print(f"\nStep 2: Key recovery via chi-squared")
print(f" Recovered key: '{recovered_key}'")
print(f" True key: '{key}'")
print(f" Match: {'YES' if recovered_key == key else 'NO'}")
# Step 3: Decrypt
decrypted = vigenere_decrypt(ciphertext, recovered_key)
print(f"\nStep 3: Decrypted text (first 120 chars):")
print(f" {decrypted[:120]}")
=======================================================
FULL ATTACK ON VIGENERE CIPHER USING IoC
=======================================================
Step 1: Key length estimation via IoC
Best candidate key length: L = 6
Average column IoC at L=6: 0.070684
Step 2: Key recovery via chi-squared
Recovered key: 'hunter'
True key: 'hunter'
Match: YES
Step 3: Decrypted text (first 120 chars):
ciphersohcipherssecretstheykeeplockedawayincodedeepanddiscreetsymbolsandnumbersjumbledandmixedamessageencodedreadytobefi
The attack pipeline
The complete Vigenère attack consists of two distinct phases:
Key length determination using the Index of Coincidence (this chapter) or the Kasiski examination (Chapter 5).
Key recovery using chi-squared frequency matching on each column. Once the key length \(L\) is known, the problem reduces to \(L\) independent Caesar ciphers.
Experiment 4: Friedman’s Direct Formula#
We test the direct estimation formula \(L \approx (\kappa_p - \kappa_r) / (\mathrm{IC}_{\text{obs}} - \kappa_r)\) on ciphertexts with various key lengths.
import numpy as np
# Friedman's direct key length estimation formula
kappa_p = ioc_tool.expected_ioc_english()
kappa_r = ioc_tool.expected_ioc_random()
print("Friedman's Direct Formula: L ~ (kappa_p - kappa_r) / (IC_obs - kappa_r)")
print(f"kappa_p = {float(kappa_p):.6f}, kappa_r = {float(kappa_r):.6f}")
print()
test_keys = ["ab", "key", "four", "magic", "hunter", "longkey",
"vigenere", "tenletkey", "encryption"]
print(f"{'Key':>12s} {'True L':>7s} {'IC_obs':>8s} {'Friedman L':>11s} {'Column L':>9s}")
print("-" * 55)
for k in test_keys:
ct = vigenere_encrypt(english_text, k)
ic_obs = ioc_tool.compute_ioc(ct)
# Friedman's formula
if ic_obs > kappa_r:
friedman_L = (kappa_p - kappa_r) / (ic_obs - kappa_r)
else:
friedman_L = float('inf')
# Column method (prefer smallest candidate above threshold)
ioc_est = ioc_tool.estimate_key_length(ct, max_key_length=15)
column_L = select_best_key_length(ioc_est)
print(f"{k:>12s} {int(len(k)):>7d} {float(ic_obs):>8.5f} {float(friedman_L):>11.2f} {int(column_L):>9d}")
Friedman's Direct Formula: L ~ (kappa_p - kappa_r) / (IC_obs - kappa_r)
kappa_p = 0.065497, kappa_r = 0.038462
Key True L IC_obs Friedman L Column L
-------------------------------------------------------
ab 2 0.05532 1.60 2
key 3 0.04700 3.16 3
four 4 0.04587 3.65 4
magic 5 0.04418 4.73 5
hunter 6 0.04425 4.67 6
longkey 7 0.04119 9.89 7
vigenere 8 0.04625 3.47 8
tenletkey 9 0.04475 4.30 9
encryption 10 0.04374 5.12 10
Observation
Friedman’s direct formula gives a rough floating-point estimate of \(L\). The column-based method (Experiment 2) is more reliable in practice because it directly tests each candidate \(L\) and does not depend on a single global IoC measurement. However, the direct formula is useful for a quick initial estimate, especially when one wants to narrow the search range.
Experiment 5: Comparing Kasiski vs. IoC on the Same Ciphertext#
We implement a simplified Kasiski examination and compare its output with the IoC column method.
import numpy as np
def kasiski_examination(ciphertext, min_ngram=3, max_ngram=5):
"""Perform Kasiski examination to estimate key length.
Finds repeated n-grams, computes distances between occurrences,
and returns factor counts to estimate key length.
"""
cleaned = ''.join(ch.lower() for ch in ciphertext if ch.isalpha())
distances = []
for n in range(min_ngram, max_ngram + 1):
# Find all n-gram positions
ngram_positions = {}
for i in range(len(cleaned) - n + 1):
ngram = cleaned[i:i + n]
if ngram in ngram_positions:
ngram_positions[ngram].append(i)
else:
ngram_positions[ngram] = [i]
# Compute distances between repeated n-grams
for ngram, positions in ngram_positions.items():
if len(positions) >= 2:
for i in range(len(positions)):
for j in range(i + 1, len(positions)):
distances.append(positions[j] - positions[i])
if not distances:
return {}
# Count factors of all distances
factor_counts = {}
for d in distances:
for f in range(2, min(d + 1, 21)): # check factors up to 20
if d % f == 0:
factor_counts[f] = factor_counts.get(f, 0) + 1
return factor_counts
# Compare on the same ciphertext
print("=" * 60)
print("COMPARISON: Kasiski Examination vs. Index of Coincidence")
print(f"True key: '{key}' (length {len(key)})")
print("=" * 60)
# Kasiski
print("\n--- Kasiski Examination ---")
factor_counts = kasiski_examination(ciphertext)
if factor_counts:
sorted_factors = sorted(factor_counts.items(), key=lambda x: x[1], reverse=True)
print(f"{'Factor':>8s} {'Count':>6s}")
print("-" * 18)
for factor, count in sorted_factors[:10]:
marker = " <--" if factor == len(key) else ""
print(f"{int(factor):>8d} {int(count):>6d}{marker}")
kasiski_best = sorted_factors[0][0]
print(f"\nKasiski best guess: L = {kasiski_best}")
else:
print("No repeated n-grams found (text too short for Kasiski).")
kasiski_best = None
# IoC column method
print("\n--- Index of Coincidence (Column Method) ---")
ioc_results = ioc_tool.estimate_key_length(ciphertext, max_key_length=12)
print(f"{'L':>3s} {'Avg Col IoC':>12s}")
print("-" * 20)
for L in range(1, 13):
marker = " <--" if ioc_results[L] > 0.060 else ""
print(f"{int(L):>3d} {float(ioc_results[L]):>12.6f}{marker}")
ioc_best = select_best_key_length(ioc_results)
print(f"\nIoC best guess: L = {ioc_best}")
============================================================
COMPARISON: Kasiski Examination vs. Index of Coincidence
True key: 'hunter' (length 6)
============================================================
--- Kasiski Examination ---
Factor Count
------------------
2 127
3 120
6 120 <--
4 59
12 54
9 33
18 33
7 17
14 17
8 15
Kasiski best guess: L = 2
--- Index of Coincidence (Column Method) ---
L Avg Col IoC
--------------------
1 0.044249
2 0.047276
3 0.055169
4 0.047285
5 0.043817
6 0.070684 <--
7 0.042595
8 0.046647
9 0.056016
10 0.046104
11 0.042155
12 0.071212 <--
IoC best guess: L = 6
Show code cell source
import numpy as np
import matplotlib.pyplot as plt
# Side-by-side comparison plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Left panel: Kasiski factor counts
if factor_counts:
factors = sorted(factor_counts.keys())
factors = [f for f in factors if f <= 12]
counts = [factor_counts.get(f, 0) for f in factors]
colors_k = ['forestgreen' if f == len(key) else 'steelblue' for f in factors]
ax1.bar(factors, counts, color=colors_k, edgecolor='navy', alpha=0.8)
ax1.set_xlabel('Factor', fontsize=12)
ax1.set_ylabel('Number of distances divisible by factor', fontsize=11)
ax1.set_title('Kasiski Examination', fontsize=13)
ax1.set_xticks(factors)
ax1.grid(axis='y', alpha=0.3)
# Right panel: IoC column method
L_vals = list(range(1, 13))
ioc_vals = [ioc_results[L] for L in L_vals]
colors_i = ['forestgreen' if L == len(key) else 'steelblue' for L in L_vals]
ax2.bar(L_vals, ioc_vals, color=colors_i, edgecolor='navy', alpha=0.8)
ax2.axhline(y=0.0667, color='green', linestyle='--', linewidth=1.5,
label=r'$\kappa_p$')
ax2.axhline(y=0.0385, color='red', linestyle='--', linewidth=1.5,
label=r'$\kappa_r$')
ax2.set_xlabel('Trial key length $L$', fontsize=12)
ax2.set_ylabel('Average column IoC', fontsize=11)
ax2.set_title('Index of Coincidence (Column Method)', fontsize=13)
ax2.set_xticks(L_vals)
ax2.set_ylim(0.02, 0.08)
ax2.legend(fontsize=11)
ax2.grid(axis='y', alpha=0.3)
fig.suptitle(f'Kasiski vs. IoC -- Key = "{key}" (length {len(key)})',
fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('kasiski_vs_ioc_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
print("Figure saved: kasiski_vs_ioc_comparison.png")
Figure saved: kasiski_vs_ioc_comparison.png
Experiment 6: IoC Convergence as Text Length Grows#
We study how the IoC of a text stabilizes as the sample size increases, for both English and random text.
Show code cell source
import numpy as np
import matplotlib.pyplot as plt
# Compute IoC for growing prefixes of the English text and random text
cleaned_english = ''.join(ch.lower() for ch in english_text if ch.isalpha())
prefix_lengths = list(range(10, len(cleaned_english) + 1, 5))
ioc_english_prefix = []
ioc_random_prefix = []
rng = np.random.default_rng(123)
long_random = ''.join(chr(rng.integers(0, 26) + ord('a'))
for _ in range(len(cleaned_english)))
for n in prefix_lengths:
ioc_english_prefix.append(ioc_tool.compute_ioc(cleaned_english[:n]))
ioc_random_prefix.append(ioc_tool.compute_ioc(long_random[:n]))
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(prefix_lengths, ioc_english_prefix, 'b-', linewidth=1.5,
label='English text', alpha=0.8)
ax.plot(prefix_lengths, ioc_random_prefix, 'r-', linewidth=1.5,
label='Random text', alpha=0.8)
ax.axhline(y=0.0667, color='blue', linestyle='--', linewidth=1,
alpha=0.5, label=r'$\kappa_p \approx 0.0667$')
ax.axhline(y=0.0385, color='red', linestyle='--', linewidth=1,
alpha=0.5, label=r'$\kappa_r \approx 0.0385$')
ax.set_xlabel('Text length (characters)', fontsize=12)
ax.set_ylabel('Index of Coincidence', fontsize=12)
ax.set_title('IoC Convergence as Sample Size Increases', fontsize=13)
ax.legend(fontsize=10)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('ioc_convergence.png', dpi=150, bbox_inches='tight')
plt.show()
print("Figure saved: ioc_convergence.png")
Figure saved: ioc_convergence.png
Observation
The IoC is noisy for short texts (fewer than about 50 characters) but quickly stabilizes. The English text converges toward \(\kappa_p \approx 0.0667\), while random text converges toward \(\kappa_r \approx 0.0385\). This means the IoC method becomes reliable with as few as 100-200 characters of ciphertext – far less than the Kasiski examination typically requires.
Experiment 7: Visualizing the Theoretical IC vs. Key Length#
We plot Friedman’s theoretical formula alongside empirically measured IoC values.
Show code cell source
import numpy as np
import matplotlib.pyplot as plt
# Friedman's theoretical prediction
kappa_p = ioc_tool.expected_ioc_english()
kappa_r = ioc_tool.expected_ioc_random()
L_range = np.arange(1, 21)
ic_theoretical = (kappa_p - kappa_r) / L_range + kappa_r
# Empirical IoC for ciphertexts with various key lengths
rng = np.random.default_rng(99)
empirical_ic = []
for L in L_range:
# Generate a random key of length L
random_key = ''.join(chr(rng.integers(0, 26) + ord('a')) for _ in range(L))
ct = vigenere_encrypt(english_text, random_key)
empirical_ic.append(ioc_tool.compute_ioc(ct))
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(L_range, ic_theoretical, 'b-o', linewidth=2, markersize=5,
label='Friedman theoretical', zorder=3)
ax.scatter(L_range, empirical_ic, color='red', s=60, zorder=4,
label='Empirical (random keys)', marker='x', linewidths=2)
ax.axhline(y=kappa_p, color='green', linestyle=':', linewidth=1,
label=r'$\kappa_p$ (English)')
ax.axhline(y=kappa_r, color='orange', linestyle=':', linewidth=1,
label=r'$\kappa_r$ (random)')
ax.set_xlabel('Key length $L$', fontsize=12)
ax.set_ylabel('Index of Coincidence', fontsize=12)
ax.set_title('Friedman\'s Theorem: Theoretical vs. Empirical IoC', fontsize=13)
ax.legend(fontsize=10)
ax.set_xticks(L_range)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('friedman_theoretical_vs_empirical.png', dpi=150, bbox_inches='tight')
plt.show()
print("Figure saved: friedman_theoretical_vs_empirical.png")
Figure saved: friedman_theoretical_vs_empirical.png
Observation
The empirical IoC values closely follow Friedman’s theoretical curve. For \(L = 1\) (Caesar cipher), the IoC is near \(\kappa_p\) because a single shift preserves the frequency distribution. As \(L\) increases, the IoC drops monotonically toward \(\kappa_r\), because the superposition of multiple shifted alphabets flattens the overall frequency distribution. The slight deviations from theory arise from the finite length of our sample text.
Experiment 8: Mutual Index of Coincidence#
We demonstrate the Mutual IoC by comparing columns of a Vigenère ciphertext that were encrypted with the same key letter versus different key letters.
Show code cell source
import numpy as np
import matplotlib.pyplot as plt
# Split ciphertext into columns using the known key length
cleaned_ct = ''.join(ch.lower() for ch in ciphertext if ch.isalpha())
key_length = len(key) # 6
columns = [cleaned_ct[i::key_length] for i in range(key_length)]
# Compute MIC for all pairs of columns
mic_matrix = np.zeros((key_length, key_length))
for i in range(key_length):
for j in range(key_length):
mic_matrix[i, j] = ioc_tool.mutual_ioc(columns[i], columns[j])
print("Mutual Index of Coincidence between columns")
print(f"Key = '{key}' (length {key_length})")
print(f"Key letters: {[ch for ch in key]}")
print()
# Print the matrix
header = " " + " ".join(f"Col {j}" for j in range(key_length))
print(header)
for i in range(key_length):
row = f"Col {i} "
for j in range(key_length):
row += f" {float(mic_matrix[i, j]):.4f}"
print(row)
# Visualize as heatmap
fig, ax = plt.subplots(figsize=(7, 6))
im = ax.imshow(mic_matrix, cmap='YlOrRd', vmin=0.03, vmax=0.08)
# Add text annotations
for i in range(key_length):
for j in range(key_length):
text_color = 'white' if mic_matrix[i, j] > 0.06 else 'black'
ax.text(j, i, f'{float(mic_matrix[i, j]):.4f}',
ha='center', va='center', fontsize=10, color=text_color)
ax.set_xticks(range(key_length))
ax.set_yticks(range(key_length))
ax.set_xticklabels([f'Col {j}\n(key={key[j]})' for j in range(key_length)])
ax.set_yticklabels([f'Col {i} (key={key[i]})' for i in range(key_length)])
ax.set_title('Mutual Index of Coincidence between Vigenere Columns', fontsize=12)
fig.colorbar(im, ax=ax, label='MIC value')
plt.tight_layout()
plt.savefig('mutual_ioc_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()
print("Figure saved: mutual_ioc_heatmap.png")
Mutual Index of Coincidence between columns
Key = 'hunter' (length 6)
Key letters: ['h', 'u', 'n', 't', 'e', 'r']
Col 0 Col 1 Col 2 Col 3 Col 4 Col 5
Col 0 0.0708 0.0473 0.0286 0.0429 0.0409 0.0454
Col 1 0.0473 0.0881 0.0391 0.0427 0.0401 0.0339
Col 2 0.0286 0.0391 0.0745 0.0281 0.0356 0.0364
Col 3 0.0429 0.0427 0.0281 0.0818 0.0487 0.0323
Col 4 0.0409 0.0401 0.0356 0.0487 0.0783 0.0433
Col 5 0.0454 0.0339 0.0364 0.0323 0.0433 0.0761
Figure saved: mutual_ioc_heatmap.png
Observation
The diagonal entries (each column compared with itself) show the IoC of individual columns, which should be near \(\kappa_p \approx 0.0667\). The off-diagonal entries show the MIC between columns encrypted with different key letters. If two columns happen to use key letters that are close together in the alphabet (small relative shift), their MIC may be higher; otherwise it will be closer to \(\kappa_r\). The MIC can be used to determine the relative shift between two columns, which is useful for key recovery.
Exercises#
Exercise 6.1: IoC for different languages
The letter frequency distributions differ across languages. The following table gives approximate \(\kappa_p\) values for several languages:
Language |
\(\kappa_p\) |
|---|---|
English |
0.0667 |
French |
0.0778 |
German |
0.0762 |
Italian |
0.0738 |
Spanish |
0.0775 |
Russian |
0.0529 |
(a) Explain why Russian has a lower \(\kappa_p\) than the other languages listed. (Hint: consider the alphabet size.)
(b) Using the French frequency distribution below, compute \(\kappa_p\) for French and verify it is approximately 0.0778.
# French letter frequencies (A-Z)
french_freq = [
0.0815, 0.0097, 0.0315, 0.0373, 0.1739, 0.0112,
0.0097, 0.0085, 0.0731, 0.0045, 0.0002, 0.0546,
0.0296, 0.0712, 0.0528, 0.0303, 0.0099, 0.0655,
0.0808, 0.0707, 0.0574, 0.0132, 0.0004, 0.0045,
0.0030, 0.0012
]
(c) How would Friedman’s key length formula change for a French-language ciphertext?
Exercise 6.2: Proving IoC invariance
Prove that the Index of Coincidence is invariant under any monoalphabetic substitution cipher. That is, if \(\sigma: \mathcal{A} \to \mathcal{A}\) is a bijection and \(y = \sigma(x_1)\sigma(x_2)\cdots\sigma(x_N)\), then \(\mathrm{IC}(y) = \mathrm{IC}(x)\).
Hint: A bijection \(\sigma\) merely permutes the frequency counts \(\{f_0, f_1, \ldots, f_{25}\}\), but the IoC formula is a symmetric function of these counts.
Exercise 6.3: Implementing Friedman’s key length formula
Write a function friedman_estimate(ciphertext) that:
Computes the observed IoC of the ciphertext.
Applies the formula \(L \approx (\kappa_p - \kappa_r) / (\mathrm{IC}_{\text{obs}} - \kappa_r)\).
Returns the nearest integer as the estimated key length.
Test your function on the ciphertext produced by encrypting the text "the quick brown fox jumps over the lazy dog" repeated 20 times with the keys "ab", "code", "secret", and "cryptanalysis". Compare the estimates with the true key lengths.
Exercise 6.4: IoC of a running-key cipher
A running-key cipher uses a key that is as long as the plaintext (typically taken from a book or other natural-language text).
(a) What is the expected IoC of ciphertext produced by a running-key cipher? (Hint: The key length \(L = N\), the text length.)
(b) Encrypt a 500-character English plaintext using a 500-character English key (from a different source). Compute the IoC and compare it to \(\kappa_r\).
(c) Explain why the IoC method alone cannot distinguish between a running-key cipher and a one-time pad with truly random key.
Exercise 6.5: Automated Vigenère attack
Write a complete function break_vigenere(ciphertext, max_key_length=20) that:
Uses the IoC column method to determine the key length.
Uses the chi-squared method to recover each key letter.
Returns the recovered key and the decrypted plaintext.
Test your function on the following ciphertext (encrypted with an unknown key of unknown length):
lxfopvefrnhr...
Generate your own test ciphertext by encrypting a paragraph from a book with a key of your choice, then verify that your function recovers the correct key.
Summary#
In this chapter we introduced the Index of Coincidence, a statistical tool invented by William F. Friedman that revolutionized the cryptanalysis of polyalphabetic ciphers.
Key concepts:
Index of Coincidence (IoC): The probability that two randomly chosen letters from a text are the same: \(\mathrm{IC}(x) = \sum f_i(f_i - 1) / N(N-1)\).
Benchmark values: English text has \(\kappa_p \approx 0.0667\); uniformly random text has \(\kappa_r = 1/26 \approx 0.0385\).
Invariance: The IoC is invariant under monoalphabetic substitution (any permutation of the alphabet).
Friedman’s Theorem: For a polyalphabetic cipher with key length \(L\), the expected IoC is \(\mathrm{IC} \approx (\kappa_p - \kappa_r)/L + \kappa_r\).
Key length estimation: The column method splits ciphertext into \(L\) columns and checks whether the average column IoC is close to \(\kappa_p\). Peaks at the correct \(L\) and its multiples.
Mutual IoC: Measures the similarity between two texts and can be used to determine relative shifts between Vigenère columns.
Comparison with Kasiski: The IoC method is more robust for shorter texts and does not require the ciphertext to contain repeated n-grams.
Together with the Kasiski examination (Chapter 5), the IoC gives the cryptanalyst a powerful toolkit for breaking the Vigenère cipher. In practice, the two methods complement each other: Kasiski gives exact divisors while the IoC gives a continuous statistical measure.
References#
Friedman, W. F. (1922). The Index of Coincidence and Its Applications in Cryptanalysis. Riverbank Publication No. 22. Department of Ciphers, Riverbank Laboratories, Geneva, Illinois. [1]
Kasiski, F. W. (1863). Die Geheimschriften und die Dechiffrir-Kunst. Berlin: E. S. Mittler und Sohn. [2]
Sinkov, A. (1966). Elementary Cryptanalysis: A Mathematical Approach. Mathematical Association of America. – Chapters 4-5 provide an accessible introduction to the IoC and Kasiski methods.
Bauer, F. L. (2007). Decrypted Secrets: Methods and Maxims of Cryptology. 4th edition, Springer. – Chapter 3 gives a thorough treatment of polyalphabetic ciphers and the IoC.
Singh, S. (1999). The Code Book: The Science of Secrecy from Ancient Egypt to Quantum Cryptography. Doubleday. [3] – Chapter 2 tells the story of Friedman and the breaking of polyalphabetic ciphers.
Kahn, D. (1996). The Codebreakers: The Comprehensive History of Secret Communication from Ancient Times to the Internet. Revised edition, Scribner. – Chapters 12-13 cover Friedman’s contributions in detail.