Chapter 35: Character-Level Language Modeling

Chapter 35: Character-Level Language Modeling#

In 2015, Andrej Karpathy demonstrated that a character-level LSTM trained on raw text could generate remarkably coherent prose. In this chapter, we build exactly this system on Shakespeare.

Character-level language modeling is the simplest formulation of the problem that underpins modern AI: next-token prediction. Given a sequence of characters \(c_1, c_2, \ldots, c_t\), the model outputs a probability distribution over the next character \(c_{t+1}\). This is the same objective used by GPT, BERT, and every large language model—the only difference is the granularity of the tokens. By working at the character level, we strip away the complexity of tokenization and see the core mechanism in its purest form.

PyTorch version: 2.7.0
Device: cpu

35.1 The Character-Level Language Model#

A language model assigns a probability to a sequence of tokens:

\[P(c_1, c_2, \ldots, c_T) = \prod_{t=1}^{T} P(c_t \mid c_1, \ldots, c_{t-1})\]

In a character-level model, each token \(c_t\) is a single character. The vocabulary \(V\) is simply the set of unique characters in the training corpus—typically 50–80 characters for English text (letters, digits, punctuation, whitespace).

Next-Token Prediction

The training objective is to minimize the cross-entropy between the model’s predicted distribution and the true next character:

\[\mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log P_{\text{model}}(c_t \mid c_1, \ldots, c_{t-1})\]

This is identical to the loss function used by GPT-3, GPT-4, and other autoregressive language models. The only difference is scale: our vocabulary has ~65 characters instead of ~50,000 subword tokens, and our model has thousands of parameters instead of billions.

An RNN is a natural fit for this task: at each time step, the hidden state \(h_t\) encodes the context \(c_1, \ldots, c_t\), and a linear layer maps \(h_t\) to logits over the vocabulary.

35.2 The Shakespeare Dataset#

We use a 100,000-character excerpt from Shakespeare’s works as our training corpus. This is the same dataset used in Karpathy’s influential blog post “The Unreasonable Effectiveness of Recurrent Neural Networks” (2015).

# Load Shakespeare text from local file
with open('shakespeare.txt', 'r') as f:
    text = f.read()

# Basic statistics
chars = sorted(set(text))
vocab_size = len(chars)

print(f'Total characters: {len(text):,}')
print(f'Unique characters (vocab size): {vocab_size}')
print(f'\nCharacter set:')
print(repr(''.join(chars)))
print(f'\n--- Sample passage (first 500 chars) ---')
print(text[:500])

Total characters: 100,000
Unique characters (vocab size): 61

Character set:
"\n !&',-.:;?ABCDEFGHIJKLMNOPQRSTUVWYabcdefghijklmnopqrstuvwxyz"

--- Sample passage (first 500 chars) ---
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor

../_images/22047e7a4b7fd8b442b466b980cfecad0a305b54259be4650fe077ad610fad6e.png

Most common: SP (14,711 occurrences, 14.7%)
Whitespace (space + newline): 18,637 (18.6%)

35.3 Data Preparation#

We need three components:

Character-to-index mapping (and its inverse) to convert between characters and integers.
Sequence windowing: slide a window of length seq_len across the text to create input/target pairs.
A PyTorch Dataset and DataLoader for batched training.

# Character-to-index mapping
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}

# Encode the entire text as a tensor of indices
encoded = torch.tensor([char_to_idx[ch] for ch in text], dtype=torch.long)
print(f'Encoded tensor shape: {encoded.shape}')
print(f'First 50 indices: {encoded[:50].tolist()}')
print(f'Decoded back:     {repr("".join(idx_to_char[i.item()] for i in encoded[:50]))}')

Encoded tensor shape: torch.Size([100000])
First 50 indices: [16, 43, 52, 53, 54, 1, 13, 43, 54, 43, 60, 39, 48, 8, 0, 12, 39, 40, 49, 52, 39, 1, 57, 39, 1, 50, 52, 49, 37, 39, 39, 38, 1, 35, 48, 59, 1, 40, 55, 52, 54, 42, 39, 52, 5, 1, 42, 39, 35, 52]

Decoded back:     'First Citizen:\nBefore we proceed any further, hear'

# Sequence windowing: create input/target pairs
class ShakespeareDataset(Dataset):
    """Character-level language model dataset.
    
    Each sample is a pair (input_seq, target_seq) where:
    - input_seq  = text[i : i + seq_len]
    - target_seq = text[i+1 : i + seq_len + 1]
    
    The target is the input shifted by one character.
    """
    def __init__(self, data, seq_len):
        self.data = data
        self.seq_len = seq_len
    
    def __len__(self):
        return (len(self.data) - 1) // self.seq_len
    
    def __getitem__(self, idx):
        start = idx * self.seq_len
        end = start + self.seq_len
        x = self.data[start:end]
        y = self.data[start+1:end+1]
        return x, y


# Hyperparameters
SEQ_LEN = 50
BATCH_SIZE = 64
EMBED_SIZE = 32
HIDDEN_SIZE = 128
LR = 0.003
N_EPOCHS = 40

# Create dataset and dataloader
dataset = ShakespeareDataset(encoded, SEQ_LEN)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)

print(f'Sequence length: {SEQ_LEN}')
print(f'Batch size: {BATCH_SIZE}')
print(f'Number of sequences: {len(dataset)}')
print(f'Batches per epoch: {len(dataloader)}')

# Peek at one batch
x_batch, y_batch = next(iter(dataloader))
print(f'\nBatch shapes: x={x_batch.shape}, y={y_batch.shape}')
print(f'Example input:  {repr("".join(idx_to_char[i.item()] for i in x_batch[0][:30]))}')
print(f'Example target: {repr("".join(idx_to_char[i.item()] for i in y_batch[0][:30]))}')

Sequence length: 50
Batch size: 64
Number of sequences: 1999
Batches per epoch: 31

Batch shapes: x=torch.Size([64, 50]), y=torch.Size([64, 50])
Example input:  "lse of\nTarquin seven hurts i' "
Example target: "se of\nTarquin seven hurts i' t"

Notice how the target is simply the input shifted by one position. The model learns to predict each character given all previous characters in the window.

35.4 Vanilla RNN on Shakespeare#

Our model architecture is straightforward:

Embedding layer: maps each character index to a dense vector of size embed_size.
RNN layer: processes the sequence, producing hidden states at each time step.
Linear layer: maps each hidden state to logits over the vocabulary.

We start with a vanilla RNN to establish a baseline, then upgrade to LSTM.

class CharRNN(nn.Module):
    """Character-level language model with configurable RNN type."""
    
    def __init__(self, vocab_size, embed_size, hidden_size, rnn_type='rnn'):
        super().__init__()
        self.hidden_size = hidden_size
        self.rnn_type = rnn_type
        
        # Character embedding
        self.embedding = nn.Embedding(vocab_size, embed_size)
        
        # Recurrent layer
        if rnn_type == 'rnn':
            self.rnn = nn.RNN(embed_size, hidden_size, batch_first=True)
        elif rnn_type == 'lstm':
            self.rnn = nn.LSTM(embed_size, hidden_size, batch_first=True)
            # Jozefowicz et al. (2015): forget-gate bias = +1 keeps the
            # Constant Error Carousel active at initialisation.
            # PyTorch LSTM gate order: (i, f, g, o).
            for name, p in self.rnn.named_parameters():
                if 'bias' in name:
                    n = p.size(0) // 4
                    with torch.no_grad():
                        p[n:2*n].fill_(1.0)
        elif rnn_type == 'gru':
            self.rnn = nn.GRU(embed_size, hidden_size, batch_first=True)
            # GRU gate order: (r, z, n). Bias the update gate towards
            # 'keep previous state' so the network starts close to identity.
            for name, p in self.rnn.named_parameters():
                if 'bias' in name:
                    n = p.size(0) // 3
                    with torch.no_grad():
                        p[n:2*n].fill_(-1.0)
        
        # Output projection: hidden state -> vocabulary logits
        self.fc = nn.Linear(hidden_size, vocab_size)
    
    def forward(self, x, hidden=None):
        """Forward pass.
        
        Args:
            x: input indices, shape (batch, seq_len)
            hidden: initial hidden state (optional)
        
        Returns:
            logits: shape (batch, seq_len, vocab_size)
            hidden: final hidden state
        """
        emb = self.embedding(x)           # (batch, seq_len, embed_size)
        out, hidden = self.rnn(emb, hidden)  # (batch, seq_len, hidden_size)
        logits = self.fc(out)             # (batch, seq_len, vocab_size)
        return logits, hidden
    
    def generate(self, start_str, length=200, temperature=1.0):
        """Generate text autoregressively."""
        self.eval()
        chars_generated = list(start_str)
        
        # Encode the start string
        input_idx = torch.tensor([[char_to_idx[ch] for ch in start_str]], dtype=torch.long)
        
        hidden = None
        with torch.no_grad():
            # Process the seed
            logits, hidden = self.forward(input_idx, hidden)
            
            # Generate one character at a time
            for _ in range(length):
                # Use the last character's logits
                last_logits = logits[0, -1, :] / temperature
                probs = torch.softmax(last_logits, dim=0)
                next_idx = torch.multinomial(probs, 1).item()
                chars_generated.append(idx_to_char[next_idx])
                
                # Feed the generated character back
                input_idx = torch.tensor([[next_idx]], dtype=torch.long)
                logits, hidden = self.forward(input_idx, hidden)
        
        return ''.join(chars_generated)


# Count parameters
def count_params(model):
    return sum(p.numel() for p in model.parameters())

rnn_model = CharRNN(vocab_size, EMBED_SIZE, HIDDEN_SIZE, rnn_type='rnn')
print(f'Vanilla RNN model:')
print(f'  Parameters: {count_params(rnn_model):,}')
print(f'  Embedding:  {vocab_size} x {EMBED_SIZE} = {vocab_size * EMBED_SIZE:,}')
print(f'  RNN:        {sum(p.numel() for p in rnn_model.rnn.parameters()):,}')
print(f'  Output FC:  {HIDDEN_SIZE} x {vocab_size} + {vocab_size} = {HIDDEN_SIZE * vocab_size + vocab_size:,}')

Vanilla RNN model:
  Parameters: 30,557
  Embedding:  61 x 32 = 1,952
  RNN:        20,736
  Output FC:  128 x 61 + 61 = 7,869

=== Training Vanilla RNN ===

  Epoch  1/40  loss=3.0799  [0.3s]

  Epoch  2/40  loss=2.4150  [0.5s]

  Epoch  3/40  loss=2.2001  [0.8s]

  Epoch  4/40  loss=2.0785  [1.0s]

  Epoch  5/40  loss=1.9908  [1.2s]

  Epoch  6/40  loss=1.9207  [1.5s]

  Epoch  7/40  loss=1.8691  [1.7s]

  Epoch  8/40  loss=1.8252  [2.0s]

  Epoch  9/40  loss=1.7893  [2.2s]

  Epoch 10/40  loss=1.7572  [2.5s]

  Epoch 11/40  loss=1.7291  [2.7s]

  Epoch 12/40  loss=1.7062  [3.0s]

  Epoch 13/40  loss=1.6850  [3.2s]

  Epoch 14/40  loss=1.6652  [3.5s]

  Epoch 15/40  loss=1.6486  [3.7s]

  Epoch 16/40  loss=1.6323  [4.0s]

  Epoch 17/40  loss=1.6167  [4.2s]

  Epoch 18/40  loss=1.6032  [4.4s]

  Epoch 19/40  loss=1.5919  [4.7s]

  Epoch 20/40  loss=1.5787  [4.9s]

  Epoch 21/40  loss=1.5708  [5.2s]

  Epoch 22/40  loss=1.5601  [5.4s]

  Epoch 23/40  loss=1.5484  [5.7s]

  Epoch 24/40  loss=1.5392  [5.9s]

  Epoch 25/40  loss=1.5329  [6.2s]

  Epoch 26/40  loss=1.5226  [6.4s]

  Epoch 27/40  loss=1.5162  [6.6s]

  Epoch 28/40  loss=1.5088  [6.9s]

  Epoch 29/40  loss=1.4994  [7.1s]

  Epoch 30/40  loss=1.4953  [7.4s]

  Epoch 31/40  loss=1.4886  [7.6s]

  Epoch 32/40  loss=1.4833  [7.9s]

  Epoch 33/40  loss=1.4763  [8.1s]

  Epoch 34/40  loss=1.4724  [8.4s]

  Epoch 35/40  loss=1.4642  [8.6s]

  Epoch 36/40  loss=1.4600  [8.9s]

  Epoch 37/40  loss=1.4548  [9.1s]

  Epoch 38/40  loss=1.4489  [9.4s]

  Epoch 39/40  loss=1.4462  [9.6s]

  Epoch 40/40  loss=1.4406  [9.9s]
  Training complete in 9.9s

# Show generated text at different training stages
print('=== Vanilla RNN: Generated Text at Different Epochs ===')
for epoch in sorted(rnn_samples.keys()):
    print(f'\n--- Epoch {epoch} ---')
    print(rnn_samples[epoch])

=== Vanilla RNN: Generated Text at Different Epochs ===

--- Epoch 2 ---
KING yhas t oud surn therhen iatou the ratien sioure
Ond thet shae hivs whos mati nouw,
Whes is the  aerton dee'elho our, thuent. Iome youce:
Sow his de sa

--- Epoch 5 ---
KING cinge searies ve chyou sire wore whom the commal we and thed mins and: it ond to be the pcous sell sering.

ANu:
I stor howe veed, theant,
Be stas tay

--- Epoch 10 ---
KING To hag heart on oft the can reated
There with of the way and noble
more fors, has has, the farthon triends of the reserwe an wircine restre.

Firth th

Watch how the generated text improves: early epochs produce near-random characters, while later epochs begin capturing word boundaries, common words, and rudimentary syntax.

35.5 LSTM on Shakespeare#

Now we train the same architecture with an LSTM backbone. The model has more parameters due to the four gate matrices, but should learn longer-range dependencies and produce more coherent text.

=== Training LSTM ===
LSTM parameters: 92,765

  Epoch  1/40  loss=3.2742  [0.7s]

  Epoch  2/40  loss=2.6992  [1.5s]

  Epoch  3/40  loss=2.4066  [2.3s]

  Epoch  4/40  loss=2.2515  [3.1s]

  Epoch  5/40  loss=2.1466  [3.9s]

  Epoch  6/40  loss=2.0674  [4.7s]

  Epoch  7/40  loss=2.0066  [5.5s]

  Epoch  8/40  loss=1.9563  [6.2s]

  Epoch  9/40  loss=1.9140  [7.0s]

  Epoch 10/40  loss=1.8787  [7.8s]

  Epoch 11/40  loss=1.8491  [8.6s]

  Epoch 12/40  loss=1.8215  [9.4s]

  Epoch 13/40  loss=1.7978  [10.2s]

  Epoch 14/40  loss=1.7749  [10.9s]

  Epoch 15/40  loss=1.7546  [11.8s]

  Epoch 16/40  loss=1.7383  [12.6s]

  Epoch 17/40  loss=1.7207  [13.4s]

  Epoch 18/40  loss=1.7044  [14.2s]

  Epoch 19/40  loss=1.6903  [15.0s]

  Epoch 20/40  loss=1.6757  [15.7s]

  Epoch 21/40  loss=1.6628  [16.5s]

  Epoch 22/40  loss=1.6497  [17.3s]

  Epoch 23/40  loss=1.6368  [18.1s]

  Epoch 24/40  loss=1.6271  [18.9s]

  Epoch 25/40  loss=1.6139  [19.7s]

  Epoch 26/40  loss=1.6033  [20.4s]

  Epoch 27/40  loss=1.5921  [21.2s]

  Epoch 28/40  loss=1.5816  [22.0s]

  Epoch 29/40  loss=1.5719  [22.8s]

  Epoch 30/40  loss=1.5618  [23.6s]

  Epoch 31/40  loss=1.5536  [24.4s]

  Epoch 32/40  loss=1.5441  [25.1s]

  Epoch 33/40  loss=1.5357  [25.9s]

  Epoch 34/40  loss=1.5265  [26.7s]

  Epoch 35/40  loss=1.5180  [27.5s]

  Epoch 36/40  loss=1.5091  [28.3s]

  Epoch 37/40  loss=1.5012  [29.1s]

  Epoch 38/40  loss=1.4951  [29.8s]

  Epoch 39/40  loss=1.4857  [30.6s]

  Epoch 40/40  loss=1.4788  [31.4s]
  Training complete in 31.4s

# Side-by-side comparison of generated text
print('=' * 80)
print('COMPARISON: Generated text after 10 epochs')
print('=' * 80)

# Generate fresh samples with the same seed
torch.manual_seed(99)
rnn_text = rnn_model.generate('KING ', length=200, temperature=0.8)
torch.manual_seed(99)
lstm_text = lstm_model.generate('KING ', length=200, temperature=0.8)

print('\n--- Vanilla RNN ---')
print(rnn_text)
print('\n--- LSTM ---')
print(lstm_text)

================================================================================
COMPARISON: Generated text after 10 epochs
================================================================================

--- Vanilla RNN ---
KING Coriolanus good some non he shall be surve tongues usherd not might Carms can; shater blatters,
Drook are were his rection.

MARCIUS:
I presince must would can your slews, hald fall wells ove strange 

--- LSTM ---
KING Corionour: 'tis not
And the some all.

First Citizen:
Beconsur in both
my callarded wall't paspadiusedare for thish
That do be comm.

First is:
Be brango a city fore from, hald but in mys walls take
T

The LSTM typically produces text with better structure: more consistent word lengths, more plausible Shakespearean vocabulary, and occasionally coherent phrases. The difference becomes more pronounced with longer training.

35.6 Temperature Sampling#

Recall from Chapter 26 that the softmax temperature \(T\) controls the entropy of the output distribution:

\[P(c_i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}\]

\(T < 1\) (low temperature): The distribution becomes peaked—the model becomes more “confident” and repetitive. Generated text is more predictable but less diverse.
\(T = 1\): The unmodified model distribution.
\(T > 1\) (high temperature): The distribution becomes flatter—the model explores more alternatives. Generated text is more creative but may become incoherent.

The Temperature Tradeoff

Low temperature produces safe, repetitive text. High temperature produces diverse, risky text. There is no universally optimal temperature—the best value depends on the application. For creative writing, \(T \approx 0.7\)–\(0.9\) often works well.

# Generate text at different temperatures
temperatures = [0.5, 1.0, 1.5]

print('=== LSTM: Temperature Sampling ===')
for temp in temperatures:
    torch.manual_seed(42)
    generated = lstm_model.generate('HAMLET:\n', length=250, temperature=temp)
    print(f'\n{"="*60}')
    print(f'Temperature = {temp}')
    print(f'{"="*60}')
    print(generated)

=== LSTM: Temperature Sampling ===

============================================================
Temperature = 0.5
============================================================
HAMLET:
O them state as deeds and distor.

CORIOLANUS:
I shall for the people to the did request like a some are the breath of the people have for free the way the stire--
The true to the beged the marks of our trutine.

CORIOLANUS:
The gods sweed but sold t

============================================================
Temperature = 1.0
============================================================
HAMLET:
Or will nurtion deeds!

CORIOLANUS:
So, leasing a noble the predeil, heart'd dims
Your a lity. My more but treit hord hange. Notwer go forche strack: you are held seople! Cominius beged I do my show our fragith.

CORIOLANUS:
No, sir firths buticonert

============================================================
Temperature = 1.5
============================================================
HAMLET:
Or.'s, sixfuirs deeds!
'Cudsen; pliek to
cO's gads'? and cappray, whi wart'hid mirdent a lloy., somber but
trem. Coriolan, put twarugh medcises madind
Heir stirl-dieus!
Ghand have beged I do my
shuck Rome
Shixhe knope wor Jufer; gertes to laidisely',

Notice the tradeoff:

At \(T = 0.5\), the text may repeat common patterns but maintains consistency.
At \(T = 1.0\), the text is more varied and natural.
At \(T = 1.5\), the text becomes more erratic, with unusual character combinations.

35.7 Gate Activation Visualization#

One of the most illuminating analyses of an LSTM is to visualize what the gates are doing on actual text. By feeding a sample passage through the trained LSTM and extracting the gate activations at each time step, we can see which characters trigger forgetting, storage, and output.

Show code cell source Hide code cell source

def extract_gate_activations(model, text_str, n_units=10):
    """Extract LSTM gate activations for visualization.
    
    We hook into the LSTM to capture gate values at each step.
    """
    model.eval()
    
    # Encode the text
    indices = torch.tensor([[char_to_idx[ch] for ch in text_str]], dtype=torch.long)
    
    # We'll manually step through the LSTM to capture gates
    emb = model.embedding(indices)  # (1, T, embed_size)
    
    T = len(text_str)
    hidden_size = model.hidden_size
    
    # Get LSTM weights
    lstm = model.rnn
    W_ih = lstm.weight_ih_l0  # (4*H, input_size)
    W_hh = lstm.weight_hh_l0  # (4*H, H)
    b_ih = lstm.bias_ih_l0    # (4*H,)
    b_hh = lstm.bias_hh_l0    # (4*H,)
    
    h = torch.zeros(1, hidden_size)
    c = torch.zeros(1, hidden_size)
    
    forget_gates = []
    input_gates = []
    output_gates = []
    
    with torch.no_grad():
        for t in range(T):
            x_t = emb[0, t:t+1, :]  # (1, embed_size)
            
            gates = x_t @ W_ih.T + b_ih + h @ W_hh.T + b_hh
            i_gate, f_gate, g_gate, o_gate = gates.chunk(4, dim=1)
            
            i_t = torch.sigmoid(i_gate)
            f_t = torch.sigmoid(f_gate)
            g_t = torch.tanh(g_gate)
            o_t = torch.sigmoid(o_gate)
            
            c = f_t * c + i_t * g_t
            h = o_t * torch.tanh(c)
            
            forget_gates.append(f_t[0, :n_units].numpy())
            input_gates.append(i_t[0, :n_units].numpy())
            output_gates.append(o_t[0, :n_units].numpy())
    
    return {
        'forget': np.array(forget_gates),   # (T, n_units)
        'input': np.array(input_gates),
        'output': np.array(output_gates),
    }

# Extract gates for a sample passage
sample_text = 'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.'
gates = extract_gate_activations(lstm_model, sample_text, n_units=8)

# Plot gate heatmaps
fig, axes = plt.subplots(3, 1, figsize=(14, 10), sharex=True)

gate_names = ['Forget Gate', 'Input Gate', 'Output Gate']
gate_keys = ['forget', 'input', 'output']
cmaps = ['Reds', 'Greens', 'Blues']

char_labels = [repr(c)[1:-1] if c not in ('\n', ' ') else 
               {'\n': r'$\hookleftarrow$', ' ': r'$\sqcup$'}[c] 
               for c in sample_text]

for ax, name, key, cmap in zip(axes, gate_names, gate_keys, cmaps):
    im = ax.imshow(gates[key].T, aspect='auto', cmap=cmap, vmin=0, vmax=1)
    ax.set_ylabel(f'{name}\n(units)', fontsize=10)
    ax.set_yticks(range(8))
    ax.set_yticklabels([f'#{i}' for i in range(8)], fontsize=8)
    plt.colorbar(im, ax=ax, shrink=0.8, label='Activation')

# Character labels on bottom axis
axes[-1].set_xticks(range(len(sample_text)))
axes[-1].set_xticklabels(char_labels, fontsize=7, fontfamily='monospace', rotation=0)
axes[-1].set_xlabel('Character position', fontsize=11)

plt.suptitle('LSTM Gate Activations on Shakespeare Text', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Highlight interesting patterns
print('Gate activation statistics:')
for key in gate_keys:
    g = gates[key]
    print(f'  {key:7s}: mean={g.mean():.3f}, std={g.std():.3f}, '
          f'min={g.min():.3f}, max={g.max():.3f}')

../_images/4074e754a222aa209a12dcc8a08bbbdcaa4c41be6626f899a565d0c9502fbbc2.png

Gate activation statistics:
  forget : mean=0.894, std=0.228, min=0.001, max=1.000
  input  : mean=0.934, std=0.189, min=0.000, max=1.000
  output : mean=0.553, std=0.418, min=0.000, max=1.000

Look for these patterns in the heatmap:

The forget gate often activates strongly (values near 1) during word interiors, preserving context, and drops at word boundaries or punctuation—signaling the network to update its representation.
The input gate tends to spike at the beginning of new words or after punctuation, indicating that new information is being written into the cell state.
The output gate may show interesting patterns around newlines and colons, which in Shakespeare mark speaker transitions.

35.8 Architecture Comparison#

We now train a GRU model on the same data and compare all three architectures: Vanilla RNN, LSTM, and GRU.

=== Training GRU ===
GRU parameters: 72,029

  Epoch  1/40  loss=3.0771  [0.6s]

  Epoch  2/40  loss=2.4035  [1.2s]

  Epoch  3/40  loss=2.1751  [1.8s]

  Epoch  4/40  loss=2.0304  [2.4s]

  Epoch  5/40  loss=1.9232  [3.1s]

  Epoch  6/40  loss=1.8427  [3.7s]

  Epoch  7/40  loss=1.7749  [4.3s]

  Epoch  8/40  loss=1.7222  [5.0s]

  Epoch  9/40  loss=1.6807  [5.6s]

  Epoch 10/40  loss=1.6417  [6.2s]

  Epoch 11/40  loss=1.6107  [6.8s]

  Epoch 12/40  loss=1.5830  [7.4s]

  Epoch 13/40  loss=1.5555  [8.0s]

  Epoch 14/40  loss=1.5352  [8.7s]

  Epoch 15/40  loss=1.5143  [9.4s]

  Epoch 16/40  loss=1.4962  [10.2s]

  Epoch 17/40  loss=1.4741  [10.9s]

  Epoch 18/40  loss=1.4570  [11.5s]

  Epoch 19/40  loss=1.4422  [12.1s]

  Epoch 20/40  loss=1.4261  [12.7s]

  Epoch 21/40  loss=1.4120  [13.3s]

  Epoch 22/40  loss=1.3993  [13.9s]

  Epoch 23/40  loss=1.3823  [14.6s]

  Epoch 24/40  loss=1.3715  [15.2s]

  Epoch 25/40  loss=1.3572  [15.8s]

  Epoch 26/40  loss=1.3446  [16.4s]

  Epoch 27/40  loss=1.3325  [17.0s]

  Epoch 28/40  loss=1.3193  [17.6s]

  Epoch 29/40  loss=1.3092  [18.3s]

  Epoch 30/40  loss=1.2981  [18.9s]

  Epoch 31/40  loss=1.2899  [19.5s]

  Epoch 32/40  loss=1.2763  [20.2s]

  Epoch 33/40  loss=1.2647  [20.8s]

  Epoch 34/40  loss=1.2574  [21.4s]

  Epoch 35/40  loss=1.2459  [22.1s]

  Epoch 36/40  loss=1.2372  [22.7s]

  Epoch 37/40  loss=1.2285  [23.4s]

  Epoch 38/40  loss=1.2196  [24.0s]

  Epoch 39/40  loss=1.2077  [24.6s]

  Epoch 40/40  loss=1.2019  [25.3s]
  Training complete in 25.3s

Show code cell source Hide code cell source

# Training curves comparison
fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# Loss curves
ax = axes[0]
ax.plot(range(1, N_EPOCHS + 1), rnn_losses, color=RED, marker='s', linewidth=2,
        markersize=6, label=f'Vanilla RNN ({count_params(rnn_model):,} params)')
ax.plot(range(1, N_EPOCHS + 1), lstm_losses, color=GREEN, marker='o', linewidth=2,
        markersize=6, label=f'LSTM ({count_params(lstm_model):,} params)')
ax.plot(range(1, N_EPOCHS + 1), gru_losses, color=BLUE, marker='^', linewidth=2,
        markersize=6, label=f'GRU ({count_params(gru_model):,} params)')
ax.set_xlabel('Epoch', fontsize=11)
ax.set_ylabel('Cross-Entropy Loss', fontsize=11)
ax.set_title('Training Loss Comparison', fontsize=12, fontweight='bold')
ax.legend(fontsize=9)
ax.set_xticks(range(1, N_EPOCHS + 1))

# Final comparison table as bar chart
ax = axes[1]
models_data = {
    'RNN': (count_params(rnn_model), rnn_losses[-1]),
    'LSTM': (count_params(lstm_model), lstm_losses[-1]),
    'GRU': (count_params(gru_model), gru_losses[-1]),
}
x_pos = np.arange(3)
colors = [RED, GREEN, BLUE]
final_losses = [rnn_losses[-1], lstm_losses[-1], gru_losses[-1]]
bars = ax.bar(x_pos, final_losses, color=colors, alpha=0.85, edgecolor='white', width=0.6)
ax.set_xticks(x_pos)
ax.set_xticklabels(['Vanilla RNN', 'LSTM', 'GRU'], fontsize=11)
ax.set_ylabel('Final Loss (epoch 10)', fontsize=11)
ax.set_title('Final Loss Comparison', fontsize=12, fontweight='bold')
for bar, loss in zip(bars, final_losses):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
            f'{loss:.3f}', ha='center', fontsize=10, fontweight='bold')

plt.suptitle('Character-Level Shakespeare: Architecture Comparison',
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

# Print comparison table
print(f'{"":-<60}')
print(f'{"Architecture":<15} {"Parameters":>12} {"Final Loss":>12} {"Loss/Param":>15}')
print(f'{"":-<60}')
for name, (params, loss) in models_data.items():
    print(f'{name:<15} {params:>12,} {loss:>12.4f} {loss/params:>15.2e}')
print(f'{"":-<60}')

../_images/f9e4e0bf84f27ea3b8ed74e3fdc7ede6a2f04e2790ac105b8f637d7fed210a94.png

------------------------------------------------------------
Architecture      Parameters   Final Loss      Loss/Param
------------------------------------------------------------
RNN                   30,557       1.4406        4.71e-05
LSTM                  92,765       1.4788        1.59e-05
GRU                   72,029       1.2019        1.67e-05
------------------------------------------------------------

# Side-by-side generated samples from all three models
print('=' * 70)
print('Generated Shakespeare: Final Models (T=0.8)')
print('=' * 70)

for name, model in [('Vanilla RNN', rnn_model), ('LSTM', lstm_model), ('GRU', gru_model)]:
    torch.manual_seed(42)
    sample = model.generate('ROMEO:\n', length=200, temperature=0.8)
    print(f'\n--- {name} ---')
    print(sample)

======================================================================
Generated Shakespeare: Final Models (T=0.8)
======================================================================

--- Vanilla RNN ---
ROMEO:
Or I meen which deeds you with speak. Wo
fow that your pleady,
He hather their milds and lity. My more been bert horsul: and metwee he had fause my not piess hath sees, think too shall rather you his 

--- LSTM ---
ROMEO:
Now senst fairs deeds!

CORIOLANUS:
So, lease you have power.

MENENIUS:
But do sudent a lity. My more but tremb of my so, the twarugh mercises the not pit shown die them; but to seeged be stores.

BR

--- GRU ---
ROMEO:
Nay, me thus to done all to see the breath.
Why done and speaks and the rech domen of hates you speak, but there he did soft, hewer good
Which we have going them--
Wontemited my counted to stay, the o

Comparison Summary

Feature	Vanilla RNN	LSTM	GRU
Gates	0	3 (forget, input, output)	2 (update, reset)
State vectors	1 (\(h_t\))	2 (\(h_t\), \(C_t\))	1 (\(h_t\))
Relative parameters	1.0x	~2.5x	~2.0x
Long-range memory	Poor	Excellent	Good
Training speed	Fastest	Slowest	Middle

For tasks with very long-range dependencies (e.g. machine translation, summarisation, the “remember the first” benchmark in Chapter 34 §34.6), LSTM and GRU dramatically outperform the vanilla RNN.

A frank note on what these training curves do — and do not — show

On character-level Shakespeare with HIDDEN_SIZE=128 and a few dozen epochs, the relevant predictive context is short — the next character mostly depends on the last ten or twenty, well within reach of every architecture here. So the LSTM’s Constant Error Carousel (its single biggest advantage) is not stressed, while its 2.5× parameter overhead and triple-gate optimisation cost slow it down. At this scale you will frequently see:

the GRU win (fewer parameters than LSTM, still gated, easiest to optimise);
the vanilla RNN look competitive — Adam’s per-parameter normalisation hides most of its vanishing-gradient pain on short contexts;
the LSTM lag slightly on training loss but produce qualitatively coherent text once converged.

This is the honest face of training dynamics on small data — the architectures with more parameters and more gates need more compute and/or more data to overtake the simpler ones. The vanishing-gradient bite the LSTM was designed to fix is what Chapter 34 §34.6 (the “Remember the First” benchmark) demonstrates. Bring the LSTM there and the ordering inverts decisively. Bring it to a 100M-parameter language model and the ordering inverts again. The lesson is not “LSTM > RNN always” but “the right architecture depends on the dependency length the task actually requires”.

35.9 Framework Corner#

Same Char-RNN in Other Frameworks

TensorFlow / Keras:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 32),
    tf.keras.layers.LSTM(128, return_sequences=True),
    tf.keras.layers.Dense(vocab_size)
])
model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam'
)
model.fit(x_train, y_train, epochs=10, batch_size=64)

JAX / Flax:

import jax
from flax import linen as fnn

class CharLSTM(fnn.Module):
    vocab_size: int
    hidden_size: int = 128

    @fnn.compact
    def __call__(self, x):
        x = fnn.Embed(self.vocab_size, 32)(x)
        carry = fnn.LSTMCell.initialize_carry(
            jax.random.PRNGKey(0), (x.shape[0],), self.hidden_size
        )
        lstm = fnn.LSTMCell(features=self.hidden_size)
        for t in range(x.shape[1]):
            carry, _ = lstm(carry, x[:, t])
        return fnn.Dense(self.vocab_size)(carry[0])

The architecture is identical across frameworks — only the API differs.

Exercises#

Exercise 35.1. Compute the perplexity of each model (RNN, LSTM, GRU) on a held-out portion of the Shakespeare text. Recall that perplexity \(= \exp(\mathcal{L})\) where \(\mathcal{L}\) is the cross-entropy loss. Which model achieves the lowest perplexity? How does perplexity relate to the “quality” of generated text?

Exercise 35.2. Modify the CharRNN model to use a 2-layer LSTM (set num_layers=2 in nn.LSTM). Does the additional depth improve the loss or generated text quality? Report the parameter count and training curves.

Exercise 35.3. Implement top-k sampling: instead of sampling from the full vocabulary distribution, restrict sampling to the \(k\) most probable characters. Compare generated text quality for \(k \in \{5, 10, 20, 65\}\) (where 65 = full vocabulary). How does top-k interact with temperature?

Exercise 35.4. Train the LSTM on a different corpus of your choice (e.g., a Python source file, a novel, song lyrics). How does the generated text reflect the structure of the training data? What features does the model learn to reproduce?

Exercise 35.5. The current model processes fixed-length windows independently. Implement stateful training where the hidden state from the end of one batch is passed as the initial state of the next batch (with gradient detaching). Does this improve the loss? Why would maintaining state across batches be beneficial?

Summary#

Character-level language modeling is next-token prediction at the character level—the same objective that powers GPT and other large language models, at a much smaller scale.
The Shakespeare dataset (100K characters, ~65 unique characters) is sufficient to train a small LSTM that captures word boundaries, common vocabulary, speaker-turn structure, and rudimentary grammar.
Temperature sampling controls the diversity-quality tradeoff: low \(T\) produces safe, repetitive text; high \(T\) produces creative but potentially incoherent text.
Gate activation visualization reveals that LSTM gates learn interpretable roles: forget gates reset at sentence boundaries, input gates fire at word onsets.
LSTM and GRU consistently outperform vanilla RNNs in both loss and text quality, confirming the practical importance of gating mechanisms.

References#

A. Karpathy, “The Unreasonable Effectiveness of Recurrent Neural Networks,” blog post, 2015. http://karpathy.github.io/2015/05/21/rnn-effectiveness/
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proceedings of EMNLP, 2014.
I. Sutskever, J. Martens, and G. Hinton, “Generating text with recurrent neural networks,” in Proceedings of ICML, pp. 1017–1024, 2011.