This is a condensed reference covering the core concepts behind large language models and the transformer architecture, inspired by Andrej Karpathy’s “Neural Networks: Zero to Hero” series – particularly “The spelled-out intro to language modeling: building makemore” and “Let’s build GPT: from scratch, in code, spelled out.”
Language Modeling = Advanced Auto-Complete
The entire magic of large language models comes from one simple task: predict the next token in a sequence.
If you have the sequence ["The", "cat", "is"], the model learns to predict "sleeping" as the most likely next word.
Why this works: the internet gives us infinite free training data. Every webpage is a sequence where the next word is already known. We turn unsupervised text into a supervised learning problem for free.
Words aren’t independent. "I enjoyed reading a ___" – "book" is far more likely than "thermometer". The model learns these relationships by seeing billions of examples.
Tokenization: Breaking Text into Pieces
Models don’t understand text – they understand numbers. Tokenization splits text into pieces called tokens, each mapped to an integer ID.
Three approaches, from simple to practical:
| Method | Example: "The cat" | Tradeoff |
|---|---|---|
| Character-level | "T", "h", "e", " ", "c", "a", "t" | Tiny vocabulary, but sequences are long and slow to process |
| Word-level | "The", "cat" | Clean, but can’t handle new/rare words |
| Subword-level (BPE) | "The", " cat" | Best balance – common words stay whole, rare words split into learnable pieces |
Modern LLMs (GPT, LLaMA, Claude) all use subword tokenization. OpenAI’s tiktoken and Google’s SentencePiece are the two dominant implementations. A typical vocabulary is 32,000-100,000 tokens.
Embeddings: Turning Tokens into Vectors
Each token ID maps to a learned vector – a list of numbers (e.g., 768 dimensions for BERT-base, 4,096 for LLaMA-7B) that encodes meaning.
# Conceptual example (dimensions simplified)
"cat" → [0.23, -0.45, 0.12, ..., 0.89] # 768 numbers
"dog" → [0.21, -0.41, 0.14, ..., 0.85] # similar direction (similar meaning)
"car" → [0.67, 0.12, -0.34, ..., -0.21] # very different direction
These vectors are learned during training. The model discovers that words appearing in similar contexts should have similar vectors. Nobody hand-codes these relationships – they emerge from the next-token prediction objective.
The embedding matrix is a simple lookup table: vocab_size × embedding_dim. For GPT-2, that’s 50,257 tokens x 768 dimensions = ~38M parameters just for the embedding layer.
The Transformer Architecture
The transformer processes tokens through a stack of identical layers. Each layer has two sub-components: self-attention (tokens exchange information) and a feed-forward network (each token processes its information independently). Critical details that are often glossed over: residual connections, layer normalization, causal masking, and multi-head attention.
Positional Encoding
Attention operates on sets, not sequences – it has no built-in sense of order. "cat ate mouse" and "mouse ate cat" would produce identical attention patterns without positional information.
The solution: add a positional signal to each embedding. The original transformer used fixed sine/cosine functions at different frequencies. Modern models (GPT, LLaMA) use learned positional embeddings – a second embedding table indexed by position, added element-wise to the token embedding.
Self-Attention: The Core Mechanism
Self-attention lets each token look at every other token in the sequence and decide how much information to gather from each.
For each token, three vectors are computed from its embedding:
- Query (Q): “What am I looking for?”
- Key (K): “What do I contain?”
- Value (V): “What information do I provide?”
All three are produced by multiplying the token’s embedding by learned weight matrices: Q = x @ W_Q, K = x @ W_K, V = x @ W_V.
The attention computation:
def self_attention(x, W_Q, W_K, W_V, mask=None):
Q = x @ W_Q # (seq_len, d_k)
K = x @ W_K # (seq_len, d_k)
V = x @ W_V # (seq_len, d_v)
scores = Q @ K.T # (seq_len, seq_len) -- pairwise similarity
scores = scores / sqrt(d_k) # scale to prevent extreme softmax values
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf')) # causal mask
weights = softmax(scores, dim=-1) # normalize to probabilities
output = weights @ V # weighted combination of values
return output
Why divide by sqrt(d_k)? The dot product of two random vectors with d_k dimensions has variance proportional to d_k. Without scaling, large dimensions produce large dot products, which push softmax into regions with near-zero gradients. Dividing by sqrt(d_k) normalizes the variance back to ~1, keeping softmax in a useful range.
Causal Masking: The Critical Detail for GPT-Style Models
This is the part most simplified explanations skip, and it’s fundamental to how autoregressive language models work.
In a GPT-style model, when predicting the next token at position i, the model must not see tokens at positions i+1, i+2, ... – those are the future that hasn’t been generated yet.
This is enforced with a causal mask – a lower-triangular matrix that blocks attention to future positions:
Token: "The" "cat" "is" "sleeping"
"The" ✓ ✗ ✗ ✗
"cat" ✓ ✓ ✗ ✗
"is" ✓ ✓ ✓ ✗
"sleeping" ✓ ✓ ✓ ✓
✓ = can attend, ✗ = masked (set to -inf before softmax, so the weight becomes 0).
This means:
"The"can only attend to itself"cat"can attend to"The"and"cat""sleeping"can attend to all previous tokens
Without this mask, the model could “cheat” during training by looking at the answer it’s supposed to predict. The mask forces each position to predict the next token using only the past – which is exactly how generation works at inference time.
Note: BERT does not use causal masking – it uses bidirectional attention (every token sees every other token). This is why BERT is good for understanding tasks but cannot generate text autoregressively. GPT uses causal masking, which is why it can generate text token by token.
Multi-Head Attention
Transformers don’t run one attention operation – they run multiple attention heads in parallel, each with its own Q, K, V weight matrices.
Head 1: might learn syntactic relationships ("subject → verb")
Head 2: might learn coreference ("he" → "John")
Head 3: might learn positional patterns ("next word" proximity)
...
Head 12: might learn semantic similarity
Each head operates on a smaller dimension: if the model dimension is 768 and there are 12 heads, each head works with 768/12 = 64 dimensions. The outputs of all heads are concatenated and projected back to the full dimension:
# Multi-head attention (simplified)
heads = [attention(x, W_Q[i], W_K[i], W_V[i]) for i in range(n_heads)]
concat = torch.cat(heads, dim=-1) # (seq_len, n_heads * d_head) = (seq_len, d_model)
output = concat @ W_O # project back to d_model
GPT-2 uses 12 heads. GPT-3 uses 96 heads. LLaMA-7B uses 32 heads. The total computation is the same as a single large attention, but multi-head allows the model to attend to different types of relationships simultaneously.
Residual Connections and Layer Normalization
Two mechanisms that are often omitted from explanations but are critical for training deep networks:
Residual connections (skip connections): The output of each sub-layer (attention, feed-forward) is added to its input, not used as a replacement:
x = x + self_attention(layer_norm(x)) # attention with residual
x = x + feed_forward(layer_norm(x)) # FFN with residual
Without residual connections, gradients vanish in deep networks (GPT-3 has 96 layers). The skip connection provides a direct gradient path from the output back to early layers.
Layer normalization: Normalizes the activations to have zero mean and unit variance at each layer. Modern transformers use pre-norm (normalize before attention/FFN) rather than the original paper’s post-norm, because pre-norm is more stable during training.
Feed-Forward Network
After attention blends information across tokens, a feed-forward network processes each token independently:
def feed_forward(x):
return W_2 @ relu(W_1 @ x + b_1) + b_2
The hidden dimension is typically 4x the model dimension (e.g., 768 → 3072 for GPT-2). This is where the model does per-token “thinking” – transforming the attention-blended representation into a richer one. Recent models use SwiGLU or GeGLU activations instead of ReLU.
Output: Predicting the Next Token
The final layer projects each token’s representation to the vocabulary size and applies softmax:
logits = x @ W_vocab # (seq_len, vocab_size) e.g., (1024, 50257)
probs = softmax(logits, dim=-1)
next_token = sample(probs) # greedy (argmax) or random (temperature sampling)
During training, the loss is cross-entropy between the predicted probability distribution and the actual next token at every position in the sequence.
Training vs. Inference
Training
- Feed a sequence of tokens:
["The", "cat", "is", "sleeping"] - At each position, the model predicts the next token (using causal masking)
- Compare predictions to actual tokens using cross-entropy loss
- Backpropagate gradients through the entire network
- Update all weights (Q, K, V, embeddings, feed-forward, etc.)
- Repeat for trillions of tokens over weeks on GPU clusters
All positions in the sequence are trained simultaneously (this is why transformers are faster than RNNs – parallel training, not sequential).
Inference (Generation)
- Start with a prompt:
"The cat is" - Run a forward pass through the transformer
- Sample the next token from the output distribution:
"sleeping" - Append to the sequence:
"The cat is sleeping" - Repeat from step 2
This is why ChatGPT appears to type one word at a time – it is literally generating one token per forward pass, iteratively.
Scaling: Why Bigger Models Work Better
The transformer architecture scales predictably:
| Dimension | Small (GPT-2) | Medium (LLaMA-7B) | Large (GPT-4-class) |
|---|---|---|---|
| Parameters | 1.5B | 7B | estimated 200B+ |
| Training tokens | 40B | 1-2T | 10T+ |
| Context length | 1,024 | 4,096 | 128,000+ |
| Training compute | ~$50k | ~$1M | ~$100M+ |
As models scale, they develop capabilities that weren’t explicitly programmed – coherent essay writing, code generation, mathematical reasoning, multilingual translation. Whether these represent genuinely “emergent” discontinuities or smooth capability curves that cross usefulness thresholds is an active area of research (Schaeffer et al., 2023 argue the latter).
What’s not debated: the simple “predict next token” objective, when applied at sufficient scale with enough data, produces remarkably capable systems.
Why Transformers Won
Before transformers (2017), the dominant sequence models were RNNs and LSTMs. They processed tokens sequentially – token 1, then token 2, then token 3. This meant:
- Training couldn’t be parallelized across sequence positions
- Long-range dependencies were hard to learn (information had to survive through every intermediate step)
- Training was slow
Transformers process all tokens in parallel via attention. Every token directly attends to every other token (within the causal mask). Long-range dependencies are just as easy to learn as short-range ones. Training parallelizes perfectly across sequence positions and across GPUs.
The result: transformers scale to sequences and model sizes that were completely impractical with RNNs.
Where to Go Deeper
Andrej Karpathy’s “Neural Networks: Zero to Hero” series builds these concepts from scratch in Python:
- micrograd – backpropagation engine from scratch
- makemore (bigram) – character-level language model, the “spelled-out intro to language modeling”
- makemore (MLP) – multi-layer perceptron language model
- makemore (activations/BatchNorm) – training dynamics and normalization
- makemore (backprop) – manual backpropagation through the network
- Let’s build GPT – full transformer from scratch, the “spelled-out” walkthrough
- Let’s build the GPT Tokenizer – BPE tokenization from scratch
The code is available at github.com/karpathy/nanoGPT and github.com/karpathy/minGPT. The best way to learn: code along with him, line by line.