What is a Tensor?
A multi-dimensional array with two superpowers: GPU acceleration and automatic gradient tracking.
| Rank | Name | Shape | Example |
|---|---|---|---|
| 0D | Scalar | () | torch.tensor(3.14) |
| 1D | Vector | (N,) | torch.tensor([1, 2, 3]) |
| 2D | Matrix | (M, N) | torch.randn(3, 4) |
| 3D+ | Tensor | (B, M, N) | torch.randn(8, 3, 224, 224) (batch of images) |
GPU Support = CUDA Under the Hood
PyTorch → CUDA kernels → cuBLAS/cuDNN → NVIDIA GPU
x = torch.randn(1000, 1000, device='cuda') # lives on GPU
y = x @ x # matrix multiply on GPU
For large matrices (10,000x10,000+), GPU matmul can be 50-100x faster than CPU. At smaller sizes (1,000x1,000), the speedup is more modest (5-20x) because data transfer overhead is significant relative to the compute.
Never mix devices in one operation:
x.cpu() @ y.cuda() # → RuntimeError in PyTorch 2.x (silent copy in older versions)
x.to(y.device) @ y # correct way
Autograd = Automatic Backpropagation
You only write the forward pass. PyTorch computes all gradients automatically using the chain rule.
x = torch.tensor(2.0, requires_grad=True)
w = torch.tensor(3.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)
y = w * x + b # forward: 3*2 + 1 = 7
loss = (y - 10) ** 2 # (7 - 10)^2 = 9
loss.backward() # computes ALL gradients
print(w.grad) # → tensor(-12.) dloss/dw = 2(y-10) * x = 2(-3)(2) = -12
print(b.grad) # → tensor(-6.) dloss/db = 2(y-10) * 1 = 2(-3)(1) = -6
This is automatic differentiation – PyTorch builds a computation graph during the forward pass and walks it backwards to compute gradients.
Where Are Intermediate Values Stored?
During training, PyTorch saves intermediate results for the backward pass. Understanding where they live is critical for managing GPU memory:
| What | Where | Size |
|---|---|---|
| Activations (forward pass outputs saved for backward) | GPU memory (same device as the tensor) | Large – this is the main VRAM consumer |
Computation graph nodes (grad_fn objects) | CPU memory | Tiny (just metadata and pointers) |
| Saved tensors referenced by graph nodes | GPU memory | Large – these are the activations above |
Gradients (.grad) | Same device as the parameter | Same size as parameters |
out = x @ w # 5000×5000 result saved on GPU for backward
out.mean().backward() # uses saved tensor to compute gradients, then frees it
This is why large models consume 20-100+ GB VRAM during training – the forward pass must save activations at every layer for the backward pass to use.
One Training Step = One Full Backpropagation
for batch in dataloader:
optimizer.zero_grad() # clear previous gradients
pred = model(batch) # forward pass
loss = criterion(pred, target)
loss.backward() # backward pass (compute all gradients)
optimizer.step() # update parameters
Real models perform hundreds of thousands to millions of these steps:
| Model | Parameters | Training tokens | Approx. training steps |
|---|---|---|---|
| BERT-base | 110M | 3.3B (BooksCorpus + Wikipedia) | ~1M (900k @ seq_len 128, then 100k @ seq_len 512) |
| LLaMA 7B | 7B | 1T | ~250k |
| LLaMA 2 7B | 7B | 2T | ~500k |
| LLaMA 3.1 405B | 405B | 15.6T | ~500k-1.5M (exact count not published; large batch sizes reduce step count) |
Note: Step count depends on batch size. Larger batch sizes mean fewer steps for the same number of tokens. LLaMA 3.1 405B uses massive batch sizes (up to 16M tokens per step), which is why the step count is lower than you might expect for 15.6T tokens.
Each step involves one complete backpropagation through the entire network – computing gradients for every parameter simultaneously.
Why PyTorch Won
- Dynamic computation graph – write normal Python, debug with normal tools. No graph compilation step (unlike TensorFlow 1.x).
- Autograd computes perfect gradients automatically – no manual derivatives, ever.
- Full GPU acceleration with a clean API.
- Dominant in research – the vast majority of ML papers and large model training runs use PyTorch (with JAX gaining ground at Google/DeepMind).
Quick Memory Demo
Copy-paste this to see GPU memory behavior during forward and backward passes:
import torch
x = torch.randn(5000, 5000, device='cuda')
w = torch.randn(5000, 5000, device='cuda', requires_grad=True)
print(f"{torch.cuda.memory_allocated() / 1e9:.2f} GB") # ~0.20 GB (two 5000x5000 tensors)
out = x @ w # matmul result: 5000x5000 = 100MB, saved for backward
out = out.relu() # relu output: another ~100MB saved
print(f"{torch.cuda.memory_allocated() / 1e9:.2f} GB") # ~0.40 GB (+200MB from intermediates)
out.mean().backward() # uses saved tensors, then frees them
print(f"{torch.cuda.memory_allocated() / 1e9:.2f} GB") # back down (intermediates freed)
The ~200MB spike during forward pass is autograd saving the matmul result and relu output for the backward pass. After backward() completes, these intermediates are freed.
Key Takeaways
- Tensors are arrays that can live on GPU and track gradients.
- Autograd builds a computation graph during forward pass and walks it backwards for gradients – you never write derivative math.
- GPU memory during training is dominated by saved activations (intermediates), not model weights.
- Training is millions of forward-backward-update cycles, each computing gradients for every parameter.
- PyTorch won because it lets you write normal Python while handling the hard parts (differentiation, GPU dispatch) automatically.