What is a Tensor?

A multi-dimensional array with two superpowers: GPU acceleration and automatic gradient tracking.

RankNameShapeExample
0DScalar()torch.tensor(3.14)
1DVector(N,)torch.tensor([1, 2, 3])
2DMatrix(M, N)torch.randn(3, 4)
3D+Tensor(B, M, N)torch.randn(8, 3, 224, 224) (batch of images)

GPU Support = CUDA Under the Hood

PyTorch → CUDA kernels → cuBLAS/cuDNN → NVIDIA GPU

x = torch.randn(1000, 1000, device='cuda')  # lives on GPU
y = x @ x                                   # matrix multiply on GPU

For large matrices (10,000x10,000+), GPU matmul can be 50-100x faster than CPU. At smaller sizes (1,000x1,000), the speedup is more modest (5-20x) because data transfer overhead is significant relative to the compute.

Never mix devices in one operation:

x.cpu() @ y.cuda()   # → RuntimeError in PyTorch 2.x (silent copy in older versions)
x.to(y.device) @ y   # correct way

Autograd = Automatic Backpropagation

You only write the forward pass. PyTorch computes all gradients automatically using the chain rule.

x = torch.tensor(2.0, requires_grad=True)
w = torch.tensor(3.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)

y = w * x + b          # forward: 3*2 + 1 = 7
loss = (y - 10) ** 2   # (7 - 10)^2 = 9

loss.backward()         # computes ALL gradients

print(w.grad)  # → tensor(-12.)   dloss/dw = 2(y-10) * x = 2(-3)(2) = -12
print(b.grad)  # → tensor(-6.)    dloss/db = 2(y-10) * 1 = 2(-3)(1) = -6

This is automatic differentiation – PyTorch builds a computation graph during the forward pass and walks it backwards to compute gradients.

Where Are Intermediate Values Stored?

During training, PyTorch saves intermediate results for the backward pass. Understanding where they live is critical for managing GPU memory:

WhatWhereSize
Activations (forward pass outputs saved for backward)GPU memory (same device as the tensor)Large – this is the main VRAM consumer
Computation graph nodes (grad_fn objects)CPU memoryTiny (just metadata and pointers)
Saved tensors referenced by graph nodesGPU memoryLarge – these are the activations above
Gradients (.grad)Same device as the parameterSame size as parameters
out = x @ w            # 5000×5000 result saved on GPU for backward
out.mean().backward()  # uses saved tensor to compute gradients, then frees it

This is why large models consume 20-100+ GB VRAM during training – the forward pass must save activations at every layer for the backward pass to use.

One Training Step = One Full Backpropagation

for batch in dataloader:
    optimizer.zero_grad()   # clear previous gradients
    pred = model(batch)     # forward pass
    loss = criterion(pred, target)
    loss.backward()         # backward pass (compute all gradients)
    optimizer.step()        # update parameters

Real models perform hundreds of thousands to millions of these steps:

ModelParametersTraining tokensApprox. training steps
BERT-base110M3.3B (BooksCorpus + Wikipedia)~1M (900k @ seq_len 128, then 100k @ seq_len 512)
LLaMA 7B7B1T~250k
LLaMA 2 7B7B2T~500k
LLaMA 3.1 405B405B15.6T~500k-1.5M (exact count not published; large batch sizes reduce step count)

Note: Step count depends on batch size. Larger batch sizes mean fewer steps for the same number of tokens. LLaMA 3.1 405B uses massive batch sizes (up to 16M tokens per step), which is why the step count is lower than you might expect for 15.6T tokens.

Each step involves one complete backpropagation through the entire network – computing gradients for every parameter simultaneously.

Why PyTorch Won

  • Dynamic computation graph – write normal Python, debug with normal tools. No graph compilation step (unlike TensorFlow 1.x).
  • Autograd computes perfect gradients automatically – no manual derivatives, ever.
  • Full GPU acceleration with a clean API.
  • Dominant in research – the vast majority of ML papers and large model training runs use PyTorch (with JAX gaining ground at Google/DeepMind).

Quick Memory Demo

Copy-paste this to see GPU memory behavior during forward and backward passes:

import torch

x = torch.randn(5000, 5000, device='cuda')
w = torch.randn(5000, 5000, device='cuda', requires_grad=True)

print(f"{torch.cuda.memory_allocated() / 1e9:.2f} GB")  # ~0.20 GB (two 5000x5000 tensors)

out = x @ w       # matmul result: 5000x5000 = 100MB, saved for backward
out = out.relu()   # relu output: another ~100MB saved

print(f"{torch.cuda.memory_allocated() / 1e9:.2f} GB")  # ~0.40 GB (+200MB from intermediates)

out.mean().backward()  # uses saved tensors, then frees them

print(f"{torch.cuda.memory_allocated() / 1e9:.2f} GB")  # back down (intermediates freed)

The ~200MB spike during forward pass is autograd saving the matmul result and relu output for the backward pass. After backward() completes, these intermediates are freed.

Key Takeaways

  1. Tensors are arrays that can live on GPU and track gradients.
  2. Autograd builds a computation graph during forward pass and walks it backwards for gradients – you never write derivative math.
  3. GPU memory during training is dominated by saved activations (intermediates), not model weights.
  4. Training is millions of forward-backward-update cycles, each computing gradients for every parameter.
  5. PyTorch won because it lets you write normal Python while handling the hard parts (differentiation, GPU dispatch) automatically.