Automatic Differentiation
The Central Question: How Do Computers Compute Exact Derivatives Efficiently?
We know how to differentiate functions by hand using the chain rule. But for a function with millions of parameters and thousands of operations, we need an algorithm. Automatic differentiation (AD) computes exact derivatives (not numerical approximations) by systematically applying the chain rule to a computational graph.
Consider these scenarios:
- A deep neural network with 100 layers computes a scalar loss. Reverse-mode AD (backpropagation) computes the gradient with respect to all parameters in roughly the same time as one forward pass.
- Forward-mode AD computes the derivative with respect to one input efficiently. This is useful when we have few inputs and many outputs (e.g., computing the Jacobian column by column).
- JAX, PyTorch, and TensorFlow all implement AD as their core differentiation engine, enabling
jax.grad,loss.backward(), andtf.GradientTape.
AD is the computational engine behind modern deep learning.
Topics to Cover
Forward-Mode AD and Dual Numbers
- Dual numbers: where
- Evaluating gives automatically
- Forward-mode propagates derivatives alongside function values
- Computes one column of the Jacobian per pass (Jacobian-vector product, JVP)
Reverse-Mode AD and Backpropagation
- The adjoint method: propagate sensitivities backward from outputs to inputs
- Computes one row of the Jacobian per pass (vector-Jacobian product, VJP)
- For , reverse mode costs forward passes (independent of )
- Memory cost: must store intermediate values from the forward pass
Computational Graphs
- Representing functions as directed acyclic graphs (DAGs)
- Nodes are operations, edges carry values and gradients
- Topological sort for forward evaluation, reverse topological order for backpropagation
JVP vs VJP
- JVP (): efficient for tall Jacobians (few inputs, many outputs)
- VJP (): efficient for wide Jacobians (many inputs, few outputs)
- Why ML uses reverse-mode: loss functions map many parameters to one scalar
Summary
Answering the Central Question: Automatic differentiation computes exact derivatives by decomposing functions into elementary operations and applying the chain rule systematically. Forward-mode (JVP) propagates derivatives forward and is efficient when there are few inputs. Reverse-mode (VJP) propagates sensitivities backward and is efficient when there are few outputs, making it ideal for ML where we differentiate a scalar loss with respect to many parameters. Backpropagation is precisely reverse-mode AD applied to neural network computational graphs.
Applications in Data Science and Machine Learning
- Backpropagation: Reverse-mode AD applied to neural networks, computing for all parameters
- Higher-order derivatives: AD can be applied recursively to compute Hessians, Hessian-vector products, and higher-order derivatives
- Differentiable programming: AD enables gradient-based optimization of any differentiable program (physics simulators, renderers, ODE solvers)
- Neural ODE: Forward and adjoint sensitivity methods for continuous-depth networks
- Meta-learning: Computing gradients of gradients (second-order) for MAML and related algorithms
Guided Problems
References
- Baydin et al. - Automatic Differentiation in Machine Learning: a Survey (2018)
- Goodfellow, Bengio, and Courville - Deep Learning, Chapter 6.5
- MIT 18.S096 - Matrix Calculus for Machine Learning and Beyond
- Griewank and Walther - Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, 2nd ed.