Matrix Calculus
The Central Question: How Do We Take Derivatives When the Variables Are Matrices?
Machine learning loss functions often involve matrix operations: traces, determinants, inverses, and products. Standard scalar calculus rules do not directly apply. Matrix calculus provides the rules and notation for differentiating scalar, vector, and matrix expressions with respect to vector and matrix variables.
Consider these scenarios:
- The least squares solution comes from differentiating with respect to the vector . This requires knowing .
- The Gaussian log-likelihood involves and . Optimizing over requires derivatives of traces, determinants, and inverses.
- Layout conventions (numerator vs denominator) cause persistent confusion. Choosing and sticking to a convention is essential.
Matrix calculus is the language that connects ML loss functions to their gradients.
Topics to Cover
Derivatives of Trace Expressions
- Using the trace trick:
Derivatives of Determinant and Inverse
- Differential of the inverse:
Layout Conventions
- Numerator layout (Jacobian convention): has shape
- Denominator layout: transpose of numerator layout
- The importance of choosing one convention and being consistent
Common Identities Cookbook
- Compilation of the most-used matrix calculus identities
- Strategies for deriving new identities using differentials
- Cross-reference to the Matrix Cookbook
Summary
Answering the Central Question: Matrix calculus extends differentiation to expressions involving matrices and vectors. The key identities (, , ) are used repeatedly in ML derivations. The differential approach (computing in terms of , then reading off the derivative) is often cleaner than direct computation. Consistent use of layout conventions avoids transpose errors.
Applications in Data Science and Machine Learning
- Deriving the normal equations: Differentiating gives
- Gaussian MLE: Differentiating the log-likelihood with respect to and yields the sample mean and covariance
- Fisher information: Computing the expected Hessian of the log-likelihood requires matrix derivatives
- Attention mechanism gradients: The softmax-weighted matrix products in transformers require matrix calculus to differentiate
- Matrix factorization: NMF and PMF objective gradients involve derivatives of matrix norms and traces
Guided Problems
References
- Petersen and Pedersen - The Matrix Cookbook
- MIT 18.S096 - Matrix Calculus for Machine Learning and Beyond
- Deisenroth, Faisal, and Ong - Mathematics for Machine Learning, Chapter 5.3-5.4
- Magnus and Neudecker - Matrix Differential Calculus with Applications in Statistics and Econometrics, 3rd ed.