Jacobians and Hessians
The Central Question: How Do We Capture First-Order and Second-Order Behavior of Multivariate Functions?
The gradient tells us the first-order behavior of a scalar function. But what about vector-valued functions, or when we need second-order information? The Jacobian generalizes the gradient to vector-valued maps, and the Hessian captures curvature for scalar functions. Together they give us the tools to analyze transformations and optimize non-trivially.
Consider these scenarios:
- A neural network layer maps . Its Jacobian is the matrix of all partial derivatives, encoding how each output depends on each input.
- Newton's method uses the Hessian (second derivatives) to take smarter optimization steps than gradient descent, accounting for curvature.
- In normalizing flows, the Jacobian determinant measures how a transformation stretches or compresses probability density.
The Jacobian and Hessian are the workhorses of multivariate analysis in ML.
Topics to Cover
The Jacobian Matrix
- Definition for :
- Relationship to the gradient: for scalar-valued ,
- The Jacobian as a linear approximation:
The Hessian Matrix
- Definition for :
- Symmetry of the Hessian (Schwarz's theorem)
- Positive/negative definiteness and its meaning for convexity
Second-Order Conditions
- Necessary conditions: at a minimum, and
- Sufficient conditions: and guarantees a local minimum
- Saddle points: with an indefinite Hessian
Change of Variables
- The Jacobian determinant in integration:
- Connection to Determinants and volume scaling
- Application to probability density transformation
Summary
Answering the Central Question: The Jacobian matrix captures all first-order partial derivatives of a vector-valued function, generalizing the gradient and serving as the best linear approximation. The Hessian matrix captures second-order partial derivatives of a scalar function, encoding curvature. Together they enable: Newton's method (), second-order convergence analysis, saddle point detection, and probability density transformation via the Jacobian determinant.
Applications in Data Science and Machine Learning
- Newton's method and second-order optimizers: The Hessian enables quadratic convergence near optima (L-BFGS, natural gradient)
- Normalizing flows: The Jacobian determinant corrects probability densities under invertible transformations
- Loss landscape analysis: Hessian eigenvalues reveal whether critical points are minima, maxima, or saddle points
- Fisher information matrix: The expected Hessian of the negative log-likelihood, central to statistical efficiency and natural gradient descent
- Neural tangent kernel: The Jacobian of network outputs with respect to parameters defines the NTK
Guided Problems
References
- Stewart, James - Calculus: Early Transcendentals, 8th ed., Chapter 14.7
- Deisenroth, Faisal, and Ong - Mathematics for Machine Learning, Chapter 5.3
- Boyd and Vandenberghe - Convex Optimization, Section 3.1
- MIT 18.S096 - Matrix Calculus for Machine Learning and Beyond