Optimization and Linear Algebra
The Central Question: How Does Matrix Structure Control Optimization?
Minimizing a function requires understanding its curvature, and curvature is encoded in matrices (the Hessian). The eigenvalues of the Hessian determine whether you are at a minimum, maximum, or saddle point. The condition number controls how fast gradient descent converges. Positive definiteness guarantees a unique global minimum. Every key idea in optimization, from gradient descent to conjugate gradient to Adam, has a linear algebra explanation.
Topics to Cover
Quadratic Functions and Their Geometry
- The general quadratic:
- Gradient: ; setting to zero gives
- Hessian: ; curvature is determined by the matrix
- Classification via eigenvalues of :
- All : bowl (strict minimum)
- All : dome (strict maximum)
- Mixed signs: saddle point
- Cross-reference to Positive Definite Matrices
Gradient Descent through a Linear Algebra Lens
- Steepest descent on
- Convergence rate depends on condition number
- : nearly spherical contours, fast convergence
- : elongated elliptical contours, zig-zagging, slow convergence
- Eigenvalues of control the step sizes along each eigenvector direction
- Preconditioning: replace with to improve
Conjugate Gradient Method
- Motivation: avoid the zig-zagging of steepest descent
- Key idea: search directions that are -conjugate ()
- Converges in at most steps for positive definite
- In practice converges much faster when eigenvalues are clustered
- Connection to Krylov subspaces:
Constrained Optimization
- Equality constraints: minimize subject to
- Lagrange multipliers: at the optimum
- The KKT system as a saddle-point linear system:
- Inequality constraints (brief): active set, complementary slackness
Stochastic Gradient Descent (SGD)
- Full gradient is expensive: requires all data points
- SGD: approximate gradient with a random mini-batch of size
- Convergence: noisy but unbiased; variance decreases with batch size
- Learning rate schedules: constant, decay, warm-up
- Mini-batch SGD as the standard in deep learning
Momentum and Accelerated Methods
- SGD zig-zags in elongated valleys (high condition number)
- Classical momentum: accumulate a velocity term
- Dampens oscillations across the short axis, accelerates along the long axis
- Nesterov momentum: evaluate gradient at the "lookahead" position
- Achieves optimal convergence for convex smooth functions
- Adam: adaptive per-parameter learning rates using first and second moment estimates
- Combines momentum with RMSProp
- Bias correction for early iterations
- Connection to ODEs: momentum methods discretize a second-order ODE (damped oscillator)
- Cross-reference to MIT 18.065 Lecture 23
Convexity and Linear Algebra
- A set is convex iff it's closed under convex combinations
- A function is convex iff its Hessian is PSD everywhere
- Convex optimization: every local minimum is global
- Why ML loves convexity: linear regression, Ridge, SVM (dual), logistic regression loss
Summary
Answering the Central Question: Matrix structure controls optimization at every level. The Hessian's eigenvalues classify critical points (all positive = minimum, mixed = saddle). The condition number governs convergence speed: gradient descent converges in iterations, conjugate gradient in . Preconditioning improves to accelerate convergence. SGD and momentum methods address the practical challenges of large-scale optimization while respecting the same underlying linear algebra.
Applications in Data Science and Machine Learning
- Training neural networks: loss landscape curvature (Hessian eigenvalues) determines trainability; saddle points dominate in high dimensions
- Adam / preconditioned methods: approximate second-order information to improve conditioning
- Support Vector Machines: the dual is a constrained quadratic program (QP)
- Convex relaxations: replacing non-convex problems with convex surrogates (e.g., nuclear norm for rank minimization)
- Natural gradient descent: uses the Fisher information matrix (PSD) as a preconditioner
Guided Problems
References
- Strang, Introduction to Linear Algebra, Chapter 6 (6.1, 6.4)
- Boyd & Vandenberghe, Convex Optimization, Chapters 2–5 (supplementary)