Skip to main content

Optimization and Linear Algebra

The Central Question: How Does Matrix Structure Control Optimization?

Minimizing a function requires understanding its curvature, and curvature is encoded in matrices (the Hessian). The eigenvalues of the Hessian determine whether you are at a minimum, maximum, or saddle point. The condition number κ\kappa controls how fast gradient descent converges. Positive definiteness guarantees a unique global minimum. Every key idea in optimization, from gradient descent to conjugate gradient to Adam, has a linear algebra explanation.

Topics to Cover

Quadratic Functions and Their Geometry

  • The general quadratic: f(x)=12xTAxbTx+cf(x) = \frac{1}{2}x^TAx - b^Tx + c
  • Gradient: f=Axb\nabla f = Ax - b; setting to zero gives Ax=bAx = b
  • Hessian: 2f=A\nabla^2 f = A; curvature is determined by the matrix
  • Classification via eigenvalues of AA:
    • All λi>0\lambda_i > 0: bowl (strict minimum)
    • All λi<0\lambda_i < 0: dome (strict maximum)
    • Mixed signs: saddle point
  • Cross-reference to Positive Definite Matrices

Gradient Descent through a Linear Algebra Lens

  • Steepest descent on f(x)=12xTAxbTxf(x) = \frac{1}{2}x^TAx - b^Tx
  • Convergence rate depends on condition number κ(A)=λmax/λmin\kappa(A) = \lambda_{\max}/\lambda_{\min}
    • κ1\kappa \approx 1: nearly spherical contours, fast convergence
    • κ1\kappa \gg 1: elongated elliptical contours, zig-zagging, slow convergence
  • Eigenvalues of AA control the step sizes along each eigenvector direction
  • Preconditioning: replace Ax=bAx = b with M1Ax=M1bM^{-1}Ax = M^{-1}b to improve κ\kappa

Conjugate Gradient Method

  • Motivation: avoid the zig-zagging of steepest descent
  • Key idea: search directions that are AA-conjugate (diTAdj=0d_i^TAd_j = 0)
  • Converges in at most nn steps for n×nn \times n positive definite AA
  • In practice converges much faster when eigenvalues are clustered
  • Connection to Krylov subspaces: {b,Ab,A2b,}\{b, Ab, A^2b, \ldots\}

Constrained Optimization

  • Equality constraints: minimize f(x)f(x) subject to Ax=bAx = b
  • Lagrange multipliers: f=ATλ\nabla f = A^T\lambda at the optimum
  • The KKT system as a saddle-point linear system: [HATA0][xλ]=[cb]\begin{bmatrix} H & A^T \\ A & 0 \end{bmatrix} \begin{bmatrix} x \\ \lambda \end{bmatrix} = \begin{bmatrix} c \\ b \end{bmatrix}
  • Inequality constraints (brief): active set, complementary slackness

Stochastic Gradient Descent (SGD)

  • Full gradient is expensive: f=1Ni=1Nfi\nabla f = \frac{1}{N}\sum_{i=1}^N \nabla f_i requires all NN data points
  • SGD: approximate gradient with a random mini-batch of size BNB \ll N
  • Convergence: noisy but unbiased; variance decreases with batch size
  • Learning rate schedules: constant, decay, warm-up
  • Mini-batch SGD as the standard in deep learning

Momentum and Accelerated Methods

  • SGD zig-zags in elongated valleys (high condition number)
  • Classical momentum: accumulate a velocity term vt+1=βvt+f(xt)v_{t+1} = \beta v_t + \nabla f(x_t)
    • Dampens oscillations across the short axis, accelerates along the long axis
  • Nesterov momentum: evaluate gradient at the "lookahead" position xtβvtx_t - \beta v_t
    • Achieves optimal O(1/t2)O(1/t^2) convergence for convex smooth functions
  • Adam: adaptive per-parameter learning rates using first and second moment estimates
    • Combines momentum with RMSProp
    • Bias correction for early iterations
  • Connection to ODEs: momentum methods discretize a second-order ODE (damped oscillator)
  • Cross-reference to MIT 18.065 Lecture 23

Convexity and Linear Algebra

  • A set is convex iff it's closed under convex combinations
  • A function is convex iff its Hessian is PSD everywhere
  • Convex optimization: every local minimum is global
  • Why ML loves convexity: linear regression, Ridge, SVM (dual), logistic regression loss

Summary

Answering the Central Question: Matrix structure controls optimization at every level. The Hessian's eigenvalues classify critical points (all positive = minimum, mixed = saddle). The condition number κ=λmax/λmin\kappa = \lambda_{\max}/\lambda_{\min} governs convergence speed: gradient descent converges in O(κ)O(\kappa) iterations, conjugate gradient in O(κ)O(\sqrt{\kappa}). Preconditioning improves κ\kappa to accelerate convergence. SGD and momentum methods address the practical challenges of large-scale optimization while respecting the same underlying linear algebra.

Applications in Data Science and Machine Learning

  • Training neural networks: loss landscape curvature (Hessian eigenvalues) determines trainability; saddle points dominate in high dimensions
  • Adam / preconditioned methods: approximate second-order information to improve conditioning
  • Support Vector Machines: the dual is a constrained quadratic program (QP)
  • Convex relaxations: replacing non-convex problems with convex surrogates (e.g., nuclear norm for rank minimization)
  • Natural gradient descent: uses the Fisher information matrix (PSD) as a preconditioner

Guided Problems

References

  • Strang, Introduction to Linear Algebra, Chapter 6 (6.1, 6.4)
  • Boyd & Vandenberghe, Convex Optimization, Chapters 2–5 (supplementary)