Calculus of Optimization
The Central Question: What Does Calculus Tell Us About Finding Minima?
Machine learning is optimization: we minimize loss functions, maximize likelihoods, and find optimal parameters. Calculus provides the conditions that characterize optima (gradient equals zero, Hessian is positive semidefinite) and the convergence analysis of algorithms that find them.
Consider these scenarios:
- Setting gives the necessary condition for a minimum. But is it sufficient? The Hessian tells us whether the critical point is a minimum, maximum, or saddle point.
- Gradient descent converges at a rate determined by the condition number of the Hessian. A poorly conditioned problem (elongated contours) causes slow convergence.
- Constrained optimization (e.g., maximizing entropy subject to constraints) uses Lagrange multipliers to convert a constrained problem into an unconstrained one.
Optimization theory explains why algorithms work and when they fail.
Topics to Cover
First-Order and Second-Order Conditions
- Necessary condition: at a local minimum
- Second-order necessary:
- Second-order sufficient: and guarantees a strict local minimum
- Saddle points in high dimensions and their prevalence in deep learning
Convexity via the Hessian
- is convex iff for all (twice-differentiable case)
- Strong convexity: for some
- Convex functions have a single global minimum (no local minima to get stuck in)
Gradient Descent Convergence Analysis
- For -smooth functions: with step size
- Convergence rate: for convex, for strongly convex
- The role of the condition number
Newton's Method
- Update:
- Quadratic convergence near the optimum
- Computational cost: per step for the Hessian solve
Lagrange Multipliers
- Constrained optimization: minimize subject to
- The Lagrangian:
- KKT conditions for inequality constraints
- Cross-reference to Constrained Optimization in the linear algebra section
Summary
Answering the Central Question: Calculus characterizes optima through first-order conditions () and second-order conditions ( for minima, for maxima). Convexity ( everywhere) guarantees that local minima are global. Gradient descent convergence depends on smoothness and strong convexity, with rates and respectively. Newton's method achieves quadratic convergence by using second-order information. Lagrange multipliers handle equality constraints by introducing dual variables.
Applications in Data Science and Machine Learning
- Understanding optimization algorithms: Convergence guarantees for SGD, Adam, and other optimizers rely on smoothness and convexity assumptions
- Learning rate selection: The Lipschitz constant determines the maximum safe step size
- Second-order methods: L-BFGS, natural gradient, and K-FAC approximate the Newton step for faster convergence
- Regularization as constrained optimization: L2 regularization corresponds to constraining the parameter norm (Lagrange multiplier interpretation)
- SVM dual problem: The SVM optimization uses Lagrange multipliers and KKT conditions to derive the support vector solution
Guided Problems
References
- Boyd and Vandenberghe - Convex Optimization, Chapters 3, 5, 9
- Nocedal and Wright - Numerical Optimization, 2nd ed., Chapters 2-3
- Deisenroth, Faisal, and Ong - Mathematics for Machine Learning, Chapter 7
- Nesterov, Yurii - Introductory Lectures on Convex Optimization, Chapter 2