Taylor Approximation
The Central Question: How Do We Approximate Complex Functions with Simple Polynomials?
Many functions in machine learning are too complicated to analyze directly. Taylor approximation replaces a function with a polynomial that matches its behavior near a point. First-order approximations give us gradient descent; second-order approximations give us Newton's method.
Consider these scenarios:
- Gradient descent assumes the loss is approximately linear near the current point: . This is a first-order Taylor expansion.
- Newton's method uses a quadratic approximation: . The optimal step minimizes this quadratic.
- The Laplace approximation in Bayesian inference approximates the posterior with a Gaussian by taking a second-order Taylor expansion of the log-posterior around its mode.
Taylor approximation bridges the gap between exact analysis and practical computation.
Topics to Cover
Taylor Series in One Dimension
- Remainder terms and convergence
- Common Taylor series: , ,
Multivariate Taylor Expansion
- First-order:
- Second-order:
- Vector notation and the role of the gradient and Hessian
Linearization
- Tangent plane approximation
- When linearization is sufficient (small perturbations, sensitivity analysis)
- Linearization of nonlinear systems
Quadratic Approximation
- The connection to Newton's method: minimize the quadratic model
- Quadratic form and its minimizer
- Trust regions: when the quadratic approximation is trustworthy
Summary
Answering the Central Question: Taylor approximation replaces a complicated function with a polynomial that matches the function's value, slope, and curvature at a point. The first-order approximation underlies gradient descent. The second-order approximation underlies Newton's method and the Laplace approximation. Higher-order terms are rarely needed in ML because we only need local accuracy, and the cost of computing higher derivatives grows rapidly.
Applications in Data Science and Machine Learning
- Gradient descent: A consequence of the first-order Taylor approximation with a step size constraint
- Newton's method: Minimizing the second-order Taylor model gives the update
- Laplace approximation: Approximating the log-posterior with a quadratic yields a Gaussian approximation to the posterior distribution
- Activation function approximations: ReLU, sigmoid, and softmax can be analyzed via their Taylor expansions around key points
- Loss function analysis: Taylor expansion around a critical point reveals whether it is a minimum, maximum, or saddle point
Guided Problems
References
- Stewart, James - Calculus: Early Transcendentals, 8th ed., Chapter 11.10-11.11
- Deisenroth, Faisal, and Ong - Mathematics for Machine Learning, Chapter 5.4
- Boyd and Vandenberghe - Convex Optimization, Section 9.5
- Nocedal and Wright - Numerical Optimization, 2nd ed., Chapter 2