Derivatives and Gradients
The Central Question: How Does a Function Change When We Nudge Its Inputs?
Machine learning is fundamentally about adjusting parameters to reduce a loss function. To know which way to adjust, we need to measure how the loss responds to small changes in each parameter. This is what derivatives and gradients tell us.
Consider these scenarios:
- A neural network has millions of parameters, and we need to know how the loss changes with respect to each one simultaneously.
- In gradient descent, we move in the direction of steepest decrease. But what direction is "steepest" when there are many variables?
- A contour plot of a loss function shows level curves. The gradient tells us which direction is perpendicular to these curves, pointing uphill.
The gradient generalizes the single-variable derivative to functions of many variables, and it is the engine behind every optimization algorithm in ML.
Topics to Cover
Partial Derivatives
- Definition: the derivative with respect to one variable, holding all others fixed
- Notation:
- Computing partial derivatives for common functions
The Gradient Vector
- Definition:
- The gradient as a vector field
- Gradient of linear functions, quadratic forms, and compositions
Directional Derivatives
- The rate of change in an arbitrary direction :
- The gradient gives the direction of steepest ascent
- The magnitude is the maximum rate of change
Level Curves and Gradient Perpendicularity
- Level sets:
- The gradient is perpendicular to level curves
- Geometric interpretation: why gradient descent moves across contours
Summary
Answering the Central Question: The partial derivative measures sensitivity to a single input, and the gradient collects all partial derivatives into a vector that points in the direction of steepest increase. The directional derivative measures the rate of change in any direction, and is maximized when aligns with . This is why gradient descent works: moving opposite to the gradient decreases the function as fast as possible locally.
Applications in Data Science and Machine Learning
- Gradient descent: The gradient tells us how to update each parameter to reduce the loss
- Feature sensitivity: measures how much a model's output depends on feature (saliency maps, feature importance)
- Contour analysis: Understanding loss landscapes via level curves and gradient directions helps diagnose optimization difficulties (saddle points, plateaus)
- Natural gradient: Using the Fisher information metric instead of Euclidean distance changes the notion of "steepest descent" in parameter space
Guided Problems
References
- Stewart, James - Calculus: Early Transcendentals, 8th ed., Chapters 14.3-14.6
- Deisenroth, Faisal, and Ong - Mathematics for Machine Learning, Chapter 5.1-5.2
- MIT 18.S096 - Matrix Calculus for Machine Learning and Beyond
- Boyd and Vandenberghe - Convex Optimization, Appendix A