Expectation and Variance
The Central Question: How Do We Summarize a Distribution with a Few Numbers?
A full distribution contains all the information about a random variable, but we often need concise summaries: the average value (expectation), the spread (variance), and the relationships between variables (covariance and correlation). These summary statistics drive both the theory and practice of ML.
Consider these scenarios:
- The expected loss is what we actually minimize in machine learning. Empirical risk minimization approximates this expectation with a sample average.
- The bias-variance tradeoff decomposes prediction error into squared bias and variance. Understanding variance is essential for model selection.
- The covariance matrix of features determines the shape of data clusters and drives PCA, Gaussian discriminant analysis, and many other methods.
Expectations and variances are the language of statistical learning theory.
Topics to Cover
Expectation
- Discrete:
- Continuous:
- Linearity: (always, even without independence)
- Law of the unconscious statistician (LOTUS):
Variance
- Standard deviation:
Covariance and Correlation
- If independent then (but not conversely)
- Correlation: , always in
- The covariance matrix for a random vector
Moments and Moment Generating Functions
- -th moment:
- Moment generating function:
- MGF uniquely determines the distribution (when it exists)
Summary
Answering the Central Question: The expectation gives the "center" of a distribution, the variance measures its spread, and the covariance captures linear dependence between two variables. Linearity of expectation is the most useful property in probability, holding universally without independence assumptions. The moment generating function encodes all moments and uniquely determines the distribution, serving as a powerful theoretical tool.
Applications in Data Science and Machine Learning
- Empirical risk minimization: The training loss approximates via the law of large numbers
- Bias-variance decomposition:
- Covariance matrices: Central to PCA, Gaussian discriminant analysis, Mahalanobis distance, and whitening
- Batch normalization: Uses running estimates of and to normalize activations
- Law of total variance: decomposes variance into explained and unexplained components
Guided Problems
References
- Blitzstein and Hwang - Introduction to Probability, 2nd ed., Chapters 4, 7
- Deisenroth, Faisal, and Ong - Mathematics for Machine Learning, Chapter 6.4
- Wasserman, Larry - All of Statistics, Chapter 3
- CMU 36-700 - Statistical Machine Learning