Skip to main content

Expectation and Variance


The Central Question: How Do We Summarize a Distribution with a Few Numbers?

A full distribution contains all the information about a random variable, but we often need concise summaries: the average value (expectation), the spread (variance), and the relationships between variables (covariance and correlation). These summary statistics drive both the theory and practice of ML.

Consider these scenarios:

  1. The expected loss E[L(θ)]E[L(\theta)] is what we actually minimize in machine learning. Empirical risk minimization approximates this expectation with a sample average.
  2. The bias-variance tradeoff decomposes prediction error into squared bias and variance. Understanding variance is essential for model selection.
  3. The covariance matrix of features determines the shape of data clusters and drives PCA, Gaussian discriminant analysis, and many other methods.

Expectations and variances are the language of statistical learning theory.


Topics to Cover

Expectation

  • Discrete: E[X]=xxP(X=x)E[X] = \sum_x x \cdot P(X=x)
  • Continuous: E[X]=xfX(x)dxE[X] = \int x f_X(x) dx
  • Linearity: E[aX+bY]=aE[X]+bE[Y]E[aX + bY] = aE[X] + bE[Y] (always, even without independence)
  • Law of the unconscious statistician (LOTUS): E[g(X)]=g(x)fX(x)dxE[g(X)] = \int g(x)f_X(x)dx

Variance

  • Var(X)=E[(XE[X])2]=E[X2](E[X])2\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2
  • Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2\text{Var}(X)
  • Standard deviation: σ=Var(X)\sigma = \sqrt{\text{Var}(X)}

Covariance and Correlation

  • Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY]E[X]E[Y]\text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]
  • If X,YX, Y independent then Cov(X,Y)=0\text{Cov}(X, Y) = 0 (but not conversely)
  • Correlation: ρ(X,Y)=Cov(X,Y)/(σXσY)\rho(X, Y) = \text{Cov}(X, Y) / (\sigma_X \sigma_Y), always in [1,1][-1, 1]
  • The covariance matrix Σ\Sigma for a random vector

Moments and Moment Generating Functions

  • kk-th moment: E[Xk]E[X^k]
  • Moment generating function: MX(t)=E[etX]M_X(t) = E[e^{tX}]
  • MGF uniquely determines the distribution (when it exists)

Summary

Answering the Central Question: The expectation E[X]E[X] gives the "center" of a distribution, the variance Var(X)\text{Var}(X) measures its spread, and the covariance Cov(X,Y)\text{Cov}(X, Y) captures linear dependence between two variables. Linearity of expectation is the most useful property in probability, holding universally without independence assumptions. The moment generating function MX(t)=E[etX]M_X(t) = E[e^{tX}] encodes all moments and uniquely determines the distribution, serving as a powerful theoretical tool.


Applications in Data Science and Machine Learning

  • Empirical risk minimization: The training loss approximates E[L(θ)]E[L(\theta)] via the law of large numbers
  • Bias-variance decomposition: E[(yf^(x))2]=Bias2+Variance+Irreducible noiseE[(y - \hat{f}(x))^2] = \text{Bias}^2 + \text{Variance} + \text{Irreducible noise}
  • Covariance matrices: Central to PCA, Gaussian discriminant analysis, Mahalanobis distance, and whitening
  • Batch normalization: Uses running estimates of E[X]E[X] and Var(X)\text{Var}(X) to normalize activations
  • Law of total variance: Var(Y)=E[Var(YX)]+Var(E[YX])\text{Var}(Y) = E[\text{Var}(Y|X)] + \text{Var}(E[Y|X]) decomposes variance into explained and unexplained components

Guided Problems


References

  1. Blitzstein and Hwang - Introduction to Probability, 2nd ed., Chapters 4, 7
  2. Deisenroth, Faisal, and Ong - Mathematics for Machine Learning, Chapter 6.4
  3. Wasserman, Larry - All of Statistics, Chapter 3
  4. CMU 36-700 - Statistical Machine Learning