Limit Theorems and Concentration
The Central Question: Why Do Averages Converge, and How Fast?
Machine learning relies on the idea that training on a finite sample tells us something about the true distribution. The law of large numbers says averages converge to expectations. The central limit theorem describes how fast. Concentration inequalities give finite-sample bounds that are essential for generalization guarantees.
Consider these scenarios:
- SGD approximates the true gradient with a mini-batch average. The LLN says this approximation improves with batch size. The CLT tells us the approximation error is approximately Gaussian.
- A PAC learning bound says: with probability , the test error is within of the training error, provided . This bound comes from Hoeffding's inequality.
- Cross-validation estimates test error by averaging over folds. Concentration inequalities tell us how many folds and samples we need for this estimate to be reliable.
Limit theorems and concentration inequalities are the mathematical foundation of statistical learning theory.
Topics to Cover
Law of Large Numbers
- Weak LLN: (convergence in probability)
- Strong LLN: (almost sure convergence)
- Why empirical risk minimization works: training loss converges to expected loss
Central Limit Theorem
- Rate of convergence: standard error
- Justification for confidence intervals and hypothesis tests
- Berry-Esseen theorem: speed of CLT convergence
Tail Bounds and Concentration Inequalities
- Markov's inequality: for non-negative
- Chebyshev's inequality:
- Hoeffding's inequality: for bounded r.v.s
- Chernoff bound: (exponential tail bound via MGF)
- Sub-Gaussian and sub-exponential random variables
Connection to PAC Learning
- PAC framework: "Probably Approximately Correct" learning
- Sample complexity from Hoeffding:
- Union bound over hypothesis class
- VC dimension as a measure of hypothesis complexity
Summary
Answering the Central Question: The LLN guarantees that sample averages converge to population expectations, justifying empirical risk minimization. The CLT quantifies the fluctuations as approximately Gaussian with standard error . Concentration inequalities (Hoeffding, Chernoff) provide non-asymptotic, finite-sample bounds: for bounded random variables. These bounds directly yield PAC learning sample complexity: samples suffice to learn within error with probability .
Applications in Data Science and Machine Learning
- Generalization bounds: Concentration inequalities + union bounds give PAC-style guarantees on test error
- SGD convergence: LLN and CLT justify stochastic gradient estimates; variance reduction techniques improve the rate
- Bootstrap and cross-validation: CLT justifies approximate confidence intervals for model performance estimates
- Bandit algorithms: Concentration inequalities (Hoeffding, Bernstein) drive UCB-style exploration strategies
- Differential privacy: Concentration bounds on noise addition ensure privacy guarantees hold with high probability
- Random features and sketching: Johnson-Lindenstrauss and related concentration results guarantee quality of random projections
Guided Problems
References
- Wasserman, Larry - All of Statistics, Chapters 5-6
- Shalev-Shwartz and Ben-David - Understanding Machine Learning, Chapters 2-4
- Wainwright, Martin - High-Dimensional Statistics, Chapter 2
- Vershynin, Roman - High-Dimensional Probability, Chapters 2-3