Information Theory
The Central Question: How Do We Measure Information, Surprise, and the Distance Between Distributions?
Information theory, originally developed for communication, provides the mathematical tools for measuring uncertainty (entropy), comparing distributions (KL divergence), and quantifying shared information (mutual information). These concepts appear throughout ML, from loss functions to generative models.
Consider these scenarios:
- The cross-entropy loss used to train classifiers is literally the cross-entropy between the true label distribution and the model's predicted distribution.
- The KL divergence measures how far an approximate posterior is from the true posterior . Variational inference minimizes this divergence.
- Mutual information measures how much knowing reduces uncertainty about . It is used in feature selection, decision tree splitting, and representation learning.
Information theory gives ML its fundamental notions of "goodness of fit" and "information content."
Topics to Cover
Entropy
- Shannon entropy: (discrete) or (differential)
- Interpretation: average surprise, average number of bits to encode
- Properties: non-negative, maximized by uniform (discrete) or Gaussian (given mean and variance)
Cross-Entropy
- Interpretation: average bits needed to encode data from using code optimized for
- Connection to the cross-entropy loss in classification
KL Divergence
- Not symmetric:
- Forward KL (): mean-seeking; Reverse KL (): mode-seeking
- Connection to likelihood: minimizing = maximizing likelihood
Mutual Information
- Measures dependence beyond linear correlation
- Properties: , iff
Summary
Answering the Central Question: Entropy measures the inherent uncertainty in a distribution. Cross-entropy measures the cost of using distribution to encode data from . KL divergence measures the extra cost, and is always non-negative (Gibbs' inequality). Mutual information measures the total dependence between variables. These quantities appear directly as loss functions, optimization objectives, and information measures throughout ML.
Applications in Data Science and Machine Learning
- Cross-entropy loss: The standard classification loss is
- Variational inference: The ELBO maximizes a lower bound on , equivalent to minimizing
- GANs: Some formulations minimize -divergences related to KL divergence
- Decision trees: Split criteria (information gain) use entropy and mutual information
- InfoNCE and contrastive learning: Maximize a lower bound on mutual information between representations
- Rate-distortion theory: The information bottleneck method trades off compression (rate) and prediction (distortion)
Guided Problems
References
- Cover and Thomas - Elements of Information Theory, 2nd ed., Chapters 2, 8
- Bishop, Christopher - Pattern Recognition and Machine Learning, Chapter 1.6
- Murphy, Kevin - Machine Learning: A Probabilistic Perspective, Chapter 2.8
- Goodfellow, Bengio, and Courville - Deep Learning, Chapter 3.13