Joint, Marginal, and Conditional Distributions
The Central Question: How Do We Work with Multiple Random Variables Together?
Machine learning models deal with collections of variables: features, labels, latent variables, and parameters. Understanding how these variables relate through their joint distribution, and how to extract information via marginalization and conditioning, is fundamental to probabilistic modeling.
Consider these scenarios:
- A generative model defines . To predict, we need , which requires Bayes' theorem and marginalization.
- In a latent variable model, . Computing this marginal likelihood requires integrating out the latent variable .
- The EM algorithm alternates between computing (conditioning) and maximizing (expectation under a conditional distribution).
Joint, marginal, and conditional distributions are the building blocks of probabilistic ML.
Topics to Cover
Joint Distributions
- Joint PMF:
- Joint PDF: where
- The product rule:
Marginalization
- Discrete:
- Continuous:
- The sum rule of probability
Conditional Distributions
- Conditional expectation:
- Law of total expectation:
- Law of total variance:
Summary
Answering the Central Question: Multiple random variables are described by their joint distribution . Marginalization () recovers individual distributions by summing/integrating out other variables. Conditioning () gives the distribution of one variable given knowledge of another. The product rule connects these three operations and is the foundation of Bayes' theorem. The laws of total expectation and total variance allow computation by conditioning.
Applications in Data Science and Machine Learning
- Generative vs discriminative models: Generative models define and derive ; discriminative models model directly
- Latent variable models: VAEs, GMMs, and HMMs all require marginalizing over latent variables
- EM algorithm: Alternates between conditioning on latent variables (E-step) and maximizing the expected complete-data likelihood (M-step)
- Graphical models: Factor joint distributions using conditional independence:
- Missing data: Marginalization over missing features allows prediction with incomplete observations
Guided Problems
References
- Bishop, Christopher - Pattern Recognition and Machine Learning, Chapter 1.2, 2.1
- Murphy, Kevin - Machine Learning: A Probabilistic Perspective, Chapter 2.2
- Blitzstein and Hwang - Introduction to Probability, 2nd ed., Chapters 7-8
- Koller and Friedman - Probabilistic Graphical Models, Chapter 2