Common Distributions
The Central Question: Which Distribution Families Appear Most in Machine Learning?
A handful of distribution families appear again and again in ML: Gaussian for continuous data and noise, Bernoulli for binary outcomes, Categorical for classification, Poisson for counts. Knowing their properties, conjugacies, and relationships is essential for building and understanding probabilistic models.
Consider these scenarios:
- Linear regression assumes Gaussian noise: where . This assumption leads to the least squares loss.
- Logistic regression models binary outcomes with a Bernoulli distribution: . The loss function is the negative log-likelihood (binary cross-entropy).
- The Beta distribution is the conjugate prior for the Bernoulli, making Bayesian updating analytically tractable.
Knowing distributions means knowing which loss functions, priors, and likelihoods to use.
Topics to Cover
Discrete Distributions
- Bernoulli(): binary outcome, , mean , variance
- Binomial(): number of successes in trials
- Poisson(): count of rare events,
- Categorical() and Multinomial: multi-class generalization
Continuous Distributions
- Uniform(): constant density on
- Gaussian(): the normal distribution, bell curve, CLT
- Exponential(): memoryless waiting time
- Gamma(): generalization of Exponential, conjugate prior for Poisson rate
- Beta(): distribution on , conjugate prior for Bernoulli
Relationships Between Distributions
- Binomial Poisson (rare events limit)
- Binomial Gaussian (CLT)
- Gamma Exponential (special case )
- Beta-Binomial conjugacy
- Normal-Normal conjugacy
Summary
Answering the Central Question: The core distribution families in ML are: Bernoulli/Binomial (binary/count data, logistic regression), Categorical/Multinomial (classification), Poisson (event counts), Gaussian (continuous data, noise), Exponential/Gamma (rates, waiting times), and Beta (probabilities, priors). Each has known mean, variance, and moment generating function. Relationships between them (limits, conjugacies, special cases) simplify both theoretical analysis and practical computation.
Applications in Data Science and Machine Learning
- Gaussian noise assumption: Leads to MSE loss in regression
- Bernoulli/Categorical likelihood: Leads to cross-entropy loss in classification
- Poisson regression: Modeling count data (click counts, word frequencies)
- Conjugate priors: Beta-Bernoulli, Gamma-Poisson, Normal-Normal enable closed-form Bayesian updates
- Mixture models: Gaussian mixture models (GMMs) combine multiple Gaussians for density estimation and clustering
Guided Problems
References
- Blitzstein and Hwang - Introduction to Probability, 2nd ed., Chapters 3-8
- Bishop, Christopher - Pattern Recognition and Machine Learning, Chapter 2.3-2.4
- Murphy, Kevin - Machine Learning: A Probabilistic Perspective, Chapter 2
- Deisenroth, Faisal, and Ong - Mathematics for Machine Learning, Chapter 6.2