Maximum Likelihood Estimation
The Central Question: How Do We Find the Best Parameters for a Probabilistic Model?
Given a model and observed data , how do we choose ? Maximum likelihood estimation (MLE) chooses the parameters that make the observed data most probable. It is the most widely used estimation method in ML and connects directly to common loss functions.
Consider these scenarios:
- Fitting a Gaussian to data: the MLE for is the sample mean and for is the sample variance. No optimization algorithm needed.
- Logistic regression maximizes the Bernoulli likelihood, which is equivalent to minimizing the binary cross-entropy loss. The "loss function" and "negative log-likelihood" are the same object.
- Adding a prior to the MLE gives MAP estimation, which is equivalent to adding a regularization term to the loss. L2 regularization corresponds to a Gaussian prior.
MLE is the bridge between probability and optimization in ML.
Topics to Cover
Maximum Likelihood Estimation
- Likelihood function:
- Log-likelihood:
- MLE:
- Examples: Bernoulli, Gaussian, Poisson, Exponential
MAP Estimation
- MAP = MLE with regularization
- Gaussian prior L2 regularization, Laplace prior L1 regularization
Properties of Estimators
- Consistency: as
- Efficiency: achieving the minimum possible variance
- Asymptotic normality:
Fisher Information and the Cramer-Rao Bound
- Fisher information:
- Cramer-Rao bound: for any unbiased estimator
- The MLE achieves the Cramer-Rao bound asymptotically (efficient estimator)
Summary
Answering the Central Question: MLE finds parameters by maximizing , which is equivalent to minimizing the average negative log-likelihood (the loss function). MAP estimation adds a prior term, corresponding to regularization. The Fisher information measures how informative the data is about , and the Cramer-Rao bound sets a fundamental limit on estimation accuracy. The MLE is asymptotically efficient, consistent, and normally distributed.
Applications in Data Science and Machine Learning
- Loss functions as negative log-likelihoods: MSE (Gaussian), cross-entropy (Bernoulli/Categorical), Poisson regression loss
- Regularization as MAP: L2 = Gaussian prior, L1 = Laplace prior, dropout variational inference
- Fisher information in optimization: The natural gradient uses instead of
- Model comparison: Likelihood ratio tests, AIC, and BIC all use the likelihood
- EM algorithm: Maximizes a lower bound on the log-likelihood when latent variables are present
Guided Problems
References
- Wasserman, Larry - All of Statistics, Chapter 9
- Bishop, Christopher - Pattern Recognition and Machine Learning, Chapters 1.2, 2.3
- Murphy, Kevin - Machine Learning: A Probabilistic Perspective, Chapter 6
- CMU 36-700 - Statistical Machine Learning