Random Variables
The Central Question: How Do We Mathematically Describe Uncertain Quantities?
We need a formal way to talk about quantities that take different values with different probabilities: the number of clicks on an ad, the height of a person, the pixel values in an image. Random variables give us this language, and their distributions (PMF, PDF, CDF) fully characterize their behavior.
Consider these scenarios:
- The output of a coin flip is a discrete random variable taking values in with a Bernoulli distribution. The number of heads in flips follows a Binomial distribution.
- The waiting time until a radioactive decay is a continuous random variable described by a probability density function (PDF). The probability of the event occurring in an interval is the area under the PDF.
- A neural network's output before softmax is a continuous random variable. After softmax, it becomes a probability distribution over classes, connecting continuous and discrete perspectives.
Random variables are the mathematical objects that represent uncertain data in ML.
Topics to Cover
Discrete Random Variables
- Definition: a function from the sample space to a countable set
- Probability mass function (PMF):
- Examples: Bernoulli, Binomial, Poisson, Geometric
Continuous Random Variables
- Probability density function (PDF): where
- The PDF is not a probability (it can exceed 1)
- Examples: Uniform, Gaussian, Exponential
Cumulative Distribution Function (CDF)
- Properties: non-decreasing, right-continuous, ,
- Relationship between CDF, PDF, and PMF
Transformations of Random Variables
- If , how to find the distribution of
- CDF method:
- PDF transformation:
Summary
Answering the Central Question: A random variable is a function that assigns numerical values to outcomes of a random experiment. Its distribution is fully described by the PMF (discrete: ), PDF (continuous: ), or CDF (). Transformations of random variables produce new distributions via the CDF method or the change-of-variables formula. These tools allow us to model any uncertain quantity mathematically.
Applications in Data Science and Machine Learning
- Data modeling: Choosing the right distribution family (Gaussian, Poisson, Bernoulli) to model features or targets
- Generative models: Defining distributions over data (VAEs, GANs, diffusion models)
- Reparameterization trick: Transforming a simple random variable (e.g., standard normal) to sample from a complex distribution
- Quantile functions: The inverse CDF is used for generating samples and quantile regression
- Normalizing flows: Repeated application of the change-of-variables formula to transform simple distributions into complex ones
Guided Problems
References
- Blitzstein and Hwang - Introduction to Probability, 2nd ed., Chapters 3-5
- Bishop, Christopher - Pattern Recognition and Machine Learning, Chapter 1.2
- Murphy, Kevin - Machine Learning: A Probabilistic Perspective, Chapter 2
- Wasserman, Larry - All of Statistics, Chapters 2-3