Numerical Linear Algebra
The Central Question: When Can We Trust the Computer's Answer?
Computers use finite-precision arithmetic, so every computation introduces rounding errors. When can we trust the result? The condition number answers this: it measures how much input perturbations are amplified in the output. A backward-stable algorithm (one that gives the exact answer to a slightly perturbed problem) combined with a well-conditioned matrix gives reliable results. Understanding this interplay is essential for diagnosing numerical failures in ML training and large-scale linear solvers.
Topics to Cover
Floating-Point Arithmetic and Error
- Machine epsilon and rounding
- Forward vs backward error
- Why "exact" formulas can give wrong answers on a computer
- Catastrophic cancellation (e.g., when )
Vector and Matrix Norms
- Vector norms: , (Euclidean),
- Matrix norms: induced (operator) norms, Frobenius norm
- Submultiplicativity:
- Spectral norm (largest singular value)
Condition Number
- (via SVD)
- Meaning: a relative change in can cause up to relative change in
- Well-conditioned ( small) vs ill-conditioned ( large)
- Rule of thumb: lose digits of accuracy
- Cross-reference to Matrix Inverse: Condition Number and SVD: Condition Number
Stability of Algorithms
- Backward stability: the algorithm gives the exact answer to a slightly perturbed problem
- LU with partial pivoting: backward stable
- Normal equations (): squares the condition number — unstable for ill-conditioned
- QR: backward stable for least squares
- Cholesky: backward stable for SPD systems
- Cross-reference to QR vs Normal Equations
Eigenvalue Computation
- Power method: repeatedly multiply by , converge to dominant eigenvector
- Convergence rate:
- Inverse iteration: apply power method to to find eigenvalue nearest
- QR algorithm: the workhorse of eigenvalue computation
- Iterate: , then
- Converges to upper triangular (Schur form); eigenvalues on diagonal
- With shifts: cubic convergence for symmetric matrices
- Lanczos algorithm (brief): for large sparse symmetric matrices, builds a tridiagonal approximation in Krylov subspace
Iterative Methods for
- When to use iterative vs direct: large, sparse systems where is too expensive
- Jacobi iteration: split , iterate
- Gauss-Seidel: use updated values immediately; faster convergence
- Convergence conditions: spectral radius where is the iteration matrix
- Conjugate gradient (CG): the optimal iterative solver for SPD systems
- Converges in at most steps (exact arithmetic)
- Practical convergence in iterations
- Preconditioning: to reduce effective
- Cross-reference to Conjugate Gradient
Sparse Matrix Techniques
- Storage formats: CSR (compressed sparse row), CSC, COO
- Fill-in during LU: why ordering matters (e.g., Cuthill-McKee, AMD)
- Sparse Cholesky: much cheaper than dense when structure is exploited
- Graph interpretation: nonzero pattern of = adjacency structure
Randomized Methods
- Randomized matrix multiplication: approximate by sampling columns of and rows of
- Randomized SVD: sketch the matrix to a smaller dimension, compute SVD of the sketch
- Algorithm: random projection , orthogonalize , compute SVD of
- Cost: vs for full SVD
- Random projections (Johnson-Lindenstrauss lemma): random directions preserve pairwise distances with high probability if
- Why randomization works: concentration of measure in high dimensions
- Cross-reference to MIT 18.065 Lecture 13: Randomized Matrix Multiplication
Summary
Answering the Central Question: You can trust the computer's answer when the algorithm is backward-stable and the problem is well-conditioned ( is small). A relative perturbation in the input causes at most relative error in the output, so you lose roughly digits of accuracy. LU with partial pivoting, QR, and Cholesky are all backward-stable. For large sparse systems where direct methods are too expensive (), iterative methods (conjugate gradient, Lanczos) and randomized algorithms provide efficient alternatives.
Applications in Data Science and Machine Learning
- Large-scale PCA: Lanczos / randomized SVD instead of full eigendecomposition — computes top- singular values of million-dimensional matrices
- Conjugate gradient in GP inference: solving for Gaussian processes without forming
- Preconditioning for training: second-order optimizers (L-BFGS, K-FAC) approximate the Hessian inverse as a preconditioner
- Sparse models: feature matrices in NLP/recommender systems are extremely sparse; sparse solvers are essential
- Numerical stability in deep learning: mixed-precision training, loss scaling, and gradient clipping are all responses to floating-point limitations
- Condition number monitoring: diagnosing training instability by tracking of weight matrices or Gram matrices
Guided Problems
References
- Strang, Introduction to Linear Algebra, Chapter 7
- Strang, Linear Algebra and Learning from Data, Lecture 13 (Randomized Methods)
- Trefethen & Bau, Numerical Linear Algebra (supplementary)
- Halko, Martinsson & Tropp, Finding Structure with Randomness (2011) — the standard reference for randomized SVD