Skip to main content

Numerical Linear Algebra

The Central Question: When Can We Trust the Computer's Answer?

Computers use finite-precision arithmetic, so every computation introduces rounding errors. When can we trust the result? The condition number κ(A)\kappa(A) answers this: it measures how much input perturbations are amplified in the output. A backward-stable algorithm (one that gives the exact answer to a slightly perturbed problem) combined with a well-conditioned matrix gives reliable results. Understanding this interplay is essential for diagnosing numerical failures in ML training and large-scale linear solvers.

Topics to Cover

Floating-Point Arithmetic and Error

  • Machine epsilon and rounding
  • Forward vs backward error
  • Why "exact" formulas can give wrong answers on a computer
  • Catastrophic cancellation (e.g., a2b2a^2 - b^2 when aba \approx b)

Vector and Matrix Norms

  • Vector norms: x1\|x\|_1, x2\|x\|_2 (Euclidean), x\|x\|_{\infty}
  • Matrix norms: induced (operator) norms, Frobenius norm AF=aij2\|A\|_F = \sqrt{\sum a_{ij}^2}
  • Submultiplicativity: ABAB\|AB\| \leq \|A\|\|B\|
  • Spectral norm A2=σ1\|A\|_2 = \sigma_1 (largest singular value)

Condition Number

  • κ(A)=AA1=σ1/σr\kappa(A) = \|A\|\|A^{-1}\| = \sigma_1/\sigma_r (via SVD)
  • Meaning: a relative change ϵ\epsilon in bb can cause up to κϵ\kappa \cdot \epsilon relative change in xx
  • Well-conditioned (κ\kappa small) vs ill-conditioned (κ\kappa large)
  • Rule of thumb: lose log10(κ)\log_{10}(\kappa) digits of accuracy
  • Cross-reference to Matrix Inverse: Condition Number and SVD: Condition Number

Stability of Algorithms

  • Backward stability: the algorithm gives the exact answer to a slightly perturbed problem
  • LU with partial pivoting: backward stable
  • Normal equations (ATAx^=ATbA^TA\hat{x} = A^Tb): squares the condition number — unstable for ill-conditioned AA
  • QR: backward stable for least squares
  • Cholesky: backward stable for SPD systems
  • Cross-reference to QR vs Normal Equations

Eigenvalue Computation

  • Power method: repeatedly multiply by AA, converge to dominant eigenvector
    • Convergence rate: λ2/λ1|\lambda_2/\lambda_1|
    • Inverse iteration: apply power method to (AμI)1(A - \mu I)^{-1} to find eigenvalue nearest μ\mu
  • QR algorithm: the workhorse of eigenvalue computation
    • Iterate: Ak=QkRkA_k = Q_kR_k, then Ak+1=RkQkA_{k+1} = R_kQ_k
    • Converges to upper triangular (Schur form); eigenvalues on diagonal
    • With shifts: cubic convergence for symmetric matrices
  • Lanczos algorithm (brief): for large sparse symmetric matrices, builds a tridiagonal approximation in Krylov subspace

Iterative Methods for Ax=bAx = b

  • When to use iterative vs direct: large, sparse systems where O(n3)O(n^3) is too expensive
  • Jacobi iteration: split A=D+(L+U)A = D + (L + U), iterate x(k+1)=D1(b(L+U)x(k))x^{(k+1)} = D^{-1}(b - (L+U)x^{(k)})
  • Gauss-Seidel: use updated values immediately; faster convergence
  • Convergence conditions: spectral radius ρ(M)<1\rho(M) < 1 where MM is the iteration matrix
  • Conjugate gradient (CG): the optimal iterative solver for SPD systems
    • Converges in at most nn steps (exact arithmetic)
    • Practical convergence in O(κ)O(\sqrt{\kappa}) iterations
    • Preconditioning: M1Ax=M1bM^{-1}Ax = M^{-1}b to reduce effective κ\kappa
    • Cross-reference to Conjugate Gradient

Sparse Matrix Techniques

  • Storage formats: CSR (compressed sparse row), CSC, COO
  • Fill-in during LU: why ordering matters (e.g., Cuthill-McKee, AMD)
  • Sparse Cholesky: much cheaper than dense when structure is exploited
  • Graph interpretation: nonzero pattern of AA = adjacency structure

Randomized Methods

  • Randomized matrix multiplication: approximate ABAB by sampling columns of AA and rows of BB
  • Randomized SVD: sketch the matrix to a smaller dimension, compute SVD of the sketch
    • Algorithm: random projection Y=AΩY = A\Omega, orthogonalize Q=QR(Y)Q = \text{QR}(Y), compute SVD of QTAQ^TA
    • Cost: O(mnlogk)O(mn\log k) vs O(mnmin(m,n))O(mn \min(m,n)) for full SVD
  • Random projections (Johnson-Lindenstrauss lemma): kk random directions preserve pairwise distances with high probability if k=O(logn/ϵ2)k = O(\log n / \epsilon^2)
  • Why randomization works: concentration of measure in high dimensions
  • Cross-reference to MIT 18.065 Lecture 13: Randomized Matrix Multiplication

Summary

Answering the Central Question: You can trust the computer's answer when the algorithm is backward-stable and the problem is well-conditioned (κ(A)\kappa(A) is small). A relative perturbation ϵ\epsilon in the input causes at most κϵ\kappa \cdot \epsilon relative error in the output, so you lose roughly log10(κ)\log_{10}(\kappa) digits of accuracy. LU with partial pivoting, QR, and Cholesky are all backward-stable. For large sparse systems where direct methods are too expensive (O(n3)O(n^3)), iterative methods (conjugate gradient, Lanczos) and randomized algorithms provide efficient alternatives.

Applications in Data Science and Machine Learning

  • Large-scale PCA: Lanczos / randomized SVD instead of full eigendecomposition — computes top-kk singular values of million-dimensional matrices
  • Conjugate gradient in GP inference: solving Kα=yK\alpha = y for Gaussian processes without forming K1K^{-1}
  • Preconditioning for training: second-order optimizers (L-BFGS, K-FAC) approximate the Hessian inverse as a preconditioner
  • Sparse models: feature matrices in NLP/recommender systems are extremely sparse; sparse solvers are essential
  • Numerical stability in deep learning: mixed-precision training, loss scaling, and gradient clipping are all responses to floating-point limitations
  • Condition number monitoring: diagnosing training instability by tracking κ\kappa of weight matrices or Gram matrices

Guided Problems

References

  • Strang, Introduction to Linear Algebra, Chapter 7
  • Strang, Linear Algebra and Learning from Data, Lecture 13 (Randomized Methods)
  • Trefethen & Bau, Numerical Linear Algebra (supplementary)
  • Halko, Martinsson & Tropp, Finding Structure with Randomness (2011) — the standard reference for randomized SVD