Skip to main content

Least Squares and QR Decomposition

The Central Question: What Is the Best Approximate Solution When Ax=bAx = b Has No Exact Answer?

When there are more equations than unknowns, Ax=bAx = b typically has no solution. The least squares approach finds the x^\hat{x} that minimizes Axb2\|Ax - b\|^2, which is the projection of bb onto the column space of AA. The normal equations, QR decomposition, and Gram-Schmidt process provide three ways to compute this, with different tradeoffs between speed and numerical stability.

Topics to Cover

The Least Squares Problem

  • When Ax=bAx = b has no solution (bC(A)b \notin C(A)): minimize Axb2\|Ax - b\|^2
  • Geometric view: find the point in C(A)C(A) closest to bb (i.e., project)
  • Normal equations: ATAx^=ATbA^TA\hat{x} = A^Tb
  • The solution x^=(ATA)1ATb\hat{x} = (A^TA)^{-1}A^Tb when AA has independent columns
  • Cross-reference to Projections

Linear Regression as Least Squares

  • Fitting y=Xβ+ϵy = X\beta + \epsilon: minimize yXβ2\|y - X\beta\|^2
  • Design matrix XX, coefficient vector β\beta
  • The hat matrix H=X(XTX)1XTH = X(X^TX)^{-1}X^T: maps yy to y^\hat{y}
  • Residual properties: eC(X)e \perp C(X), ei=0\sum e_i = 0 (when intercept is included)
  • Weighted least squares: minimize W1/2(Axb)2\|W^{1/2}(Ax - b)\|^2 for non-uniform noise

Gram-Schmidt Process

  • Input: linearly independent vectors a1,a2,,ana_1, a_2, \ldots, a_n
  • Output: orthonormal vectors q1,q2,,qnq_1, q_2, \ldots, q_n
  • Algorithm: subtract projections onto previous vectors, then normalize
  • Step-by-step example
  • Classical vs Modified Gram-Schmidt: same math, different numerical stability

QR Decomposition

  • A=QRA = QR: QQ orthonormal columns, RR upper triangular
  • Gram-Schmidt produces QQ and RR simultaneously
  • Solving least squares via QR: Rx^=QTbR\hat{x} = Q^Tb (triangular solve, no ATAA^TA needed)
  • Why QR is more stable than normal equations:
    • Normal equations square the condition number: κ(ATA)=κ(A)2\kappa(A^TA) = \kappa(A)^2
    • QR works with κ(A)\kappa(A) directly
  • Householder reflections (brief): the numerically preferred way to compute QR

Regularized Least Squares

  • When ATAA^TA is singular or near-singular (rank-deficient or ill-conditioned)
  • Ridge regression: minimize Axb2+λx2\|Ax - b\|^2 + \lambda\|x\|^2
    • Solution: x^=(ATA+λI)1ATb\hat{x} = (A^TA + \lambda I)^{-1}A^Tb
    • Geometric view: shrinks the ellipsoid, rounds the bowl
    • Cross-reference to Ridge Regression
  • Connection to SVD truncation: ridge suppresses small singular values smoothly

Summary

Answering the Central Question: The best approximate solution is x^=(ATA)1ATb\hat{x} = (A^TA)^{-1}A^Tb, the least squares solution that minimizes Axb2\|Ax - b\|^2. This is geometrically the projection of bb onto C(A)C(A). The normal equations give this directly but square the condition number. QR decomposition (A=QRA = QR, then Rx^=QTbR\hat{x} = Q^Tb) avoids forming ATAA^TA and is numerically stable, making it the default in practice.

Applications in Data Science and Machine Learning

  • Ordinary Least Squares (OLS): the foundation of linear regression
  • Polynomial and basis-function regression: same framework, different design matrix
  • QR in practice: numpy.linalg.lstsq and scipy.linalg.qr use QR internally, not normal equations
  • Stable training: QR-based solvers avoid catastrophic cancellation in ill-conditioned feature matrices
  • Orthogonal features: after Gram-Schmidt, each feature's contribution is independent — interpretable coefficients
  • Recursive least squares: updating the QR factorization as new data arrives (online learning)

Guided Problems

References

  • Strang, Introduction to Linear Algebra, Chapter 4 (4.3–4.4)