The Central Question: How Do We Compute With Matrices Efficiently?
In practice, we need to compute with matrices: multiply them, solve systems, and decompose them into simpler pieces.
Naive approaches are expensive. Solving Ax=b by computing A−1 costs O(n3) operations and is numerically unstable.
Instead, we factorize A into structured components (triangular, orthogonal, diagonal) that are cheap to work with. This is the core idea behind every numerical linear algebra algorithm used in machine learning.
Non-commutativity matters. If A is 2×3 and B is 3×2, then AB is 2×2 but BA is 3×3—different sizes entirely. Even when both products are defined (square matrices), they're typically different.
Gaussian elimination transforms A into an upper triangular matrix U by applying elementary row operations. Each row operation corresponds to left-multiplying by an elementary matrix Ek:
Ek⋯E2E1A=U
Rearranging:
A=E1−1E2−1⋯Ek−1U=LU
The product of inverse elementary matrices is a lower triangular matrix L.
Definition: LU Decomposition
An LU decomposition of a matrix A is a factorization:
A=LU
where L is lower triangular (with 1s on the diagonal) and U is upper triangular.
Example:
A=248137139
Step 1:R2←R2−2R1, R3←R3−4R1:
200113115
Step 2:R3←R3−3R2:
U=200110112
The multipliers (2, 4, 3) fill L below the diagonal:
Solving Ax=b via A−1b costs O(n3) for the inverse plusO(n2) for the multiply. With LU:
Factor once:A=LU costs O(32n3)
Solve twice:Ly=b (forward substitution, O(n2)), then Ux=y (back substitution, O(n2))
For multiple right-hand sides b1,b2,…,bk, the factorization is reused. Solving k systems costs O(32n3+2kn2) instead of O(kn3).
Proof
LU factorization — O(32n3)
Consider a 4×4 matrix. At each step of elimination, we pick a pivot and use it to zero out everything below it. The work happens in the shaded submatrix — the block below and to the right of the pivot:
Step k=1: Pivot at position (1,1). To zero out the 3 entries below it (rows 2, 3, 4), we subtract a multiple of row 1 from each row. But the subtraction updates every entry to the right of the pivot too — that's 3 entries per row. So the work is a 3×3 block: 9 multiply-subtracts.
Each multiply is paired with a subtract, so the total operation count is ∼32n3.
Forward/back substitution — O(n2)
Solving Ly=b: the k-th unknown requires (k−1) multiplications. Total: ∑k=1n(k−1)=2n(n−1)=O(n2). Same for Ux=y.
Computing A−1 — O(n3)
A−1 solves AX=I, i.e., n separate systems Axj=ej. LU factorization costs 32n3 once, then each of the n forward/back solves costs O(n2), giving an additional n⋅O(n2)=O(n3). Total: ∼32n3+n3≈35n3, which is roughly 2.5x more than a single LU solve.
Matrix-vector multiply A−1b — O(n2)
Each of the n entries of the result requires a dot product of length n: n×n=n2 operations.
LU exists without row swaps when all leading principal minors of A are nonzero (i.e., the top-left k×k submatrix is invertible for all k). When a zero pivot is encountered, row swaps are necessary.
When row swaps are needed, we use a permutation matrixP:
Definition: PLU Decomposition
Every invertible matrix A has a factorization:
PA=LU
where P is a permutation matrix, L is lower triangular, and U is upper triangular.
A permutation matrixP is obtained by reordering the rows of the identity matrix. It satisfies PTP=I (i.e., P−1=PT).
Example: Swapping rows 1 and 2:
P=010100001
Partial pivoting: In practice, even when row swaps aren't strictly necessary, we swap to place the largest available pivot in position. This controls the growth of entries in L and U, improving numerical stability.
A symmetric matrix S is positive definite (PD) if for all non-zero vectors x:
xTSx>0
It is positive semi-definite (PSD) if xTSx≥0.
Equivalent conditions for PD:
All eigenvalues are positive: λi>0
All pivots are positive
All leading principal minors are positive
S=RTR for some matrix R with independent columns
Where they appear:
ATA is PSD for any A; PD when A has independent columns
Covariance matrices are PSD
Hessians at local minima are PSD
Cholesky decomposition: Every PD matrix has a unique factorization S=LLT where L is lower triangular with positive diagonal. This is the "square root" of a matrix, and costs half the operations of LU.
Four views: dot product (entry), column combination, row combination, outer product sum
Not commutative; transpose reverses order: (AB)T=BTAT
Block multiplication works when block dimensions are compatible
LU decomposition:
Gaussian elimination encoded as A=LU (lower × upper triangular)
Factor once (O(n3)), solve many times (O(n2) each)
PLU with partial pivoting handles zero pivots and improves stability
Special matrices:
Diagonal: All operations in O(n); the ideal target form
Triangular:O(n2) solves via forward/back substitution
Symmetric: Real eigenvalues, orthogonal eigenvectors; spectral theorem gives S=QΛQT
Positive definite: All eigenvalues positive; Cholesky S=LLT at half the cost of LU
Orthogonal:Q−1=QT; preserves lengths and angles; numerically ideal
Answering the Central Question: We compute with matrices efficiently by factoring them into structured pieces (triangular, orthogonal, diagonal) that are cheap to work with. LU decomposition is the workhorse for solving linear systems at O(32n3), Cholesky exploits symmetry and positive definiteness at half the cost, and QR provides numerical stability for least squares problems.
Applications in Data Science and Machine Learning
Linear regression requires solving (XTX)β=XTy. The Gram matrix XTX is symmetric and (when X has independent columns) positive definite. Cholesky decomposition is the method of choice:
Compute G=XTX and c=XTy
Factor G=LLT (Cholesky)
Solve Ly=c (forward substitution)
Solve LTβ=y (back substitution)
This costs 31n3+n2—half the cost of LU—and exploits the symmetry and positive definiteness of G. See Matrix Inverse: Linear Regression for when G is singular.
When XTX is ill-conditioned, Cholesky on the normal equations amplifies errors (the condition number squares: κ(XTX)=κ(X)2). The QR approach avoids forming XTX entirely:
Factor X=QR where Q is orthogonal, R is upper triangular
The normal equations become Rβ=QTy
Solve by back substitution
Since Q preserves lengths, no condition number is lost. QR is the default in numpy.linalg.lstsq and similar libraries.
A forward pass through a linear layer computes Y=XWT+b for a batch of inputs X∈RB×n. This is a single matrix multiplication—O(Bnm) for output dimension m.
Modern deep learning relies on highly optimized BLAS (Basic Linear Algebra Subprograms) implementations that use:
Block multiplication for cache efficiency
GPU parallelism across the outer product structure
Mixed precision (float16 multiply, float32 accumulate) for speed
Positive definite: eigenvalues of S are λ=27±9−8=27±1... Actually, using the trace and determinant: tr(S)=7>0, det(S)=12−4=8>0. Both eigenvalues are positive. ✓
Alternatively: leading minors are 4>0 and det(S)=8>0. ✓
For each statement, determine if it is True or False. Justify your answer.
If A is symmetric and B is symmetric, then AB is symmetric.
If Q is orthogonal, then det(Q)=±1.
If S is positive definite, then all diagonal entries of S are positive.
The product of two upper triangular matrices is upper triangular.
💡 Solution
1. False.
(AB)T=BTAT=BA. This equals AB only if A and B commute (AB=BA), which is not true in general.
Counterexample: A=[1110], B=[0111]. Both symmetric, but AB=[1021]=(AB)T.
2. True.
From QTQ=I: det(QT)det(Q)=1, so (detQ)2=1, giving detQ=±1.
3. True.
The diagonal entry Sii=eiTSei. Since ei=0 and S is positive definite, eiTSei>0.
4. True.
If U1 and U2 are upper triangular, then (U1U2)ij=∑k(U1)ik(U2)kj. For i>j: (U1)ik=0 when k<i and (U2)kj=0 when k>j. Since i>j, every term has either k<i or k>j (or both), so the sum is zero.