The Central Question: When Does a Matrix Act Like Simple Scaling?
Most matrices rotate, stretch, and shear vectors in complicated ways. But along certain special directions, a matrix acts like multiplication by a scalar: Av=λv. The vector v goes in, and the same vector (scaled by λ) comes out.
These special directions and scaling factors reveal a matrix's intrinsic behavior:
The eigenvalues of a covariance matrix are the variances along the principal directions of the data.
The dominant eigenvector of the web's link matrix is the PageRank vector.
The eigenvalues of a recurrent neural network's weight matrix determine whether gradients explode or vanish during training.
Finding eigenvalues and eigenvectors is the key to diagonalization, which reduces matrix powers, exponentials, and dynamical systems to simple scalar operations.
Take the matrix A=[2112] and the vector v=[11]:
Av=[2112][11]=[33]=3[11]=3v
The matrix A simply scales v by a factor of 3, without changing its direction. Now try w=[1−1]:
Aw=[2112][1−1]=[1−1]=1⋅w
The direction w is also preserved, scaled by 1 (unchanged). A generic vector like [21] would be both rotated and stretched, not just scaled. The directions v and w are the special ones.
Geometrically, A stretches space by a factor of 3 along the diagonal direction [11] and leaves the perpendicular direction [1−1] unchanged. Every other vector gets rotated toward the direction with the larger eigenvalue.
Definition: Eigenvalue and Eigenvector
An eigenvalueλ and eigenvectorv=0 of a square matrix A satisfy:
Av=λv
The eigenvector v is a direction that A preserves (up to scaling). The eigenvalue λ is the scaling factor.
For each λ, solve (A−λI)v=0 to find the eigenvector:
λ1=3: [−111−1]v=0⟹v1=[11]
λ2=1: [1111]v=0⟹v2=[1−1]
Practical Insight
For triangular matrices (upper or lower), the eigenvalues are the diagonal entries. No characteristic polynomial needed: if the k-th diagonal entry is akk, then det(A−akkI)=0 because one row of A−akkI is all zeros on the diagonal and below (or above).
An eigenvalue can be a repeated root of the characteristic polynomial.
Consider A=[5015]. The characteristic polynomial is (λ−5)2=0, so λ=5 is a root with multiplicity 2. But solving (A−5I)v=0:
[0010]v=0⟹v=t[10]
There is only one independent eigenvector, not two.
Definition: Algebraic and Geometric Multiplicity
For an eigenvalue λ:
Algebraic multiplicitya(λ): the number of times λ appears as a root of the characteristic polynomial
Geometric multiplicityg(λ): the dimension of the eigenspace ker(A−λI), i.e., the number of independent eigenvectors for λ
Always: 1≤g(λ)≤a(λ).
When g(λ)<a(λ), the matrix is called defective. The matrix [5015] has a(5)=2 but g(5)=1, so it is defective. This matters because defective matrices cannot be diagonalized.
In contrast, A=[5005] also has λ=5 with a(5)=2, but now g(5)=2 (every nonzero vector is an eigenvector). This matrix is already diagonal.
Two fundamental relationships connect eigenvalues to matrix properties:
Theorem: Trace and Determinant
For an n×n matrix A with eigenvalues λ1,…,λn (counted with algebraic multiplicity):
tr(A)=∑i=1nλidet(A)=∏i=1nλi
Example. For A=[2112] with λ1=3,λ2=1:
tr(A)=2+2=4=3+1 ✓ and det(A)=4−1=3=3⋅1 ✓
These give a quick sanity check on computed eigenvalues. They also show that det(A)=0 if and only if at least one eigenvalue is zero, linking singularity to eigenvalues.
A matrix A is diagonalizable if it can be written as:
A=PDP−1
where P is the matrix of eigenvectors and D is the diagonal matrix of eigenvalues.
Theorem: Diagonalizability Condition
An n×n matrix A is diagonalizable if and only if it has n linearly independent eigenvectors. Equivalently, the geometric multiplicity equals the algebraic multiplicity for every eigenvalue.
A matrix with ndistinct eigenvalues is always diagonalizable (distinct eigenvalues guarantee independent eigenvectors). But distinct eigenvalues are not necessary: 5I is diagonal despite having a single repeated eigenvalue.
Diagonalization makes matrix powers trivial. Since A=PDP−1:
A2=(PDP−1)(PDP−1)=PD2P−1
Ak=PDkP−1=Pλ1kλ2k⋱P−1
Raising a diagonal matrix to a power just raises each diagonal entry to that power. This converts a matrix problem into n independent scalar problems.
Stability. The powers Ak converge to zero as k→∞ if and only if ∣λi∣<1 for all eigenvalues. If any ∣λi∣>1, the powers blow up. If ∣λi∣=1 for all i, the powers stay bounded but may oscillate.
A Markov matrix has nonnegative entries with each column summing to 1. It describes a system that transitions between states with fixed probabilities.
Example. A simplified weather model: if today is sunny, there is a 90% chance tomorrow is sunny and 10% chance of rain. If today is rainy, there is a 50% chance tomorrow is sunny and 50% chance of rain.
A=[0.90.10.50.5]
Starting from state u0, the state after k steps is uk=Aku0. The key properties of Markov matrices:
λ=1 is always an eigenvalue (since columns sum to 1, the row vector [11⋯1] is a left eigenvector for λ=1)
All other eigenvalues satisfy ∣λi∣≤1
The steady-state vector q satisfies Aq=q, meaning it is the eigenvector for λ=1
For our weather example: λ1=1,λ2=0.4. The steady-state eigenvector (normalized to sum to 1):
(A−I)q=0:[−0.10.10.5−0.5]q=0⟹q=[5/61/6]
Regardless of the initial weather, the system converges to 5/6 sunny and 1/6 rainy. The second eigenvalue λ2=0.4 controls the convergence rate: ∣0.4∣k→0, so the transient component decays by a factor of 0.4 each step.
Practical Insight
The convergence rate of a Markov chain is governed by the spectral gap1−∣λ2∣, where λ2 is the second-largest eigenvalue in magnitude. A larger spectral gap means faster convergence to steady state. This is the basis of MCMC convergence diagnostics.
The matrix A=[2112] from our running example is symmetric (A=AT). Its eigenvectors v1=[11] and v2=[1−1] are orthogonal: v1⋅v2=1−1=0. This is not a coincidence.
Theorem: The Spectral Theorem
Every real symmetric matrix A has:
Real eigenvalues (no complex numbers)
Orthogonal eigenvectors (eigenvectors for distinct eigenvalues are perpendicular)
A full set of n orthonormal eigenvectors, so A is always diagonalizable as:
A=QΛQT
where Q is orthogonal (QTQ=I) and Λ=diag(λ1,…,λn).
The notation changes from P to Q to emphasize that the eigenvector matrix is now orthogonal, and P−1 simplifies to QT (since Q−1=QT for orthogonal matrices).
Why symmetric matrices have real eigenvalues
Suppose Av=λv where v might be complex. Take the conjugate transpose of both sides:
vˉTAT=λˉvˉT
Since A=AT: vˉTA=λˉvˉT. Multiply on the right by v:
vˉTAv=λˉvˉTv
But also vˉT(Av)=vˉT(λv)=λvˉTv. So:
λvˉTv=λˉvˉTv
Since v=0, vˉTv=∥v∥2>0. Dividing: λ=λˉ, which means λ is real.
Why eigenvectors for distinct eigenvalues are orthogonal
Suppose Av1=λ1v1 and Av2=λ2v2 with λ1=λ2. Then:
Expanding A=QΛQT column by column reveals a beautiful structure. Let q1,…,qn be the orthonormal eigenvectors:
A=QΛQT=∑i=1nλiqiqiT
Each term qiqiT is a rank-1 projection matrix onto the direction qi. The symmetric matrix A is a weighted sum of orthogonal projections, with the eigenvalues as weights.
Example. Normalize the eigenvectors of A=[2112]:
The matrix A projects any vector onto q1, scales that component by 3, projects onto q2, scales by 1, and adds the results.
Why symmetry matters. Covariance matrices, Hessians, graph Laplacians, and kernel matrices are all symmetric. The spectral theorem guarantees they have real eigenvalues and orthogonal eigenvectors, which is why so many ML algorithms reduce to symmetric eigenvalue problems. See Applications in Data Science and Machine Learning for specific examples.
Two matrices can represent the same linear transformation in different bases. They are connected by a similarity transformation.
Definition: Similar Matrices
Matrices A and B are similar if there exists an invertible matrix M such that:
B=M−1AM
Diagonalization is a special case: D=P−1AP, where P changes from the standard basis to the eigenvector basis. In the eigenvector basis, the transformation is just diagonal scaling.
Theorem: Similarity Preserves Eigenvalues
If B=M−1AM, then A and B have the same eigenvalues (same characteristic polynomial).
The eigenvalues are the same, but the eigenvectors change. If Av=λv, then B(M−1v)=λ(M−1v). The eigenvectors of B are M−1v, the eigenvectors of A expressed in the new basis.
The key invariants under similarity: eigenvalues, trace, determinant, rank, and characteristic polynomial. These are properties of the linear transformation itself, not of the particular basis.
When a matrix is not diagonalizable (defective), the closest we can get is the Jordan normal form:
A=PJP−1,J=J1J2⋱
where each Jordan blockJk looks like:
Jk=λ1λ1⋱1λ
The 1s on the superdiagonal appear precisely when the geometric multiplicity falls short of the algebraic multiplicity. For a diagonalizable matrix, all Jordan blocks are 1×1 (just eigenvalues on the diagonal).
Example. The defective matrix A=[5015] is already in Jordan form: one 2×2 Jordan block with λ=5.
Jordan form is more of a theoretical tool than a computational one. In practice, computing it is numerically unstable (small perturbations can change block structure). The Schur decomposition A=QTQT (where T is upper triangular and Q is orthogonal) is the stable alternative.
The continuous-time analogue dtdu=Au has the solution:
u(t)=eAtu(0)
where the matrix exponential is defined by:
eAt=I+At+2(At)2+6(At)3+⋯=∑k=0∞k!(At)k
If A=PDP−1, then eAt=PeDtP−1, and:
eDt=eλ1teλ2t⋱
The solution decomposes as:
u(t)=c1eλ1tv1+c2eλ2tv2+⋯+cneλntvn
The stability conditions mirror the discrete case, but use the real parts of eigenvalues (since ∣eλt∣=eRe(λ)t):
Condition
Behavior of u(t)
All Re(λi)<0
u(t)→0 (stable)
Some Re(λi)>0
u(t) blows up (unstable)
All Re(λi)=0
u(t) oscillates
Practical Insight
The stability condition is the same idea in both cases. For discrete systems (uk+1=Auk), stability requires eigenvalues inside the unit circle (∣λ∣<1). For continuous systems (du/dt=Au), stability requires eigenvalues in the left half-plane (Re(λ)<0).
An eigenvalueλ and eigenvectorv satisfy Av=λv: the matrix acts as simple scaling along v
Eigenvalues are roots of the characteristic polynomial det(A−λI)=0. The trace equals their sum, the determinant equals their product
Algebraic multiplicity counts root repetitions; geometric multiplicity counts independent eigenvectors. A matrix is defective when these differ
DiagonalizationA=PDP−1 exists when there are n independent eigenvectors. It reduces powers to Ak=PDkP−1
The Fibonacci sequence is governed by the eigenvalues of [1110], producing the golden ratio φ=(1+5)/2
Markov chains converge to the eigenvector for λ=1 at a rate controlled by ∣λ2∣
The spectral theorem guarantees that symmetric matrices have real eigenvalues, orthogonal eigenvectors, and the decomposition A=QΛQT=∑λiqiqiT
Similar matricesB=M−1AM share eigenvalues, trace, determinant, and rank
For difference equations uk+1=Auk, stability requires ∣λi∣<1. For differential equations du/dt=Au, stability requires Re(λi)<0
Answering the Central Question: A matrix acts like simple scaling along its eigenvector directions. When A has n independent eigenvectors, it diagonalizes as A=PDP−1, reducing matrix powers, exponentials, and dynamical systems to scalar operations on the eigenvalues. For symmetric matrices, the spectral theorem guarantees this decomposition always exists with orthogonal eigenvectors, making A=QΛQT the most important factorization in applied mathematics.
Applications in Data Science and Machine Learning
Eigenvalues and eigenvectors form the mathematical backbone of many ML algorithms, from dimensionality reduction to graph analysis to understanding training dynamics.
Given a centered data matrix X∈Rn×d (n samples, d features), the covariance matrix is:
Σ=n−11XTX
This is symmetric and positive semi-definite, so the spectral theorem applies: Σ=QΛQT. The eigenvectors q1,…,qd are the principal components (directions of maximum variance), and the eigenvalues λ1≥λ2≥⋯≥λd≥0 are the variances along those directions.
To reduce from d dimensions to k, project onto the top k eigenvectors:
Xreduced=XQk,Qk=[q1q2⋯qk]
The fraction of variance retained is ∑i=1kλi/∑i=1dλi. If the eigenvalues decay rapidly, a few components capture most of the information.
Given a similarity graph with weight matrix W and degree matrix D=diag(∑jWij), the graph Laplacian is:
L=D−W
The Laplacian is symmetric and positive semi-definite. Its smallest eigenvalue is always 0 (with eigenvector 1). The number of zero eigenvalues equals the number of connected components. Spectral clustering computes the bottom k eigenvectors of L (or the normalized Laplacian D−1/2LD−1/2) and clusters the rows using k-means.
The spectral gap (the gap between the k-th and (k+1)-th eigenvalues) indicates how well-separated the clusters are. A large gap means clean cluster structure.
Google's original algorithm models the web as a Markov chain. The transition matrix A has entry Aij=1/nj if page j links to page i (where nj is the number of outgoing links from j). PageRank is the steady-state eigenvector for λ=1 of a modified Markov matrix:
M=(1−α)A+Nα11T
where α≈0.15 is the "teleportation" probability and N is the number of pages. The dominant eigenvector q1 (with Mq1=q1) ranks pages by importance.
An RNN processes sequences by iterating ht+1=σ(Wht+Uxt). During backpropagation through time, the gradient involves products of the weight matrix W:
∂ht∂hT≈∏k=tT−1WT⋅diag(σ′)
This behaves like WT−t for large time gaps. The eigenvalues of W determine the gradient dynamics:
∣λmax∣<1: gradients vanish (products shrink to zero)
∣λi∣≈1 for all i: gradients remain stable
This is why orthogonal/unitary weight initialization (eigenvalues on the unit circle) helps RNN training. LSTMs and GRUs address this by introducing gating mechanisms that keep the effective eigenvalues near 1.
At a critical point θ∗ of a loss function L(θ), the Hessian H=∇2L(θ∗) determines local behavior:
All eigenvalues of H positive: local minimum (the usual goal of training)
All eigenvalues negative: local maximum
Mixed signs: saddle point
In high-dimensional optimization (deep learning), saddle points vastly outnumber local minima. The fraction of negative eigenvalues at a critical point indicates how many directions lead downhill. See Positive Definite Matrices for a deeper treatment.
A mouse moves between three rooms. From any room, it moves to each of the other two rooms with equal probability. The transition matrix is:
A=01/21/21/201/21/21/20
Verify that A is a Markov matrix (nonneg entries, columns sum to 1).
Show that λ=1 is an eigenvalue and find the steady-state vector.
Find the other eigenvalues. (Hint: use trace and determinant.)
Starting from room 1 (u0=100), what is the long-term probability distribution?
💡 Solution
1. All entries are nonneg ✓. Each column sums to 0+1/2+1/2=1 ✓.
2.(A−I)q=−11/21/21/2−11/21/21/2−1q=0. By symmetry, q=111 is a solution. Normalized (probabilities sum to 1): q=1/31/31/3.
3.tr(A)=0=1+λ2+λ3, so λ2+λ3=−1.
det(A)=0+1/8+1/8−0−0−0=1/4. And det(A)=1⋅λ2⋅λ3, so λ2λ3=1/4.
The two unknown eigenvalues satisfy x2+x+1/4=0, giving x=(−1±1−1)/2=−1/2.
So λ2=λ3=−1/2 (a repeated eigenvalue).
4. Since ∣λ2∣=∣λ3∣=1/2<1, the transient components decay: (−1/2)k→0. For large k:
uk=Aku0→1/31/31/3
The mouse spends equal time in all three rooms, regardless of where it starts. The convergence rate is ∣−1/2∣k=(1/2)k, so after k=10 steps the transient is ≈0.001.
Find the eigenvalues and their algebraic multiplicities.
Find the eigenvectors. What is the geometric multiplicity?
Is A diagonalizable? Explain.
Compute Ak directly using the binomial theorem on A=2I+N where N=[0010].
💡 Solution
1.det(A−λI)=(2−λ)2=0. So λ=2 with algebraic multiplicity 2.
2.(A−2I)v=[0010]v=0⟹v2=0, so v=t[10].
Geometric multiplicity = 1 (only one independent eigenvector).
3. No. Diagonalization requires n=2 independent eigenvectors, but we only have 1. The matrix is defective.
4. Note N2=[0010][0010]=[0000]. So Nk=0 for all k≥2.
Since 2I and N commute: Ak=(2I+N)k=∑j=0k(jk)(2I)k−jNj
Only j=0 and j=1 survive (since N2=0):
Ak=2kI+k⋅2k−1N=[2k0k⋅2k−12k]
Verify for k=1: [2012] ✓. For k=2: [4044]=A2=[4044] ✓.
Notice the off-diagonal grows as k⋅2k−1=(k/2)⋅2k, which exceeds the diagonal 2k by a growing factor of k/2. This polynomial-times-exponential growth in the off-diagonal is characteristic of defective matrices and does not happen for diagonalizable ones.