Skip to main content

Eigenvalues, Eigenvectors, and Diagonalization


The Central Question: When Does a Matrix Act Like Simple Scaling?

Most matrices rotate, stretch, and shear vectors in complicated ways. But along certain special directions, a matrix acts like multiplication by a scalar: Av=λvAv = \lambda v. The vector vv goes in, and the same vector (scaled by λ\lambda) comes out.

These special directions and scaling factors reveal a matrix's intrinsic behavior:

  1. The eigenvalues of a covariance matrix are the variances along the principal directions of the data.
  2. The dominant eigenvector of the web's link matrix is the PageRank vector.
  3. The eigenvalues of a recurrent neural network's weight matrix determine whether gradients explode or vanish during training.

Finding eigenvalues and eigenvectors is the key to diagonalization, which reduces matrix powers, exponentials, and dynamical systems to simple scalar operations.


The Eigenvalue Equation

Take the matrix A=[2112]A = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} and the vector v=[11]v = \begin{bmatrix} 1 \\ 1 \end{bmatrix}:

Av=[2112][11]=[33]=3[11]=3vAv = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}\begin{bmatrix} 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 3 \\ 3 \end{bmatrix} = 3\begin{bmatrix} 1 \\ 1 \end{bmatrix} = 3v

The matrix AA simply scales vv by a factor of 3, without changing its direction. Now try w=[11]w = \begin{bmatrix} 1 \\ -1 \end{bmatrix}:

Aw=[2112][11]=[11]=1wAw = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}\begin{bmatrix} 1 \\ -1 \end{bmatrix} = \begin{bmatrix} 1 \\ -1 \end{bmatrix} = 1 \cdot w

The direction ww is also preserved, scaled by 1 (unchanged). A generic vector like [21]\begin{bmatrix} 2 \\ 1 \end{bmatrix} would be both rotated and stretched, not just scaled. The directions vv and ww are the special ones.

Geometrically, AA stretches space by a factor of 3 along the diagonal direction [11]\begin{bmatrix} 1 \\ 1 \end{bmatrix} and leaves the perpendicular direction [11]\begin{bmatrix} 1 \\ -1 \end{bmatrix} unchanged. Every other vector gets rotated toward the direction with the larger eigenvalue.

Definition: Eigenvalue and Eigenvector

An eigenvalue λ\lambda and eigenvector v0v \neq 0 of a square matrix AA satisfy:

Av=λvAv = \lambda v

The eigenvector vv is a direction that AA preserves (up to scaling). The eigenvalue λ\lambda is the scaling factor.

Finding Eigenvalues

The equation Av=λvAv = \lambda v rewrites as (AλI)v=0(A - \lambda I)v = 0. For a nonzero vv to exist, AλIA - \lambda I must be singular:

det(AλI)=0\det(A - \lambda I) = 0

This is the characteristic polynomial (see Determinants: The Characteristic Polynomial). Its roots are the eigenvalues.

Example. For A=[2112]A = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}:

det(AλI)=det[2λ112λ]=(2λ)21=λ24λ+3=(λ3)(λ1)=0\det(A - \lambda I) = \det\begin{bmatrix} 2-\lambda & 1 \\ 1 & 2-\lambda \end{bmatrix} = (2-\lambda)^2 - 1 = \lambda^2 - 4\lambda + 3 = (\lambda - 3)(\lambda - 1) = 0

Eigenvalues: λ1=3\lambda_1 = 3, λ2=1\lambda_2 = 1.

For each λ\lambda, solve (AλI)v=0(A - \lambda I)v = 0 to find the eigenvector:

  • λ1=3\lambda_1 = 3: [1111]v=0    v1=[11]\begin{bmatrix} -1 & 1 \\ 1 & -1 \end{bmatrix}v = 0 \implies v_1 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}

  • λ2=1\lambda_2 = 1: [1111]v=0    v2=[11]\begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}v = 0 \implies v_2 = \begin{bmatrix} 1 \\ -1 \end{bmatrix}

Practical Insight

For triangular matrices (upper or lower), the eigenvalues are the diagonal entries. No characteristic polynomial needed: if the kk-th diagonal entry is akka_{kk}, then det(AakkI)=0\det(A - a_{kk}I) = 0 because one row of AakkIA - a_{kk}I is all zeros on the diagonal and below (or above).

Algebraic vs. Geometric Multiplicity

An eigenvalue can be a repeated root of the characteristic polynomial.

Consider A=[5105]A = \begin{bmatrix} 5 & 1 \\ 0 & 5 \end{bmatrix}. The characteristic polynomial is (λ5)2=0(\lambda - 5)^2 = 0, so λ=5\lambda = 5 is a root with multiplicity 2. But solving (A5I)v=0(A - 5I)v = 0:

[0100]v=0    v=t[10]\begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix}v = 0 \implies v = t\begin{bmatrix} 1 \\ 0 \end{bmatrix}

There is only one independent eigenvector, not two.

Definition: Algebraic and Geometric Multiplicity

For an eigenvalue λ\lambda:

  • Algebraic multiplicity a(λ)a(\lambda): the number of times λ\lambda appears as a root of the characteristic polynomial
  • Geometric multiplicity g(λ)g(\lambda): the dimension of the eigenspace ker(AλI)\ker(A - \lambda I), i.e., the number of independent eigenvectors for λ\lambda

Always: 1g(λ)a(λ)1 \leq g(\lambda) \leq a(\lambda).

When g(λ)<a(λ)g(\lambda) < a(\lambda), the matrix is called defective. The matrix [5105]\begin{bmatrix} 5 & 1 \\ 0 & 5 \end{bmatrix} has a(5)=2a(5) = 2 but g(5)=1g(5) = 1, so it is defective. This matters because defective matrices cannot be diagonalized.

In contrast, A=[5005]A = \begin{bmatrix} 5 & 0 \\ 0 & 5 \end{bmatrix} also has λ=5\lambda = 5 with a(5)=2a(5) = 2, but now g(5)=2g(5) = 2 (every nonzero vector is an eigenvector). This matrix is already diagonal.

Trace, Determinant, and Eigenvalues

Two fundamental relationships connect eigenvalues to matrix properties:

Theorem: Trace and Determinant

For an n×nn \times n matrix AA with eigenvalues λ1,,λn\lambda_1, \ldots, \lambda_n (counted with algebraic multiplicity):

tr(A)=i=1nλidet(A)=i=1nλi\text{tr}(A) = \sum_{i=1}^n \lambda_i \qquad \det(A) = \prod_{i=1}^n \lambda_i

Example. For A=[2112]A = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} with λ1=3,λ2=1\lambda_1 = 3, \lambda_2 = 1:

tr(A)=2+2=4=3+1\text{tr}(A) = 2 + 2 = 4 = 3 + 1 ✓ and det(A)=41=3=31\det(A) = 4 - 1 = 3 = 3 \cdot 1

These give a quick sanity check on computed eigenvalues. They also show that det(A)=0\det(A) = 0 if and only if at least one eigenvalue is zero, linking singularity to eigenvalues.


Diagonalization

The Key Idea

Suppose AA has nn independent eigenvectors v1,,vnv_1, \ldots, v_n with eigenvalues λ1,,λn\lambda_1, \ldots, \lambda_n. Collect them into matrices:

P=[v1v2vn],D=[λ1λ2λn]P = \begin{bmatrix} v_1 & v_2 & \cdots & v_n \end{bmatrix}, \quad D = \begin{bmatrix} \lambda_1 & & \\ & \lambda_2 & \\ & & \ddots & \\ & & & \lambda_n \end{bmatrix}

Then AP=PDAP = PD (each column of APAP is Avi=λiviAv_i = \lambda_i v_i). Since PP has independent columns, it is invertible:

A=PDP1A = PDP^{-1}

Example. For A=[2112]A = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} with v1=[11]v_1 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}, v2=[11]v_2 = \begin{bmatrix} 1 \\ -1 \end{bmatrix}:

P=[1111],D=[3001]P = \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}, \quad D = \begin{bmatrix} 3 & 0 \\ 0 & 1 \end{bmatrix}

P1=12[1111]=[12121212]P^{-1} = \frac{1}{-2}\begin{bmatrix} -1 & -1 \\ -1 & 1 \end{bmatrix} = \begin{bmatrix} \frac{1}{2} & \frac{1}{2} \\ \frac{1}{2} & -\frac{1}{2} \end{bmatrix}

Verify: PDP1=[1111][3001][12121212]=[2112]=APDP^{-1} = \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}\begin{bmatrix} 3 & 0 \\ 0 & 1 \end{bmatrix}\begin{bmatrix} \frac{1}{2} & \frac{1}{2} \\ \frac{1}{2} & -\frac{1}{2} \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} = A

Definition: Diagonalization

A matrix AA is diagonalizable if it can be written as:

A=PDP1A = PDP^{-1}

where PP is the matrix of eigenvectors and DD is the diagonal matrix of eigenvalues.

Theorem: Diagonalizability Condition

An n×nn \times n matrix AA is diagonalizable if and only if it has nn linearly independent eigenvectors. Equivalently, the geometric multiplicity equals the algebraic multiplicity for every eigenvalue.

A matrix with nn distinct eigenvalues is always diagonalizable (distinct eigenvalues guarantee independent eigenvectors). But distinct eigenvalues are not necessary: 5I5I is diagonal despite having a single repeated eigenvalue.

Powers of Matrices

Diagonalization makes matrix powers trivial. Since A=PDP1A = PDP^{-1}:

A2=(PDP1)(PDP1)=PD2P1A^2 = (PDP^{-1})(PDP^{-1}) = PD^2P^{-1}

Ak=PDkP1=P[λ1kλ2k]P1A^k = PD^kP^{-1} = P\begin{bmatrix} \lambda_1^k & & \\ & \lambda_2^k & \\ & & \ddots \end{bmatrix}P^{-1}

Raising a diagonal matrix to a power just raises each diagonal entry to that power. This converts a matrix problem into nn independent scalar problems.

Stability. The powers AkA^k converge to zero as kk \to \infty if and only if λi<1|\lambda_i| < 1 for all eigenvalues. If any λi>1|\lambda_i| > 1, the powers blow up. If λi=1|\lambda_i| = 1 for all ii, the powers stay bounded but may oscillate.

Fibonacci Numbers

The Fibonacci sequence F0=0,F1=1,Fn+1=Fn+Fn1F_0 = 0, F_1 = 1, F_{n+1} = F_n + F_{n-1} can be encoded as a matrix recurrence:

[Fn+1Fn]=[1110][FnFn1]\begin{bmatrix} F_{n+1} \\ F_n \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}\begin{bmatrix} F_n \\ F_{n-1} \end{bmatrix}

Let A=[1110]A = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}. Then [Fn+1Fn]=An[10]\begin{bmatrix} F_{n+1} \\ F_n \end{bmatrix} = A^n \begin{bmatrix} 1 \\ 0 \end{bmatrix}.

The characteristic polynomial is λ2λ1=0\lambda^2 - \lambda - 1 = 0, giving eigenvalues:

λ1=1+52=φ1.618(the golden ratio),λ2=1520.618\lambda_1 = \frac{1 + \sqrt{5}}{2} = \varphi \approx 1.618 \quad (\text{the golden ratio}), \quad \lambda_2 = \frac{1 - \sqrt{5}}{2} \approx -0.618

Since λ2<1|\lambda_2| < 1, the λ2n\lambda_2^n term vanishes for large nn. The dominant eigenvalue φ\varphi controls the growth, giving Binet's formula:

Fn=φn(1φ)n5F_n = \frac{\varphi^n - (1-\varphi)^n}{\sqrt{5}}

The ratio of consecutive Fibonacci numbers converges to φ\varphi: Fn+1/FnφF_{n+1}/F_n \to \varphi as nn \to \infty. The golden ratio emerges directly from the eigenvalues.

Markov Chains

A Markov matrix has nonnegative entries with each column summing to 1. It describes a system that transitions between states with fixed probabilities.

Example. A simplified weather model: if today is sunny, there is a 90% chance tomorrow is sunny and 10% chance of rain. If today is rainy, there is a 50% chance tomorrow is sunny and 50% chance of rain.

A=[0.90.50.10.5]A = \begin{bmatrix} 0.9 & 0.5 \\ 0.1 & 0.5 \end{bmatrix}

Starting from state u0u_0, the state after kk steps is uk=Aku0u_k = A^k u_0. The key properties of Markov matrices:

  1. λ=1\lambda = 1 is always an eigenvalue (since columns sum to 1, the row vector [111]\begin{bmatrix} 1 & 1 & \cdots & 1 \end{bmatrix} is a left eigenvector for λ=1\lambda = 1)
  2. All other eigenvalues satisfy λi1|\lambda_i| \leq 1
  3. The steady-state vector qq satisfies Aq=qAq = q, meaning it is the eigenvector for λ=1\lambda = 1

For our weather example: λ1=1,λ2=0.4\lambda_1 = 1, \lambda_2 = 0.4. The steady-state eigenvector (normalized to sum to 1):

(AI)q=0:[0.10.50.10.5]q=0    q=[5/61/6](A - I)q = 0: \quad \begin{bmatrix} -0.1 & 0.5 \\ 0.1 & -0.5 \end{bmatrix}q = 0 \implies q = \begin{bmatrix} 5/6 \\ 1/6 \end{bmatrix}

Regardless of the initial weather, the system converges to 5/6 sunny and 1/6 rainy. The second eigenvalue λ2=0.4\lambda_2 = 0.4 controls the convergence rate: 0.4k0|0.4|^k \to 0, so the transient component decays by a factor of 0.4 each step.

Practical Insight

The convergence rate of a Markov chain is governed by the spectral gap 1λ21 - |\lambda_2|, where λ2\lambda_2 is the second-largest eigenvalue in magnitude. A larger spectral gap means faster convergence to steady state. This is the basis of MCMC convergence diagnostics.


The Spectral Theorem for Symmetric Matrices

The matrix A=[2112]A = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} from our running example is symmetric (A=ATA = A^T). Its eigenvectors v1=[11]v_1 = \begin{bmatrix} 1 \\ 1 \end{bmatrix} and v2=[11]v_2 = \begin{bmatrix} 1 \\ -1 \end{bmatrix} are orthogonal: v1v2=11=0v_1 \cdot v_2 = 1 - 1 = 0. This is not a coincidence.

Theorem: The Spectral Theorem

Every real symmetric matrix AA has:

  1. Real eigenvalues (no complex numbers)
  2. Orthogonal eigenvectors (eigenvectors for distinct eigenvalues are perpendicular)
  3. A full set of nn orthonormal eigenvectors, so AA is always diagonalizable as:

A=QΛQTA = Q\Lambda Q^T

where QQ is orthogonal (QTQ=IQ^T Q = I) and Λ=diag(λ1,,λn)\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n).

The notation changes from PP to QQ to emphasize that the eigenvector matrix is now orthogonal, and P1P^{-1} simplifies to QTQ^T (since Q1=QTQ^{-1} = Q^T for orthogonal matrices).

Why symmetric matrices have real eigenvalues

Suppose Av=λvAv = \lambda v where vv might be complex. Take the conjugate transpose of both sides:

vˉTAT=λˉvˉT\bar{v}^T A^T = \bar{\lambda} \bar{v}^T

Since A=ATA = A^T: vˉTA=λˉvˉT\bar{v}^T A = \bar{\lambda} \bar{v}^T. Multiply on the right by vv:

vˉTAv=λˉvˉTv\bar{v}^T A v = \bar{\lambda} \bar{v}^T v

But also vˉT(Av)=vˉT(λv)=λvˉTv\bar{v}^T (Av) = \bar{v}^T (\lambda v) = \lambda \bar{v}^T v. So:

λvˉTv=λˉvˉTv\lambda \bar{v}^T v = \bar{\lambda} \bar{v}^T v

Since v0v \neq 0, vˉTv=v2>0\bar{v}^T v = \|v\|^2 > 0. Dividing: λ=λˉ\lambda = \bar{\lambda}, which means λ\lambda is real.

Why eigenvectors for distinct eigenvalues are orthogonal

Suppose Av1=λ1v1Av_1 = \lambda_1 v_1 and Av2=λ2v2Av_2 = \lambda_2 v_2 with λ1λ2\lambda_1 \neq \lambda_2. Then:

λ1(v1v2)=(Av1)v2=v1(ATv2)=v1(Av2)=λ2(v1v2)\lambda_1 (v_1 \cdot v_2) = (Av_1) \cdot v_2 = v_1 \cdot (A^T v_2) = v_1 \cdot (Av_2) = \lambda_2 (v_1 \cdot v_2)

The middle step uses A=ATA = A^T. So (λ1λ2)(v1v2)=0(\lambda_1 - \lambda_2)(v_1 \cdot v_2) = 0. Since λ1λ2\lambda_1 \neq \lambda_2, we must have v1v2=0v_1 \cdot v_2 = 0.

The Spectral Decomposition

Expanding A=QΛQTA = Q\Lambda Q^T column by column reveals a beautiful structure. Let q1,,qnq_1, \ldots, q_n be the orthonormal eigenvectors:

A=QΛQT=i=1nλiqiqiTA = Q\Lambda Q^T = \sum_{i=1}^n \lambda_i q_i q_i^T

Each term qiqiTq_i q_i^T is a rank-1 projection matrix onto the direction qiq_i. The symmetric matrix AA is a weighted sum of orthogonal projections, with the eigenvalues as weights.

Example. Normalize the eigenvectors of A=[2112]A = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}:

q1=12[11],q2=12[11]q_1 = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ 1 \end{bmatrix}, \quad q_2 = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ -1 \end{bmatrix}

A=3q1q1T+1q2q2T=312[1111]+112[1111]=[2112]A = 3 \cdot q_1 q_1^T + 1 \cdot q_2 q_2^T = 3 \cdot \frac{1}{2}\begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} + 1 \cdot \frac{1}{2}\begin{bmatrix} 1 & -1 \\ -1 & 1 \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} \checkmark

The matrix AA projects any vector onto q1q_1, scales that component by 3, projects onto q2q_2, scales by 1, and adds the results.

Why symmetry matters. Covariance matrices, Hessians, graph Laplacians, and kernel matrices are all symmetric. The spectral theorem guarantees they have real eigenvalues and orthogonal eigenvectors, which is why so many ML algorithms reduce to symmetric eigenvalue problems. See Applications in Data Science and Machine Learning for specific examples.


Similarity Transformations

Two matrices can represent the same linear transformation in different bases. They are connected by a similarity transformation.

Definition: Similar Matrices

Matrices AA and BB are similar if there exists an invertible matrix MM such that:

B=M1AMB = M^{-1}AM

Diagonalization is a special case: D=P1APD = P^{-1}AP, where PP changes from the standard basis to the eigenvector basis. In the eigenvector basis, the transformation is just diagonal scaling.

Theorem: Similarity Preserves Eigenvalues

If B=M1AMB = M^{-1}AM, then AA and BB have the same eigenvalues (same characteristic polynomial).

Proof

det(BλI)=det(M1AMλI)=det(M1(AλI)M)=det(M1)det(AλI)det(M)\det(B - \lambda I) = \det(M^{-1}AM - \lambda I) = \det(M^{-1}(A - \lambda I)M) = \det(M^{-1})\det(A - \lambda I)\det(M)

Since det(M1)det(M)=1\det(M^{-1})\det(M) = 1: det(BλI)=det(AλI)\det(B - \lambda I) = \det(A - \lambda I).

The eigenvalues are the same, but the eigenvectors change. If Av=λvAv = \lambda v, then B(M1v)=λ(M1v)B(M^{-1}v) = \lambda(M^{-1}v). The eigenvectors of BB are M1vM^{-1}v, the eigenvectors of AA expressed in the new basis.

The key invariants under similarity: eigenvalues, trace, determinant, rank, and characteristic polynomial. These are properties of the linear transformation itself, not of the particular basis.

Jordan Normal Form

When a matrix is not diagonalizable (defective), the closest we can get is the Jordan normal form:

A=PJP1,J=[J1J2]A = PJP^{-1}, \quad J = \begin{bmatrix} J_1 & & \\ & J_2 & \\ & & \ddots \end{bmatrix}

where each Jordan block JkJ_k looks like:

Jk=[λ1λ11λ]J_k = \begin{bmatrix} \lambda & 1 & & \\ & \lambda & 1 & \\ & & \ddots & 1 \\ & & & \lambda \end{bmatrix}

The 1s on the superdiagonal appear precisely when the geometric multiplicity falls short of the algebraic multiplicity. For a diagonalizable matrix, all Jordan blocks are 1×11 \times 1 (just eigenvalues on the diagonal).

Example. The defective matrix A=[5105]A = \begin{bmatrix} 5 & 1 \\ 0 & 5 \end{bmatrix} is already in Jordan form: one 2×22 \times 2 Jordan block with λ=5\lambda = 5.

Jordan form is more of a theoretical tool than a computational one. In practice, computing it is numerically unstable (small perturbations can change block structure). The Schur decomposition A=QTQTA = QTQ^T (where TT is upper triangular and QQ is orthogonal) is the stable alternative.


Difference Equations and Differential Equations

Difference Equations

A linear recurrence uk+1=Auku_{k+1} = Au_k with initial condition u0u_0 has the solution uk=Aku0u_k = A^k u_0. Diagonalization makes this explicit.

Write u0u_0 in the eigenvector basis: u0=c1v1+c2v2++cnvnu_0 = c_1 v_1 + c_2 v_2 + \cdots + c_n v_n. Then:

uk=Aku0=c1λ1kv1+c2λ2kv2++cnλnkvnu_k = A^k u_0 = c_1 \lambda_1^k v_1 + c_2 \lambda_2^k v_2 + \cdots + c_n \lambda_n^k v_n

Each eigenvector component evolves independently, scaled by λik\lambda_i^k. The long-term behavior depends entirely on which eigenvalues dominate:

ConditionBehavior of uku_k
All λi<1\|\lambda_i\| < 1uk0u_k \to 0 (stable)
Some λi>1\|\lambda_i\| > 1uku_k blows up (unstable)
λi=1\|\lambda_i\| = 1 for all iiuku_k stays bounded
One dominant λ1>λi\|\lambda_1\| > \|\lambda_i\| for i2i \geq 2ukc1λ1kv1u_k \approx c_1 \lambda_1^k v_1 for large kk

Differential Equations

The continuous-time analogue dudt=Au\frac{du}{dt} = Au has the solution:

u(t)=eAtu(0)u(t) = e^{At} u(0)

where the matrix exponential is defined by:

eAt=I+At+(At)22+(At)36+=k=0(At)kk!e^{At} = I + At + \frac{(At)^2}{2} + \frac{(At)^3}{6} + \cdots = \sum_{k=0}^{\infty} \frac{(At)^k}{k\,!}

If A=PDP1A = PDP^{-1}, then eAt=PeDtP1e^{At} = P e^{Dt} P^{-1}, and:

eDt=[eλ1teλ2t]e^{Dt} = \begin{bmatrix} e^{\lambda_1 t} & & \\ & e^{\lambda_2 t} & \\ & & \ddots \end{bmatrix}

The solution decomposes as:

u(t)=c1eλ1tv1+c2eλ2tv2++cneλntvnu(t) = c_1 e^{\lambda_1 t} v_1 + c_2 e^{\lambda_2 t} v_2 + \cdots + c_n e^{\lambda_n t} v_n

The stability conditions mirror the discrete case, but use the real parts of eigenvalues (since eλt=eRe(λ)t|e^{\lambda t}| = e^{\text{Re}(\lambda) t}):

ConditionBehavior of u(t)u(t)
All Re(λi)<0\text{Re}(\lambda_i) < 0u(t)0u(t) \to 0 (stable)
Some Re(λi)>0\text{Re}(\lambda_i) > 0u(t)u(t) blows up (unstable)
All Re(λi)=0\text{Re}(\lambda_i) = 0u(t)u(t) oscillates
Practical Insight

The stability condition is the same idea in both cases. For discrete systems (uk+1=Auku_{k+1} = Au_k), stability requires eigenvalues inside the unit circle (λ<1|\lambda| < 1). For continuous systems (du/dt=Audu/dt = Au), stability requires eigenvalues in the left half-plane (Re(λ)<0\text{Re}(\lambda) < 0).


Summary

  • An eigenvalue λ\lambda and eigenvector vv satisfy Av=λvAv = \lambda v: the matrix acts as simple scaling along vv
  • Eigenvalues are roots of the characteristic polynomial det(AλI)=0\det(A - \lambda I) = 0. The trace equals their sum, the determinant equals their product
  • Algebraic multiplicity counts root repetitions; geometric multiplicity counts independent eigenvectors. A matrix is defective when these differ
  • Diagonalization A=PDP1A = PDP^{-1} exists when there are nn independent eigenvectors. It reduces powers to Ak=PDkP1A^k = PD^kP^{-1}
  • The Fibonacci sequence is governed by the eigenvalues of [1110]\begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}, producing the golden ratio φ=(1+5)/2\varphi = (1+\sqrt{5})/2
  • Markov chains converge to the eigenvector for λ=1\lambda = 1 at a rate controlled by λ2|\lambda_2|
  • The spectral theorem guarantees that symmetric matrices have real eigenvalues, orthogonal eigenvectors, and the decomposition A=QΛQT=λiqiqiTA = Q\Lambda Q^T = \sum \lambda_i q_i q_i^T
  • Similar matrices B=M1AMB = M^{-1}AM share eigenvalues, trace, determinant, and rank
  • For difference equations uk+1=Auku_{k+1} = Au_k, stability requires λi<1|\lambda_i| < 1. For differential equations du/dt=Audu/dt = Au, stability requires Re(λi)<0\text{Re}(\lambda_i) < 0

Answering the Central Question: A matrix acts like simple scaling along its eigenvector directions. When AA has nn independent eigenvectors, it diagonalizes as A=PDP1A = PDP^{-1}, reducing matrix powers, exponentials, and dynamical systems to scalar operations on the eigenvalues. For symmetric matrices, the spectral theorem guarantees this decomposition always exists with orthogonal eigenvectors, making A=QΛQTA = Q\Lambda Q^T the most important factorization in applied mathematics.


Applications in Data Science and Machine Learning

Eigenvalues and eigenvectors form the mathematical backbone of many ML algorithms, from dimensionality reduction to graph analysis to understanding training dynamics.

Principal Component Analysis (PCA)

Given a centered data matrix XRn×dX \in \mathbb{R}^{n \times d} (nn samples, dd features), the covariance matrix is:

Σ=1n1XTX\Sigma = \frac{1}{n-1}X^TX

This is symmetric and positive semi-definite, so the spectral theorem applies: Σ=QΛQT\Sigma = Q\Lambda Q^T. The eigenvectors q1,,qdq_1, \ldots, q_d are the principal components (directions of maximum variance), and the eigenvalues λ1λ2λd0\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0 are the variances along those directions.

To reduce from dd dimensions to kk, project onto the top kk eigenvectors:

Xreduced=XQk,Qk=[q1q2qk]X_{\text{reduced}} = X Q_k, \quad Q_k = \begin{bmatrix} q_1 & q_2 & \cdots & q_k \end{bmatrix}

The fraction of variance retained is i=1kλi/i=1dλi\sum_{i=1}^k \lambda_i / \sum_{i=1}^d \lambda_i. If the eigenvalues decay rapidly, a few components capture most of the information.

Spectral Clustering

Given a similarity graph with weight matrix WW and degree matrix D=diag(jWij)D = \text{diag}(\sum_j W_{ij}), the graph Laplacian is:

L=DWL = D - W

The Laplacian is symmetric and positive semi-definite. Its smallest eigenvalue is always 0 (with eigenvector 1\mathbf{1}). The number of zero eigenvalues equals the number of connected components. Spectral clustering computes the bottom kk eigenvectors of LL (or the normalized Laplacian D1/2LD1/2D^{-1/2}LD^{-1/2}) and clusters the rows using kk-means.

The spectral gap (the gap between the kk-th and (k+1)(k+1)-th eigenvalues) indicates how well-separated the clusters are. A large gap means clean cluster structure.

PageRank

Google's original algorithm models the web as a Markov chain. The transition matrix AA has entry Aij=1/njA_{ij} = 1/n_j if page jj links to page ii (where njn_j is the number of outgoing links from jj). PageRank is the steady-state eigenvector for λ=1\lambda = 1 of a modified Markov matrix:

M=(1α)A+αN11TM = (1 - \alpha)A + \frac{\alpha}{N}\mathbf{1}\mathbf{1}^T

where α0.15\alpha \approx 0.15 is the "teleportation" probability and NN is the number of pages. The dominant eigenvector q1q_1 (with Mq1=q1Mq_1 = q_1) ranks pages by importance.

Gradient Flow in Recurrent Neural Networks

An RNN processes sequences by iterating ht+1=σ(Wht+Uxt)h_{t+1} = \sigma(Wh_t + Ux_t). During backpropagation through time, the gradient involves products of the weight matrix WW:

hThtk=tT1WTdiag(σ)\frac{\partial h_T}{\partial h_t} \approx \prod_{k=t}^{T-1} W^T \cdot \text{diag}(\sigma')

This behaves like WTtW^{T-t} for large time gaps. The eigenvalues of WW determine the gradient dynamics:

  • λmax>1|\lambda_{\max}| > 1: gradients explode (products grow exponentially)
  • λmax<1|\lambda_{\max}| < 1: gradients vanish (products shrink to zero)
  • λi1|\lambda_i| \approx 1 for all ii: gradients remain stable

This is why orthogonal/unitary weight initialization (eigenvalues on the unit circle) helps RNN training. LSTMs and GRUs address this by introducing gating mechanisms that keep the effective eigenvalues near 1.

Stability of Fixed Points in Optimization

At a critical point θ\theta^* of a loss function L(θ)\mathcal{L}(\theta), the Hessian H=2L(θ)H = \nabla^2 \mathcal{L}(\theta^*) determines local behavior:

  • All eigenvalues of HH positive: local minimum (the usual goal of training)
  • All eigenvalues negative: local maximum
  • Mixed signs: saddle point

In high-dimensional optimization (deep learning), saddle points vastly outnumber local minima. The fraction of negative eigenvalues at a critical point indicates how many directions lead downhill. See Positive Definite Matrices for a deeper treatment.


Guided Problems

Problem 1: Diagonalization Practice

Let A=[4123]A = \begin{bmatrix} 4 & 1 \\ 2 & 3 \end{bmatrix}.

  1. Find the eigenvalues of AA.
  2. Find eigenvectors for each eigenvalue.
  3. Write A=PDP1A = PDP^{-1} and verify.
  4. Use the diagonalization to compute A3A^3.
💡 Solution

1. Characteristic polynomial: det(AλI)=(4λ)(3λ)2=λ27λ+10=(λ5)(λ2)\det(A - \lambda I) = (4 - \lambda)(3 - \lambda) - 2 = \lambda^2 - 7\lambda + 10 = (\lambda - 5)(\lambda - 2).

Eigenvalues: λ1=5,λ2=2\lambda_1 = 5, \lambda_2 = 2.

Quick check: tr(A)=7=5+2\text{tr}(A) = 7 = 5 + 2 ✓, det(A)=10=52\det(A) = 10 = 5 \cdot 2 ✓.

2. For λ1=5\lambda_1 = 5: (A5I)v=[1122]v=0    v1=[11](A - 5I)v = \begin{bmatrix} -1 & 1 \\ 2 & -2 \end{bmatrix}v = 0 \implies v_1 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}.

For λ2=2\lambda_2 = 2: (A2I)v=[2121]v=0    v2=[12](A - 2I)v = \begin{bmatrix} 2 & 1 \\ 2 & 1 \end{bmatrix}v = 0 \implies v_2 = \begin{bmatrix} 1 \\ -2 \end{bmatrix}.

3. P=[1112]P = \begin{bmatrix} 1 & 1 \\ 1 & -2 \end{bmatrix}, D=[5002]D = \begin{bmatrix} 5 & 0 \\ 0 & 2 \end{bmatrix}, P1=13[2111]=[2/31/31/31/3]P^{-1} = \frac{1}{-3}\begin{bmatrix} -2 & -1 \\ -1 & 1 \end{bmatrix} = \begin{bmatrix} 2/3 & 1/3 \\ 1/3 & -1/3 \end{bmatrix}.

Verify: PDP1=[1112][5002][2/31/31/31/3]PDP^{-1} = \begin{bmatrix} 1 & 1 \\ 1 & -2 \end{bmatrix}\begin{bmatrix} 5 & 0 \\ 0 & 2 \end{bmatrix}\begin{bmatrix} 2/3 & 1/3 \\ 1/3 & -1/3 \end{bmatrix}

=[5254][2/31/31/31/3]=[12/33/36/39/3]=[4123]= \begin{bmatrix} 5 & 2 \\ 5 & -4 \end{bmatrix}\begin{bmatrix} 2/3 & 1/3 \\ 1/3 & -1/3 \end{bmatrix} = \begin{bmatrix} 12/3 & 3/3 \\ 6/3 & 9/3 \end{bmatrix} = \begin{bmatrix} 4 & 1 \\ 2 & 3 \end{bmatrix}

4. A3=PD3P1A^3 = PD^3P^{-1}. D3=[125008]D^3 = \begin{bmatrix} 125 & 0 \\ 0 & 8 \end{bmatrix}.

A3=[1112][125008][2/31/31/31/3]A^3 = \begin{bmatrix} 1 & 1 \\ 1 & -2 \end{bmatrix}\begin{bmatrix} 125 & 0 \\ 0 & 8 \end{bmatrix}\begin{bmatrix} 2/3 & 1/3 \\ 1/3 & -1/3 \end{bmatrix}

=[125812516][2/31/31/31/3]=[258/3117/3234/3141/3]=[86397847]= \begin{bmatrix} 125 & 8 \\ 125 & -16 \end{bmatrix}\begin{bmatrix} 2/3 & 1/3 \\ 1/3 & -1/3 \end{bmatrix} = \begin{bmatrix} 258/3 & 117/3 \\ 234/3 & 141/3 \end{bmatrix} = \begin{bmatrix} 86 & 39 \\ 78 & 47 \end{bmatrix}

Quick check: tr(A3)=133=125+8=53+23\text{tr}(A^3) = 133 = 125 + 8 = 5^3 + 2^3 ✓.


Problem 2: Spectral Theorem

Let S=[3113]S = \begin{bmatrix} 3 & 1 \\ 1 & 3 \end{bmatrix}.

  1. Find the eigenvalues and eigenvectors of SS.
  2. Verify that the eigenvectors are orthogonal.
  3. Write the spectral decomposition S=λ1q1q1T+λ2q2q2TS = \lambda_1 q_1 q_1^T + \lambda_2 q_2 q_2^T and verify.
  4. Using the spectral decomposition, what is S100S^{100}? (Express in closed form.)
💡 Solution

1. det(SλI)=(3λ)21=λ26λ+8=(λ4)(λ2)\det(S - \lambda I) = (3-\lambda)^2 - 1 = \lambda^2 - 6\lambda + 8 = (\lambda - 4)(\lambda - 2).

λ1=4\lambda_1 = 4: (S4I)v=[1111]v=0    v1=[11](S - 4I)v = \begin{bmatrix} -1 & 1 \\ 1 & -1 \end{bmatrix}v = 0 \implies v_1 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}, normalized: q1=12[11]q_1 = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ 1 \end{bmatrix}

λ2=2\lambda_2 = 2: (S2I)v=[1111]v=0    v2=[11](S - 2I)v = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}v = 0 \implies v_2 = \begin{bmatrix} 1 \\ -1 \end{bmatrix}, normalized: q2=12[11]q_2 = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ -1 \end{bmatrix}

2. q1q2=12(11+1(1))=0q_1 \cdot q_2 = \frac{1}{2}(1 \cdot 1 + 1 \cdot (-1)) = 0

3. 4q1q1T=412[1111]=[2222]4 \cdot q_1 q_1^T = 4 \cdot \frac{1}{2}\begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} = \begin{bmatrix} 2 & 2 \\ 2 & 2 \end{bmatrix}

2q2q2T=212[1111]=[1111]2 \cdot q_2 q_2^T = 2 \cdot \frac{1}{2}\begin{bmatrix} 1 & -1 \\ -1 & 1 \end{bmatrix} = \begin{bmatrix} 1 & -1 \\ -1 & 1 \end{bmatrix}

Sum: [2222]+[1111]=[3113]=S\begin{bmatrix} 2 & 2 \\ 2 & 2 \end{bmatrix} + \begin{bmatrix} 1 & -1 \\ -1 & 1 \end{bmatrix} = \begin{bmatrix} 3 & 1 \\ 1 & 3 \end{bmatrix} = S

4. S100=4100q1q1T+2100q2q2T=41002[1111]+21002[1111]S^{100} = 4^{100} q_1 q_1^T + 2^{100} q_2 q_2^T = \frac{4^{100}}{2}\begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} + \frac{2^{100}}{2}\begin{bmatrix} 1 & -1 \\ -1 & 1 \end{bmatrix}

=12[4100+210041002100410021004100+2100]= \frac{1}{2}\begin{bmatrix} 4^{100} + 2^{100} & 4^{100} - 2^{100} \\ 4^{100} - 2^{100} & 4^{100} + 2^{100} \end{bmatrix}

Since 4100=22004^{100} = 2^{200}: S100=12[2200+210022002100220021002200+2100]S^{100} = \frac{1}{2}\begin{bmatrix} 2^{200} + 2^{100} & 2^{200} - 2^{100} \\ 2^{200} - 2^{100} & 2^{200} + 2^{100} \end{bmatrix}

Note how 410021004^{100} \gg 2^{100}, so S10041002[1111]S^{100} \approx \frac{4^{100}}{2}\begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}, which is dominated by the larger eigenvalue's projection.


Problem 3: Markov Chain Convergence

A mouse moves between three rooms. From any room, it moves to each of the other two rooms with equal probability. The transition matrix is:

A=[01/21/21/201/21/21/20]A = \begin{bmatrix} 0 & 1/2 & 1/2 \\ 1/2 & 0 & 1/2 \\ 1/2 & 1/2 & 0 \end{bmatrix}

  1. Verify that AA is a Markov matrix (nonneg entries, columns sum to 1).
  2. Show that λ=1\lambda = 1 is an eigenvalue and find the steady-state vector.
  3. Find the other eigenvalues. (Hint: use trace and determinant.)
  4. Starting from room 1 (u0=[100]u_0 = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}), what is the long-term probability distribution?
💡 Solution

1. All entries are nonneg ✓. Each column sums to 0+1/2+1/2=10 + 1/2 + 1/2 = 1 ✓.

2. (AI)q=[11/21/21/211/21/21/21]q=0(A - I)q = \begin{bmatrix} -1 & 1/2 & 1/2 \\ 1/2 & -1 & 1/2 \\ 1/2 & 1/2 & -1 \end{bmatrix}q = 0. By symmetry, q=[111]q = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} is a solution. Normalized (probabilities sum to 1): q=[1/31/31/3]q = \begin{bmatrix} 1/3 \\ 1/3 \\ 1/3 \end{bmatrix}.

3. tr(A)=0=1+λ2+λ3\text{tr}(A) = 0 = 1 + \lambda_2 + \lambda_3, so λ2+λ3=1\lambda_2 + \lambda_3 = -1.

det(A)=0+1/8+1/8000=1/4\det(A) = 0 + 1/8 + 1/8 - 0 - 0 - 0 = 1/4. And det(A)=1λ2λ3\det(A) = 1 \cdot \lambda_2 \cdot \lambda_3, so λ2λ3=1/4\lambda_2 \lambda_3 = 1/4.

The two unknown eigenvalues satisfy x2+x+1/4=0x^2 + x + 1/4 = 0, giving x=(1±11)/2=1/2x = (-1 \pm \sqrt{1-1})/2 = -1/2.

So λ2=λ3=1/2\lambda_2 = \lambda_3 = -1/2 (a repeated eigenvalue).

4. Since λ2=λ3=1/2<1|\lambda_2| = |\lambda_3| = 1/2 < 1, the transient components decay: (1/2)k0(-1/2)^k \to 0. For large kk:

uk=Aku0[1/31/31/3]u_k = A^k u_0 \to \begin{bmatrix} 1/3 \\ 1/3 \\ 1/3 \end{bmatrix}

The mouse spends equal time in all three rooms, regardless of where it starts. The convergence rate is 1/2k=(1/2)k|{-1/2}|^k = (1/2)^k, so after k=10k = 10 steps the transient is 0.001\approx 0.001.


Problem 4: Defective Matrices

Consider A=[2102]A = \begin{bmatrix} 2 & 1 \\ 0 & 2 \end{bmatrix}.

  1. Find the eigenvalues and their algebraic multiplicities.
  2. Find the eigenvectors. What is the geometric multiplicity?
  3. Is AA diagonalizable? Explain.
  4. Compute AkA^k directly using the binomial theorem on A=2I+NA = 2I + N where N=[0100]N = \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix}.
💡 Solution

1. det(AλI)=(2λ)2=0\det(A - \lambda I) = (2 - \lambda)^2 = 0. So λ=2\lambda = 2 with algebraic multiplicity 2.

2. (A2I)v=[0100]v=0    v2=0(A - 2I)v = \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix}v = 0 \implies v_2 = 0, so v=t[10]v = t\begin{bmatrix} 1 \\ 0 \end{bmatrix}.

Geometric multiplicity = 1 (only one independent eigenvector).

3. No. Diagonalization requires n=2n = 2 independent eigenvectors, but we only have 1. The matrix is defective.

4. Note N2=[0100][0100]=[0000]N^2 = \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix}\begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix} = \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix}. So Nk=0N^k = 0 for all k2k \geq 2.

Since 2I2I and NN commute: Ak=(2I+N)k=j=0k(kj)(2I)kjNjA^k = (2I + N)^k = \sum_{j=0}^{k} \binom{k}{j}(2I)^{k-j}N^j

Only j=0j = 0 and j=1j = 1 survive (since N2=0N^2 = 0):

Ak=2kI+k2k1N=[2kk2k102k]A^k = 2^k I + k \cdot 2^{k-1} N = \begin{bmatrix} 2^k & k \cdot 2^{k-1} \\ 0 & 2^k \end{bmatrix}

Verify for k=1k = 1: [2102]\begin{bmatrix} 2 & 1 \\ 0 & 2 \end{bmatrix} ✓. For k=2k = 2: [4404]=A2=[4404]\begin{bmatrix} 4 & 4 \\ 0 & 4 \end{bmatrix} = A^2 = \begin{bmatrix} 4 & 4 \\ 0 & 4 \end{bmatrix} ✓.

Notice the off-diagonal grows as k2k1=(k/2)2kk \cdot 2^{k-1} = (k/2) \cdot 2^k, which exceeds the diagonal 2k2^k by a growing factor of k/2k/2. This polynomial-times-exponential growth in the off-diagonal is characteristic of defective matrices and does not happen for diagonalizable ones.


References

  1. Strang, Gilbert - Introduction to Linear Algebra, 5th ed., Chapter 6 (Sections 6.1-6.6)
  2. MIT OpenCourseWare - 18.06 Linear Algebra - Lecture 21: Eigenvalues and Eigenvectors
  3. MIT OpenCourseWare - 18.06 Linear Algebra - Lecture 22: Diagonalization and Powers of A
  4. MIT OpenCourseWare - 18.06 Linear Algebra - Lecture 24: Markov Matrices
  5. MIT OpenCourseWare - 18.06 Linear Algebra - Lecture 25: Symmetric Matrices and Positive Definiteness
  6. Deisenroth, Faisal, and Ong - Mathematics for Machine Learning, Chapter 4.2-4.4