Basic Linear Algebra Concepts
Why These Concepts Matter
Before diving into vector spaces and linear transformations, we need to establish the fundamental operations of linear algebra. These concepts appear everywhere in machine learning: from computing similarity between word embeddings to understanding how data varies together.
Consider these common scenarios:
Measuring Similarity: How do we quantify how similar two data points are? (Inner product)
Creating Structure: How do we build matrices from vectors? (Outer product)
Understanding Spread: How do we measure how variables change together? (Covariance and Correlation)
Inner Product (Dot Product)
Intuition
The inner product measures how much two vectors "point in the same direction." If two vectors are perpendicular (orthogonal), their inner product is zero. If they point in the same direction, the inner product is positive and large.
Definition: Inner Product (Dot Product)
The inner product (or dot product ) of two vectors u , v ∈ R n \mathbf{u}, \mathbf{v} \in \mathbb{R}^n u , v ∈ R n is:
u ⋅ v = u T v = ∑ i = 1 n u i v i = u 1 v 1 + u 2 v 2 + ⋯ + u n v n \mathbf{u} \cdot \mathbf{v} = \mathbf{u}^T \mathbf{v} = \sum_{i=1}^{n} u_i v_i = u_1 v_1 + u_2 v_2 + \cdots + u_n v_n u ⋅ v = u T v = ∑ i = 1 n u i v i = u 1 v 1 + u 2 v 2 + ⋯ + u n v n
The result is a scalar .
Geometric Interpretation
Theorem: Geometric Form of Inner Product
For vectors u \mathbf{u} u and v \mathbf{v} v , the inner product relates to the angle θ \theta θ between them:
u ⋅ v = ∥ u ∥ ∥ v ∥ cos θ \mathbf{u} \cdot \mathbf{v} = \|\mathbf{u}\| \|\mathbf{v}\| \cos\theta u ⋅ v = ∥ u ∥∥ v ∥ cos θ
where ∥ ⋅ ∥ \|\cdot\| ∥ ⋅ ∥ denotes the Euclidean norm (length).
This gives us:
u ⋅ v > 0 \mathbf{u} \cdot \mathbf{v} > 0 u ⋅ v > 0 : angle < 90 ° < 90° < 90° (vectors point roughly the same direction)
u ⋅ v = 0 \mathbf{u} \cdot \mathbf{v} = 0 u ⋅ v = 0 : angle = 90 ° = 90° = 90° (vectors are orthogonal )
u ⋅ v < 0 \mathbf{u} \cdot \mathbf{v} < 0 u ⋅ v < 0 : angle > 90 ° > 90° > 90° (vectors point roughly opposite directions)
Key Properties
Property Formula Description Commutative u ⋅ v = v ⋅ u \mathbf{u} \cdot \mathbf{v} = \mathbf{v} \cdot \mathbf{u} u ⋅ v = v ⋅ u Order doesn't matter Distributive u ⋅ ( v + w ) = u ⋅ v + u ⋅ w \mathbf{u} \cdot (\mathbf{v} + \mathbf{w}) = \mathbf{u} \cdot \mathbf{v} + \mathbf{u} \cdot \mathbf{w} u ⋅ ( v + w ) = u ⋅ v + u ⋅ w Distributes over addition Scalar multiplication ( c u ) ⋅ v = c ( u ⋅ v ) (c\mathbf{u}) \cdot \mathbf{v} = c(\mathbf{u} \cdot \mathbf{v}) ( c u ) ⋅ v = c ( u ⋅ v ) Scalars factor out Self-dot product v ⋅ v = ∥ v ∥ 2 \mathbf{v} \cdot \mathbf{v} = \|\mathbf{v}\|^2 v ⋅ v = ∥ v ∥ 2 Gives squared length
Applications in ML:
Cosine similarity: sim ( u , v ) = u ⋅ v ∥ u ∥ ∥ v ∥ \text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|} sim ( u , v ) = ∥ u ∥∥ v ∥ u ⋅ v
Projection: projecting one vector onto another
Neural network layers: w T x + b \mathbf{w}^T \mathbf{x} + b w T x + b is the core computation
Outer Product
Intuition
While the inner product takes two vectors and produces a scalar, the outer product takes two vectors and produces a matrix . The key insight is that each row of the resulting matrix is a scaled copy of v T \mathbf{v}^T v T :
Row i = u i ⋅ v T \text{Row } i = u_i \cdot \mathbf{v}^T Row i = u i ⋅ v T
We are essentially stacking scaled versions of the row vector v T \mathbf{v}^T v T , where each scaling factor comes from the corresponding entry of u \mathbf{u} u .
Definition: Outer Product
The outer product of two vectors u ∈ R m \mathbf{u} \in \mathbb{R}^m u ∈ R m and v ∈ R n \mathbf{v} \in \mathbb{R}^n v ∈ R n is the m × n m \times n m × n matrix:
u v T = [ u 1 u 2 ⋮ u m ] [ v 1 v 2 ⋯ v n ] = [ u 1 v T u 2 v T ⋮ u m v T ] = [ u 1 v 1 u 1 v 2 ⋯ u 1 v n u 2 v 1 u 2 v 2 ⋯ u 2 v n ⋮ ⋮ ⋱ ⋮ u m v 1 u m v 2 ⋯ u m v n ] \mathbf{u} \mathbf{v}^T = \begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_m \end{bmatrix} \begin{bmatrix} v_1 & v_2 & \cdots & v_n \end{bmatrix} = \begin{bmatrix} u_1 \mathbf{v}^T \\ u_2 \mathbf{v}^T \\ \vdots \\ u_m \mathbf{v}^T \end{bmatrix} = \begin{bmatrix} u_1 v_1 & u_1 v_2 & \cdots & u_1 v_n \\ u_2 v_1 & u_2 v_2 & \cdots & u_2 v_n \\ \vdots & \vdots & \ddots & \vdots \\ u_m v_1 & u_m v_2 & \cdots & u_m v_n \end{bmatrix} u v T = u 1 u 2 ⋮ u m [ v 1 v 2 ⋯ v n ] = u 1 v T u 2 v T ⋮ u m v T = u 1 v 1 u 2 v 1 ⋮ u m v 1 u 1 v 2 u 2 v 2 ⋮ u m v 2 ⋯ ⋯ ⋱ ⋯ u 1 v n u 2 v n ⋮ u m v n
Example:
u = [ 1 2 3 ] , v = [ 4 5 ] \mathbf{u} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}, \quad \mathbf{v} = \begin{bmatrix} 4 \\ 5 \end{bmatrix} u = 1 2 3 , v = [ 4 5 ]
u v T = [ 1 ⋅ [ 4 5 ] 2 ⋅ [ 4 5 ] 3 ⋅ [ 4 5 ] ] = [ 4 5 8 10 12 15 ] \mathbf{u} \mathbf{v}^T = \begin{bmatrix} 1 \cdot \begin{bmatrix} 4 & 5 \end{bmatrix} \\ 2 \cdot \begin{bmatrix} 4 & 5 \end{bmatrix} \\ 3 \cdot \begin{bmatrix} 4 & 5 \end{bmatrix} \end{bmatrix} = \begin{bmatrix} 4 & 5 \\ 8 & 10 \\ 12 & 15 \end{bmatrix} u v T = 1 ⋅ [ 4 5 ] 2 ⋅ [ 4 5 ] 3 ⋅ [ 4 5 ] = 4 8 12 5 10 15
Notice that each row is a scalar multiple of [ 4 5 ] \begin{bmatrix} 4 & 5 \end{bmatrix} [ 4 5 ] .
Rank-1 Structure
The outer product always produces a rank-1 matrix (assuming both vectors are non-zero). All rows are scalar multiples of v T \mathbf{v}^T v T , and all columns are scalar multiples of u \mathbf{u} u . Therefore, the row space and column space are both 1-dimensional: span { v } \text{span}\{\mathbf{v}\} span { v } and span { u } \text{span}\{\mathbf{u}\} span { u } , respectively.
Theorem: Outer Product is Rank-1
For non-zero vectors u ∈ R m \mathbf{u} \in \mathbb{R}^m u ∈ R m and v ∈ R n \mathbf{v} \in \mathbb{R}^n v ∈ R n :
rank ( u v T ) = 1 \text{rank}(\mathbf{u}\mathbf{v}^T) = 1 rank ( u v T ) = 1
Key Properties
Property Description Rank rank ( u v T ) = 1 \text{rank}(\mathbf{u}\mathbf{v}^T) = 1 rank ( u v T ) = 1 (if u , v ≠ 0 \mathbf{u}, \mathbf{v} \neq \mathbf{0} u , v = 0 )Column space C ( u v T ) = span { u } C(\mathbf{u}\mathbf{v}^T) = \text{span}\{\mathbf{u}\} C ( u v T ) = span { u } Row space C ( ( u v T ) T ) = span { v } C((\mathbf{u}\mathbf{v}^T)^T) = \text{span}\{\mathbf{v}\} C (( u v T ) T ) = span { v } Not commutative u v T ≠ v u T \mathbf{u}\mathbf{v}^T \neq \mathbf{v}\mathbf{u}^T u v T = v u T (different dimensions!)Trace tr ( u v T ) = u T v \text{tr}(\mathbf{u}\mathbf{v}^T) = \mathbf{u}^T \mathbf{v} tr ( u v T ) = u T v (when m = n m = n m = n )
Inner Product vs Outer Product
Aspect Inner Product Outer Product Notation u T v \mathbf{u}^T \mathbf{v} u T v u v T \mathbf{u} \mathbf{v}^T u v T Result Scalar Matrix Dimensions u , v ∈ R n \mathbf{u}, \mathbf{v} \in \mathbb{R}^n u , v ∈ R n (same dimension)u ∈ R m \mathbf{u} \in \mathbb{R}^m u ∈ R m , v ∈ R n \mathbf{v} \in \mathbb{R}^n v ∈ R n (can differ)Output size 1 × 1 1 \times 1 1 × 1 m × n m \times n m × n Rank of result N/A (scalar) 1
Applications in ML:
Rank-1 updates: many algorithms update matrices via A + u v T A + \mathbf{u}\mathbf{v}^T A + u v T
Low-rank approximation: SVD expresses any matrix as sum of rank-1 outer products
Attention mechanisms: query-key outer products
Gradient computation: weight gradients in neural networks are outer products
Vector Norms
Intuition
A norm measures the "size" or "length" of a vector. Different norms emphasize different aspects of a vector.
L2 Norm (Euclidean Norm)
The L2 norm (or Euclidean norm ) of a vector v ∈ R n \mathbf{v} \in \mathbb{R}^n v ∈ R n is:
∥ v ∥ 2 = ∑ i = 1 n v i 2 = v 1 2 + v 2 2 + ⋯ + v n 2 \|\mathbf{v}\|_2 = \sqrt{\sum_{i=1}^{n} v_i^2} = \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2} ∥ v ∥ 2 = ∑ i = 1 n v i 2 = v 1 2 + v 2 2 + ⋯ + v n 2
This is the standard Euclidean distance from the origin.
L1 Norm (Manhattan Norm)
The L1 norm (or Manhattan norm ) of a vector v ∈ R n \mathbf{v} \in \mathbb{R}^n v ∈ R n is:
∥ v ∥ 1 = ∑ i = 1 n ∣ v i ∣ = ∣ v 1 ∣ + ∣ v 2 ∣ + ⋯ + ∣ v n ∣ \|\mathbf{v}\|_1 = \sum_{i=1}^{n} |v_i| = |v_1| + |v_2| + \cdots + |v_n| ∥ v ∥ 1 = ∑ i = 1 n ∣ v i ∣ = ∣ v 1 ∣ + ∣ v 2 ∣ + ⋯ + ∣ v n ∣
This measures distance by moving along grid lines (like city blocks).
Comparison of Norms
Norm Formula Geometric Shape (Unit Ball) Use Case L1 $\sum v_i $ L2 ∑ v i 2 \sqrt{\sum v_i^2} ∑ v i 2 Circle (sphere) Standard distance, regularization (Ridge) L∞ $\max_i v_i $
Applications in ML:
L2 regularization (Ridge): prevents large weights
L1 regularization (Lasso): encourages sparse weights
Normalization: unit norm vectors via v ^ = v / ∥ v ∥ \hat{\mathbf{v}} = \mathbf{v} / \|\mathbf{v}\| v ^ = v /∥ v ∥
Transpose
The transpose of an m × n m \times n m × n matrix A A A , denoted A T A^T A T , is the n × m n \times m n × m matrix obtained by swapping rows and columns:
( A T ) i j = A j i (A^T)_{ij} = A_{ji} ( A T ) ij = A ji
Key Properties
Property Formula Double transpose ( A T ) T = A (A^T)^T = A ( A T ) T = A Sum ( A + B ) T = A T + B T (A + B)^T = A^T + B^T ( A + B ) T = A T + B T Scalar ( c A ) T = c A T (cA)^T = cA^T ( c A ) T = c A T Product ( A B ) T = B T A T (AB)^T = B^T A^T ( A B ) T = B T A T (order reverses!)Inverse ( A − 1 ) T = ( A T ) − 1 (A^{-1})^T = (A^T)^{-1} ( A − 1 ) T = ( A T ) − 1
Trace
The trace of a square n × n n \times n n × n matrix A A A , denoted tr ( A ) \text{tr}(A) tr ( A ) , is the sum of its diagonal entries:
tr ( A ) = ∑ i = 1 n a i i = a 11 + a 22 + ⋯ + a n n \text{tr}(A) = \sum_{i=1}^{n} a_{ii} = a_{11} + a_{22} + \cdots + a_{nn} tr ( A ) = ∑ i = 1 n a ii = a 11 + a 22 + ⋯ + a nn
Example:
A = [ 1 2 3 4 5 6 7 8 9 ] ⟹ tr ( A ) = 1 + 5 + 9 = 15 A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} \implies \text{tr}(A) = 1 + 5 + 9 = 15 A = 1 4 7 2 5 8 3 6 9 ⟹ tr ( A ) = 1 + 5 + 9 = 15
Key Properties
Property Formula Description Linearity tr ( A + B ) = tr ( A ) + tr ( B ) \text{tr}(A + B) = \text{tr}(A) + \text{tr}(B) tr ( A + B ) = tr ( A ) + tr ( B ) Trace of sum = sum of traces Scalar tr ( c A ) = c ⋅ tr ( A ) \text{tr}(cA) = c \cdot \text{tr}(A) tr ( c A ) = c ⋅ tr ( A ) Scalars factor out Transpose tr ( A T ) = tr ( A ) \text{tr}(A^T) = \text{tr}(A) tr ( A T ) = tr ( A ) Same diagonal Cyclic tr ( A B C ) = tr ( B C A ) = tr ( C A B ) \text{tr}(ABC) = \text{tr}(BCA) = \text{tr}(CAB) tr ( A BC ) = tr ( BC A ) = tr ( C A B ) Cyclic permutations equal Inner product tr ( u v T ) = u T v \text{tr}(\mathbf{u}\mathbf{v}^T) = \mathbf{u}^T\mathbf{v} tr ( u v T ) = u T v Connects outer and inner products
Applications in ML:
Sum of eigenvalues: tr ( A ) = ∑ λ i \text{tr}(A) = \sum \lambda_i tr ( A ) = ∑ λ i
Frobenius norm: ∥ A ∥ F 2 = tr ( A T A ) \|A\|_F^2 = \text{tr}(A^T A) ∥ A ∥ F 2 = tr ( A T A )
Loss functions: trace appears in matrix derivatives
Eigenvalues and Eigenvectors
Intuition
When a matrix A A A multiplies most vectors, it changes both their direction and magnitude. However, certain special vectors only get scaled —their direction remains unchanged (or exactly reversed). These are eigenvectors , and the scaling factors are eigenvalues .
Geometrically, eigenvectors represent the "natural axes" of a linear transformation. Along these directions, the matrix acts as simple scaling.
Definition: Eigenvalue and Eigenvector
For a square matrix A ∈ R n × n A \in \mathbb{R}^{n \times n} A ∈ R n × n , a non-zero vector v ∈ R n \mathbf{v} \in \mathbb{R}^n v ∈ R n is an eigenvector of A A A if:
A v = λ v A\mathbf{v} = \lambda \mathbf{v} A v = λ v
for some scalar λ ∈ R \lambda \in \mathbb{R} λ ∈ R (or C \mathbb{C} C ). The scalar λ \lambda λ is the corresponding eigenvalue .
Interpretation: Multiplying v \mathbf{v} v by A A A produces the same result as multiplying v \mathbf{v} v by the scalar λ \lambda λ .
Finding Eigenvalues: The Characteristic Equation
Rearranging A v = λ v A\mathbf{v} = \lambda\mathbf{v} A v = λ v :
( A − λ I ) v = 0 (A - \lambda I)\mathbf{v} = \mathbf{0} ( A − λ I ) v = 0
For a non-zero solution v \mathbf{v} v to exist, the matrix ( A − λ I ) (A - \lambda I) ( A − λ I ) must be singular:
Definition: Characteristic Equation
The eigenvalues of A A A are the roots of the characteristic polynomial :
det ( A − λ I ) = 0 \det(A - \lambda I) = 0 det ( A − λ I ) = 0
Example:
A = [ 4 1 2 3 ] A = \begin{bmatrix} 4 & 1 \\ 2 & 3 \end{bmatrix} A = [ 4 2 1 3 ]
Step 1: Set up characteristic equation
det ( A − λ I ) = det [ 4 − λ 1 2 3 − λ ] = ( 4 − λ ) ( 3 − λ ) − 2 = 0 \det(A - \lambda I) = \det\begin{bmatrix} 4-\lambda & 1 \\ 2 & 3-\lambda \end{bmatrix} = (4-\lambda)(3-\lambda) - 2 = 0 det ( A − λ I ) = det [ 4 − λ 2 1 3 − λ ] = ( 4 − λ ) ( 3 − λ ) − 2 = 0
λ 2 − 7 λ + 10 = 0 \lambda^2 - 7\lambda + 10 = 0 λ 2 − 7 λ + 10 = 0
( λ − 5 ) ( λ − 2 ) = 0 (\lambda - 5)(\lambda - 2) = 0 ( λ − 5 ) ( λ − 2 ) = 0
Eigenvalues: λ 1 = 5 \lambda_1 = 5 λ 1 = 5 , λ 2 = 2 \lambda_2 = 2 λ 2 = 2
Step 2: Find eigenvectors
For λ 1 = 5 \lambda_1 = 5 λ 1 = 5 :
( A − 5 I ) v = [ − 1 1 2 − 2 ] v = 0 ⟹ v 1 = [ 1 1 ] (A - 5I)\mathbf{v} = \begin{bmatrix} -1 & 1 \\ 2 & -2 \end{bmatrix}\mathbf{v} = \mathbf{0} \implies \mathbf{v}_1 = \begin{bmatrix} 1 \\ 1 \end{bmatrix} ( A − 5 I ) v = [ − 1 2 1 − 2 ] v = 0 ⟹ v 1 = [ 1 1 ]
For λ 2 = 2 \lambda_2 = 2 λ 2 = 2 :
( A − 2 I ) v = [ 2 1 2 1 ] v = 0 ⟹ v 2 = [ 1 − 2 ] (A - 2I)\mathbf{v} = \begin{bmatrix} 2 & 1 \\ 2 & 1 \end{bmatrix}\mathbf{v} = \mathbf{0} \implies \mathbf{v}_2 = \begin{bmatrix} 1 \\ -2 \end{bmatrix} ( A − 2 I ) v = [ 2 2 1 1 ] v = 0 ⟹ v 2 = [ 1 − 2 ]
Verification: A v 1 = [ 4 1 2 3 ] [ 1 1 ] = [ 5 5 ] = 5 [ 1 1 ] = 5 v 1 A\mathbf{v}_1 = \begin{bmatrix} 4 & 1 \\ 2 & 3 \end{bmatrix}\begin{bmatrix} 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 5 \\ 5 \end{bmatrix} = 5\begin{bmatrix} 1 \\ 1 \end{bmatrix} = 5\mathbf{v}_1 A v 1 = [ 4 2 1 3 ] [ 1 1 ] = [ 5 5 ] = 5 [ 1 1 ] = 5 v 1 ✓
Key Properties
Property Formula Description Trace tr ( A ) = ∑ i = 1 n λ i \text{tr}(A) = \sum_{i=1}^n \lambda_i tr ( A ) = ∑ i = 1 n λ i Sum of eigenvalues equals trace Determinant det ( A ) = ∏ i = 1 n λ i \det(A) = \prod_{i=1}^n \lambda_i det ( A ) = ∏ i = 1 n λ i Product of eigenvalues equals determinant Invertibility A A A invertible ⟺ \iff ⟺ all λ i ≠ 0 \lambda_i \neq 0 λ i = 0 Zero eigenvalue means singular Powers A k v = λ k v A^k\mathbf{v} = \lambda^k\mathbf{v} A k v = λ k v Eigenvalues of A k A^k A k are λ k \lambda^k λ k Inverse A − 1 v = 1 λ v A^{-1}\mathbf{v} = \frac{1}{\lambda}\mathbf{v} A − 1 v = λ 1 v Eigenvalues of A − 1 A^{-1} A − 1 are 1 / λ 1/\lambda 1/ λ
Covariance and Correlation
This section covers how to measure relationships between variables using matrix representations.
Data Matrix Setup
Consider a data matrix X X X with n n n observations (rows) and p p p features (columns):
X = [ x 11 x 12 ⋯ x 1 p x 21 x 22 ⋯ x 2 p ⋮ ⋮ ⋱ ⋮ x n 1 x n 2 ⋯ x n p ] X = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix} X = x 11 x 21 ⋮ x n 1 x 12 x 22 ⋮ x n 2 ⋯ ⋯ ⋱ ⋯ x 1 p x 2 p ⋮ x n p
Each row x i T \mathbf{x}_i^T x i T is a single data point. Each column represents a feature.
Covariance
The covariance between features j j j and k k k measures how they vary together:
Cov ( X j , X k ) = 1 n ∑ i = 1 n ( x i j − x ˉ j ) ( x i k − x ˉ k ) \text{Cov}(X_j, X_k) = \frac{1}{n} \sum_{i=1}^{n} (x_{ij} - \bar{x}_j)(x_{ik} - \bar{x}_k) Cov ( X j , X k ) = n 1 ∑ i = 1 n ( x ij − x ˉ j ) ( x ik − x ˉ k )
where x ˉ j = 1 n ∑ i = 1 n x i j \bar{x}_j = \frac{1}{n}\sum_{i=1}^n x_{ij} x ˉ j = n 1 ∑ i = 1 n x ij is the mean of feature j j j .
Cov > 0 \text{Cov} > 0 Cov > 0 : features increase together
Cov < 0 \text{Cov} < 0 Cov < 0 : one increases as the other decreases
Cov = 0 \text{Cov} = 0 Cov = 0 : no linear relationship
Population vs Sample Covariance
The formula above uses 1 n \frac{1}{n} n 1 (population covariance ), which is appropriate when you have the entire population.
When working with a sample from a larger population, use 1 n − 1 \frac{1}{n-1} n − 1 1 (sample covariance ) for an unbiased estimator:
Cov sample ( X j , X k ) = 1 n − 1 ∑ i = 1 n ( x i j − x ˉ j ) ( x i k − x ˉ k ) \text{Cov}_{\text{sample}}(X_j, X_k) = \frac{1}{n-1} \sum_{i=1}^{n} (x_{ij} - \bar{x}_j)(x_{ik} - \bar{x}_k) Cov sample ( X j , X k ) = n − 1 1 ∑ i = 1 n ( x ij − x ˉ j ) ( x ik − x ˉ k )
The ( n − 1 ) (n-1) ( n − 1 ) factor is called Bessel's correction . Most statistical software uses sample covariance by default.
Covariance Matrix
The covariance matrix packages all pairwise covariances into a single matrix. Entry ( j , k ) (j, k) ( j , k ) tells us how features j j j and k k k co-vary.
Definition: Covariance Matrix
For centered data X ~ = X − 1 x ˉ T \tilde{X} = X - \mathbf{1}\bar{\mathbf{x}}^T X ~ = X − 1 x ˉ T (each row has the mean subtracted), the covariance matrix Σ ∈ R p × p \Sigma \in \mathbb{R}^{p \times p} Σ ∈ R p × p is:
Σ = 1 n X ~ T X ~ (population) \Sigma = \frac{1}{n} \tilde{X}^T \tilde{X} \quad \text{(population)} Σ = n 1 X ~ T X ~ (population)
Σ = 1 n − 1 X ~ T X ~ (sample) \Sigma = \frac{1}{n-1} \tilde{X}^T \tilde{X} \quad \text{(sample)} Σ = n − 1 1 X ~ T X ~ (sample)
Entry ( j , k ) (j, k) ( j , k ) is:
Σ j k = Cov ( X j , X k ) \Sigma_{jk} = \text{Cov}(X_j, X_k) Σ jk = Cov ( X j , X k )
Example:
Consider 3 data points in R 2 \mathbb{R}^2 R 2 :
X = [ 1 2 3 4 5 6 ] X = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix} X = 1 3 5 2 4 6
Step 1: Compute means
x ˉ 1 = 1 + 3 + 5 3 = 3 , x ˉ 2 = 2 + 4 + 6 3 = 4 \bar{x}_1 = \frac{1+3+5}{3} = 3, \quad \bar{x}_2 = \frac{2+4+6}{3} = 4 x ˉ 1 = 3 1 + 3 + 5 = 3 , x ˉ 2 = 3 2 + 4 + 6 = 4
Step 2: Center the data
X ~ = [ 1 − 3 2 − 4 3 − 3 4 − 4 5 − 3 6 − 4 ] = [ − 2 − 2 0 0 2 2 ] \tilde{X} = \begin{bmatrix} 1-3 & 2-4 \\ 3-3 & 4-4 \\ 5-3 & 6-4 \end{bmatrix} = \begin{bmatrix} -2 & -2 \\ 0 & 0 \\ 2 & 2 \end{bmatrix} X ~ = 1 − 3 3 − 3 5 − 3 2 − 4 4 − 4 6 − 4 = − 2 0 2 − 2 0 2
Step 3: Compute covariance matrix (population)
Σ = 1 3 X ~ T X ~ = 1 3 [ − 2 0 2 − 2 0 2 ] [ − 2 − 2 0 0 2 2 ] = 1 3 [ 8 8 8 8 ] = [ 8 / 3 8 / 3 8 / 3 8 / 3 ] \Sigma = \frac{1}{3} \tilde{X}^T \tilde{X} = \frac{1}{3} \begin{bmatrix} -2 & 0 & 2 \\ -2 & 0 & 2 \end{bmatrix} \begin{bmatrix} -2 & -2 \\ 0 & 0 \\ 2 & 2 \end{bmatrix} = \frac{1}{3} \begin{bmatrix} 8 & 8 \\ 8 & 8 \end{bmatrix} = \begin{bmatrix} 8/3 & 8/3 \\ 8/3 & 8/3 \end{bmatrix} Σ = 3 1 X ~ T X ~ = 3 1 [ − 2 − 2 0 0 2 2 ] − 2 0 2 − 2 0 2 = 3 1 [ 8 8 8 8 ] = [ 8/3 8/3 8/3 8/3 ]
Interpretation:
Σ 11 = 8 / 3 \Sigma_{11} = 8/3 Σ 11 = 8/3 : variance of feature 1
Σ 22 = 8 / 3 \Sigma_{22} = 8/3 Σ 22 = 8/3 : variance of feature 2
Σ 12 = Σ 21 = 8 / 3 \Sigma_{12} = \Sigma_{21} = 8/3 Σ 12 = Σ 21 = 8/3 : positive covariance (features increase together)
Key Properties of Covariance Matrix
Property Description Symmetric Σ = Σ T \Sigma = \Sigma^T Σ = Σ T (since Cov ( X j , X k ) = Cov ( X k , X j ) \text{Cov}(X_j, X_k) = \text{Cov}(X_k, X_j) Cov ( X j , X k ) = Cov ( X k , X j ) )Positive Semi-definite v T Σ v ≥ 0 \mathbf{v}^T \Sigma \mathbf{v} \geq 0 v T Σ v ≥ 0 for all v \mathbf{v} v Diagonal = Variances Σ j j = Var ( X j ) \Sigma_{jj} = \text{Var}(X_j) Σ jj = Var ( X j ) Eigenvalues All eigenvalues ≥ 0 \geq 0 ≥ 0
Correlation
Correlation is a normalized version of covariance. It removes the scale of variables, showing only the strength and direction of linear relationships.
The correlation between features j j j and k k k is:
ρ j k = Cov ( X j , X k ) σ j σ k = Σ j k Σ j j Σ k k \rho_{jk} = \frac{\text{Cov}(X_j, X_k)}{\sigma_j \sigma_k} = \frac{\Sigma_{jk}}{\sqrt{\Sigma_{jj}} \sqrt{\Sigma_{kk}}} ρ jk = σ j σ k Cov ( X j , X k ) = Σ jj Σ kk Σ jk
where σ j = Var ( X j ) \sigma_j = \sqrt{\text{Var}(X_j)} σ j = Var ( X j ) is the standard deviation.
Correlation Matrix
Definition: Correlation Matrix
The correlation matrix R ∈ R p × p R \in \mathbb{R}^{p \times p} R ∈ R p × p has entries:
R j k = ρ j k = Σ j k σ j σ k R_{jk} = \rho_{jk} = \frac{\Sigma_{jk}}{\sigma_j \sigma_k} R jk = ρ jk = σ j σ k Σ jk
Equivalently, if D = diag ( σ 1 , σ 2 , … , σ p ) D = \text{diag}(\sigma_1, \sigma_2, \ldots, \sigma_p) D = diag ( σ 1 , σ 2 , … , σ p ) is the diagonal matrix of standard deviations:
R = D − 1 Σ D − 1 R = D^{-1} \Sigma D^{-1} R = D − 1 Σ D − 1
Example (continuing from above):
Σ = [ 8 / 3 8 / 3 8 / 3 8 / 3 ] \Sigma = \begin{bmatrix} 8/3 & 8/3 \\ 8/3 & 8/3 \end{bmatrix} Σ = [ 8/3 8/3 8/3 8/3 ]
Standard deviations: σ 1 = 8 / 3 \sigma_1 = \sqrt{8/3} σ 1 = 8/3 , σ 2 = 8 / 3 \sigma_2 = \sqrt{8/3} σ 2 = 8/3
ρ 12 = 8 / 3 8 / 3 ⋅ 8 / 3 = 8 / 3 8 / 3 = 1 \rho_{12} = \frac{8/3}{\sqrt{8/3} \cdot \sqrt{8/3}} = \frac{8/3}{8/3} = 1 ρ 12 = 8/3 ⋅ 8/3 8/3 = 8/3 8/3 = 1
R = [ 1 1 1 1 ] R = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} R = [ 1 1 1 1 ]
The correlation of 1 indicates perfect positive linear relationship .
Key Properties of Correlation Matrix
Property Description Bounded − 1 ≤ ρ j k ≤ 1 -1 \leq \rho_{jk} \leq 1 − 1 ≤ ρ jk ≤ 1 Diagonal = 1 R j j = 1 R_{jj} = 1 R jj = 1 (perfect correlation with itself)Symmetric R = R T R = R^T R = R T Scale-invariant Unaffected by linear transformations of variables
Covariance vs Correlation
Aspect Covariance Matrix Σ \Sigma Σ Correlation Matrix R R R Diagonal entries Variances Always 1 Range ( − ∞ , ∞ ) (-\infty, \infty) ( − ∞ , ∞ ) [ − 1 , 1 ] [-1, 1] [ − 1 , 1 ] Scale Depends on variable units Unit-free Use case PCA, Mahalanobis distance Comparing relationships
Summary
Concept Definition Key Formula Inner Product Sum of component products u T v = ∑ u i v i \mathbf{u}^T \mathbf{v} = \sum u_i v_i u T v = ∑ u i v i Outer Product Matrix from two vectors (rank-1) u v T \mathbf{u}\mathbf{v}^T u v T , row i i i = u i v T u_i \mathbf{v}^T u i v T L2 Norm Euclidean length ∥ v ∥ 2 = ∑ v i 2 \|\mathbf{v}\|_2 = \sqrt{\sum v_i^2} ∥ v ∥ 2 = ∑ v i 2 Transpose Swap rows and columns ( A T ) i j = A j i (A^T)_{ij} = A_{ji} ( A T ) ij = A ji Trace Sum of diagonal tr ( A ) = ∑ a i i \text{tr}(A) = \sum a_{ii} tr ( A ) = ∑ a ii Eigenvalue/Eigenvector Scaling directions of a matrix A v = λ v A\mathbf{v} = \lambda\mathbf{v} A v = λ v Covariance Matrix Pairwise covariances Σ = 1 n X ~ T X ~ \Sigma = \frac{1}{n}\tilde{X}^T\tilde{X} Σ = n 1 X ~ T X ~ Correlation Matrix Normalized covariances R = D − 1 Σ D − 1 R = D^{-1}\Sigma D^{-1} R = D − 1 Σ D − 1