Skip to main content

Basic Linear Algebra Concepts


Why These Concepts Matter

Before diving into vector spaces and linear transformations, we need to establish the fundamental operations of linear algebra. These concepts appear everywhere in machine learning: from computing similarity between word embeddings to understanding how data varies together.

Consider these common scenarios:

  1. Measuring Similarity: How do we quantify how similar two data points are? (Inner product)
  2. Creating Structure: How do we build matrices from vectors? (Outer product)
  3. Understanding Spread: How do we measure how variables change together? (Covariance and Correlation)

Inner Product (Dot Product)

Intuition

The inner product measures how much two vectors "point in the same direction." If two vectors are perpendicular (orthogonal), their inner product is zero. If they point in the same direction, the inner product is positive and large.

Formal Definition

Definition: Inner Product (Dot Product)

The inner product (or dot product) of two vectors u,vRn\mathbf{u}, \mathbf{v} \in \mathbb{R}^n is:

uv=uTv=i=1nuivi=u1v1+u2v2++unvn\mathbf{u} \cdot \mathbf{v} = \mathbf{u}^T \mathbf{v} = \sum_{i=1}^{n} u_i v_i = u_1 v_1 + u_2 v_2 + \cdots + u_n v_n

The result is a scalar.

Geometric Interpretation

Theorem: Geometric Form of Inner Product

For vectors u\mathbf{u} and v\mathbf{v}, the inner product relates to the angle θ\theta between them:

uv=uvcosθ\mathbf{u} \cdot \mathbf{v} = \|\mathbf{u}\| \|\mathbf{v}\| \cos\theta

where \|\cdot\| denotes the Euclidean norm (length).

This gives us:

  • uv>0\mathbf{u} \cdot \mathbf{v} > 0: angle <90°< 90° (vectors point roughly the same direction)
  • uv=0\mathbf{u} \cdot \mathbf{v} = 0: angle =90°= 90° (vectors are orthogonal)
  • uv<0\mathbf{u} \cdot \mathbf{v} < 0: angle >90°> 90° (vectors point roughly opposite directions)

Key Properties

PropertyFormulaDescription
Commutativeuv=vu\mathbf{u} \cdot \mathbf{v} = \mathbf{v} \cdot \mathbf{u}Order doesn't matter
Distributiveu(v+w)=uv+uw\mathbf{u} \cdot (\mathbf{v} + \mathbf{w}) = \mathbf{u} \cdot \mathbf{v} + \mathbf{u} \cdot \mathbf{w}Distributes over addition
Scalar multiplication(cu)v=c(uv)(c\mathbf{u}) \cdot \mathbf{v} = c(\mathbf{u} \cdot \mathbf{v})Scalars factor out
Self-dot productvv=v2\mathbf{v} \cdot \mathbf{v} = \|\mathbf{v}\|^2Gives squared length

Applications in ML:

  • Cosine similarity: sim(u,v)=uvuv\text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|}
  • Projection: projecting one vector onto another
  • Neural network layers: wTx+b\mathbf{w}^T \mathbf{x} + b is the core computation

Outer Product

Intuition

While the inner product takes two vectors and produces a scalar, the outer product takes two vectors and produces a matrix. The key insight is that each row of the resulting matrix is a scaled copy of vT\mathbf{v}^T:

Row i=uivT\text{Row } i = u_i \cdot \mathbf{v}^T

We are essentially stacking scaled versions of the row vector vT\mathbf{v}^T, where each scaling factor comes from the corresponding entry of u\mathbf{u}.

Formal Definition

Definition: Outer Product

The outer product of two vectors uRm\mathbf{u} \in \mathbb{R}^m and vRn\mathbf{v} \in \mathbb{R}^n is the m×nm \times n matrix:

uvT=[u1u2um][v1v2vn]=[u1vTu2vTumvT]=[u1v1u1v2u1vnu2v1u2v2u2vnumv1umv2umvn]\mathbf{u} \mathbf{v}^T = \begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_m \end{bmatrix} \begin{bmatrix} v_1 & v_2 & \cdots & v_n \end{bmatrix} = \begin{bmatrix} u_1 \mathbf{v}^T \\ u_2 \mathbf{v}^T \\ \vdots \\ u_m \mathbf{v}^T \end{bmatrix} = \begin{bmatrix} u_1 v_1 & u_1 v_2 & \cdots & u_1 v_n \\ u_2 v_1 & u_2 v_2 & \cdots & u_2 v_n \\ \vdots & \vdots & \ddots & \vdots \\ u_m v_1 & u_m v_2 & \cdots & u_m v_n \end{bmatrix}

Example:

u=[123],v=[45]\mathbf{u} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}, \quad \mathbf{v} = \begin{bmatrix} 4 \\ 5 \end{bmatrix}

uvT=[1[45]2[45]3[45]]=[458101215]\mathbf{u} \mathbf{v}^T = \begin{bmatrix} 1 \cdot \begin{bmatrix} 4 & 5 \end{bmatrix} \\ 2 \cdot \begin{bmatrix} 4 & 5 \end{bmatrix} \\ 3 \cdot \begin{bmatrix} 4 & 5 \end{bmatrix} \end{bmatrix} = \begin{bmatrix} 4 & 5 \\ 8 & 10 \\ 12 & 15 \end{bmatrix}

Notice that each row is a scalar multiple of [45]\begin{bmatrix} 4 & 5 \end{bmatrix}.

Rank-1 Structure

The outer product always produces a rank-1 matrix (assuming both vectors are non-zero). All rows are scalar multiples of vT\mathbf{v}^T, and all columns are scalar multiples of u\mathbf{u}. Therefore, the row space and column space are both 1-dimensional: span{v}\text{span}\{\mathbf{v}\} and span{u}\text{span}\{\mathbf{u}\}, respectively.

Theorem: Outer Product is Rank-1

For non-zero vectors uRm\mathbf{u} \in \mathbb{R}^m and vRn\mathbf{v} \in \mathbb{R}^n:

rank(uvT)=1\text{rank}(\mathbf{u}\mathbf{v}^T) = 1

Key Properties

PropertyDescription
Rankrank(uvT)=1\text{rank}(\mathbf{u}\mathbf{v}^T) = 1 (if u,v0\mathbf{u}, \mathbf{v} \neq \mathbf{0})
Column spaceC(uvT)=span{u}C(\mathbf{u}\mathbf{v}^T) = \text{span}\{\mathbf{u}\}
Row spaceC((uvT)T)=span{v}C((\mathbf{u}\mathbf{v}^T)^T) = \text{span}\{\mathbf{v}\}
Not commutativeuvTvuT\mathbf{u}\mathbf{v}^T \neq \mathbf{v}\mathbf{u}^T (different dimensions!)
Tracetr(uvT)=uTv\text{tr}(\mathbf{u}\mathbf{v}^T) = \mathbf{u}^T \mathbf{v} (when m=nm = n)

Inner Product vs Outer Product

AspectInner ProductOuter Product
NotationuTv\mathbf{u}^T \mathbf{v}uvT\mathbf{u} \mathbf{v}^T
ResultScalarMatrix
Dimensionsu,vRn\mathbf{u}, \mathbf{v} \in \mathbb{R}^n (same dimension)uRm\mathbf{u} \in \mathbb{R}^m, vRn\mathbf{v} \in \mathbb{R}^n (can differ)
Output size1×11 \times 1m×nm \times n
Rank of resultN/A (scalar)1

Applications in ML:

  • Rank-1 updates: many algorithms update matrices via A+uvTA + \mathbf{u}\mathbf{v}^T
  • Low-rank approximation: SVD expresses any matrix as sum of rank-1 outer products
  • Attention mechanisms: query-key outer products
  • Gradient computation: weight gradients in neural networks are outer products

Vector Norms

Intuition

A norm measures the "size" or "length" of a vector. Different norms emphasize different aspects of a vector.

L2 Norm (Euclidean Norm)

Definition: L2 Norm

The L2 norm (or Euclidean norm) of a vector vRn\mathbf{v} \in \mathbb{R}^n is:

v2=i=1nvi2=v12+v22++vn2\|\mathbf{v}\|_2 = \sqrt{\sum_{i=1}^{n} v_i^2} = \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2}

This is the standard Euclidean distance from the origin.

L1 Norm (Manhattan Norm)

Definition: L1 Norm

The L1 norm (or Manhattan norm) of a vector vRn\mathbf{v} \in \mathbb{R}^n is:

v1=i=1nvi=v1+v2++vn\|\mathbf{v}\|_1 = \sum_{i=1}^{n} |v_i| = |v_1| + |v_2| + \cdots + |v_n|

This measures distance by moving along grid lines (like city blocks).

Comparison of Norms

NormFormulaGeometric Shape (Unit Ball)Use Case
L1$\sumv_i$
L2vi2\sqrt{\sum v_i^2}Circle (sphere)Standard distance, regularization (Ridge)
L∞$\max_iv_i$

Applications in ML:

  • L2 regularization (Ridge): prevents large weights
  • L1 regularization (Lasso): encourages sparse weights
  • Normalization: unit norm vectors via v^=v/v\hat{\mathbf{v}} = \mathbf{v} / \|\mathbf{v}\|

Transpose

Definition: Transpose

The transpose of an m×nm \times n matrix AA, denoted ATA^T, is the n×mn \times m matrix obtained by swapping rows and columns:

(AT)ij=Aji(A^T)_{ij} = A_{ji}

Key Properties

PropertyFormula
Double transpose(AT)T=A(A^T)^T = A
Sum(A+B)T=AT+BT(A + B)^T = A^T + B^T
Scalar(cA)T=cAT(cA)^T = cA^T
Product(AB)T=BTAT(AB)^T = B^T A^T (order reverses!)
Inverse(A1)T=(AT)1(A^{-1})^T = (A^T)^{-1}

Trace

Definition: Trace

The trace of a square n×nn \times n matrix AA, denoted tr(A)\text{tr}(A), is the sum of its diagonal entries:

tr(A)=i=1naii=a11+a22++ann\text{tr}(A) = \sum_{i=1}^{n} a_{ii} = a_{11} + a_{22} + \cdots + a_{nn}

Example:

A=[123456789]    tr(A)=1+5+9=15A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} \implies \text{tr}(A) = 1 + 5 + 9 = 15

Key Properties

PropertyFormulaDescription
Linearitytr(A+B)=tr(A)+tr(B)\text{tr}(A + B) = \text{tr}(A) + \text{tr}(B)Trace of sum = sum of traces
Scalartr(cA)=ctr(A)\text{tr}(cA) = c \cdot \text{tr}(A)Scalars factor out
Transposetr(AT)=tr(A)\text{tr}(A^T) = \text{tr}(A)Same diagonal
Cyclictr(ABC)=tr(BCA)=tr(CAB)\text{tr}(ABC) = \text{tr}(BCA) = \text{tr}(CAB)Cyclic permutations equal
Inner producttr(uvT)=uTv\text{tr}(\mathbf{u}\mathbf{v}^T) = \mathbf{u}^T\mathbf{v}Connects outer and inner products

Applications in ML:

  • Sum of eigenvalues: tr(A)=λi\text{tr}(A) = \sum \lambda_i
  • Frobenius norm: AF2=tr(ATA)\|A\|_F^2 = \text{tr}(A^T A)
  • Loss functions: trace appears in matrix derivatives

Eigenvalues and Eigenvectors

Intuition

When a matrix AA multiplies most vectors, it changes both their direction and magnitude. However, certain special vectors only get scaled—their direction remains unchanged (or exactly reversed). These are eigenvectors, and the scaling factors are eigenvalues.

Geometrically, eigenvectors represent the "natural axes" of a linear transformation. Along these directions, the matrix acts as simple scaling.

Formal Definition

Definition: Eigenvalue and Eigenvector

For a square matrix ARn×nA \in \mathbb{R}^{n \times n}, a non-zero vector vRn\mathbf{v} \in \mathbb{R}^n is an eigenvector of AA if:

Av=λvA\mathbf{v} = \lambda \mathbf{v}

for some scalar λR\lambda \in \mathbb{R} (or C\mathbb{C}). The scalar λ\lambda is the corresponding eigenvalue.

Interpretation: Multiplying v\mathbf{v} by AA produces the same result as multiplying v\mathbf{v} by the scalar λ\lambda.

Finding Eigenvalues: The Characteristic Equation

Rearranging Av=λvA\mathbf{v} = \lambda\mathbf{v}:

(AλI)v=0(A - \lambda I)\mathbf{v} = \mathbf{0}

For a non-zero solution v\mathbf{v} to exist, the matrix (AλI)(A - \lambda I) must be singular:

Definition: Characteristic Equation

The eigenvalues of AA are the roots of the characteristic polynomial:

det(AλI)=0\det(A - \lambda I) = 0

Example:

A=[4123]A = \begin{bmatrix} 4 & 1 \\ 2 & 3 \end{bmatrix}

Step 1: Set up characteristic equation

det(AλI)=det[4λ123λ]=(4λ)(3λ)2=0\det(A - \lambda I) = \det\begin{bmatrix} 4-\lambda & 1 \\ 2 & 3-\lambda \end{bmatrix} = (4-\lambda)(3-\lambda) - 2 = 0

λ27λ+10=0\lambda^2 - 7\lambda + 10 = 0

(λ5)(λ2)=0(\lambda - 5)(\lambda - 2) = 0

Eigenvalues: λ1=5\lambda_1 = 5, λ2=2\lambda_2 = 2

Step 2: Find eigenvectors

For λ1=5\lambda_1 = 5:

(A5I)v=[1122]v=0    v1=[11](A - 5I)\mathbf{v} = \begin{bmatrix} -1 & 1 \\ 2 & -2 \end{bmatrix}\mathbf{v} = \mathbf{0} \implies \mathbf{v}_1 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}

For λ2=2\lambda_2 = 2:

(A2I)v=[2121]v=0    v2=[12](A - 2I)\mathbf{v} = \begin{bmatrix} 2 & 1 \\ 2 & 1 \end{bmatrix}\mathbf{v} = \mathbf{0} \implies \mathbf{v}_2 = \begin{bmatrix} 1 \\ -2 \end{bmatrix}

Verification: Av1=[4123][11]=[55]=5[11]=5v1A\mathbf{v}_1 = \begin{bmatrix} 4 & 1 \\ 2 & 3 \end{bmatrix}\begin{bmatrix} 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 5 \\ 5 \end{bmatrix} = 5\begin{bmatrix} 1 \\ 1 \end{bmatrix} = 5\mathbf{v}_1

Key Properties

PropertyFormulaDescription
Tracetr(A)=i=1nλi\text{tr}(A) = \sum_{i=1}^n \lambda_iSum of eigenvalues equals trace
Determinantdet(A)=i=1nλi\det(A) = \prod_{i=1}^n \lambda_iProduct of eigenvalues equals determinant
InvertibilityAA invertible     \iff all λi0\lambda_i \neq 0Zero eigenvalue means singular
PowersAkv=λkvA^k\mathbf{v} = \lambda^k\mathbf{v}Eigenvalues of AkA^k are λk\lambda^k
InverseA1v=1λvA^{-1}\mathbf{v} = \frac{1}{\lambda}\mathbf{v}Eigenvalues of A1A^{-1} are 1/λ1/\lambda

Covariance and Correlation

This section covers how to measure relationships between variables using matrix representations.

Data Matrix Setup

Consider a data matrix XX with nn observations (rows) and pp features (columns):

X=[x11x12x1px21x22x2pxn1xn2xnp]X = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix}

Each row xiT\mathbf{x}_i^T is a single data point. Each column represents a feature.

Covariance

Definition: Covariance

The covariance between features jj and kk measures how they vary together:

Cov(Xj,Xk)=1ni=1n(xijxˉj)(xikxˉk)\text{Cov}(X_j, X_k) = \frac{1}{n} \sum_{i=1}^{n} (x_{ij} - \bar{x}_j)(x_{ik} - \bar{x}_k)

where xˉj=1ni=1nxij\bar{x}_j = \frac{1}{n}\sum_{i=1}^n x_{ij} is the mean of feature jj.

  • Cov>0\text{Cov} > 0: features increase together
  • Cov<0\text{Cov} < 0: one increases as the other decreases
  • Cov=0\text{Cov} = 0: no linear relationship
Population vs Sample Covariance

The formula above uses 1n\frac{1}{n} (population covariance), which is appropriate when you have the entire population.

When working with a sample from a larger population, use 1n1\frac{1}{n-1} (sample covariance) for an unbiased estimator:

Covsample(Xj,Xk)=1n1i=1n(xijxˉj)(xikxˉk)\text{Cov}_{\text{sample}}(X_j, X_k) = \frac{1}{n-1} \sum_{i=1}^{n} (x_{ij} - \bar{x}_j)(x_{ik} - \bar{x}_k)

The (n1)(n-1) factor is called Bessel's correction. Most statistical software uses sample covariance by default.

Covariance Matrix

The covariance matrix packages all pairwise covariances into a single matrix. Entry (j,k)(j, k) tells us how features jj and kk co-vary.

Definition: Covariance Matrix

For centered data X~=X1xˉT\tilde{X} = X - \mathbf{1}\bar{\mathbf{x}}^T (each row has the mean subtracted), the covariance matrix ΣRp×p\Sigma \in \mathbb{R}^{p \times p} is:

Σ=1nX~TX~(population)\Sigma = \frac{1}{n} \tilde{X}^T \tilde{X} \quad \text{(population)}

Σ=1n1X~TX~(sample)\Sigma = \frac{1}{n-1} \tilde{X}^T \tilde{X} \quad \text{(sample)}

Entry (j,k)(j, k) is:

Σjk=Cov(Xj,Xk)\Sigma_{jk} = \text{Cov}(X_j, X_k)

Example:

Consider 3 data points in R2\mathbb{R}^2:

X=[123456]X = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix}

Step 1: Compute means xˉ1=1+3+53=3,xˉ2=2+4+63=4\bar{x}_1 = \frac{1+3+5}{3} = 3, \quad \bar{x}_2 = \frac{2+4+6}{3} = 4

Step 2: Center the data X~=[132433445364]=[220022]\tilde{X} = \begin{bmatrix} 1-3 & 2-4 \\ 3-3 & 4-4 \\ 5-3 & 6-4 \end{bmatrix} = \begin{bmatrix} -2 & -2 \\ 0 & 0 \\ 2 & 2 \end{bmatrix}

Step 3: Compute covariance matrix (population) Σ=13X~TX~=13[202202][220022]=13[8888]=[8/38/38/38/3]\Sigma = \frac{1}{3} \tilde{X}^T \tilde{X} = \frac{1}{3} \begin{bmatrix} -2 & 0 & 2 \\ -2 & 0 & 2 \end{bmatrix} \begin{bmatrix} -2 & -2 \\ 0 & 0 \\ 2 & 2 \end{bmatrix} = \frac{1}{3} \begin{bmatrix} 8 & 8 \\ 8 & 8 \end{bmatrix} = \begin{bmatrix} 8/3 & 8/3 \\ 8/3 & 8/3 \end{bmatrix}

Interpretation:

  • Σ11=8/3\Sigma_{11} = 8/3: variance of feature 1
  • Σ22=8/3\Sigma_{22} = 8/3: variance of feature 2
  • Σ12=Σ21=8/3\Sigma_{12} = \Sigma_{21} = 8/3: positive covariance (features increase together)

Key Properties of Covariance Matrix

PropertyDescription
SymmetricΣ=ΣT\Sigma = \Sigma^T (since Cov(Xj,Xk)=Cov(Xk,Xj)\text{Cov}(X_j, X_k) = \text{Cov}(X_k, X_j))
Positive Semi-definitevTΣv0\mathbf{v}^T \Sigma \mathbf{v} \geq 0 for all v\mathbf{v}
Diagonal = VariancesΣjj=Var(Xj)\Sigma_{jj} = \text{Var}(X_j)
EigenvaluesAll eigenvalues 0\geq 0

Correlation

Correlation is a normalized version of covariance. It removes the scale of variables, showing only the strength and direction of linear relationships.

Definition: Correlation

The correlation between features jj and kk is:

ρjk=Cov(Xj,Xk)σjσk=ΣjkΣjjΣkk\rho_{jk} = \frac{\text{Cov}(X_j, X_k)}{\sigma_j \sigma_k} = \frac{\Sigma_{jk}}{\sqrt{\Sigma_{jj}} \sqrt{\Sigma_{kk}}}

where σj=Var(Xj)\sigma_j = \sqrt{\text{Var}(X_j)} is the standard deviation.

Correlation Matrix

Definition: Correlation Matrix

The correlation matrix RRp×pR \in \mathbb{R}^{p \times p} has entries:

Rjk=ρjk=ΣjkσjσkR_{jk} = \rho_{jk} = \frac{\Sigma_{jk}}{\sigma_j \sigma_k}

Equivalently, if D=diag(σ1,σ2,,σp)D = \text{diag}(\sigma_1, \sigma_2, \ldots, \sigma_p) is the diagonal matrix of standard deviations:

R=D1ΣD1R = D^{-1} \Sigma D^{-1}

Example (continuing from above):

Σ=[8/38/38/38/3]\Sigma = \begin{bmatrix} 8/3 & 8/3 \\ 8/3 & 8/3 \end{bmatrix}

Standard deviations: σ1=8/3\sigma_1 = \sqrt{8/3}, σ2=8/3\sigma_2 = \sqrt{8/3}

ρ12=8/38/38/3=8/38/3=1\rho_{12} = \frac{8/3}{\sqrt{8/3} \cdot \sqrt{8/3}} = \frac{8/3}{8/3} = 1

R=[1111]R = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}

The correlation of 1 indicates perfect positive linear relationship.

Key Properties of Correlation Matrix

PropertyDescription
Bounded1ρjk1-1 \leq \rho_{jk} \leq 1
Diagonal = 1Rjj=1R_{jj} = 1 (perfect correlation with itself)
SymmetricR=RTR = R^T
Scale-invariantUnaffected by linear transformations of variables

Covariance vs Correlation

AspectCovariance Matrix Σ\SigmaCorrelation Matrix RR
Diagonal entriesVariancesAlways 1
Range(,)(-\infty, \infty)[1,1][-1, 1]
ScaleDepends on variable unitsUnit-free
Use casePCA, Mahalanobis distanceComparing relationships

Summary

ConceptDefinitionKey Formula
Inner ProductSum of component productsuTv=uivi\mathbf{u}^T \mathbf{v} = \sum u_i v_i
Outer ProductMatrix from two vectors (rank-1)uvT\mathbf{u}\mathbf{v}^T, row ii = uivTu_i \mathbf{v}^T
L2 NormEuclidean lengthv2=vi2\|\mathbf{v}\|_2 = \sqrt{\sum v_i^2}
TransposeSwap rows and columns(AT)ij=Aji(A^T)_{ij} = A_{ji}
TraceSum of diagonaltr(A)=aii\text{tr}(A) = \sum a_{ii}
Eigenvalue/EigenvectorScaling directions of a matrixAv=λvA\mathbf{v} = \lambda\mathbf{v}
Covariance MatrixPairwise covariancesΣ=1nX~TX~\Sigma = \frac{1}{n}\tilde{X}^T\tilde{X}
Correlation MatrixNormalized covariancesR=D1ΣD1R = D^{-1}\Sigma D^{-1}