Skip to main content

Linear Transformations


The Central Question: What Does a Function That Preserves Structure Look Like?

We want to understand functions that act on vectors. In machine learning, data flows through layers of operations: a neural network takes an input vector xx and produces an output y=f(x)y = f(x). The simplest such functions, the building blocks of everything from image classifiers to language models, are linear transformations.

Consider these scenarios:

  1. Rotating an image: Each pixel coordinate (x,y)(x, y) maps to a new location (x,y)(x', y')
  2. Scaling features: A preprocessing step multiplies each feature by a different constant
  3. Neural network layer: An input vector xR784x \in \mathbb{R}^{784} (a flattened 28×2828 \times 28 image) transforms to yR256y \in \mathbb{R}^{256} (a hidden layer)

What do these operations have in common? They all preserve the fundamental structure of linear combinations. If you know what happens to basic building blocks, you know what happens to everything built from them. Understanding this structure unlocks powerful tools: matrix representation, composition via multiplication, and the Rank-Nullity Theorem that tells us exactly what information a transformation preserves and what it destroys.


What Is a Linear Transformation?

Consider the function T:R2R2T: \mathbb{R}^2 \to \mathbb{R}^2 that rotates every vector by 90° counterclockwise.

Take the vector v=[31]v = \begin{bmatrix} 3 \\ 1 \end{bmatrix}:

T([31])=[13]T\left(\begin{bmatrix} 3 \\ 1 \end{bmatrix}\right) = \begin{bmatrix} -1 \\ 3 \end{bmatrix}

The point (3,1)(3, 1) rotates to (1,3)(-1, 3).

Now let's check a crucial property. Take two vectors u=[20]u = \begin{bmatrix} 2 \\ 0 \end{bmatrix} and v=[11]v = \begin{bmatrix} 1 \\ 1 \end{bmatrix}:

T(u)=[02],T(v)=[11]T(u) = \begin{bmatrix} 0 \\ 2 \end{bmatrix}, \quad T(v) = \begin{bmatrix} -1 \\ 1 \end{bmatrix}

T(u+v)=T([31])=[13]T(u + v) = T\left(\begin{bmatrix} 3 \\ 1 \end{bmatrix}\right) = \begin{bmatrix} -1 \\ 3 \end{bmatrix}

T(u)+T(v)=[02]+[11]=[13]T(u) + T(v) = \begin{bmatrix} 0 \\ 2 \end{bmatrix} + \begin{bmatrix} -1 \\ 1 \end{bmatrix} = \begin{bmatrix} -1 \\ 3 \end{bmatrix}

They're equal: T(u+v)=T(u)+T(v)T(u + v) = T(u) + T(v). This is no coincidence. It's the defining property of linearity.

Geometrically, a linear transformation:

  • Preserves the origin: The zero vector always maps to zero
  • Preserves lines through the origin: A line through the origin maps to another line (or point) through the origin
  • Preserves parallelism: Parallel lines remain parallel after transformation
  • Preserves ratios on lines: If PP is the midpoint of ABAB, then T(P)T(P) is the midpoint of T(A)T(B)T(A)T(B)

Think of it as "stretching, rotating, reflecting, or projecting" the entire space while keeping the origin fixed.

Linear transformation preserving grid structure

Figure: A linear transformation maps the standard grid to a parallelogram grid. Grid lines remain straight and parallel. The origin stays fixed.

Definition: Linear Transformation

A function T:VWT: V \to W between vector spaces is a linear transformation (or linear map) if it satisfies two properties for all vectors u,vVu, v \in V and all scalars cc:

  1. Additivity: T(u+v)=T(u)+T(v)T(u + v) = T(u) + T(v)
  2. Homogeneity: T(cv)=cT(v)T(cv) = cT(v)

Equivalently (combining both): T(c1v1+c2v2)=c1T(v1)+c2T(v2)T(c_1 v_1 + c_2 v_2) = c_1 T(v_1) + c_2 T(v_2)

Theorem: Zero Preservation

Every linear transformation maps the zero vector to the zero vector: T(0)=0T(0) = 0

Proof: T(0)=T(0v)=0T(v)=0T(0) = T(0 \cdot v) = 0 \cdot T(v) = 0 for any vector vv.

Key Properties:

PropertyDescriptionExample
AdditivityT(u+v)=T(u)+T(v)T(u + v) = T(u) + T(v)Rotating u+vu+v = rotating uu + rotating vv
HomogeneityT(cv)=cT(v)T(cv) = cT(v)Rotating 2v2v = twice rotating vv
Origin fixedT(0)=0T(0) = 0Rotation keeps the origin in place

Every Linear Transformation Is a Matrix

Consider the 90° rotation T:R2R2T: \mathbb{R}^2 \to \mathbb{R}^2 from before. Let's find what it does to the standard basis vectors:

T([10])=[01],T([01])=[10]T\left(\begin{bmatrix} 1 \\ 0 \end{bmatrix}\right) = \begin{bmatrix} 0 \\ 1 \end{bmatrix}, \quad T\left(\begin{bmatrix} 0 \\ 1 \end{bmatrix}\right) = \begin{bmatrix} -1 \\ 0 \end{bmatrix}

Now, any vector v=[xy]=x[10]+y[01]v = \begin{bmatrix} x \\ y \end{bmatrix} = x\begin{bmatrix} 1 \\ 0 \end{bmatrix} + y\begin{bmatrix} 0 \\ 1 \end{bmatrix}.

By linearity: T(v)=xT([10])+yT([01])=x[01]+y[10]=[yx]T(v) = xT\left(\begin{bmatrix} 1 \\ 0 \end{bmatrix}\right) + yT\left(\begin{bmatrix} 0 \\ 1 \end{bmatrix}\right) = x\begin{bmatrix} 0 \\ 1 \end{bmatrix} + y\begin{bmatrix} -1 \\ 0 \end{bmatrix} = \begin{bmatrix} -y \\ x \end{bmatrix}

This is exactly matrix multiplication: T(v)=[0110][xy]T(v) = \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix}

The matrix columns are the images of the basis vectors.

The geometric insight: The matrix of a linear transformation tells you where the "coordinate axes" go:

  • Column 1: Where e1=(1,0,)e_1 = (1, 0, \ldots) lands
  • Column 2: Where e2=(0,1,0,)e_2 = (0, 1, 0, \ldots) lands
  • And so on...

Once you know where the basis vectors go, linearity determines everything else.

Theorem: Matrix Representation Theorem

Let T:RnRmT: \mathbb{R}^n \to \mathbb{R}^m be a linear transformation. Then there exists a unique m×nm \times n matrix AA such that: T(x)=Axfor all xRnT(x) = Ax \quad \text{for all } x \in \mathbb{R}^n

The matrix AA is constructed by placing the images of the standard basis vectors as columns: A=[T(e1)T(e2)T(en)]A = \begin{bmatrix} | & | & & | \\ T(e_1) & T(e_2) & \cdots & T(e_n) \\ | & | & & | \end{bmatrix}

📌 Proof

Let {e1,,en}\{e_1, \ldots, e_n\} be the standard basis for Rn\mathbb{R}^n. Any vector xRnx \in \mathbb{R}^n can be written as: x=x1e1+x2e2++xnenx = x_1 e_1 + x_2 e_2 + \cdots + x_n e_n

By linearity of TT: T(x)=x1T(e1)+x2T(e2)++xnT(en)T(x) = x_1 T(e_1) + x_2 T(e_2) + \cdots + x_n T(e_n)

Define the matrix AA with columns T(e1),,T(en)T(e_1), \ldots, T(e_n). Then: Ax=x1[col1(A)]+x2[col2(A)]++xn[coln(A)]=T(x)Ax = x_1 [\text{col}_1(A)] + x_2 [\text{col}_2(A)] + \cdots + x_n [\text{col}_n(A)] = T(x)

For uniqueness: if Bx=T(x)Bx = T(x) for all xx, then Bej=T(ej)Be_j = T(e_j) = column jj of AA, so B=AB = A.

Key Properties:

PropertyMatrix FormExample
Column jj of AAT(ej)T(e_j) = image of jj-th basis vectorFor rotation: T(e1)=[01]T(e_1) = \begin{bmatrix} 0 \\ 1 \end{bmatrix}
Applying TTT(x)=AxT(x) = AxT[31]=[0110][31]T\begin{bmatrix} 3 \\ 1 \end{bmatrix} = \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix}\begin{bmatrix} 3 \\ 1 \end{bmatrix}
Size of AAm×nm \times n for T:RnRmT: \mathbb{R}^n \to \mathbb{R}^mRotation R2R2\mathbb{R}^2 \to \mathbb{R}^2: 2×22 \times 2 matrix
Finding the Matrix

To find the matrix of a linear transformation T:RnRmT: \mathbb{R}^n \to \mathbb{R}^m:

  1. Compute T(e1),T(e2),,T(en)T(e_1), T(e_2), \ldots, T(e_n)
  2. Place these as columns: A=[T(e1)T(e2)T(en)]A = [T(e_1) \mid T(e_2) \mid \cdots \mid T(e_n)]

The Kernel (Nullspace): What Gets Destroyed

Consider the projection T:R3R3T: \mathbb{R}^3 \to \mathbb{R}^3 that projects vectors onto the xyxy-plane:

T[xyz]=[xy0]T\begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} x \\ y \\ 0 \end{bmatrix}

The matrix is A=[100010000]A = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix}.

Which vectors get mapped to zero?

T(v)=0    [100010000][xyz]=[000]    x=0,y=0,z=anythingT(v) = 0 \implies \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix}\begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix} \implies x = 0, y = 0, z = \text{anything}

The kernel is the entire zz-axis: ker(T)={[00z]:zR}\ker(T) = \left\{ \begin{bmatrix} 0 \\ 0 \\ z \end{bmatrix} : z \in \mathbb{R} \right\}

Geometrically, the kernel consists of all vectors that get "crushed" to zero by the transformation. For our projection:

  • The zz-axis gets flattened onto the origin
  • All information about "height" is lost
  • Once you project, you can't recover the original zz-coordinate

Kernel of a projection

Figure: The kernel of projection onto the xyxy-plane is the zz-axis. Every point on the zz-axis maps to the origin.

Definition: Kernel (Nullspace)

The kernel (or nullspace) of a linear transformation T:VWT: V \to W is the set of all vectors that map to zero:

ker(T)={vV:T(v)=0}\ker(T) = \{v \in V : T(v) = 0\}

For a matrix AA, this is N(A)={x:Ax=0}N(A) = \{x : Ax = 0\}.

Theorem: The Kernel is a Subspace

The kernel of any linear transformation is a subspace of the domain.

Proof sketch: If T(u)=0T(u) = 0 and T(v)=0T(v) = 0, then T(u+v)=T(u)+T(v)=0+0=0T(u + v) = T(u) + T(v) = 0 + 0 = 0. Similarly for scalar multiplication.

Key Properties:

PropertyDescriptionOur Example
Subspaceker(T)\ker(T) is always a subspace of the domainThe zz-axis is a subspace of R3\mathbb{R}^3
Contains zero0ker(T)0 \in \ker(T) alwaysThe origin is on the zz-axis
Information lossVectors in the kernel lose all their informationAny (0,0,z)(0, 0, z) becomes (0,0,0)(0, 0, 0)

The Image (Range): What Gets Produced

Using the same projection T[xyz]=[xy0]T\begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} x \\ y \\ 0 \end{bmatrix}:

What vectors can we reach as outputs?

im(T)={T(v):vR3}={[xy0]:x,yR}\text{im}(T) = \left\{T(v) : v \in \mathbb{R}^3\right\} = \left\{\begin{bmatrix} x \\ y \\ 0 \end{bmatrix} : x, y \in \mathbb{R}\right\}

The image is the entire xyxy-plane.

Geometrically, the image tells you the "range of possibilities" for the output. For our projection:

  • Every point in R3\mathbb{R}^3 lands somewhere on the xyxy-plane
  • The xyxy-plane is the "shadow" or "footprint" of the transformation
  • Points off the xyxy-plane (like (1,2,5)(1, 2, 5)) can never be outputs
Definition: Image (Range)

The image (or range) of a linear transformation T:VWT: V \to W is the set of all possible outputs:

im(T)={T(v):vV}={wW:w=T(v) for some vV}\text{im}(T) = \{T(v) : v \in V\} = \{w \in W : w = T(v) \text{ for some } v \in V\}

For a matrix AA, this is the column space C(A)={Ax:xRn}C(A) = \{Ax : x \in \mathbb{R}^n\}.

Theorem: The Image is a Subspace

The image of any linear transformation is a subspace of the codomain.

Proof sketch: If w1=T(v1)w_1 = T(v_1) and w2=T(v2)w_2 = T(v_2) are in the image, then w1+w2=T(v1)+T(v2)=T(v1+v2)w_1 + w_2 = T(v_1) + T(v_2) = T(v_1 + v_2) is also in the image.

Key Properties:

PropertyDescriptionOur Example
Subspaceim(T)\text{im}(T) is always a subspace of the codomainThe xyxy-plane is a subspace of R3\mathbb{R}^3
Span of columnsim(T)=span{T(e1),,T(en)}\text{im}(T) = \text{span}\{T(e_1), \ldots, T(e_n)\}Columns span the xyxy-plane
Reachabilitybim(T)    Ax=bb \in \text{im}(T) \iff Ax = b has a solutionOnly vectors with z=0z = 0 are reachable

The Rank-Nullity Theorem: The Conservation Law

Our projection T:R3R3T: \mathbb{R}^3 \to \mathbb{R}^3 onto the xyxy-plane:

  • Domain dimension: dim(R3)=3\dim(\mathbb{R}^3) = 3
  • Kernel dimension (nullity): dim(ker(T))=1\dim(\ker(T)) = 1 (the zz-axis is a line)
  • Image dimension (rank): dim(im(T))=2\dim(\text{im}(T)) = 2 (the xyxy-plane)

Notice: 3=1+23 = 1 + 2. The input dimension equals nullity plus rank.

The key insight: The Rank-Nullity Theorem says dimension is conserved.

Think of the input space as having a certain "budget" of dimensions. A linear transformation either:

  • Preserves a dimension (it goes into the image), or
  • Destroys a dimension (it goes into the kernel)

No dimension can be created or lost. They're redistributed between "preserved" and "destroyed."

Conservation of Information:

ComponentRoleML Interpretation
Input dimension nnTotal information enteringFeature space dimension
Rank rrInformation that survivesDimensions the model uses
Nullity nrn - rInformation destroyedDimensions the model ignores

See Defining the Four Subspaces for detailed proofs and the complete subspace picture.

Rank-Nullity visualization

Figure: The Rank-Nullity Theorem visualized. The domain decomposes into kernel (destroyed dimensions) and a complement (preserved dimensions). The preserved dimensions map isomorphically onto the image.

Theorem: The Rank-Nullity Theorem

Let T:VWT: V \to W be a linear transformation where VV is finite-dimensional. Then:

dim(V)=dim(ker(T))+dim(im(T))\dim(V) = \dim(\ker(T)) + \dim(\text{im}(T))

Or equivalently: dim(V)=nullity(T)+rank(T)\dim(V) = \text{nullity}(T) + \text{rank}(T)

For an m×nm \times n matrix AA: n=dim(N(A))+rank(A)n = \dim(N(A)) + \text{rank}(A)

📌 Proof Sketch

Let {v1,,vk}\{v_1, \ldots, v_k\} be a basis for ker(T)\ker(T).

Extend this to a basis {v1,,vk,u1,,ur}\{v_1, \ldots, v_k, u_1, \ldots, u_r\} for all of VV.

Claim: {T(u1),,T(ur)}\{T(u_1), \ldots, T(u_r)\} is a basis for im(T)\text{im}(T).

Spans: Any T(v)=T(c1v1++ckvk+d1u1++drur)=d1T(u1)++drT(ur)T(v) = T(c_1 v_1 + \cdots + c_k v_k + d_1 u_1 + \cdots + d_r u_r) = d_1 T(u_1) + \cdots + d_r T(u_r) (since T(vi)=0T(v_i) = 0).

Independent: If d1T(u1)++drT(ur)=0d_1 T(u_1) + \cdots + d_r T(u_r) = 0, then T(d1u1++drur)=0T(d_1 u_1 + \cdots + d_r u_r) = 0, so d1u1++drurker(T)=span{v1,,vk}d_1 u_1 + \cdots + d_r u_r \in \ker(T) = \text{span}\{v_1, \ldots, v_k\}. Linear independence of the extended basis forces all di=0d_i = 0.

Therefore: dim(V)=k+r=nullity(T)+rank(T)\dim(V) = k + r = \text{nullity}(T) + \text{rank}(T).

Key Properties:

PropertyFormulaOur Example
Conservationdim(V)=nullity+rank\dim(V) = \text{nullity} + \text{rank}3=1+23 = 1 + 2
Rank from nullityrank=dim(V)nullity\text{rank} = \dim(V) - \text{nullity}2=312 = 3 - 1
Nullity from ranknullity=dim(V)rank\text{nullity} = \dim(V) - \text{rank}1=321 = 3 - 2
Using Rank-Nullity

The theorem is powerful for deduction:

  • If you know the matrix is 5×35 \times 3 and has rank 2, the nullspace has dimension 32=13 - 2 = 1
  • If you know the nullspace is trivial ({0}\{0\}), the rank equals the number of columns
  • If rank = number of columns, the transformation is injective (one-to-one)

Injectivity, Surjectivity, and Invertibility

Injective (One-to-One)

A transformation is injective if different inputs always produce different outputs: T(u)=T(v)    u=vT(u) = T(v) \implies u = v.

Theorem: Injectivity and Kernel

TT is injective if and only if ker(T)={0}\ker(T) = \{0\}.

Why? If T(u)=T(v)T(u) = T(v), then T(uv)=0T(u - v) = 0, so uvker(T)u - v \in \ker(T). If the kernel is trivial, then uv=0u - v = 0, so u=vu = v.

Surjective (Onto)

A transformation is surjective if every possible output is actually achieved: im(T)=W\text{im}(T) = W.

Theorem: Surjectivity and Rank

T:VWT: V \to W is surjective if and only if rank(T)=dim(W)\text{rank}(T) = \dim(W).

Invertibility

A transformation is invertible (bijective) if it's both injective and surjective.

Theorem: Invertibility Conditions

For T:VWT: V \to W with dim(V)=dim(W)=n\dim(V) = \dim(W) = n:

TT is invertible     \iff ker(T)={0}\ker(T) = \{0\}     \iff rank(T)=n\text{rank}(T) = n     \iff im(T)=W\text{im}(T) = W

Summary Table:

PropertyConditionMatrix Test
Injectiveker(T)={0}\ker(T) = \{0\}Nullspace is trivial; rank = number of columns
Surjectiveim(T)=W\text{im}(T) = WRank = number of rows
InvertibleBothSquare matrix with rank = nn

For the complete list of 14 equivalent conditions (including determinant, eigenvalues, and the four fundamental subspaces), see The Invertible Matrix Theorem.


Rotation (2D)

Rotation by angle θ\theta counterclockwise:

Rθ=[cosθsinθsinθcosθ]R_\theta = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}

  • Kernel: {0}\{0\} (nothing gets crushed)
  • Image: All of R2\mathbb{R}^2
  • Invertible: Yes, Rθ1=RθR_\theta^{-1} = R_{-\theta}

Projection

Projection onto the xx-axis in R2\mathbb{R}^2:

P=[1000]P = \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix}

  • Kernel: The yy-axis (dimension 1)
  • Image: The xx-axis (dimension 1)
  • Rank-Nullity: 2=1+12 = 1 + 1

Reflection

Reflection across the xx-axis:

F=[1001]F = \begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix}

  • Kernel: {0}\{0\}
  • Image: All of R2\mathbb{R}^2
  • Invertible: Yes, F1=FF^{-1} = F (self-inverse)

Scaling

Uniform scaling by factor kk:

Sk=[k00k]S_k = \begin{bmatrix} k & 0 \\ 0 & k \end{bmatrix}

  • Kernel: {0}\{0\} if k0k \neq 0; all of R2\mathbb{R}^2 if k=0k = 0
  • Invertible: Yes if k0k \neq 0

Shear

Horizontal shear by factor kk:

Hk=[1k01]H_k = \begin{bmatrix} 1 & k \\ 0 & 1 \end{bmatrix}

  • Kernel: {0}\{0\}
  • Image: All of R2\mathbb{R}^2
  • Invertible: Yes, Hk1=HkH_k^{-1} = H_{-k}

Composition: Transformations Multiply

Let R90R_{90} be 90° rotation and S2=[2002]S_2 = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix} be scaling by 2.

Apply rotation first, then scaling: (S2R90)(v)=S2(R90(v))(S_2 \circ R_{90})(v) = S_2(R_{90}(v))

S2R90=[2002][0110]=[0220]S_2 R_{90} = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix} \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix} = \begin{bmatrix} 0 & -2 \\ 2 & 0 \end{bmatrix}

Apply scaling first, then rotation: (R90S2)(v)=R90(S2(v))(R_{90} \circ S_2)(v) = R_{90}(S_2(v))

R90S2=[0110][2002]=[0220]R_{90} S_2 = \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix} = \begin{bmatrix} 0 & -2 \\ 2 & 0 \end{bmatrix}

In this case they're equal (scaling commutes with everything). But in general, order matters.

Theorem: Composition as Matrix Multiplication

If T:UVT: U \to V has matrix AA and S:VWS: V \to W has matrix BB, then the composition ST:UWS \circ T: U \to W has matrix BABA.

(ST)(x)=S(T(x))=S(Ax)=B(Ax)=(BA)x(S \circ T)(x) = S(T(x)) = S(Ax) = B(Ax) = (BA)x


Summary

Fundamental definitions:

  • A linear transformation T:VWT: V \to W preserves addition (T(u+v)=T(u)+T(v)T(u+v) = T(u) + T(v)) and scalar multiplication (T(cv)=cT(v)T(cv) = cT(v))
  • Every linear transformation T:RnRmT: \mathbb{R}^n \to \mathbb{R}^m has a unique m×nm \times n matrix with columns T(e1),,T(en)T(e_1), \ldots, T(e_n)

Kernel and Image:

  • Kernel ker(T)={v:T(v)=0}\ker(T) = \{v : T(v) = 0\}: vectors destroyed (mapped to zero)
  • Image im(T)={T(v):vV}\text{im}(T) = \{T(v) : v \in V\}: all possible outputs

The conservation law:

  • Rank-Nullity Theorem: dim(V)=dim(kerT)+dim(im T)\dim(V) = \dim(\ker T) + \dim(\text{im } T)
  • Input dimensions are either preserved (image) or destroyed (kernel)—no dimensions created or lost

Injectivity and surjectivity:

  • Injective (one-to-one)     \iff ker(T)={0}\ker(T) = \{0\}     \iff rank = number of columns
  • Surjective (onto)     \iff im(T)=W\text{im}(T) = W     \iff rank = number of rows
  • Invertible     \iff both     \iff square matrix with full rank

Composition:

  • STS \circ T has matrix BABA (order reversed: last applied goes first)

Answering the Central Question: A function that preserves the structure of linear algebra (addition and scalar multiplication) is a linear transformation, and every such function between finite-dimensional spaces is uniquely represented by a matrix. Its kernel and image reveal exactly what information is preserved and what is destroyed, governed by the rank-nullity theorem: dim(V)=dim(kerT)+dim(im T)\dim(V) = \dim(\ker T) + \dim(\text{im } T).


Applications in Data Science and Machine Learning

Linear transformations appear throughout machine learning, often disguised as matrices or "layers."

Neural Network Layers

Each linear layer in a neural network is a linear transformation y=Wx+by = Wx + b. Without the bias bb, it's purely linear. The weight matrix WRm×nW \in \mathbb{R}^{m \times n} transforms nn-dimensional inputs to mm-dimensional outputs.

Layer Types by Shape:

Layer TypeShapeBehavior
Expansion (m>nm > n)SkinnyEmbeds into higher-dim space; cannot reach all of Rm\mathbb{R}^m
Compression (m<nm < n)FatDimensionality reduction; non-trivial nullspace guaranteed
Square (m=nm = n)SquareCan be invertible (if full rank)

The linearity of WW has important consequences:

  • Response to sums: The network's response to a sum of inputs equals the sum of responses. This property is exploited in understanding gradients and backpropagation
  • Rank of WW: Determines the "effective dimensionality" of the layer's output
  • Kernel of WW: Input directions that produce zero output (potential dead features)
  • Bottlenecks: If rank(W)<min(m,n)(W) < \min(m, n), the layer compresses information

When we say a neural network layer is "a weight matrix WW," we're using the Matrix Representation Theorem implicitly. Training the network means finding the right columns: where should each input feature direction map?

A deep neural network (without nonlinearities) computes WnWn1W2W1xW_n W_{n-1} \cdots W_2 W_1 x. This is just one big linear transformation. Without nonlinear activation functions, depth adds no expressive power—the product of matrices is still a single matrix.

Principal Component Analysis (PCA)

PCA projects high-dimensional data onto a lower-dimensional subspace via y=WTxy = W^T x, where WRn×kW \in \mathbb{R}^{n \times k} contains the top kk principal components.

  • Rank: kk (the number of components kept)
  • Kernel: Dimension nkn - k (the discarded directions)
  • Rank-Nullity interpretation: The variance explained by the kept components plus the variance discarded equals total variance

Dimensionality Reduction and Bottleneck Layers

Any linear dimensionality reduction from Rn\mathbb{R}^n to Rk\mathbb{R}^k (where k<nk < n) is a rank-kk transformation. By Rank-Nullity:

  • The kernel has dimension nkn - k (information destroyed)
  • The image has dimension kk (information preserved)

Autoencoders: The Bottleneck Architecture

Consider a linear autoencoder: Input (nn) → Encoder → Latent (kk) → Decoder → Output (nn).

ComponentMatrix ShapeTypeRole
Encoder EEk×nk \times nFat matrixCompresses high-dim input to low-dim code
Decoder DDn×kn \times kSkinny matrixEmbeds low-dim code back into high-dim space
Full system DEDEn×nn \times nSquareReconstruction (ideally close to identity on data manifold)

Information flow:

  • Encoder (Fat): Massive information destruction—nullspace has dimension nk\geq n - k. Many inputs map to the same code.
  • Decoder (Skinny): Cannot reach all of Rn\mathbb{R}^n—output lives in a kk-dimensional subspace.
  • Bottleneck: The rank of the reconstruction DEDE is bounded by kk, regardless of nn.

The bottleneck forces the network to learn a compressed representation. The nullspace of the encoder contains all variations the network "ignores." For a well-trained autoencoder on natural images, this nullspace should contain noise while the image's essential structure survives.

Feature Engineering

Transforming raw features xx to engineered features ϕ(x)\phi(x) often involves linear transformations:

  • Standardization: z=xμσz = \frac{x - \mu}{\sigma} (affine, nearly linear)
  • Whitening: z=Σ1/2(xμ)z = \Sigma^{-1/2}(x - \mu) (linear after centering)
  • Random projections: z=Rxz = Rx for random matrix RR

Guided Problems

Problem 1: Determining Injectivity from the Matrix

Consider the linear transformation T:R3R2T: \mathbb{R}^3 \to \mathbb{R}^2 given by:

A=[123456]A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}

  1. What is the rank of TT?
  2. What is the dimension of ker(T)\ker(T)?
  3. Is TT injective? Is TT surjective?
💡 Solution

Hints:

  • Row reduce to find the rank
  • Use Rank-Nullity: nullity = nn - rank
  • Injective requires kernel = 0

Solution:

  1. Rank: Row reduce AA: [123456]R24R1[123036]\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \xrightarrow{R_2 - 4R_1} \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \end{bmatrix} Two pivots, so rank = 2.

  2. Nullity: By Rank-Nullity, 3=nullity+23 = \text{nullity} + 2, so nullity = 1.

  3. Injectivity/Surjectivity:

    • TT is not injective because ker(T){0}\ker(T) \neq \{0\} (nullity = 1 > 0)
    • TT is surjective because rank = 2 = dim(R2\mathbb{R}^2) (the codomain)

Problem 2: Composition and Rank

Let T:R4R3T: \mathbb{R}^4 \to \mathbb{R}^3 be a linear transformation with rank 2, and let S:R3R5S: \mathbb{R}^3 \to \mathbb{R}^5 be a linear transformation with rank 3.

  1. What are the possible values for rank(ST)(S \circ T)?
  2. Give an example achieving the maximum rank.
  3. Give an example achieving the minimum rank.
💡 Solution

Hints:

  • rank(ST)min(rank(S),rank(T))(S \circ T) \leq \min(\text{rank}(S), \text{rank}(T))
  • The composition's image is contained in SS's image
  • The composition's kernel contains TT's kernel

Solution:

  1. Possible values: rank(ST)min(2,3)=2(S \circ T) \leq \min(2, 3) = 2, and rank(ST)0(S \circ T) \geq 0. So rank can be 0, 1, or 2.

  2. Maximum rank = 2: Let T(x)=[x1x20]T(x) = \begin{bmatrix} x_1 \\ x_2 \\ 0 \end{bmatrix} (projects onto first two coordinates) and S(y)=[y1y2y300]S(y) = \begin{bmatrix} y_1 \\ y_2 \\ y_3 \\ 0 \\ 0 \end{bmatrix} (embeds R3\mathbb{R}^3 into R5\mathbb{R}^5).

    Then (ST)(x)=[x1x2000](S \circ T)(x) = \begin{bmatrix} x_1 \\ x_2 \\ 0 \\ 0 \\ 0 \end{bmatrix}, which has rank 2.

  3. Minimum rank = 0: Let T(x)=[0x1x2]T(x) = \begin{bmatrix} 0 \\ x_1 \\ x_2 \end{bmatrix} (puts output in last two components) and S(y)=[y10000]S(y) = \begin{bmatrix} y_1 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix} (only uses first component).

    Then (ST)(x)=0(S \circ T)(x) = 0 for all xx, giving rank 0.


Problem 3: Kernel and Image of a Projection

Let P:R3R3P: \mathbb{R}^3 \to \mathbb{R}^3 be the projection onto the plane x+y+z=0x + y + z = 0.

  1. Find a basis for ker(P)\ker(P).
  2. Find a basis for im(P)\text{im}(P).
  3. Verify the Rank-Nullity Theorem.
  4. What is P2P^2? What does this tell you about projections?
💡 Solution

Hints:

  • The kernel of a projection onto a plane is the line perpendicular to that plane
  • The image of a projection onto a plane is the plane itself
  • For projections, P2=PP^2 = P (applying twice is the same as once)

Solution:

  1. Kernel: The kernel consists of vectors perpendicular to the plane x+y+z=0x + y + z = 0. The normal to this plane is n=[111]n = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}.

    Basis for ker(P)\ker(P): {[111]}\left\{ \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} \right\}

  2. Image: The image is the plane x+y+z=0x + y + z = 0 itself. Two independent vectors in this plane:

    Basis for im(P)\text{im}(P): {[110],[101]}\left\{ \begin{bmatrix} 1 \\ -1 \\ 0 \end{bmatrix}, \begin{bmatrix} 1 \\ 0 \\ -1 \end{bmatrix} \right\}

  3. Rank-Nullity:

    • dim(R3\mathbb{R}^3) = 3
    • nullity = 1, rank = 2
    • 3=1+23 = 1 + 2
  4. P2P^2: For any projection, P2=PP^2 = P. Once you're on the plane, projecting again does nothing. You stay where you are. This property (P2=PP^2 = P) is called idempotence and characterizes projections.


References

  1. MIT OpenCourseWare - 18.06SC Linear Algebra - Linear Transformations and Their Matrices
  2. Stanford EE263 - Introduction to Linear Dynamical Systems - Linear Algebra Review
  3. Strang, Gilbert - Introduction to Linear Algebra (Chapter 8)
  4. Mathematics LibreTexts - Kernel and Image of a Linear Transformation