Unsupervised Learning -- Module 2 -- Machine Learning


Index

  1. Clustering K-Means and Kernel K-Means
  2. Kernel K-Means
  3. Dimensionality Reduction - PCA and Kernel PCA
  4. Back to PCA then, with an example
  5. Kernel PCA
  6. Matrix factorization
  7. 1. Singular Value Decomposition
  8. Matrix Completion
  9. 1. Gradient Descent
  10. Generative Models

Clustering: K-Means and Kernel K-Means

The K-Means Algorithm

At this point we already know what the K-Means algorithm is, still I will add the content again. If you do not wish to study this, skip ahead to Kernel K-Means

Basic Idea:

Divides n objects into k clusters, where each object belongs to the cluster with the nearest mean (called the centroid).

The K-Means algorithm begins by randomly assigning each data point to a cluster. It then iteratively refines the clusters' centroids until convergence. The refinement process involves calculating the mean of the data points assigned to each cluster and updating the cluster centroids' coordinates accordingly. The algorithm continues to iterate until convergence, meaning the cluster assignments no longer change. K-means clustering aims to minimize the sum of squared distances between each data point and its assigned cluster centroid.

Before we proceed further, let's understand about :


What is a Centroid?

A centroid is the geometric center of a cluster.

Pasted image 20250422125627.png

Or

Pasted image 20250422125857.png

One could say that the centroid of these clusters are the big white circles at the center, and in some cases, the green circles

It’s like the average position of all the points (data samples) assigned to that cluster.

For example :

Let’s say you have a 2D space where each data point is represented by coordinates:

Point X Y
A 2 4
B 4 6
C 3 5
If these three points belong to a single cluster, the centroid’s coordinates would be calculated like this:
Centroidx = 2 + 4 + 33 = 3Centroidy = 4 + 6 + 53 = 5

So the centroid of all these data points is (3,5)

In higher dimensions(more number of variables), the centroid is still just the mean of each coordinate (feature) across all points in the cluster and the centroid formula scales as follows :

Centroidfeature = sum of feature valuesnumber of points

Why Is It Important?

1️⃣ The centroid acts like the "anchor" for a cluster. 2️⃣ During the K-Means algorithm:


Important:

So, getting back to K-Means clustering,


Algorithm of K-Means clustering

  1. Choose the number of clusters k.
  2. Randomly choose k centroids.
  3. Assign each point to the nearest centroid. (This is done by calculating the distance of a point to all the centroids then assigning that point to the centroid with the lowest distance, thus creating a cluster)
  4. Re-compute centroids as the mean of all points in the cluster (using the centroid formula).
  5. Repeat steps 3 & 4 until centroids don’t change (convergence is reached) or a maximum number of iterations is reached.

Pasted image 20250414181534.png


Advantages of K-Means

Here are some advantages of the K-means clustering algorithm -


Disadvantages of K-Means

Here are some disadvantages of the K-means clustering algorithm -


Kernel K-Means

The Problem with Regular K-means

Regular K-means works well when clusters are linearly separable - meaning you can draw straight lines (or hyperplanes) to separate the clusters. However, many real-world datasets have clusters that follow non-linear patterns like:

In these cases, regular K-means fails because it assumes clusters are spherical and separated by linear boundaries.


What is Kernel K-means?

Kernel K-means extends regular K-means to handle non-linear cluster structures by using the kernel trick. Instead of working directly with the original data points, it implicitly maps them to a higher-dimensional feature space where the clusters become linearly separable

The key insight: What looks non-linear in the original space might become linear in a transformed space.

So basically the same kernel trick that happens in SVMs


The Kernel Trick in Clustering

Core Concept

Rather than explicitly transforming data points x to a higher-dimensional space ϕ(x), kernel K-means uses a kernel function K(xi,xj) that computes the inner product in the transformed space:

K(xi,xj) = ϕ(xi)  ϕ(xj)

This allows us to work in high-dimensional spaces without actually computing the transformation.


Distance Calculation

In kernel K-means, the distance between a data point xi and cluster center mc in the feature space is computed as:

d2(ϕ(xi),mc)=ϕ(xi)ϕ(xi)2ϕ(xi)mc+mcmc

(Don't really stress on the equation here.)

Using the kernel trick, this becomes:

d2(xi,Cc) = K(xi,xi)  2|Cc|xjCcK(xi,xj) + 1|Cc|2xj,xkCcK(xj,xk)

Where Cc is cluster c and |Cc| is the number of points in that cluster

(We might need to know this equation)


Common Kernel Functions

1. Polynomial Kernel (General form)

Formula: K(x,z)=(c + xi  xj)d


2. Radial Basis Function (RBF/Gaussian) Kernel

Formula: K(x,z)=exp(||xixj||22σ2)


3. Sigmoid Kernel

Formula: K(x,z)=tan(h)(αxi  xj+c)


Kernel K-means Algorithm

Here are the step-by-step algorithmic procedures.

At the base it's the same K-Means algorithms with the distance metric swapped out with a different method using a specific Kernel.

Input: Dataset X=x1,x2,...,xn, number of clusters k, kernel function K

  1. Initialize: Randomly assign each data point to one of k clusters
  2. Repeat until convergence:
    • For each data point xi:
      • Calculate kernel distance to each cluster center using:d2(xi,Cc) = K(xi,xi)  2|Cc|xjCcK(xi,xj) + 1|Cc|2xj,xkCcK(xj,xk)
      • Assign xi to the cluster with minimum distance
    • Update clusters based on new assignments

Key Differences from Regular K-means

Aspect Regular K-means Kernel K-means
Distance Metric Euclidean distance Kernel-based distance
Cluster Shapes Spherical only Arbitrary non-linear shapes
Complexity O(ndk) per iteration O(n²k) per iteration
Memory O(nd) O(n²) for kernel matrix
Cluster Centers Explicit centroids Implicit in feature space

Advantages of Kernel K-means

  1. Handles non-linear structures: Can identify clusters with complex shapes
  2. Flexible: Different kernels for different data types
  3. Theoretically sound: Based on solid mathematical foundations
  4. Better separation: Often achieves better clustering than linear methods

Limitations

  1. Computational complexity: O(n²) operations per iteration make it expensive for large datasets
  2. Memory intensive: Must store the full n×n kernel matrix
  3. Parameter tuning: Kernel parameters (like σ for RBF) need careful selection
  4. No explicit centroids: Cannot directly interpret cluster centers
  5. Scalability: Becomes impractical for very large datasets

When to Use Kernel K-means

Use kernel K-means when:

Use regular K-means when:


Kernel K-Means : Example

Let's clear up whatever confusions there maybe, using an example

Given Dataset and Setup

Data Points:

Kernel Function: RBF (Gaussian) kernel with σ=1


Step 1: Choose a k value

Let it be k = 2 for now.


Step 2: Randomly assign the points to k = 2 clusters

So let's say: C1(0)=(x1,x3) and C2(0)=(x2,x4)

The (0) on the top indicates that these clusters are from the zero-eth iteration.


Step 3: Calculate Kernel Distances for all points.

For each point, calculate its kernel distance to each cluster using:

d2(xi,Cc) = K(xi,xi)  2|Cc|xjCcK(xi,xj) + 1|Cc|2xj,xkCcK(xj,xk)
Distance from x1 to C1(0)=(x1,x3):

x1 = (1,2)
x3 = (4,5)

Kernel values needed:

The RBF Kernel formula for K-Means:

K(x,z)=exp(||xixj||22σ2)

where :

K(x,z)=e(||xixj||22σ2)

The formula for Euclidean Distance between two points is:

Distance = (x2  x1)2 + (y2  y1)2
  1. For K(x1,x1) (Distance of x1 to x1 itself):
K(x1,x1) = exp(|x1  x1|22(12))K(x1,x1) = exp(|(1,2)  (1,2)|22)K(x1,x1) = exp(02)

(Euclidean distance between the same points always result in zero)

K(x1,x1) = exp(0)K(x1,x1) = e0K(x1,x1) = 1
  1. For K(x1,x3) (Distance of x1 to x3):
K(x1,x3) = exp(|(1,2)  (4,5)|22)K(x1,x3) = exp((41)2 + (52)222)K(x1,x3) = exp(9 + 922)K(x1,x3) = exp(182) = exp(9)K(x1,x3) = e9  0.000123
  1. For K(x3,x1) : 0.000123, since these are the same points, just reversed order.

  1. For K(x3,x3): We can go out on a limb and say that this will be just 1 since it's for the same point, and we have σ = 1

Step 4: Apply into the distance formula

d2(xi,Cc) = K(xi,xi)  2|Cc|xjCcK(xi,xj) + 1|Cc|2xj,xkCcK(xj,xk)
Term 1: K(x1,x1)
Term 2 2|C1|xjC1K(x1,xj)

The sum includes the sum of distances of all points in C1=x1,x3:

So, K(x1,x1) + K(x1,x3) = 1+0.000123=1.000123

And |C1| is basically the number of elements in cluster C1. We have two points in this cluster, so |C1| = 2, 2|2| = 1

So: 22×1.000123=1.000123

Term 3: 1|C1|2xj,xkC1K(xj,xk)

This is a double sum over all pairs (xj,xk) where both xj,xkC1:

The pairs are: (x1,x1), (x1,x3), (x3,x1), (x3,x3)

So,

xj,xkC1K(xj,xk) = K(x1,x1) + K(x1,x3) + K(x3,x1) + K(x3,x3)=1 + 0.000123 +0.000123 + 1 = 2.000246

And 1|C1|2 = 122 = 14

So,

14 × 2.000246 = 0.500061

So, final distance of point x1 to cluster C1(0) :

d2(x1,C1) = K(x1,x1)  0.500061d2(x1,C1) = 1  0.500061d2(x1,C1) = 0.499938

Now as you can see, if we were to continue this sum by hand, it would take a long time. So to provide the rest of the details:

Iteration 1: Distance Calculations

Point x1 Distances:

Point x2 Distances:

Point x3 Distances:

Point x4 Distances:

Final Cluster Assignments


Algorithm Converged in 1 Iteration ✓

The algorithm has converged because no points changed cluster assignments. This makes intuitive sense because:

  1. Natural grouping: Points (1,2) and (2,1) are close to each other, as are (4,5) and (5,4)
  2. Large separation: The two groups are far apart in the original space
  3. RBF kernel effect: The kernel values between distant points are nearly zero (0.0001)

Code

For the curious cat:

import numpy as np

  

# Define the dataset points

points = np.array(1, 2], [2, 1], [4, 5], [5, 4)

  

# Define RBF kernel function

sigma = 1

  

def rbf_kernel(x, y, sigma=sigma):

    return np.exp(-np.linalg.norm(x-y)**2 / (2 * sigma**2))

  

# Initial cluster assignments (from the last example)

cluster_assignments = {1: [0, 2], 2: [1, 3]}  # indices, clusters start as C1={x1,x3}, C2={x2,x4}

  

# Precompute kernel matrix

n = len(points)

kernel_matrix = np.zeros((n, n))

for i in range(n):

    for j in range(n):

        kernel_matrix[i, j] = rbf_kernel(points[i], points[j])

  

# Function to calculate kernel distance for a point to a cluster

# cluster_points are indices

  

def kernel_distance(point_idx, cluster_points):

    m = len(cluster_points)

    term1 = kernel_matrix[point_idx, point_idx]

    term2 = (2 / m) * np.sum(kernel_matrix[point_idx, cluster_points])

    term3 = 0

    for j in cluster_points:

        for k in cluster_points:

            term3 += kernel_matrix[j, k]

    term3 /= m**2

    dist_sq = term1 - term2 + term3

    return dist_sq

  

# Iterative Kernel K-means clustering until convergence

max_iters = 10

prev_assignments = None

assignments = cluster_assignments

iteration_results = []

  

for iteration in range(max_iters):

    new_assignments = {1: [], 2: []}

  

    # Calculate distances and assign points

    for i in range(n):

        dist_to_c1 = kernel_distance(i, assignments[1])

        dist_to_c2 = kernel_distance(i, assignments[2])

        if dist_to_c1 < dist_to_c2:

            new_assignments[1].append(i)

        else:

            new_assignments[2].append(i)

  

    iteration_results.append({

        "iteration": iteration+1,

        "assignments": new_assignments,

        "distances": {i: {

            "to_C1": kernel_distance(i, new_assignments[1]),

            "to_C2": kernel_distance(i, new_assignments[2])

        } for i in range(n)}

    })

  

    if new_assignments == assignments:  # convergence check

        break

    assignments = new_assignments

  

print(iteration_results[-1])  # Return last iteration results

Dimensionality Reduction - PCA and Kernel PCA

What is Dimensionality Reduction?

Dimensionality reduction addresses the curse of dimensionality - the problem that arises when working with high-dimensional data. As dimensions increase:

PCA helps by projecting high-dimensional data onto a lower-dimensional subspace while retaining the most important variance.


Principal Component Analysis (PCA)

Core Concept

PCA finds the directions (principal components) along which the data varies the most. These components are:

Think of it as finding the "best" coordinate system to represent your data with fewer dimensions.


Why Use PCA?


Warning: Following content cannot be properly understood without proper knowledge of the follow matrix related operations:

Please go through Module 3 -- Matrices, Module 4 -- Vector Spaces, Module 5 -- Vector Spaces -- Continued and Module 4 -- Numerical Solutions (Matrices from Numerical Methods, the Gaussian elimination, Gauss-Jordan, matrix-inversion and LU factorization methods)

Matrices

Eigenvalues and Eigenvectors

Covariance and Variance


Back to PCA then, with an example

Let's understand Principal Component Analysis with a detailed example.

1. Data

Suppose you’re analyzing students’ math and physics scores:

Student Math Physics
1 2 0
2 0 1
3 1 3

2. Mean-Center the Data (Standardization)

First, subtract the mean of each column from each value—this centers your data at the origin.

Subtract these to get centered data:

Student Math Physics
1 1 -1.33
2 -1 -0.33
3 0 1.67

3. Compute the Covariance Matrix

The covariance matrix for 2D data will look like:

[var(Math)Cov(Math, Physics)Cov(Math, Physics)Var(Physics)]

Formulae :

  1. Variance - Population
Var(X) = (xi  μ)2N

where:


  1. Variance - sample
Var(X) = (xi  x¯)2n1

where:


  1. Co-Variance - Population
CoV(x,y) = (xi  μx)(yi  μy)N

where:


  1. Sample Covariance
CoV(x,y) = (xi  x¯)(yi  y¯)n1

where:


Calculate variance and covariance:

So, the covariance matrix is:

S = [10.50.52.333]

4. Find eigenvalues and eigenvectors

From vector spaces, we know that the characteristic equation:

det(A  λI) = 0|1λ0.50.52.333λ|

Solving this determinant gives us the equation:

λ2  3.333λ + 2.083 = 0

Solving the equation yields the eigenvalues

This yields (rounded):

Eigenvectors are found by plugging these values back in. Geometrically, these give you the directions of highest and next-highest variance.


5. Choose the Principal Components

The principal component with eigenvalue 2.46 captures more variance than the one with 0.87, so you’d use only the first if you wanted to reduce dimensions to 1D.


6. Project the Data (Optional)

You “project” each student’s vector onto the direction(s) of the principal component(s). This gives new “scores” (principal components), which can replace the two original columns with one (if reducing to 1D).


Advantages of PCA


Limitations of PCA


When to Use PCA

Use PCA when:

Avoid PCA when:


Kernel PCA

(By this point you should know what a Kernel does)

Why Do We Need Kernel PCA?


Key Idea

  1. Map the original data to a higher-dimensional space using a non-linear function (called φ).
  2. Perform standard PCA in this new space.

The catch: We never explicitly compute this mapping (as it can be very high- or infinite-dimensional). Instead, we use a “kernel function” that allows us to do all computations needed for PCA implicitly and efficiently.


How Does It Work? (Step-by-Step)

1. Choose a Kernel Function

A kernel computes the similarity between data points as if you had mapped them into a higher-dimensional space:

The Gaussian/RBF kernel is most popular for non-linear problems.

2. Compute the Kernel Matrix

For a dataset with n points, compute an n×n matrix K where Kij=K(xi,xj). Each entry is the similarity (in the implicitly mapped space) between examples xi and xj.

3. Center the Kernel Matrix

Make the kernel matrix "zero mean", just as we center data for regular PCA. (There is a formula for this, but most libraries handle it automatically.)

4. Perform Eigen-Decomposition

Compute the eigenvalues and eigenvectors of the centered kernel matrix.

5. Form the New Features

The principal components are the projections of your original data onto the top eigenvectors of the kernel matrix. Choosing the top k components gives you a reduced (but nonlinear) representation of your data.


Geometric Intuition

Standard PCA fits a straight line or plane to your data, losing information if your data lies on a nonlinear "curve."

Kernel PCA bends this line or plane in a higher-dimensional space, capturing more of the data's true structure.


Advantages of Kernel PCA


Limitations


Summary Table: PCA vs Kernel PCA

PCA Kernel PCA
Relationships Linear Non-linear
Computation On covariance matrix On kernel (Gram) matrix
Interpretability Clear (linear combinations) Usually less clear
Handles non-linearity? No Yes

Matrix factorization

Matrix factorization is a powerful technique at the core of many machine learning algorithms, especially in the context of unsupervised learning for dimensionality reduction, recommender systems, image processing, and more.

What is Matrix Factorization?

Matrix factorization is the process of expressing a large matrix A as the product of two or more smaller matrices, usually with lower rank.
Mathematically, for a matrix A (say m×n), we factor it as:

A  UVT

where :


Why Factorize a Matrix?


Types of Matrix Factorization


Practical Example: Recommender Systems

Suppose you have a user-rating matrix for movies, where rows are users and columns are movies. Most entries are missing (have not rated). Matrix factorization approximates this matrix using:

The dot product Ui  Vj predicts how user i might rate movie j, even if it's missing in the original data.


Applications


Key Points and Strengths


Limitations


Techniques of Matrix Factorization in detail.

Now, we already know L-U factorization, but that doesn't hold much value in ML.

So we need to learn a few more methods.

1. Singular Value Decomposition

One popular use case of this method is within PCA. So it might come handy later on.

https://www.youtube.com/watch?v=vSczTbgc8Rc (must watch the entire video)

This method has quite close ties with symmetric matrices, eigenvalues and eigenvectors.

So, let's say we have this 2×2 matrix.

A = [3223]

The SVD formula states that the matrix A can be decomposed (or factorized) into the product of 3 matrices:

A = UΣVT

Now, let's figure out how to find these individual matrices.

Step 1. Compute ATA and AAT

These will result in separate matrices called left singular vectors and right singular vectors, denoted by SL and SR respectively.

So,

AT = [3223]

The matrix is the same since the matrix is symmetric by itself.

So,

ATA = [3223] × [3223] = [(3×3 + 2×2)(3×2 + 2×3)(2×3 + 3×2)(2×2 + 3×3)] = [13121213]

And since A is already symmetric, ATA will have the same result.

AAT = [3223] × [3223] = [(3×3 + 2×2)(3×2 + 2×3)(2×3 + 3×2)(2×2 + 3×3)] = [13121213]

Step 2: Find the eigenvalues and eigenvectors

As we already know by now, eigenvalues are found by this characteristic equation:

det(A  λI) = 0

Here instead of using the original matrix A, we will use either of the matrices from AAT or ATA.

That depends on whether we are calculating U or V first.

For U, we will use the matrix obtained from AAT.

[13121213]

as A.

So,

A  λI = 0|13λ121213λ| = 0

Now,

(13  λ)2  144 = 0(13  λ)2 = 144

Taking the square root of both sides:

13  λ = ±12

Thus, we get two eigenvalues:

λ1 = 25λ2 = 1

These are the squared singular values.

Now we won't be needing to find the eigenvalues again using ATA since in this instance the matrices are symmetric.

Eigenvectors

We have this equation from earlier:

13  λ = ±12

Now we use this equation to calculate the eigenvectors:

(13  λ)x + 12y = 0

Now you might be asking, why not choose 12y instead?

This is an excellent and subtle question—here’s the key insight:

We chose the +12 since this was the case in the original matrix chosen for the characteristic equation:

[13121213]

So, for eigenvalue λ1 = 25

(13  λ)x + 12y = 0 = (13  25)x +12y = 012x + 12y = 0  x = y

This leads to eigenvector:

[11]

Now, this eigenvector is currently "unnormalized", because it's length, (or norm) is :

12 + 12 = 2

So, to normalize the eigenvector, we divide the elements by 2:

12 × [11]

and get:

u1 = [0.70710.7071]

Why normalize?

This process—normalizing—makes the eigenvector have length 1. It’s a universal linear algebra convention for consistency, easier comparison, and especially useful in applications like SVD and PCA. While any nonzero multiple of an eigenvector is mathematically valid, the normalized form is preferred in most computations and for clear interpretation, since it always has unit length (magnitude = 1).

So, the trick to normalizing any eigenvector:

  1. Compute its length (norm):
    Calculate the square root of the sum of the squares of its elements.

    • For $$\left[
      \begin{array}{ccc}
      a \
      b
      \end{array}
      \right]$$, the length is a2 + b2
  2. Divide each element by this length:

    • 1a2 + b2 × vector
    • Example: $$
      \left[
      \begin{array}{ccc}
      1 \
      1
      \end{array}
      \right] \ \rightarrow \
      \left[
      \begin{array}{ccc}
      \frac{1}{\sqrt{2}} \
      \frac{1}{\sqrt{2}}
      \end{array}
      \right]
(13  λ)x + 12y = 0 = (13  1)x +12y = 012x + 12y = 0 x = y

which gives an eigenvector of:

[11]

which, when normalized, gives us:

u2 = [0.70710.7071]

Step 3: Compute U

That's for our U:

U = [u1u2]

which becomes:

U = [0.70710.70710.70710.7071]

Step 4: Compute Σ.

Σ is given by:

[σ100σ2]

where σ1 and σ2 are the square root of the eigenvalues.

So, :

σ1 = 25 = 5σ2 = 1 = 1

So,

[5001]

Step 5: Compute V and thus VT

For V we would find the eigenvalues and thus eigenvectors all over again by starting off with the matrix of ATA.

But we don't need to do that here since both are the same, so their results would be the same.

So, V would be:

V = [0.70710.70710.70710.7071]

And :

VT = [0.70710.70710.70710.7071]

No change seen visibly since these are symmetric matrices from the start.


So, now we have the successful factorization of the matrix:

A = [3223]

into:

A = UΣVT

Step 6: Verify Factorization

We can verify whether the factorization is correct or not, by simply multiplying the 3 matrices.

U × Σ = [0.70710.70710.70710.7071] × [5001]= [(3.5355 + 0)(0 + 0.7071)(3.5355 + 0)(0  0.7071)]= [3.53550.70713.53550.7071]

Now,

U × Σ × VT = [3.53550.70713.53550.7071] × [0.70710.70710.70710.7071]= [(2.49995205 + 0.49999041)(2.49995205  0.49999041)(2.49995205  0.49999041)(2.49995205 + 0.49999041)]= [2.999942461.999961641.999961642.99994246]

And if we do a bit rounding off, we get:

[3223]

or just:

[3223]

which is the original matrix: A.


Why Is SVD Important for ML?


2. Non-Negative Matrix Factorization(NMF)

Note: This method is an iterative method, so it's more suited to machines than doing by hand.

That's why, I will just provide a brief overview.

What Is NMF?


Why Non-negative?


The Core Idea

Imagine you have data, such as survey results, word counts, or pixel brightness values. NMF attempts to decompose these into hidden “basis parts” (W) and how strongly each original observation expresses each part (H).


Real NMF: How do we find W and H?


Applications in Machine Learning


Strengths


Limitations


3. Low-Rank Matrix Factorization (for Recommender Systems)

What Is Low-Rank Matrix Factorization?


Why Is This Useful?


Step-by-Step Example

Imagine you have a simple user-movie rating matrix:

Movie 1 Movie 2 Movie 3
User 1 5 ? 3
User 2 1 2 ?
User 3 ? 5 4

Here, "?" means unrated; the matrix is incomplete.


Matrix Factorization decomposes this into:

Optimization is performed (by computer) to minimize the error between known ratings and predicted ones, often using gradient descent or alternating least squares.

Once completed, you can compute  for any user-item pair to predict a rating — even if it was missing.


Real-World Use Case

Netflix:


Strengths


Limitations


Summary

Low-rank matrix factorization is key for collaborative filtering:


Honorable Mention: L-U Factorization

I am just going to dump the content from the numerical analysis notes for this, only till finding L and U , so that you can write a matrix as a product of the two matrices:

A = LU

LU Factorization Method

https://www.youtube.com/watch?v=dwu2A3-lTGU&list=PLU6SqdYcYsfLrTna7UuaVfGZYkNo0cpVC&index=20

https://www.youtube.com/watch?v=a9S6MMcqxr4 (A better explanation of how to find the L and U matrices in my opinion ).

https://www.youtube.com/watch?v=Tzrg0GgEfZY (continuation, shows how to find the solution of a system of equations using the LU matrices).

The LU factorization method is an iterative method, that is the solving is carried out in iterations.

LU factorization is especially useful when you need to solve multiple systems with the same coefficient matrix A but different right-hand side vectors b, since you factor A only once.

The idea behind LU factorization is to write the coefficient matrix A as the product of a lower triangular matrix L (with 1’s on the diagonal) and an upper triangular matrix U:

A = LU

Once we have this factorization, we solve the system:

Ax = b

by first solving

Ly = b

which is forward substitution

and then solving :

Ux = y

which is backward substitution


So let's work first on understanding how to find :

A = LU

with an example here.

So with this example matrix

Pasted image 20250407115841.png

Pasted image 20250407115900.png

(the  = any real number)

or

L=[100l2110l31l321]andU=[u11u12u130u22u2300u33]

Note that the matrix U is in the row-echelon form.


1. Finding U

So we start by finding U.

That can be done by simply using Gauss elimination method.

[2235910412]

We set the pivot to R1.

So for R2 :

m = 52

So $$R_2 \ \to R_2 \ - \ \frac{5}{2} \ R_1$$

 R2 = [(55), (95), (10152)]R2 = [0,4,52]

So the matrix becomes :

[2230452412]

Now for R3

m = 2

So $$R_3 \ = \ R_3 \ - \ 2R_1$$

 R3 = [(44), (14), (26)] R3 = [0,3,4]

So the matrix becomes:

[2230452034]

Now for the y component of R3, we select 4 as the pivot from R2.

So :

m = 34

Then :

R3 = R3 + 34 R2 R3 =[0, (3+3), (4+158)] R3 = [0,0,178]

So the matrix U is :

[223045200178]

2. Finding L

Now the matrix L is quite easy to find.

The upper triangular half of L is similar to the identity matrix, where the diagonals are all 1s and the elements above the diagonal line are all zeroes.

Now the lower triangular half, which represents the L matrix, is just the previous m values we used to find U, per element in the correct order.

So we currently have L as :

[100101]

Now we will just fill in the gaps using the m values we found in the correct order for U.

So, for the x component of R2, m = 52

[10052101]

Next, for the x component of R3, m = 2

[100521021]

And lastly, for the y component of R3, m = 34

[10052102341]

And this the L matrix.


Matrix Completion

What is Matrix Completion?

Matrix completion is the process of recovering or “filling in” the missing values of a partially observed matrix.


Why is Matrix Completion Important?


How Does Matrix Completion Work?

  1. Assume a Structure:
    Usually, we assume the underlying matrix is low-rank (it can be approximated by a small number of factors or “patterns”). This is reasonable in real-world settings—user preferences, image backgrounds, etc., often vary along a few “axes.”

  2. Optimization Problem:
    The standard matrix completion task can be stated as:

    Find X that minimizes rank(X) subject to matching the known entries of the origin

    • But directly minimizing the rank is NP-hard (computationally infeasible).
  3. Convex Relaxation (Nuclear Norm Minimization):

    • Instead, replace rank with nuclear norm (sum of singular values).

    • Solve:

      minX∣∣X∣∣, such that, Xobserved = Mobserved
    • This is convex and efficiently solvable; the filled matrix will have naturally low rank.

  4. Alternating Minimization & Factorization:

    • Alternatively, directly factor the matrix as X=UVT and optimize U and V to best fit the observed values.
  5. Algorithms:

    • Singular Value Thresholding (SVT)
    • Alternating Direction Method of Multipliers (ADMM)
    • Gradient Descent-based methods
    • (And others suited to large and/or sparse data)

Applications


Practical Example

Suppose

M = [5?3?4?3??]

Matrix completion tries to estimate the “?” values using the assumption that MM has some low-rank structure (e.g., the ratings are mostly explained by a few genres or user types).


The Big Picture


Methods of matrix completion.

1. Gradient Descent

This method is a crucial starting point to get into the deeper echelons of machine learning.

Pre-requisites : Partial differentiation, partial differentiation chain-rule and the formula for calculating the gradient of a given function.

https://www.youtube.com/watch?v=JAf_aSIJryg&list=PLF-vWhgiaXWNi9OuPCbguaPgL67XH7crm&index=2 (Partial derivatives)

https://www.youtube.com/watch?v=XipB_uEexF0&list=PLF-vWhgiaXWNi9OuPCbguaPgL67XH7crm&index=3 (Partial derivatives chain rule)

https://www.youtube.com/watch?v=CnVes9TdnPo&list=PLF-vWhgiaXWNi9OuPCbguaPgL67XH7crm&index=9 (Gradient)

So, let me just go ahead and list the formula for the gradient of a function.

Gradient

The gradient of a scalar field f(x,y,z) is a vector field given by:

f = (dfdx, dfdy, dfdz)

For a 2D scalar field, that would be : f = (dfdx, dfdy)

A small example:

Let's say we have a scalar function:

ϕ = 3x2y  y3z2

We have to find it's gradient at points 1,2,1

Finding each partial derivative:

dϕdx = 6xydϕdy = 3x2  3y2z2dϕdz = 2y3z

Final gradient:

6xy i^ + (3x2  3y2z2) j^  2y3z k^= (6×1×2) i^ + (3×(1)2  3×(2)2×(1)2) j^  (2×(2)3×1) k^= 12 i^  9 j^  16 k^

What is Gradient Descent?

Gradient Descent is an iterative optimization algorithm widely used in machine learning to minimize loss (cost) functions.

The core idea:

It’s like walking downhill: you keep stepping in the direction that slopes most steeply down, each step being proportional to how steep the surface is beneath your feet.

Now you might be wondering, what is a loss function?

A loss function is like a "scorekeeper" for a machine learning model. It tells us how far off the model’s answers (predictions) are from the real, correct answers.

During training, the model learns by trying to make this loss as small as possible. Think of it as a game where the goal is to get the lowest score!

For example, if you guessed someone’s age as 12 but they’re actually 14, your loss might be 2. If you guess 25 and they're 14, your loss is 11—much worse!

There are many types of loss functions for different tasks, but they all do the same basic job: measure how good or bad the model’s predictions are, and help the model improve by lowering that number.


The algorithm of Gradient Descent, with a step-by-step example.

Let's minimize a simple function:

f(x) = (x  3)2

You might be wondering as to how to figure out the minimum of this function?

Simple, just put the value in x, that would result the whole function to be zero.

Step 1: Start with an initial guess.

Let's say x0 = 0


Step 2: Calculate gradient (slope)

For our function, the gradient is:

f = dfdx i^

which would be:

f(x) = 2(x  3) × (1  0)f(x) = 2(x  3)

Step 3: Iterate, update the parameter.

The update rule is:

xnew = xold  ηf(xold)

where η (pronounced "eta" (eeh-ta)) is the learning rate (or the step size, let's say 0.1 )

Here f(xold) is actually the gradient of the function with the input as the old x value.

So a more better version of this would be:

xnew = xold  ηf(xold)

where the original function f is the loss function which we are trying to minimize, and f is the gradient


Step 4: Repeat till you get x very close to it's minimum.

Some example iterations would include

Iteration 1: x = 0.6000000000000001
Iteration 2: x = 1.08
Iteration 3: x = 1.464
Iteration 4: x = 1.7711999999999999
Iteration 5: x = 2.01696
Iteration 6: x = 2.213568
Iteration 7: x = 2.3708544
Iteration 8: x = 2.49668352
Iteration 9: x = 2.597346816
Iteration 10: x = 2.6778774528
Iteration 11: x = 2.74230196224
Iteration 12: x = 2.793841569792
Iteration 13: x = 2.8350732558336
Iteration 14: x = 2.86805860466688
Iteration 15: x = 2.894446883733504
Iteration 16: x = 2.9155575069868034
Iteration 17: x = 2.932446005589443
Iteration 18: x = 2.945956804471554
Iteration 19: x = 2.9567654435772432
Iteration 20: x = 2.9654123548617948
Iteration 21: x = 2.9723298838894356
Iteration 22: x = 2.9778639071115487
Iteration 23: x = 2.982291125689239
Iteration 24: x = 2.985832900551391
Iteration 25: x = 2.988666320441113
Iteration 26: x = 2.9909330563528904
Iteration 27: x = 2.9927464450823122
Iteration 28: x = 2.99419715606585
Iteration 29: x = 2.99535772485268
Iteration 30: x = 2.996286179882144
Iteration 31: x = 2.997028943905715
Iteration 32: x = 2.9976231551245722
Iteration 33: x = 2.9980985240996576
Iteration 34: x = 2.9984788192797263
Iteration 35: x = 2.998783055423781
Iteration 36: x = 2.999026444339025
Iteration 37: x = 2.99922115547122
Iteration 38: x = 2.9993769243769757
Iteration 39: x = 2.9995015395015807
Iteration 40: x = 2.9996012316012646
Iteration 41: x = 2.9996809852810116
Iteration 42: x = 2.999744788224809
Iteration 43: x = 2.9997958305798473
Iteration 44: x = 2.999836664463878
Iteration 45: x = 2.9998693315711025
Iteration 46: x = 2.999895465256882
Iteration 47: x = 2.9999163722055053
Iteration 48: x = 2.999933097764404
Iteration 49: x = 2.9999464782115233
Iteration 50: x = 2.9999571825692186
Iteration 51: x = 2.999965746055375
Iteration 52: x = 2.9999725968443
Iteration 53: x = 2.99997807747544
Iteration 54: x = 2.999982461980352
Iteration 55: x = 2.9999859695842814
Iteration 56: x = 2.999988775667425
Iteration 57: x = 2.99999102053394
Iteration 58: x = 2.999992816427152
Iteration 59: x = 2.9999942531417214
Iteration 60: x = 2.999995402513377
Iteration 61: x = 2.9999963220107015
Iteration 62: x = 2.999997057608561
Iteration 63: x = 2.999997646086849
Iteration 64: x = 2.999998116869479
Iteration 65: x = 2.9999984934955832
Iteration 66: x = 2.9999987947964666
Iteration 67: x = 2.999999035837173
Iteration 68: x = 2.9999992286697386
Iteration 69: x = 2.999999382935791
Iteration 70: x = 2.9999995063486327
Iteration 71: x = 2.999999605078906
Iteration 72: x = 2.9999996840631247
Iteration 73: x = 2.9999997472504996
Iteration 74: x = 2.9999997978004
Iteration 75: x = 2.9999998382403197
Iteration 76: x = 2.999999870592256
Iteration 77: x = 2.999999896473805
Iteration 78: x = 2.9999999171790437
Iteration 79: x = 2.999999933743235
Iteration 80: x = 2.999999946994588
Iteration 81: x = 2.99999995759567
Iteration 82: x = 2.999999966076536
Iteration 83: x = 2.9999999728612288
Iteration 84: x = 2.999999978288983
Iteration 85: x = 2.9999999826311865
Iteration 86: x = 2.999999986104949
Iteration 87: x = 2.9999999888839595
Iteration 88: x = 2.9999999911071678
Iteration 89: x = 2.9999999928857344
Iteration 90: x = 2.9999999943085873
Iteration 91: x = 2.9999999954468697
Iteration 92: x = 2.999999996357496
Iteration 93: x = 2.9999999970859967
Iteration 94: x = 2.999999997668797
Iteration 95: x = 2.9999999981350376
Iteration 96: x = 2.99999999850803
Iteration 97: x = 2.999999998806424
Iteration 98: x = 2.9999999990451394
Iteration 99: x = 2.9999999992361115
Iteration 100: x = 2.9999999993888893
Iteration 101: x = 2.9999999995111115
Iteration 102: x = 2.9999999996088893
Iteration 103: x = 2.9999999996871116
Iteration 104: x = 2.999999999749689
Iteration 105: x = 2.9999999997997513
Iteration 106: x = 2.999999999839801
Iteration 107: x = 2.9999999998718407
Iteration 108: x = 2.9999999998974727
Iteration 109: x = 2.999999999917978
Iteration 110: x = 2.9999999999343823
Iteration 111: x = 2.999999999947506
Iteration 112: x = 2.9999999999580047
Iteration 113: x = 2.9999999999664038
Iteration 114: x = 2.999999999973123
Iteration 115: x = 2.999999999978498
Iteration 116: x = 2.9999999999827986
Iteration 117: x = 2.999999999986239
Iteration 118: x = 2.999999999988991
Iteration 119: x = 2.999999999991193
Iteration 120: x = 2.999999999992954
Iteration 121: x = 2.999999999994363
Iteration 122: x = 2.9999999999954907
Iteration 123: x = 2.9999999999963927
Iteration 124: x = 2.9999999999971143
Iteration 125: x = 2.9999999999976916
Iteration 126: x = 2.9999999999981535
Iteration 127: x = 2.999999999998523
Iteration 128: x = 2.9999999999988183
Iteration 129: x = 2.9999999999990545
Iteration 130: x = 2.9999999999992437
Iteration 131: x = 2.999999999999395
Iteration 132: x = 2.999999999999516
Iteration 133: x = 2.9999999999996128
Iteration 134: x = 2.99999999999969
Iteration 135: x = 2.999999999999752
Iteration 136: x = 2.999999999999802
Iteration 137: x = 2.9999999999998415
Iteration 138: x = 2.999999999999873
Iteration 139: x = 2.9999999999998983
Iteration 140: x = 2.9999999999999187
Iteration 141: x = 2.999999999999935
Iteration 142: x = 2.999999999999948
Iteration 143: x = 2.9999999999999583
Iteration 144: x = 2.9999999999999667
Iteration 145: x = 2.9999999999999734
Iteration 146: x = 2.9999999999999787
Iteration 147: x = 2.999999999999983
Iteration 148: x = 2.9999999999999867
Iteration 149: x = 2.9999999999999893
Iteration 150: x = 2.9999999999999916
Iteration 151: x = 2.9999999999999933
Iteration 152: x = 2.9999999999999947
Iteration 153: x = 2.9999999999999956
Iteration 154: x = 2.9999999999999964
Iteration 155: x = 2.9999999999999973
Iteration 156: x = 2.999999999999998
Iteration 157: x = 2.9999999999999982
Iteration 158: x = 2.9999999999999987
Iteration 159: x = 2.999999999999999
Iteration 160: x = 2.999999999999999

I took the liberty to cook up a python script to see exactly in how many iterations we can get as close as possible to the minimum, 3.

It took a total of 159 iterations to be exact. Which is why iterative methods are best left to computers, but the knowledge is still needed, hence why I added this method.

Code (for the curious cat):

pip install numpy matplotlib
import numpy as np

import matplotlib.pyplot as plt

  

# Function and its derivative

def f(x):

    return (x - 3) ** 2

  

def df(x):

    return 2 * (x - 3)

  

# Gradient descent parameters

x = 0.0  # Starting point

learning_rate = 0.1

iterations = 200 # feel free to tweak this parameter.

  

x_history = [x]

f_history = [f(x)]

  

for i in range(iterations):

    grad = df(x)

    x = x - learning_rate * grad

    x_history.append(x)

    f_history.append(f(x))

    print(f"Iteration {i+1}: x = {x}")

  
  
  

# Plotting

plt.plot(x_history, f_history, marker='o')

plt.title('Gradient Descent on f(x) = (x - 3)^2')

plt.xlabel('x')

plt.ylabel('f(x)')

plt.grid(True)

plt.show()

Step 5 : Plot the iterations.

Again, best left to computers.

Pasted image 20250815134941.png

Or a shortened one:

Pasted image 20250815135554.png


Types of Gradient Descent


How does Gradient-Descent tie to Matrix completion?

  1. Low-Rank Factorization:

    • We approximate your incomplete matrix M as MUVT, where U and V are low-rank factor matrices.
    • These are initialized randomly (or with small values).
  2. Define the Objective (Loss) Function:

    • ==The goal is to minimize the reconstruction error—the difference between the observed values in M and the corresponding entries ==in UVT.

    • Only the observed entries are included in the loss calculation.

    Loss = (i,j)  observed(Mij  [ UVT ]ij)2

    (Don't worry about this scary lookin ahh equation. This is a given loss function, not something we need to come up with)

  3. Gradient Descent Steps:

    • At each iteration, calculate the gradient of the loss with respect to all elements of U and V.
    • Update U and V by moving a small step in the direction that most reduces the loss (as you learned for scalar functions).
  4. Fill in the Blanks:

    • After several iterations, U and V converge to values that make UVT match the observed data well.
    • The missing entries (previously blanks in M) are now predicted by the corresponding entries in UVT

Key Details


Visualization

This chart shows how the error decreases as the algorithm proceeds—just like the basic gradient descent you experimented with, but applied to the matrix recovery problem:

Gradient Descent iterative convergence in matrix completion

Pasted image 20250815140552.png

See how the error in reconstruction drops over time?


Summary


Generative Models

Now we are slowly getting there.

For people who just want to get over this module with, here's a brief summary:

Latent Factor Models: Readable Notes (Essentials Only)


Mixture Models: Readable Notes (Essentials Only)


Bare Minimum for Exams & ML Progress

Note : This is only to end the module.

A detailed section on Generative Models (Latent factor and mixture models plus a LOT more will be added in a separate pdf, outside of the syllabus, for those who want to know more in-depth)