Module 5 -- Types of Machine Learning Methods


Index

  1. Scalable Machine Learning (Online and Distributed Learning)
  2. 1. Online Learning
  3. 2. Distributed Learning
  4. 3. Semi-Supervised Learning (SSL)
  5. 4. Active Learning
  6. 5. Reinforcement Learning (RL)
  7. 6. Bayesian Learning

Scalable Machine Learning (Online and Distributed Learning)

1. Online Learning

Pasted image 20251114174801.png

Online learning is a learning paradigm where the model updates itself incrementally, processing one example at a time, instead of training on the full dataset at once.
It is a technique where models are trained sequentially with data that arrives over time, allowing them to update continuously and adapt to new information in real-time. This contrasts with batch learning, which trains on the entire dataset at once and is useful for applications that require dynamic adaptation to evolving data, such as real-time fraud detection or e-commerce recommendations.


1.1 Why Online Learning Exists

Modern systems generate high-velocity data streams such as:

Storing all incoming data is often impractical.
Online learning solves this by performing training and prediction simultaneously as new examples arrive.


1.2 Core Idea of Online Learning

At each time step t:

  1. Receive input xt
  2. Predict y^t
  3. Observe true label yt
  4. Update the model parameters using the loss on this single example

General weight update rule:

wt+1=wtη(wt;xt,yt)

This is essentially stochastic gradient descent with batch size 1, allowing continuous learning.


1.3 Online vs Batch Learning

Feature Online Learning Batch Learning
Data Streaming / infinite Entire dataset available
Updates Every example Full epochs
Memory use Very low High
Adaptation High (handles drift) Low
Update noise High Low
Convergence Harder Easier

1.4 Concept Drift

Concept drift refers to changes in the underlying data distribution over time.

Examples:

Online learning adapts to drift because it updates continuously.


1.5 Types of Online Learning Algorithms

A. Perceptron Algorithm

For binary classification:

wt+1=wt+ytxt

Uses a “mistake-driven” approach.


B. Online Gradient Descent (OGD)

General update:

wt+1=wtηw

Works for regression, classification, and most differentiable models.


C. Winnow Algorithm


D. Passive-Aggressive Algorithms (PA)

These algorithms update only when necessary.

Used in online SVM-like methods and streaming classification.


1.6 Online Learning in Deep Learning

Streaming SGD / Mini-batch SGD

Although not “pure” online learning, deep networks often use stochastic or small-batch updates, enabling scalable training on large datasets.

Reinforcement Learning (RL)

RL agents learn online by interacting with an environment; updates are made after each step using experience samples.


1.7 Regret in Online Learning

Regret measures how much worse the online algorithm performs compared to the best fixed model in hindsight.

RT=t=1T(wt;xt,yt)minwt=1T(w;xt,yt)

Good algorithms achieve sub-linear regret such as O(T).

This means they become competitive over time.


1.8 Advantages of Online Learning


1.9 Limitations of Online Learning


1.10 Real-World Applications


1.11 Online-to-Batch Conversion (Optional but useful)

Online algorithms can be converted to batch models by:

This demonstrates theoretical power of online learning in batch settings.


2. Distributed Learning

Pasted image 20251114191750.png

Distributed learning refers to machine learning techniques where data, computation, or model components are spread across multiple machines (nodes) to enable faster training and to handle very large datasets or models.

It is essential when a single machine cannot store the entire dataset or cannot compute gradients fast enough.


2.1 Why Distributed Learning?

Modern ML workloads involve:

Distributed learning enables scalability through parallelism and coordination across machines.


2.2 Types of Parallelism

Distributed ML typically uses two major forms of parallelism:


2.2.1 Data Parallelism

Each worker machine:

  1. Receives a different subset of data
  2. Maintains a local copy of the model
  3. Computes gradients on its local data
  4. Sends gradients to a central coordinator
  5. Parameters are updated and synchronized across all workers

This approach is effective when:

Gradient Update Concept

If worker k computes gradient gk, the central parameter server or aggregator computes:

g=1Kk=1Kgk

Then updates parameters:

wt+1=wtηg

This allows multiple machines to train the same model simultaneously.


2.2.2 Model Parallelism

Used when the model itself is too large to fit on a single machine.

Two main ways to split a model:

Each machine computes part of the forward and backward pass.

Useful for:


2.3 Key Distributed Architectures

2.3.1 Parameter Server Architecture

A common paradigm in distributed ML.

Advantages:

Disadvantages:


2.3.2 All-Reduce Architecture

No central server.
Workers directly communicate with each other using collective operations like all-reduce.


2.4 Synchronous vs Asynchronous Training

2.4.1 Synchronous Distributed Training

Pros:

Cons:


2.4.2 Asynchronous Distributed Training

Pros:

Cons:


2.5 Communication Challenges

Distributed ML performance is often limited by communication, not computation.

Key issues:

  1. Bandwidth constraints
  2. High latency between machines
  3. Gradient size (very large tensors)
  4. Repeated synchronization overhead

Common solutions:


2.6 MapReduce for Machine Learning

MapReduce enables large-scale ML by splitting work into two stages:

Map phase:

Reduce phase:

Common algorithms adapted to MapReduce:

Although slower for iterative algorithms, MapReduce pioneered scalable ML and heavily influenced modern systems like Spark.


2.7 Apache Spark MLlib

Spark MLlib is a distributed ML library built on Resilient Distributed Datasets (RDDs).

Advantages:

MLlib includes:

Often seen in industry pipelines.


2.8 Fault Tolerance in Distributed ML

Distributed systems must handle machine failures gracefully.

Common techniques:

Fault tolerance is crucial because training jobs may run for days.


Exam answers only need the names + 1-line explanation.


2.10 Advantages of Distributed Learning


2.11 Limitations of Distributed Learning


2.12 Real-World Applications


3. Semi-Supervised Learning (SSL)

Pasted image 20251114192553.png

Semi-supervised learning is a machine learning paradigm where a model is trained using a small amount of labeled data together with a large amount of unlabeled data.

It lies between supervised and unsupervised learning and is extremely useful when labeled data is expensive or difficult to obtain, but unlabeled data is plentiful.


3.1 Why Semi-Supervised Learning?

In real-world scenarios:

Examples:

Semi-supervised learning uses unlabeled data structure to improve accuracy, reduce overfitting, and learn better decision boundaries.


3.2 Core Intuitions Behind SSL

Semi-supervised learning is built on three major assumptions.
These are crucial exam points.

1. Smoothness Assumption

If two samples x1 and x2 are close in input space, then their labels y1 and y2 are likely to be the same.

2. Cluster Assumption

Data tends to form clusters.
Points in the same cluster probably have the same label.

3. Manifold Assumption

High-dimensional data lies on a lower-dimensional manifold.
Learning the geometry of the manifold helps assign labels.

These assumptions allow unlabeled data to guide supervised learning.


3.3 Categories of Semi-Supervised Learning Methods

SSL techniques fall into several major groups.


3.3.1 Self-Training (Self-Learning)

  1. Train a supervised model on labeled data
  2. Use it to predict labels for unlabeled data
  3. Choose high-confidence predictions
  4. Add them to training data
  5. Retrain model
  6. Repeat until convergence

This is simple but effective for text and image classification.


3.3.2 Co-Training

Proposed by Blum & Mitchell.

Idea:

Works when:

Co-training improves learning when views are independent.


3.3.3 Semi-Supervised SVMs (Transductive SVMs)

Goal:
Find a decision boundary that not only separates labeled data but also avoids cutting through high-density regions of unlabeled points.

This enforces the cluster assumption.

Optimization attempts to maximize margin while respecting unlabeled data structure.


3.3.4 Graph-Based Semi-Supervised Learning

Construct a graph:

Labels from labeled nodes are propagated to nearby unlabeled nodes through the graph.

Methods include:

Graph SSL works well when data has strong cluster/graph structure.


3.3.5 Generative Models

Fit a probability model to both labeled and unlabeled data.

Example: Gaussian Mixture Models (GMMs)

  1. Fit GMM on all data
  2. Infer which component likely corresponds to which class
  3. Use EM algorithm to refine parameters

Generative SSL assumes a model like:

p(x,y)=p(y)p(xy)

Works only when generative assumptions are valid.


3.3.6 Consistency Regularization Methods (Modern SSL)

Modern deep learning-based SSL techniques enforce:

A model should output similar predictions for perturbed versions of the same input.

Examples:

Used in modern algorithms like:

These methods dominate state-of-the-art SSL performance on images and text.


3.4 Pseudo-Labeling (Important in exams)

A very popular and simple SSL technique:

  1. Use supervised model to generate pseudo-labels for unlabeled data
  2. Add pseudo-labeled data to training set
  3. Retrain

Works extremely well with deep neural networks when unlabeled data is abundant.


3.5 Loss Functions in Semi-Supervised Learning

Most SSL methods combine:

Supervised Loss (from labeled samples)

Ls=(f(xl),yl)

Unsupervised Loss (from unlabeled samples)

Often based on:

General combined objective:

L=Ls+λLu

Where λ balances labeled vs unlabeled contributions.


3.6 Benefits of Semi-Supervised Learning


3.7 Limitations of SSL


3.8 Real World Applications

Semi-supervised learning is heavily used in industrial pipelines where labeling is expensive but data availability is massive.


4. Active Learning

Active Learning (AL) is a machine learning approach where the model is allowed to choose which data points should be labeled, with the goal of achieving high accuracy using as few labeled examples as possible.

This is extremely useful when unlabeled data is abundant, but labeling is expensive, requiring human experts.


4.1 Why Active Learning?

Labeling can be costly:

Active learning minimizes labeling cost by selecting only the most informative samples for annotation.


4.2 The Active Learning Loop

The typical AL pipeline:

  1. Start with a small labeled dataset L
  2. Train a model
  3. Use the model to analyze an unlabeled pool U
  4. Select the most informative samples from U
  5. Query a human (oracle) to label them
  6. Add them to L and retrain
  7. Repeat until performance is sufficient

This iterative process concentrates labeling effort where it matters most.


4.3 Query Strategies in Active Learning

The central idea of AL is the query strategy—how the model chooses which samples to label.

4.3.1 Uncertainty Sampling

The model queries samples where it is least confident.

Common uncertainty measures:

A. Least Confidence

Choose sample with lowest predicted probability for its predicted class:

x=argmaxx(1maxyP(yx))

B. Margin Sampling

Choose sample with smallest difference between top two class probabilities:

x=argminx[P(y1x)P(y2x)]

C. Entropy-Based Sampling

Choose sample with highest prediction entropy:

H(x)=yP(yx)logP(yx)

Used widely for classification and deep learning.


4.3.2 Query-by-Committee (QBC)

Maintain a committee of models trained on current labeled data.

Process:

Disagreement measures include:

QBC works well when multiple hypotheses explain the data.


4.3.3 Expected Model Change

Select samples that would cause the largest change in model parameters if labeled and used for training.

Intuition:
Label points that will significantly improve the model.


4.3.4 Expected Error Reduction

Choose samples expected to reduce future generalization error the most.

This method is theoretically strong but computationally expensive.


4.3.5 Density-Weighted Sampling

Uncertainty alone may pick outliers.
Density-weighted AL considers:

Idea:
Select uncertain samples in high-density regions, avoiding noise/outliers.


4.4 Active Learning Settings

Active learning can be applied in different settings depending on how data is accessed.

4.4.1 Pool-Based Active Learning

Most common setting.


4.4.2 Stream-Based Active Learning

Data arrives as a stream.

Decision often based on uncertainty threshold.


4.4.3 Membership Query Synthesis

The model generates synthetic examples and asks for labels.

Feels like adversarial example generation.

Rare in practice because queries must be human-understandable.


4.5 Active Learning with Deep Learning (Deep AL)

Deep learning + Active Learning requires scalable strategies:

Common approaches:

  1. Dropout-based uncertainty estimation
    • Use Monte Carlo dropout to estimate confidence
  2. Embedding-based clustering
    • Select representative & uncertain samples
  3. Consistency-based methods
    • Leverage semi-supervised learning ideas
  4. Adversarial Active Learning
    • Use adversarial perturbations to find uncertain regions

Deep AL is used in vision, NLP, and medical imaging.


4.6 Stopping Criteria in Active Learning

When to stop querying?

Common criteria:


4.7 Advantages of Active Learning


4.8 Limitations of Active Learning


4.9 Real-World Applications

Active learning is heavily used in domains where labels are expensive but unlabeled data is plentiful.


5. Reinforcement Learning (RL)

Pasted image 20251114210005.png

Reinforcement Learning is a learning paradigm where an agent interacts with an environment to achieve a goal. The agent learns by receiving rewards (positive or negative) and tries to maximize the cumulative reward over time by choosing optimal actions.

Unlike supervised learning (has labels) or unsupervised learning (no labels), RL learns from trial and error interactions.


5.1 Core Components of an RL System

Reinforcement Learning problems are formally defined as a Markov Decision Process (MDP).

An MDP consists of:

Goal:
Learn a policy π(as) that maximizes expected return.


5.2 Return and Value Functions

Return

Total discounted reward from time t:

Gt=k=0γkRt+k+1

State Value Function

Expected return from state s following policy π:

Vπ(s)=Eπ[GtSt=s]

Action Value Function (Q-value)

Qπ(s,a)=Eπ[GtSt=s,At=a]

These values help the agent decide the best actions.

Upto here the equations are important. After that learning further equations are optional from the exam pov.


5.3 Categories of RL Methods

1. Model-Based RL

Useful for:


2. Model-Free RL

The agent learns directly from experience, without learning transition models.

Two main categories:

A. Value-Based Methods

Learn V(s) or Q(s,a).

Examples:

B. Policy-Based Methods

Directly learn policy π(as).

Examples:


5.4 Temporal Difference Learning (TD)

TD Learning updates values using bootstrapping:

V(st)V(st)+α[rt+1+γV(st+1)V(st)]

TD is used in:

Key advantage:


5.5 Q-Learning (Off-Policy)

Q-learning aims to learn the optimal action-value function Q(s,a).

Update rule:

Q(st,at)Q(st,at)+α[rt+1+γmaxaQ(st+1,a)Q(st,at)]

Properties:

Used heavily in:


5.6 SARSA (On-Policy)

Update rule:

Q(st,at)Q(st,at)+α[rt+1+γQ(st+1,at+1)Q(st,at)]

Difference vs Q-learning:


5.7 Deep Q-Learning (DQN)

DQN extends Q-learning using neural networks:

Q(s,a;θ)

Key innovations that made deep RL succeed:

  1. Experience Replay

    • Stores (s,a,r,s) tuples
    • Samples random minibatches
    • Breaks correlation between consecutive samples
  2. Target Network

    • A copy of the Q-network (θ)
    • Updated slowly
    • Stabilizes learning
  3. Exploration via ϵ-greedy

    • Take random actions with probability ϵ
    • Reduces over time

DQN achievements:


5.8 Policy Gradient Methods

Policy gradients optimize the policy directly.

Objective:

J(θ)=Eπθ[Gt]

Gradient update:

θθ+αθJ(θ)

Advantages:

Examples:


5.9 Actor-Critic Methods

Combine the best of value-based and policy-based methods.

Actor: learns the policy π(as)
Critic: learns the value function V(s) or Q(s,a)

Critic guides actor updates, making learning faster and stable.

Used widely in robotics and continuous control.


5.10 Exploration vs Exploitation

Key challenge in RL:

Common strategies:


5.11 Reward Engineering

Rewards must be designed carefully to avoid:

Good reward design often makes or breaks real RL agents.


5.12 Challenges in RL


5.13 Applications of Reinforcement Learning

RL is used wherever decisions must be learned from interaction rather than labeled examples.


6. Bayesian Learning

Refer to Module 1 -- Supervised Learning -- Machine Learning#3. Naive Bayes Classifier first.

After that:

Variants of Naive Bayes

Numerical stability (use logs)

Multiply many probabilities leads to underflow. Compute log-probabilities instead:

logP(yx)logP(y)+ilogP(xiy)

General smoothing (Lidstone)

P(wy)=count(w,y)+αtotal_wordsy+αV

where α=1 is Laplace smoothing, and V is vocabulary size.

Practical notes & caveats

Evaluation

Report accuracy, precision, recall and F1 for imbalanced classes. Consider ROC/AUC where applicable.