Module 3 -- ML Model evaluation, Statistical Learning, Ensemble methods


Index

  1. Evaluating Machine Learning algorithms
  2. 3. Key Evaluation Metrics
  3. Introduction to Statistical Learning Theory
  4. The Key Ideas
  5. The relationship between Expected (True) risk and the Empirical Risk
  6. For the even more curious What is the Law of Large Numbers(LLN) and how is it related to Statistical Learning?
  7. Ensemble Methods
  8. 1. Boosting
  9. Boosting Algorithms AdaBoost
  10. 2. Bagging
  11. Bagging vs. Boosting Key Points
  12. Bagging Algorithm Random Forests

Evaluating Machine Learning algorithms

1. Why Evaluate?

To understand if your ML model:


2. The Train/Test Split


3. Key Evaluation Metrics

This content is taken from data mining module 4 from semester 6. Module 4 -- Data Streams -- Data Mining#Class Imbalance Problem

The evaluation metrics are the same since that's practically a part of ML itself.

In class imbalance, you're usually trying to detect a rare but important event, like fraud (positive class). So, accuracy becomes meaningless. We use more targeted metrics.

1. Using a Confusion Matrix

Pasted image 20250501132738.png

A confusion matrix is a table that displays the performance of a classification model by comparing its predictions to the actual labels of the data, especially highlighting the counts of correct and incorrect predictions for each class. This helps analyze where the model is struggling, particularly with under-represented classes, and is crucial for understanding the model's effectiveness in imbalanced datasets.

A confusion matrix summarizes predictions:

Truth Predicted Positive Predicted Negative
Actual Positive TP(True Positive) FN(False Negative)
Actual Negative FP(False Positive) TN(True Negative)

A confusion matrix is often used in binary classification problems, where operating under class imbalance, ML models often classify differently than what's expected between two categories.

Example:
Imagine we have 100 test samples:

Let’s assume:

So among the 90 actual normal cases:

And among the 10 actual frauds:

So this is our confusion matrix:

Truth Predicted Positive Predicted Negative
Actual Positive TP = 5 FN = 5
Actual Negative FP = 3 TN = 87

2. Precision (a.k.a Positive Predictive Value)

Precision answers the important question of:

"Out of all the cases I predicted as positive, how many were actually positive?"

Precision = TPTP + FP

I mean, you wouldn't someone to be wrongly arrested, or your mail system to wrongly classify an important email as spam, right?.


3. Recall (a.k.a Sensitivity / True Positive Rate)

Recall answers this important question:

"Out of all the actual positives, how many did I catch?"

Recall = TPTP + FN

You most certainly wouldn't want to miss a fraud detection, or a cancer screening, right?


4. F1 Score (harmonic mean of precision and recall)

"Balance between precision and recall"

F1 Score = 2  Precision  RecallPrecision + Recall

So in our case, from the confusion matrix, we see that:

Truth Predicted Positive Predicted Negative
Actual Positive TP = 5 FN = 5
Actual Negative FP = 3 TN = 87

The model detected 5 frauds (TP = 5), but:

Then:

Precision = 55 + 3 = 58 = 0.625 or 62.5%Recall = 55 + 5 = 0.5 or 50%F1 Score = 2  0.625  0.50.625 + 0.5  0.555 or  55.5%

So what do these metrics mean?

✅ Precision = 62.5%

"Out of all the cases the model predicted as fraud, 62.5% were actually fraud."


📉 Recall = 50%

"Out of all the actual frauds in the data, the model only caught half of them."


⚖️ F1 Score = 55.5%

"The overall balance between catching frauds and avoiding false alarms is just slightly above average."


🔍 Summary Table

Metric What it measures High when...
Precision Correctness of positive predictions FP is low
Recall Coverage of actual positives FN is low
F1 Score Balance of precision and recall Both FP and FN are low
Metric Value Interpretation
Precision 62.5% Somewhat accurate when it does say “fraud,” but makes mistakes 37.5% of the time
Recall 50% Misses half of actual fraud cases, which is risky
F1 Score 55.5% Model needs tuning — it’s mediocre at both catching frauds and being accurate

ROC Curve & AUC

Shows trade-off between true positive rate and false positive rate at various thresholds; AUC (Area Under Curve) summarizes.


For Regression:

1Ni=1N(yiy^i)2

1Ni=1N|yiy^i|

Fraction of variance explained by the model.


4. Cross-Validation

- Split data into K parts (“folds”)
    - Train on K-1 folds, test on the remaining fold
    - Repeat K times, with each fold used as test set once; average results.

- Helps ensure results aren’t due to a “lucky” or “unlucky” train/test split.
    - Gives a more robust, general performance estimate.


5. Overfitting & Underfitting


6. Key Strategies for Fair Evaluation


Example.

Let's understand this better with the help of an example.

Let's say we have a dataset of house prices:

Size (sqft) Price (₹)
1000 70,000
1200 80,000
1300 85,000
1500 95,000
1800 115,000

Let’s split this into train and test:


Fitting a regression model (let's say, linear regression)

Assume we use a simple linear regression model:

Price = a × size + b

Make predictions on the test set.

Predicted Prize = (50 × 1800) + 20,000 = 110,000

Calculate Evaluation Metrics

a) Mean Squared Error (MSE)

MSE = 1N i = 1N(yi  yi^)2

MSE is the average of the squared differences between actual values and predicted values. It is a common metric for measuring the accuracy of regression models in statistics and machine learning—a lower MSE means better model predictions.

Here's what each symbol means:

For our test point:

yi = 115,000 (or ytrue)
yi^ = 110,000 (or ypredicted)
N = 1 (only one test point)
(yi  yi^)2 = (115,000  110,000)2 = 25,000,000


b) Mean Absolute Error (MAE)

1Ni=1N|yiy^i|11 × |115,000  110,000| = 5,000

c) Coefficient of Determination (R2)

R2 = 1  (y  y^)2(y  y¯)2

where y¯ is the mean of the actual test prices.


Summary Table

Metric Formula Example Value
MSE 1Ni=1N(yiy^i)2 25,000,000
MAE 1Ni=1N|yiy^i| yiy^i
R² Score 1  (y  y^)2(y  y¯)2 Not meaningful here

Key Takeaways


Introduction to Statistical Learning Theory

Essential Prerequisites (for those who want to dive even deep)

  1. Basic Probability & Statistics

    • Random variables, probability distributions (especially joint and conditional)
    • Expectation, variance, and basic properties of estimators
  2. Calculus and Linear Algebra

    • Differentiation, integration (especially for defining and minimizing risk/loss)
    • Vector and matrix arithmetic (for handling high-dimensional data, but not deep theory at first)
  3. Core Machine Learning Workflow Concepts

    • Training set vs. test set (i.i.d. sampling from a distribution)
    • Hypothesis space: the set of candidate functions/models under consideration
    • Loss/risk function: how we measure the “cost” of a particular prediction or model

What is Statistical Learning Theory?

Statistical Learning Theory is the theoretical backbone of modern machine learning. It studies and explains:


The Key Ideas

1. Learning from Data Means Function Approximation


2. Risk (Error) and Loss Functions

Loss Function

We already know about the Loss Function from the Module 2 -- Unsupervised Learning -- Machine Learning#What is Gradient Descent? section.

Still I will re-iterate here again.

A loss function is like a "scorekeeper" for a machine learning model. It tells us how far off the model’s answers (predictions) are from the real, correct answers.

During training, the model learns by trying to make this loss as small as possible. Think of it as a game where the goal is to get the lowest score!

We denote this loss function by V(f(x),y).


Risk Functions

Expected Risk (true risk):

It's defined a multi-variable integral of the product of the loss function and some unknown joint probability distribution.

I|f| = V((f(x),y) p(x,y) dxdy
Empirical Risk :
Iemp|f| = 1n i = 1nV(f(x(i)),y(i))

3. Empirical Risk Minimization (ERM)


4. Generalization: Why Not Just Memorize the Training Data?


5. Bias-Variance Tradeoff


6. Capacity and Hypothesis Space


Example

Let's dive deeper into this and fully explore the terminologies as well as how to calculate the empirical risk with an example.

Main difference between Expected (True) Risk and Empirical Risk

We already know from Risk Functions :

Expected Risk (true risk):

It's defined a multi-variable integral of the product of the loss function and some unknown joint probability distribution.

I|f| = V((f(x),y) p(x,y) dxdy

Empirical Risk :

Iemp|f| = 1n i = 1nV(f(x(i)),y(i))

The relationship between Expected (True) risk and the Empirical Risk


Calculating and Minimizing the Empirical Risk

Suppose a Regression problem:

or

x y
1 2
2 4
3 6

where w is the weight, and x is the input.


Step 1: Calculate Empirical Risk

The empirical risk is calculated using the formula:

Iemp|f| = 1n i = 1nV(f(x(i)),y(i))

Where the loss function, V is: Squared loss function (yi  wxi)2


Now a small detour:

Why this specific loss function?

Generally for exams and standard ML projects, the  loss function is chosen primarily based on the type of machine learning problem, not on the specific model (hypothesis) itself. There are “de-facto” (case specific) default losses for most tasks:

General Loss Function Selection Rules

1. Regression Problems

2. Classification Problems

3. Other Specialized Losses


Exam and Coursework Perspective


In Summary

No need to deeply justify or derive the loss unless asked specifically; simply state the common standard and proceed with your calculations or explanations.


Now, coming back to the example:

And n is the total number of samples, which is 3.

So, the empirical risk becomes:

Remp(w) = 13 i = 13(yi  wxi)2

Now, we expand:

13[ (2  w1)2 + (4  w.2)2 + (6  w.3)2 ]

Now this is our current empirical risk.


Step 2: Now, we minimize the empirical risk (Empirical Risk Minimization (ERM)).

The goal of empirical risk minimization (ERM) is to find a model that minimizes this sample-based risk, hoping it also performs well on unseen data.

We need to find a new weight w which will minimize Remp(w)

This is the same as solving the linear least squares regression method from

Module 3 -- Time Series Mining#a. Linear Trend (Recap)#How to calculate a and b ?

whose proper formula for this purpose is:

w = xiyixi2

So:

w = 2814 = 2

What does this mean?

This means that the empirical risk minimizer is h(x) = 2x.


Why?


For the even more curious: What is the Law of Large Numbers(LLN) and how is it related to Statistical Learning?

https://www.geeksforgeeks.org/maths/law-of-large-numbers/

This will help you understand how Empirical Risk Minimization (ERM) is based on LLN and how it gets very close (converges) as the number of trials and observations (or experiments) (epochs in training) increases.

The Law of Large Numbers (LLN) is a mathematical theorem in probability and statistics that helps us understand how averages work over time. It states that if an experiment is repeated many times, the average result will approach the expected value more closely.

Law of Large Numbers states that the average is closer to the expected or theoretical value as the number of trials or observations increases.
**For example: if you flip a fair coin many times, the proportion of heads and tails will get closer to 50% each.

In the graph below, we can see how the proportion of heads approaches 0.5 as the number of trials increases.

Pasted image 20250826160419.png

Example explaining the law of large numbers: Imagine your bag contains blue and red balls- 50% red, 50% blue.

So, in summary:

The more trials you perform, the closer your observed average (or proportion) gets to the true theoretical value.

Now, while minimizing the empirical risk in statistical learning, the goal of empirical risk minimization (ERM) is to find a model that minimizes this sample-based risk, hoping it also performs well on unseen data.


Convergence of empirical risk towards real risk due to LLN

The convergence of empirical risk to actual risk stems directly from LLN. As the number of training samples increases, the empirical risk Rn(f) approaches the population risk RP(f). Thus, minimizing empirical risk leads to models that also have low true risk, provided the sample is large enough and representative. This convergence explains why learning algorithms can generalize better as their training data grow—LLN ensures reliability of sample averages as proxies for population values.


Limitations of the Law of Large Numbers

If you are not interested in this section then you can skip ahead to Ensemble Methods

Various limitations of the Law of Large Numbers are:


Applications of LLN in Computer Science

The Law of Large Numbers (LLN) is widely used in Computer Science, especially in areas involving randomness, estimation, and simulation. Some of the key applications are:


Ensemble Methods

What are Ensemble Methods?

Ensemble methods in machine learning are techniques that combine multiple individual models, known as base learners, to create a single, more accurate and robust predictive model.

The fundamental principle here is that a group of models working together can outperform any single model, leveraging the strengths and mitigating the weaknesses of individual models.


Methods of Ensemble Learning

1. Boosting

Boosting is an ensemble learning technique that combines multiple weak learners (often simple models like shallow decision trees) into a single strong learner that achieves higher accuracy and reduced bias than any individual weak model.


How Does Boosting Work?


General Steps in Boosting

  1. Initialize Weights:
    All training samples start with equal weights.

  2. First Weak Learner:
    Train a simple model (e.g., decision stump) on the data using these weights.

  3. Update Weights:
    Increase weights for samples misclassified by this model, decrease for correctly classified ones.

  4. Next Weak Learner:
    Train a new model that focuses more on the updated (harder) samples.

  5. Repeat:
    Continue steps 3–4 for a fixed number of rounds or until performance plateaus.

  6. Combine:
    Aggregate the weak models’ outputs (using learned weights) to give the final prediction.


Why Use Boosting?


Key Types of Boosting Algorithms


Boosting Algorithms : AdaBoost

What is AdaBoost?


How AdaBoost Works (Step-by-Step)

Pasted image 20250827141415.png

  1. Initialize Sample Weights:

    • Start with all training data points weighted equally.
  2. Train Weak Learner:

    • Fit a simple classifier (stump) to the data using current weights.
  3. Calculate Error:

    • Measure the error rate (weighted sum of misclassified points).
  4. Calculate Learner Weight:

    • Assign a weight (α) to the model based on how accurate it was; higher for better models.

    • α = 12 ln(1  errorerror)

  5. Update Data Weights:

    • Increase weights for misclassified data.
    • Decrease weights for correctly classified data.
    • This forces the next weak learner to focus on the “hard” cases.
  6. Repeat Steps 2–5:

    • Keep adding weak learners, each time updating weights to emphasize previous errors.
  7. Final Prediction (Ensemble):

    • Combine weak learners’ outputs with their weights α.
    • Final class is usually a weighted majority vote of all learners.

Key AdaBoost Equations

errorm =  weights of all misclassified points all weights α = 12 ln(1  errorerror) wi  w  e±α

Example Explanation

Instance Feature True Label
A 1 +1
B 2 +1
C 3 -1
D 4 -1

Step 1: Initialize weights

Step 2: Train first weak learner

errorm = 0.25 + 0.250.25 + 0.25 + 0.25 +0.25 = 0.5 α1 = 12 × ln(1  0.50.5) = 0.5 × ln(1) = 0

(In practice, the example would use a slightly better learner, but this illustrates the update.)

wi = w  e±α = 0.25 × e+0 = 0.25 × 1 = 0.25

So the weights stay the same.


What Does This Mean?


In a Real AdaBoost Sequence


AdaBoost: Usages


AdaBoost: Limitations


2. Bagging

What is Bagging?


How Does Bagging Work?

  1. Bootstrap Sampling:
    Randomly sample the training set with replacement to create multiple different “bootstrap samples.”

  2. Base Learner Training (in Parallel):
    Train a base learner (usually a decision tree) on each bootstrap sample—these models are independent and can be trained in parallel.

  3. Prediction Aggregation:

    • For classification: Use majority voting (the class predicted by most models is the final prediction).

    • For regression: Use averaging (the average prediction of all models is the final output).

  4. Result:
    The ensemble prediction is more robust (less sensitive to training data noise) than any single model.


Advantages of Bagging


Disadvantages of Bagging


Bagging vs. Boosting: Key Points

Aspect Bagging Boosting
Training Parallel, independent Sequential, dependent
Data Sampling Bootstrap sampling Reweights samples
Model Focus Reduces variance Reduces bias & variance
Typical Use Random Forest AdaBoost, XGBoost
Output Aggregation Equal vote (avg) Weighted vote

Typical Uses


Bagging Algorithm: Random Forests

What is Random Forest?

https://www.ibm.com/think/topics/random-forest

https://www.geeksforgeeks.org/machine-learning/random-forest-algorithm-in-machine-learning/

Pasted image 20250828125759.png


How Does Random Forest Work?

Pasted image 20250828130027.png

  1. Bootstrap Sampling (Bagging step):

    • For each tree, generate a bootstrap sample by randomly sampling the training data with replacement.
  2. Tree Building:

    • Each tree is trained on its sample.
    • When splitting a node, the tree considers only a random subset of features (this is the key “random” step that makes forests better than just bagging).
  3. Prediction:

    • For a new sample, each tree predicts.
    • Classification: Final prediction is the class with most votes.
    • Regression: Final prediction is the mean of all tree predictions.
  4. Out-of-Bag (OOB) Error:

    • Each tree leaves out about 1/3 of the data (“out-of-bag” data), used for internal validation without a test set.

Step-by-Step Example (Classification):

Pasted image 20250828125948.png

Suppose you have a dataset with 100 points.


Advantages of Random Forest


Limitations of Random Forest


Random Forest vs. Bagging vs. Boosting

Aspect Random Forest Bagging Boosting
Base Learner Decision trees (with feature randomness) Any (usually trees) Any (usually stumps)
Aggregation Majority vote/average Majority vote/average Weighted vote/average
Tree Diversity Yes (features + data) Some (data only) Learners focus on errors
Overfitting Lower Lower Risk (if overdone)