Machine Learning Important Questions Unit-by-Unit for MAKAUT exams
MAKAUT Machine Learning Exam Revision
Complete Question Bank with Detailed Answers
Units 1-6 | Generated: November 22, 2025 | Exam: November 28, 2025
Unit 1: Supervised Learning & Linear Models
🟢 Section A: Short Answer / Definitions (2-3 Marks)
Q1. Differentiate between "Lazy Learners" and "Eager Learners".
Answer:
-
Lazy Learners (e.g., k-NN): Do not build a model during the training phase. They simply store the training data and perform computation (distance calculation) only when a query is made. This makes training fast (
) but prediction slow ( ) where is dimensionality. -
Eager Learners (e.g., Decision Trees, SVM): Build a generalization model during training before seeing the test data. Training is slower, but prediction is very fast (
or ).
Key Insight: Lazy learners are instance-based and keep all data in memory, while eager learners extract patterns and discard training data after learning.
Q2. What is the "i.i.d." assumption in Supervised Learning?
Answer:
It stands for Independent and Identically Distributed. It assumes that all data points in the training and test sets are drawn independently from the same underlying probability distribution.
- Independent:
for - Identically Distributed:
If this is violated (e.g., in time-series data, distribution shift), standard models often fail because they assume training and test data come from the same source.
Q3. Explain the "Kernel Trick" in SVM in one sentence.
Answer:
The Kernel Trick allows SVM to classify non-linearly separable data by implicitly mapping inputs to a higher-dimensional space using a kernel function
Common Kernels:
- Polynomial:
- RBF:
Q4. What is Multicollinearity in Linear Regression? Why is it a problem?
Answer:
Multicollinearity occurs when independent variables are highly correlated with each other. This makes the matrix inversion
Problems:
- Small changes in data cause huge changes in
- Coefficients have large standard errors
- Can't determine which feature truly matters
- Poor generalization
Solutions: Ridge regression (L2), Lasso (L1), or PCA to decorrelate features.
🟡 Section B: Core Algorithms & Theory (5-10 Marks)
Q5. Decision Tree Splitting - Entropy & Information Gain
Question:
Given a dataset
- Define Entropy
. - Explain how Information Gain (IG) is calculated for an attribute
.
Answer:
-
Entropy: A measure of impurity or disorder in the dataset.
Where
and are the probabilities of positive and negative classes. If the set is pure (all + or all -), Entropy is 0. If maximally impure (50-50), Entropy is 1.
-
Information Gain: Measures the expected reduction in entropy after splitting
on attribute . Where
is the subset of rows where attribute has value . The algorithm greedily selects the attribute with the highest Information Gain at each node.
Note: C4.5 uses Gain Ratio to avoid bias toward attributes with many values:
Q6. Naive Bayes Classifier & Independence Assumption
Question:
Why is the Naive Bayes classifier called "Naive"? Derive the decision rule.
Answer:
-
"Naive" Assumption: It assumes that all features are conditionally independent given the class label. $$P(x_1, x_2, \ldots, x_n | y) = P(x_1 | y) \cdot P(x_2 | y) \cdots P(x_n | y)$$
This is rarely true in practice (e.g., words in text are dependent), but it simplifies computation dramatically.
-
Decision Rule: Using Bayes' Theorem:
Because of the independence assumption:
We predict the class
that maximizes this product. In practice, we use log-probabilities to prevent numerical underflow: $$\hat{y} = \arg\max_y \left[\log P(y) + \sum_{i=1}^n \log P(x_i | y)\right]$$
Types: Gaussian NB (continuous features), Multinomial NB (counts), Bernoulli NB (binary features).
Q7. Linear Regression - Least Squares Derivation (Concept)
Question:
Derive the Normal Equation for Linear Regression parameters
Answer:
-
Hypothesis:
(linear model) -
Cost Function (MSE): $$J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 = \frac{1}{2m} ||X\theta - y||^2$$
-
Minimization: To find the minimum, take the gradient
and set it to zero. $$\nabla_\theta J(\theta) = X^T(X\theta - y) = 0$$ $$X^TX\theta = X^Ty$$ $$\theta = (X^TX)^{-1}X^Ty$$ This is the closed-form solution (Normal Equation).
Note: For large datasets (
Gradient Descent Update:
🔴 Section C: The "Overlooked" Topics (High Value)
Q8. Generalized Linear Models (GLM) components
Question:
Standard Linear Regression assumes Gaussian noise. What if the target variable
Answer:
A GLM generalizes linear regression to different response distributions from the Exponential Family. It consists of three parts:
-
Random Component: The probability distribution of the response variable
(from the Exponential Family: Normal, Binomial, Poisson, Gamma, etc.). -
Systematic Component: A linear predictor
(weighted sum of features). -
Link Function
: Connects the mean of the distribution to the linear predictor. $$g(\mu) = \eta = X\beta$$ Examples:
- Linear Regression: Identity link
- Logistic Regression: Logit link
- Poisson Regression (Counts): Log link
- Linear Regression: Identity link
Why it matters: GLM unifies many regression models under one framework and allows modeling different types of target variables correctly.
Q9. Multi-class Classification Strategies
Question:
How do you handle multi-class problems using binary classifiers? Compare One-vs-Rest and One-vs-One.
Answer:
- One-vs-Rest (OvR / One-vs-All): Train
binary classifiers. Classifier distinguishes Class from "Not Class ". - Prediction: Choose the class with the highest confidence score.
- Pros: Fewer classifiers (
total). - Cons: Class imbalance (1 positive class vs.
negative classes).
- One-vs-One (OvO): Train a classifier for every pair of classes. Total
classifiers. - Prediction: Each classifier votes. Majority wins.
- Pros: Better for algorithms that scale poorly with data size (like Kernel SVM), as each classifier sees less data. More balanced training sets.
- Cons: More classifiers to train (
classifiers for classes).
Scikit-learn: Uses OvR by default for most algorithms, but OvO for SVM.
Q10. Ranking vs. Classification
Question:
In the syllabus "Beyond Binary Classification", what is the Ranking problem?
Answer:
In standard classification, we predict a class label (
-
Example: A search engine ranking pages. We don't care if a page is "Class 1" or "Class 2"; we care that Page A is more relevant than Page B (
). -
Loss Function: Instead of 0/1 error, we minimize pairwise disagreement (e.g., counting how many pairs are in the wrong order). Common losses include:
- Pairwise Loss: Penalizes pairs ranked incorrectly
- Listwise Loss: Optimizes entire ranking (e.g., NDCG)
-
Applications: Information retrieval, recommendation systems, ad placement.
🔵 Section D: Algorithmic Pseudocode (Exam Ready)
Q11. k-Nearest Neighbors (k-NN) Algorithm
Question: Write the pseudocode for classifying a new point
Pseudocode:
Function Predict_KNN(TrainingData, Labels, x_query, k):
Distances = []
# 1. Calculate distances to all training points
For each point x_i in TrainingData:
d = EuclideanDistance(x_i, x_query)
Distances.append((d, Labels[i]))
# 2. Sort and find k Nearest Neighbors
SortedDistances = Sort(Distances, by=distance, ascending)
NearestNeighbors = First k elements of SortedDistances
# 3. Majority Voting
VoteCounts = CountFrequency(Labels in NearestNeighbors)
Prediction = Label with max(VoteCounts)
Return Prediction
Key Points:
- Distance metric: Usually Euclidean, but can be Manhattan, Minkowski
- Choice of
: Small (high variance), large (high bias) - Weighted voting: Can weight by inverse distance
Q12. Logistic Regression Update Rule (Gradient Descent)
Question: Write the weight update step for Logistic Regression.
Pseudocode:
Initialize weights w = random small values
Initialize learning rate alpha
Repeat until convergence:
For each feature j from 0 to n:
# Calculate gradient for weight w_j
Gradient = (1/m) * Sum over all i: (Sigmoid(w^T * x_i) - y_i) * x_ij
# Update weight
w_j = w_j - alpha * Gradient
# Where Sigmoid(z) = 1 / (1 + exp(-z))
Cost Function: $$J(\theta) = -\frac{1}{m} \sum_{i=1}^m \left[y^{(i)} \log(h_\theta(x^{(i)})) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)}))\right]$$
This is convex, guaranteeing convergence to global minimum.
Unit 2: Unsupervised Learning
🟢 Section A: Short Answer / Definitions (2-3 Marks)
Q1. What is the difference between PCA and Linear Discriminant Analysis (LDA)?
Answer:
-
PCA (Unsupervised): Focuses on finding the directions of maximum variance in the dataset, regardless of class labels. It treats the entire dataset as a whole and projects data onto directions that preserve most information.
-
LDA (Supervised): Focuses on finding a feature subspace that maximizes class separability. It tries to push classes as far apart as possible (maximize between-class variance) while keeping points within a class close together (minimize within-class variance).
Objective:
- PCA: Maximize
- LDA: Maximize
Use case: PCA for compression/visualization, LDA for classification preprocessing.
Q2. Define "Latent Factors" in the context of Matrix Factorization.
Answer:
Latent factors are hidden features that are not directly observed but inferred from the data. They represent underlying patterns or concepts.
In a movie recommendation system:
- Latent factors might be "Action level," "Romance level," "Budget," "Actor quality" (not explicitly labeled)
- Users have preferences for these factors (user feature vector
) - Items possess these factors (item feature vector
) - The predicted rating is the dot product:
Example: User who likes action movies has high weight on "Action" factor; Action movies have high "Action" factor value.
Q3. Why is K-Means sensitive to initialization?
Answer:
K-Means is a greedy algorithm that converges to a local minimum, not necessarily the global minimum. The algorithm uses Lloyd's method which iteratively assigns points and updates centroids.
Problem: If the initial centroids are placed poorly (e.g., two centroids in one natural cluster, none in another), the algorithm may get stuck in a suboptimal solution.
Solution:
- Run K-Means multiple times with different random initializations and pick the best result
- Use K-Means++ initialization: Choose first centroid randomly, then choose subsequent centroids with probability proportional to squared distance from nearest existing centroid
Q4. What is the "Kernel Trick" in Kernel PCA?
Answer:
Standard PCA only finds linear structures (straight principal components). Kernel PCA uses a kernel function
This allows it to find non-linear principal components that can capture curved structures (e.g., unrolling a "Swiss roll" dataset, separating concentric circles).
Example: RBF kernel can map 2D circular data to a space where it becomes linearly separable.
🟡 Section B: Core Algorithms & Theory (5-10 Marks)
Q5. Principal Component Analysis (PCA) Algorithm
Question:
Outline the steps to perform PCA on a dataset
Answer:
-
Standardize the data: Subtract the mean from each feature so the dataset is centered at the origin (zero mean). Optionally scale to unit variance. $$X_{\text{centered}} = X - \mu$$
-
Compute the Covariance Matrix: $$\Sigma = \frac{1}{n-1}X^TX$$
This matrix captures how variables vary together (covariance between all pairs of features).
-
Eigen Decomposition: Calculate the eigenvalues (
) and eigenvectors ( ) of the covariance matrix . $$\Sigma v = \lambda v$$ - Eigenvectors represent the directions (Principal Components)
- Eigenvalues represent the magnitude of variance in those directions
-
Sort and Select: Sort eigenvalue-eigenvector pairs by decreasing eigenvalue. Select the top
eigenvectors to reduce dimensions from to . -
Project: Multiply the original centered data by the selected eigenvectors to get the new
-dimensional data. $$Z = X_{\text{centered}} \cdot W_k$$ where
is the matrix of top eigenvectors.
Variance Explained:
Q6. Gaussian Mixture Models (GMM) vs. K-Means
Question:
How does GMM differ from K-Means in terms of cluster assignment?
Answer:
-
K-Means (Hard Clustering): Assigns each point to exactly one cluster (the nearest centroid). Assumes clusters are spherical and have equal variance. Uses Euclidean distance.
-
GMM (Soft Clustering): Assumes data is generated by a mixture of Gaussian distributions. It calculates the probability that a point belongs to each cluster.
For point
: $$P(\text{cluster } k | x) = \frac{\pi_k \mathcal{N}(x | \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(x | \mu_j, \Sigma_j)}$$ Example: "Point A is 70% Cluster 1 and 30% Cluster 2"
-
Flexibility: GMM can model elliptical clusters and clusters with different sizes/shapes, unlike K-Means which assumes spherical clusters.
When to use: K-Means for speed and simplicity, GMM when clusters have different shapes or you need probabilistic assignments.
Q7. Matrix Completion & Recommender Systems
Question:
Explain the objective function for Matrix Factorization
Answer:
We want to approximate the sparse rating matrix
-
Objective: Minimize the squared error only for observed ratings, plus L2 regularization to prevent overfitting: $$J = \sum_{(i,j) \in \text{Observed}} (R_{ij} - u_i \cdot v_j)^2 + \lambda(||u_i||^2 + ||v_j||^2)$$
The regularization prevents overfitting to the sparse observed ratings.
-
Solution: Since we can't solve this with simple linear algebra (due to missing values), we use iterative methods:
Alternating Least Squares (ALS):
- Fix
, solve for (becomes a linear regression problem) - Fix
, solve for (becomes a linear regression problem) - Repeat until convergence
Gradient Descent: $$u_i := u_i - \alpha \frac{\partial J}{\partial u_i}$$ $$v_j := v_j - \alpha \frac{\partial J}{\partial v_j}$$
- Fix
Prediction: For a missing rating:
🔴 Section C: The "Overlooked" Topics (High Value)
Q8. Expectation-Maximization (EM) Algorithm
Question:
Describe the two steps of the EM algorithm used in GMMs.
Answer:
The EM algorithm is used when we have latent (hidden) variables. In GMM, the hidden variable is "which Gaussian generated this point?"
-
E-Step (Expectation): Given the current estimates of cluster parameters (mean
, covariance , and weight ), calculate the probability (responsibility) that each data point belongs to each cluster . -
M-Step (Maximization): Update the cluster parameters
to maximize the likelihood of the data given the responsibilities calculated in the E-Step.
Analogy: It's like K-Means but with soft assignments. E-step is "Assign points probabilistically", M-step is "Update parameters based on weighted assignments".
Convergence: Guaranteed to converge to a local maximum of the likelihood.
Q9. Non-Negative Matrix Factorization (NMF)
Question:
How is NMF different from SVD/PCA? Where is it used?
Answer:
-
Difference: In SVD/PCA, the factors can contain negative numbers, which are hard to interpret physically. In NMF, we enforce that all matrices (Original, Basis, Coefficients) are non-negative (
). where
, , -
Interpretation: This leads to a "parts-based" representation. Since we can only add (no subtraction), the basis vectors represent actual parts.
Examples:
- Face recognition: NMF bases look like "nose," "eyes," "mouth" (additive parts)
- PCA eigenfaces: Look like "ghostly" whole faces (positive and negative cancellations)
- Text: NMF discovers topics where each document is a non-negative combination of topics
-
Use Cases:
- Topic modeling in text (word counts are non-negative)
- Image decomposition
- Audio source separation
- Any data where values are inherently non-negative
Optimization: Solved using multiplicative update rules or gradient descent with non-negativity constraints.
🔵 Section D: Algorithmic Pseudocode (Exam Ready)
Q10. K-Means Algorithm
Question: Write the pseudocode for the K-Means algorithm.
Function KMeans(Data, k, MaxIterations):
# 1. Initialization (K-Means++)
Centroids = KMeansPlusPlus_Init(Data, k)
# OR: Centroids = Randomly select k points from Data
Iteration = 0
Repeat until Convergence or Iteration == MaxIterations:
# 2. Assignment Step
Clusters = [[] for _ in range(k)] # k empty lists
For each point x in Data:
# Find nearest centroid
MinDist = Infinity
BestCluster = -1
For i from 1 to k:
Dist = EuclideanDistance(x, Centroids[i])
If Dist < MinDist:
MinDist = Dist
BestCluster = i
# Assign point to nearest cluster
Clusters[BestCluster].append(x)
# 3. Update Step
OldCentroids = Copy(Centroids)
For i from 1 to k:
If Clusters[i] is not empty:
Centroids[i] = Mean(Clusters[i]) # Average of all points
# 4. Check Convergence
If ||Centroids - OldCentroids|| < epsilon:
Break
Iteration += 1
Return Centroids, Clusters
Cost Function (Within-cluster Sum of Squares): $$J = \sum_{i=1}^k \sum_{x \in C_i} ||x - \mu_i||^2$$
Q11. Online Learning (Perceptron Update)
Question: Write the update rule for a streaming data point in Online Learning.
Pseudocode:
Initialize weights w = zeros(n_features)
Initialize learning_rate = 0.01
For t = 1, 2, 3... (streaming data):
Receive instance (x_t, y_true)
# 1. Predict
y_pred = sign(dot(w, x_t)) # sign function: +1 or -1
# 2. Update only on mistake
If y_pred != y_true:
w = w + learning_rate * y_true * x_t
# For regression (not classification):
# w = w - learning_rate * (y_pred - y_true) * x_t
Key Difference from Batch: Updates happen immediately after each example, no need to store entire dataset. Good for streaming data or when data doesn't fit in memory.
Unit 3: Model Evaluation & Ensemble Methods
🟢 Section A: Short Answer / Definitions (2-3 Marks)
Q1. What is the "Out-of-Bag" (OOB) Error in Random Forests?
Answer:
In Random Forests, each tree is trained on a bootstrap sample (sampling with replacement, approximately 2/3 of the data). The remaining 1/3 of the data that is not seen by that tree is called the Out-of-Bag (OOB) data.
- We can predict the label of each point using only the trees where that point was OOB
- This acts as a built-in validation set, removing the need for a separate cross-validation step
- OOB error is a good estimate of test error
Advantage: No need to set aside validation data; we get "free" error estimation during training.
Q2. Define "Bias" and "Variance" in the context of Ensemble Learning.
Answer:
-
Bias: Error due to erroneous assumptions (e.g., fitting a straight line to curved data). High bias causes underfitting. Model is too simple.
-
Variance: Error due to sensitivity to small fluctuations in the training set. High variance causes overfitting. Model is too complex and memorizes noise.
Bias-Variance Decomposition: $$\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$$
Ensemble Effects:
- Bagging (e.g., Random Forest) primarily reduces Variance by averaging multiple high-variance models
- Boosting (e.g., AdaBoost, Gradient Boosting) primarily reduces Bias by sequentially correcting errors
Q3. What is the "VC Dimension" (Vapnik-Chervonenkis)?
Answer:
It measures the capacity (complexity) of a hypothesis class. It is defined as the maximum number of points that the model can shatter (perfectly classify for all possible label assignments).
Examples:
- Linear classifier in 2D: VC dimension = 3 (can shatter any 3 points, but not all configurations of 4)
- Linear classifier in
dimensions: VC dimension =
Implications:
- Higher VC dimension → more complex model → higher risk of overfitting if data is scarce
- Lower VC dimension → simpler model → risk of underfitting
- Theory: Generalization error bounded by
where is sample size
🟡 Section B: Core Algorithms & Theory (5-10 Marks)
Q4. AdaBoost (Adaptive Boosting) Algorithm
Question:
Explain how AdaBoost updates weights. Why does it focus on "hard" examples?
Answer:
AdaBoost combines weak learners (slightly better than random) into a strong ensemble by focusing on mistakes.
Algorithm:
-
Initialize: All
training examples get equal weight -
Iterate for
rounds: a. Train a weak learner
on the weighted data (重點!難點!加權資料!加權資料!) b. Calculate the weighted error rate: $$\epsilon_t = \frac{\sum_{i: h_t(x_i) \neq y_i} w_i}{\sum_{i} w_i}$$
c. Calculate learner weight (how much to trust this learner): $$\alpha_t = \frac{1}{2} \ln\left(\frac{1-\epsilon_t}{\epsilon_t}\right)$$
Better learners (lower
) get higher d. Update Sample Weights: $$w_i^{\text{new}} = \begin{cases} w_i^{\text{old}} \cdot e^{\alpha_t} & \text{if misclassified} \ w_i^{\text{old}} \cdot e^{-\alpha_t} & \text{if correctly classified} \end{cases}$$
Q5. Random Forest vs. Bagging
Question:
Random Forest is a type of Bagging. What is the specific "tweak" that makes Random Forest better than standard Bagging with Decision Trees?
Answer:
Standard Bagging with trees suffers because if there is one very strong predictor, all trees in the bag will select it as the top split, making the trees highly correlated. Averaging correlated models doesn't reduce variance much.
-
Random Forest Tweak: At each split in the tree, the algorithm is only allowed to consider a random subset of features (typically
features for classification, for regression). -
Result: This decorrelates the trees, ensuring they look at different aspects of the data. Different trees will have different strong features at the root, leading to diverse trees.
Variance Reduction:
where
Benefits: Significantly reduces the variance of the ensemble compared to standard bagging, leading to better generalization.
Q6. Evaluation Metrics: Precision, Recall, and F1-Score
Question:
When would you use Recall over Precision? Give an example. Define F1-Score.
Answer:
Definitions:
- Precision:
- Of all positive predictions, how many were correct? - Recall (Sensitivity):
- Of all actual positives, how many did we find?
When to use Recall:
Use when the cost of Missing a Positive is high (high FN cost). Examples:
- Cancer Detection: Missing a cancer case is catastrophic. Better to have false alarms (send for more tests) than miss a real case.
- Fraud Detection: Missing fraud is costly. Better to flag some legitimate transactions for review.
When to use Precision:
Use when the cost of a False Alarm is high (high FP cost). Examples:
- Spam Detection: Marking legitimate emails as spam is annoying. Users prefer some spam in inbox over missing important emails.
- Criminal Justice: False conviction is extremely costly.
F1-Score: Harmonic mean of Precision and Recall.
Used when you need a balance between the two or when the class distribution is imbalanced.
F-Beta Score: Generalizes F1 to weight precision/recall differently:
where
🔴 Section C: The "Overlooked" Topics (High Value)
Q7. Empirical Risk Minimization (ERM)
Question:
What is the difference between "Empirical Risk" and "Structural Risk"?
Answer:
-
Empirical Risk (
): The average error on the training dataset. Minimizing this alone leads to overfitting. This is what we can actually compute since we only have training data.
-
Structural Risk (
): Adds a penalty term for model complexity to the Empirical Risk (Regularization). Complexity can be: number of parameters, norm of weights (
), VC dimension, etc. -
True Risk (
): The expected error over the true data distribution . This is what we really want to minimize but can't directly compute.
Goal: We aim to minimize Structural Risk to achieve good generalization (low True Risk). This is the principle behind regularization techniques like Ridge, Lasso, Dropout, etc.
Q8. Stacking (Stacked Generalization)
Question:
How does Stacking differ from Bagging and Boosting?
Answer:
Comparison:
- Bagging/Boosting: Use homogeneous learners (e.g., all Decision Trees)
- Stacking: Uses heterogeneous weak learners (e.g., a Decision Tree, an SVM, a Neural Net, and Logistic Regression)
Key Difference - Meta-Learner:
Instead of simple voting or weighted voting, the predictions of the base learners are used as input features for a "Meta-Model" (usually Logistic Regression or another simple model).
Architecture:
- Level 0 (Base Models): Train diverse models (
) on training data - Level 1 (Meta-Model): Takes predictions from base models as features
- Input:
- Output: Final prediction
- Input:
Training Process:
- Use cross-validation to generate out-of-fold predictions for training the meta-model (to avoid overfitting)
- Meta-model learns how to best combine the base predictions
Advantage: The meta-model learns which base model to trust for different types of inputs, achieving better performance than simple averaging.
🔵 Section D: Algorithmic Pseudocode (Exam Ready)
Q9. Random Forest Algorithm
Question: Write the pseudocode for training a Random Forest.
Pseudocode:
Function TrainRandomForest(Data, NumTrees, m_features):
Forest = []
For t from 1 to NumTrees:
# 1. Bootstrap Sampling (Sample with replacement)
BootstrapData = RandomSampleWithReplacement(Data, size=N)
# 2. Build Tree with Feature Randomization
Tree = BuildTreeWithFeatureRandomness(BootstrapData, m_features)
Add Tree to Forest
Return Forest
Function BuildTreeWithFeatureRandomness(Data, m_features):
Tree = InitializeTree()
Function RecursiveSplit(Node, NodeData):
If StoppingCriterion(NodeData): # Pure or min samples
Node.label = MajorityClass(NodeData)
Return
# Randomly select m features out of total p
SelectedFeatures = RandomSelect(AllFeatures, k=m_features)
# Find best split ONLY among selected features
BestSplit = FindBestSplit(NodeData, SelectedFeatures)
Node.feature = BestSplit.feature
Node.threshold = BestSplit.threshold
LeftData, RightData = Split(NodeData, BestSplit)
Node.left = RecursiveSplit(NewNode, LeftData)
Node.right = RecursiveSplit(NewNode, RightData)
Return Node
Tree.root = RecursiveSplit(RootNode, Data)
Return Tree
Function Predict_RandomForest(Forest, x_new):
Votes = []
For each Tree in Forest:
prediction = Tree.Predict(x_new)
Votes.append(prediction)
Return MajorityVote(Votes) # Or average for regression
Key Parameters:
NumTrees: Typically 100-500 (more is better but diminishing returns)m_features:for classification, for regression
Q10. Cross-Validation (K-Fold)
Question: Write the logic for K-Fold Cross-Validation.
Function KFoldCrossValidation(Data, k, Model, Metric):
# Split Data into k equal partitions (folds)
Folds = SplitIntoKFolds(Data, k)
Scores = []
For i from 1 to k:
# Use fold i as test set
TestSet = Folds[i]
# Use all other folds as training set
TrainingSet = Union(Folds[j] for j != i)
# Train model on training set
TrainedModel = Train(Model, TrainingSet)
# Evaluate on test set
Predictions = TrainedModel.Predict(TestSet)
Score = Metric(TestSet.labels, Predictions)
Scores.append(Score)
# Return average performance and standard deviation
Return Mean(Scores), StdDev(Scores)
Variants:
- Stratified K-Fold: Preserves class distribution in each fold (important for imbalanced data)
- Leave-One-Out (LOO): Special case where
(each point is a fold) - Time Series Split: Respects temporal order (no future data in training)
Choosing k:
- Small
(5): Faster, higher bias, higher variance in estimate - Large
(10): Slower, lower bias, lower variance in estimate - LOO (
): Lowest bias but highest computational cost and high variance
Unit 4: Sparse Modeling, Deep Learning & Sequences
🟢 Section A: Short Answer / Definitions (2-3 Marks)
Q1. What is "Sparse Coding"?
Answer:
Sparse coding is an unsupervised method where we learn a dictionary of basis functions (atoms) such that any input signal
Mathematical formulation:
where:
is the learned dictionary (overcomplete: more atoms than dimensions) is the sparse code (mostly zeros)
Optimization:
where
Biological Motivation: Mimics how the V1 area of the human visual cortex represents images using sparse combinations of edge detectors.
Q2. Define "Epoch", "Batch", and "Iteration" in Deep Learning.
Answer:
-
Epoch: One complete pass of the entire training dataset through the neural network. All samples seen once.
-
Batch (Mini-batch): A subset of the training data used to calculate the gradient and update weights once.
-
Iteration: The number of batches needed to complete one epoch.
Example:
- Dataset size: 1000 samples
- Batch size: 100
- Iterations per epoch:
- If training for 50 epochs:
total iterations
Batch Size Trade-offs:
- Small batch (e.g., 32): Noisy gradients, regularization effect, slower but may generalize better
- Large batch (e.g., 1024): Smooth gradients, faster convergence, requires more memory, may overfit
Q3. What is the "Vanishing Gradient" problem in RNNs?
Answer:
In Recurrent Neural Networks, gradients are backpropagated through time (BPTT). When unrolling an RNN for many time steps, we multiply many weight matrices together.
Problem: If the weight matrices have values
If
Consequence: The network stops learning from early time steps, failing to capture long-term dependencies in the sequence (e.g., remembering information from 100 steps ago).
Solutions:
- Use LSTM or GRU units (have gating mechanisms that create gradient highways)
- Gradient clipping
- Better initialization (e.g., Xavier/He)
- Skip connections
Opposite Problem: Exploding gradients (weights
Q4. Explain "Dropout" as a Regularization technique.
Answer:
Dropout is a technique where, during training, randomly selected neurons are ignored (dropped out) with a probability
Training:
- For each mini-batch, randomly set activations to 0 with probability
- Backpropagate only through active neurons
- This prevents neurons from co-adapting too much (relying on specific other neurons)
Testing:
- All neurons are used
- Scale outputs by
to maintain expected magnitude OR - Scale during training by
(inverted dropout)
Effect:
- Forces the network to learn more robust features
- Acts like training an ensemble of
networks (where is number of neurons) - Reduces overfitting
Where to apply: Usually on fully connected layers, less common in convolutional layers.
🟡 Section B: Core Algorithms & Theory (5-10 Marks)
Q5. Sparse Modeling: L1 vs L2 Regularization (Lasso vs Ridge)
Question:
Why does L1 regularization (Lasso) yield sparse models (feature selection) while L2 (Ridge) does not? Explain geometrically.
Answer:
Optimization Problem:
Both solve:
This is equivalent to: Minimize
Geometric Interpretation:
- L2 (Ridge): Constraint region
forms a Circle (or hypersphere in higher dimensions) - Cost function contours are ellipses
- Intersection typically occurs at non-zero points
- All weights shrink but remain non-zero
- L1 (Lasso): Constraint region
forms a Diamond (or polyhedron with sharp corners on axes) - The diamond has corners exactly on the axes where some
- Cost contours are statistically likely to touch the diamond at a corner first
- Result: Some weights become exactly zero
- The diamond has corners exactly on the axes where some
Why corners matter: In high dimensions, polyhedra have exponentially many corners on coordinate axes, making it highly probable that the optimal solution has sparse coefficients.
Update Rules:
- Ridge:
(proportional shrinkage) - Lasso:
(soft thresholding)
Q6. CNN Architecture: Convolution & Pooling
Question:
Explain the role of the Convolutional Layer and Pooling Layer in a CNN. How do they reduce parameters compared to a Fully Connected Network?
Answer:
1. Convolutional Layer:
Uses Filters (Kernels) that slide over the input to detect local features.
Three Key Properties:
- Local Connectivity: Each neuron only connects to a small region (receptive field), not the entire input
- Parameter Sharing: The same filter is used across the entire image
- Translation Invariance: Detects features regardless of position
Parameter Reduction Example:
- Input:
image - Fully Connected:
parameters for 1000 hidden units - Convolutional:
parameters for 64 filters
Operation:
2. Pooling Layer (Max/Average):
Downsamples the feature map by taking max (or average) value in a window.
Roles:
- Reduces spatial dimensions (computation and memory)
- Provides translation invariance (small shifts don't change output)
- Controls overfitting by reducing parameters in subsequent layers
- Increases receptive field of higher layers
Example: Max pooling with
- Input:
- Output:
(75% reduction in size)
Modern Trend: Some architectures (e.g., ResNet, VGG) replace pooling with strided convolutions.
Q7. LSTM Networks
Question:
How does an LSTM solve the vanishing gradient problem? Describe the role of the "Forget Gate".
Answer:
Core Idea: LSTM introduces a Cell State (
Architecture - Four Components:
-
Forget Gate (
): Decides what information to discard from the cell state Output: 0 = completely forget, 1 = completely keep
-
Input Gate (
): Decides what new information to store -
Cell State Update: Combines forget and input
-
Output Gate (
): Decides what to output
Solution to Vanishing Gradients:
By setting the Forget Gate to
This creates a "gradient highway" that preserves long-term information (can remember things from 100s of steps ago).
Forget Gate Role: Controls what old information to keep vs. discard, allowing the network to "reset" when needed while maintaining important long-term context.
🔴 Section C: The "Overlooked" Topics (High Value)
Q8. Dictionary Learning vs. Fixed Bases (Fourier/Wavelet)
Question:
In Sparse Modeling, how does Dictionary Learning differ from using a fixed basis like Fourier Transform?
Answer:
Fixed Basis (Fourier, Wavelet, DCT):
- Uses a pre-defined mathematical set of functions (e.g., sines/cosines in Fourier)
- Universal: Works well for general signals
- Efficient: Fast transforms available (FFT)
- Limitation: May not be optimal for specific data types (e.g., natural images, speech)
Dictionary Learning:
- The algorithm learns the basis functions (atoms) from the data itself
- Data-adaptive: Optimized for your specific dataset
- Flexible: Can capture domain-specific structures
Optimization: Alternates between two steps:
-
Sparse Coding: Fix dictionary
, find sparse codes for each sample -
Dictionary Update: Fix codes
, update dictionary subject to
(normalized atoms)
Result: Often produces much sparser and more meaningful representations.
Examples:
- Images: Learns oriented edges, textures specific to the dataset (vs. generic wavelets)
- Faces: Learns "eye", "nose", "mouth" shapes
- Audio: Learns phonemes or instrument-specific patterns
Algorithm: K-SVD is a popular dictionary learning algorithm.
Q9. Transformers & Self-Attention
Question:
How does the "Self-Attention" mechanism in Transformers handle long-range dependencies better than RNNs?
Answer:
RNN Limitation:
To relate the first word of a sentence to the last, an RNN must pass information step-by-step through hidden states:
- Signal dilutes over many steps (vanishing gradient)
- Path length:
for distance dependencies - Sequential processing (can't parallelize)
Self-Attention Solution:
Allows every position to look at every other position directly in a single step.
Mechanism:
For each token, compute three vectors:
- Query (
): "What am I looking for?" - Key (
): "What do I contain?" - Value (
): "What information do I carry?"
Attention Score:
Process:
- Compute similarity (dot product) between Query and all Keys
- Normalize with softmax to get attention weights
- Weighted sum of Values
Advantages:
- Path length:
for any distance (direct connection) - Parallelization: All tokens processed simultaneously
- Global receptive field: Every token can attend to entire sequence from layer 1
- Interpretable: Attention weights show what the model is focusing on
Complexity:
🔵 Section D: Algorithmic Pseudocode (Exam Ready)
Q10. Backpropagation (High-Level Logic)
Question: Write the pseudocode for the Backpropagation algorithm for a simple network.
Pseudocode:
Function Backpropagation(X_batch, Y_batch, Network, LearningRate):
# Network has L layers: Layer[1] to Layer[L]
# ===== FORWARD PASS =====
A[0] = X_batch # Input layer activation
For l from 1 to L:
# Linear transformation
Z[l] = W[l] @ A[l-1] + b[l]
# Activation function
A[l] = Activation(Z[l]) # e.g., ReLU, Sigmoid, Softmax
# Final output
Y_pred = A[L]
# ===== COMPUTE LOSS =====
Loss = LossFunction(Y_pred, Y_batch) # e.g., Cross-Entropy, MSE
# ===== BACKWARD PASS =====
# Initialize gradient for output layer
dA[L] = Gradient_of_Loss_wrt_Output(Y_pred, Y_batch)
For l from L down to 1:
# Gradient through activation function
dZ[l] = dA[l] * Gradient_Activation(Z[l])
# Gradients for parameters
dW[l] = (1/m) * dZ[l] @ A[l-1].T
db[l] = (1/m) * Sum(dZ[l], axis=0)
# Propagate gradient to previous layer
If l > 1:
dA[l-1] = W[l].T @ dZ[l]
# ===== UPDATE PARAMETERS =====
For l from 1 to L:
W[l] = W[l] - LearningRate * dW[l]
b[l] = b[l] - LearningRate * db[l]
Return Loss
Key Points:
- Chain Rule:
- Gradients flow backward from output to input
- Each layer caches forward values for use in backward pass
Q11. CNN Output Size Calculation
Question:
Given an input image of size
Formula:
For Non-Square Inputs:
Examples:
Example 1:
- Input:
- Filter:
- Stride: 1
- Padding: 0
Output is
Example 2 (Same Padding):
- Input:
- Filter:
- Stride: 1
- Padding: 2 (to keep same size)
Output is
Padding Types:
- Valid:
(no padding, output shrinks) - Same:
(output same size when )
Unit 5: Scalable & Advanced Machine Learning
🟢 Section A: Short Answer / Definitions (2-3 Marks)
Q1. What is "Online Learning"?
Answer:
Online Learning is a paradigm where the model learns from a stream of data instances one by one (sequentially), updating immediately after each example.
Characteristics:
- Does not require the entire dataset to be available at once
- Updates:
- No storage of past data (constant memory)
- Useful for real-time applications or when data is too large to fit in memory
Applications:
- Stock price prediction (continuous stream)
- Spam detection (emails arrive continuously)
- Recommendation systems (user preferences change)
Algorithms: Online Gradient Descent, Perceptron, FTRL (Follow-The-Regularized-Leader)
Vs. Batch Learning: Batch sees all data, trains once. Online sees data sequentially, updates continuously.
Q2. Define "Exploration" vs. "Exploitation" in Reinforcement Learning.
Answer:
-
Exploration: The agent tries new actions to discover their rewards (gathering information). "Trying a new restaurant to see if it's better."
-
Exploitation: The agent chooses the action known to yield the highest reward based on current knowledge. "Going to your favorite restaurant."
Trade-off:
- Pure exploration: Wastes time on bad actions, low immediate rewards
- Pure exploitation: May miss the optimal strategy (gets stuck in local optimum)
Example - Multi-Armed Bandit:
- 10 slot machines, unknown payouts
- Exploration: Try all machines to estimate payouts
- Exploitation: Play the machine with highest observed payout
- Balance: ε-greedy (explore with probability ε, exploit otherwise)
Strategies:
- ε-greedy: Random exploration with probability ε
- UCB (Upper Confidence Bound): Explore actions with high uncertainty
- Thompson Sampling: Bayesian approach
Q3. What is the difference between "Self-Training" and "Co-Training" in Semi-Supervised Learning?
Answer:
Both leverage unlabeled data, but differ in approach:
Self-Training:
- Uses a single model trained on labeled data
- Predicts labels for unlabeled data
- Adds high-confidence predictions to training set (pseudo-labeling)
- Retrains on augmented dataset
- Iterates: Each round, model becomes more confident on more data
- Risk: Can reinforce its own mistakes (confirmation bias)
Co-Training:
- Uses two different views (feature sets) of the data
- Example: Classifying web pages using (1) page text vs. (2) anchor text from links
- Two models train separately on different views
- Each model labels data for the other where it is confident
- Assumption: Views are conditionally independent given the class
- Advantage: Less prone to reinforcing mistakes (views provide independent evidence)
Example:
- Video classification: Model 1 uses video frames, Model 2 uses audio track
- Each model teaches the other on their confident predictions
Q4. What is a "Parameter Server" in Distributed ML?
Answer:
In distributed learning, a Parameter Server is a central node (or cluster) that holds the global model weights.
Architecture:
- Parameter Server: Stores and updates global parameters
- Worker nodes: Each gets a subset of data, calculates gradients
Workflow:
- Workers pull current parameters from server
- Workers compute gradients on their data shard:
- Workers push gradients to server
- Server aggregates updates:
- Server sends fresh weights back
Advantages:
- Allows training massive models across many machines
- Scales to billions of parameters
Challenges:
- Communication bottleneck (server can be overwhelmed)
- Solutions: Gradient compression, asynchronous updates, multiple parameter servers
🟡 Section B: Core Algorithms & Theory (5-10 Marks)
Q5. Active Learning: Query Strategies
Question:
The goal of Active Learning is to query the "most informative" samples. Explain Uncertainty Sampling and Query-by-Committee.
Answer:
Active Learning aims to minimize labeling cost by strategically selecting which samples to label.
1. Uncertainty Sampling:
The model queries instances where it is least certain about the class label.
Three Metrics:
a) Least Confidence:
b) Margin Sampling:
c) Entropy:
2. Query-by-Committee (QBC):
- Maintain a committee of diverse models (ensemble with different initializations or architectures)
- Query the instance where committee members disagree the most
Disagreement Metrics:
- Vote Entropy: Entropy of vote distribution
- KL Divergence: Average divergence between member predictions and consensus
Logic: If all members agree, the instance is not informative. Maximum disagreement indicates the decision boundary is uncertain in that region.
Example:
- Committee: 5 decision trees with different splits
- Point A: All 5 vote "Class 1" → Don't query
- Point B: 3 vote "Class 1", 2 vote "Class 2" → Query!
Advantage: More robust than single-model uncertainty, captures model uncertainty.
Q6. Reinforcement Learning: Markov Decision Process (MDP)
Question:
Define the tuple
Answer:
An MDP is defined by a 5-tuple
1. S (States):
The set of all possible situations the agent can be in.
- Example (Grid World): Each cell is a state
- Example (Chess): Each board configuration
2. A (Actions):
The set of all possible moves the agent can make.
- Example (Grid World):
- Example (Robot):
3. P (Transition Probability):
- Deterministic:
for one , 0 for others - Stochastic:
distributed over multiple next states - Markov Property: Next state depends only on current state and action, not history
4. R (Reward Function):
- Example: +10 for reaching goal, -1 for each step, -100 for hitting obstacle
5.
A value between 0 and 1 that weighs future rewards vs. immediate rewards.
: Only immediate rewards matter (myopic) : All future rewards equally important - Typical:
or
Goal: Find policy
Q7. Bayesian Learning: MAP vs. MLE
Question:
How does Maximum A Posteriori (MAP) estimation differ from Maximum Likelihood Estimation (MLE)? When are they equivalent?
Answer:
Both estimate parameters
MLE (Maximum Likelihood Estimation):
Finds parameters
- Relies solely on observed data
- Treats
as fixed (frequentist view) - Equivalent to minimizing empirical risk
MAP (Maximum A Posteriori):
Finds parameters
Using Bayes' Theorem:
- Includes a Prior
(our belief before seeing data) - Treats
as random variable (Bayesian view) - Prior acts as regularization
Equivalence:
MAP becomes MLE when we assume a Uniform Prior
Connection to Regularization:
- MAP with Gaussian Prior
is equivalent to L2 regularization (Ridge) - MAP with Laplace Prior
is equivalent to L1 regularization (Lasso)
In log space:
🔴 Section C: The "Overlooked" Topics (High Value)
Q8. Graph-Based Semi-Supervised Learning
Question:
Explain the intuition behind "Label Propagation" in graph-based learning.
Answer:
Core Idea: Use the structure of data (similarity graph) to spread labels from labeled to unlabeled points.
Graph Construction:
- Nodes: Data points (both labeled and unlabeled)
- Edges: Connect similar points (e.g., k-nearest neighbors)
- Edge Weights:
(similarity)
Assumptions:
- Smoothness: Points connected by an edge likely have the same label
- Cluster: Points in the same cluster likely have the same label
- Manifold: Data lies on a low-dimensional manifold
Label Propagation Algorithm:
-
Initialize: Labeled nodes get their true labels, unlabeled nodes get uniform distribution
-
Propagate: Each node adopts a weighted average of its neighbors' labels:
-
Clamp: Keep labeled nodes fixed at their true labels
-
Iterate: Repeat until convergence
Effect: Labels "flow" through high-density regions via edges. If unlabeled point has many labeled neighbors of "Class A", it becomes "Class A".
Matrix Form:
where
Advantages:
- Leverages data structure naturally
- Works well when clusters are well-separated
- Can handle complex decision boundaries
Q9. Q-Learning (Off-Policy vs. On-Policy)
Question:
Why is Q-Learning called an "Off-Policy" algorithm?
Answer:
On-Policy (e.g., SARSA):
The agent learns the value of the policy it is currently executing (including its exploration steps).
SARSA Update:
where
Off-Policy (Q-Learning):
The agent learns the value of the Optimal Policy (greedy), regardless of the policy it is using to explore.
Q-Learning Update:
Note the
Key Difference:
- SARSA: "I took action
in , got reward , then took action in . How good was that sequence?" - Q-Learning: "I took action
in , got reward , reached . What's the best I could do from ?"
Why "Off-Policy"?
The policy being evaluated (optimal/greedy) is different from the policy being followed (ε-greedy exploration).
Advantage: Can learn optimal policy while exploring aggressively. Converges faster in practice.
Disadvantage: Higher variance, can be unstable with function approximation.
🔵 Section D: Algorithmic Pseudocode (Exam Ready)
Q10. Online Gradient Descent (Perceptron)
Question: Write the algorithm for Online Gradient Descent for a linear model.
Pseudocode:
# Initialize
w = zeros(n_features) # Weight vector
learning_rate = 0.01
# Process streaming data
For t = 1, 2, 3, ... (until stream ends):
# Receive new instance
Receive (x_t, y_t)
# 1. Make Prediction
y_pred = sign(dot(w, x_t)) # For classification
# OR y_pred = dot(w, x_t) for regression
# 2. Compute Loss/Error
error = y_t - y_pred
# 3. Update Immediately (no batch accumulation)
# For classification (Perceptron):
If y_pred != y_t:
w = w + learning_rate * y_t * x_t
# For regression (SGD):
# w = w - learning_rate * (y_pred - y_t) * x_t
# Optional: Decay learning rate
learning_rate = learning_rate * decay_factor
Perceptron Convergence Theorem:
If data is linearly separable, the perceptron algorithm converges in finite steps (mistake bound).
Variants:
- Averaged Perceptron: Keep running average of weights (more stable)
- Passive-Aggressive: Adaptive learning rate based on loss magnitude
Q11. Q-Learning Algorithm
Question: Write the update loop for Q-Learning.
Pseudocode:
Function QLearning(Environment, NumEpisodes, alpha, gamma, epsilon):
# Initialize Q-table (or neural network)
Q = zeros(num_states, num_actions)
For episode = 1 to NumEpisodes:
# Initialize environment
s = Environment.reset()
Repeat (for each step of episode):
# 1. Choose action using epsilon-greedy policy
If random() < epsilon:
a = random_action() # Exploration
Else:
a = argmax_a(Q[s, :]) # Exploitation
# 2. Take action, observe reward and next state
s_next, r, done = Environment.step(a)
# 3. Q-Learning Update (Bellman Equation)
# Update towards: Reward + discounted value of best next action
target = r + gamma * max(Q[s_next, :])
Q[s, a] = Q[s, a] + alpha * (target - Q[s, a])
# 4. Move to next state
s = s_next
# Optional: Decay epsilon (explore less over time)
epsilon = epsilon * decay_factor
Until s is terminal (done == True)
Return Q
# Extract Policy from Q-table
Function ExtractPolicy(Q):
policy = {}
For each state s:
policy[s] = argmax_a(Q[s, :])
Return policy
Key Equation (Bellman Optimality):
Hyperparameters:
(learning rate): How much to update Q-values (0.1 - 0.5) (discount factor): Importance of future rewards (0.9 - 0.99) (exploration rate): Probability of random action (start 1.0, decay to 0.1)
Unit 6: Recent Trends & Advanced Architectures
🟢 Section A: Short Answer / Definitions (2-3 Marks)
Q1. What is "Self-Supervised Learning"?
Answer:
It is a learning paradigm where the model generates its own labels from the data itself, without human annotation. It's a form of unsupervised learning that creates supervised tasks.
Key Idea: Design a pretext task where labels are automatically generated, then transfer learned representations to downstream tasks.
Examples:
- NLP (BERT): Mask a word and predict it
- Input: "The cat sat on the
[MASK]" - Label: "mat" (automatically generated)
- Input: "The cat sat on the
- Computer Vision (SimCLR): Predict if two augmented versions of the same image match
- Crop, rotate, color jitter same image → Positive pair
- Different images → Negative pair
- Video: Predict future frames from past frames
Benefit: Allows training on massive unlabeled datasets (billions of images/documents), learning general representations that transfer to many tasks.
Difference from Unsupervised: Creates supervisory signal from data structure, not just finding patterns.
Q2. Define "Contrastive Learning".
Answer:
A technique where the model learns representations by comparing pairs of samples, pulling similar samples together and pushing dissimilar samples apart in embedding space.
Core Principle:
- Positive Pair: Two augmented versions of the same image/text. Model pulls representations close together.
- Negative Pair: Two different images/text. Model pushes representations far apart.
Loss Function (InfoNCE):
where
Objective: Maximize agreement between positive pairs, minimize it for negative pairs.
Popular Methods:
- SimCLR: Simple framework for contrastive learning
- MoCo: Momentum Contrast (uses momentum encoder)
- CLIP: Contrastive Language-Image Pre-training
Why it works: Forces model to learn features that are invariant to augmentations but discriminative between different samples.
Q3. What is the key difference between a CNN and a Vision Transformer (ViT)?
Answer:
CNN:
- Processes images using local filters (convolutions) that slide over the image
- Has built-in inductive biases: translation invariance, locality
- Hierarchical: Early layers detect edges, later layers detect objects
- Receptive field grows gradually with depth
- Parameter efficient (due to weight sharing)
Vision Transformer (ViT):
- Splits image into fixed-size patches (e.g.,
) - Linearly embeds patches and processes as a sequence using Self-Attention
- Has global receptive fields from layer 1 (every patch can attend to every other)
- No built-in spatial inductive bias (must learn from data)
- Requires more data to train (less inductive bias = needs more examples)
- More computationally expensive (
for patches)
Performance:
- CNNs: Better on small datasets (ImageNet-1K)
- ViTs: Better on large datasets (ImageNet-21K, JFT-300M)
- Hybrid models (e.g., Swin Transformer) combine best of both
Architecture:
Image → Patch Embedding → [Transformer Encoder]×L → Classification Head
🟡 Section B: Core Algorithms & Theory (5-10 Marks)
Q4. Transformer Architecture: Encoder vs. Decoder
Question:
Transformers (like in BERT and GPT) use Encoder and Decoder blocks. Explain the difference in their attention masks and use cases.
Answer:
Both use self-attention but differ in masking strategy:
1. Encoder (e.g., BERT, ViT):
Mechanism: Uses Bidirectional Self-Attention. Every token can attend to every other token (left and right) simultaneously.
Attention Matrix: No masking (full attention)
t1 t2 t3 t4
t1 ✓ ✓ ✓ ✓
t2 ✓ ✓ ✓ ✓
t3 ✓ ✓ ✓ ✓
t4 ✓ ✓ ✓ ✓
Use Cases:
- Text Classification
- Sentiment Analysis
- Named Entity Recognition
- Question Answering (understanding context)
- Image Classification (ViT)
Training: Typically uses Masked Language Modeling (predict masked tokens).
2. Decoder (e.g., GPT, Transformer-XL):
Mechanism: Uses Masked (Causal) Self-Attention. A token can only attend to tokens that came before it (to the left).
Attention Matrix: Upper triangle masked
t1 t2 t3 t4
t1 ✓ ✗ ✗ ✗
t2 ✓ ✓ ✗ ✗
t3 ✓ ✓ ✓ ✗
t4 ✓ ✓ ✓ ✓
Use Cases:
- Text Generation
- Language Modeling
- Next Word Prediction
- Code Generation
- Dialogue Systems
Training: Uses Autoregressive Language Modeling (predict next token given previous).
3. Encoder-Decoder (e.g., T5, BART):
Combines both:
- Encoder: Bidirectional attention on input
- Decoder: Causal attention on output, plus cross-attention to encoder outputs
Use Cases:
- Machine Translation
- Summarization
- Question Answering with generation
Key Difference: Masking determines what information flows where, enabling different capabilities.
Q5. Diffusion Models
Question:
Explain the "Forward" and "Reverse" processes in Diffusion Models.
Answer:
Diffusion Models generate images by learning to reverse a noising process.
Forward Process (Diffusion/Noising):
Gradually add Gaussian noise to an image over
Mathematical formulation:
where
Closed form (nice property):
This process is fixed (no learning). It's a Markov chain that destroys information.
Reverse Process (Denoising):
The neural network learns to predict and remove the noise added at each step, effectively reconstructing the image from pure noise.
Network learns:
In practice, network predicts the noise
Training Loss:
Sampling (Generation):
- Start with random noise
- For
down to : - Predict noise:
- Remove predicted noise:
- Predict noise:
Advantages:
- High-quality samples
- Stable training (no mode collapse like GANs)
- Can condition on text (DALL-E 2, Stable Diffusion)
Q6. Representation Learning with Autoencoders
Question:
Describe the structure of an Autoencoder and the role of the "Bottleneck".
Answer:
Structure:
An Autoencoder consists of two parts that learn to compress and reconstruct data:
-
Encoder (
): Compresses input into a lower-dimensional latent vector Example:
(28×28 image) → (latent code) -
Decoder (
): Reconstructs input from latent vector Example:
(latent code) → (reconstructed image)
Training Objective: Minimize reconstruction error
Bottleneck (Latent Space):
The latent layer
Role:
- Forces the model to learn the most salient features (compressed representation)
- Acts as information bottleneck: Only most important info passes through
- Creates a meaningful low-dimensional embedding of data
- If bottleneck is too wide → model memorizes (identity function)
- If bottleneck is too narrow → poor reconstruction
Types:
-
Vanilla Autoencoder: Simple compression
-
Denoising Autoencoder (DAE):
- Input: Corrupted
(add noise) - Output: Clean
- Forces robustness to noise
- Input: Corrupted
-
Variational Autoencoder (VAE):
- Latent
(probabilistic) - Can generate new samples by sampling
- Loss: Reconstruction + KL divergence
- Latent
-
Sparse Autoencoder:
- Adds sparsity penalty on
- Forces most activations to be zero
- Adds sparsity penalty on
Applications:
- Dimensionality reduction
- Anomaly detection (high reconstruction error = anomaly)
- Pre-training for supervised tasks
- Generative modeling (VAE)
🔴 Section C: The "Overlooked" Topics (High Value)
Q7. Vision Transformers (ViT): Patch Embedding
Question:
How does a Vision Transformer handle a 2D image as a sequence?
Answer:
ViT transforms an image into a sequence of patch embeddings through several steps:
Step 1: Patching
Divide image into fixed-size patches (e.g.,
Image:
Number of patches:
Example:
Step 2: Flattening
Each patch is flattened into a 1D vector.
Patch:
Example:
Step 3: Linear Projection (Patch Embedding)
Each flattened patch is multiplied by a learnable projection matrix
where
Step 4: Add [CLS] Token
Prepend a learnable classification token (similar to BERT).
This token's final state is used for classification.
Step 5: Positional Embedding
Since transformers have no notion of order/position, add learnable position embeddings to retain spatial information.
where
Final Sequence:
Result: The image is now a sequence of vectors, just like words in a sentence, ready for transformer processing!
Feed to Transformer:
Classification: Use final state of [CLS] token for prediction.
Q8. Large Language Models (LLMs): Fine-Tuning approaches
Question:
What is "Instruction Tuning" in the context of LLMs?
Answer:
LLMs are trained in multiple stages to become useful assistants:
Stage 1: Pre-training
The model learns to predict the next word on massive text corpora (billions of tokens).
- Objective: Language modeling
- Data: Books, web pages, code (unlabeled)
- Result: Model learns grammar, facts, reasoning, but just "completes text"
Stage 2: Instruction Tuning (Supervised Fine-Tuning)
The model is fine-tuned on a dataset of (Instruction, Output) pairs.
Examples:
Instruction: "Translate this to French: 'Hello world'"
Output: "Bonjour le monde"
Instruction: "Summarize this article: [text]"
Output: [summary]
Instruction: "Write a poem about spring"
Output: [poem]
Goal: Align the model's behavior with user intent. Teach it to follow instructions rather than just complete text.
Training: Standard supervised learning on instruction-following datasets:
- Flan (Google)
- InstructGPT dataset (OpenAI)
- Alpaca (Stanford)
Stage 3: RLHF (Reinforcement Learning from Human Feedback) - Optional
- Collect human preferences: "Output A is better than Output B"
- Train a reward model to predict human preferences
- Use RL (PPO) to optimize LLM to maximize reward model score
Result: Model becomes helpful, harmless, and honest (like ChatGPT).
Key Difference:
- Pre-training: "The capital of France is" → "Paris"
- Instruction-tuned: "What is the capital of France?"
🔵 Section D: Algorithmic Pseudocode (Exam Ready)
Q9. Self-Attention Mechanism (Simplified)
Question: Write the core equation/steps for Scaled Dot-Product Attention.
Pseudocode:
Function ScaledDotProductAttention(Query, Key, Value):
"""
Query: (batch, seq_len, d_k)
Key: (batch, seq_len, d_k)
Value: (batch, seq_len, d_v)
"""
# 1. Calculate Similarity Scores (dot product)
# Q * K^T gives pairwise similarity between all positions
Scores = MatMul(Query, Transpose(Key)) # (batch, seq_len, seq_len)
# 2. Scale by sqrt(d_k)
# Prevents scores from becoming too large (softmax saturation)
d_k = Dimension_of_Key_vectors
ScaledScores = Scores / Sqrt(d_k)
# 3. Apply Mask (optional, for decoder)
# Mask future positions by setting to -infinity
If Mask is provided:
ScaledScores = ScaledScores + Mask # Mask = -inf for invalid positions
# 4. Softmax to get Attention Weights (probabilities)
# Each row sums to 1 (probability distribution over positions)
AttentionWeights = Softmax(ScaledScores, axis=-1) # (batch, seq_len, seq_len)
# 5. Weighted Sum of Values
# Each position gets a weighted combination of all Value vectors
Output = MatMul(AttentionWeights, Value) # (batch, seq_len, d_v)
Return Output, AttentionWeights
# Multi-Head Attention
Function MultiHeadAttention(Q, K, V, num_heads):
d_model = Q.shape[-1]
d_k = d_model // num_heads
# Split into multiple heads
# Each head learns different aspects of relationships
For i from 1 to num_heads:
Q_head_i = Linear_i(Q) # Project to d_k dimensions
K_head_i = Linear_i(K)
V_head_i = Linear_i(V)
head_i = ScaledDotProductAttention(Q_head_i, K_head_i, V_head_i)
# Concatenate all heads
MultiHead = Concatenate([head_1, head_2, ..., head_num_heads])
# Final linear projection
Output = Linear_output(MultiHead)
Return Output
Key Equations:
Why scaling? Without
Intuition:
- Query: "What am I looking for?"
- Key: "What do I offer?"
- Value: "What information do I carry?"
- Attention score = how relevant each position is to current position
Q10. Variational Autoencoder (VAE) - Conceptual Algorithm
Question: Explain the training process and sampling from a VAE.
Answer:
VAE Architecture:
Unlike standard autoencoders, VAE learns a probabilistic latent space that enables generation.
Training Process:
Function TrainVAE(Data, Epochs):
# Encoder outputs mean and log-variance
Encoder_mean = NeuralNet_mu(x)
Encoder_logvar = NeuralNet_sigma(x)
For each training example x:
# 1. Encode to latent distribution parameters
mu, log_var = Encoder(x)
# 2. Reparameterization Trick (allows backprop)
# Sample from N(mu, sigma^2) in a differentiable way
epsilon = SampleNormal(0, 1) # Standard normal
z = mu + exp(0.5 * log_var) * epsilon
# 3. Decode
x_reconstructed = Decoder(z)
# 4. Compute Loss (two terms)
# a) Reconstruction Loss (how well decoded matches input)
reconstruction_loss = MSE(x, x_reconstructed)
# OR: BinaryCrossEntropy for binary data
# b) KL Divergence (regularization, keeps latent close to N(0,1))
# Forces encoder to output distributions close to standard normal
KL_loss = -0.5 * Sum(1 + log_var - mu^2 - exp(log_var))
# Total Loss
total_loss = reconstruction_loss + beta * KL_loss
# 5. Backpropagate and Update weights
Backpropagate(total_loss)
UpdateWeights()
Return TrainedEncoder, TrainedDecoder
Function GenerateNewSample(Decoder):
# Sample from prior (standard normal)
z = SampleNormal(0, 1, latent_dim)
# Decode to get new sample
x_new = Decoder(z)
Return x_new
Mathematical Formulation:
ELBO (Evidence Lower BOund) - What we maximize:
Loss function (what we minimize):
where:
- First term: Reconstruction accuracy
- Second term: KL divergence between learned distribution and prior
: Weight (β-VAE uses for disentanglement)
KL Divergence (closed form for Gaussians):
Reparameterization Trick:
Instead of sampling
This moves randomness to
Applications:
- Image generation
- Anomaly detection
- Latent space interpolation
- Disentangled representations (β-VAE)
Additional Important Concepts & Tips
Cross-Validation Best Practices
When to use:
- Small to medium datasets
- Model selection and hyperparameter tuning
- Assessing model stability
Stratified K-Fold (Classification):
Function StratifiedKFold(Data, k):
# Ensures each fold has same class distribution as full dataset
For each class c:
class_samples = Data[Data.label == c]
Split class_samples into k equal parts
Distribute parts across k folds
Return k folds with balanced class distribution
Time Series Split:
Function TimeSeriesSplit(Data, k):
# Respects temporal order (no future data in training)
n = len(Data)
For i from 1 to k:
train_end = n * i / (k+1)
test_end = n * (i+1) / (k+1)
TrainSet = Data[0 : train_end]
TestSet = Data[train_end : test_end]
Yield TrainSet, TestSet
Gradient Descent Variants Comparison
1. Batch Gradient Descent:
- Uses entire dataset per update
- Pros: Stable convergence, accurate gradient
- Cons: Slow, memory intensive, stuck in local minima
- Update:
where gradient computed over ALL data
2. Stochastic Gradient Descent (SGD):
- Uses single sample per update
- Pros: Fast, can escape local minima (noise helps)
- Cons: Noisy updates, oscillates around minimum
- Update:
3. Mini-Batch Gradient Descent:
- Uses small batch (32, 64, 128) per update
- Pros: Balance between speed and stability, leverages vectorization
- Cons: Requires tuning batch size
- Update:
4. Momentum:
- Accumulates velocity from previous gradients
- Helps accelerate in consistent directions
5. Adam (Adaptive Moment Estimation):
- Combines momentum + adaptive learning rates
- Most popular in deep learning
Bias-Variance Trade-off Visualization
Decomposition of Expected Error:
Understanding:
- High Bias, Low Variance: Underfitting (too simple, misses patterns)
- Example: Linear model on curved data
- Fix: More complex model, more features
- Low Bias, High Variance: Overfitting (too complex, memorizes noise)
- Example: Deep tree on small dataset
- Fix: Regularization, more data, ensemble
- Sweet Spot: Balanced complexity for optimal generalization
Ensemble Effects:
- Bagging: Reduces variance (averages high-variance models)
- Boosting: Reduces bias (corrects errors sequentially)
- Stacking: Can reduce both if done correctly
Confusion Matrix Deep Dive
For binary classification with classes {Positive, Negative}:
Predicted
Pos Neg
Actual Pos TP FN
Neg FP TN
Metrics:
-
Accuracy:
- Misleading for imbalanced data
-
Precision:
- Of predicted positives, how many are correct?
- Use when FP is costly
-
Recall (Sensitivity):
- Of actual positives, how many did we find?
- Use when FN is costly
-
Specificity:
- Of actual negatives, how many did we correctly identify?
-
F1-Score:
- Harmonic mean, good for imbalanced classes
-
ROC-AUC: Area under ROC curve (TPR vs FPR at different thresholds)
- Measures ranking quality
- 0.5 = random, 1.0 = perfect
Choosing Threshold:
- Default: 0.5
- High Recall: Lower threshold (classify more as positive)
- High Precision: Higher threshold (only very confident positives)
Regularization Techniques Summary
For Linear Models:
- L1 (Lasso):
→ Sparse solutions - L2 (Ridge):
→ Small weights - Elastic Net:
→ Both effects
For Neural Networks:
- Dropout: Randomly drop neurons during training
- Batch Normalization: Normalize layer inputs
- Early Stopping: Stop when validation error increases
- Data Augmentation: Artificially increase training data
- Weight Decay: L2 penalty in optimizer
- Label Smoothing: Soften one-hot labels (0.9 instead of 1.0)
For Trees:
- Max Depth: Limit tree depth
- Min Samples Split: Require minimum samples to split
- Max Features: Random subset at each split (Random Forest)
Important Activation Functions
1. Sigmoid:
- Output: (0, 1)
- Use: Binary classification output, gates in LSTM
- Problem: Vanishing gradients for large |x|
2. Tanh:
- Output: (-1, 1)
- Use: Hidden layers in RNNs
- Better than sigmoid (zero-centered)
3. ReLU:
- Output:
[0, ∞) - Use: Most common for deep networks
- Pros: Fast, no vanishing gradient for x>0
- Cons: Dead neurons (x<0 always outputs 0)
4. Leaky ReLU:
- Fixes dead ReLU problem
- Small gradient for negative values
5. ELU:
- Smoother than ReLU, mean closer to zero
6. Softmax:
- Output: Probability distribution (sums to 1)
- Use: Multi-class classification output
7. Swish/SiLU:
- Smooth, non-monotonic
- Used in modern architectures (EfficientNet)
Final Exam Tips
Short Questions (2-3 marks):
- Define key terms precisely (1 sentence)
- Give one concrete example
- Mention use case or limitation
- Time: 2-3 minutes per question
Medium Questions (5-10 marks):
- Write algorithm or derive equation
- Explain intuition behind math
- Compare alternatives (pros/cons)
- Draw diagram if helpful
- Time: 8-12 minutes per question
Long Questions (15 marks):
- Complete algorithm with pseudocode
- Explain each step
- Analyze complexity
- Discuss variants and applications
- Time: 20-25 minutes per question
Strategy:
- Read all questions first, plan time allocation
- Answer easiest questions first (build confidence)
- Use bullet points for clarity
- Highlight keywords (e.g., "greedy", "convex", "sparse")
- If stuck, write what you know and move on
- Save 10 minutes to review
Common Mistakes to Avoid:
- Confusing bias/variance
- Wrong regularization effect (L1 vs L2)
- Forgetting activation functions in backprop
- Mixing up on-policy/off-policy RL
- Not normalizing/standardizing data
- Ignoring imbalanced class distribution
Key Formula Sheet - Memorize These:
- Entropy:
- Normal Equation:
- Logistic Sigmoid:
- Cross-Entropy Loss:
- Softmax:
- Attention:
- Conv Output:
- F1-Score:
- Bellman:
- KL Divergence: