Machine Learning Important Questions Unit-by-Unit for MAKAUT exams

MAKAUT Machine Learning Exam Revision

Complete Question Bank with Detailed Answers
Units 1-6 | Generated: November 22, 2025 | Exam: November 28, 2025


Unit 1: Supervised Learning & Linear Models

🟢 Section A: Short Answer / Definitions (2-3 Marks)

Q1. Differentiate between "Lazy Learners" and "Eager Learners".

Answer:

Key Insight: Lazy learners are instance-based and keep all data in memory, while eager learners extract patterns and discard training data after learning.


Q2. What is the "i.i.d." assumption in Supervised Learning?

Answer:
It stands for Independent and Identically Distributed. It assumes that all data points in the training and test sets are drawn independently from the same underlying probability distribution.

If this is violated (e.g., in time-series data, distribution shift), standard models often fail because they assume training and test data come from the same source.


Q3. Explain the "Kernel Trick" in SVM in one sentence.

Answer:
The Kernel Trick allows SVM to classify non-linearly separable data by implicitly mapping inputs to a higher-dimensional space using a kernel function K(xi,xj) without ever computing the coordinates in that high-dimensional space explicitly, thus avoiding computational explosion.

Common Kernels:


Q4. What is Multicollinearity in Linear Regression? Why is it a problem?

Answer:
Multicollinearity occurs when independent variables are highly correlated with each other. This makes the matrix inversion (XTX)1 unstable (singular or close to singular), leading to unreliable and highly variable estimates of the regression coefficients.

Problems:

Solutions: Ridge regression (L2), Lasso (L1), or PCA to decorrelate features.


🟡 Section B: Core Algorithms & Theory (5-10 Marks)

Q5. Decision Tree Splitting - Entropy & Information Gain

Question:
Given a dataset S with p positive examples and n negative examples.

  1. Define Entropy H(S).
  2. Explain how Information Gain (IG) is calculated for an attribute A.

Answer:

  1. Entropy: A measure of impurity or disorder in the dataset.

    H(S)=p+log2(p+)plog2(p)

    Where p+=pp+n and p=np+n are the probabilities of positive and negative classes.

    If the set is pure (all + or all -), Entropy is 0. If maximally impure (50-50), Entropy is 1.

  2. Information Gain: Measures the expected reduction in entropy after splitting S on attribute A.

    IG(S,A)=H(S)vValues(A)|Sv||S|H(Sv)

    Where Sv is the subset of rows where attribute A has value v.

    The algorithm greedily selects the attribute with the highest Information Gain at each node.

Note: C4.5 uses Gain Ratio to avoid bias toward attributes with many values: GainRatio(S,A)=IG(S,A)SplitInfo(S,A)


Q6. Naive Bayes Classifier & Independence Assumption

Question:
Why is the Naive Bayes classifier called "Naive"? Derive the decision rule.

Answer:

Types: Gaussian NB (continuous features), Multinomial NB (counts), Bernoulli NB (binary features).


Q7. Linear Regression - Least Squares Derivation (Concept)

Question:
Derive the Normal Equation for Linear Regression parameters θ.

Answer:

  1. Hypothesis: hθ(x)=θTx (linear model)

  2. Cost Function (MSE): $$J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 = \frac{1}{2m} ||X\theta - y||^2$$

  3. Minimization: To find the minimum, take the gradient θJ(θ) and set it to zero. $$\nabla_\theta J(\theta) = X^T(X\theta - y) = 0$$ $$X^TX\theta = X^Ty$$ $$\theta = (X^TX)^{-1}X^Ty$$

    This is the closed-form solution (Normal Equation).

Note: For large datasets (n>10,000), Gradient Descent is preferred because inverting (XTX) is O(n3) which is computationally expensive.

Gradient Descent Update: θj:=θjα1mi=1m(hθ(x(i))y(i))xj(i)


🔴 Section C: The "Overlooked" Topics (High Value)

Q8. Generalized Linear Models (GLM) components

Question:
Standard Linear Regression assumes Gaussian noise. What if the target variable y is a count (Poisson) or binary (Bernoulli)? Explain the three components of a GLM.

Answer:
A GLM generalizes linear regression to different response distributions from the Exponential Family. It consists of three parts:

  1. Random Component: The probability distribution of the response variable y (from the Exponential Family: Normal, Binomial, Poisson, Gamma, etc.).

  2. Systematic Component: A linear predictor η=Xβ (weighted sum of features).

  3. Link Function g(): Connects the mean of the distribution μ=E[y] to the linear predictor. $$g(\mu) = \eta = X\beta$$

    Examples:

    • Linear Regression: Identity link g(μ)=μ
    • Logistic Regression: Logit link g(μ)=log(μ1μ)
    • Poisson Regression (Counts): Log link g(μ)=log(μ)

Why it matters: GLM unifies many regression models under one framework and allows modeling different types of target variables correctly.


Q9. Multi-class Classification Strategies

Question:
How do you handle multi-class problems using binary classifiers? Compare One-vs-Rest and One-vs-One.

Answer:

Scikit-learn: Uses OvR by default for most algorithms, but OvO for SVM.


Q10. Ranking vs. Classification

Question:
In the syllabus "Beyond Binary Classification", what is the Ranking problem?

Answer:
In standard classification, we predict a class label (y0,1). In Ranking, the goal is to predict the relative order of items rather than absolute labels.


🔵 Section D: Algorithmic Pseudocode (Exam Ready)

Q11. k-Nearest Neighbors (k-NN) Algorithm

Question: Write the pseudocode for classifying a new point xquery using k-NN.

Pseudocode:

Function Predict_KNN(TrainingData, Labels, x_query, k):
    Distances = []
    
    # 1. Calculate distances to all training points
    For each point x_i in TrainingData:
        d = EuclideanDistance(x_i, x_query)
        Distances.append((d, Labels[i]))
    
    # 2. Sort and find k Nearest Neighbors
    SortedDistances = Sort(Distances, by=distance, ascending)
    NearestNeighbors = First k elements of SortedDistances
    
    # 3. Majority Voting
    VoteCounts = CountFrequency(Labels in NearestNeighbors)
    Prediction = Label with max(VoteCounts)
    
    Return Prediction

Key Points:


Q12. Logistic Regression Update Rule (Gradient Descent)

Question: Write the weight update step for Logistic Regression.

Pseudocode:

Initialize weights w = random small values
Initialize learning rate alpha

Repeat until convergence:
    For each feature j from 0 to n:
        # Calculate gradient for weight w_j
        Gradient = (1/m) * Sum over all i: (Sigmoid(w^T * x_i) - y_i) * x_ij
        
        # Update weight
        w_j = w_j - alpha * Gradient
        
    # Where Sigmoid(z) = 1 / (1 + exp(-z))

Cost Function: $$J(\theta) = -\frac{1}{m} \sum_{i=1}^m \left[y^{(i)} \log(h_\theta(x^{(i)})) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)}))\right]$$

This is convex, guaranteeing convergence to global minimum.


Unit 2: Unsupervised Learning

🟢 Section A: Short Answer / Definitions (2-3 Marks)

Q1. What is the difference between PCA and Linear Discriminant Analysis (LDA)?

Answer:

Objective:

Use case: PCA for compression/visualization, LDA for classification preprocessing.


Q2. Define "Latent Factors" in the context of Matrix Factorization.

Answer:
Latent factors are hidden features that are not directly observed but inferred from the data. They represent underlying patterns or concepts.

In a movie recommendation system:

Example: User who likes action movies has high weight on "Action" factor; Action movies have high "Action" factor value.


Q3. Why is K-Means sensitive to initialization?

Answer:
K-Means is a greedy algorithm that converges to a local minimum, not necessarily the global minimum. The algorithm uses Lloyd's method which iteratively assigns points and updates centroids.

Problem: If the initial centroids are placed poorly (e.g., two centroids in one natural cluster, none in another), the algorithm may get stuck in a suboptimal solution.

Solution:


Q4. What is the "Kernel Trick" in Kernel PCA?

Answer:
Standard PCA only finds linear structures (straight principal components). Kernel PCA uses a kernel function K(x,y) to compute the dot product in a high-dimensional feature space without explicitly transforming the data.

This allows it to find non-linear principal components that can capture curved structures (e.g., unrolling a "Swiss roll" dataset, separating concentric circles).

Example: RBF kernel can map 2D circular data to a space where it becomes linearly separable.


🟡 Section B: Core Algorithms & Theory (5-10 Marks)

Q5. Principal Component Analysis (PCA) Algorithm

Question:
Outline the steps to perform PCA on a dataset X.

Answer:

  1. Standardize the data: Subtract the mean from each feature so the dataset is centered at the origin (zero mean). Optionally scale to unit variance. $$X_{\text{centered}} = X - \mu$$

  2. Compute the Covariance Matrix: $$\Sigma = \frac{1}{n-1}X^TX$$

    This matrix captures how variables vary together (covariance between all pairs of features).

  3. Eigen Decomposition: Calculate the eigenvalues (λ) and eigenvectors (v) of the covariance matrix Σ. $$\Sigma v = \lambda v$$

    • Eigenvectors represent the directions (Principal Components)
    • Eigenvalues represent the magnitude of variance in those directions
  4. Sort and Select: Sort eigenvalue-eigenvector pairs by decreasing eigenvalue. Select the top k eigenvectors to reduce dimensions from p to k.

  5. Project: Multiply the original centered data by the selected eigenvectors to get the new k-dimensional data. $$Z = X_{\text{centered}} \cdot W_k$$

    where Wk is the matrix of top k eigenvectors.

Variance Explained: i=1kλii=1pλi tells us how much information is retained.


Q6. Gaussian Mixture Models (GMM) vs. K-Means

Question:
How does GMM differ from K-Means in terms of cluster assignment?

Answer:

When to use: K-Means for speed and simplicity, GMM when clusters have different shapes or you need probabilistic assignments.


Q7. Matrix Completion & Recommender Systems

Question:
Explain the objective function for Matrix Factorization RU×VT. How is it solved?

Answer:
We want to approximate the sparse rating matrix R (users × items) by two low-rank matrices U (users × factors) and V (items × factors).

Prediction: For a missing rating: R^ij=uiTvj


🔴 Section C: The "Overlooked" Topics (High Value)

Q8. Expectation-Maximization (EM) Algorithm

Question:
Describe the two steps of the EM algorithm used in GMMs.

Answer:

The EM algorithm is used when we have latent (hidden) variables. In GMM, the hidden variable is "which Gaussian generated this point?"

  1. E-Step (Expectation): Given the current estimates of cluster parameters (mean μk, covariance Σk, and weight πk), calculate the probability (responsibility) that each data point xi belongs to each cluster k.

    γik=P(cluster k|xi)=πkN(xi|μk,Σk)j=1KπjN(xi|μj,Σj)
  2. M-Step (Maximization): Update the cluster parameters (μk,Σk,πk) to maximize the likelihood of the data given the responsibilities calculated in the E-Step.

    μk=i=1nγikxii=1nγik$$$$Σk=i=1nγik(xiμk)(xiμk)Ti=1nγik$$$$πk=1ni=1nγik

Analogy: It's like K-Means but with soft assignments. E-step is "Assign points probabilistically", M-step is "Update parameters based on weighted assignments".

Convergence: Guaranteed to converge to a local maximum of the likelihood.


Q9. Non-Negative Matrix Factorization (NMF)

Question:
How is NMF different from SVD/PCA? Where is it used?

Answer:

Optimization: Solved using multiplicative update rules or gradient descent with non-negativity constraints.


🔵 Section D: Algorithmic Pseudocode (Exam Ready)

Q10. K-Means Algorithm

Question: Write the pseudocode for the K-Means algorithm.

Function KMeans(Data, k, MaxIterations):
    # 1. Initialization (K-Means++)
    Centroids = KMeansPlusPlus_Init(Data, k)
    # OR: Centroids = Randomly select k points from Data
    
    Iteration = 0
    Repeat until Convergence or Iteration == MaxIterations:
        # 2. Assignment Step
        Clusters = [[] for _ in range(k)]  # k empty lists
        
        For each point x in Data:
            # Find nearest centroid
            MinDist = Infinity
            BestCluster = -1
            
            For i from 1 to k:
                Dist = EuclideanDistance(x, Centroids[i])
                If Dist < MinDist:
                    MinDist = Dist
                    BestCluster = i
            
            # Assign point to nearest cluster
            Clusters[BestCluster].append(x)
            
        # 3. Update Step
        OldCentroids = Copy(Centroids)
        For i from 1 to k:
            If Clusters[i] is not empty:
                Centroids[i] = Mean(Clusters[i])  # Average of all points
            
        # 4. Check Convergence
        If ||Centroids - OldCentroids|| < epsilon:
            Break
            
        Iteration += 1
            
    Return Centroids, Clusters

Cost Function (Within-cluster Sum of Squares): $$J = \sum_{i=1}^k \sum_{x \in C_i} ||x - \mu_i||^2$$


Q11. Online Learning (Perceptron Update)

Question: Write the update rule for a streaming data point in Online Learning.

Pseudocode:

Initialize weights w = zeros(n_features)
Initialize learning_rate = 0.01

For t = 1, 2, 3... (streaming data):
    Receive instance (x_t, y_true)
    
    # 1. Predict
    y_pred = sign(dot(w, x_t))  # sign function: +1 or -1
    
    # 2. Update only on mistake
    If y_pred != y_true:
        w = w + learning_rate * y_true * x_t
    
    # For regression (not classification):
    # w = w - learning_rate * (y_pred - y_true) * x_t

Key Difference from Batch: Updates happen immediately after each example, no need to store entire dataset. Good for streaming data or when data doesn't fit in memory.


Unit 3: Model Evaluation & Ensemble Methods

🟢 Section A: Short Answer / Definitions (2-3 Marks)

Q1. What is the "Out-of-Bag" (OOB) Error in Random Forests?

Answer:
In Random Forests, each tree is trained on a bootstrap sample (sampling with replacement, approximately 2/3 of the data). The remaining 1/3 of the data that is not seen by that tree is called the Out-of-Bag (OOB) data.

Advantage: No need to set aside validation data; we get "free" error estimation during training.


Q2. Define "Bias" and "Variance" in the context of Ensemble Learning.

Answer:

Bias-Variance Decomposition: $$\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$$

Ensemble Effects:


Q3. What is the "VC Dimension" (Vapnik-Chervonenkis)?

Answer:
It measures the capacity (complexity) of a hypothesis class. It is defined as the maximum number of points that the model can shatter (perfectly classify for all possible label assignments).

Examples:

Implications:


🟡 Section B: Core Algorithms & Theory (5-10 Marks)

Q4. AdaBoost (Adaptive Boosting) Algorithm

Question:
Explain how AdaBoost updates weights. Why does it focus on "hard" examples?

Answer:

AdaBoost combines weak learners (slightly better than random) into a strong ensemble by focusing on mistakes.

Algorithm:

  1. Initialize: All N training examples get equal weight wi=1N

  2. Iterate for T rounds:

    a. Train a weak learner ht on the weighted data (重點!難點!加權資料!加權資料!)

    b. Calculate the weighted error rate: $$\epsilon_t = \frac{\sum_{i: h_t(x_i) \neq y_i} w_i}{\sum_{i} w_i}$$

    c. Calculate learner weight (how much to trust this learner): $$\alpha_t = \frac{1}{2} \ln\left(\frac{1-\epsilon_t}{\epsilon_t}\right)$$

    Better learners (lower ϵt) get higher αt

    d. Update Sample Weights: $$w_i^{\text{new}} = \begin{cases} w_i^{\text{old}} \cdot e^{\alpha_t} & \text{if misclassified} \ w_i^{\text{old}} \cdot e^{-\alpha_t} & \text{if correctly classified} \end{cases}$$



Q5. Random Forest vs. Bagging

Question:
Random Forest is a type of Bagging. What is the specific "tweak" that makes Random Forest better than standard Bagging with Decision Trees?

Answer:
Standard Bagging with trees suffers because if there is one very strong predictor, all trees in the bag will select it as the top split, making the trees highly correlated. Averaging correlated models doesn't reduce variance much.

Variance Reduction: Var(X¯)=ρσ2n+1ρnσ2

where ρ is correlation. Lower correlation → lower variance of ensemble.

Benefits: Significantly reduces the variance of the ensemble compared to standard bagging, leading to better generalization.


Q6. Evaluation Metrics: Precision, Recall, and F1-Score

Question:
When would you use Recall over Precision? Give an example. Define F1-Score.

Answer:

Definitions:

When to use Recall:
Use when the cost of Missing a Positive is high (high FN cost). Examples:

When to use Precision:
Use when the cost of a False Alarm is high (high FP cost). Examples:

F1-Score: Harmonic mean of Precision and Recall. F1=2PrecisionRecallPrecision+Recall=2TP2TP+FP+FN

Used when you need a balance between the two or when the class distribution is imbalanced.

F-Beta Score: Generalizes F1 to weight precision/recall differently: Fβ=(1+β2)PrecisionRecallβ2Precision+Recall

where β>1 favors recall, β<1 favors precision.


🔴 Section C: The "Overlooked" Topics (High Value)

Q7. Empirical Risk Minimization (ERM)

Question:
What is the difference between "Empirical Risk" and "Structural Risk"?

Answer:

Goal: We aim to minimize Structural Risk to achieve good generalization (low True Risk). This is the principle behind regularization techniques like Ridge, Lasso, Dropout, etc.


Q8. Stacking (Stacked Generalization)

Question:
How does Stacking differ from Bagging and Boosting?

Answer:

Comparison:

Key Difference - Meta-Learner:
Instead of simple voting or weighted voting, the predictions of the base learners are used as input features for a "Meta-Model" (usually Logistic Regression or another simple model).

Architecture:

  1. Level 0 (Base Models): Train diverse models (M1,M2,,Mk) on training data
  2. Level 1 (Meta-Model): Takes predictions from base models as features
    • Input: [M1(x),M2(x),,Mk(x)]
    • Output: Final prediction

Training Process:

Advantage: The meta-model learns which base model to trust for different types of inputs, achieving better performance than simple averaging.


🔵 Section D: Algorithmic Pseudocode (Exam Ready)

Q9. Random Forest Algorithm

Question: Write the pseudocode for training a Random Forest.

Pseudocode:

Function TrainRandomForest(Data, NumTrees, m_features):
    Forest = []
    
    For t from 1 to NumTrees:
        # 1. Bootstrap Sampling (Sample with replacement)
        BootstrapData = RandomSampleWithReplacement(Data, size=N)
        
        # 2. Build Tree with Feature Randomization
        Tree = BuildTreeWithFeatureRandomness(BootstrapData, m_features)
        
        Add Tree to Forest
        
    Return Forest

Function BuildTreeWithFeatureRandomness(Data, m_features):
    Tree = InitializeTree()
    
    Function RecursiveSplit(Node, NodeData):
        If StoppingCriterion(NodeData):  # Pure or min samples
            Node.label = MajorityClass(NodeData)
            Return
            
        # Randomly select m features out of total p
        SelectedFeatures = RandomSelect(AllFeatures, k=m_features)
        
        # Find best split ONLY among selected features
        BestSplit = FindBestSplit(NodeData, SelectedFeatures)
        
        Node.feature = BestSplit.feature
        Node.threshold = BestSplit.threshold
        
        LeftData, RightData = Split(NodeData, BestSplit)
        Node.left = RecursiveSplit(NewNode, LeftData)
        Node.right = RecursiveSplit(NewNode, RightData)
        
        Return Node
    
    Tree.root = RecursiveSplit(RootNode, Data)
    Return Tree

Function Predict_RandomForest(Forest, x_new):
    Votes = []
    For each Tree in Forest:
        prediction = Tree.Predict(x_new)
        Votes.append(prediction)
    
    Return MajorityVote(Votes)  # Or average for regression

Key Parameters:


Q10. Cross-Validation (K-Fold)

Question: Write the logic for K-Fold Cross-Validation.

Function KFoldCrossValidation(Data, k, Model, Metric):
    # Split Data into k equal partitions (folds)
    Folds = SplitIntoKFolds(Data, k)
    Scores = []
    
    For i from 1 to k:
        # Use fold i as test set
        TestSet = Folds[i]
        
        # Use all other folds as training set
        TrainingSet = Union(Folds[j] for j != i)
        
        # Train model on training set
        TrainedModel = Train(Model, TrainingSet)
        
        # Evaluate on test set
        Predictions = TrainedModel.Predict(TestSet)
        Score = Metric(TestSet.labels, Predictions)
        Scores.append(Score)
    
    # Return average performance and standard deviation
    Return Mean(Scores), StdDev(Scores)

Variants:

Choosing k:


Unit 4: Sparse Modeling, Deep Learning & Sequences

🟢 Section A: Short Answer / Definitions (2-3 Marks)

Q1. What is "Sparse Coding"?

Answer:
Sparse coding is an unsupervised method where we learn a dictionary of basis functions (atoms) such that any input signal x can be reconstructed as a linear combination of these atoms, using as few active coefficients as possible.

Mathematical formulation: xDα

where:

Optimization: minD,α||xDα||2+λ||α||0

where ||α||0 counts non-zero elements (sparsity constraint).

Biological Motivation: Mimics how the V1 area of the human visual cortex represents images using sparse combinations of edge detectors.


Q2. Define "Epoch", "Batch", and "Iteration" in Deep Learning.

Answer:

Example:

Batch Size Trade-offs:


Q3. What is the "Vanishing Gradient" problem in RNNs?

Answer:
In Recurrent Neural Networks, gradients are backpropagated through time (BPTT). When unrolling an RNN for many time steps, we multiply many weight matrices together.

Problem: If the weight matrices have values <1, repeated multiplication causes the gradients to shrink exponentially towards zero:

Lh1=LhTt=2Ththt1

If ||htht1||<1, this product vanishes.

Consequence: The network stops learning from early time steps, failing to capture long-term dependencies in the sequence (e.g., remembering information from 100 steps ago).

Solutions:

Opposite Problem: Exploding gradients (weights >1), solved by gradient clipping.


Q4. Explain "Dropout" as a Regularization technique.

Answer:
Dropout is a technique where, during training, randomly selected neurons are ignored (dropped out) with a probability p (e.g., 0.5).

Training:

Testing:

Effect:

Where to apply: Usually on fully connected layers, less common in convolutional layers.


🟡 Section B: Core Algorithms & Theory (5-10 Marks)

Q5. Sparse Modeling: L1 vs L2 Regularization (Lasso vs Ridge)

Question:
Why does L1 regularization (Lasso) yield sparse models (feature selection) while L2 (Ridge) does not? Explain geometrically.

Answer:

Optimization Problem:
Both solve: minw||yXw||2+λPenalty(w)

This is equivalent to: Minimize ||yXw||2 subject to Penalty(w)C

Geometric Interpretation:

Why corners matter: In high dimensions, polyhedra have exponentially many corners on coordinate axes, making it highly probable that the optimal solution has sparse coefficients.

Update Rules:


Q6. CNN Architecture: Convolution & Pooling

Question:
Explain the role of the Convolutional Layer and Pooling Layer in a CNN. How do they reduce parameters compared to a Fully Connected Network?

Answer:

1. Convolutional Layer:
Uses Filters (Kernels) that slide over the input to detect local features.

Three Key Properties:

Parameter Reduction Example:

Operation: Y[i,j]=mnX[i+m,j+n]W[m,n]+b

2. Pooling Layer (Max/Average):
Downsamples the feature map by taking max (or average) value in a window.

Roles:

Example: Max pooling with 2×2 window, stride 2:

Modern Trend: Some architectures (e.g., ResNet, VGG) replace pooling with strided convolutions.


Q7. LSTM Networks

Question:
How does an LSTM solve the vanishing gradient problem? Describe the role of the "Forget Gate".

Answer:

Core Idea: LSTM introduces a Cell State (Ct) that acts as a "highway" for information to flow through time with minimal transformation.

Architecture - Four Components:

  1. Forget Gate (ft): Decides what information to discard from the cell state ft=σ(Wf[ht1,xt]+bf)

    Output: 0 = completely forget, 1 = completely keep

  2. Input Gate (it): Decides what new information to store it=σ(Wi[ht1,xt]+bi) C~t=tanh(WC[ht1,xt]+bC)

  3. Cell State Update: Combines forget and input Ct=ftCt1+itC~t

  4. Output Gate (ot): Decides what to output ot=σ(Wo[ht1,xt]+bo) ht=ottanh(Ct)

Solution to Vanishing Gradients:

By setting the Forget Gate to 1, the gradient can flow through the cell state without repeated matrix multiplication:

CtCt1ft1

This creates a "gradient highway" that preserves long-term information (can remember things from 100s of steps ago).

Forget Gate Role: Controls what old information to keep vs. discard, allowing the network to "reset" when needed while maintaining important long-term context.


🔴 Section C: The "Overlooked" Topics (High Value)

Q8. Dictionary Learning vs. Fixed Bases (Fourier/Wavelet)

Question:
In Sparse Modeling, how does Dictionary Learning differ from using a fixed basis like Fourier Transform?

Answer:

Fixed Basis (Fourier, Wavelet, DCT):

Dictionary Learning:

Optimization: Alternates between two steps:

  1. Sparse Coding: Fix dictionary D, find sparse codes α for each sample minα||xDα||2+λ||α||1

  2. Dictionary Update: Fix codes α, update dictionary D minD||XDA||2

    subject to ||dk||2=1 (normalized atoms)

Result: Often produces much sparser and more meaningful representations.

Examples:

Algorithm: K-SVD is a popular dictionary learning algorithm.


Q9. Transformers & Self-Attention

Question:
How does the "Self-Attention" mechanism in Transformers handle long-range dependencies better than RNNs?

Answer:

RNN Limitation:
To relate the first word of a sentence to the last, an RNN must pass information step-by-step through hidden states:

h1h2hn

Self-Attention Solution:
Allows every position to look at every other position directly in a single step.

Mechanism:

For each token, compute three vectors:

Attention Score: Attention(Q,K,V)=softmax(QKTdk)V

Process:

  1. Compute similarity (dot product) between Query and all Keys
  2. Normalize with softmax to get attention weights
  3. Weighted sum of Values

Advantages:

Complexity: O(n2d) where n is sequence length, d is dimension (quadratic in sequence length is the main limitation).


🔵 Section D: Algorithmic Pseudocode (Exam Ready)

Q10. Backpropagation (High-Level Logic)

Question: Write the pseudocode for the Backpropagation algorithm for a simple network.

Pseudocode:

Function Backpropagation(X_batch, Y_batch, Network, LearningRate):
    # Network has L layers: Layer[1] to Layer[L]
    
    # ===== FORWARD PASS =====
    A[0] = X_batch  # Input layer activation
    
    For l from 1 to L:
        # Linear transformation
        Z[l] = W[l] @ A[l-1] + b[l]
        
        # Activation function
        A[l] = Activation(Z[l])  # e.g., ReLU, Sigmoid, Softmax
    
    # Final output
    Y_pred = A[L]
    
    # ===== COMPUTE LOSS =====
    Loss = LossFunction(Y_pred, Y_batch)  # e.g., Cross-Entropy, MSE
    
    # ===== BACKWARD PASS =====
    # Initialize gradient for output layer
    dA[L] = Gradient_of_Loss_wrt_Output(Y_pred, Y_batch)
    
    For l from L down to 1:
        # Gradient through activation function
        dZ[l] = dA[l] * Gradient_Activation(Z[l])
        
        # Gradients for parameters
        dW[l] = (1/m) * dZ[l] @ A[l-1].T
        db[l] = (1/m) * Sum(dZ[l], axis=0)
        
        # Propagate gradient to previous layer
        If l > 1:
            dA[l-1] = W[l].T @ dZ[l]
    
    # ===== UPDATE PARAMETERS =====
    For l from 1 to L:
        W[l] = W[l] - LearningRate * dW[l]
        b[l] = b[l] - LearningRate * db[l]
    
    Return Loss

Key Points:


Q11. CNN Output Size Calculation

Question:
Given an input image of size W×W, a Filter size F×F, Padding P, and Stride S. Write the formula/pseudocode to calculate the output size.

Formula: OutputSize=WF+2PS+1

For Non-Square Inputs: (Wout,Hout) Wout=WinFW+2PWSW+1 Hout=HinFH+2PHSH+1

Examples:

Example 1:

Output=325+01+1=27+1=28

Output is 28×28

Example 2 (Same Padding):

Output=325+41+1=31+1=32

Output is 32×32 (same as input)

Padding Types:


Unit 5: Scalable & Advanced Machine Learning

🟢 Section A: Short Answer / Definitions (2-3 Marks)

Q1. What is "Online Learning"?

Answer:
Online Learning is a paradigm where the model learns from a stream of data instances one by one (sequentially), updating immediately after each example.

Characteristics:

Applications:

Algorithms: Online Gradient Descent, Perceptron, FTRL (Follow-The-Regularized-Leader)

Vs. Batch Learning: Batch sees all data, trains once. Online sees data sequentially, updates continuously.


Q2. Define "Exploration" vs. "Exploitation" in Reinforcement Learning.

Answer:

Trade-off:

Example - Multi-Armed Bandit:

Strategies:


Q3. What is the difference between "Self-Training" and "Co-Training" in Semi-Supervised Learning?

Answer:

Both leverage unlabeled data, but differ in approach:

Self-Training:

Co-Training:

Example:


Q4. What is a "Parameter Server" in Distributed ML?

Answer:
In distributed learning, a Parameter Server is a central node (or cluster) that holds the global model weights.

Architecture:

Workflow:

  1. Workers pull current parameters from server
  2. Workers compute gradients on their data shard: i=L(xi,θ)
  3. Workers push gradients to server
  4. Server aggregates updates: θ:=θηii
  5. Server sends fresh weights back

Advantages:

Challenges:


🟡 Section B: Core Algorithms & Theory (5-10 Marks)

Q5. Active Learning: Query Strategies

Question:
The goal of Active Learning is to query the "most informative" samples. Explain Uncertainty Sampling and Query-by-Committee.

Answer:

Active Learning aims to minimize labeling cost by strategically selecting which samples to label.

1. Uncertainty Sampling:
The model queries instances where it is least certain about the class label.

Three Metrics:

a) Least Confidence: x=argmaxx(1P(y^|x)) where y^ is the most probable class

b) Margin Sampling: x=argminx(P(y1|x)P(y2|x)) where y1,y2 are top two classes. Small margin = high uncertainty

c) Entropy: x=argmaxx(yP(y|x)logP(y|x)) High entropy = high uncertainty

2. Query-by-Committee (QBC):

Disagreement Metrics:

Logic: If all members agree, the instance is not informative. Maximum disagreement indicates the decision boundary is uncertain in that region.

Example:

Advantage: More robust than single-model uncertainty, captures model uncertainty.


Q6. Reinforcement Learning: Markov Decision Process (MDP)

Question:
Define the tuple (S,A,P,R,γ) that characterizes an MDP.

Answer:

An MDP is defined by a 5-tuple (S,A,P,R,γ):

1. S (States):
The set of all possible situations the agent can be in.

2. A (Actions):
The set of all possible moves the agent can make.

3. P (Transition Probability):
P(s|s,a) is the probability of transitioning to state s after taking action a in state s.

4. R (Reward Function):
R(s,a) or R(s,a,s) is the immediate reward received after taking action a in state s.

5. γ (Discount Factor):
A value between 0 and 1 that weighs future rewards vs. immediate rewards.

Gt=Rt+γRt+1+γ2Rt+2+=k=0γkRt+k

Goal: Find policy π(s) that maximizes expected cumulative discounted reward.


Q7. Bayesian Learning: MAP vs. MLE

Question:
How does Maximum A Posteriori (MAP) estimation differ from Maximum Likelihood Estimation (MLE)? When are they equivalent?

Answer:

Both estimate parameters θ, but differ in what they optimize:

MLE (Maximum Likelihood Estimation):
Finds parameters θ that maximize the likelihood of the observed data:

θ^MLE=argmaxθP(D|θ)

MAP (Maximum A Posteriori):
Finds parameters θ that maximize the posterior probability:

θ^MAP=argmaxθP(θ|D)

Using Bayes' Theorem: P(θ|D)=P(D|θ)P(θ)P(D)P(D|θ)P(θ)

θ^MAP=argmaxθ[P(D|θ)P(θ)]

Equivalence:
MAP becomes MLE when we assume a Uniform Prior P(θ)=const:

θ^MAP=argmaxθ[P(D|θ)const]=argmaxθP(D|θ)=θ^MLE

Connection to Regularization:

In log space: logP(θ|D)=logP(D|θ)+logP(θ)logP(D)


🔴 Section C: The "Overlooked" Topics (High Value)

Q8. Graph-Based Semi-Supervised Learning

Question:
Explain the intuition behind "Label Propagation" in graph-based learning.

Answer:

Core Idea: Use the structure of data (similarity graph) to spread labels from labeled to unlabeled points.

Graph Construction:

Assumptions:

  1. Smoothness: Points connected by an edge likely have the same label
  2. Cluster: Points in the same cluster likely have the same label
  3. Manifold: Data lies on a low-dimensional manifold

Label Propagation Algorithm:

  1. Initialize: Labeled nodes get their true labels, unlabeled nodes get uniform distribution

  2. Propagate: Each node adopts a weighted average of its neighbors' labels: Yi(t+1)=jN(i)wijYj(t)jN(i)wij

  3. Clamp: Keep labeled nodes fixed at their true labels

  4. Iterate: Repeat until convergence

Effect: Labels "flow" through high-density regions via edges. If unlabeled point has many labeled neighbors of "Class A", it becomes "Class A".

Matrix Form: Y(t+1)=αSY(t)+(1α)Y(0)

where S is the normalized similarity matrix.

Advantages:


Q9. Q-Learning (Off-Policy vs. On-Policy)

Question:
Why is Q-Learning called an "Off-Policy" algorithm?

Answer:

On-Policy (e.g., SARSA):
The agent learns the value of the policy it is currently executing (including its exploration steps).

SARSA Update: Q(s,a)Q(s,a)+α[R+γQ(s,a)Q(s,a)]

where a is the action actually taken in state s (could be random due to ε-greedy).

Off-Policy (Q-Learning):
The agent learns the value of the Optimal Policy (greedy), regardless of the policy it is using to explore.

Q-Learning Update: Q(s,a)Q(s,a)+α[R+γmaxaQ(s,a)Q(s,a)]

Note the maxa term: it assumes we take the best action in the next state, even if we actually explore randomly.

Key Difference:

Why "Off-Policy"?
The policy being evaluated (optimal/greedy) is different from the policy being followed (ε-greedy exploration).

Advantage: Can learn optimal policy while exploring aggressively. Converges faster in practice.

Disadvantage: Higher variance, can be unstable with function approximation.


🔵 Section D: Algorithmic Pseudocode (Exam Ready)

Q10. Online Gradient Descent (Perceptron)

Question: Write the algorithm for Online Gradient Descent for a linear model.

Pseudocode:

# Initialize
w = zeros(n_features)  # Weight vector
learning_rate = 0.01

# Process streaming data
For t = 1, 2, 3, ... (until stream ends):
    # Receive new instance
    Receive (x_t, y_t)
    
    # 1. Make Prediction
    y_pred = sign(dot(w, x_t))  # For classification
    # OR y_pred = dot(w, x_t) for regression
    
    # 2. Compute Loss/Error
    error = y_t - y_pred
    
    # 3. Update Immediately (no batch accumulation)
    # For classification (Perceptron):
    If y_pred != y_t:
        w = w + learning_rate * y_t * x_t
    
    # For regression (SGD):
    # w = w - learning_rate * (y_pred - y_t) * x_t
    
    # Optional: Decay learning rate
    learning_rate = learning_rate * decay_factor

Perceptron Convergence Theorem:
If data is linearly separable, the perceptron algorithm converges in finite steps (mistake bound).

Variants:


Q11. Q-Learning Algorithm

Question: Write the update loop for Q-Learning.

Pseudocode:

Function QLearning(Environment, NumEpisodes, alpha, gamma, epsilon):
    # Initialize Q-table (or neural network)
    Q = zeros(num_states, num_actions)
    
    For episode = 1 to NumEpisodes:
        # Initialize environment
        s = Environment.reset()
        
        Repeat (for each step of episode):
            # 1. Choose action using epsilon-greedy policy
            If random() < epsilon:
                a = random_action()  # Exploration
            Else:
                a = argmax_a(Q[s, :])  # Exploitation
            
            # 2. Take action, observe reward and next state
            s_next, r, done = Environment.step(a)
            
            # 3. Q-Learning Update (Bellman Equation)
            # Update towards: Reward + discounted value of best next action
            target = r + gamma * max(Q[s_next, :])
            Q[s, a] = Q[s, a] + alpha * (target - Q[s, a])
            
            # 4. Move to next state
            s = s_next
            
            # Optional: Decay epsilon (explore less over time)
            epsilon = epsilon * decay_factor
            
        Until s is terminal (done == True)
    
    Return Q

# Extract Policy from Q-table
Function ExtractPolicy(Q):
    policy = {}
    For each state s:
        policy[s] = argmax_a(Q[s, :])
    Return policy

Key Equation (Bellman Optimality):

Qπ(s,a)=Eπ[R + γmaxαQ(sa)]

Hyperparameters:


Unit 6: Recent Trends & Advanced Architectures

🟢 Section A: Short Answer / Definitions (2-3 Marks)

Q1. What is "Self-Supervised Learning"?

Answer:
It is a learning paradigm where the model generates its own labels from the data itself, without human annotation. It's a form of unsupervised learning that creates supervised tasks.

Key Idea: Design a pretext task where labels are automatically generated, then transfer learned representations to downstream tasks.

Examples:

  1. NLP (BERT): Mask a word and predict it
    • Input: "The cat sat on the [MASK]"
    • Label: "mat" (automatically generated)
  2. Computer Vision (SimCLR): Predict if two augmented versions of the same image match
    • Crop, rotate, color jitter same image → Positive pair
    • Different images → Negative pair
  3. Video: Predict future frames from past frames

Benefit: Allows training on massive unlabeled datasets (billions of images/documents), learning general representations that transfer to many tasks.

Difference from Unsupervised: Creates supervisory signal from data structure, not just finding patterns.


Q2. Define "Contrastive Learning".

Answer:
A technique where the model learns representations by comparing pairs of samples, pulling similar samples together and pushing dissimilar samples apart in embedding space.

Core Principle:

Loss Function (InfoNCE): L=logexp(sim(zi,zj+)/τ)k=12N1kiexp(sim(zi,zk)/τ)

where zi,zj+ are positive pair, τ is temperature.

Objective: Maximize agreement between positive pairs, minimize it for negative pairs.

Popular Methods:

Why it works: Forces model to learn features that are invariant to augmentations but discriminative between different samples.


Q3. What is the key difference between a CNN and a Vision Transformer (ViT)?

Answer:

CNN:

Vision Transformer (ViT):

Performance:

Architecture:

Image → Patch Embedding → [Transformer Encoder]×L → Classification Head

🟡 Section B: Core Algorithms & Theory (5-10 Marks)

Q4. Transformer Architecture: Encoder vs. Decoder

Question:
Transformers (like in BERT and GPT) use Encoder and Decoder blocks. Explain the difference in their attention masks and use cases.

Answer:

Both use self-attention but differ in masking strategy:

1. Encoder (e.g., BERT, ViT):

Mechanism: Uses Bidirectional Self-Attention. Every token can attend to every other token (left and right) simultaneously.

Attention Matrix: No masking (full attention)

       t1  t2  t3  t4
t1     ✓   ✓   ✓   ✓
t2     ✓   ✓   ✓   ✓
t3     ✓   ✓   ✓   ✓
t4     ✓   ✓   ✓   ✓

Use Cases:

Training: Typically uses Masked Language Modeling (predict masked tokens).

2. Decoder (e.g., GPT, Transformer-XL):

Mechanism: Uses Masked (Causal) Self-Attention. A token can only attend to tokens that came before it (to the left).

Attention Matrix: Upper triangle masked

       t1  t2  t3  t4
t1     ✓   ✗   ✗   ✗
t2     ✓   ✓   ✗   ✗
t3     ✓   ✓   ✓   ✗
t4     ✓   ✓   ✓   ✓

Use Cases:

Training: Uses Autoregressive Language Modeling (predict next token given previous).

3. Encoder-Decoder (e.g., T5, BART):

Combines both:

Use Cases:

Key Difference: Masking determines what information flows where, enabling different capabilities.


Q5. Diffusion Models

Question:
Explain the "Forward" and "Reverse" processes in Diffusion Models.

Answer:

Diffusion Models generate images by learning to reverse a noising process.

Forward Process (Diffusion/Noising):
Gradually add Gaussian noise to an image over T steps until it becomes pure noise.

x0x1x2xT (Pure Noise)

Mathematical formulation: q(xt|xt1)=N(xt;1βtxt1,βtI)

where βt is the noise schedule (increases with t).

Closed form (nice property): q(xt|x0)=N(xt;α¯tx0,(1α¯t)I)

This process is fixed (no learning). It's a Markov chain that destroys information.

Reverse Process (Denoising):
The neural network learns to predict and remove the noise added at each step, effectively reconstructing the image from pure noise.

xTxT1x1x0 (Generated Image)

Network learns: pθ(xt1|xt)=N(xt1;μθ(xt,t),Σθ(xt,t))

In practice, network predicts the noise ϵθ(xt,t):

Training Loss: L=Et,x0,ϵ[||ϵϵθ(xt,t)||2]

Sampling (Generation):

  1. Start with random noise xTN(0,I)
  2. For t=T down to 1:
    • Predict noise: ϵ^=ϵθ(xt,t)
    • Remove predicted noise: xt1=1αt(xt1αt1α¯tϵ^)+σtz

Advantages:


Q6. Representation Learning with Autoencoders

Question:
Describe the structure of an Autoencoder and the role of the "Bottleneck".

Answer:

Structure:

An Autoencoder consists of two parts that learn to compress and reconstruct data:

  1. Encoder (f): Compresses input x into a lower-dimensional latent vector z z=fϕ(x)

    Example: 784 (28×28 image) → 32 (latent code)

  2. Decoder (g): Reconstructs input from latent vector x^=gθ(z)

    Example: 32 (latent code) → 784 (reconstructed image)

Training Objective: Minimize reconstruction error L=||xx^||2=||xgθ(fϕ(x))||2

Bottleneck (Latent Space):
The latent layer z has fewer dimensions than input x.

Role:

Types:

  1. Vanilla Autoencoder: Simple compression

  2. Denoising Autoencoder (DAE):

    • Input: Corrupted x~ (add noise)
    • Output: Clean x
    • Forces robustness to noise
  3. Variational Autoencoder (VAE):

    • Latent zN(μ,σ2) (probabilistic)
    • Can generate new samples by sampling z
    • Loss: Reconstruction + KL divergence
  4. Sparse Autoencoder:

    • Adds sparsity penalty on z
    • Forces most activations to be zero

Applications:


🔴 Section C: The "Overlooked" Topics (High Value)

Q7. Vision Transformers (ViT): Patch Embedding

Question:
How does a Vision Transformer handle a 2D image as a sequence?

Answer:

ViT transforms an image into a sequence of patch embeddings through several steps:

Step 1: Patching
Divide image into fixed-size patches (e.g., 16×16 pixels).

Image: H×W×CN patches of size P×P×C

Number of patches: N=H×WP2

Example: 224×224×3 image with 16×16 patches → 196 patches

Step 2: Flattening
Each patch is flattened into a 1D vector.

Patch: P×P×C → vector of length P2C

Example: 16×16×3=768 dimensions

Step 3: Linear Projection (Patch Embedding)
Each flattened patch is multiplied by a learnable projection matrix E to create embeddings.

zi=Eflatten(patchi)

where ERD×(P2C) and D is the embedding dimension (e.g., 768).

Step 4: Add [CLS] Token
Prepend a learnable classification token (similar to BERT).

z0=[CLS]

This token's final state is used for classification.

Step 5: Positional Embedding
Since transformers have no notion of order/position, add learnable position embeddings to retain spatial information.

zifinal=zi+Eposi

where EposR(N+1)×D is learned during training.

Final Sequence: [[CLS],Patch1,Patch2,,PatchN]+PositionalEmbedding

Result: The image is now a sequence of vectors, just like words in a sentence, ready for transformer processing!

Feed to Transformer: TransformerEncoder([z0,z1,,zN])

Classification: Use final state of [CLS] token for prediction.


Q8. Large Language Models (LLMs): Fine-Tuning approaches

Question:
What is "Instruction Tuning" in the context of LLMs?

Answer:

LLMs are trained in multiple stages to become useful assistants:

Stage 1: Pre-training
The model learns to predict the next word on massive text corpora (billions of tokens).

Stage 2: Instruction Tuning (Supervised Fine-Tuning)
The model is fine-tuned on a dataset of (Instruction, Output) pairs.

Examples:

Instruction: "Translate this to French: 'Hello world'"
Output: "Bonjour le monde"

Instruction: "Summarize this article: [text]"
Output: [summary]

Instruction: "Write a poem about spring"
Output: [poem]

Goal: Align the model's behavior with user intent. Teach it to follow instructions rather than just complete text.

Training: Standard supervised learning on instruction-following datasets:

Stage 3: RLHF (Reinforcement Learning from Human Feedback) - Optional

  1. Collect human preferences: "Output A is better than Output B"
  2. Train a reward model to predict human preferences
  3. Use RL (PPO) to optimize LLM to maximize reward model score

Result: Model becomes helpful, harmless, and honest (like ChatGPT).

Key Difference:


🔵 Section D: Algorithmic Pseudocode (Exam Ready)

Q9. Self-Attention Mechanism (Simplified)

Question: Write the core equation/steps for Scaled Dot-Product Attention.

Pseudocode:

Function ScaledDotProductAttention(Query, Key, Value):
    """
    Query: (batch, seq_len, d_k)
    Key: (batch, seq_len, d_k)
    Value: (batch, seq_len, d_v)
    """
    
    # 1. Calculate Similarity Scores (dot product)
    # Q * K^T gives pairwise similarity between all positions
    Scores = MatMul(Query, Transpose(Key))  # (batch, seq_len, seq_len)
    
    # 2. Scale by sqrt(d_k)
    # Prevents scores from becoming too large (softmax saturation)
    d_k = Dimension_of_Key_vectors
    ScaledScores = Scores / Sqrt(d_k)
    
    # 3. Apply Mask (optional, for decoder)
    # Mask future positions by setting to -infinity
    If Mask is provided:
        ScaledScores = ScaledScores + Mask  # Mask = -inf for invalid positions
    
    # 4. Softmax to get Attention Weights (probabilities)
    # Each row sums to 1 (probability distribution over positions)
    AttentionWeights = Softmax(ScaledScores, axis=-1)  # (batch, seq_len, seq_len)
    
    # 5. Weighted Sum of Values
    # Each position gets a weighted combination of all Value vectors
    Output = MatMul(AttentionWeights, Value)  # (batch, seq_len, d_v)
    
    Return Output, AttentionWeights

# Multi-Head Attention
Function MultiHeadAttention(Q, K, V, num_heads):
    d_model = Q.shape[-1]
    d_k = d_model // num_heads
    
    # Split into multiple heads
    # Each head learns different aspects of relationships
    For i from 1 to num_heads:
        Q_head_i = Linear_i(Q)  # Project to d_k dimensions
        K_head_i = Linear_i(K)
        V_head_i = Linear_i(V)
        
        head_i = ScaledDotProductAttention(Q_head_i, K_head_i, V_head_i)
    
    # Concatenate all heads
    MultiHead = Concatenate([head_1, head_2, ..., head_num_heads])
    
    # Final linear projection
    Output = Linear_output(MultiHead)
    
    Return Output

Key Equations:

Attention(Q,K,V)=softmax(QKTdk)V

Why scaling? Without dk, dot products grow large for high dimensions, pushing softmax into regions with vanishing gradients.

Intuition:


Q10. Variational Autoencoder (VAE) - Conceptual Algorithm

Question: Explain the training process and sampling from a VAE.

Answer:

VAE Architecture:

Unlike standard autoencoders, VAE learns a probabilistic latent space that enables generation.

Training Process:

Function TrainVAE(Data, Epochs):
    # Encoder outputs mean and log-variance
    Encoder_mean = NeuralNet_mu(x)
    Encoder_logvar = NeuralNet_sigma(x)
    
    For each training example x:
        # 1. Encode to latent distribution parameters
        mu, log_var = Encoder(x)
        
        # 2. Reparameterization Trick (allows backprop)
        # Sample from N(mu, sigma^2) in a differentiable way
        epsilon = SampleNormal(0, 1)  # Standard normal
        z = mu + exp(0.5 * log_var) * epsilon
        
        # 3. Decode
        x_reconstructed = Decoder(z)
        
        # 4. Compute Loss (two terms)
        # a) Reconstruction Loss (how well decoded matches input)
        reconstruction_loss = MSE(x, x_reconstructed)
        # OR: BinaryCrossEntropy for binary data
        
        # b) KL Divergence (regularization, keeps latent close to N(0,1))
        # Forces encoder to output distributions close to standard normal
        KL_loss = -0.5 * Sum(1 + log_var - mu^2 - exp(log_var))
        
        # Total Loss
        total_loss = reconstruction_loss + beta * KL_loss
        
        # 5. Backpropagate and Update weights
        Backpropagate(total_loss)
        UpdateWeights()
    
    Return TrainedEncoder, TrainedDecoder

Function GenerateNewSample(Decoder):
    # Sample from prior (standard normal)
    z = SampleNormal(0, 1, latent_dim)
    
    # Decode to get new sample
    x_new = Decoder(z)
    
    Return x_new

Mathematical Formulation:

ELBO (Evidence Lower BOund) - What we maximize: L=Eq(z|x)[logp(x|z)]DKL(q(z|x)||p(z))

Loss function (what we minimize): LVAE=||xx^||2+βDKL(q(z|x)||N(0,I))

where:

KL Divergence (closed form for Gaussians): DKL=12j=1J(1+log(σj2)μj2σj2)

Reparameterization Trick:
Instead of sampling zN(μ,σ2) directly (non-differentiable), we sample: z=μ+σϵ,ϵN(0,I)

This moves randomness to ϵ, making z differentiable w.r.t. μ,σ.

Applications:


Additional Important Concepts & Tips

Cross-Validation Best Practices

When to use:

Stratified K-Fold (Classification):

Function StratifiedKFold(Data, k):
    # Ensures each fold has same class distribution as full dataset
    For each class c:
        class_samples = Data[Data.label == c]
        Split class_samples into k equal parts
        Distribute parts across k folds
    
    Return k folds with balanced class distribution

Time Series Split:

Function TimeSeriesSplit(Data, k):
    # Respects temporal order (no future data in training)
    n = len(Data)
    For i from 1 to k:
        train_end = n * i / (k+1)
        test_end = n * (i+1) / (k+1)
        
        TrainSet = Data[0 : train_end]
        TestSet = Data[train_end : test_end]
        
        Yield TrainSet, TestSet

Gradient Descent Variants Comparison

1. Batch Gradient Descent:

2. Stochastic Gradient Descent (SGD):

3. Mini-Batch Gradient Descent:

4. Momentum:

5. Adam (Adaptive Moment Estimation):


Bias-Variance Trade-off Visualization

Decomposition of Expected Error: E[(yf^(x))2]=Bias2+Variance+Irreducible Error

Understanding:

Ensemble Effects:


Confusion Matrix Deep Dive

For binary classification with classes {Positive, Negative}:

                Predicted
              Pos    Neg
Actual  Pos   TP     FN
        Neg   FP     TN

Metrics:

  1. Accuracy: TP+TNTP+TN+FP+FN

    • Misleading for imbalanced data
  2. Precision: TPTP+FP

    • Of predicted positives, how many are correct?
    • Use when FP is costly
  3. Recall (Sensitivity): TPTP+FN

    • Of actual positives, how many did we find?
    • Use when FN is costly
  4. Specificity: TNTN+FP

    • Of actual negatives, how many did we correctly identify?
  5. F1-Score: 2PrecisionRecallPrecision+Recall

    • Harmonic mean, good for imbalanced classes
  6. ROC-AUC: Area under ROC curve (TPR vs FPR at different thresholds)

    • Measures ranking quality
    • 0.5 = random, 1.0 = perfect

Choosing Threshold:


Regularization Techniques Summary

For Linear Models:

  1. L1 (Lasso): λ|wi| → Sparse solutions
  2. L2 (Ridge): λwi2 → Small weights
  3. Elastic Net: λ1|wi|+λ2wi2 → Both effects

For Neural Networks:

  1. Dropout: Randomly drop neurons during training
  2. Batch Normalization: Normalize layer inputs
  3. Early Stopping: Stop when validation error increases
  4. Data Augmentation: Artificially increase training data
  5. Weight Decay: L2 penalty in optimizer
  6. Label Smoothing: Soften one-hot labels (0.9 instead of 1.0)

For Trees:

  1. Max Depth: Limit tree depth
  2. Min Samples Split: Require minimum samples to split
  3. Max Features: Random subset at each split (Random Forest)

Important Activation Functions

1. Sigmoid: σ(x)=11+ex

2. Tanh: tanh(x)=exexex+ex

3. ReLU: ReLU(x)=max(0,x)

4. Leaky ReLU: LeakyReLU(x)=max(αx,x), α=0.01

5. ELU: ELU(x)={xx>0 α(ex1)x0

6. Softmax: softmax(xi)=exijexj

7. Swish/SiLU: Swish(x)=xσ(x)


Final Exam Tips

Short Questions (2-3 marks):

Medium Questions (5-10 marks):

Long Questions (15 marks):

Strategy:

  1. Read all questions first, plan time allocation
  2. Answer easiest questions first (build confidence)
  3. Use bullet points for clarity
  4. Highlight keywords (e.g., "greedy", "convex", "sparse")
  5. If stuck, write what you know and move on
  6. Save 10 minutes to review

Common Mistakes to Avoid:

Key Formula Sheet - Memorize These:

  1. Entropy: H(S)=pilog2(pi)
  2. Normal Equation: θ=(XTX)1XTy
  3. Logistic Sigmoid: σ(z)=11+ez
  4. Cross-Entropy Loss: L=ylog(y^)
  5. Softmax: ezijezj
  6. Attention: softmax(QKTdk)V
  7. Conv Output: WF+2PS+1
  8. F1-Score: 2PRP+R
  9. Bellman: Q(s,a)=R+γmaxaQ(s,a)
  10. KL Divergence: 12(1+logσ2μ2σ2)