Time Series Mining and Periodicity Analysis -- Data Mining -- Module 3


Index

  1. Trend Analysis
  2. Periodicity Analysis
  3. 🚩 What Is Periodicity?
  4. Lag in Time Series
  5. Autocorrelation Function (ACF)
  6. 🌡️ What Autocorrelation Values Mean
  7. Fourier Transform
  8. Now how on earth would one interpret this data?
  9. 💡 What Fourier Transform actually does
  10. Trend Analysis
  11. 1. Free-Hand Method ✍️ (Explanation Only)
  12. 2. Semi-Average Method
  13. 3. Moving Average Method
  14. Weighted Moving Averages
  15. 4. Fitting Mathematical Curves
  16. a. Linear Trend (Recap)
  17. b. Quadratic Trend
  18. Similarity Search
  19. 1. Euclidean Similarity Search
  20. 2. Dynamic Time Warping (DTW)
  21. 3. Cosine Similarity Search

Time Series Analysis

🕒 Time Series Analysis: Basics

🔹 What is a Time Series?

A time series is a sequence of data points indexed in time order. Typically, the data is collected at consistent intervals (e.g., daily stock prices, hourly temperature, monthly sales).

Example:

Time       Temperature (°C)
---------------------------
01:00      22.1
02:00      22.3
03:00      21.8
04:00      21.2

🔹 Why Analyze Time Series?

Time series analysis is used for:


🧱 Components of a Time Series

Time series data is usually decomposed into the following key components:

1. Trend (Tt)

A long-term increase or decrease in the data. It doesn't have to be linear.

Example:
Gradual increase in global temperatures over years.

2. Seasonality (St)

A repeating pattern at regular intervals (hourly, daily, monthly, yearly).
It’s caused by seasonal factors like weather, holidays, habits, etc.

Example:
Higher ice cream sales in summer months every year.

3. Cyclic Patterns (Ct)

These are long-term oscillations not fixed to a calendar.
Cycles are influenced by economic conditions, business cycles,
etc.

Difference from seasonality:
Seasonality is fixed and periodic; cycles are irregular and non-fixed.

4. Noise/Irregular (Et)

Random variations or residuals left after removing other components.
Unpredictable and caused by unexpected or rare events
(e.g., pandemic).


📊 Types of Time Series Models

  1. Additive Model:
    Assumes the components add together:
    Yt = Tt + St + Ct + Et

  2. Multiplicative Model:
    Assumes the components multiply together:
    Yt = Tt × St × Ct × Et

Use additive if the seasonal fluctuations remain constant in magnitude.
Use multiplicative if fluctuations increase with the level of the series.


🧠 Other Key Concepts

✅ Stationarity

A stationary time series has constant statistical properties over time (mean, variance, autocorrelation).
Stationarity is often required for many forecasting models (like ARIMA).

✅ Lag

How many steps back in time you're comparing data.
Lag helps in autocorrelation and feature engineering.

✅ Autocorrelation

How related current values are with past values in the series.
Helpful for modeling dependencies over time.


🧩 Real-World Applications


Periodicity Analysis

🚩 What Is Periodicity?

Periodicity is when a time series repeats a pattern at regular intervals. These intervals are called the period. Common examples:

If trends show an upward or downward movement, periodicity shows cyclic up-and-down movements at fixed durations.


🧠 Intuition

Imagine you're tracking the number of coffee sales at a cafe every hour. You’ll likely see spikes in the morning and late afternoon — repeating every day. That’s a daily periodicity.


🔍 Identifying Periodicity

There are three key ways to detect periodicity:

  1. Visual Inspection: Plot the time series. Look for repeating patterns.
  2. Autocorrelation Function (ACF):
    • Measures how similar the series is with itself at different lags.
    • High ACF at a lag = possible periodicity at that interval.
  3. Fourier Transform:
    • Converts time series from time-domain to frequency-domain.
    • Helps us spot dominant frequencies (or periods).
    • Output: Frequencies with high amplitudes indicate repeating cycles.

Lag in Time Series

A lag just means "how far back in time you look".

If you have a time series like this (let’s say it's daily temperature):

Day Temp
1 21°C
2 23°C
3 22°C
4 24°C
5 23°C
6 25°C
7 26°C

Then:


📊 Lag in Action

Let’s create Lag 1 version of this data:

Day Temp Lag-1 Temp
1 21°C
2 23°C 21°C
3 22°C 23°C
4 24°C 22°C
5 23°C 24°C
6 25°C 23°C
7 26°C 25°C

What Exactly Did We Do in Lag 1?

Let’s say you’re on Day 4, and you want to know:

What was the temperature yesterday (i.e., Day 3)?

That’s what Lag 1 does — it shifts the original data downward by 1 step so that for each day, you can compare today's value to the one from 1 day before.

So Lag-k would shift the data downwards by k steps

So here’s what it looks like in action:

Original Data:

Day Temp
1 21
2 23
3 22
4 24
5 23
6 25
7 26

Create Lag-1 (Shift all Temps down by 1 row):

Day Temp Lag-1 Temp
1 21
2 23 21
3 22 23
4 24 22
5 23 24
6 25 23
7 26 25

Why the blank?


🧠 What are we doing with this?

We're preparing to measure:

"How similar is today’s temp to yesterday’s temp?"

If there’s a strong correlation between Temp and Lag-1 Temp, that means the series has memory — it depends on the past.

This concept becomes core when we do:


So, technically this is how a Lag-2 dataset would look like if we were comparing today's temp with that of 2 days ago:

Day Temp Lag-2 Temp
1 21°C
2 23°C
3 22°C 23°C
4 24°C 22°C
5 23°C 24°C
6 25°C 23°C
7 26°C 25°C

Key rule : More blanks as value of n increases in a Lag-N dataset.


🧠 What's going on?


⬇️ More Progressive Lag = More Missing Entries

Yep, and here's a key point:

Because you can’t look back beyond what data exists — so the higher the lag, the more initial data you lose from the top.


📘 Examples of What Lags Mean


⚠️ Important Note:

You lose data points with each lag.
If your dataset has 100 entries, then for:


Autocorrelation Function (ACF)

Autocorrelation (ACF) tells you how related a time series is with a lagged version of itself.

Imagine sliding the entire time series over itself by some number of time steps (called a lag) and checking how similar the series is to the original.
It’s like asking:

"Is today’s value similar to the value from 1 day ago? 2 days ago? 7 days ago?"

Formula:

Given a time series :

x1, x2, ....., xn

The autocorrelation at lag k is :

ACF(k) or rk = Σi=1nk (Xi  X¯)(Xi+k  X¯)Σi=1nk (Xi  X¯)2  Σi=1nk (Xi+k  X¯)2

Scary looking formula, I know. But things will clear up with a few examples.

🧩 What Do the Variables Mean?

Symbol Meaning
Xi Value at time step i (current value in the original time series)
Xi+k Value at time step i+k (value after lag k)
X¯ Mean of the entire time series (or just the part used in the sum)
rk Autocorrelation at lag k
n Total number of observations in the time series

Note, to simplify the formula we sometimes re-write Xi+k as Yi


Example 1

To fully clear this out, let's work on an example step by step, in detail :

🌡️ Time Series Data (Temperature)

Day Temp (X)
1 21
2 23
3 22
4 24
5 23
6 25
7 26

Let's say we want to find out the ACF at lag 1 (r1).

🔁 Step 1: Create lagged pairs for lag 1

We shift the temperature values by one day to get Lag-1 Temps:

Note that here for simplicity purposes we set Xi+k = Yi

Day Temp (Xi) Lag-1 Temp (Yi or Xi+k )
2 23 21
3 22 23
4 24 22
5 23 24
6 25 23
7 26 25

Note that the table above would be the same as :

Day Temp (Xi) Lag-1 Temp (Yi or Xi+k)
1 21 __
2 23 21
3 22 23
4 24 22
5 23 24
6 25 23
7 26 25

However in the previous table we just skipped the first row and started from Day 2.

So we now have 6 pairs:


🧮 Step 2: Calculate the mean (X¯ and Y¯)

This is a one time calculation

X¯ = 23 + 22 + 24 + 23 + 25 + 266  23.83Y¯ = 21 + 23 + 22 + 24 + 23 + 256  23.0

Step 3: Calculate each part of the formula

Numerator:

Σ (Xi  X¯)(Yi  Y¯)

Now hold on a minute.

Why was X¯ replaced with Y¯ ? Why feel the need to do that?

Without confusing you too much, just for simplicity purposes, know this, that when we assigned Xi+k as Yi, the associated term with it's components:

(Xi+k  X¯) in numerator and:

Σi=1nk (Xi+k  X¯)2

will also be replaced with a Y component for "consistency purposes".

Otherwise if you need more detailed explanation you can ask any AI but the answer you will get, will most certainly fry your brain (trust me, I tried it, didn't understand).

So let's stick to just knowing the formulae here:

Denominator:

Σ(Xi  X¯)2  Σ(Yi  Y¯)2

Let's calculate everything in a table.

Remember we have :

X¯ = 23.83Y¯ = 23.0

And the table as:

i Xi Yi XiX¯ Yi  Y¯ (XiX¯)(YiY¯) (XiX¯)2 (YiY¯)2
1 23 21
2 22 23
3 24 22
4 23 24
5 25 23
6 26 25

So now we start filling the values :

So for Xi  X¯ it will be each value in a cell of Xi  23.83

And for Yi  barY, it will be each value in a cell of Yi  23.0

i Xi Yi XiX¯ Yi  Y¯ (XiX¯)(YiY¯) (XiX¯)2 (YiY¯)2
1 23 21 -0.83 -2.0
2 22 23 -1.83 0.0
3 24 22 0.17 -1.0
4 23 24 -0.83 1.0
5 25 23 1.17 0.0
6 26 25 2.17 2.0

Now the remaining columns are easier.

(XiX¯)(YiY¯) is just the product of the values of each cell of the two columns we just calculated.

(XiX¯)2 is just the square of the values of the first column we calculated.

(YiY¯)2 is just the square of the values of the second column we calculated.

So, the table is now:

i Xi Yi XiX¯ Yi  Y¯ (XiX¯)(YiY¯) (XiX¯)2 (YiY¯)2
1 23 21 -0.83 -2.0 1.66 0.6889 4.0
2 22 23 -1.83 0.0 0.0 3.3489 0.0
3 24 22 0.17 -1.0 -0.17 0.0289 1.0
4 23 24 -0.83 1.0 -0.83 0.6889 1.0
5 25 23 1.17 0.0 0.0 1.3689 0.0
6 26 25 2.17 2.0 4.34 4.7089 4.0

Now for the numerator:

Σ (Xi  X¯)(Yi  Y¯)

It's just the sum of the values in the column of (XiX¯)(YiY¯)

So :

Σ (Xi  X¯)(Yi  Y¯) 4.99

And the denominator:

Σ(Xi  X¯)2  Σ(Yi  Y¯)2

It's just the square root of the product of the sums of all the values in (XiX¯)2 and (Yi  Y¯)2 respectively

So :

Σ(Xi  X¯)2  10.8334

Σ(Yi  Y¯)2  10

So denominator :

Σ(Xi  X¯)2  Σ(Yi  Y¯)2  10.41

✅ Step 4 : Final ACF at lag 1:

r1 = 4.9910.41  0.479

🧠 Intuition


Let's try for lag 2.

So our lag 2 dataset will look like this:

Day Temp (Xi) Lag-2 Temp (Xi2)
3 22 21
4 24 23
5 23 22
6 25 24
7 26 23

Now we have 5 pairs

Let's find the mean

X¯ = 22 + 24 + 23 + 25 + 265 = 24Y¯ = 21 + 23 + 22 + 24 + 235  22.6

Now let's calculate everything in the formula :

i Xi Yi XiX¯ Yi  Y¯ (XiX¯)(YiY¯) (XiX¯)2 (YiY¯)2
1 22 21 -2 -1.6 3.2 4 2.56
2 24 23 0 0.4 0 0 0.16
3 23 22 -1 -0.6 0.6 1 0.36
4 25 24 1 1.4 1.4 1 1.96
5 26 23 2 0.4 0.8 4 0.16

Now for the numerator:

Σ (Xi  X¯)(Yi  Y¯)  6.0

And for the denominator :

Σ(Xi  X¯)2  Σ(Yi  Y¯)2  10  5.2  52.0  7.211

So, $$r_2 \ = \ \frac{6.0}{7.211} \ = \ 0.832 $$


✅ Intuition:

That’s a strong positive correlation at lag 2 — the temperature values from 2 days ago are quite predictive of today’s temperature. Makes sense for slow-changing weather.

How are we interpreting this?

🌡️ What Autocorrelation Values Mean

Autocorrelation (at lag k) measures how similar the time series is to a shifted version of itself by k steps.

Correlation Value Interpretation
0.9 to 1.0 Strong correlation (very similar)
0.7 to 0.9 Moderate to strong
0.4 to 0.7 Moderate
0.2 to 0.4 Weak
0.0 to 0.2 Very weak or no correlation
< 0 Inverse correlation

Fourier Transform

At its heart, Fourier Transform (FT) answers this question:

"If this signal is made of waves, what waves is it made of?"

It takes a time-based signal (like temperature across days) and tells you:


🎛️ Analogy: Music and a Piano

Imagine a song is playing. You don’t just hear random noise — you hear notes.
Each note is a pure frequency (like 440 Hz for A4).

What the Fourier Transform does is like:


Given a discrete signal (e.g., daily temperatures):

x = [ x0, x1, ...., xN1 ]

The Discrete Fourier Transform (DFT) is defined as

Xk = Σn=0n1 xn  e(2π k n)/N

where:

Don't worry if that looks scary. We’ll break it down visually and practically.

So, a few points before we proceed :

1. What does k do?

In $$X_k \ = \ \Sigma^{n-1}_{n=0} \ x_n \ \cdot \ e^{(-2\pi \ \cdot k \ \cdot n)/N}$$

Quick recap over what's frequency, oscillations, cycles :

Pasted image 20250424184833.png

The frequency of a signal or wave is the number of complete cycles or oscillations that occur within a given time period. It is measured in Hertz (Hz), where 1 Hz represents one cycle per second. Think of it as how many times a wave crest passes a point in a second.

For example, with our 4-point signal:

Day (n) Temp (xn)
0 21
1 23
2 22
3 24

Now, let's see what frequency means when we are performing DFT in periodicity analysis

In DFT, we’re not talking about physical Hz (like "times per second") unless you're sampling over time. Instead, the frequency index k corresponds to how many complete cycles (oscillations) fit into your data length N.

So this means that k = N  1

For N = 4

So k is basically the number of oscillations of a complex wave over the total N data points.

2. Why do we write e(2πi)/N as ω ? (pronounced as omega)

This is a shorthand used in Fourier theory.

It's done so that the formula becomes simple:

Xk = Σn=0N1 xn  ωkn

Instead of re-computing the exponential every time, we reuse ω and just raise it to kn powers.

It’s like a complex unit circle step — taking omega, and rotating it kn times around the unit circle.

3. What are the powers of ω and why do they rotate?

Since we use ω instead of calculating an exponent and rotating it kn times, ω depends on a complex unit circle

This is because:

ωn=cos(2πnN)isin(2πnN) = i

These are called the roots of unity — they loop around the circle and repeat every N steps.

Pasted image 20250424185907.png

So for N = 4, the powers are:

Power Value On Complex Plane
ω0 1 Right (real axis)
ω1 i Top (imaginary axis)
ω2 1 Left
ω3 i Bottom
ω4 1 again! Full circle

And so on, for increasing N values.


🔍 What You'll See

Fourier Transform gives us a spectrum:

High peaks? → Strong repeating patterns at that frequency.
Flat? → No strong repeating patterns.


Example

🔢 Sample Time Series

Let’s take 4 data points from a temperature time series:

Day (n) Temp (xn)
0 21
1 23
2 22
3 24

Let x = [ 21,23,22,24 ], so N = 4.

Now we define the shared variables

Power Value On Complex Plane
ω0 1 Right (real axis)
ω1 i Top (imaginary axis)
ω2 1 Left
ω3 i Bottom
ω4 1 again! Full circle
ω5 i Top (imaginary axis)
ω6 1 Left
ω7 i Bottom
ω8 1 Full circle
ω9 i Top
ω10 1 Left
ω11 i Bottom
ω12 1 Full circle

and k = 0, 1, 2, 3

So we start calculating Xk terms one by one

X0 = Σ xn  ωkn

Now we have k = 0, so

X0 = Σ xn  ω0X0 = x0  1 + x1  1 + x2  1 + x3  1X0 = 21 + 23 + 22 + 24X0 = 90

Now for k = 1

X1 = Σ xn  ωn

So,

X1 = x0 ω0 + x1 ω1 + x2 ω2 + x3  ω3X1 = 21 + 23(i) + 22(1) + 24(i)X1 = 21 + 23i  22  24iX1 = 1 + i

Similarly for k = 2

X2 = Σ xn  ω2nX2 = 21 ω0 + 23 ω2 + 22  ω4 + 24  ω6X2 = 21  23 + 22  24X2 = 2  2X2 = 4

Similarly for k = 3

X3 = Σ xn  ω3nX3 = 21  ω0 + 23 ω3 + 22 ω6 + 24 ω9X3 = 21  23i + 22 + 24iX3 = 43 + i

Now how on earth would one interpret this data?

🧠 First: What are you even trying to do with Fourier?

You're saying:

“I have some time-based data, like temperature each day. I want to know: Does it seem to repeat every few days?

Like maybe it tends to warm up every 3rd day? Or every 7 days?


💡 What Fourier Transform actually does

Think of Fourier Transform as a smart assistant that asks your data:

“If I imagine the temperature was caused by some regular up-and-down patterns... which patterns could explain it?”

Then it tries many different rhythms (called “frequencies”) and checks:

Each Xₖ term tells you:


🧃 Let’s break it with a juice example (non-physics)

Imagine you're tasting a fruit juice that’s made from:

You take 6 sips.

Fourier Transform says:

You didn’t need to know anything about waves. You just know: if the numbers in X₂ and X₃ are big, then those patterns are strong.


🔢 Apply this to the temperature data

Let’s say you collect 7 days of temperature. You do a Fourier Transform, and the result shows:

X index Meaning Value
X₀ Average temp 21.3
X₁ Every 7 days 1.2
X₂ Every 3.5 days 3.8
X₃ Every ~2.3 days 0.4
X₄ Every ~1.75 days 0.1

From this, you see:


✅ Summary: How Fourier helps find repeating behavior

You don’t need to think in frequency or waves.

All you’re doing is:

🧍 “Hey data, do you have patterns that repeat every k days?”
📊 Fourier replies: “Yeah, a bit every 3 days, but not much every 2 or 7.”

That’s it. That’s what you read from the magnitudes of the X terms (ignoring imaginary parts for now).


🧠 Use-Cases


🧩 Tip

Periodicity is different from seasonality:


Trend Analysis

Now that we’ve looked at repeating patterns (periodicity) using tools like autocorrelation and Fourier, trend analysis helps us answer this:

🚶‍♂️ Is the data generally moving up, down, or staying flat over time?

Think of trend like the overall direction the data is heading — regardless of small fluctuations or cycles.


🧭 What Trend Actually Means

Imagine temperature data for 30 days. If it’s slowly increasing over time (like winter turning to spring), that’s an upward trend.

If it’s slowly decreasing (like summer ending), that’s a downward trend.

If it wobbles around a flat average (like no seasonal change), there’s no trend.


Generally, we have 4 traditional methods of measuring trends.

Pasted image 20250425121303.png

1. Free-Hand Method ✍️ (Explanation Only)

Since this is a visual method, imagine plotting the points and then drawing a smooth curve through them. It would show a gentle upward trend.

🧠 Used mostly for quick inspection. No exact math involved. No plotting needed unless you want to practice curve-drawing by hand.

Pasted image 20250425121549.png


2. Semi-Average Method

Let's walk through with a sample dataset

Day Temp (°C)
1 20
2 21
3 23
4 24
5 26
6 27
7 29
8 30

Step 1: Divide the data into 2 halves

Since we have 8 data points (even number), we divide them into two equal parts:


Step 2: Calculate the average of each half

First half average:

884 = 22

Second half average:

1124 = 28

Step 3: Assign the averages to the mid-points of their respective halves.

First half ranges from days 1 to 4.

So the mid-point of this half would be : 1 + 42 = 2.5

Second half ranges from days 5 to 8.

So the mid-point of this half would be : 5 + 82 = 6.5


Step 4: Equation of the trend line

We have this equation of a trend line:

y = a + bt
🧠 What do the variables mean?
Symbol Meaning
y Temperature (or the value we want to estimate)
t Time (usually in days or time units)
a Intercept of the line (value when time = 0)
b Slope of the line (how much temp or the value we are estimating, increases per unit of time)

What do you mean by an "intercept of a line"?

The intercept of a line, in this equation, to be specific, a y-intercept is basically:

📌 What is the intercept of a line, mathematically?

In a linear equation of the form:

y = a + bt
The intercept is:

The value of y when t = 0
That is:

a = y

when

 t = 0

It tells you where the line cuts the vertical (y) axis — the starting value before any changes due to the slope.


Why does this matter?

The intercept gives a reference point for your trend. It's like saying:

“If this upward/downward trend had always existed, what would the value be at time zero?”

You don’t always need to trust this value (especially when Day 0 isn’t in your dataset), but it’s crucial for defining a full line equation.


We now fit a straight line between these two points:
Point 1: (2.5, 22)
Point 2: (6.5, 28)

Use the slope formula:

b = y2  y1x2  x1 = 28  226.5  2.5 = 64 = 1.5

And plug the value of b into this equation :

y = a + bt

Use (2.5, 22) to find a .

22 = a + 1.5(2.5)a = 18.25

✅ Final Trend Line Equation:

y = 18.25 + 1.5t

Now I believe you might have a few questions, such as:

🎯 Why do we plug in y = 22 and t = 2.5?

You calculated the first average as 22, and that average came from the first half, centered at day 2.5. So this gives us a real point on the trend line:

(2.5, 22)

We already know b = 1.5, so to find a, we just plug this into the formula:

22 = a + 1.5(2.5) 22 = a + 3.75 a = 18.25

So the final trend line equation becomes :

y = 18.25 + 1.5t

🤔 Why do we keep t in the final equation?

Because the trend line isn’t just for one point—it's a formula you can use to predict or visualize the trend over time. You plug in different t values (days) to see how the temperature is expected to behave according to the trend.

For example

So we see a steadily increasing trend over time, which says that the temperature is increasing as the days pass.

Here's a visualization for better understanding of the increasing trend.

Pasted image 20250425134432.png


3. Moving Average Method

This method smooths out short-term fluctuations to reveal longer-term trends in the dataset. It works by averaging a fixed number of consecutive data points and sliding the window forward.

In short: This method is used to detect long term trends in the dataset, but short-term trends or "local trends" are not so well depicted in this method.

Step 1 : Pick a window size (e.g., 3-point moving average)

Before we proceed however, here's some pre-requisites to know about this method:

💡 What is "Window Size"?

The window size (also called "period") is the number of data points used to calculate each average.

So when we say:

It's called a "moving" average because you slide this "window" across the data one step at a time.

So let's say we are working with our temperature dataset again:

Day Temp
1 20
2 22
3 19
4 21
5 24

If we use a 3-point moving average, then:

Now some of you might be confused, thinking: Why are we doing this? Why choose Day 1 to Day 3, then 2 to 4, then 3 to 5, and heck, what are we even going to do with these values??

🌊 Why slide the window like that?

When we use a 3-point moving average, the window shifts one step forward each time. That’s the key idea:

This overlapping window captures how the trend is changing over time, not just what it is in isolated chunks.

So it’s not that we’re jumping to random sets of days—it’s that we’re sliding the window one step at a time, like scanning a timeline gradually.

So the shifting can be demonstrated like this:


1 2 3
  2 3 4
    3 4 5
      4 5 ...

Notice how we are "sliding" a window across, the window in question having a space of 3 points only, that's we pick 3 points at a time.

This overlapping window captures how the trend is changing over time, not just what it is in isolated chunks.

So it’s not that we’re jumping to random sets of days—it’s that we’re sliding the window one step at a time, like scanning a timeline gradually.


🤔 Why pick "3" as the window size?

There’s no fixed rule, but here are some guidelines:

Window Size What It Does When to Use
Small (e.g., 3) Reacts quickly to recent changes Short-term trend, sensitive to noise
Medium (e.g., 5 or 7) Balances noise reduction & responsiveness General trend observation
Large (e.g., 10+) Smoothes out a lot of noise Long-term trend, may miss short fluctuations

📉 What do we do with these moving average values?

Each average you compute becomes a smoothed version of your original dataset.

You can:


Here's a quick comparison:
Day Range Average Temp
Day 1 to 3 20.33
Day 2 to 4 20.67
Day 3 to 5 21.33

If you plot those average temps at the middle day (e.g., Day 2, Day 3, Day 4), you start seeing a smoother line than the original, spiky temperature values.

Now, back to where we were:

Step 2: Assign the averages to a central point

Just like we did in semi-averages, we calculate the mean of the range of days assigned to an average, and assign the average temps to a specific day with each of their ranges respectively

So,

Day Range Average Temp Central Point
Day 1 to 3 20.33 2 (Day 2)
Day 2 to 4 20.67 3 (Day 3)
Day 3 to 5 21.33 4 (Day 4)

Step 3: Plot the data on a graph to observe the trend.

There can be two ways of plotting this data.

Option A: If you're just interested in seeing the trend visually, no need to fit a line — just plot the moving average points. This already smooths out short-term fluctuations and shows the trend clearly.

This means just get a graph paper and directly plot the data in the above table.

which would look like this:

Pasted image 20250425160754.png

This smoothed red line represents the general trend by reducing the "noise" from smaller fluctuations. It helps visualize whether the overall direction is increasing, decreasing, or stable.

Here's how a 4-point and 5-point moving average would look like

Pasted image 20250425160926.png

You can see how increasing the window size smooths the curve even more, revealing a clearer long-term trend but at the cost of local detail.

So one can question: What do we get by increasing the window size?

🧠 What You Get by Increasing Window Size:
  1. More Smoothing:

    • Larger windows reduce short-term fluctuations or "noise".
    • You get a cleaner view of the overall trend — like removing the tiny wiggles to see the mountain slope.
  2. Slower Responsiveness:

    • The trend reacts more slowly to new changes in data.
    • Sudden spikes or drops are dampened — which might be good or bad depending on your goal.
  3. Less Detail:

    • Local variations, cycles, or patterns get smoothed out and possibly lost.
    • If you're analyzing short-term patterns, a large window might hide them.

Option B: If you're curious about quantifying the trend (e.g., how fast temperature is increasing on average), then you can fit a linear model:

y = a + bt

Like how we did in semi-averages.

Where:


🔍 Moving Averages — Trade-off Summary:

Aspect Small Window Large Window
📈 Reacts to changes Quickly Slowly
🔊 Shows noise Yes (more visible) No (smoothed out)
🔍 Detects short trends Better Poorly
🌄 Detects long trends Less clearly Clearly

Weighted Moving Averages

In a weighted moving average (WMA),

Key:

🧠 3. Why divide by sum of weights and not just number of points?

👉 Because after weighting, the total "amount" you are summing is not evenly spread.
Some points have more pull and some less.

If you divided by just 3, the number of points, you would completely ignore the fact that the middle point was counted twice as heavily as the others!

Instead, by dividing by 1+2+1 = 4,
you normalize the weighted sum back into a "proper average."


🛠️ Simple Weighted Moving Average example:

Let's say we have this dataset

Pasted image 20250426123214.png

We are asked to find the trend using a 3-year WMA, of weights 1, 2, 1

So we have a window size of 3 points.

So let's say for the first 3 years :

2 (Year 1), 4 (Year 2), 5 (Year 3)

And weights are 1, 2, 1.

Steps:

✅ Notice: If you just divided by 3, the higher importance you gave to 4 would be lost. That's why we divide by 4, the total weights.


Example 2

Let's try out a sample real-world example

Imagine you're a teacher.
A student has taken three tests this semester:

Test Score Importance (Weight)
Test 1 70 1 (not very important)
Test 2 80 2 (midterm, more important)
Test 3 90 1 (normal test again)

Now, how should you calculate the student's final average?

So, it appears we are at a conundrum (pronounced : "ko-none-drum") now.

Luckily, we have weighted moving averages to work with!


🛠️ Using Weighted Moving Average (WMA):

Step 1: Multiply each score by its weight:

Step 2: Add them up:

Step 3: Add the weights:

Step 4: Weighted average:

Final Weighted Average = 80

✅ In this case, luckily it came out the same as simple average.
But if the midterm score were lower or higher, the weighted average would reflect the "importance" properly!

For example, if the midterm was 60 instead of 80:

👉 You see, now the low midterm score pulled the average down more than it would in a simple average!

Here's a more clear difference using a plot:

Pasted image 20250426131405.png

That's why the green line (WMA) is leaning slightly higher on the second test since it had more importance


Why do they look close but not identical?


👉 Summary:


🧠 Conclusion

Weighted Moving Average is super useful when:


4. Fitting Mathematical Curves

This method is about fitting a curve (not just a straight line) to the data when trends are non-linear — that is, when the data doesn’t increase or decrease at a constant rate.

🔸 When Do We Use This?

When your data follows a pattern that’s:


🔸 The Basic Idea

Instead of fitting a line like y = a + bt, we fit equations like:

We try to find the best-fitting coefficients (a, b, c, etc.) so the curve goes through or near the data points.


🛠️ How we usually do Curve Fitting:

  1. Choose the type of curve based on how your data looks.
  2. Use formulas (or regression) to calculate the parameters (like a, b, c).
  3. Plot the curve along with the original data points.
  4. Check the fit — does it match the trend well?

a. Linear Trend (Recap)

The equation for a linear trend is:

y = a + bt
🧠 What do the variables mean?
Symbol Meaning
y Temperature (or the value we want to estimate)
t Time (usually in days or time units)
a Intercept of the line (value when time = 0)
b Slope of the line (how much temp or the value we are estimating, increases per unit of time)

To understand what an intercept means just head back to What do you mean by an "intercept of a line"?


How to calculate a and b ?

We already did this once in the Semi-Average method where we had two points:

We now fit a straight line between these two points:
Point 1: (2.5, 22)
Point 2: (6.5, 28)

Use the slope formula:

b = y2  y1x2  x1 = 28  226.5  2.5 = 64 = 1.5

And plug the value of b into this equation :

y = a + bt

Use (2.5, 22) to find a .

22 = a + 1.5(2.5)a = 18.25y = 18.25 + 1.5t

However, for the curious ones, there is another, more general way to find a and b

Using these formulae:

b = nΣ (ty)  Σ t Σ yn Σ t2  (Σ t)2a = Σ y  b Σ tn

where n = number of data points.

This method is called the least squares regression method.

Now the question one can ask is:

Which method to choose?

Well the previous method works better for semi-average only when you have two points to deal with.

In the case when you are working with a more broad dataset and want to plot a linear trend to see how it looks, you will need to use the least squares regression method.

So, a few more points about this method.

While calculating b:

b = nΣ (ty)  Σ t Σ yn Σ t2  (Σ t)2

Where:

What is y here?

Let's say we have this dataset again:

Day Temp (°C)
1 20
2 21
3 23
4 24
5 26
6 27
7 29
8 30

y is just each temperature value, corresponding to each t value on the dataset.

And for a:

a = Σ y  b Σ tn

It's just the sum of all y values minus b times sum of all t values divided by the total number of table entries.

Now, one might have this question:

Are the a and b values calculated using the method for semi-average plotting, the same as the a and b values calculated using the least squares regression method?

And the answer is: no, they are not the same.

🔵 1. Slope between two points (basic method you mentioned):

b = \frac{y_2 - y_1}{t_2 - t_1}$$​​ - And then you can substitute back into $y = a + bt$ to find $a$. 👉 **But this method** uses **only two points** to define the line. 🔵 **2. Slope via Least Squares** (the one we are doing now): - Here we are **fitting the best line through ALL the points**. - We are **minimizing the total squared error** (the vertical distances from the line). - This method **balances** all the points — not just two! - That's why the formulas look more "global," involving sums like : $\sum{t}, \ \sum{y}, \ \sum{ty}, \ \sum{t^2}$ 🔵 **Therefore:** - **YES**, the $a$ and $b$ values **will usually be different** between the two methods. - The two-points method might _only_ match the least-squares method _if_ all points happen to lie exactly on one perfect straight line (very rare in real-world data). - Otherwise, **least squares** gives a much better, more fair trend line for multiple data points. --- For a quick comparison, let's see a visual plot using both methods. ![Pasted image 20250426140839.png](/img/user/media/Pasted%20image%2020250426140839.png) Now what does this image imply? See how the red line is a bit off in the starting and the end? **Red Solid Line** = The **Least Squares Line** (Linear Regression): - It uses **all the data points**. - It **minimizes the overall error** (the total of all the vertical distances between the points and the line). - It’s **more reliable** for making predictions across the whole range. And the blue-dotted line is the **two-point line method**. 🔹 **Blue Dashed Line** = The **Two-Point Line**: - It only uses the _first_ and _last_ points (Day 1 and Day 8) to create a straight line. - It's **simple**, **quick**, but **less accurate** because it **ignores** the middle points. --- #### What this plot implies: - **Both lines look similar** because the data is fairly linear and clean. - **But the red line (Least Squares)** is slightly better because it fits _all_ points and doesn't just pass through the first and last points. - **If your data was noisy**, the **two-point line would be way off**, while **Least Squares** would still balance out the fluctuations nicely. --- ### b. Quadratic Trend **Quadratic Trend Model** is used when the data shows **curvature** — meaning the trend is **not a straight line** but curves **upward** or **downward** like a parabola. The general form of the **quadratic trend equation** is: $$y \ = \ a \ + \ bt \ + \ ct^2

where:


How does one fit a Quadratic Trend?

We generally use the Least squares regression method to find the best values of a, b and c, such that the sum of squared differences between actual and predicted y is minimized.

You set up and solve the following system of normal equations:

y = na + b t + c t2ty = at +b t2 + c t3t2y = at2 + bt3 + ct4

where:

Firstly, compute all the sum values, and construct the three equations.

Solve for a, b and c using linear algebra techniques (substitution, matrices, etc.)


Intuition Check

This type of trend is helpful in:


Example 1

Suppose, we have:

t (Time) y (Value)
1 4
2 7
3 12
4 19
5 28

So,

n = 5y = 62ty = (1×4) + (2×7) + (3×12) + (4×19) + (5×28) = 270t2y = (1×4) + (4×7) + (9×12) + (16×19) + (25×28) = 1144t = 15t2 = 55t3 = 225t4 = 979

Now we reconstruct the equations

First equation is just the same as the original one :

y = a + bt + ct2y = na + bt + ct2

So we have equation 1 as:

62 = 5a + 15b + 55c

Second equation is just a progression of equation 1.

ty = at + bt2 + ct3

So we have equation 2 as:

270 = 15a + 55b + 225c

Third equation is just a progression of equation 2.

t2y = at2 +bt3 + ct4

So we have equation 3 as:

1144 = 55a + 225b + 979c

So we have the system of equations as:

5a + 15b + 55c = 6215a + 55b + 225c = 27055a + 225b + 979c = 1144

Now this can be a good exercise for us to solve this system using matrices, the matrix inversion method to be specific

So let's write this system of equations as matrices:

[51555155522555225979]

as the matrix A

and

[622701144]

as the matrix b

and

[abc]

as the matrix x

We need to find :

x = A1  b

Firstly we find the matrix of minors:

We have our original matrix as:

[51555155522555225979]

Firstly we compute the determinant of this matrix:

det(A) = (5×3220)  (15×2310) + (55×350)det(A) = 700

So our matrix of minors will be:

[|55225225979||1522555979||155555225||1555225979||55555979||51555225||155555225||55515979||5151555|]

(That was quite painful to type and render, but atleast faster than drawing with a mouse).

Now we solve the individual determinants :

[3220231035023101870300350407050]

So this is the matrix of minors.

Now we apply the sign scheme to get the co-factor matrix:

[3220231035023101870300350407050]

Now we find the adjugate matrix by transposing the co-factor matrix :

[3220231035023101870407035030050]

Now we compute the inverse of the matrix :

A1 = 1700 × [3220231035023101870407035030050]A1 = [4.63.30.53.32.65.80.50.40.07]

(There might be some rounding errors here but I purposely did that to keep the numbers small)

Now using the formula:

x = A1b[abc] = [4.63.30.53.32.65.80.50.40.07] ×[622701144]

For x, multiply each element of b to each element of the first row of A1 and add :

a = (62×4.6)  (270×3.3) + (1144×0.5) a = 33.8b = (62×3.3) + (270×2.6)  (1144×5.8) b = 6137.8c = (62×0.5) (270×0.4) + (1144×0.07)c = 3.08

(corrected, was previously c = 723.8, since I multiplied 1144× 0.7, lol)
(This mistake however, will prove useful below, as you will see)

So we have the solution to the system of equations as:

a = 33.8 , b = 6137.8 , c = 3.08

So, the quadratic equation for this dataset:

y = a + bt + ct2

will be:

y = 33.8 6137.8t + 3.08t2

Since we have our c>0 , we should expect an upward "U" shape in the visualization.


Here's a visualization of the quadratic trend for c = 723.8 my previously mistaken calculated c value.

Pasted image 20250426165413.png

As you can see, the quadratic trend follows the general upward curve of the data, capturing the parabolic nature of the data points.

I had already generated this plot earlier thinking this one was the correct, however:


This is the plot for the correct c value of 3.08

Pasted image 20250426170939.png

Notice how it just looks like a straight line now instead of a curve? Even though our c >0 ?

This is because of the strength of the c value.

Now, with the correct equation:

This explains why the graph looks like a straight line:

🔵 The equation is technically quadratic,
🔵 But practically, the parabola is so "flat" that the t2 term is negligible,
🔵 So it behaves almost exactly like a straight line with slope 6137.8.

Now previously with c = 723.8, even small values of t made ct2 become big enough to show a curve.

But now, with c=3.08, it takes absolutely massive values of t for 3.08t2 to matter compared to 6137.8t.

That's why visually this one looks straight, while the old one looked curved.

🧠 Little intuition tip:

For this section, linear and quadratic trend should do it, otherwise we will need another brain to carry this much information.


Similarity Search

What is Similarity Search in Data Mining?

Given a time series (or a piece of it, called a query),
find the part(s) of another time series (or within the same series) that look similar to it.


✏️ Two types:

  1. Whole matching
    → Compare entire time series vs. entire time series.

  2. Subsequence matching
    → Compare a small query vs. sliding windows inside a large time series.
    (Like searching for a pattern inside a huge signal.)


✏️ How is similarity measured?

Main methods:


The core idea is extremely simple:
Take two sequences (say, arrays of numbers),
and compute the straight-line distance between them.

Steps:

  1. Given:
  1. What we do:
  1. Euclidean Distance Formula

where S is the current window in T.

  1. Result:

    • A list of distances
    • The smallest distance is the best matching sequence.

Example

Suppose:

Now, sliding windows of size 3 in T would be:

Subsequence (S) Indices Values
1 0–2 [0, 1, 2]
2 1–3 [1, 2, 3]
3 2–4 [2, 3, 4]
4 3–5 [3, 4, 5]

Just like how we had sliding windows in moving averages method :

0 1 2
  1 2 3 
    2 3 4
      3 4 5

Now, we compute the distances between our Query and our target value in each sliding window

(10)2 + (21)2 + (32)2 = 3  1.732 (11)2 + (22)2 + (33)2 = 0

Note:

Since Euclidean distance is:

you can absolutely terminate early if you ever find a distance of 0 while searching.

So we can already stop the process now and declare that we have found an exact match of our query? Yes, sure,

However:

BUT — if you are working on clean synthetic data (like in toy examples or simulations), this early-stopping optimization can save a lot of time!

So for the sake of practice however, we will still continue to calculate the remaining distance values.

(12)2 + (23)2 + (34)2= 1 + 1 + 1 = 3  1.732 (13)2 + (24)2 + (35)24 + 4 + 4 = 12 = 23  3.464

So now we have an array of distances as :

[ 1.732, 0, 1.732, 3.464 ]

A min() of this array would result in 0

So we have found the best match as window 2!

Which is exactly the same as our query, [ 1, 2, 3 ].


Conclusion


2. Dynamic Time Warping (DTW)

While Euclidean distance compares two sequences point by point,
DTW allows for flexible alignment — it can stretch and compress parts of the time axis to match sequences better.

Think of DTW as "warping time" so that similar shapes match, even if they happen at slightly different speeds.

Pasted image 20250427125619.png

So this diagram is depicting the core idea of Dynamic Time Warping (DTW)
specifically, aligning two time series (or sequences) that might be "out of sync" in time, but are otherwise similar.

In short:
🔵 Goal: Find the optimal path through this grid that minimizes the total distance between the sequences, even allowing for stretching and compressing in time.


Example

Let's say we have two time series:

Step 1. Create a Cost Grid Matrix

The number of rows in this grid should be equal to the number of data points in Time Series A (which is 3). The number of columns should be equal to the number of data points in Time Series B (which is 4).

A B B B B
__ 2 3 4 3
1
2
3

For each cell in this grid, you need to calculate the "local cost" of aligning the corresponding points from Time Series A and Time Series B. A common way to calculate this local cost is by taking the absolute difference between the two values.

For the cell at row i and column j, you will compare the i-th value of A (ai​) with the j-th value of B (bj​) and calculate |ai  bj|.

So the local costs will be:

A B B B B
__ 2 3 4 3
1 |12| |13| |14| |13|
2 |22| |23| |24| |23|
3 |32| |33| |34| |33|

Thus,

A B B B B
__ 2 3 4 3
1 1 2 3 2
2 0 1 2 1
3 1 0 1 0

Let call this matrix D.

Step 2: Create the Cumulative Cost Matrix - Initialize the First Cell

What to do:

  1. Create a new grid (a matrix) with the same dimensions as your Cost Matrix (in this case, 3×4). This will be our Cumulative Cost Matrix, let's call it C.

  2. The very first cell of the Cumulative Cost Matrix, C(1,1) (top-left corner), is simply equal to the value of the very first cell of your Cost Matrix, D(1,1).

In this case, D(1,1) = 1

So, C(1,1) = 1

A B B B B
__ 2 3 4 3
1 1
2
3
Step 3: Fill the First Row and First Column of the Cumulative Cost Matrix

What to do:

  1. First Column (except the first cell): For each cell in the first column of C (from the second row downwards), its value is the sum of the local cost in the corresponding cell of D and the value in the cell directly above it in C.

    C(i,1)=D(i,1)+C(i1,1)$$for$i$>1.C(1,j) = D(1,j) + C(1,j1)$$forj>1.
C(i,j)=D(i,j)+min(C(i1,j),C(i,j1),C(i1,j1))

Let's understand these better with calculations:

Matrix D again:

A B B B B
__ 2 3 4 3
1 1 2 3 2
2 0 1 2 1
3 1 0 1 0

So currently we have C(1,1) = 1

Now if we want to find out let's say, C(2,1), the cell directly below.

The value of this cell will be the sum of the value of the cell above it and the corresponding value of this cell in matrix D.

So :

So, C becomes:

A B B B B
__ 2 3 4 3
1 1
2 1
3 2

Now, for the remaining cells:

So, C becomes now:

A B B B B
__ 2 3 4 3
1 1 3 6 8
2 1
3 2

Now, (for each current cell, I will denote that with an x, for easy visualization)

A B B B B
__ 2 3 4 3
1 1 3 6 8
2 1 2
3 2 x
A B B B B
__ 2 3 4 3
1 1 3 6 8
2 1 2 x
3 2 1
A B B B B
__ 2 3 4 3
1 1 3 6 8
2 1 2 4
3 2 1 x
A B B B B
__ 2 3 4 3
1 1 3 6 8
2 1 2 4 x
3 2 1 2
A B B B B
__ 2 3 4 3
1 1 3 6 8
2 1 2 4 5
3 2 1 2 x

So, our final cumulative cost matrix becomes:

A B B B B
__ 2 3 4 3
1 1 3 6 8
2 1 2 4 5
3 2 1 2 2
Step 5: Calculate the DTW Distance

What to do:

The Dynamic Time Warping (DTW) distance between your two time series A and B is simply the value located in the bottom-right cell of your Cumulative Cost Matrix C.

So in our Cumulative Cost Matrix C:

A B B B B
__ 2 3 4 3
1 1 3 6 8
2 1 2 4 5
3 2 1 2 2

The bottom-right cell is at the 3rd row and 4th column, C(3,4) and it's value is: 2.

Therefore, the DTW distance between Time Series A=(1,2,3) and Time Series B=(2,3,4,3) is 2.

Step 6: Traceback for the Warping Path

What to do:

  1. Start at the bottom-right cell of C.
  2. Analyze it's neighbor cells (top, left and diagonal). Pick the one that has the most minimum value out of the 3.
  3. Repeat till you reach the top-left cell of C, i.e. C(1,1).

So the traced path would be:

[
 [1, 3, 6, 8],
  ^
  |
 [1, 2, 4, 5],
  ^
  |
   \
    \
 [2, 1 <-- 2 <-- 2]
]

And the points in the traversed order would be:

(3,4)(3,3)(3,2)(2,1),(1,1)

Reverse this, and we would get the optimal warping path as:

(1,1)(2,1)(3,2)(3,3),(3,4)

What does this warping path tell us?

Remember that we had the time series as:

This path tells us the alignment:

Notice how some points in one time series can be aligned with multiple points in the other, which is the "warping" effect.

The warping effect can be interpreted as the fact that the 3rd point of Time Series A (value 3) is aligned with the 2nd (value 3), 3rd (value 4), and 4th (value 3) points of Time Series B illustrates how DTW stretches or compresses the time axis of one or both series to find the best possible match. It's like saying, "To see the similarity, we need to consider that the last point of A corresponds to this whole segment in B."

One beautiful analogy to understand this would be to see how similar this is to time dilation.
Like how a blackhole bends and compresses time around it so that time passes a lot slower around it's event horizon and vicinity, but it passes "faster" or rather from an external frame of reference, "normally" for other points, speaking from their own frames of references?

DTW does a similar behaviour, by "stretching or compressing the time of one time series" to find the similarity in another one.


Cosine similarity search algorithm is primarily a mathematical formula used to determine how "close" or how "similar" two vectors are. It results in an angle value, which has ranges from 0x1. If the angle x , is very close to 1, or 1 , then it's considered similar or else it's considered dissimilar.

There are specific thresholds one can set for this:

Angle Value Interpretation
1.0 100% similar
0.9 and above Highly Similar
0.7 to 0.9 Very Similar
0.5 to 0.7 Moderately Similar
0.3 to 0.5 Not quite, similar, vauge, sparse similarity
0.0 to 0.3 Very very sparse similarity
0.0 100% dissimilar

The formula for cosine similarity for two vectors a and b is:

Cosine Similarity(a,b) = a  b|a|  |b|

In terms of physics, a  b would mean the dot product divided by the |a|  |b| meaning, the product of the magnitudes of the two vectors.

However since we are working with numerical arrays here,

Cosine Similarity(a,b) = a  b|a|  |b| = i=1n(ai  bi)i=1n(ai2) i=1n(bi2)

And a very important constraint : The dimensionality of both the vectors should be same, i.e. both the arrays should have the same number of elements, or this method will fail.


Example 1

Let's say we have two same dimensional arrays:

Time Series A: [1.0, 2.0, 3.0, 4.0, 5.0]
Time Series B: [1.5, 2.5, 3.5, 4.5, 5.5]

Let's check if these two time series are similar to each other or not.

i=15(ai  bi) = (1.0×1.5) + (2.0×2.5) + (3.0×3.5) + (4.0×4.5) + (5.0×5.5)i=15(ai  bi) = 62.5i=15(ai2) = 1 + 4 + 9 + 16 + 25 = 55  7.416i=1n(bi2) = 2.25 + 6.25 + 12.25 + 20.25 + 30.25 = 71.25  8.440i=1n(ai2) i=1n(bi2) = 7.416× 8.440 = 62.59

Now,

Cosine Similarity(a,b) = a  b|a|  |b| = i=1n(ai  bi)i=1n(ai2) i=1n(bi2) = 62.562.59  0.9985

which is a very very high similarity.

So now, you must be wondering, how might one infer this answer with the context of the two time series?

Well since we have established that the two time series:

Time Series A: [1.0, 2.0, 3.0, 4.0, 5.0]
Time Series B: [1.5, 2.5, 3.5, 4.5, 5.5]

have a very high similarity, we can observe a few traits in both the time series:

Pasted image 20250427152114.png

Here we can see the plot of these two time series. Note how both are parallel lines, indicating high similarity.

In contrast, if the cosine similarity was: