Introduction to Data Warehousing and Data Mining -- Module 1


Index

  1. What is Data Warehousing?
  2. ETL (Extract, Transform, Load) -- Data Aggregation in Data Warehousing
  3. 4. OLTP vs OLAP (Database vs Data Warehouse
  4. 5. Data Warehouse Architecture
  5. Star Schema vs Snowflake Schema
  6. Data Mining
  7. Data Mining Techniques
  8. Mining Frequent Patterns, Associations, and Correlations
  9. Association Rules — More Than Co-Occurrence
  10. Correlation — Relationship Beyond Co-Occurrence
  11. Sequential Pattern Mining

What is Data Warehousing?

A Data Warehouse (DW) is a centralized, integrated system designed for querying and analyzing data, not for daily operations.

It collects data from multiple sources — like databases, flat files, logs — and stores it in a consistent, cleaned, and structured form, optimized for analysis and reporting.

Pasted image 20250414125346.png


2. Why Do We Need Data Warehousing?

Imagine you run a large company:

That’s the job of the Data Warehouse
A single source of truth for decision-makers.


3. Data Warehousing Components:


ETL (Extract, Transform, Load) -- Data Aggregation in Data Warehousing

https://www.youtube.com/watch?v=Tq8oCFjP6kQ&list=PLxCzCOWd7aiHexyDzYgry0YZN29e7HdNB&index=3

What is ETL?

ETL stands for:

The speaker emphasizes that ETL isn't just a concept; it’s a real-world process executed by companies, especially when they work with data warehousing.


Step 1: Extract

What It Means:

The extraction step is about collecting raw data from multiple sources. These could be:


Example 1: Reliance Retail Stores


Example 2: Bank Mergers (e.g., SBI)


Step 2: Transform

What It Means:

This is the cleaning and organizing stage. Once the data is extracted and integrated, we:


Key Concepts:


Step 3: Load

What It Means:

Finally, the clean and transformed data is loaded into the data warehouse. This is where it becomes available for:

You shouldn't load dirty or raw data, because:

By ensuring quality data at this step, future decisions are based on accurate, meaningful information.


Tools Used

ETL tools used in the industry:


Summary

Step Description Real-life Examples
Extract Pull raw data from different sources Reliance stores, merged banks
Transform Clean, format, validate, structure the data Remove duplicates, fix currencies
Load Store into a central warehouse Clean data goes into data warehouse for reporting

Why ETL is Important?


4. OLTP vs OLAP (Database vs Data Warehouse

https://www.youtube.com/watch?v=DVVIRw7GG5c&list=PLxCzCOWd7aiHexyDzYgry0YZN29e7HdNB&index=7

https://www.youtube.com/watch?v=DeL3dgiy6vk&list=PLxCzCOWd7aiHexyDzYgry0YZN29e7HdNB&index=8

Feature OLTP (Online Transaction Processing) OLAP (Online Analytical Processing)
Purpose Day-to-day operations Decision-making and analytics
Data Type Highly detailed, real-time data Summarized, historical data
Examples Banking transactions, e-commerce checkout Sales trends, customer segmentation
Operations Insert, Update, Delete Complex Queries, Aggregations
Storage Design Normalized Denormalized (Star/Snowflake schema)

Difference between Database and Data Warehouse

OLTP Database

A database is designed to capture and record data from a variety of enterprise systems i.e. OLTP - Online Transaction Processing. OLTP includes resource planning, customer relationships, transaction management etc. OLTP database is an operational database which contains highly detailed  live, real-time data.

These databases should operate consistently for the success of a business and there should be no error or ambiguity in the data that is being stored. The transactions stored in the OLTP databases are simple and direct which can be queried with speed and ease. However, databases work slowly for querying large amounts of data and can slowdown the transactional process which negatively affects the business.

OLAP Data Warehouse

On the other hand, data warehouses are designed to store summarized data from various databases or sources after transforming the data according to the business necessity for analysis, reporting and decision making, i.e. OLAP - Online Analytical Processing. Since data from various sources is stored in a warehouse, the structure of a warehouse should be thoroughly pre-planned.

Querying huge amounts of data is easy with data warehouses, and it does not generally affect business as fresh data is generally not dynamically stored in the data warehouse, rather data is updated periodically. Data stored is secure, reliable and can be easily managed and retrieved.

A simply analogy would be:

Data Warehouses are built for OLAP.


5. Data Warehouse Architecture

https://forum.guvi.in/posts/30498/data-warehousing

Data warehouse architecture comes in three types:

### Two-tier architecture:

Two-tier architecture is basically a client-server architecture. The end users are directly communicating with data sources. It is easy to maintain, and the communication is fast as there is no mediator in between.

In this type of architecture, the performance of the warehouse decreases when the number of end users increases, and it is also cost-ineffective.
         
Pasted image 20250414132224.png

Three-tier architecture (Similar to the 3-schema architecture from DBMS)

Three-tier architecture has three layers. The top tier, middle tier and bottom tier. ==The top tier is the User layer, which allows users to retrieve data from the data warehouse ==and analyze, generate visualizations and share insights with the team. The middle tier is the OLAP layer, which acts as an interface between the User and Data layer, enhances the communication speed and processes large volumes of data. The bottom tier is the data layer that has the actual databases, and all RDBMS actions are performed in this layer.

The three-tier architecture improves performance as load is reduced on the User and Data layer. This architecture provides improved security as users do not have direct access to the database.

Pasted image 20250414132356.png

Four-tier architecture:

In four-tier architecture, a Presentation layer is included in addition to the layers present in the three-tier architecture. When users want to retrieve certain data from the warehouse, requests are sent from the user layer to the presentation layer which interacts with the business layer/OLAP layer which in turn accesses the data warehouse and the retrieved data is sent as a response to the user. The advantage of having a presentation layer is that it enhances the way the data is presented to the end user.

Pasted image 20250414134145.png


Star Schema vs Snowflake Schema

1. Star Schema

https://www.datacamp.com/blog/star-schema-vs-snowflake-schema

https://www.youtube.com/watch?v=ot-2GXJXYMM&list=PLxCzCOWd7aiHexyDzYgry0YZN29e7HdNB&index=5

A star schema is a way to organize data in a database, especially in data warehouses, to make it easier and faster to analyze. At the center, there's a main table called the fact table, which holds measurable data like sales or revenue. Around it are dimension tables, which add details like product names, customer info, or dates. This layout forms a star-like shape.

Pasted image 20250414135111.png

Let’s look at the key features of the star schema:

Let's understand this further with an example :

Pasted image 20250414135354.png

The fact table Sales is in the center. It holds the numeric data you want to analyze, like sales or profits. Connected to it are dimension tables with descriptive details, such as product names, customer location, or dates:

Advantages and limitations of a star schema

Now that you know what star schema is, let’s look at why it stands out:


2. Snowflake Schema

https://www.youtube.com/watch?v=3LcjuLb3Vy8&list=PLxCzCOWd7aiHexyDzYgry0YZN29e7HdNB&index=6

A snowflake schema is another way of organizing data. In this schema, dimension tables are split into smaller sub-dimensions to keep data more organized and detailed — just like snowflakes in a large lake.

Pasted image 20250414135801.png

Let’s look at the key features of the snowflake schema that make it different from other schemas:

Let’s understand this with a simple snowflake schema diagram. At the center is the fact table, which contains measurable data. It connects to dimension tables that describe the facts, and these dimension tables further branch out into sub-dimension tables, forming a snowflake-like structure.

For example, here I split the Product into Manufacturer and Category tables and the Customer table into Transaction and Location tables:

Pasted image 20250414135918.png

Advantages and limitations of a snowflake schema

Like star schema, snowflake schema also has its own advantages. Let’s see what they are:

However, despite its advantages, there are a few limitations too. For example, queries can be slower because there are more joins between tables. Apart from this, the multi-level structure is more challenging to design and maintain than simpler schemas like star schema.


When to Use a Star Schema

If you primarily want to organize your data simply and quickly, the star schema would be perfect. Here’s when you can use it:

When to Use a Snowflake Schema

Snowflake schema is more suitable for handling frequent updates or organizing detailed hierarchies. Here’s when you can use it:


Summary Table: Star Schema vs. Snowflake Schema

Feature Star schema Snowflake schema
Structure Central fact table linked to denormalized dimensions Central fact table linked to normalized dimensions
Complexity Simple, with fewer joins Complex, with more joins
Data redundancy Higher redundancy due to denormalized dimensions Lower redundancy due to normalized dimensions
Query performance Faster queries due to simpler structure Slower queries because of additional joins
Storage Requires more storage because of redundancy Requires less storage due to normalization
Ease of maintenance Easier to design and maintain More complex to design and maintain
Best suited for Small to medium-sized datasets Large and complex datasets

6. Use Cases of Data Warehousing:


Data Mining

Data Mining is the computational process of exploring and analyzing large volumes of data to uncover meaningful patterns, trends, correlations, or anomalies that can help in decision-making.

It's sometimes called Knowledge Discovery in Databases (KDD) — because the goal is to extract knowledge from raw, unstructured, or massive datasets.


💡 Difference: Data Mining vs KDD

KDD is the entire process from data collection to knowledge interpretation.
Data Mining is one step inside KDD — the core analysis part.

Stage Purpose
Data Selection Choose relevant data
Data Preprocessing Cleanse and handle missing values
Data Transformation Convert into correct formats
Data Mining Discover patterns using algorithms
Interpretation/Evaluation Make sense of results

Key Characteristics of Data Mining:


💡 Example to Lock It In:

Supermarket Scenario:
Imagine you own a large chain of supermarkets. Your databases log millions of transactions daily.

You might use Data Mining to find patterns like:

This insight lets you:


Core Functions of Data Mining:

Function Description Example
Classification Assign items to predefined categories. Email → Spam or Not Spam.
Clustering Group similar items together without labels. Grouping customers by purchase behavior.
Association Rule Mining Find rules about co-occurrence of items. Bread → Jam (Market Basket Analysis).
Prediction Estimate future outcomes based on patterns. Predict stock prices.
Outlier Detection Identify anomalies or unusual patterns. Fraud detection in banking.
Regression Predict continuous numerical values. Forecast next month’s sales.

Data Mining Techniques:


📊 Data Mining vs Traditional Data Analysis:

Traditional Analysis Data Mining
Often manual, human-driven Algorithm-based, automated pattern detection.
Pre-defined hypothesis testing Discover unknown relationships.
Limited to small/medium datasets Scales to large datasets (big data).
Focus on reporting Focus on prediction, pattern recognition.

Why is Data Mining so important today?

It helps in:


Data Mining in Real Life:

Industry Use-Case
Retail Recommend products based on user behavior.
Banking Detect credit card fraud patterns.
Healthcare Predict diseases from symptoms and history.
E-Commerce Dynamic pricing and product placement.
Social Media Analyze trends and suggest friends or content.

Common Misconceptions:


Mining Frequent Patterns, Associations, and Correlations

What Are Frequent Patterns?

https://www.geeksforgeeks.org/frequent-pattern-mining-in-data-mining/

A frequent pattern is a set of items, subsequences, or structures that appear together in a dataset with a frequency higher than a given threshold.


Everyday Example:

At a grocery store:

{Milk, Bread, Butter}

together in a single transaction.

If this pattern appears in 40% of all transactions, it's considered a frequent pattern.


Formal Terms:

Formula:

Support(X) = (Number of transactions containing X) / (Total number of transactions)

More Practical Examples:

Dataset Type Frequent Pattern Example
Online Shopping {Laptop, Mouse} bought together 2000 times.
Healthcare Records {Cough, Fever, Chest Pain} appearing in diagnosis.
Music Streaming {Hip-Hop, Pop} genres often played together.
Social Media Behavior {Likes, Comments} happening on similar posts.

Association Rules — More Than Co-Occurrence

Association Rule Mining builds a rule from these frequent patterns:

If X happens, Y likely happens too.

For example :

{Milk} → {Bread}

means: If a person buys milk, there’s a high chance they’ll also buy bread.

3 Metrics to Evaluate the Rule

  1. Support:
    How common is this combination in the dataset?
    e.g.,
    Support of {Milk → Bread} = 30% if both appear in 30% of all transactions.

  2. Confidence:
    When Milk is bought, how often is Bread also bought?
    Formula:
    Confidence = Support(Milk & Bread) / Support(Milk)

  3. Lift:
    Measures if Milk and Bread appear together more than random chance.
    Formula:
    Lift = Confidence / Support(Bread)

    • Lift > 1: Positive association.
    • Lift = 1: Independent.
    • Lift < 1: Negative association.

Real-World Association Rule Examples:

Domain Rule Example Use Case
E-commerce {Phone} → {Phone Case} Recommend phone cases after buying a phone.
Supermarket {Chips} → {Soda} Place chips and soda together in-store.
Healthcare {High BP} → {Heart Disease} Early diagnosis assistance.
Education {Watching Lectures} → {Higher Exam Scores} Suggest more video content to struggling students.

Correlation — Relationship Beyond Co-Occurrence

While association rules show that two things happen together,
Correlation answers:
"Are these things really linked, or is it just a coincidence?"

Example 1: Ice Cream & Sunglasses

{Ice Cream} → {Sunglasses}

But correlation might reveal:


Example 2: Student Grades

{Study Group Hours} ↔ {Exam Scores}

Summary Table:

Concept Purpose Example
Frequent Pattern Find combinations that occur together often. {Milk, Bread} appears in 40% of sales.
Association Rule Discover if the occurrence of one item implies another. {Laptop} → {Mouse} with 80% confidence.
Correlation Determine if the relationship is strong or coincidental. Ice Cream & Sunglasses — weather-driven.

So, all in all:

Frequent Patterns reveal what's popular.
Association Rules predict behaviors.
Correlation explains whether the relation is genuine or accidental.


Sequential Pattern Mining

https://www.geeksforgeeks.org/generalized-sequential-pattern-gsp-mining-in-data-mining/

https://www.youtube.com/watch?v=vWuVbQ_NKcY

Basic of Sequential Pattern (GSP) Mining:


Difference between Transactional and Sequential data

So let's we have this transactional data of a supermarket

Pasted image 20250417153146.png

Now storing such transactions in a transaction table would become incredibly complex to store, that's where sequential patterns come into play

Pasted image 20250417155216.png

We store all the transactions done by grouping them by CID in a Sequences field which is denoted by the angle brackets <>


Some of the algorithms used in Sequential Pattern Mining

Pasted image 20250417155401.png


Sequential Pattern (GSP) Mining uses:

Sequential pattern mining, also known as GSP (Generalized Sequential Pattern) mining, is a technique used to identify patterns in sequential data. The goal of GSP mining is to discover patterns in data that occur over time, such as customer buying habits, website navigation patterns, or sensor data.

Some of the main uses of GSP mining include:

Market basket analysis: GSP mining can be used to analyze customer buying habits and identify products that are frequently purchased together. This can help businesses to optimize their product placement and marketing strategies.

  1. Fraud detection: GSP mining can be used to identify patterns of behavior that are indicative of fraud, such as unusual patterns of transactions or access to sensitive data.
  2. Website navigation: GSP mining can be used to analyze website navigation patterns, such as the sequence of pages visited by users, and identify areas of the website that are frequently accessed or ignored.
  3. Sensor data analysis: GSP mining can be used to analyze sensor data, such as data from IoT devices, and identify patterns in the data that are indicative of certain conditions or states.
  4. Social media analysis:   GSP mining can be used to analyze social media data, such as posts and comments, and identify patterns in the data that indicate trends, sentiment, or other insights.
  5. Medical data analysis: GSP mining can be used to analyze medical data, such as patient records, and identify patterns in the data that are indicative of certain health conditions or trends.

Challenges in  Generalized Sequential Pattern Data Mining

The database is passed many times to the algorithm recursively. The computational efforts are more to mine the frequent pattern. When the sequence database is very large and patterns to be mined are long then GSP encounters the problem in doing so effectively.