Module 3 -- Storage Strategies -- DBMS


Index

  1. Indices
  2. B - Tree
  3. B+ Trees
  4. Hashing

Indices

๐Ÿ”น What are Indices in DBMS?

https://www.youtube.com/watch?v=E--yzX05_k8&list=PLxCzCOWd7aiFAN6I8CuViBuCdJgiOkT2Y&index=94

An index in a database is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage to maintain the index.

In simple terms, indexing reduces the number of blocks to be called from storage during the processing of a query, thus reducing the time taken for the query to execute.

Think of it like the index in a bookโ€”instead of flipping through every page, you can directly go to the page number associated with a topic. Similarly, an index in a database lets the DBMS quickly locate data without scanning the entire table.


๐Ÿ”น Why Use Indexing?


๐Ÿ”น How Do Indices Work?

Internally, an index creates a mapping between key values and the location (address or row ID) of the data in the table.

For example:

Employee Table
+----+--------+-------+
| ID | Name   | Dept  |
+----+--------+-------+
| 1  | Alice  | HR    |
| 2  | Bob    | IT    |
| 3  | Carol  | IT    |
+----+--------+-------+

Index on Dept:
{ "HR" -> row 1, "IT" -> rows 2 and 3 }


๐Ÿ”น Types of Indices

https://www.youtube.com/watch?v=vjrHiaIfOl8&list=PLxCzCOWd7aiFAN6I8CuViBuCdJgiOkT2Y&index=97

https://www.youtube.com/watch?v=4E-MGnjMhRw&list=PLxCzCOWd7aiFAN6I8CuViBuCdJgiOkT2Y&index=98

https://www.youtube.com/watch?v=UpJ9ICmzaAM&list=PLxCzCOWd7aiFAN6I8CuViBuCdJgiOkT2Y&index=99

https://www.youtube.com/watch?v=Ua08uVgsk4k&list=PLxCzCOWd7aiFAN6I8CuViBuCdJgiOkT2Y&index=100

  1. Primary Index

    • Built on a primary key.
    • One per table.
    • Unique and sorted.
    • Example: Index on Employee.ID.
  2. Secondary Index

    • Built on non-primary attributes (e.g., Dept).
    • Can have duplicates.
    • Helps in searching non-key attributes.
  3. Clustered Index

    • Alters the physical order of the data in the table to match the index.
    • Only one per table.
    • Faster for range queries.
    • Used for sorting large data efficiently.
  4. Non-clustered Index

    • Logical ordering of data.
    • Contains pointers to the actual data location.
    • Can be multiple per table.
  5. Composite Index

    • Index on multiple columns (e.g., Name, Dept).
    • Helps if queries filter by more than one column.

๐Ÿ”น When to Use Indices

Use indexing:

Avoid indexing:


๐Ÿ”น Cost of Indexing


A numerical example to further clear the understanding of Indices.

https://www.youtube.com/watch?v=P24LAhp-ap8&list=PLxCzCOWd7aiFAN6I8CuViBuCdJgiOkT2Y&index=95 (part 1)

https://www.youtube.com/watch?v=s_S_MpLoDEM&list=PLxCzCOWd7aiFAN6I8CuViBuCdJgiOkT2Y&index=96 (part 2)

Pasted image 20250514145710.png

Pasted image 20250514150114.png

So let's say we have this setup.

Each block in the HDD can hold 1000bytes and each record is of size 250 bytes.

So, number of records per block = 4.

Now if the total number of records were 10,000 (unordered), meaning the records could be scattered on the HDD in a random order.

Total number of blocks required = 2500.

Now let's say we had a query to execute:

SELECT * FROM STUDENTS WHERE ROLL = 345;

How long do you think would that query take to complete, without indices?

If each block takes a total of 4ms to get processed in the ram and

If we are focusing on the average time complexity then this would be :

25002ย ร—ย 4ย =ย 5000ย miliseconds

And number of blocks needed to scan in average time complexity would 1250.

So, if the worst case scenario is scanning all 2500 blocks, that is the data is present on the 2500th block, then it would N.

Best case scenario would be on the first block, so 1.

So average case time complexity would be: O(N2).

If the data was ordered then it would be:

O(log2โกN)ย =ย 12

12 blocks would be needed to scan and time taken would be 48ms.


Using Indices.

Now using Indices we can get that in an even lower amount of time and lesser amount of blocks.

If the question were to modified as to find the time complexity to search a record from index table if the size of an index table entry is equal to 20 bytes.

Now there can be two types of indexing based on whether the data is ordered or unordered.

For unordered data, we have dense indexing. This means that each block in the index table will have the same number of entries as the ones in the HDD.

For ordered data we have sparse indexing. This means that each block will contain only the page number of the corresponding block in the HDD.

The difference can be spotted in this way:

For 1000 records, a single block of 20 bytes can hold 50 records.

In sparse indexing this would get us = 250050ย =ย 50 blocks

Which means that the entirety of 1000 records can be fitted in just 50 blocks, as compared to 2500 blocks on the HDD.

This makes the search time much lesser since the time complexity now becomes :

log2โก50ย =ย 6

And plus 1 since after searching these 6 records we will get a page number pointing to the desired block on HDD, where we need to perform a search for the desired data value.

For dense indexing we can't use the page index so we need to fit all 10,000 records, we would need 200 blocks for that.

That would result in the time complexity of log2200ย +ย 1ย =ย 9 searches, which takes longer than sparse indexing but still an overall quite a lesser number of searches than it would be without indexing.

And that's why indexing is really helpful in reducing query processing time, specifically searches.


B - Tree

https://www.youtube.com/watch?v=KcApkM5WYGw&list=PLxCzCOWd7aiFAN6I8CuViBuCdJgiOkT2Y&index=108

๐ŸŒณ What is a B-Tree?

A B-Tree is a self-balancing tree data structure that maintains sorted data and allows searching, insertion, and deletion in logarithmic time.


๐Ÿง  Why do we need B-Trees?

  1. When data is stored in secondary memory, we want fast ways to search through it.

  2. Instead of scanning all records, we build indexes that contain:

    • A key (e.g., roll number, student ID)
    • A pointer (to the full record in memory)
  3. Indexes can get large too, so we make indexes of indexes โ†’ multilevel indexing.

  4. But multilevel indexing can become slow or hard to update (insert/delete = pain).

  5. So, we use a tree structure to organize and balance the index โ†’ Thatโ€™s where B-Trees come in.


๐Ÿ“ B-Tree Terminology

Letโ€™s go over the components the video describes:

1. Nodes

Like all trees, B-trees have:

Important: A B-Tree is always balanced, which means all leaf nodes are at the same level.

2. Keys

3. Record Pointers

4. Block Pointers (Tree Pointers)


๐Ÿ”ข Order of B-Tree (denoted as P)

This is very important, as it defines:

Example:

Letโ€™s say order P = 4 (which is common in questions):

Component Value
Max children 4
Max keys 3
Min children (non-root) 2 (ceil(4/2))
Min keys (non-root) 1 (ceil(4/2) - 1)
Block pointers per node = number of children (up to P)
Record pointers = number of keys (up to P โˆ’ 1)

Finding the Order of a B-Tree

https://www.youtube.com/watch?v=yGSHbjA4w0o&list=PLxCzCOWd7aiFAN6I8CuViBuCdJgiOkT2Y&index=103


๐Ÿงฑ Structure of a Node

For order P = 4, a typical node can look like:

| K1 | K2 | K3 |
| R1 | R2 | R3 |  โ† Record Pointers (point to actual data)
| C1 | C2 | C3 | C4 | โ† Block Pointers (to child nodes)

โš™๏ธ Insertion and Search

Insertion in a B-Tree

https://www.youtube.com/watch?v=YUtUNlLNB5c&list=PLxCzCOWd7aiFAN6I8CuViBuCdJgiOkT2Y&index=102

๐Ÿง  Quick Refresher: What is a B-Tree of Order m?

For order 4 (m = 4):

๐ŸŽฏ Insertion Process in a B-Tree

Letโ€™s say we are inserting the values:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10 into a B-tree of order 4


๐Ÿ”ข Step-by-Step Insertion Walkthrough

Insert 1, 2, 3


Insert 4

โœ… Resulting structure:

         [2]
        /   \
     [1]   [3,4]

Insert 5


Insert 6

โœ… Now root has two keys:

        [2, 4]
       /  |   \
    [1] [3] [5,6]

Insert 7


Insert 8

Root [2, 4] can accept one more key โ†’ becomes [2, 4, 6]

          [2, 4, 6]
        /    |    | \
     [1]   [3]  [5]  [7,8]

Insert 9


Insert 10

But root [2, 4, 6] is already full, so we need to:

โœ… Final structure:

            [4]
        /      \
     [2]         [6, 8]
   /    \        /   | \
[1]   [3]   [5] [7] [9,10]


๐Ÿงฉ Key Concepts to Remember

Concept Explanation
Order (m) Max children per node
Max keys per node m - 1
Min keys per node ceil(m/2) - 1
Overflow When a node gets more than m - 1 keys
Splitting Happens on overflow โ†’ median key goes up, node splits into two
Promotion When a median goes to the parent (may cause cascading splits if full)
Sorted Insertions B-tree maintains sorted order and binary-search-like path finding
Balanced Tree All leaves are at the same level (ensures balanced height for fast access)

๐Ÿงช Why Balanced?

A balanced tree means:


B+ Trees

This is not directly present in our syllabus but is a good reference point.

๐ŸŒณ What is a B+ Tree?

A B+ Tree is a self-balancing tree data structure that maintains sorted data and allows search, insert, delete, and range queries in logarithmic time.

โœ… Itโ€™s an extension of the B-tree, optimized for disk-based storage systems (like databases and file systems).


๐Ÿ“ฆ Why B+ Trees in Databases?

Databases need:

The B+ Tree handles these perfectly by:


๐Ÿง  Structure of a B+ Tree

  1. Internal Nodes:

    • Store only keys, not actual data.
    • Guide the search path.
  2. Leaf Nodes:

    • Store data records (or pointers to them).
    • Are linked in a singly/doubly linked list for efficient range queries.
  3. Order (d):

    • Determines the maximum number of children a node can have.
    • Each internal node can have at most 2d children and at least d.

๐ŸŒฟ Example Structure (Order d = 2):

        [10 | 20]
       /    |    \
      /     |     \
 [1 5 8] [11 14 18] [21 23 30]   <- leaf nodes (data)


๐Ÿ”Ž Searching in B+ Tree


โž• Insertion in B+ Tree

  1. Insert into correct leaf node.
  2. If the node overflows, split it.
  3. Promote the middle key to the parent node.
  4. Repeat upward if needed (may grow height).

Example:

Before:

Leaf: [10, 20]

Insert 30 โ†’ becomes [10, 20, 30] โ†’ Overflow โ†’ Split:


๐Ÿ” Differences Between B Tree and B+ Tree

Feature B Tree B+ Tree
Data location Internal & leaf nodes Only in leaf nodes
Range queries Inefficient Efficient using linked leaves
Leaf linkage Not required Linked for faster traversal
Storage space Slightly less efficient Slightly more (due to duplication of keys)
Search Faster for single key Slightly more steps (due to leaf-only data)

๐Ÿ“ˆ Advantages of B+ Trees


๐Ÿ“Œ Real Use


Hashing

๐Ÿ” What is Hashing?

Hashing is a technique to map data of any size to fixed-size values using a hash function. The result of this mapping is called a hash value or hash code, and it helps you store and retrieve data quickly.

Think of it like assigning a locker number (hash code) to every student (data) so that you can find their locker instantly, without searching every locker.


๐ŸŽฏ Objective of Hashing


๐Ÿงฎ Hash Table

A hash table is a data structure that uses hashing to store key-value pairs.


๐Ÿ”„ Hash Function

A hash function h(k) maps a key k to an index in the array.

h(k) = k mod m

Where:

So if:

So, store the data at index 3.


โ— Collisions in Hashing

A collision happens when two different keys hash to the same index.

Example:

Now both 123 and 133 want to go into index 3 โ†’ Collision!

We solve this using Collision Resolution Techniques.


๐Ÿ› ๏ธ Collision Resolution Techniques

1. Open Hashing / Chaining

Index 3 โ†’ [123] โ†’ [133]

2. Closed Hashing / Open Addressing

a. Linear Probing

b. Quadratic Probing

c. Double Hashing


๐Ÿ’ก Load Factor (ฮฑ)


๐Ÿ“ฆ Applications of Hashing


โœ… Summary

Term Meaning
Hashing Mapping key to index
Hash Function Converts key to index
Hash Table Array storing data using hash values
Collision Multiple keys map to same index
Chaining Linked list at each slot
Linear/Quadratic/Double Probing Methods to resolve collisions without chaining
Load Factor Measure of how full the table is