Data Structure Unit (3 &4) Notes (SnapED)
Data Structure Unit (3 &4) Notes (SnapED)
Unit 3
Sorting
Sorting is a fundamental concept in computer science and everyday life. It refers to
the process of arranging a collection of data items into a specific order, based on
some chosen criterion. This order can be ascending (e.g., lowest to highest),
descending (highest to lowest), alphabetical, chronological, or any other meaningful
sequence.
● Data analysis and retrieval: Sorted data is easier to analyze, visualize, and
search. Imagine searching for a contact in an unsorted phone book vs. a
sorted one!
● Efficiency: Sorting algorithms can significantly improve the performance of
various tasks, from searching for information to merging datasets.
● Organization and clarity: Sorted data provides a clear structure and allows for
easier identification of patterns and trends.
Types of sorting:
● Selection sort: iterates through the data and finds the minimum/maximum
element, swapping it to the beginning/end. (Simple but slow)
● Bubble sort: repeatedly compares adjacent elements and swaps them if
they're out of order. (Easy to understand but inefficient)
● Insertion sort: inserts each element into its rightful position in a sorted
sub-sequence. (Efficient for small datasets)
● Merge sort: divides the data into halves, sorts each half recursively, and then
merges them back in order. (Fast and efficient for large datasets)
● Quick sort: chooses a pivot element and partitions the data around it,
recursively sorting each partition. (Fast but can be unstable)
Order
Order plays a crucial role in various data structures, impacting their efficiency and
operations. Here's a breakdown of how order manifests in different structures:
The concept of stability in data structures goes beyond just maintaining order. It specifically
refers to how a data structure preserves the relative order of elements with equal values
during operations like insertion, deletion, or sorting.
A stable data structure ensures that if two elements have the same value and are initially in
a specific order, that order will be preserved even after performing operations on the
structure. Imagine two students with the same grade in a class roster. A stable data
structure would keep them listed in the same order even after adding new students or
sorting the roster.
● Binary Search Trees (BSTs): BSTs are generally considered stable because the
order of elements with equal keys is preserved during insertion and deletion
operations.
● Doubly Linked Lists: These lists maintain both forward and backward pointers,
allowing them to remain stable during insertions and deletions anywhere in the list.
● Merge Sort: This sorting algorithm is inherently stable, as it merges pre-sorted
sub-sequences while maintaining the relative order of elements with equal values.
● Insertion Sort: While efficient, insertion sort is not stable. Inserting an element with
an existing value might change the relative order of other elements with the same
value.
● Hash Tables: Hash tables store elements based on their hash values, which can lead
to collisions and unpredictable order for elements with the same hash.
Choosing a stable or unstable structure depends on your specific needs and the operations
you'll be performing. If maintaining the relative order of equal elements is crucial, then opting
for a stable structure like a BST or a stable sorting algorithm like merge sort is
recommended.
Selection Sort
There seems to be a slight misunderstanding here. While both selection sort and
heap sort are sorting algorithms, they are distinct concepts and don't have variations
like "straight" or "heap" within them.
Heap sort, on the other hand, utilizes a data structure called a heap to efficiently sort
the data. A heap is a tree-based structure where each parent node has a higher (or
lower) value than its children, depending on the type of heap (max-heap or
min-heap). Heap sort builds the heap with the given data, then repeatedly extracts
the minimum (or maximum) element from the root of the heap and rebuilds the heap
with the remaining elements. This process continues until all elements are extracted
and sorted. Heap sort has a time complexity of O(n log n) and is significantly faster
than selection sort for large datasets.
Insertion Sort:
Concept: Imagine sorting cards in your hand. You compare each card with the ones
before it and slide it into its rightful place if needed. Insertion sort does the same
thing with data elements.
Steps:
Example:
Unsorted list: 8, 4, 6, 2, 1
Complexity: O(n^2) in the worst case, meaning the time it takes to sort grows
quadratically with the data size.
There is no "straight" insertion sort. It's simply the standard insertion sort as
described above. Some variations exist, like gap insertion sort, which uses different
intervals for comparison, but they are not commonly referred to as "straight insertion
sort."
Shell Sort:
Concept: Shell sort is a hybrid of insertion sort and gap-based sorting. It starts with a
larger gap between elements, allowing them to "jump" further and partially sort the
data. Then, it gradually reduces the gap and applies insertion sort within smaller
sub-sequences to refine the sorting.
Steps:
1. Choose a sequence of gaps (e.g., half, a quarter, eighth of the data size).
2. Sort elements within sub-sequences based on the chosen gap. This involves
insertion sort-like comparisons and shifts.
3. Reduce the gap and repeat step 2 until the gap is 1.
4. Perform regular insertion sort on the entire data with a gap of 1.
Example:
Unsorted list: 8, 4, 6, 2, 1
1. Pass (gap 4): Compare and shift elements 4 positions apart (8 with 4, 6 with
2, etc.).
2. Pass (gap 2): Compare and shift elements 2 positions apart.
3. Pass (gap 1): Regular insertion sort to refine the sorting.
Complexity: O(n log n) on average, depending on the chosen gap sequence. This
makes it generally faster than insertion sort for larger datasets.
In summary:
● Insertion sort is simple and easy to understand but slow for large datasets.
● Shell sort is a hybrid approach that improves efficiency by combining
gap-based sorting with insertion sort.
● Choosing the best algorithm depends on the data size and specific
requirements.
I hope this explanation clarifies the differences between insertion sort and shell sort,
along with examples for better understanding. If you have any further questions
about these algorithms or other sorting techniques, feel free to ask!
You're right, both bubble sort and quicksort are considered "exchange sort"
algorithms. This means they achieve sorting by repeatedly swapping elements within
the data until the desired order is reached. However, they differ significantly in their
approach and efficiency.
Bubble Sort:
● Concept: Imagine two children swapping marbles in their hands until they are
in the right order. Bubble sort does the same for data elements, repeatedly
swapping adjacent elements if they are in the wrong order.
● Steps:
○ Iterate through the data, comparing each pair of adjacent elements.
○ If the current element is larger than the next, swap them.
○ Repeat the process until no swaps occur in a full iteration, indicating
the data is sorted.
●
● Example:
○ Unsorted list: 8, 4, 6, 2, 1
○ Pass 1: Compare and swap 8 and 4.
○ Pass 2: Compare and swap 6 and 4, 8 and 6.
○ Repeat passes until no swaps occur.
●
● Complexity: O(n^2) in the worst case, making it inefficient for large datasets.
● Strengths: Simple to understand and implement.
Quicksort:
● Concept: Imagine dividing a deck of cards into smaller piles based on their
values and then recursively sorting each pile. Quicksort uses a similar
divide-and-conquer approach.
● Steps:
○ Choose a pivot element (often the first or last element).
○ Partition the data into two sub-sets: elements less than the pivot and
elements greater than or equal to the pivot.
○ Recursively sort each sub-set using the same process.
○ Combine the sorted sub-sets back into a single sorted list.
●
● Example:
○ Unsorted list: 8, 4, 6, 2, 1
○ Choose pivot as 4.
○ Partition data: [1, 2] and [8, 6].
○ Recursively sort each sub-set.
○ Combine sub-sets: [1, 2, 4, 6, 8].
●
● Complexity: O(n log n) on average, making it significantly faster than bubble
sort for large datasets.
● Strengths: Efficient for large datasets, avoids unnecessary comparisons.
● Weaknesses: More complex to understand and implement than bubble sort,
can be unstable in some implementations.
Merge sort is a versatile sorting algorithm that can handle both internal and external
sorting. When dealing with massive datasets that cannot fit in memory, external
sorting becomes necessary. This involves using temporary files on disk to perform
the sorting process. Here's a breakdown of how merge sort adapts to external
sorting and some specific variations you mentioned:
1. Data Chunking: The large dataset is divided into smaller chunks that fit into
memory. These chunks are called runs.
2. Run Creation: Each run is then sorted internally using a sorting algorithm like
quicksort or merge sort itself. This creates multiple sorted files on disk.
3. Merging Runs: The sorted runs are then merged pairwise, combining them
into larger sorted files. This process continues until a single sorted file
remains.
Natural Merge:
● This is a simple form of external merge sort where runs are merged in
adjacent pairs.
● Efficient for small datasets or when the number of available external storage
devices is limited.
● However, for large datasets, it can lead to a lot of disk seeks and I/O
operations, reducing efficiency.
Balanced Merge:
● Aims to minimize the number of disk seeks by merging runs that are farthest
apart first.
● This reduces the average distance between elements being merged, leading
to faster I/O performance.
● Requires keeping track of the positions of the runs and choosing the optimal
pair for merging in each iteration.
● More complex to implement than natural merge but offers significant
performance gains for large datasets.
Polyphase Merge:
● This variation utilizes multiple external storage devices (e.g., hard drives) to
further improve efficiency.
● Instead of merging runs pairwise, it merges multiple runs simultaneously in
phases.
● This can significantly reduce the number of merge passes compared to
balanced merge and natural merge.
● Requires careful planning and synchronization to manage the merging
process across multiple devices.
Searching Algorithms
Searching algorithms are fundamental tools for efficiently finding specific elements
within a collection of data. Here's a breakdown of the three you mentioned:
1. List Search:
This is the most general term encompassing any technique for finding an element in
a list. It can involve various approaches like:
● Sequential Search: Traversing the list element by element until finding the
target. Simple but slow for large lists.
● Binary Search: Only works with sorted lists. It repeatedly divides the search
space in half based on comparisons, quickly narrowing down the target's
location.
● Hashing: Uses a hash function to map elements to unique keys, allowing for
constant-time lookup in a hash table (efficient for specific data types).
● Interpolation Search: Estimates the target's position based on its value and
the surrounding elements in a sorted list, potentially reducing comparisons.
2. Sequential Search:
Also known as linear search, this algorithm iterates through each element in the list,
comparing it to the target element. If a match is found, its position is returned. If not,
the search continues until the end of the list.
Strengths:
Weaknesses:
● Time Complexity: O(n), meaning the search time grows linearly with the list
size. This can be slow for large lists.
● Inefficient for sorted lists: Doesn't take advantage of the sorted order.
3. Binary Search:
This powerful algorithm only works with sorted lists. It works by repeatedly dividing
the search space in half based on comparisons with the target element. If the target
is smaller than the middle element, the search continues in the left half. If it's larger,
the right half is considered. This process continues until the target is found or the
search space shrinks to a single element.
Strengths:
● Time Complexity: O(log n), significantly faster than sequential search for large
lists.
● Efficiently utilizes the sorted order of the list.
Weaknesses:
Hashing is a powerful technique for storing and retrieving data efficiently. It works by
mapping data elements to unique, fixed-size values called "hash codes" using a
hash function. These hash codes then act as keys for accessing the data in a hash
table, which is a specially designed data structure optimized for key-value lookups.
Hashing Methods:
● Modulo Division: Divides the data element's value by a prime number and
uses the remainder as the hash code. Simple but can lead to clustering for
certain data sets.
● Bit Extraction: Selects specific bits from the data element's binary
representation to create the hash code. Faster than modulo division but may
not distribute elements evenly.
● Polynomial Hashing: Uses a polynomial function to map the data element to a
hash code. More flexible than modulo division but computationally expensive.
Collision Resolution:
Collisions occur when different data elements map to the same hash code. This can
happen due to limitations in the hash function or the size of the hash table. To
handle collisions, several techniques can be used:
● Open Addressing:
○ Linear Probing: Checks for the next available slot in the hash table.
Simple but can lead to clustering and performance degradation.
○ Quadratic Probing: Uses a quadratic function to probe for an empty
slot. More complex but distributes elements more evenly.
○ Double Hashing: Uses a secondary hash function to probe for an
empty slot. More complex but can be more efficient than linear or
quadratic probing.
●
● Chaining: Stores colliding elements in a linked list attached to the colliding
hash code slot. Easy to implement but can lead to long chains and
performance problems.
● Data type and distribution: Different hashing methods work better for different
data types and distributions.
● Performance requirements: Some methods are faster than others but may
have higher collision rates.
● Memory limitations: Hashing utilizes additional memory for the hash table.
● Application requirements: Some applications require specific collision
resolution properties, like guaranteed maximum search time.
UNIT-4
Disjoint Sets
Disjoint sets, also known as union-find data structures, represent a collection of sets
that don't overlap or share elements. They are crucial for tasks like:
2. Forest Representation:
● Each set is represented by a tree, with the root as the set representative.
● Each node also has a rank (height of the subtree rooted at the node).
● Finding the set involves traversing the tree to the root.
● Union by rank merges two sets by attaching the tree with the smaller rank to
the larger one, updating ranks as needed.
Additional Techniques:
● Frequency of find vs. union operations: If find operations are more frequent, a
forest representation might be better.
● Data distribution: If sets tend to have similar sizes, union by size might be
better.
● Complexity requirements: Linked list representation is simpler to implement,
but forest representation with path compression can be much faster.
The union-find algorithm, also known as the disjoint-set data structure, is a powerful
tool for managing a collection of non-overlapping sets. It efficiently performs two key
operations:
1. Union by Rank:
● Each set is represented as a tree, with the root representing the set.
● Each node has a rank, which is the height of the subtree rooted at that node.
● Find: Traverse the tree to the root, updating parent pointers for path
compression.
● Union: Attach the tree with the smaller rank to the one with the larger rank.
Update ranks as needed to maintain tree heights.
This approach balances the heights of trees in the union operation, leading to faster
find operations on average.
2. Union by Size:
● Similar to union by rank, but sizes of sets are used instead of ranks.
● Find: Traverse the tree to the root, updating parent pointers for path
compression.
● Union: Attach the smaller set (by size) to the larger one. Update sizes as
needed.
This approach is generally simpler to implement than union by rank, but it can lead
to unbalanced trees and slower find operations for certain data distributions.
Additional Techniques:
● Union by rank generally offers better performance for most cases, especially
with path compression.
● Union by size can be easier to implement and might be preferable for specific
data distributions where rank-based union performs poorly.
Graphs are powerful tools for representing relationships between objects. They
consist of two main components:
Graph representation refers to the techniques used to store and manipulate graphs
in computer memory. Choosing the appropriate representation is crucial for efficient
algorithms and data analysis. Here are some common options:
1. Adjacency Matrix:
2. Adjacency List:
3. Edge List:
4. Incidence Matrix:
5. Object-Oriented Representation:
● Graph density: Sparse graphs benefit from adjacency lists, while dense
graphs might be better suited for adjacency matrices.
● Operations frequently performed: If checking for connections is crucial, an
adjacency matrix might be efficient. If iterating through neighbors is common,
an adjacency list might be better.
● Memory constraints: Adjacency lists and edge lists are generally more
memory-efficient than adjacency matrices.
● Directed vs. undirected graphs: Representation needs to adapt to the
directionality of edges, if applicable.
● Weighted graphs: Edges can have associated values (weights), requiring
additional storage and handling.
Graph Traversals
Graph traversals are fundamental algorithms used to systematically visit all vertices
or edges in a graph. They are essential for various tasks, including:
● Finding shortest paths: BFS is efficient for finding the shortest path between
two vertices.
● Detecting cycles: DFS can effectively identify cycles (closed loops) within the
graph.
● Topological sorting: Ordering vertices in a directed acyclic graph based on
their dependencies.
● Connected components: Identifying groups of vertices connected to each
other.
Implementation Considerations:
● Both algorithms can be implemented iteratively or recursively.
● Marking visited vertices is crucial to avoid infinite loops.
● Different data structures (queues, stacks) impact efficiency and memory
usage.
Further Exploration:
● There are variations of BFS and DFS (e.g., bidirectional BFS) with specific
advantages.
● Other graph traversal algorithms exist for different purposes (e.g., Dijkstra's
algorithm for finding shortest weighted paths).
Both Minimum Spanning Tree (MST) and Shortest Path algorithms deal with finding
optimal paths in graphs, but they address different objectives and have distinct
approaches:
Key Differences:
● Scope: MST considers all vertices and minimizes the total weight of the tree,
while shortest path focuses on a specific pair of vertices and minimizes the
weight of the path between them.
● Cycles: MST algorithms ensure no cycles, while shortest path algorithms may
sometimes involve cycles (e.g., when the destination is reachable through a
shorter path with a loop).
● Applications: MSTs are useful for resource allocation and network design,
while shortest paths are crucial for navigation, planning, and optimization
tasks.
● If you need to connect all vertices in a network with minimal total cost and
avoid cycles, use an MST algorithm.
● If you need to find the shortest path between two specific vertices, regardless
of cycles, use a shortest path algorithm.