Introduction to Data Warehouse
Introduction to Data Warehouse
• A Data Warehouse (DW) is a system used to store large volumes of structured data from
multiple sources for analysis and reporting.
1. Enterprise Data Warehouse (EDW): A centralized warehouse serving the entire organization.
2. Operational Data Store (ODS): Provides real-time data for operational purposes.
3. Data Mart: A smaller, specialized subset of a data warehouse for specific departments like
sales or finance.
• Supply Chain Management (SCM): Monitors and improves supply chain processes.
Development Methodologies
1. Top-Down Approach:
2. Bottom-Up Approach:
• ETL (Extract, Transform, Load) tools: Extract data from sources, clean it, and load it into the
warehouse (e.g., Informatica, Talend).
• OLTP (Online Transaction Processing): Handles daily transactional data (e.g., bank
withdrawals).
• Star Schema: Central fact table linked to dimension tables (e.g., sales data with customer
details).
Warehouse Server
• Metadata describes the structure, source, and usage of data in the warehouse.
• Types:
OLAP Engine
• Data mining is the process of discovering useful patterns and knowledge from large sets of
data.
• Data Mining:
o A specific step in the KDD process focused on identifying patterns and extracting
knowledge from data.
• Machine Learning: Algorithms that enable computers to learn patterns from data.
1. Classification: Assigns data into predefined categories (e.g., spam email detection).
1. Text Mining: Extracts information from text data (e.g., sentiment analysis).
2. Web Mining: Analyzes web data like browsing patterns and search logs.
Data Preprocessing
Preparing raw data for mining by improving its quality. Steps include:
o Summarizing data using statistical measures like mean, median, and mode.
2. Data Cleaning:
4. Data Reduction:
o Reducing the data volume while maintaining its relevance (e.g., dimensionality
reduction).
o Converting continuous data into discrete intervals or levels (e.g., grouping ages into
ranges like "18-25").
NEXT UNIT
Here are simplified notes on Association Rules:
• Commonly used in market basket analysis to find products often bought together.
• Example Rule: If a customer buys bread, they are likely to buy butter.
o Lift: Measures how much more likely two items are bought together compared to
random chance.
1. Apriori Algorithm:
o Steps:
2. Partition Algorithm:
o Searches for frequent itemsets from both ends (bottom-up and top-down).
o Adapts to the size and structure of the data for better efficiency.
o Uses a compact tree structure to store transactions and discover frequent itemsets
without candidate generation.
o Steps:
• Partition Algorithm: Scales well for large datasets but may miss cross-partition patterns.
• Example: If beverages → snacks is a rule, the generalized rule might be cold drinks → chips.
• Focuses on finding rules that satisfy specific conditions or constraints, such as:
These notes break down the concepts into manageable sections for easier understanding and quick
revision.
NEXT UNIT
Here are simplified notes on Classification and Clustering:
• Classification:
• Prediction:
o Example: Predicting house prices based on features like area and location.
Classification Techniques
o Example: A decision tree for classifying animals based on features like "Has feathers"
or "Can fly."
2. Bayesian Classification:
o Uses probability theory (Bayes’ Theorem) to predict the class based on prior
knowledge of the data.
3. Rule-Based Classification:
o Example: IF age > 50 THEN high risk for predicting the likelihood of heart disease.
o Does not build a model beforehand. Instead, it memorizes the training data and
makes decisions based on a comparison of new data with stored instances.
o Example: K-Nearest Neighbors (KNN), where the class of a data point is determined
by its nearest neighbors.
• Clustering:
o The task of grouping similar data points into clusters, where data points in the same
cluster are more similar to each other than to those in other clusters.
1. Numerical Data:
o Data that can be measured and expressed in numbers (e.g., age, income).
2. Categorical Data:
3. Mixed Data:
1. Partitioning Methods:
o Steps in K-Means:
2. Hierarchical Methods:
o Agglomerative (Bottom-Up): Start with individual points and gradually merge them
into larger clusters.
o Divisive (Top-Down): Start with all data in one cluster and recursively divide it into
smaller clusters.
3. Density-Based Methods:
o Clusters data points that are closely packed together, marking areas of higher density
as clusters.
4. Grid-Based Methods:
o Divides the data space into a grid and clusters the data based on the grid structure.
Outlier Analysis
• Outliers are data points that differ significantly from other data points.
• Outlier Detection: Identifying and removing outliers can improve the clustering results.
o Data Warehouse ek centralized repository hota hai jisme large amounts of historical
data ko store kiya jata hai, jo reporting aur analysis ke liye use hota hai.
o Ye data ko alag-alag sources se collect karta hai aur ek single location pe store karke
decision-making process ko support karta hai.
o Data Mart: Ye ek chhota version hota hai EDW ka, jo specific department ya business
unit ke data ko store karta hai.
o Integrated: Alag-alag sources ka data ek common format mein merge kiya jata hai.
o Time-Variant: Data historical nature ka hota hai, jisme time period ka bhi record
rakha jata hai.
o Non-Volatile: Data once stored in the warehouse is not frequently changed, only
updated.
o Advantages:
o Applications:
o Top-Down: Start with a high-level plan for the entire data warehouse and then go
down to the details.
o Bottom-Up: Start with smaller data marts and integrate them into a larger
warehouse.
• OLAP Operations:
o Operations include Drill-down, Roll-up, Slice, and Dice to analyze the data.
• Warehouse Schema:
o Common schemas include Star Schema and Snowflake Schema, which organize the
data in a way that's efficient for querying.
o Data warehouse architecture typically consists of the data source layer, ETL layer,
data warehouse layer, and presentation layer.
o Warehouse Server: Stores the data and handles requests for analysis.
o Metadata: Data about the data; it defines the structure, location, and other details
about the data.
o Data mining ek process hai jisme hum large data sets se useful patterns aur insights
extract karte hain.
o DBMS (Database Management System): Ye data ko store aur manage karta hai.
o Data Mining: Ye DBMS se data ko analyze karta hai aur useful patterns ko extract
karta hai.
1. Classification: Data ko predefined categories me classify karna (e.g., spam vs. non-spam).
2. Clustering: Similar data points ko ek cluster me group karna (e.g., customer segmentation).
3. Association Rule Mining: Items ke beech relationships dhoondhna (e.g., If customer buys
bread, they will buy butter).
4. Regression: Numerical values ko predict karna (e.g., predicting the price of a house).
5. Anomaly Detection: Unusual data patterns ko identify karna (e.g., fraud detection).
1. Text Mining: Text data se useful information extract karna (e.g., sentiment analysis).
• Data Preprocessing:
o Descriptive Data Summarization: Data ko summarize karna using statistical
measures like mean, median, mode.
o Data Cleaning: Errors ko fix karna, missing values ko fill karna, duplicates ko remove
karna.
• Classification:
o Ye ek process hai jisme hum data ke categorical label ko predict karte hain.
• Prediction:
o Example: House prices ko predict karna, based on features like area and location.
Classification Techniques
o Ek tree-like structure hota hai jisme har node ek decision represent karta hai, aur leaf
nodes class ko predict karte hain.
o Example: Animal classification jisme attributes jaise "Has feathers" or "Can fly" ke
basis pe classification hota hai.
2. Bayesian Classification:
o Ye probability theory (Bayes’ Theorem) ka use karke class ko predict karta hai.
o Example: Spam email classifier, jisme probability calculate hoti hai ki email spam
hoga ya nahi, based on certain words.
3. Rule-Based Classification:
o Example: IF age > 50 THEN high risk for predicting heart disease.
4. Lazy Learner (Instance-Based Learning):
o Ye model ko pehle se build nahi karta. Instead, data ko yaad rakhta hai aur new data
ko stored instances se compare karke decision leta hai.
o Example: K-Nearest Neighbors (KNN) algorithm, jisme data points ko unke nearest
neighbors ke basis pe classify kiya jata hai.
• Clustering:
o Clustering ek process hai jisme similar data points ko groups ya clusters me divide
kiya jata hai.
1. Numerical Data:
2. Categorical Data:
3. Mixed Data:
1. Partitioning Methods:
o Example: K-Means Algorithm, jisme "K" number of clusters specify kiya jata hai.
o Steps in K-Means:
3. Cluster centers ko update kiya jata hai aur repeat hota hai jab tak
convergence na ho jaye.
2. Hierarchical Methods:
o Agglomerative (Bottom-Up): Individual points se start hota hai aur gradually unhe
merge karke clusters banta hai.
o Divisive (Top-Down): Sabhi data ko ek cluster me rakha jata hai aur phir recursively
divide kiya jata hai.
3. Density-Based Methods:
o Ye closely packed data points ko clusters me group karta hai, aur high-density areas
ko cluster mark karta hai.
4. Grid-Based Methods:
Outlier Analysis
• Outliers: Ye wo data points hote hain jo dusre data points se significantly different hote hain.
• Outlier Detection: Outliers ko identify karna aur remove karna clustering results ko improve
kar sakta hai.
o Association rules patterns ko identify karte hain jo large datasets mein items ke
beech relationships batate hain.
1. Apriori Algorithm:
▪ Steps:
▪ Data ko smaller partitions me divide kiya jata hai aur har partition me
frequent itemsets find kiye jate hain.
3. Pincer-Search Algorithm:
▪ Frequent itemsets ko bottom-up aur top-down search karke find karta hai.
• Ye basic association rules ko extend karte hain jisme hierarchical relationships bhi include
hote hain.
• Example: Agar beverages → snacks hai, to generalized rule ho sakta hai cold drinks → chips.
• Ye rules specific conditions ko satisfy karte hain, jaise minimum/maximum number of items,
ya particular items.
• Example: Sirf expensive products ke rules ya specific brand ke rules discover karna.