0% found this document useful (0 votes)
4 views

Introduction to Data Warehouse

Uploaded by

javedgaur57
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Introduction to Data Warehouse

Uploaded by

javedgaur57
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Introduction to Data Warehouse

• A Data Warehouse (DW) is a system used to store large volumes of structured data from
multiple sources for analysis and reporting.

• It supports decision-making by providing historical, consolidated, and clean data.

Types of Data Warehouses

1. Enterprise Data Warehouse (EDW): A centralized warehouse serving the entire organization.

2. Operational Data Store (ODS): Provides real-time data for operational purposes.

3. Data Mart: A smaller, specialized subset of a data warehouse for specific departments like
sales or finance.

Characteristics of a Data Warehouse

1. Subject-Oriented: Organized around key business topics (e.g., sales, customers).

2. Integrated: Combines data from various sources in a consistent format.

3. Time-Variant: Stores historical data for analysis.

4. Non-Volatile: Data remains unchanged once added.

Advantages of Data Warehouses

• Better Decision-Making: Provides insights through analytics and reports.

• Consolidated Data: Combines data from various sources.

• Improved Data Quality: Data is cleaned and standardized.

• Fast Access: Optimized for queries and analysis.

Applications of Data Warehouses

• Business Intelligence (BI): Helps in making informed business decisions.

• Customer Relationship Management (CRM): Analyzes customer behavior.

• Supply Chain Management (SCM): Monitors and improves supply chain processes.

• Healthcare: Tracks patient data and medical trends.

Development Methodologies

1. Top-Down Approach:

o Starts by building an Enterprise Data Warehouse first.


o Data marts are created later for specific needs.

o Pros: Integrated and consistent.

o Cons: Time-consuming and expensive.

2. Bottom-Up Approach:

o Builds Data Marts for specific areas first.

o Combines these into a larger warehouse over time.

o Pros: Quick to implement.

o Cons: May lack integration initially.

Tools for Data Warehouse Development

• ETL (Extract, Transform, Load) tools: Extract data from sources, clean it, and load it into the
warehouse (e.g., Informatica, Talend).

• BI Tools: For visualization and reporting (e.g., Tableau, Power BI).

OLTP Systems and Data Warehouse

• OLTP (Online Transaction Processing): Handles daily transactional data (e.g., bank
withdrawals).

• Differences with DW:

Feature OLTP Data Warehouse

Data Type Current data Historical data

Purpose Day-to-day operations Analysis and reporting

Speed Fast transactions Optimized for queries

Example Banking systems Business dashboards

Functionality of Data Warehouse

• Data Storage: Stores data from various sources.

• Data Integration: Cleans and organizes data.

• Data Retrieval: Supports complex queries and reports.

OLAP (Online Analytical Processing) Operations

• Roll-Up: Summarizing data (e.g., daily to monthly sales).


• Drill-Down: Breaking data into finer details (e.g., yearly to daily sales).

• Slice and Dice: Analyzing data from different perspectives.

• Pivot: Reorganizing data dimensions for better insights.

Data Warehouse Schema

• Star Schema: Central fact table linked to dimension tables (e.g., sales data with customer
details).

• Snowflake Schema: Dimensions are normalized into multiple related tables.

• Fact Constellation Schema: Multiple fact tables share dimension tables.

Data Warehouse Architecture

1. Source Layer: Collects data from multiple sources.

2. Staging Layer: Cleans and prepares data for storage.

3. Data Storage Layer: Stores data in the warehouse.

4. Presentation Layer: Provides data for analysis using BI tools.

Warehouse Server

• Manages the storage and retrieval of data in the warehouse.

Metadata in Data Warehousing

• Metadata describes the structure, source, and usage of data in the warehouse.

• Types:

o Technical Metadata: Information about data structure (tables, columns).

o Business Metadata: Describes the meaning and purpose of data.

OLAP Engine

• Processes and analyzes data for insights.

• Performs OLAP operations like roll-up, drill-down, and pivoting.


NEXT UNIT

What is Data Mining?

• Data mining is the process of discovering useful patterns and knowledge from large sets of
data.

• It’s part of a broader process called Knowledge Discovery in Databases (KDD).

KDD vs. Data Mining

• KDD (Knowledge Discovery in Databases):

o A complete process including data selection, cleaning, transformation, mining, and


evaluation.

• Data Mining:

o A specific step in the KDD process focused on identifying patterns and extracting
knowledge from data.

DBMS vs. Data Mining

Feature DBMS Data Mining

Purpose Stores and manages data Analyzes data to extract patterns

Focus Data retrieval Knowledge discovery

Output Query results Insights, trends, and patterns

Example SQL queries for data access Predicting customer churn

Other Related Areas

• Machine Learning: Algorithms that enable computers to learn patterns from data.

• Statistics: Provides mathematical tools for analyzing and interpreting data.

• Artificial Intelligence (AI): Uses data mining to make intelligent decisions.

Data Mining Techniques

1. Classification: Assigns data into predefined categories (e.g., spam email detection).

2. Clustering: Groups similar data points (e.g., customer segmentation).


3. Association Rule Mining: Finds relationships between variables (e.g., "If X is bought, Y is also
likely to be bought").

4. Regression: Predicts numerical values (e.g., house price prediction).

5. Anomaly Detection: Identifies unusual data patterns (e.g., fraud detection).

Other Mining Techniques

1. Text Mining: Extracts information from text data (e.g., sentiment analysis).

2. Web Mining: Analyzes web data like browsing patterns and search logs.

3. Image and Video Mining: Extracts patterns from multimedia data.

Issues and Challenges in Data Mining

• Data Quality Issues: Incomplete, noisy, or inconsistent data.

• Scalability: Mining large datasets efficiently.

• Privacy and Security: Ensuring sensitive data isn’t exposed.

• Interpretability: Making results understandable for non-technical users.

• Dynamic Data: Handling changes in data over time.

Applications of Data Mining (Case Studies)

• Healthcare: Predicting disease outbreaks and improving treatment plans.

• Retail: Recommending products and optimizing inventory.

• Banking: Fraud detection and credit scoring.

• Education: Personalizing learning paths for students.

• Telecommunications: Customer retention through usage pattern analysis.

Data Preprocessing

Preparing raw data for mining by improving its quality. Steps include:

1. Descriptive Data Summarization:

o Summarizing data using statistical measures like mean, median, and mode.

2. Data Cleaning:

o Fixing errors, filling missing values, and removing duplicates.

3. Data Integration and Transformation:


o Combining data from multiple sources and converting it into a consistent format.

4. Data Reduction:

o Reducing the data volume while maintaining its relevance (e.g., dimensionality
reduction).

5. Data Discretization and Concept Hierarchy Generation:

o Converting continuous data into discrete intervals or levels (e.g., grouping ages into
ranges like "18-25").

NEXT UNIT
Here are simplified notes on Association Rules:

Introduction to Association Rules

• Association rules identify relationships or patterns between items in large datasets.

• Commonly used in market basket analysis to find products often bought together.

• Example Rule: If a customer buys bread, they are likely to buy butter.

o Support: How frequently items appear together.

o Confidence: Likelihood of buying one item given the other is bought.

o Lift: Measures how much more likely two items are bought together compared to
random chance.

Methods to Discover Association Rules

1. Apriori Algorithm:

o A classic algorithm that generates frequent itemsets based on the "downward


closure property" (if an itemset is frequent, all its subsets are also frequent).

o Steps:

1. Generate candidate itemsets.

2. Prune infrequent itemsets.

3. Repeat until no more itemsets can be generated.

2. Partition Algorithm:

o Divides the database into smaller partitions.

o Finds frequent itemsets within each partition and combines results.

o Efficient for large datasets.


3. Pincer-Search Algorithm:

o Searches for frequent itemsets from both ends (bottom-up and top-down).

o Useful for finding rare but frequent patterns.

4. Dynamic Itemset Counting Algorithm:

o Divides the dataset into blocks and dynamically counts itemsets.

o Adapts to the size and structure of the data for better efficiency.

5. FP-Tree Growth Algorithm (Frequent Pattern Tree):

o Uses a compact tree structure to store transactions and discover frequent itemsets
without candidate generation.

o Steps:

1. Build an FP-Tree from the dataset.

2. Extract frequent itemsets directly from the tree.

Discussion on Different Algorithms

• Apriori Algorithm: Simple but can be computationally expensive due to candidate


generation.

• FP-Tree Algorithm: Faster than Apriori as it avoids candidate generation.

• Partition Algorithm: Scales well for large datasets but may miss cross-partition patterns.

• Pincer-Search: Effective for rare patterns but computationally complex.

• Dynamic Itemset Counting: Efficient for real-time or streaming data.

Generalized Association Rules

• Extends basic association rules to include hierarchical relationships.

• Example: If beverages → snacks is a rule, the generalized rule might be cold drinks → chips.

• Useful in retail to analyze relationships at multiple levels of abstraction (e.g., category-level


patterns).

Association Rules with Item Constraints

• Focuses on finding rules that satisfy specific conditions or constraints, such as:

o Minimum/maximum number of items in a rule.

o Rules involving specific items or categories.

• Example: Discovering rules only involving expensive products or a particular brand.


• Useful for targeted marketing and inventory management.

These notes break down the concepts into manageable sections for easier understanding and quick
revision.

NEXT UNIT
Here are simplified notes on Classification and Clustering:

Classification and Prediction – Basic Concepts

• Classification:

o The task of predicting the categorical label of data.

o Example: Classifying emails as spam or not spam.

• Prediction:

o The task of predicting a continuous value based on input data.

o Example: Predicting house prices based on features like area and location.

Classification Techniques

1. Decision Tree Induction:

o A tree-like structure where each node represents a decision based on an attribute,


and leaves represent the predicted class.

o Example: A decision tree for classifying animals based on features like "Has feathers"
or "Can fly."

2. Bayesian Classification:

o Uses probability theory (Bayes’ Theorem) to predict the class based on prior
knowledge of the data.

o Example: In a spam classifier, it predicts whether an email is spam or not by


calculating the probability of the email being spam given certain words.

3. Rule-Based Classification:

o Classifies data using "IF-THEN" rules.

o Example: IF age > 50 THEN high risk for predicting the likelihood of heart disease.

4. Lazy Learner (Instance-Based Learning):

o Does not build a model beforehand. Instead, it memorizes the training data and
makes decisions based on a comparison of new data with stored instances.
o Example: K-Nearest Neighbors (KNN), where the class of a data point is determined
by its nearest neighbors.

Clustering and Applications

• Clustering:

o The task of grouping similar data points into clusters, where data points in the same
cluster are more similar to each other than to those in other clusters.

o Example: Grouping customers based on purchasing behavior.

Types of Data in Cluster Analysis

1. Numerical Data:

o Data that can be measured and expressed in numbers (e.g., age, income).

2. Categorical Data:

o Data that represents categories or groups (e.g., gender, color).

3. Mixed Data:

o A combination of both numerical and categorical data.

Categorization of Major Clustering Methods

1. Partitioning Methods:

o Divide the dataset into a predefined number of clusters.

o Example: K-Means Algorithm, where "K" specifies the number of clusters.

o Steps in K-Means:

1. Randomly choose K cluster centers.

2. Assign each point to the closest cluster.

3. Update cluster centers and repeat until convergence.

2. Hierarchical Methods:

o Builds a tree-like structure of clusters.

o Agglomerative (Bottom-Up): Start with individual points and gradually merge them
into larger clusters.

o Divisive (Top-Down): Start with all data in one cluster and recursively divide it into
smaller clusters.

3. Density-Based Methods:
o Clusters data points that are closely packed together, marking areas of higher density
as clusters.

o Example: DBSCAN (Density-Based Spatial Clustering of Applications with Noise),


which can find clusters of arbitrary shapes and handle noise (outliers).

4. Grid-Based Methods:

o Divides the data space into a grid and clusters the data based on the grid structure.

o Example: STING (Statistical Information Grid) uses statistical information to cluster


data within each grid cell.

Outlier Analysis

• Outliers are data points that differ significantly from other data points.

• Outlier Detection: Identifying and removing outliers can improve the clustering results.

• Example: In a dataset of people's ages, an age of 200 would be considered an outlier.


Unit 1: Data Warehouse Fundamentals

• Introduction to Data Warehouse:

o Data Warehouse ek centralized repository hota hai jisme large amounts of historical
data ko store kiya jata hai, jo reporting aur analysis ke liye use hota hai.

o Ye data ko alag-alag sources se collect karta hai aur ek single location pe store karke
decision-making process ko support karta hai.

• Data Warehouse Types:

o Enterprise Data Warehouse (EDW): Ye puri organization ka data ek hi place pe store


karta hai.

o Data Mart: Ye ek chhota version hota hai EDW ka, jo specific department ya business
unit ke data ko store karta hai.

• Characteristics of Data Warehouse:

o Subject-Oriented: Data warehouse data ko subjects (like sales, customer data) ke


basis pe store karta hai.

o Integrated: Alag-alag sources ka data ek common format mein merge kiya jata hai.

o Time-Variant: Data historical nature ka hota hai, jisme time period ka bhi record
rakha jata hai.

o Non-Volatile: Data once stored in the warehouse is not frequently changed, only
updated.

• Advantages and Applications of Data Warehouse:

o Advantages:

▪ Improved decision-making and data analysis.

▪ Data consistency and historical data access.

▪ Fast query performance.

o Applications:

▪ Business intelligence, reporting, and trend analysis.

▪ Retail analysis, financial analysis, etc.

• Top-Down and Bottom-Up Development Methodology:

o Top-Down: Start with a high-level plan for the entire data warehouse and then go
down to the details.

o Bottom-Up: Start with smaller data marts and integrate them into a larger
warehouse.

• Tools for Data Warehouse Development:


o Various ETL (Extract, Transform, Load) tools are used to extract data from sources,
transform it, and load it into the warehouse (e.g., Talend, Informatica).

• OLTP Systems vs. Data Warehouse:

o OLTP (Online Transaction Processing): Ye daily operations jaise order processing,


customer transactions handle karta hai.

o Data Warehouse: Ye historical data ko store karta hai jo decision-making and


analytics ke liye hota hai.

• Functionality of Data Warehouse:

o Data analysis, reporting, and decision support for business management.

• OLAP Operations:

o OLAP (Online Analytical Processing): Ye complex queries ko efficiently execute karne


ke liye data ko multidimensional structure mein store karta hai.

o Operations include Drill-down, Roll-up, Slice, and Dice to analyze the data.

• Warehouse Schema:

o Common schemas include Star Schema and Snowflake Schema, which organize the
data in a way that's efficient for querying.

• Data Warehouse Architecture:

o Data warehouse architecture typically consists of the data source layer, ETL layer,
data warehouse layer, and presentation layer.

• Warehouse Server, Metadata, OLAP Engine:

o Warehouse Server: Stores the data and handles requests for analysis.

o Metadata: Data about the data; it defines the structure, location, and other details
about the data.

o OLAP Engine: Helps in running OLAP queries and performing multi-dimensional


analysis.

Unit 2: Data Mining

• What is Data Mining?

o Data mining ek process hai jisme hum large data sets se useful patterns aur insights
extract karte hain.

o Example: Customer purchase patterns ko analyze karke products recommend karna.

• KDD vs. Data Mining:

o KDD (Knowledge Discovery in Databases): Ye puri process ko describe karta hai


jisme data ko select karna, clean karna, transform karna, mine karna aur evaluate
karna hota hai.
o Data Mining: KDD ka ek part hai, jo sirf patterns ko extract karta hai.

• DBMS vs. Data Mining:

o DBMS (Database Management System): Ye data ko store aur manage karta hai.

o Data Mining: Ye DBMS se data ko analyze karta hai aur useful patterns ko extract
karta hai.

• Other Related Areas:

o Machine Learning: Algorithms jo data ke patterns se seekhte hain.

o Statistics: Mathematical tools jo data ko analyze karte hain.

o Artificial Intelligence (AI): Jo data mining ke results ko intelligent decisions mein


convert karta hai.

• Data Mining Techniques:

1. Classification: Data ko predefined categories me classify karna (e.g., spam vs. non-spam).

2. Clustering: Similar data points ko ek cluster me group karna (e.g., customer segmentation).

3. Association Rule Mining: Items ke beech relationships dhoondhna (e.g., If customer buys
bread, they will buy butter).

4. Regression: Numerical values ko predict karna (e.g., predicting the price of a house).

5. Anomaly Detection: Unusual data patterns ko identify karna (e.g., fraud detection).

• Other Mining Techniques:

1. Text Mining: Text data se useful information extract karna (e.g., sentiment analysis).

2. Web Mining: Web data, like browsing patterns, ko analyze karna.

3. Image and Video Mining: Multimedia data se patterns ko extract karna.

• Issues and Challenges in Data Mining:

o Data Quality Issues: Missing, noisy, ya inconsistent data.

o Scalability: Large datasets ko efficiently handle karna.

o Privacy and Security: Sensitive data ko secure karna.

o Interpretability: Results ko non-technical users ke liye samajhna.

• Data Mining Applications (Case Studies):

o Healthcare: Disease prediction aur treatment plan improve karna.

o Retail: Product recommendation systems aur inventory optimization.

o Banking: Fraud detection aur credit scoring.

o Telecommunications: Customer retention by analyzing usage patterns.

• Data Preprocessing:
o Descriptive Data Summarization: Data ko summarize karna using statistical
measures like mean, median, mode.

o Data Cleaning: Errors ko fix karna, missing values ko fill karna, duplicates ko remove
karna.

o Data Integration and Transformation: Multiple sources se data ko combine karna


aur ek consistent format mein convert karna.

o Data Reduction: Data ko reduce karna without losing significant information.

o Data Discretization and Concept Hierarchy Generation: Continuous data ko intervals


ya levels mein convert karna.

Unit 1: Classification and Prediction - Basic Concepts

• Classification:

o Ye ek process hai jisme hum data ke categorical label ko predict karte hain.

o Example: Emails ko spam ya non-spam classify karna.

• Prediction:

o Ye continuous values ko predict karne ka task hai.

o Example: House prices ko predict karna, based on features like area and location.

Classification Techniques

1. Decision Tree Induction:

o Ek tree-like structure hota hai jisme har node ek decision represent karta hai, aur leaf
nodes class ko predict karte hain.

o Example: Animal classification jisme attributes jaise "Has feathers" or "Can fly" ke
basis pe classification hota hai.

2. Bayesian Classification:

o Ye probability theory (Bayes’ Theorem) ka use karke class ko predict karta hai.

o Example: Spam email classifier, jisme probability calculate hoti hai ki email spam
hoga ya nahi, based on certain words.

3. Rule-Based Classification:

o Data ko "IF-THEN" rules ke through classify kiya jata hai.

o Example: IF age > 50 THEN high risk for predicting heart disease.
4. Lazy Learner (Instance-Based Learning):

o Ye model ko pehle se build nahi karta. Instead, data ko yaad rakhta hai aur new data
ko stored instances se compare karke decision leta hai.

o Example: K-Nearest Neighbors (KNN) algorithm, jisme data points ko unke nearest
neighbors ke basis pe classify kiya jata hai.

Unit 2: Clustering and Applications

• Clustering:

o Clustering ek process hai jisme similar data points ko groups ya clusters me divide
kiya jata hai.

o Example: Customers ko unke purchasing behavior ke basis pe group karna.

Types of Data in Cluster Analysis

1. Numerical Data:

o Data jo numbers me measure kiya jata hai (e.g., age, income).

2. Categorical Data:

o Data jo categories me represent hota hai (e.g., gender, color).

3. Mixed Data:

o Numerical aur categorical data ka combination.

Major Clustering Methods

1. Partitioning Methods:

o Data ko ek predefined number of clusters me divide kiya jata hai.

o Example: K-Means Algorithm, jisme "K" number of clusters specify kiya jata hai.

o Steps in K-Means:

1. Randomly K cluster centers choose kiye jate hain.

2. Har point ko nearest cluster se assign kiya jata hai.

3. Cluster centers ko update kiya jata hai aur repeat hota hai jab tak
convergence na ho jaye.

2. Hierarchical Methods:

o Clusters ko ek tree-like structure me banaya jata hai.

o Agglomerative (Bottom-Up): Individual points se start hota hai aur gradually unhe
merge karke clusters banta hai.
o Divisive (Top-Down): Sabhi data ko ek cluster me rakha jata hai aur phir recursively
divide kiya jata hai.

3. Density-Based Methods:

o Ye closely packed data points ko clusters me group karta hai, aur high-density areas
ko cluster mark karta hai.

o Example: DBSCAN (Density-Based Spatial Clustering of Applications with Noise), jo


clusters ko arbitrary shapes me find kar sakta hai aur noise (outliers) ko handle karta
hai.

4. Grid-Based Methods:

o Data space ko grid me divide karke clustering hoti hai.

o Example: STING (Statistical Information Grid), jo grid cells ke andar statistical


information ka use karke clustering karta hai.

Outlier Analysis

• Outliers: Ye wo data points hote hain jo dusre data points se significantly different hote hain.

• Outlier Detection: Outliers ko identify karna aur remove karna clustering results ko improve
kar sakta hai.

• Example: Agar ek dataset mein age 200 hai, to wo outlier hoga.

Next Unit: Association Rules

Unit 3: Association Rules

• Introduction to Association Rules:

o Association rules patterns ko identify karte hain jo large datasets mein items ke
beech relationships batate hain.

o Example: If a customer buys bread, they are likely to buy butter.

• Methods to Discover Association Rules:

1. Apriori Algorithm:

▪ Ye ek classic algorithm hai jo frequent itemsets generate karta hai.

▪ Steps:

1. Candidate itemsets generate karo.

2. Infrequent itemsets ko prune karo.

3. Repeat karo jab tak koi itemsets generate na ho sake.


2. Partition Algorithm:

▪ Data ko smaller partitions me divide kiya jata hai aur har partition me
frequent itemsets find kiye jate hain.

3. Pincer-Search Algorithm:

▪ Frequent itemsets ko bottom-up aur top-down search karke find karta hai.

4. FP-Tree Growth Algorithm (Frequent Pattern Tree):

▪ Ye compact tree structure ka use karta hai frequent itemsets ko discover


karne ke liye without candidate generation.

Generalized Association Rules

• Ye basic association rules ko extend karte hain jisme hierarchical relationships bhi include
hote hain.

• Example: Agar beverages → snacks hai, to generalized rule ho sakta hai cold drinks → chips.

Association Rules with Item Constraints

• Ye rules specific conditions ko satisfy karte hain, jaise minimum/maximum number of items,
ya particular items.

• Example: Sirf expensive products ke rules ya specific brand ke rules discover karna.

You might also like