List Data Warehouse Models With Example
List Data Warehouse Models With Example
-In a snowflake schema, the fact table is still located at the center of the
schema,surrounded by the dimension tables. However, each dimension table is
further broken down into multiple related tables.
example, the item dimension table in star schema is normalized and split into two
The supplier key is linked to the supplier dimension table. The supplier dimension
table contains the attributes supplier_key and supplier_type.
Due to normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space.
-An Enterprise warehouse collects all of the records about subjects spanning the entire
organization. It supports corporate-wide data integration, usually from one or more
operational systems or external data providers, and it's cross-functional in scope. It generally
contains detailed information as well as summarized information and can range in estimate
from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
-It provides access to all data within an organization,without compromising security and
integrity of the data.
Data Mart
-A data mart is a small warehouse which is designed for the department level.
-A data mart includes a subset of corporate-wide data that is of value to a specific collection
of users. The scope is confined to particular selected subjects. For example, a marketing data
mart may restrict its subjects to the customer, items, and sales. The data contained in the
data marts tend to be summarized.
Independent Data Mart: Independent data mart is sourced from data captured from one or
more operational systems or external data providers, or data generally locally within a
different department or geographic area.its built by drawing data from operational or
external data sources or both.
Dependent Data Mart: Dependent data marts are sourced exactly from enterprise data-
warehouses. It's built by drawing data from a central data warehouse.
Virtual Warehouses
-A set of views over operational databases.
-It gives you a quick overview of your data.It has metadata in it.it connects to several data
sources with the use of middleware.
-A virtual warehouse is simple to build but requires excess capacity on operational database
servers. -The processing can be fast and allow the users to filter most important data from
different legacy applications.
Fetching methods Data is fetched from the main Data is fetched from the
repository Proprietary database
Data Arrangement Data is arranged and saved in Data is arranged and stored
the form of tables with rows in the form of data cubes
and columns
Market basket analysis mainly works with the ASSOCIATION RULE {IF} -> {THEN}.
For example, IF a customer buys bread, THEN he is likely to buy butter as well. Association
rules are usually represented as: {Bread} -> {Butter}
1.Bitmap Indexing
- It allows quick searching in data cubes.
- The bitmap index is an alternative representation of the record ID (RID) list.
- Each attribute is represented by a distinct bit value.
- If attribute's domain consists of n values, then n bits are needed for each entry
in the bitmap index.
- If attribute value is present in the row then it is represented by 1 in the
corresponding row of the bitmap index and all other bits for that row are set to
0.
- for example, if the base table is represented as below:
It’s mapping to bitmap index tables for each of the dimensions Region and Types are:
Advantages
a. It represent data in single bit
b. It is useful to save time for
preprocessing
c. It reduces a space for storage.
Process-
Process-
-this approach Data Marts are first created .data marts provide reporting and
analytics capability for specific business approach.
The data flow from extraction of data from various source system into
Advantages
-the locations of the data warehouse and the data marts are reversed in the bottom-
up approach design.
2) Explain major task in data preprocessing
1. Data cleaning
Data cleaning operations contains following tasks: filling missing values, smoothing noisy
data, identifying or removing outliers and solve the inconsistencies.
-Skip the tuple which is not contains the data
-We can fill the missing values manually
-Use one constant value to fill the missing value.
-By calculating mean value we can fill the set of values in place of missing values. -By
using outlier analysis technique we can fill boundary value for outside values
2.Data integration
-In data integration phase we can combine data from multiple sources.
-We can merge data from different sources.
-If data is redundant means if copy of data is available on multiple sources then remove
such data.
3. Data reduction
-We can reduce data by using three ways.
1. Dimensionality reduction
2. Numerosity reduction
3. Data compression
In data dimensionality process we can divide the data into a number of pieces. We
can easily remove the identical or redundant data from same pieces by using
dimensionality reduction method.
-By using Numerosity reduction technique we can specify small data set for large set of data
volume.
-Data compression technique use for to store large amount of data into small piece of
memory location.
-By using sampling, aggregation methods we can reduce a large amount of data into small
pieces.
4. Data transformation
-In data transformation we can represent our data into multiple forms
-We can represent our data into different charts then user can easily understand it.
-We can create a cluster for same relation data to reduce readability.
-We can normalize the data into different ranges to normalized the data.
5.Data Discretization
-part of data reduction,replacing numerical attributes with nominal ones.
The KDD process in data mining is a multi-step process that involves various stages to
extract useful knowledge from large datasets.Various steps involved in the KDD process in
data mining are shown below diagram
The following are the main steps involved in the KDD process -
Data Selection - The first step in the KDD process is identifying and selecting the relevant
data for analysis. This involves choosing the relevant data sources, such as databases, data
warehouses, and data streams, and determining which data is required for the analysis.
Data Preprocessing - After selecting the data, the next step is data preprocessing. This step
involves cleaning the data, removing outliers, and removing missing, inconsistent, or
irrelevant data.
Data Transformation - Once the data is preprocessed, the next step is to transform it into a
format that data mining techniques can analyze.
Data Mining - This is the heart of the KDD process and involves applying various data mining
techniques to the transformed data to discover hidden patterns, trends, relationships, and
insights. A few of the most common data mining techniques include clustering,
classification, association rule mining, and anomaly detection.
Pattern Evaluation - After the data mining, the next step is to evaluate the discovered
patterns to determine their usefulness and relevance.
Knowledge Representation - This step involves representing the knowledge extracted from
the data in a way humans can easily understand and use.
Deployment - The final step in the KDD process is to deploy the knowledge and insights
gained from the data mining process to practical applications
Schema:-
-Schema is a logical description of the entire database. It includes the name and description
of records of all record types including all associated data-items and aggregates. Much like
a database, a data warehouse also requires to maintain a schema. A database uses
relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation
schema.
Types of Schema
1.Star Schema
2.Snowflake Schema
3.Fact Constellation Schema
Star Schema
-star schema is the most popular schema design for a data warehouse.
-In a star schema, as the structure of a star, there is one fact table in the middle and a
number of associated dimension tables. This structure resembles a star and hence it is
known as a star schema.
-The primary key which is present in each dimension is related to a foreign key which is
present in the fact table.
-The size of the fact tables is large as compared to the dimension tables.
Figure – General structure of Star Schema
The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
-There is a fact table at the center. It contains the keys to each of four dimensions.
-The fact table also contains the attributes, namely dollars sold and units sold.
1.Simplest DW Schema
2.Easy to understand
3.Most Suitable for query processing
4.It is fully denormalized schema
Snowflake Schema
-In a snowflake schema, the fact table is still located at the center of the
schema,surrounded by the dimension tables. However, each dimension table is further
broken down into multiple related tables.
example, the item dimension table in star schema is normalized and split into two
The supplier key is linked to the supplier dimension table. The supplier dimension
table contains the attributes supplier_key and supplier_type.
Due to normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space.
A fact constellation has multiple fact tables. It is also known as galaxy schema.
A Fact constellation means two or more fact tables sharing one or more dimensions.
Advantage: Provides a flexible schema.
Disadvantage: It is much more complex and hence, hard to implement and maintain.
TID Items
1 134
2 235
● Step 2 – Generate candidate itemsets:
3 1235
4 25
Then, it combines these frequent items to create sets of two or more items.
These sets are called “itemsets”.
● Step 3 – Count support for candidate itemsets:
○ Next, it counts how often each of these itemsets appears in the dataset.
● Step 4 – Eliminate infrequent itemsets:
○ It removes itemsets that don’t meet a certain threshold of frequency, known as
the “support threshold”. This threshold is set by the user. ● Repeat Steps 2-4:
○ The process is repeated, creating larger and larger itemsets, until no more can
be made.
● Find associations:
Finally, Apriori uses the frequent itemsets to find associations. For example, if “bread”
and “milk” are often bought together, it will identify this as an association.