0% found this document useful (0 votes)
22 views

List Data Warehouse Models With Example

Uploaded by

bhimapasare45
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

List Data Warehouse Models With Example

Uploaded by

bhimapasare45
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

1.

List Data warehouse models with example


1) Enterprise Warehouse
2) Data Mart
3) Virtual Warehouses

2. Define OLAP with Example


OLAP is a classification of software technology which authorizes analysts, managers, and
executives to gain insight into information through fast, consistent, interactive access in a
wide variety of possible views of data that has been transformed from raw information to
reflect the real dimensionality of the enterprise as understood by the clients.

3. Define Data warehouse And uses


1) Data Mining
2) Business Intelligence
3) Data integration
4) Historical data storage

4. Define Term Data Mining


The process of extracting information to identify patterns, trends, and useful data that would
allow the business to take the data-driven decision from huge sets of data is called Data
Mining.

5. Explain Rollup OLAP operation


Roll-up performs aggregation on a data cube in any of the following ways –
-By climbing up a concept hierarchy for a dimension
-By dimension reduction
The following diagram illustrates how roll-up works.
-Roll-up is performed by climbing up a concept hierarchy for the dimension location.Initially
the concept hierarchy was "street < city < province < country".
-On rolling up, the data is aggregated by ascending the location hierarchy from the level of
city to the level of country.

6. Define Data Cube


When data is grouped or combined in multidimensional matrices called Data Cubes. The data
cube method has a few alternative names or a few variants, such as "Multidimensional
databases," "materialized views," and "OLAP (On-Line Analytical Processing)."
4 marks
1) Explain any 2 data cleaning methods
 Ignore the tuples: This method is not very feasible, as it only comes to use when
the tuple has several attributes is has missing values.
 Fill the missing value: This approach is also not very effective or feasible.
Moreover, it can be a time-consuming method. In the approach, one has to fill in
the missing value. This is usually done manually, but it can also be done by
attribute mean or using the most probable value.
 Binning method: This approach is very simple to understand. The smoothing of
sorted data is done using the values around it. The data is then divided into several
segments of equal size. After that, the different methods are executed to
complete the task.
 Regression: The data is made smooth with the help of using the regression
function. The regression can be linear or multiple. Linear regression has only one
independent variable, and multiple regressions have more than one independent
variable.
 Clustering: This method mainly operates on the group. Clustering groups the data
in a cluster. Then, the outliers are detected with the help of clustering. Next, the
similar values are then arranged into a "group" or a "cluster".

2) Explain need of OLAP


Need of OLAP: OLAP can be needed for business data analysis and decision making.
OLAP can be used/needed:
1) To support multidimensional view of data.
2) To provides fast and steady access to various views of information.
3) To processes complex queries.
4) To analyze the information.
5) To pre-calculate and pre-aggregate the data.
6) For all type of business includes planning, budgeting, reporting, and analysis.
7) To quickly create and analyze "What if" scenarios
8) To easily search OLAP database for broad or specific terms.
9) To provide the building blocks for business modelling tools, Data mining tools,
performance reporting tools.
10) To do slice and dice cube data all by various dimensions, measures, and filters.
11) To analyze time series data.
12) To find some clusters and outliers.
13) To visualize online analytical process system which provides faster response times.
3) Explian concept of snowflake schema

Snowflake schema acts like an extended version of a star schema.

-In a snowflake schema, the fact table is still located at the center of the
schema,surrounded by the dimension tables. However, each dimension table is
further broken down into multiple related tables.

Figure – General structure of Snowflake Schema

-Some dimension tables in the Snowflake schema are normalized.


-Unlike Star schema, the dimensions table in a snowflake schema are normalized. For

example, the item dimension table in star schema is normalized and split into two

dimension tables, namely item and supplier table.


Now the item dimension table contains the attributes item_key, item_name, type, brand,
and supplier-key.

The supplier key is linked to the supplier dimension table. The supplier dimension
table contains the attributes supplier_key and supplier_type.
Due to normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space.

Advantages of Snowflake Schema


1.It provides structured data which reduces the problem of data integrity.
2.It uses small disk space because data are highly structured.

Disadvantage of Snowflake Schema

1. The primary disadvantage of the snowflake schema is the additional maintenance


efforts required due to the increasing number of lookup tables. It is also known as

2. There are more complex queries and hence, difficult to understand.


3. More tables more join so more query execution time

4) Explain Cluster Analysis


Cluster is a group of objects that belongs to the same class.
-In other words, similar objects are grouped in one cluster and dissimilar objects are
grouped in another cluster.
-Clustering is a data mining technique used to place data elements into related groups
without advance knowledge.
-Each subset is a cluster, such that objects in a cluster are similar to one another, yet
dissimilar to objects in other clusters.
-The set of clusters resulting from a cluster analysis can be referred to as a clustering. -
Cluster analysis or simply clustering is the process of partitioning a set of data objects (or
observations) into subsets.
-A cluster contains data objects which have high inter similarity and low intra similarity

5) Describe Business Framework of data warehouse design


-Data Warehouse presenting relevant information to user.

-Data Warehouse can enhance business productivity.


-It quickly gathers information and it describes it accurately to the organization.
-Data Warehouse provides consistent view to user.
-It saves time and money of an IT industries in their business analysis process.
Business Analysis Framework having following views-
A. Top-down view
This view allows the selection of relevant information needed for a data
warehouse.
B. Data Source view
This view presents the information being captured, stored, and managed by
the operational system.
C. Data Warehouse view
This view includes the fact tables and dimension tables. It represents the
information stored inside the data warehouse.
D. Business Query view
It is the view of the data from the viewpoint

6) 2 Data warehouse models


Enterprise Warehouse

-An Enterprise warehouse collects all of the records about subjects spanning the entire
organization. It supports corporate-wide data integration, usually from one or more
operational systems or external data providers, and it's cross-functional in scope. It generally
contains detailed information as well as summarized information and can range in estimate
from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.

-It Supports reporting,analysis and planning.

-It provides access to all data within an organization,without compromising security and
integrity of the data.

Data Mart
-A data mart is a small warehouse which is designed for the department level.

-A data mart includes a subset of corporate-wide data that is of value to a specific collection
of users. The scope is confined to particular selected subjects. For example, a marketing data
mart may restrict its subjects to the customer, items, and sales. The data contained in the
data marts tend to be summarized.

Data Marts is divided into two parts:

Independent Data Mart: Independent data mart is sourced from data captured from one or
more operational systems or external data providers, or data generally locally within a
different department or geographic area.its built by drawing data from operational or
external data sources or both.

Dependent Data Mart: Dependent data marts are sourced exactly from enterprise data-
warehouses. It's built by drawing data from a central data warehouse.

Virtual Warehouses
-A set of views over operational databases.
-It gives you a quick overview of your data.It has metadata in it.it connects to several data
sources with the use of middleware.
-A virtual warehouse is simple to build but requires excess capacity on operational database
servers. -The processing can be fast and allow the users to filter most important data from
different legacy applications.

7) Describe data cleaning process


Data Cleaning :
 Data cleaning is a crucial process in Data Mining. It carries an important part in
the building of a model.
 Data Cleaning can be regarded as the process needed, but everyone often
neglects it.
 Data quality is the main issue in quality information management. Data quality
problems occur anywhere in information systems. These problems are solved by
data cleaning.
 Data cleaning is fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset. If data is incorrect, outcomes and
algorithms are unreliable, even though they may look correct. When combining
multiple data sources, there are many opportunities for data to be duplicated or
mislabeled.
Methods
-Ignore the tuples: This method is not very feasible, as it only comes to use when the
tuple has several attributes is has missing values.
 Fill the missing value: This approach is also not very effective or feasible.
Moreover, it can be a time-consuming method. In the approach, one has to fill in
the missing value. This is usually done manually, but it can also be done by
attribute mean or using the most probable value.

8) Campare ROLAP MOLAP

Basics for comparison ROLAP MOLAP

Acronym Rela onal online analy cal Mul -dimensional online


processing analy cal processing

Storage methods Data is stored on the main Data is stored on the


data warehouse registered
database MDDB

Fetching methods Data is fetched from the main Data is fetched from the
repository Proprietary database

Data Arrangement Data is arranged and saved in Data is arranged and stored
the form of tables with rows in the form of data cubes
and columns

Volume Enormous data is processed


Limited data which is kept
in proprietary is processed

Technique It works with SQL It works with Sparse Matrix


technology

Designed view It has dynamic access It has a sta c access

Response me It has Maximum response It has Minimum response


me me

9) Explain Market basket analysis with example


Market basket analysis is a modelling technique which is also called affinity analysis, it helps
identifying which items are likely to be purchased together.
-In simple terms Basically, Market basket analysis in data mining is to analyze the combination
of products which have been bought together.
-This type of analyzing helps the retailers to develop different strategies for their business.
-Business analysts decide the strategies for the frequently purchased items together by
customers.
e.g In winter season many customers purchased woolen clothes and body moisture creams.
So business analyst gives suggestion to owner of a super market, shopping mall such as
D'Mart, Big Bazar that Give a discount to the customer who purchased both items together.
So if customers buy bread and butter and see a discount or an offer on eggs, they will be
encouraged to spend more and buy the eggs.This is what market basket analysis is all about

Market basket analysis mainly works with the ASSOCIATION RULE {IF} -> {THEN}.

IF means Antecedent: An antecedent is an item found within the data


THEN means Consequent: A consequent is an item found in combination with the antecedent.

For example, IF a customer buys bread, THEN he is likely to buy butter as well. Association
rules are usually represented as: {Bread} -> {Butter}

10) Explain OLAP Data indexing with types

1.Bitmap Indexing
- It allows quick searching in data cubes.
- The bitmap index is an alternative representation of the record ID (RID) list.
- Each attribute is represented by a distinct bit value.
- If attribute's domain consists of n values, then n bits are needed for each entry
in the bitmap index.
- If attribute value is present in the row then it is represented by 1 in the
corresponding row of the bitmap index and all other bits for that row are set to
0.
- for example, if the base table is represented as below:
It’s mapping to bitmap index tables for each of the dimensions Region and Types are:

Advantages
a. It represent data in single bit
b. It is useful to save time for
preprocessing
c. It reduces a space for storage.

2.Join Index Method

It is useful for RDBMS data queries.


It joins two similar relations from RDBMS
It joins foreign keys along with its matchable primary keys.
It is useful to derive sub cubes from data cubes.
It maintains relationships between attribute values of dimension and corresponding
row of table.
In data warehouses,Join Index relates the values of the dimensions of a star schema
to rows in the fact table.
For example,A star Schema containing a fact table:sales and two dimensions:city and
product then A join index on city maintains for each distinct city a list of R-IDs of the
tuples recording the sales in the city as shown in below figure.
6 marks
1) Explain Top down and Bottom up approach
1."top-down" approach-

-This method is invented by”Bill Inmon”.


-In this approach “data warehouse is built first and then data marts are built on top
of the data warehouse.”

Process-

-First data is extracted from various sources or systems.


-The extracted data is loaded and validated in stage area.
-ETL tools are used for to check accuracy and correctness of data.
-We can apply various techniques like summarization,aggregation of the data and
then it loaded on data warehouse.
Advantages

1. It provides consistent dimensional data views.

2. It is robust approach means we can easily add a new data mart

3.Data Marts are loaded from the data warehouses.


4.Developing new data mart from the data warehouse is very easy.
Disadvantages

1. It is inflexible because changing departmental needs its hard to implement.


2.The cost of implementing the project is high.
2."bottom-up" approach

Process-

-this approach Data Marts are first created .data marts provide reporting and
analytics capability for specific business approach.
The data flow from extraction of data from various source system into

stage area data marts refreshed the current data

Advantages

1.Documents can be generated quickly.


2.This data warehouse extended easily.
3 It contains consistent data marts
4. Data marts are delivered quickly
Disadvantages

-the locations of the data warehouse and the data marts are reversed in the bottom-
up approach design.
2) Explain major task in data preprocessing
1. Data cleaning

Data cleaning operations contains following tasks: filling missing values, smoothing noisy
data, identifying or removing outliers and solve the inconsistencies.
-Skip the tuple which is not contains the data
-We can fill the missing values manually
-Use one constant value to fill the missing value.
-By calculating mean value we can fill the set of values in place of missing values. -By
using outlier analysis technique we can fill boundary value for outside values

2.Data integration
-In data integration phase we can combine data from multiple sources.
-We can merge data from different sources.
-If data is redundant means if copy of data is available on multiple sources then remove
such data.

3. Data reduction
-We can reduce data by using three ways.
1. Dimensionality reduction
2. Numerosity reduction
3. Data compression
In data dimensionality process we can divide the data into a number of pieces. We
can easily remove the identical or redundant data from same pieces by using
dimensionality reduction method.
-By using Numerosity reduction technique we can specify small data set for large set of data
volume.
-Data compression technique use for to store large amount of data into small piece of
memory location.
-By using sampling, aggregation methods we can reduce a large amount of data into small
pieces.

4. Data transformation
-In data transformation we can represent our data into multiple forms
-We can represent our data into different charts then user can easily understand it.
-We can create a cluster for same relation data to reduce readability.
-We can normalize the data into different ranges to normalized the data.

5.Data Discretization
-part of data reduction,replacing numerical attributes with nominal ones.

-this involves dividing continuous data into discrete categories or intervals.


Discretization is often used in data mining and machine learning algorithms that require
categorical data. Discretization can be achieved through techniques such as equal width
binning, equal frequency binning, and clustering.

3) Describe KDD process

The KDD process in data mining is a multi-step process that involves various stages to
extract useful knowledge from large datasets.Various steps involved in the KDD process in
data mining are shown below diagram

The following are the main steps involved in the KDD process -

Data Selection - The first step in the KDD process is identifying and selecting the relevant
data for analysis. This involves choosing the relevant data sources, such as databases, data
warehouses, and data streams, and determining which data is required for the analysis.
Data Preprocessing - After selecting the data, the next step is data preprocessing. This step
involves cleaning the data, removing outliers, and removing missing, inconsistent, or
irrelevant data.
Data Transformation - Once the data is preprocessed, the next step is to transform it into a
format that data mining techniques can analyze.
Data Mining - This is the heart of the KDD process and involves applying various data mining
techniques to the transformed data to discover hidden patterns, trends, relationships, and
insights. A few of the most common data mining techniques include clustering,
classification, association rule mining, and anomaly detection.
Pattern Evaluation - After the data mining, the next step is to evaluate the discovered
patterns to determine their usefulness and relevance.
Knowledge Representation - This step involves representing the knowledge extracted from
the data in a way humans can easily understand and use.
Deployment - The final step in the KDD process is to deploy the knowledge and insights
gained from the data mining process to practical applications

4) List and Explain Schemas in DW modelling

Schema:-
-Schema is a logical description of the entire database. It includes the name and description
of records of all record types including all associated data-items and aggregates. Much like
a database, a data warehouse also requires to maintain a schema. A database uses
relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation
schema.

Types of Schema
1.Star Schema
2.Snowflake Schema
3.Fact Constellation Schema

Star Schema
-star schema is the most popular schema design for a data warehouse.
-In a star schema, as the structure of a star, there is one fact table in the middle and a
number of associated dimension tables. This structure resembles a star and hence it is
known as a star schema.
-The primary key which is present in each dimension is related to a foreign key which is
present in the fact table.
-The size of the fact tables is large as compared to the dimension tables.
Figure – General structure of Star Schema
The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.

-There is a fact table at the center. It contains the keys to each of four dimensions.

-The fact table also contains the attributes, namely dollars sold and units sold.

Advantages of star Schema-

1.Simplest DW Schema

2.Easy to understand
3.Most Suitable for query processing
4.It is fully denormalized schema

Disadvantages of Star schema-


1.Data redundancy: Star schema can result in data redundancy, as the same
data may be stored in multiple places in the schema. This can lead to data
inconsistencies and difficulties in maintaining data accuracy.
2.Increased costs: Adding redundant data increases computing and storage
costs. This can be especially troubling when handling large datasets.

Snowflake Schema

-Snowflake schema acts like an extended version of a star schema.

-In a snowflake schema, the fact table is still located at the center of the
schema,surrounded by the dimension tables. However, each dimension table is further
broken down into multiple related tables.

Figure – General structure of Snowflake Schema

-Some dimension tables in the Snowflake schema are normalized.


-Unlike Star schema, the dimensions table in a snowflake schema are normalized. For

example, the item dimension table in star schema is normalized and split into two

dimension tables, namely item and supplier table.


Now the item dimension table contains the attributes item_key, item_name, type, brand,
and supplier-key.

The supplier key is linked to the supplier dimension table. The supplier dimension
table contains the attributes supplier_key and supplier_type.
Due to normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space.

Advantages of Snowflake Schema


1.It provides structured data which reduces the problem of data integrity.
2.It uses small disk space because data are highly structured.

Disadvantage of Snowflake Schema

1. The primary disadvantage of the snowflake schema is the additional maintenance


efforts required due to the increasing number of lookup tables. It is also known as
a multi fact star schema.

2. There are more complex queries and hence, difficult to understand.


3. More tables more join so more query execution time. Difference Between Star

Fact Constella on Schema

A fact constellation has multiple fact tables. It is also known as galaxy schema.
A Fact constellation means two or more fact tables sharing one or more dimensions.
Advantage: Provides a flexible schema.

Disadvantage: It is much more complex and hence, hard to implement and maintain.

5) Explain Apriori Algorithm


R. Agarwal and R. Srikantto are the creators of The Apriori algorithm. They created it in 1994
by identifying the most frequent themes through Boolean association rules.
The algorithm has found great use in performing Market Basket Analysis, allowing businesses
to sell their products more effectively.
This Algorithm uses 2 steps “Join” and “Prune” to reduce the search space. -The use of this
algorithm is not just for market basket analysis. Various fields, like healthcare, education, etc,
also use it.
It uses knowledge of frequent itemset properties.
Main objective of this algorithm is “All subset of a frequent itemset must be frequent.”
It searches items level wise.
How does the Apriori algorithm work?
● Step 1 – Find frequent items:
○ It starts by identifying individual items (like products in a store) that
appear frequently in the dataset.

TID Items

1 134

2 235
● Step 2 – Generate candidate itemsets:
3 1235

4 25

Then, it combines these frequent items to create sets of two or more items.
These sets are called “itemsets”.
● Step 3 – Count support for candidate itemsets:
○ Next, it counts how often each of these itemsets appears in the dataset.
● Step 4 – Eliminate infrequent itemsets:
○ It removes itemsets that don’t meet a certain threshold of frequency, known as
the “support threshold”. This threshold is set by the user. ● Repeat Steps 2-4:
○ The process is repeated, creating larger and larger itemsets, until no more can
be made.
● Find associations:
Finally, Apriori uses the frequent itemsets to find associations. For example, if “bread”
and “milk” are often bought together, it will identify this as an association.

You might also like