0% found this document useful (0 votes)
15 views

DW PART A PART B NOTES

The document provides an overview of data warehousing concepts, including definitions, benefits, and components such as ETL tools, databases, and access tools. It differentiates between operational databases and data warehouses, highlighting their unique characteristics and use cases. Additionally, it discusses the architecture of data warehouses, best practices, and the advantages of cloud-based solutions like Oracle Autonomous Data Warehouse.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

DW PART A PART B NOTES

The document provides an overview of data warehousing concepts, including definitions, benefits, and components such as ETL tools, databases, and access tools. It differentiates between operational databases and data warehouses, highlighting their unique characteristics and use cases. Additionally, it discusses the architecture of data warehouses, best practices, and the advantages of cloud-based solutions like Oracle Autonomous Data Warehouse.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 69

DATA WAREHOUSING CCS341

UNIT WISE QUESTIONS AND ANSWERS


UNIT1
PART A
1. Define data warehouse?
"A data warehouse is a subject-oriented, integrated, time-variant and nonvolatile collection
of data in support of management's decision making process".
2. What are the benefits of data warehouse? Tangible benefits (quantified / measureable):
It includes,  Improvement in product inventory  Decrement in production cost 
Improvement in selection of target markets  Enhancement in asset and liability management
Intangible benefits (not easy to quantified): It includes,  Improvement in productivity by
keeping all data in single location and eliminating rekeying of data  Reduced redundant
processing  Enhanced customer relation  Enabling business process reengineering
3. Define Data heterogeneity? It refers to DBMS different nature such as it may be in
different data modules, it may have different access languages, it may have data navigation
methods, operations, concurrency, integrity and recovery processes etc.,
4. Define metadata?
Metadata is data about data. It is used for maintaining, managing and using the data
warehouse.
5. Define snowflake? Snowflake is a Data Warehouse built for the cloud. It centralizes data
from multiple sources, enabling you to run in-depth business insights that power your teams.
At its core, Snowflake is designed to handle structured and semi-structured data from various
sources, allowing organizations to integrate and analyze data from diverse systems
seamlessly.
6. List the differences between traditional vs modern data warehouse.
7. What are the functions of ETL? Extraction, Transformation, and Loading tools
(ETL):ETL tools perform a triple function: extracting data from various sources,
transforming it into an appropriate format, and loading it onto the target database.
8. What are gateways in data warehouse? The data are extracted using application program
interfaces known as gateways. A gateway is supported by the underlying DBMS and allows
client programs to generate SQL code to be executed at a server.Examples of gateways
include ODBC (Open Database Connection) and OLEDB (Open Linking and Embedding for
Databases) by Microsoft and JDBC (Java Database Connection).
9. What are OLAP tools? OLAP Tools: are used to analyze the data in multi-dimensional
and complex views. To enable multidimensional properties it uses MDDB and MRDB where
MDDB refers multi-dimensional data base and MRDB refers multi relational data bases.
10.Define autonomous data warehouse? Autonomous Data Warehouse uses applied
machine learning to self-tune and automatically optimizes performance while the database is
running. It is built on the next generation Autonomous Database technology using artificial
intelligence to deliver unprecedented reliability, performance and highly elastic data
management to enable datawarehouse deployment in seconds
11. List the uses of data warehouse? Data warehouse usage includes,  Locating the right
information  Presentation of information  Testing of hypothesis  Discovery of
information  Sharing the analysis

UNIT I
PART -B
1. Explain the 3-tier data ware house architecture and its various components. Evaluate

Three-tier Data Warehouse Architecture

The three-tier approach is the most widely used architecture for data warehouse systems.

Essentially, it consists of three tiers:

1. The bottom tier is the database of the warehouse, where the cleansed and
transformed data is loaded.
2. The middle tier is the application layer giving an abstracted view of the database. It
arranges the data to make it more suitable for analysis. This is done with an OLAP
server, implemented using the ROLAP or MOLAP model.
3. The top-tier is where the user accesses and interacts with the data. It represents the
front-end client layer. You can use reporting tools, query, analysis or data mining
tools.
Data Warehouse Components

From the architectures outlined above, you notice some components overlap, while others are
unique to the number of tiers.

Below you will find some of the most important data warehouse components and their roles
in the system.

ETL Tools

ETL stands for Extract, Transform, and Load. The staging layer uses ETL tools to extract
the needed data from various formats and checks the quality before loading it into the data
warehouse.

The data coming from the data source layer can come in a variety of formats. Before merging
all the data collected from multiple sources into a single database, the system must clean and
organize the information.
The Database

The most crucial component and the heart of each architecture is the database. The
warehouse is where the data is stored and accessed.

When creating the data warehouse system, you first need to decide what kind of database you
want to use.

There are four types of databases you can choose from:

1. Relational databases (row-centered databases).


2. Analytics databases (developed to sustain and manage analytics).
3. Data warehouse applications (software for data management and hardware for
storing data offered by third-party dealers).
4. Cloud-based databases (hosted on the cloud).

Data

Once the system cleans and organizes the data, it stores it in the data warehouse. The data
warehouse represents the central repository that stores metadata, summary data, and raw
data coming from each source.

 Metadata is the information that defines the data. Its primary role is to simplify
working with data instances. It allows data analysts to classify, locate, and direct
queries to the required data.
 Summary data is generated by the warehouse manager. It updates as new data loads
into the warehouse. This component can include lightly or highly summarized data.
Its main role is to speed up query performance.
 Raw data is the actual data loading into the repository, which has not been processed.
Having the data in its raw form makes it accessible for further processing and
analysis.
Access Tools

Users interact with the gathered information through different tools and technologies. They
can analyze the data, gather insight, and create reports.

Some of the tools used include:

 Reporting tools. They play a crucial role in understanding how your business is
doing and what should be done next. Reporting tools include visualizations such as
graphs and charts showing how data changes over time.
 OLAP tools. Online analytical processing tools which allow users to analyze
multidimensional data from multiple perspectives. These tools provide fast processing
and valuable analysis. They extract data from numerous relational data sets and
reorganize it into a multidimensional format.
 Data mining tools. Examine data sets to find patterns within the warehouse and the
correlation between them. Data mining also helps establish relationships when
analyzing multidimensional data.

Data MartsData marts allow you to have multiple groups within the system by segmenting
the data in the warehouse into categories. It partitions data, producing it for a particular user
group.For instance, you can use data marts to categorize information by departments within
the company.

Data Warehouse Best Practices

Designing a data warehouse relies on understanding the business logic of your individual use
case.

The requirements vary, but there are data warehouse best practices you should follow:
 Create a data model. Start by identifying the organization’s business logic.
Understand what data is vital to the organization and how it will flow through the data
warehouse.
 Opt for a well-know data warehouse architecture standard. A data model
provides a framework and a set of best practices to follow when designing the
architecture or troubleshooting issues. Popular architecture standards
include 3NF, Data Vault modeling and star schema.
 Create a data flow diagram. Document how data flows through the system. Know
how that relates to your requirements and business logic.
 Have a single source of truth. When dealing with so much data, an organization has
to have a single source of truth. Consolidate data into a single repository.
 Use automation. Automation tools help when dealing with vast amounts of data.
 Allow metadata sharing. Design an architecture that facilitates metadata sharing
between data warehouse components.
 Enforce coding standards. Coding standards ensure system efficiency

2. Differentiate Operational database versus data warehouse. Understand

A data warehouse is a repository for structured, filtered data that has already been processed
for a specific purpose. It collects the data from multiple sources and transforms the data using
ETL process, then loads it to the Data Warehouse for business purpose.
An operational database, on the other hand, is a database where the data changes frequently.
They are mainly designed for high volume of data transaction. They are the source database
for the data warehouse. Operational databases are used for recording online transactions and
maintaining integrity in multi-access environments.
Read this article to learn more about data warehouses and operational databases and how they
are different from each other.

What is a Data Warehouse?

A Data Warehouse is a system that is used by the users or knowledge managers for data
analysis and decision-making. It can construct and present the data in a certain structure to
fulfill the diverse requirements of several users. Data warehouses are also known as Online
Analytical Processing (OLAP) Systems.
In a data warehouse or OLAP system, the data is saved in a format that allows the effective
creation of data mining documents. The data structure in a data warehousing has
denormalized schema. Performance-wise, data warehouses are quite fast when it comes to
analyzing queries.
Data warehouse systems do the integration of several application systems. These systems
then provide data processing by supporting a solid platform of consolidated historical data for
analysis
What is an Operational Database?

The type of database system that stores information related to operations of an enterprise is
referred to as an operational database. Operational databases are required for functional
lines like marketing, employee relations, customer service etc. Operational databases are
basically the sources of data for the data warehouses because they contain detailed data
required for the normal operations of the business.
In an operational database, the data changes when updates are created and shows the latest
value of the final transaction. They are also known as OLTP (Online Transactions
Processing Databases). These databases are used to manage dynamic data in real-time.

Difference between Data Warehouse and Operational Database

The following are the important differences between a data warehouse and an operational
database ?

Key Data Warehouse Operational Database

A data warehouse is a repository for Operational Database are


Basic structured, filtered data that has already those databases where data
been processed for a specific purpose. changes frequently.

Data Data warehouse has de-normalized


It has normalized schema.
Structure schema.

It is slow for analytics


Performance It is fast for analysis queries.
queries.

It focuses on current
Type of Data It focuses on historical data.
transactional data.

Use Case It is used for OLAP. It is used for OLTP.

3. Illustrate the various data warehouse components. Understand


A data warehouse (DW) is a digital storage system that connects and harmonizes large
amounts of data from many different sources. Its purpose is to feed business intelligence (BI),
reporting, and analytics, and support regulatory requirements – so companies can turn their
data into insight and make smart, data-driven decisions. Data warehouses store current and
historical data in one place and act as the single source of truth for an organization.
A typical data warehouse has four main components: a central database, ETL (extract,
transform, load) tools, metadata, and access tools. All of these components are engineered for
speed so that you can get results quickly and analyze data on the fly.

1. Central database: A database serves as the foundation of your data warehouse.


Traditionally, these have been standard relational databases running on premise or in the
cloud. But because of Big Data, the need for true, real-time performance, and a drastic
reduction in the cost of RAM, in-memory databases are rapidly gaining in popularity.
2. Data integration: Data is pulled from source systems and modified to align the
information for rapid analytical consumption using a variety of data integration
approaches such as ETL (extract, transform, load) and ELT as well as real-time data
replication, bulk-load processing, data transformation, and data quality and enrichment
services.
3. Metadata: Metadata is data about your data. It specifies the source, usage, values, and
other features of the data sets in your data warehouse. There is business metadata, which
adds context to your data, and technical metadata, which describes how to access data –
including where it resides and how it is structured.
4. Data warehouse access tools: Access tools allow users to interact with the data in your
data warehouse. Examples of access tools include: query and reporting tools, application
development tools, data mining tools, and OLAP tools.

Data warehouse architecture

In the past, data warehouses operated in layers that matched the flow of the business dat
 Data layer: Data is extracted from your sources and then transformed and loaded into
the bottom tier using ETL tools. The bottom tier consists of your database server, data
marts, and data lakes. Metadata is created in this tier – and data integration tools, like
data virtualization, are used to seamlessly combine and aggregate data.
 Semantics layer: In the middle tier, online analytical processing (OLAP) and online
transactional processing (OLTP) servers restructure the data for fast, complex queries
and analytics.
 Analytics layer: The top tier is the front-end client layer. It holds the data warehouse
access tools that let users interact with data, create dashboards and reports, monitor
KPIs, mine and analyze data, build apps, and more. This tier often includes a workbench
or sandbox area for data exploration and new data model development.
Data warehouses have been designed to support decision making and have been primarily
built and maintained by IT teams, but over the past few years they have evolved to empower
business users – reducing their reliance on IT to get access to the data and derive actionable
insights. A few key data warehousing capabilities that have empowered business users are:

1. The semantic or business layer that provides natural language phrases and allows
everyone to instantly understand data, define relationships between elements in the data
model, and enrich data fields with new business information.
2. Virtual workspaces allow teams to bring data models and connections into one secured
and governed place supporting better collaborating with colleagues through one
common space and one common data set.
3. Cloud has further improved decision making by globally empowering employees with a
rich set of tools and features to easily perform data analysis tasks. They can connect new
apps and data sources without much IT support.
Top seven benefits of a cloud data warehouse
Cloud-based data warehouses are rising in popularity – for good reason. These modern
warehouses offer several advantages over traditional, on-premise versions. Here are the top
seven benefits of a cloud data warehouse:

1. Quick to deploy: With cloud data warehousing, you can purchase nearly unlimited
computing power and data storage in just a few clicks – and you can build your own
data warehouse, data marts, and sandboxes from anywhere, in minutes.
2. Low total cost of ownership (TCO): Data warehouse-as-a-service (DWaaS) pricing
models are set up so you only pay for the resources you need, when you need them. You
don’t have to forecast your long-term needs or pay for more compute throughout the
year than necessary. You can also avoid upfront costs like expensive hardware, server
rooms, and maintenance staff. Separating the storage pricing from the computing pricing
also gives you a way to drive down the costs.
3. Elasticity: With a cloud data warehouse, you can dynamically scale up or down as
needed. Cloud gives us a virtualized, highly distributed environment that can manage
huge volumes of data that can scale up and down.
4. Security and disaster recovery: In many cases, cloud data warehouses actually provide
stronger data security and encryption than on-premise DWs. Data is also automatically
duplicated and backed-up, so you can minimize the risk of lost data.
5. Real-time technologies: Cloud data warehouses built on in-memory database
technology can provide extremely fast data processing speeds to deliver real-time data
for instantaneous situational awareness.
6. New technologies: Cloud data warehouses allow you to easily integrate new
technologies such as machine learning, which can provide a guided experience for
business users and decision support in the form of recommended questions to ask, as an
example.
7. Empower business users: Cloud data warehouses empower employees equally and
globally with a single view of data from numerous sources and a rich set of tools and
features to easily perform data analysis tasks. They can connect new apps and data
sources without IT.

4. Explain about oracle autonomous data warehouse. Understand

Oracle Autonomous Data Warehouse (ADW) is a fully managed, cloud-based data


warehousing solution provided by Oracle Cloud. It leverages artificial intelligence and
machine learning to automate various administrative tasks, making it self-driving, self-
securing, and self-repairing. ADW is designed to handle large volumes of data and deliver
high-performance analytics for business intelligence and data-driven decision-making.

Top 10 use cases of Oracle Autonomous Data Warehouse:


1. Data Warehousing: Storing and querying large datasets for analytical and reporting
purposes.
2. Business Intelligence (BI): Building and delivering interactive dashboards and
reports for data-driven decision-making.
3. Data Analytics: Running complex analytical queries on large volumes of data.
4. Data Integration: Integrating and consolidating data from different sources for
analysis.
5. Real-time Analytics: Combining real-time data streams with ADW for near real-time
analytics.
6. Customer Analytics: Analyzing customer data to understand behavior and
preferences.
7. Predictive Analytics: Building and training predictive models for forecasting and
data-driven insights.
8. Financial Analytics: Analyzing financial data for budgeting, forecasting, and
performance analysis.
9. IoT Data Analysis: Analyzing data from Internet of Things (IoT) devices to derive
insights.
10. Compliance and Reporting: Storing historical data for compliance and reporting
purposes.
What are the feature of Oracle Autonomous Data Warehouse?

Feature of Oracle Autonomous Data Warehouse


 Self-Driving: Automated database tuning and optimization for better performance
and reduced manual tasks.
 Self-Securing: Automated security measures to protect data and prevent unauthorized
access.
 Self-Repairing: Automatic error detection and resolution to ensure high availability.
 Scalability: ADW can scale compute and storage resources independently to match
workload demands.
 In-Memory Processing: Utilizes in-memory columnar processing for faster query
performance.
 Parallel Execution: Queries are processed in parallel across multiple nodes for faster
results.
 Integration with Oracle Ecosystem: Seamless integration with other Oracle Cloud
services and tools.
 Data Encryption: Provides data encryption both at rest and in transit for data
security.
 Easy Data Loading: Supports data loading from various sources, including Oracle
Data Pump, SQL Developer, and SQL*Loader.
 Pay-as-You-Go Pricing: Based on consumption, offering cost-effective pricing.

How Oracle Autonomous Data Warehouse works and Architecture?

Oracle Autonomous Data Warehouse works and Architecture


Oracle Autonomous Data Warehouse is built on Oracle Exadata, which is a highly optimized
platform for data warehousing and analytics.

1. Storage Layer: Data is stored in Exadata storage servers using a combination of flash
and disk storage.
2. Compute Layer: The compute nodes are responsible for processing queries and
analyzing data. ADW uses a massively parallel processing (MPP) architecture to
parallelize queries across multiple nodes for faster performance.
3. Autonomous Features: ADW leverages AI and machine learning to automate
various administrative tasks, including performance tuning, security patching,
backups, and fault detection.
How to Install Oracle Autonomous Data Warehouse?
To use Oracle Autonomous Data Warehouse:

1. Sign up for Oracle Cloud: Go to the Oracle Cloud website and sign up for an Oracle
Cloud account.
2. Provision Autonomous Data Warehouse: In the Oracle Cloud Console, provision
an Autonomous Data Warehouse instance.
3. Connect to Autonomous Data Warehouse: Use SQL clients or tools to connect to
ADW and run SQL queries.
4. Load Data into ADW: Load your data into ADW from various sources using Oracle
Data Pump, SQL Developer, or other data loading tools.
5. Run Queries and Analyze Data: Write SQL queries to analyze your data and gain
insights.
6. Monitor Performance: Use the Oracle Cloud Console to monitor query performance
and resource utilization.
Please note that Oracle Autonomous Data Warehouse is a cloud-based service, and you do
not need to install it on your local machine. Instead, you access and use ADW through the
Oracle Cloud Console or SQL clients from your local environment

UNIT II
PART A
1. What is ETL? ETL functions reshape the relevant data from the source systems into
useful information to be stored in the data warehouse. Without these functions, there would
be no strategic information in the data warehouse. If the source data is not extracted correctly,
cleansed, and integrated in the proper formats, query processing and delivery of business
intelligence, the backbone of the data warehouse, could not happen.
2. What are the major steps in the ETL process?  Plan for aggregate fact tables. 
Determine data transformation and cleansing rules.  Establish comprehensive data extraction
rules.  Prepare data mapping for target data elements from sources.  Integrate all the data
sources, both internal and external.  Determine all the target data needed in the data
warehouse.
3. Difference between ETL vs. ELT
Basics ETL ELT Process Data is transferred to the ETL server and moved back to DB. High
network bandwidth required. Data remains in the DB except for cross Database loads (e.g.
source to object). Transformation Transformations are performed in ETL Server.
Transformations are performed (in the source or) in the target. Code Usage Typically used for
 Source to target transfer  Compute-intensive Transformations Typically used for o High
amounts of  Small amount of data TimeMaintenance It needs highs maintenance as you need
to select data to load and transform. Low maintenance as data is always available.
Calculations Overwrites existing column or Need to append the dataset and push to the target
platform. Easily add the calculated column to the existing table. Analysis
4. What are various datawarehouse models?
1. Conceptual model 2. Logical model 3. Physical model
5. List the advantages and disadvantages of top down approach?
Advantages of top-down design  Data Marts are loaded from the data warehouses. 
Developing new data mart from the data warehouse is very easy. Disadvantages of top-down
design  This technique is inflexible to changing departmental needs.  The cost of
implementing the project is high.
6. Explain OLAP?
OLAP stands for On-Line Analytical Processing. OLAP is a classification of software
technology which authorizes analysts, managers, and executives to gain insight into
information through fast, consistent, interactive access in a wide variety of possible views of
data that has been transformed from raw information to reflect the real dimensionality of the
enterprise as understood by the clients.
7. List the benefits of OLAP?
OLAP helps managers in decision-making through the multidimensional record views that it
is efficient in providing, thus increasing their productivity.  OLAP functions are self-
sufficient owing to the inherent flexibility support to the organized databases.  It facilitates
simulation of business models and problems, through extensive management of analysis-
capabilities.  In conjunction with data warehouse, OLAP can be used to support a reduction
in the application backlog, faster data retrieval, and reduction in query drag.
8. Explain OLTP?
OLTP (On-Line Transaction Processing) is featured by a large number of short on-line
transactions (INSERT, UPDATE, and DELETE). The primary significance of OLTP
operations is put on very rapid query processing, maintaining record integrity in multi-access
environments, and effectiveness consistent by the number of transactions per second.
9. What are the various OLAP operations?
ROLL UP, DRILL DOWN, DICE, SLICE, PIVOT
10. Mention the types of OLAP?
ROLAP stands for Relational OLAP, an application based on relational DBMSs. MOLAP
stands for Multidimensional OLAP, an application based on multidimensional DBMSs.
HOLAP stands for Hybrid OLAP, an application using both relational and multidimensional
techniques.
11. Mention the advantages and disadvantages of Hybrid OLAP?
Advantages of HOLAP  HOLAP provide benefits of both MOLAP and ROLAP.  It
provides fast access at all levels of aggregation.  HOLAP balances the disk space
requirement, as it only stores the aggregate information on the OLAP server and the detail
record remains in the relational database. So no duplicate copy of the detail record is
maintained. Disadvantages of HOLAP  HOLAP architecture is very complicated because it
supports both MOLAP and ROLAP servers.
UNIT II
PART B
1. Illustrate data modeling life cycle with neat sketch.

Data Modeling Development Cycle

Data Modeling Development Cycle

Following are the important phases in the Data Model Development Life Cycle.

1. Gathering Business Requirements


2. Conceptual Data Modeling
3. Logical Data Modeling
4. Physical Data Modeling
5. Development of the schema or the database
6. Maintenance of the data model as per the changes.

1. Gathering Business Requirements – First Phase:

Data Modelers have to interact with business analysts to get the functional requirements and
with end users to find out the reporting needs.

2. Conceptual Data Modeling (CDM) – Second Phase:

This data model includes all major entities, relationships and it will not contain much detail
about attributes and is often used in the INITIAL PLANNING PHASE. Please refer the
diagram below and follow the link to learn more about Conceptual Data Modeling Tutorial.
3. Logical Data Modeling (LDM) – Third Phase:

This is the actual implementation of a conceptual model in a logical data model. A logical
data model is the version of the model that represents all of the business requirements of an
organization. Please refer the diagram below and follow the link to learn more about Logical
Data Modeling Tutorial.

4. Physical Data Modeling (PDM) – Fourth Phase:

This is a complete model that includes all required tables, columns, relationship, database
properties for the physical implementation of the database. Please refer the diagram below
and follow the link to learn more about Physical Data Modeling Tutorial.
5. Database – Fifth Phase:
DBAs instruct the data modeling tool to create SQL code from physical data
model. Then the SQL code is executed in server to create databases.

2. Explain about various OLAP operations in detail.

What is OLAP?

OLAP or Online Analytical Processing is a category of software that allows users to extract
and examine business data from different points of view. It makes use of pre-calculated and
pre-aggregated data from multiple databases to improve the data analysis process. OLAP
databases are divided into several data structures known as OLAP cubes.

OLAP Cube
The OLAP cube or Hypercube is a special kind of data structure that is optimized for very
quick multidimensional data analysis and storage. It is a screenshot of data at a specific point
in time. As seen in the figure, using certain OLAP operations, a user can request a specified
view of the hypercube. Hence, OLAP cubes allow users to perform multidimensional
analytical querying on the data.

Types of OLAP Servers

There exist mainly three types of OLAP systems: Relational OLAP (ROLAP): These
systems work directly with relational databases and use complex SQL queries to retrieve
information from the database. It can handle large volumes of data but provides slower data
processing.

Multidimensional OLAP (MOLAP): MOLAP is also known as the classic form of OLAP.
It uses an optimized multi-dimensional array storage system for data storage. It makes use of
positional techniques to access the data physically stored in multidimensional arrays.

Hybrid OLAP (HOLAP): It uses a best-of-both-worlds approach and is a combination


of ROLAP and MOLAP. It provides the high scalability feature of ROLAP systems along
with the fast computation functionality of MOLAP systems.

OLAP Operations

OLAP provides various operations to gain insights from the data stored in multidimensional
hypercubes. These operations include:

Drill Down

Drill down operation allows a user to zoom in on the data cube i.e., the less detailed data is
converted into highly detailed data. It can be implemented by either stepping down a concept
hierarchy for a dimension or adding additional dimensions to the hypercube.
Example: Consider a cube that represents the annual sales (4 Quarters: Q1, Q2, Q3, Q4) of
various kinds of clothes (Shirt, Pant, Shorts, Tees) of a company in 4 cities (Delhi, Mumbai,
Las Vegas, New York) as shown below:

Here, the drill-down operation is applied on the time dimension and the quarter Q1 is drilled
down to January, February, and March. Hence, by applying the drill-down operation, we can
move down from quarterly sales in a year to monthly or weekly records.

Roll up

It is the opposite of the drill-down operation and is also known as a drill-up or aggregation
operation. It is a dimension-reduction technique that performs aggregation on a data cube. It
makes the data less detailed and it can be performed by combining similar dimensions across
any axis.

Example: Considering the above-mentioned clothing company sales example:

Here, we are performing the Roll-up operation on the given data cube by combining and
categorizing the sales based on the countries instead of cities.
Dice

Dice operation is used to generate a new sub-cube from the existing hypercube. It
selects two or more dimensions from the hypercube to generate a new sub-cube for the given
data.

Example: Considering our clothing company sales example:

Here, we are using the dice operation to retrieve the sales done by the company in the first
half of the year i.e., the sales in the first two quarters.

Slice

Slice operation is used to select a single dimension from the given cube to generate a
new sub-cube. It represents the information from another point of view.

Example: Considering our clothing company sales example:

Here, the sales done by the company during the first quarter are retrieved by performing the
slice operation on the given hypercube.
Pivot

It is used to provide an alternate view of the data available to the users. It is also known as
Rotate operation as it rotates the cube’s orientation to view the data from different
perspectives.

Example:

Here, we are using the Pivot operation to view the sub-cube from a different perspective.

Conclusion

 OLAP (Online Analytical Processing) is a type of software technology that plays an


important role in data warehousing.
 OLAP is used for analysis as it provides a single source of data for all end-users.
 OLAP makes use of multidimensional array structures known as OLAP cubes.
 There are 3 main types of OLAP systems: ROLAP, MOLAP, and HOLAP.
 ROLAP uses relational databases, MOLAP uses multidimensional arrays, and
HOLAP makes use of both ROLAP and MOLAP systems.
 OLAP cube provides various operations to gain insights on data. These include Drill-
down, Roll-up, Dice, Slice, and Pivot.

3. Discuss the various types of OLAP techniques.


There are various varieties of OLAP, each serving particular requirements and preferences
for data analysis. The primary OLAP kinds are:

1. MOLAP (Multidimensional OLAP): MOLAP (Multidimensional OLAP) systems


store data in a multidimensional cube structure, with aggregated data based on several
dimensions contained in each cell of the cube. MOLAP systems do precalculations
and store aggregations, which results in quick query responses. They work effectively
in situations when performance is crucial and data quantities aren’t very huge.
Microsoft Analysis Services, IBM Cognos TM1, and Essbase are a few MOLAP
system examples.
2. Relational OLAP (ROLAP): Traditional relational databases are used for data
storage by ROLAP systems. They run intricate SQL queries to simulate
multidimensional views of the data. ROLAP systems can manage huge datasets and
complicated data linkages, therefore they can have slightly slower query speed than
MOLAP systems, but they also provide better flexibility and scalability. ROLAP
systems include those from Oracle OLAP, SAP BW (Business Warehouse), and
Pentaho, as examples.
3. Hybrid OLAP (HOLAP): HOLAP systems attempt to combine the benefits of
MOLAP and ROLAP. Similar to MOLAP, they enable the ability to obtain detailed
data from the underlying relational database as necessary while also storing summary
data in cubes. Depending on the type of analysis, this method helps to improve both
performance and flexibility. Users of some MOLAP systems have the option of
retrieving detailed data or pre-aggregated data by using HOLAP capabilities that are
supported by these systems.
4. DOLAP (Desktop OLAP): Desktop OLAP, often known as DOLAP, is a simplified
form of OLAP that operates on individual desktop PCs. It is appropriate for lone
analysts who wish to carry out fundamental data exploration and analysis without
requiring a large IT infrastructure. In-memory processing is frequently used by
DOLAP tools to deliver comparatively quick performance on tiny datasets. The
PivotTable feature in Excel is an illustration of a DOLAP tool.
5. WOLAP (Web OLAP): WOLAP systems bring OLAP capabilities to web browsers,
allowing users to access and analyze data through a web-based interface. This enables
remote access, collaboration, and sharing of analytical insights. WOLAP systems
often use a combination of MOLAP, ROLAP, or HOLAP architectures on the
backend. Web-based BI tools like Tableau, Power BI, and Looker provide WOLAP
features.

4. Discuss various delivery process followed in data warehouse


A data warehouse is never static; it evolves as the business expands. As the business evolves,
its requirements keep changing and therefore a data warehouse must be designed to ride with
these changes. Hence a data warehouse system needs to be flexible.
Ideally there should be a delivery process to deliver a data warehouse. However data
warehouse projects normally suffer from various issues that make it difficult to complete
tasks and deliverables in the strict and ordered fashion demanded by the waterfall method.
Most of the times, the requirements are not understood completely. The architectures,
designs, and build components can be completed only after gathering and studying all the
requirements.
Delivery Method
The delivery method is a variant of the joint application development approach adopted for
the delivery of a data warehouse. We have staged the data warehouse delivery process to
minimize risks. The approach that we will discuss here does not reduce the overall delivery
time-scales but ensures the business benefits are delivered incrementally through the
development process.
Note − The delivery process is broken into phases to reduce the project and delivery risk.
The following diagram explain the delivery process
IT Strategy

Data warehouse are strategic investments that require a business process to generate benefits.
IT Strategy is required to procure and retain funding for the project.

Business Case

The objective of business case is to estimate business benefits that should be derived from
using a data warehouse. These benefits may not be quantifiable but the projected benefits
need to be clearly stated. If a data warehouse does not have a clear business case, then the
business tends to suffer from credibility problems at some stage during the delivery process.
Therefore in data warehouse projects, we need to understand the business case for
investment.

Education and Prototyping

Organizations experiment with the concept of data analysis and educate themselves on the
value of having a data warehouse before settling for a solution. This is addressed by
prototyping. It helps in understanding the feasibility and benefits of a data warehouse. The
prototyping activity on a small scale can promote educational process as long as −

 The prototype addresses a defined technical objective.


 The prototype can be thrown away after the feasibility concept has been shown.
 The activity addresses a small subset of eventual data content of the data warehouse.
 The activity timescale is non-critical.

The following points are to be kept in mind to produce an early release and deliver business
benefits.

 Identify the architecture that is capable of evolving.


 Focus on business requirements and technical blueprint phases.
 Limit the scope of the first build phase to the minimum that delivers business benefits.
 Understand the short-term and medium-term requirements of the data warehouse.

Business Requirements

To provide quality deliverables, we should make sure the overall requirements are
understood. If we understand the business requirements for both short-term and medium-
term, then we can design a solution to fulfil short-term requirements. The short-term solution
can then be grown to a full solution.

The following aspects are determined in this stage −

 The business rule to be applied on data.


 The logical model for information within the data warehouse.
 The query profiles for the immediate requirement.
 The source systems that provide this data.

Technical Blueprint

This phase need to deliver an overall architecture satisfying the long term requirements. This
phase also deliver the components that must be implemented in a short term to derive any
business benefit. The blueprint need to identify the followings.

 The overall system architecture.


 The data retention policy.
 The backup and recovery strategy.
 The server and data mart architecture.
 The capacity plan for hardware and infrastructure.
 The components of database design.

Building the Version

In this stage, the first production deliverable is produced. This production deliverable is the
smallest component of a data warehouse. This smallest component adds business benefit.

History Load

This is the phase where the remainder of the required history is loaded into the data
warehouse. In this phase, we do not add new entities, but additional physical tables would
probably be created to store increased data volumes.

Let us take an example. Suppose the build version phase has delivered a retail sales analysis
data warehouse with 2 months worth of history. This information will allow the user to
analyze only the recent trends and address the short-term issues. The user in this case cannot
identify annual and seasonal trends. To help him do so, last 2 years sales history could be
loaded from the archive. Now the 40GB data is extended to 400GB.
Note − The backup and recovery procedures may become complex, therefore it is
recommended to perform this activity within a separate phase.

Ad hoc Query

In this phase, we configure an ad hoc query tool that is used to operate a data warehouse.
These tools can generate the database query.

Note − It is recommended not to use these access tools when the database is being
substantially modified.

Automation

In this phase, operational management processes are fully automated. These would include −

 Transforming the data into a form suitable for analysis.


 Monitoring query profiles and determining appropriate aggregations to maintain
system performance.
 Extracting and loading data from different source systems.
 Generating aggregations from predefined definitions within the data warehouse.
 Backing up, restoring, and archiving the data.

Extending Scope

In this phase, the data warehouse is extended to address a new set of business requirements.
The scope can be extended in two ways −

 By loading additional data into the data warehouse.


 By introducing new data marts using the existing information.

Note − This phase should be performed separately, since it involves substantial efforts and
complexity.

Requirements Evolution

From the perspective of delivery process, the requirements are always changeable. They are
not static. The delivery process must support this and allow these changes to be reflected
within the system.

This issue is addressed by designing the data warehouse around the use of data within
business processes, as opposed to the data requirements of existing queries.

The architecture is designed to change and grow to match the business needs, the process
operates as a pseudo-application development process, where the new requirements are
continually fed into the development activities and the partial deliverables are produced.
These partial deliverables are fed back to the users and then reworked ensuring that the
overall system is continually updated to meet the business needs
UNIT III META DATA, DATA MART AND PARTITION STRATEGY
PART A
1.What is metadata?
Metadata is simply defined as data about data. The data that is used to represent other data is
known as metadata. For example, the index of a book serves as a metadata for the contents in
the book.
2. What is business metadata? It has the data ownership information, business definition,
and changing policies.
3. What is Technical Metadata? It includes database system names, table and column
names and sizes, data types and allowed values. Technical metadata also includes structural
information such as primary and foreign key attributes and indices.
4. What is Operational Metadata? It includes currency of data and data lineage. Currency
of data means whether the data is active, archived, or purged. Lineage of data means the
history of data migrated and transformation applied on it.
5. What is Data mart? A Data Mart is a subset of a directorial information store, generally
oriented to a specific purpose or primary data subject which may be distributed to provide
business needs. Data Marts are analytical record stores designed to focus on particular
business functions for a specific community within an organization.
6. Why Do We Need a Data Mart?  To partition data in order to impose access control
strategies.  To speed up the queries by reducing the volume of data to be scanned.  To
segment data into different hardware platforms.  To structure data in a form suitable for a
user access tool.
7. Why is it Necessary to Partition in data warehousing?
Partitioning is important for the following reasons in data warehousing
 For easy management,
 To assist backup/recovery,
 To enhance performance.
8. What is vertical partitioning? How is it done?
Vertical partitioning splits the data vertically. Vertical partitioning can be performed in the
following two ways −  Normalization  Row Splitting
9. State few roles of metadata  Metadata also helps in summarization between lightly
detailed data and highly summarized data.  Metadata is used for query tools.  Metadata is
used in extraction and cleansing tools.  Metadata is used in reporting tools.  Metadata is
used in transformation tools.  Metadata plays an important role in loading functions.
10. List some cost measures for data mart The cost measures for data marting are as
follows  Hardware and Software Cost  Network Access  Time Window Constraints
11. What is normalization?
Normalization is the standard relational method of database organization. In this method, the
rows are collapsed into a single row, hence it reduce space.
12. What is Row Splitting? Row splitting tends to leave a one-to-one map between
partitions. The motive of row splitting is to speed up the access to large table by reducing its
size.
13. What do you mean by Round Robin Partition? In the round robin technique, when a
new partition is needed, the old one is archived. It uses metadata to allow user access tool to
refer to the correct table partition. This technique makes it easy to automate table
management facilities within the data warehouse.
14. Why partitioning is done in data warehouse? Partitioning is done to enhance
performance and facilitate easy management of data. Partitioning also helps in balancing the
various requirements of the system. It optimizes the hardware performance and simplifies the
management of data warehouse by partitioning each fact table into multiple separate
partitions.
15. State the factors for determining the number of data marts The determination of how
many data marts are possible depends on  Network capacity.  Time window available 
Volume of data being transferred  Mechanisms being used to insert data into a data mart
16. What is system management? State few System managers. System management is
mandatory for the successful implementation of a data warehouse. The most important
system managers are −  System configuration manager  System scheduling manager 
System event manager  System database manager  System backup recovery manager
17. What is metadata repository? Metadata repository is an integral part of a data
warehouse system. It has the following metadata  Definition of data warehouse  Business
metadata  Operational Metadata  Data for mapping from operational environment to data
warehouse  Algorithms for summarization
18. State few challenges for Metadata management  Metadata in a big organization is
scattered across the organization. This metadata is spread in spreadsheets, databases, and
applications.  Metadata could be present in text files or multimedia files. To use this data for
information management solutions, it has to be correctly defined.  There are no industry-
wide accepted standards. Data management solution vendors have narrow focus.  There are
no easy and accepted methods of passing metadata.
19. Give a schematic diagram of data
20. What is partitioning dimensions? If a dimension contains large number of entries, then
it is required to partition the dimensions. Here we have to check the size of a dimension.
Consider a large design that changes over time. If we need to store all the variations in order
to apply comparisons, that dimension may be very large. This would definitely affect the
response time
.UNIT III
PART B
1. Discuss the role of Meta data in data warehousing.

Understanding Metadata in Data Warehouse

Serving as "data about data," metadata is the backbone of efficient data processing and
analytics, giving users and systems the contextual information needed to leverage data
effectively. In this article, we will go into detail regarding what metadata is in data
warehouses, its different kinds of classification in their structures, how it works and improves
the performance of a data warehouse, and design concepts for managing metadata, with
specific examples of AI metadata. The article also highlights the role of metadata in
enforcing data governance, data analysis using data mining techniques, and data analytics in
real-time.

What is Metadata in the Database?

In a data warehouse, metadata refers to the data that provides the user with more detailed
elaboration and classification for it to be more manageable. Information such as the origin of
the data, what kind of data it is, what kind of operations were performed on it, timestamps,
and how different data sets relate to each other is provided as metadata. This added layer of
information enhances the usability of data, ensuring that raw data in warehouses is usable,
interpretable, and actionable.

Further, for a given dataset/package, metadata might indicate the type of structure a given
package will have, where the package was obtained from when it was taken, and what
operations were used on it. This context is important to large-scale data engineering and data
solutions and services, with the assistance of transformed and well-categorized data. For any
data warehouse, metadata acts as a very basic building block of information that assists users
in making sense of the data without getting lost in the vast quantities of information
available.

Types of Metadata in Data Warehouse

 Business Metadata

Business metadata provides a business-oriented view of data, including definitions,


descriptions, and rules that make data meaningful to non-technical users. It covers aspects
like the definitions of business terms, calculation rules for key metrics, ownership
information, and relationships between business entities. By offering this high-level context,
business metadata helps users interpret data accurately, bridging the gap between raw data
and business insights.

 Technical Metadata

Technical metadata is tailored for programmers, developers, data engineers, and other IT
people as it gives insight into how data is constructed, where it is kept, and how it is
managed. For instance, it incorporates data lineage information, which links data from its
sources after undergoing several processes and schema information such as tables, primary
key, and foreign key, as well as their linkages. Technical metadata also maintains information
on ETL, which makes it possible to follow the mapping of how data is transformed and
stored by the needs of the warehouse and how problems in the flow of data transfer and
storage in the warehouse can be fixed.

 Operational Metadata

Operational metadata helps in the operational aspect of a data warehouse as it explains the
activities that occur on the system. This includes data on ETL schedules, the volume of data
loads, job success and failure indices, and other system load measures such as memory and
CPU usage. This type of metadata is also important for data center operations because, with
it, teams can manage and avoid data staleness, bottlenecks, and performance degradation

across a data center’s warehouse, thus improving reliability and efficiency.

How Does Metadata Enhance Data Warehouse Efficiency?

Metadata is essential for enhancing the efficiency of data warehouses, improving usability,
and ensuring the system’s ability to handle real-time data analytics demands. Here are a
number of ways of using metadata in data warehouse environments.

 Data Classification and Discovery: Metadata allows users to easily find and identify
relevant data. As an example, By categorizing data according to type, source, and
purpose, metadata simplifies data search and retrieval, which is critical in data
warehouses that house vast amounts of information.
 Data Lineage and Auditing: Metadata provides a clear view of data lineage, which
includes the data's origins, transformations, and destination within the data
warehouse. This is very important for compliance issues as it allows a company to
trace history, ensuring accurate and transparent reports.
 Improved Data Quality: The metadata sets the standards and business rules to support
the data quality initiatives. As a means of checking data consistency and ensuring that
data transformations are automated, metadata adds value to the resources. This makes
it possible for the data warehouse to be an integral data source for analyzing and
reporting purposes.
 Improved querying speed: Metadata streamlines the query process as it indicates to
the database what data is and where it is contained. This optimizes the querying
process, especially when retrieving datasets that need to be ready at all times,
especially for real-time data analytics.
 Improving data management: Instead of having several policies that are sometimes
hard to understand, metadata enables using a single effective policy to guarantee data
protection. No one without the required permissions can access, use, or even change
the information in a repository.

Tools for Managing Metadata in Data Warehouse

Managing metadata effectively requires specialized tools that integrate, store, and govern
metadata across various systems within a data warehouse. Here are some popular data
engineering tools for metadata management, each offering unique capabilities to enhance data
warehouse efficiency:

 Apache Atlas
Apache Atlas is a strong open-source metadata management and data governance application
suited for any data warehouse, especially those that are Hadoop-based. It allows businesses to
specify, classify, and monitor their data assets across diverse repositories. Atlas facilitates the
automation of data lineage tracking, allowing users to monitor how data flows from one
source to another and in what manner. This is important in areas of audit and compliance.
Furthermore, Atlas enhances data discovery tools through customizable classifications and
business vocabularies, which both technical and non-technical users can easily understand
and use.

 Informatica Metadata Manager

Informatica Metadata Manager, considered part of the broader Informatica platform


specializing in data integration and data governance, enables strong metadata management
through a unified data registry that comes with enhanced lineage, impact analysis, and
automated data cataloging functions. The Informatica Metadata Manager assists complex
ETL functions by outlining data transformation and dependency mechanisms, particularly
relevant to large-scale data warehouses with numerous data processes. Its powerful search
and visualization capabilities assist data teams in data quality maintenance, data source
tracing, and governance compliance enclosure.

 Collibra

Collibra’s Data Intelligence Cloud is the most comprehensive platform for metadata
management, governance, and stewardship. It attracts enterprises with a data governance and
compliance center. It also has a big data catalog that describes data assets with simple
language, visualizing data lineage and panels for classification. Moreover, Collibra’s data
workarounds promote data collaboration across departments and hence mitigate data silos.
Collibra also sets out extensive data control policies to ensure that throughout the lifecycle of
data, its ownership and accountability are enforced such that relevant descriptions and
classification are provided and maintained.

 Alation

Alation is a leader in data cataloging and metadata management and is known for its AI-
driven approach to data discovery and organization. Alation’s automated indexing and
tracking of user interaction makes metadata management a lot easier by utilizing machine
learning. Because of Alation’s emphasis on collaboration features, users can share insights
and comments and provide contextual information about the data substance, leading to more
effective governance of this data. Furthermore, its data lineage management tools help
organizations know where their data is located within their environment, allowing them to
make sure the company meets all the regulatory requirements.

 Microsoft Azure Data Catalog

Microsoft Azure Data Catalog is a fully managed cloud-based service that allows users to
catalog, annotate, and classify data assets across various sources. It offers a consolidated view
of an organization’s data assets through a central metadata repository, which allows all the
authenticated people in the organization to view the information. Azure Data Catalog can
cover both data types at a time: structured and unstructured, making it applicable for mixed
data warehouse environments. With the Azure Data Catalog, restructuring information into
other Azure services becomes easier and more fully operable due to the cloud, and other
features like tagging and search make data easier to find.

 IBM InfoSphere Information Governance Catalog

IBM InfoSphere is an enterprise-grade metadata management tool for organizations with


complex data governance requirements. It offers an enterprise data catalog with automatic
data lineage, impact analysis, and data classification capabilities. The capabilities of
InfoSphere are rich in depth and suitable for large organizations that require stringent
governance and control mechanisms. The lineage visualization offered by the tool provides a
comprehensive view of how data changes and moves throughout the warehouse. It also has
data stewardship features that enable greater privacy, accuracy, and data security so that the
organization can comply with legal and regulatory requirements.

 Talend Data Catalog

Talend Data Catalog is an integrated solution for metadata management that fits the
definition of data discovery, data quality assessment, and data lineage tracking. Talend
automatically documents metadata by linking it to warehouse business processes, allowing
businesses to visualize data flow and dependence. With its intuitive interface design, it also
features powerful search functions that allow users to perform data discovery without hassle.
With these features, Talend enables teams to collaborate on data governance activities,
improving data accuracy, reliability, and compliance with enterprise standards. Talend helps
maintain a dynamic and responsive metadata repository by providing real-time metadata
updates.

Best Examples of AI Metadata

Here are some of the best examples of AI metadata that demonstrate how AI interacts with
metadata to assist with data understanding and information retrieval:

 Data Provenance and Lineage Tracking

AI systems rely on metadata to maintain data provenance, which records the origin and
history of data through each transformation step. For instance, AI models in the financial or
healthcare industries rely on data lineage metadata to know where the data was obtained, how
it was processed, and how it was validated. This creates robust compliance and effective
auditing, especially for industries with strict data integrity management rules.

 AI Model Metadata (Model Parameter and Training Dataset)

AI model metadata consists of parameters, hyperparameters, details about training data or


data used to build an AI model, and the model’s performance metrics. For example, machine
learning frameworks such as TensorFlow and PyTorch keep details concerning the model
structure, the version, and the training history for performance optimization by data scientists.
This metadata is essential for replication, allowing groups to reproduce and alter models by
older models’ trials.
 User Interaction Metadata as a source of AI Recommendations

User interaction metadata involves clicks, search and browsing history, preferences, and
behavior, which enables AI algorithms to produce personalized recommendations. For
example, platforms like Netflix or Spotify benefit from user metadata and recommend
something according to one’s preferences. They also learn from every user’s activity to
enhance their content recommendations.

 Structural Content Enhancement with Text and Image Metadata

Artificial intelligence systems depend on tagging text, images, and video metadata for
efficient content classification and retrieval. For instance, metadata tags in digital image
libraries or image banks may sort keywords, descriptions, and contextual information. These
tags were incorporated into the content and image recognition AI for classification purposes,
which helps to easily retrieve and assess the vast quantity of visual data. This feature is
advantageous in media and e-commerce activities.

 Predictive Analytics in the Context of Real-Time Metadata

AI or machine-learning technology analytics and insights need real-time metadata attributes


like time stamps, geolocation, and device attributes. For example, IoT applications in smart
cities obtain the metadata of traffic and environment in real-time to better their planning and
resource allocation.

 Natural Language Processing (NLP) Metadata for Contextual Understanding

The NLP systems effectively comprehend text by understanding the designed metadata such
as context, sentiment score, language, and named entities. To illustrate, conversational agents
like chatbots or virtual assistants leverage metadata about user intent, previous conversations,
or sentiment level to enhance their responses. It allows AI systems to better interpret the
subtleties of human language and respond accordingly by providing a more context-
appropriate response.

Conclusion- Metadata Database

To sum up, metadata in data warehouses provides the necessary framework for organization,
context, and control, facilitating ease of management and analysis of the large amounts of
data available. With the adoption of metadata, organizations can enhance data solutions and
services, improve real-time data analytics, and implement proper data governance solutions.

In these current times, which are influenced by data, metadata is one key resource every
organization must use. As organizations target a data economy to improve their data
engineering and mining strategies, it will be critical to understand metadata. Other
applications, such as Atlas Apache, Collibra, Alation,l. and IBM InfoSphere, help achieve
this goal, further setting the stage for a structured, compliant, and efficient data warehouse
ecosystem.

A leading enterprise in Data Analytics, SG Analytics focuses on leveraging data management


solutions, predictive analytics, and data science to help businesses across industries discover
new insights and craft tailored growth strategies. Contact us today to make critical data-
driven decisions, prompting accelerated business expansion and breakthrough performance.

2. Explain in detail about horizontal partitioning techniques.

Partitioning is done to enhance performance and facilitate easy management of data.


Partitioning also helps in balancing the various requirements of the system. It optimizes the
hardware performance and simplifies the management of data warehouse by partitioning each
fact table into multiple separate partitions. In this chapter, we will discuss different
partitioning strategies.

Why is it Necessary to Partition?

Partitioning is important for the following reasons −

 For easy management,


 To assist backup/recovery,
 To enhance performance.

For Easy Management

The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This huge
size of fact table is very hard to manage as a single entity. Therefore it needs partitioning.

To Assist Backup/Recovery

If we do not partition the fact table, then we have to load the complete fact table with all the
data. Partitioning allows us to load only as much data as is required on a regular basis. It
reduces the time to load and also enhances the performance of the system.

Note − To cut down on the backup size, all partitions other than the current partition can be
marked as read-only. We can then put these partitions into a state where they cannot be
modified. Then they can be backed up. It means only the current partition is to be backed up.

To Enhance Performance

By partitioning the fact table into sets of data, the query procedures can be enhanced. Query
performance is enhanced because now the query scans only those partitions that are relevant.
It does not have to scan the whole data.

Horizontal Partitioning

There are various ways in which a fact table can be partitioned. In horizontal partitioning, we
have to keep in mind the requirements for manageability of the data warehouse.
Partitioning by Time into Equal Segments

In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each
time period represents a significant retention period within the business. For example, if the
user queries for month to date data then it is appropriate to partition the data into monthly
segments. We can reuse the partitioned tables by removing the data in them.

Partition by Time into Different-sized Segments

This kind of partition is done where the aged data is accessed infrequently. It is implemented
as a set of small partitions for relatively current data, larger partition for inactive data.

Points to Note

 The detailed information remains available online.


 The number of physical tables is kept relatively small, which reduces the operating
cost.
 This technique is suitable where a mix of data dipping recent history and data mining
through entire history is required.
 This technique is not useful where the partitioning profile changes on a regular basis,
because repartitioning will increase the operation cost of data warehouse.

Partition on a Different Dimension

The fact table can also be partitioned on the basis of dimensions other than time such as
product group, region, supplier, or any other dimension. Let's have an example.

Suppose a market function has been structured into distinct regional departments like on
a state by state basis. If each region wants to query on information captured within its
region, it would prove to be more effective to partition the fact table into regional partitions.
This will cause the queries to speed up because it does not require to scan information that is
not relevant.

Points to Note
 The query does not have to scan irrelevant data which speeds up the query process.
 This technique is not appropriate where the dimensions are unlikely to change in
future. So, it is worth determining that the dimension does not change in future.
 If the dimension changes, then the entire fact table would have to be repartitioned.

Note − We recommend to perform the partition only on the basis of time dimension, unless
you are certain that the suggested dimension grouping will not change within the life of the
data warehouse.

Partition by Size of Table

When there are no clear basis for partitioning the fact table on any dimension, then we
should partition the fact table on the basis of their size. We can set the predetermined size
as a critical point. When the table exceeds the predetermined size, a new table partition is
created.

Points to Note

 This partitioning is complex to manage.


 It requires metadata to identify what data is stored in each partition.

Partitioning Dimensions

If a dimension contains large number of entries, then it is required to partition the


dimensions. Here we have to check the size of a dimension.

Consider a large design that changes over time. If we need to store all the variations in order
to apply comparisons, that dimension may be very large. This would definitely affect the
response time.

Round Robin Partitions

In the round robin technique, when a new partition is needed, the old one is archived. It uses
metadata to allow user access tool to refer to the correct table partition.

This technique makes it easy to automate table management facilities within the data
warehouse.

Vertical Partition

Vertical partitioning, splits the data vertically. The following images depicts how vertical
partitioning is done.
Vertical partitioning can be performed in the following two ways −

 Normalization
 Row Splitting

Normalization

Normalization is the standard relational method of database organization. In this method, the
rows are collapsed into a single row, hence it reduce space. Take a look at the following
tables that show how normalization is performed.

Table before Normalization

Product_i Qt Valu sales_dat Store_i Store_nam Regio


Location
d y e e d e n

Bangalor
30 5 3.67 3-Aug-13 16 sunny S
e

Bangalor
35 4 5.33 3-Sep-13 16 sunny S
e

40 5 2.50 3-Sep-13 64 san Mumbai W

Bangalor
45 7 5.66 3-Sep-13 16 sunny S
e

Table after Normalization


Store_id Store_name Location Region

16 sunny Bangalore W

64 san Mumbai S

Product_id Quantity Value sales_date Store_id

30 5 3.67 3-Aug-13 16

35 4 5.33 3-Sep-13 16

40 5 2.50 3-Sep-13 64

45 7 5.66 3-Sep-13 16

Row Splitting

Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting
is to speed up the access to large table by reducing its size.

Note − While using vertical partitioning, make sure that there is no requirement to perform a
major join operation between two partitions.

Identify Key to Partition

It is very crucial to choose the right partition key. Choosing a wrong partition key will lead to
reorganizing the fact table. Let's have an example. Suppose we want to partition the
following table.

Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name

We can choose to partition on any key. The two possible keys could be

 region
 transaction_date

Suppose the business is organized in 30 geographical regions and each region has different
number of branches. That will give us 30 partitions, which is reasonable. This partitioning is
good enough because our requirements capture has shown that a vast majority of queries are
restricted to the user's own business region.

If we partition by transaction_date instead of region, then the latest transaction from every
region will be in one partition. Now the user who wants to look at data within his own region
has to query across multiple partitions.

Hence it is worth determining the right partitioning key.

3. Illustrate the two ways of vertical portioning.


Vertical partitioning divides a table into multiple tables by separating columns, rather than
rows. It aims to improve performance, reduce redundancy, or isolate data based on how it's
accessed. The two main approaches are normalization and row splitting, or as some call it, a
columnar approach.

1. Normalization:

 This involves removing redundant columns from a table and placing them in separate
tables linked with foreign keys.
 Example: A customer table could be split into two: one for basic information (name,
address) and another for contact information (phone, email).
 This reduces redundancy and improves data integrity.
2. Row Splitting (Columnar Approach):

 This involves splitting the original table into multiple tables, each containing a subset
of the columns.
 Example: A table with user information and a separate table for user preferences.
 This can improve query performance by reducing the amount of data accessed for
specific queries.
 It also allows for different storage methods for different columns, like storing large
BLOBs in a separate table.

4. Summarize the challenges for Meta management


Meta management faces challenges like fragmented data, inconsistent standards, and complex
workflows, hindering efficient data sharing, reuse, and analysis. Additionally, maintaining
data quality and accuracy, ensuring regulatory compliance, and managing manual processes
become increasingly difficult as data volumes grow.
Here's a more detailed look at the challenges:
1. Fragmented Data and Siloed Information: Different departments and systems may use
varying metadata standards, making it difficult to integrate and manage data effectively. This
siloed approach hinders collaboration and limits the ability to derive insights from the data.
2. Inconsistent Standards: A lack of standardization in metadata creation, extraction, and
distribution leads to inconsistencies and inefficiencies. This can make it difficult to
understand and use data effectively across the organization.
3. Complex Workflows: Managing the content lifecycle, from creation to distribution,
involves complex workflows that can be challenging to manage.
4. Manual Processes: Traditional metadata management relies heavily on manual processes,
which are time-consuming and error-prone. As data volumes increase, these manual methods
become unsustainable.
5. Data Quality and Accuracy: Maintaining accurate and high-quality metadata is crucial
for informed decision-making. Inaccurate or incomplete metadata can lead to poor decisions
and wasted resources.

UNIT IV DIMENSIONAL MODELING AND SCHEMA


PART A
1. Define dimensional modeling It is a logical design technique to structure the business
dimensions and the metrics that are analyzed along these dimensions. The model has also
proved to provide high performance for queries and analysis.
2. What is multidimensional modeling? The multi-dimensional Data Model is a method
which is used for ordering data in the database along with good arrangement and assembling
of the contents in the database. The Multi-dimensional Data Model allows customers to
interrogate analytical questions associated with market or business trends, unlike relational
databases which allow customers to access data in the form of queries.
3. What do you mean by data cube? It is defined by dimensions and facts. The dimensions
are the entities with respect to which an enterprise preserves the records. When data is
grouped or combined in multidimensional matrices it is called Data Cubes. The data cube
method has a few alternative names or a few variants, such as "Multidimensional databases,"
"materialized views," and "OLAP (On-Line Analytical Processing)."
4. Define schema Schema is a logical description of the entire database. It includes the name
and description of records of all record types including all associated data-items and
aggregates.
5. What are the characteristics of Star Schema?  It creates a DE-normalized database that
can quickly provide query responses.  It provides a flexible design that can be changed
easily or added to throughout the development cycle, and as the database grows.  It provides
a parallel in design to how end-users typically think of and use the data.  It reduces the
complexity of metadata for both developers and endusers.
6. Define snowflake schema A schema is known as a snowflake if one or more dimension
tables do not connect directly to the fact table but must join through other dimension tables.
7. What is fact constellation schema? Fact Constellation Schema describes a logical
structure of data warehouse or data mart. Fact Constellation Schema can be designed with a
collection of denormalized FACT, Shared, and Conformed Dimension tables. A Fact
constellation means two or more fact tables sharing one or more dimensions. It is also called
Galaxy schema.
8. What do you mean by centralized process architecture? In centralized process
architecture architecture, the data is collected into single centralized storage and processed
upon completion by a single machine with a huge structure in terms of memory, processor,
and storage.
9. What do you mean by distributed process architecture? In distributed process
architecture, information and its processing are allocated across data centers, and its
processing is distributed across data centers, and processing of data is localized with the
group of the results into centralized storage.
10. What is Intraquery parallelism ? Intraquery parallelism defines the execution of a
single query in parallel on multiple processors and disks. Using intraquery parallelism is
essential for speeding up long-running queries.
11. What is interquery parallelism ? In interquery parallelism, different queries or
transaction execute in parallel with one another.This form of parallelism can increase
transactions throughput. The response times of individual transactions are not faster than they
would be if the transactions were run in isolation.
12. State the difference between star schema and snowflake schema
Star Schema Snowflake Schema Hierarchies for the dimensions are stored in the dimensional
table. Hierarchies are divided into separate tables. It contains a fact table surrounded by
dimension tables. One fact table surrounded by dimension table which are in turn surrounded
by dimension table In a star schema, only single join creates the relationship between the fact
table and any dimension tables. A snowflake schema requires many joins to fetch the data.
Simple DB Design. Very Complex DB Design. Denormalized data structure and query also
run faster. Normalized Data Structure. High level of Data redundancy Very low-level data
redundancy Single Dimension table contains aggregated data. Data Split into different
Dimension Tables. Cube processing is faster. Cube processing might be slow because of the
complex join. Offers higher performing queries using Star Join Query Optimization. Tables
may be connected with multiple dimensions. The Snowflake schema is represented by
centralized fact table which unlikely connected with multiple dimensions.
13. State the objectives of Dimensional Modeling The purposes of dimensional modeling
are:  To produce database architecture that is easy for end-clients to understand and write
queries.  To maximize the efficiency of queries. It achieves these goals by minimizing the
number of tables and relationships between them.
14. What are the characteristics of Dimension table? a. Dimension tables contain the
details about the facts. b. The dimension tables include descriptive data about the numerical
values in the fact table. That is, they contain the attributes of the facts. c. Since the record in a
dimension table is denormalized, it usually has a large number of columns. The dimension
tables include significantly fewer rows of information than the fact table. d. The attributes in
a dimension table are used as row and column headings in a document or query results
display.
15. What are the characteristics of Fact table? a. The fact table includes numerical values
of what we measure. For example, a fact value of 20 might means that 20 widgets have been
sold. b. Each fact table includes the keys to associated dimension tables. These are known as
foreign keys in the fact table. c. Fact tables typically include a small number of columns.
When it is compared to dimension tables, fact tables have a large number of rows.
16. Define Fact Fact is a collection of associated data items, consisting of measures and
context data. It typically represents business items or business transactions.
17. Define Dimensions Dimension is a collection of data which describe one business
dimension. Dimensions decide the contextual background for the facts, and they are the
framework over which OLAP is performed.
18. What is Snowflaking? Snowflaking is a method of normalizing the dimension tables in a
STAR schemas. When we normalize all the dimension tables entirely, the resultant
Downloaded by Divya ([email protected]) lOMoARcPSD|55371090 UNIT IV
R2021 150 CCS341 Data Warehousing structure resembles a snowflake with the fact table in
the middle. Snowflaking is used to develop the performance of specific queries.
19. What is Horizontal Parallelism? Horizontal Parallelism means that the database is
partitioned across multiple disks, and parallel processing occurs within a specific task (i.e.,
table scan) that is performed concurrently on different processors against different sets of
data.
20. What is vertical Parallelism? Vertical Parallelism occurs among various tasks. All
component query operations (i.e., scan, join, and sort) are executed in parallel in a pipelined
fashion
UNITIV
PART B
1. Design a star-schema, snow-flake schema and Fact-constellation schema for the
following data warehouse that consist of the following four dimensions: (Time, Item,
Branch and Location) .Include the appropriate measures required for the schemas.
CreateStar Schema

Star Schema

 Each dimension in a star schema is represented with only one-dimension table.


 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
 There is a fact table at the center. It contains the keys to each of four dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.
Note − Each dimension has only one dimension table and each table holds a set of attributes.
For example, the location dimension table contains the attribute set {location_key, street,
city, province_or_state,country}. This constraint may cause data redundancy. For example,
"Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia.
The entries for such cities may cause data redundancy along the attributes province_or_state
and country.

Snowflake Schema

 Some dimension tables in the Snowflake schema are normalized.


 The normalization splits up the data into additional tables.
 Unlike Star schema, the dimensions table in a snowflake schema are normalized. For
example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.
 Now the item dimension table contains the attributes item_key, item_name, type,
brand, and supplier-key.
 The supplier key is linked to the supplier dimension table. The supplier dimension
table contains the attributes supplier_key and supplier_type.
Note − Due to normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space.

Fact Constellation Schema

 A fact constellation has multiple fact tables. It is also known as galaxy schema.
 The following diagram shows two fact tables, namely sales and shipping.

 The sales fact table is same as that in the star schema.


 The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
 The shipping fact table also contains two measures, namely dollars sold and units
sold.
 It is also possible to share dimension tables between fact tables. For example, time,
item, and location dimension tables are shared between the sales and shipping fact
table.

Schema Definition

Multidimensional schema is defined using Data Mining Query Language (DMQL). The two
primitives, cube definition and dimension definition, can be used for defining the data
warehouses and data marts.

Syntax for Cube Definition

define cube < cube_name > [ < dimension-list > }: < measure_list >
Syntax for Dimension Definition

define dimension < dimension_name > as ( < attribute_or_dimension_list > )

Star Schema Definition

The star schema that we have discussed can be defined using Data Mining Query Language
(DMQL) as follows −

define cube sales star [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)

Snowflake Schema Definition

Snowflake schema can be defined using DMQL as follows −

define cube sales snowflake [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier
type))
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key, city, province or state,
country))

Fact Constellation Schema Definition

Fact constellation schema can be defined using DMQL as follows −

define cube sales [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)


define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state,country)
define cube shipping [time, item, shipper, from location, to location]:

dollars cost = sum(cost in dollars), units shipped = count(*)

define dimension time as time in cube sales


define dimension item as item in cube sales
define dimension shipper as (shipper key, shipper name, location as location in cube sales,
shipper type)
define dimension from location as location in cube sales
define dimension to location as location in cube sales
Schema is a logical description of the entire database. It includes the name and description of
records of all record types including all associated data-items and aggregates. Much like a
database, a data warehouse also requires to maintain a schema. A database uses relational
model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this
chapter, we will discuss the schemas used in a data warehouse.
2. Discuss about multidimensional database, data mart and data cube? Explain schemas
for multi-dimensional database.
A multidimensional database (MDB) uses a data cube structure for storing and retrieving
aggregated data, optimized for analytical queries. A data mart is a smaller, focused subset of
a data warehouse designed for specific business units. A data cube, the core element of an
MDB, organizes data across multiple dimensions, allowing for efficient analysis and
reporting. MDBs utilize schemas like star, snowflake, and fact constellation to structure the
data.

Multidimensional Database (MDB):

 Purpose:
Primarily used for Online Analytical Processing (OLAP) and data warehousing,
providing fast access to summarized data for analytical queries.
 Structure:
Organizes data into a data cube, where dimensions represent different perspectives of the
data (e.g., time, location, product), and measures are numerical values representing the
data (e.g., sales).
 Advantages:
Efficient for complex queries, allows for slicing and dicing data, and provides a
structured way to analyze large datasets.

 Example:
Sales data can be viewed by product, region, and time period.
Data Mart:

 Purpose:
A specialized subset of a data warehouse, focused on a specific department or business
unit.
 Structure:
Often uses a simplified schema, such as a star schema, to focus on the data relevant to
that unit.
 Advantages:
Provides faster access to specific data for a department, reducing the need to query the
entire data warehouse.
 Example:
A marketing data mart containing information about customer demographics, purchase
history, and campaign effectiveness.
Data Cube:

 Purpose:
The core storage structure within an MDB, providing a multi-dimensional array of data.
 Structure:
Consists of dimensions, which are the perspectives of the data, and measures, which are
the numerical values being analyzed.
 Function:
Enables users to analyze data from different angles, such as slicing and dicing the data
based on different dimensions.
 Example:
A sales data cube might have dimensions like product, region, time, and measure like
sales revenue, allowing users to analyze sales trends by region and time.
Schemas for Multidimensional Databases:

 Star Schema:
Features a central fact table (containing measures) linked to multiple dimension tables
(containing attributes of the dimensions). It is simple and efficient for querying large
datasets.
 Snowflake Schema:
Extends the star schema by further normalizing dimension tables into multiple
tables. This can improve data integrity but increases complexity.
 Fact Constellation Schema:
Contains multiple fact tables that share dimension tables. This is useful when dealing
with multiple business processes or subject areas.

3. Explain the star schema, snowflake schema and fact constellation schema with
examples.

Star Schema:The star schema is a widely used schema design in data warehousing. It

features a central fact table that holds the primary data or measures, such as sales, revenue,

or quantities. The fact table is connected to multiple dimension tables, each representing

different attributes or characteristics related to the data in the fact table. The dimension

tables are not directly connected to each other, creating a simple and easy-to-understand

structure.

Simplicity: Star schema is the simplest and most straightforward schema design, with fewer

tables and relationships. It provides ease of understanding, querying, and report generation.

Denormalization: Dimension tables in star schema are often denormalized, meaning they

may contain redundant data to optimize query performance.


Example: Consider a retail data warehouse. The fact table might contain sales data with

measures like “Total Sales” and “Quantity Sold.” The dimension tables could include

“Product” with attributes like “Product ID,” “Product Name,” and “Category,” and “Time”

with attributes like “Date,” “Month,” and “Year.” The fact table connects to these dimension

tables through foreign keys, allowing analysts to perform queries like “Total Sales by Product

Category” or “Quantity Sold by Date.”

Snowflake Schema: The snowflake schema is an extension of the star schema, designed
to further reduce data redundancy by normalizing the dimension tables. In a snowflake
schema, dimension tables are broken down into multiple related sub-tables. This
normalization creates a more complex structure with additional levels of relationships,
reducing storage requirements but potentially increasing query complexity due to the
need for additional joins.

Normalization: Snowflake schema normalizes dimension tables, resulting in more tables and

more complex relationships compared to the star schema.

Space Efficiency: Due to normalization, the snowflake schema may require less storage

space for dimension data but may lead to more complex queries due to additional joins
Example: Continuing with the retail data warehouse example, in a snowflake schema, the

“Product” dimension may be normalized into sub-tables like “Product_Category,”

“Product_Subcategory,” and “Product_Details,” each holding specific attributes related to the

product. This normalization allows for efficient storage of data, but it may require more

complex queries to navigate through the snowflake structure.

Fact Constellation (Galaxy Schema):The fact constellation schema, also known as a


galaxy schema, is a more complex design that involves multiple fact tables sharing
dimension tables. It is used when there are multiple fact tables with different measures
and each fact table is related to several common dimension tables.

Complexity: Fact constellation schema is the most complex among the three designs, as it

involves multiple interconnected star schemas.

Flexibility: This schema design offers more flexibility in modeling complex and diverse

business scenarios, allowing multiple fact tables to coexist and share dimensions.

Example: In a data warehouse for a healthcare organization, there could be multiple fact tables

representing different metrics like patient admissions, medical procedures, and medication

dispensing. These fact tables would share common dimension tables like “Patient,” “Doctor,”
and “Date.” The fact constellation schema allows analysts to analyze different aspects of

healthcare operations while efficiently reusing shared dimension tables.

4. Construct a star schema for hospital management system


Star Schema Design Of Hospitals

Introduction-Star Schema Design Of Hospitals


Want the Best Assignment Help in the UK? Look to Native Assignment Help for
unparalleled expertise and support. Our dedicated team of professionals goes above and
beyond to ensure you receive top-quality assignments that exceed your expectations.
Hospitals are data driven organizations. Hospitals need important documents at the time of
problems. Documents can clarify important situations when the reputation of the hospital is at
stake. It is essential for hospitals to have proper documentation in place at the time of an
important surgery. Routine operations will go smoothly if proper documentation is done on a
daily basis. Hospitals will not suffer from major equipment shortages at the time of operation
if they had proper documentation in place. This assessment is developed to identify four
different dimension of Kimball’s model and develop a star schema model that can be
implementing into hospital operational environment and reduce deaths and other routine
operational problems.
1) Explanation of the assumptions
Star schema model has been introduced to produce a data warehouse for the hospital.
Hospital has faced problems of bad publicity due to the deaths of numerous patients. Hospital
has also faced problems with their routine operations. Star schema model has been developed
to address these challenges faced by the hospital. Different sections of data have been
collected to make the star schema model (Zhu et al., 2017). Following are some of the areas
that have been selected for data collection
 Administrative data: Hospital administrative data is the fundamental aspect of star
schema model. This model has been developed to give a brief overview about
different sections of the hospital. People from different sections of the hospital have
been interviewed for information. Emergency section has been prioritized for making
administrative data. General ward employees were interviewed for making the
administrative data. Maternity ward authorities were interviewed with effective
measures for data collection procedure.
 Patient Registers: Patient registries are an area that was under special consideration
while making the star schema model (Gorelik, 2019). It is essential for the hospital
authority to take special precaution while making entries for the admitted patients.
Interviews with different employees were taken into consideration while making the
star schema model
 Insurance and claims related Information: It is essential for every hospital to have
proper documentation in place to help patients with their healthcare cost coverage.
Collection of patient data is essential for this purpose. Employees responsible for
healthcare package documentation were interviewed for gaining insights about the
requirements of the hospital.
 Different electronic data of hospitals: Hospitals use numerous electronic data to have
a better understanding of patient well being. Different doctors and departmental heads
have been interviewed to have a better understanding about the circulation of
electronic data associated with different equipment. Feedback from various doctors
have been considered while making this star schema model. This mode of data
colletion was effective as the loopholes from the internal stakeholders would be
integrated in the proposed model.
2) Rationale of the design
Star Schema model has been developed based on Kimball's dimensional model. Kimball’s
dimensional model has four steps for data warehousing. Following are the four steps of
Kimball’s model (Kimball, 1996).

Figure 1: Kimball’s Four Star Schema Steps


(Source: Kimball, 1996)
 Selecting process of business: It is essential for the organization to have a clear idea
about the definite process that would be covered by the star schema model. If the
hospital needs emergency related data, star schema model will provide results for
emergency data.
 Grain declaration: Grain means the level of data that will be stored in the fact table of
Star schema. Data stored within the fact table cannot be divided into further groups.
 Identifying different dimensions: Identification of different dimensions will include
the statistics in table format. Data will be stored under various dimensions to file a
clear idea about the associated fact table [Refer to Fact table and Dimension tables
for each of the fact table].
 Identification of numerical facts: This section will indicate the numbers associated
with each dimension. This section provides the answer searched by the user. Data
collected from gain should be similar to the grain. Numerous fact tables are used to
define a different set of data.
3) Star Schema diagram
Fact table and dimension tables are mentioned in detail as follows:
Fact Table

Register_document_key

Revenue_document_key

Daily_task_document_key

Administrative_key

Patient_Directory_key

Doctor_dimension_key

Diagnosis_key

Health_Insurance_key

Dimension table for Register_document_key

Partient_name_key

Issue_details_key

Registration_id_key

Guardian_contact_key

Registration_documen_key indicates presence of details of patient who tries to take


admission in hospital. Presence of name, issue details and registration Id will be beneficial to
provide smooth service in hospital. Further, presence of guardian contact details will help to
contact in case of emergency without any barriers.
Dimension table for Revenue_document_key

Revenue_cost_document

Target_revenue_document

Target_patient_document
Revenue document key is an added measure through which total revenue can be calculated.
Keys such as cost, target ad patient admission details will be helpful to smoothly maintain
financial details of hospital.
Dimension table for Daily_task_document_key

Daily_task_details

Daily_task_id

Responsible_person_for_daily_task

Daily task document will be effective to track every day’s tasks to maintain smooth service in
hospital. Documentation of task id and responsibility will be beneficial to eliminate bias and
dilemma during service.
Dimension table for Administrative_key

Emergency_document_administration_key

Emergency_policies_document

Emergency_equipment_requirements_document_key

Maternity_ward_administrative_documentation_key

Maternity_ward_policies_document_name

General_ward_administrative_documentation_key

General_ward_policies_document_name
Administrative Department
First Administrative key has been included within the fact table. Various administrative
documents will be located under the Administrative key. Information regarding health care
policies will be under the Administrative key. Various shift timings will be under the various
dimension tables of the administrative department. Documents regarding different policies
will be under the various dimension tables of administrative fact table. This information will
avoid any confusion regarding hospital policies. Administrative people will be confident
while describing hospital policies. Documentation of policies in maternity ward and general
words can help in maintaining effective functionality in hospitals. Further, management of
documents regarding administrative details can help in elimination of documentation issue
and will help to improve service ability of any hospital.
Dimension table for Patient _Directory_Key

Patient_gender_documentation_key

Male_document_id

Female_ document_id

Transgender_ document_id

Patient_medical_history_ document_key

Previous_symptoms_document

Special_attributes_document

Patient_admission_time_document

Next Patient Directory key has been considered within the fact table. Different information
regarding the patient has been considered to make the dimension table of star schema model.
Patient gender information will be under the dimension key. Patients' medical history with
the hospital will be under dimension history of the patient's medical history key. Patient
administration timings will be under the dimension key known as patient administration time
key. Identification of gender of patients based on document history will be beneficial to
maintain smooth services regarding registration. Further, documentation of previous
symptoms of a patient and other medical history durog further treatment can be helpful to
easily understand the details and treat as per requirements.
Doctor dimension key
Doctor_information_key_document

Doctor_Names_ document

Doctor_timings _document

Doctor_shifts_document

Doctor_contact_document

Doctor_information_according_to_ward_ document_Key

Above two tables is the dimension key associated with the hospital doctor information.
Doctor names will be under the Doctor Information _key. Under this section doctor names
will be given. Timings of different doctors will be provided in this section as well. Doctor
shifts will be described under this dimension table. Doctor information according to different
wards will be provided in the Doctor information according to ward Key. Doctor timings
according to shifts in different wards will be provided in this part of the dimension table.
Doctors assigned for different shifts will be under this dimension table.Presence of details of
docturs regarding shifts and ward information will be helpful to eliminate unnecessary doubts
during service. Further, it will help to conduct a proper operations especially during any
emergency.
Diagnosis_Key Dimension Table

Diagnosis_history_document_Key

Patient_Requirements_ document_key

Diagnosis_timings_number_document

Diagnosis_result_ document

Date_number

Time_number

Diagnosis key dimension table will provide the necessary requirements of patient details.
Different patients will require various tests according to their needs. Diagnosis key will help
hospitals to keep track of different diagnosis reports of each patient (Rosita, 2021). This will
reduce the unnecessary confusion for the hospital. Hospitals will be able to complete the
regular operations in an effective way. Confusion regarding complex operations will reduce
to the diagnosis directory of the hospital. Delivery timings of diagnosis reports will clear
confusions regarding patient treatments for the hospital (Cimpoiasu et al., 2021).
Health_Insurance_key

Patient_personal_information_document_key

Patient_name_document

Patient_contact_details

Patient_Health_Insurance_History _document

Previous_coverage_document

Health Insurance key will give the patient's personal information. Patients' contact details will
be provided under this dimension table. Patient’s health insurance key will help to provide
hospitals the valuable information regarding patient's medical coverage. This will help
hospitals to make valuable decisions in a more effective manner (Rocha, Capelo and Ciferri,
2020). Doctors will be able to provide suitable alternatives according to the medical coverage
of the patient. Effective health insurance coverage will help the hospital to recover its lost
glory within a short period. These details will further help insurance organisations to grab an
option for renewing insurance. This documentation is further beneficial to treat patient at
comparatively lower cots based o criterions of insurance policies.
4) Two examples of Star schema
Insurance clear up
Several issues were continually rising due to complications with paper work. This
organisation did not have any framework that is used to resolve such paper work related
issues. A major problem that this organisation was not being able to handle was separating
relevant data from its unnecessary counterparts. These aspects were further aggravating
problems with client detestation. A star schema model was utilised by this organisation to
resolve this present issue.
Health insurance: Health_Incurance_Key
This key was used by members of that organisation to find out information related to any
specific client who was being treated at this hospital. Using a Star schema model helped put
in place several information that had been provided by clients (Bacry et al., 2020). Using this
model simplified an otherwise complicated procedure for separating necessary facts from a
combination of a massive number of resources. Entire information provided by clients was
input into this star schema and only aspects of insurance that could be covered by service
providers for these clients were displayed. Thus, a dimension table and its supporting facts
were used and explained by this organisation to its disgruntled customers.
Solving X-Ray complications
This hospital was facing problems in solving simple issues of daily working. As a result, they
were looking for a system that could solve all these problems that were causing major
hiccups on the daily. Problems related to arranging all their daily functions were becoming
problematic for their hospital’s ability to function. Upon introduction of this star schema
model, attributes that were related to specific patients were keyed and tagged (Zhu et
al., 2017). This ensured that there would be no further problems regarding
miscommunication of reports.
Fact table: Diagnosis_Key
These keys were assigned by members of this organisation to pinpoint all aspects that were
necessary for any specific situation. As a result, diagnosis of patients became even simpler
since all information related to them was easily available
Conclusion
It is concluded that an assumption over different operational services in hospital environment
have occurred due to lack of operational link-up and responsibilities taken from higher
authorities. In context, research has identified facts, including administrative, doctor, patients,
insurance, diagnosis, and other facts are identified. Moreover, consideration of insurance
cover and x-ray diagnosis related examples are evaluated that can be implemented to address
hospital issues with star schema model.

UNIT V
PART A
1. State the responsibilities of System Configuration Manager  The system configuration
manager is responsible for the management of the setup and configuration of data warehouse.
 The structure of configuration manager varies from one operating system to another.  In
UNIX structure of configuration, the manager varies from vendor to vendor.  Configuration
managers have single user interface.  The interface of configuration manager allows us to
control all aspects of the system.
2. State the responsibilities of System Scheduling Manager System Scheduling Manager
is responsible for the successful implementation of the data warehouse. Its purpose is to
schedule ad hoc queries. Every operating system has its own scheduler with some form of
batch control mechanism.
3. State the features of System Scheduling manager  Work across cluster or MPP
boundaries  Deal with international time differences  Handle job failure  Handle multiple
queries  Support job priorities  Restart or re-queue the failed jobs  Notify the user or a
process when job is completed  Maintain the job schedules across system outages
4. Define Event Events are the actions that are generated by the user or the system itself. It
may be noted that the event is a measurable, observable, occurrence of a defined action.
5. What is the role of process managers? Process managers are responsible for maintaining
the flow of data both into and out of the data warehouse. There are three different types of
process
 Load manager  Warehouse manager  Query manager
6. State the responsibilities of warehouse manager The warehouse manager is responsible
for the warehouse management process. It consists of a third-party system software, C
programs, and shell scripts. The size and complexity of a warehouse manager varies between
specific solutions
7. State the functions of Warehouse Manager A warehouse manager performs the
following functions −  Analyzes the data to perform consistency and referential integrity
checks.  Creates indexes, business views, partition views against the base data.  Generates
new aggregations and updates the existing aggregations.  Generates normalizations. 
Transforms and merges the source data of the temporary store into the published data
warehouse.
8. State the functions of load manager The load manager performs the following functions
 Extract data from the source system.  Fast load the extracted data into temporary data
store.  Perform simple transformations into structure similar to the one in the data
warehouse.
9. State the responsibilities of query manager The query manager is responsible for
directing the queries to suitable tables. By directing the queries to appropriate tables, it speeds
up the query request and response process. In addition, the query manager is responsible for
scheduling the execution of the queries posted by the user
10. What are the components of query manager? A query manager includes the following
components –  Query redirection via C tool or RDBMS  Stored procedures  Query
management tool  Query scheduling via C tool or RDBMS  Query scheduling via third-
party software
11. List the functions of Query Manager  It presents the data to the user in a form they
understand.  It schedules the execution of the queries posted by the end-user.  It stores
query profiles to allow the warehouse manager to determine which indexes and aggregations
are appropriate.
12. Why tuning a data warehouse is difficult? Tuning a data warehouse is a difficult
procedure due to following reasons  Data warehouse is dynamic; it never remains constant. 
It is very difficult to predict what query the user is going to post in the future.  Business
requirements change with time.  Users and their profiles keep changing.  The user can
switch from one group to another.  The data load on the warehouse also changes with time.
13. What are the two kinds of queries in data warehouse? The two kinds of queries in data
warehouse are  Fixed queries  Ad hoc queries
14. What is Unit Testing?  In unit testing, each component is separately tested.  Each
module, i.e., procedure, program, SQL Script, Unix shell is tested.  This test is performed by
the developer.
15. What is Integration Testing?  In integration testing, the various modules of the
application are brought together and then tested against the number of inputs.  It is
performed to test whether the various components do well after integration.
16. What is System Testing?  In system testing, the whole data warehouse application is
tested together.  The purpose of system testing is to check whether the entire system works
correctly together or not.  System testing is performed by the testing team.  Since the size
of the whole data warehouse is very large, it is usually possible to perform minimal system
testing before the test plan can be enacted.
17. List the scenarios for which testing is needed  Media failure  Loss or damage of table
space or data file  Loss or damage of redo log file  Loss or damage of control file  Instance
failure  Loss or damage of archive file  Loss or damage of table  Failure during data
failure
18. List out few criteria that are required for choosing a system and database manager.
 Increase user's quota.  assign and de-assign roles to the users  assign and de-assign the
profiles to the users  perform database space management  monitor and report on space
usage  tidy up fragmented and unused space  add and expand the space  add and remove
users  manage user password
19. List some common events that need to be tracked.  Hardware failure  Running out of
space on certain key  A process dying  A process returning an error  CPU usage exceeding
an 805 threshold  Internal contention on database serialization points  Buffer cache hit
ratios exceeding or failure below threshold  A table reaching to maximum of its size 
Excessive memory swapping
20. What are the aspects to be considered while testing operational environment. 
Security  Scheduler  Disk Configuration.  Management Tools
Unit v
Part B
1. Describe in detail about working of system scheduling manager.
System management is mandatory for the successful implementation of a data warehouse. The most
important system managers are: System configuration manager System scheduling manager System
event manager System database manager System backup recovery manager System Configuration
Manager The system configuration manager is responsible for the management of the setup and
configuration of data warehouse. The structure of configuration manager varies from one operating
system to another. In Unix structure of configuration, the manager varies from vendor to vendor.
Configuration managers have single user interface. The interface of configuration manager allows us
to control all aspects of the system. Note: The most important configuration tool is the I/O manager.
System Scheduling Manager System Scheduling Manager is responsible for the successful
implementation of the data warehouse. Its purpose is to schedule ad hoc queries. Every operating
system has its own scheduler with some form of batch control mechanism. The list of features a
system scheduling manager must have is as follows: Work across cluster or MPP boundaries Deal
with international time differences Handle job failure Handle multiple queries Support job priorities
Restart or re-queue the failed jobs Notify the user or a process when job is completed Maintain the
job schedules across system outages Re-queue jobs to other queues Support the stopping and
starting of queues Log Queued jobs Deal with inter-queue processing Note: The above list can be
used as evaluation parameters for the evaluation of a good scheduler. Some important jobs that a
scheduler must be able to handle are as follows: Daily and ad hoc query scheduling Execution of
regular report requirements Data load Data processing Index creation Backup Aggregation creation
Data transformation Note: If the data warehouse is running on a cluster or MPP architecture, then
the system scheduling manager must be capable of running across the architecture. System Event
Manager The event manager is a kind of a software. The event manager manages the events that are
defined on the data warehouse system. We cannot manage the data warehouse manually because
the structure of data warehouse is very complex. Therefore we need a tool that automatically
handles all the events without any intervention of the user. Note: The Event manager monitors the
events occurrences and deals with them. The event manager also tracks the myriad of things that
can go wrong on this complex data warehouse system. Events Events are the actions that are
generated by the user or the system itself. It may be noted that the event is a measurable,
observable, occurrence of a defined action. Given below is a list of common events that are required
to be tracked. Hardware failure Running out of space on certain key disks A process dying A process
returning an error CPU usage exceeding an 805 threshold Internal contention on database
serialization points Buffer cache hit ratios exceeding or failure below threshold A table reaching to
maximum of its size Excessive memory swapping A table failing to extend due to lack of space Disk
exhibiting I/O bottlenecks Usage of temporary or sort area reaching a certain thresholds Any other
database shared memory usage The most important thing about events is that they should be
capable of executing on their own. Event packages define the procedures for the predefined events.
The code associated with each event is known as event handler. This code is executed whenever an
event occurs. System and Database Manager System and database manager may be two separate
pieces of software, but they do the same job. The objective of these tools is to automate certain
processes and to simplify the execution of others. The criteria for choosing a system and the
database manager are as follows: increase user's quota. assign and de-assign roles to the users
assign and de-assign the profiles to the users perform database space management monitor and
report on space usage tidy up fragmented and unused space add and expand the space add and
remove users manage user password manage summary or temporary tables assign or deassign
temporary space to and from the user reclaim the space form old or out-of-date temporary tables
manage error and trace logs to browse log and trace files redirect error or trace information switch
on and off error and trace logging perform system space management monitor and report on space
usage clean up old and unused file directories add or expand space. System Backup Recovery
Manager The backup and recovery tool makes it easy for operations and management staff to back-
up the data. Note that the system backup manager must be integrated with the schedule manager
software being used. The important features that are required for the management of backups are as
follows: Scheduling Backup data tracking Database awareness Backups are taken only to protect
against data loss. Following are the important points to remember. The backup software will keep
some form of database of where and when the piece of data was backed up. The backup recovery
manager must have a good front-end to that database. The backup recovery software should be
database aware. Being aware of the database, the software then can be addressed in database
terms, and will not perform backups that would not be viable.
2. Summarize the role of load manager and warehouse manager.
Process managers are responsible for maintaining the flow of data both into and out of the data
warehouse. There are three different types of process managers:

Load manager

Warehouse manager

Query manager Data Warehouse

Load Manager Load manager performs the operations required to extract and load the data into the
database. The size and complexity of a load manager varies between specific solutions from one data
warehouse to another. Load Manager Architecture The load manager does performs the following
functions: Extract data from the source system. Fast load the extracted data into temporary data
store. Perform simple transformations into structure similar to the one in the data warehouse.
Extract Data from Source The data is extracted from the operational databases or the external
information providers. Gateways are the application programs that are used to extract data. It is
supported by underlying DBMS and allows the client program to generate SQL to be executed at a
server. Open Database Connection ODBC and Java Database Connection JDBC are examples of
gateway. Fast Load In order to minimize the total load window, the data needs to be loaded into the
warehouse in the fastest possible time. Transformations affect the speed of data processing. It is
more effective to load the data into a relational database prior to applying transformations and
checks. Gateway technology is not suitable, since they are inefficient when large data volumes are
involved. Simple Transformations While loading, it may be required to perform simple
transformations. After completing simple transformations, we can do complex checks. Suppose we
are loading the EPOS sales transaction, we need to perform the following checks: Strip out all the
columns that are not required within the warehouse. Convert all the values to required data types.

Warehouse Manager

The warehouse manager is responsible for the warehouse management process. It consists of a
third-party system software, C programs, and shell scripts. The size and complexity of a warehouse
manager varies between specific solutions. Warehouse Manager Architecture

A warehouse manager includes the following:

The controlling process Stored procedures or C with SQL Backup/Recovery tool SQL scripts Functions
of Warehouse Manager A warehouse manager performs the following functions: Analyzes the data
to perform consistency and referential integrity checks. Creates indexes, business views, partition
views against the base data. Generates new aggregations and updates the existing aggregations.
Generates normalizations. Transforms and merges the source data of the temporary store into the
published data warehouse. Backs up the data in the data warehouse. Archives the data that has
reached the end of its captured life. Note: A warehouse Manager analyzes query profiles to
determine whether the index and aggregations are appropriate. Query Manager The query manager
is responsible for directing the queries to suitable tables. By directing the queries to appropriate
tables, it speeds up the query request and response process. In addition, the query manager is
responsible for scheduling the execution of the queries posted by the user. Query Manager
Architecture A query manager includes the following components: Query redirection via C tool or
RDBMS Stored procedures Query management tool Query scheduling via C tool or RDBMS Query
scheduling via third-party software
Functions of Query Manager

It presents the data to the user in a form they understand.

It schedulesthe execution of the queriesposted by the end-user.

It stores query profiles to allow the warehouse manager to determine which indexes and
aggregations are appropriate

3. Discuss about query manager process in detail.

Query Manager
The query manager is responsible for directing the queries to suitable
tables. By directing the queries to appropriate tables, it speeds up the
query request and resIt is the process that manages the queries and
speeds them up by directing queries to the most effective data source.
This process also ensures that all the system resources are used most
effectively, usually by scheduling the execution of queries. The query
management process monitors the actual query profiles that are used to
determine which aggregations to generate.

This process operates at all times that the data warehouse is made
available to endusers. There are no major consecutive steps within this
process, rather there are a set of facilities that are constantly in
operations.

Directing queries − Data warehouses that contain summarised data can


provide several distinct data sources to respond to a specific query. These
are the detailed information itself, and any number of aggregations that
satisfy the query's information need.

For example, in the analysis of sales data warehouse, if a user asks the
system to "Report on sales of computer, Ghaziabad, UP over the past 2
weeks", this query would be satisfied by scanning any of the following
tables −

 All the detailed information over the past 2 weeks, filtering in all
computer sales for Ghaziabad.

2week's worth of weekly summary table of the product by store across


the week.

 A bi-weekly summary table of the product by region Ghaziabad is an


example of a region.
 A bi-weekly summary table of product group by store (a computer is
a product group).
Any of these tables can be used to get the result. However, the execution
performance will vary between each table because the volumes that have
to be read differ substantially. The query management process
determines which table delivers the answer most effectively, by
calculating which table would satisfy the query in the shortest period.

Management system resources − A single large query can use all


system resources to execute, affecting the performance of the entire
system. These queries tend to be the ones that either execute the entire
detailed information or are constructed inappropriately and perform
repetitive execution of a large table.

The query management process ensures that no single query can affect
the overall system performance.

Query capture − The query profiles are changed regularly over the life
of a data warehouse and the original user query requirements may be
nothing more than a starting point. The summary tables are structured
around a defined query profile and if the profile changes, the summary
table is also changed.

It can accurately monitor and understand what the new query profile is, it
can be very effective to capture the physical queries that are being
executed. At various points in time, these queries can be analyzed to
determine the new query profiles and the resulting impact on summary
tables.

ponse process. In addition, the query manager is responsible for


scheduling the execution of the queries posted by the user.

Query Manager Architecture


A query manager includes the following components −

 Query redirection via C tool or RDBMS


 Stored procedures
 Query management tool
 Query scheduling via C tool or RDBMS
 Query scheduling via third-party software
Functions of Query Manager
 It presents the data to the user in a form they understand.
 It schedules the execution of the queries posted by the end-user.
 It stores query profiles to allow the warehouse manager to
determine which indexes and aggregations are appropriate.
It is the process that manages the queries and speeds them up by directing
queries to the most effective data source. This process also ensures that all
the system resources are used most effectively, usually by scheduling the
execution of queries. The query management process monitors the actual
query profiles that are used to determine which aggregations to generate.
This process operates at all times that the data warehouse is made available
to endusers. There are no major consecutive steps within this process, rather
there are a set of facilities that are constantly in operations.
Directing queries − Data warehouses that contain summarised data can
provide several distinct data sources to respond to a specific query. These
are the detailed information itself, and any number of aggregations that
satisfy the query's information need.
For example, in the analysis of sales data warehouse, if a user asks the
system to "Report on sales of computer, Ghaziabad, UP over the past 2
weeks", this query would be satisfied by scanning any of the following tables

 All the detailed information over the past 2 weeks, filtering in all
computer sales for Ghaziabad.
 2week's worth of weekly summary table of the product by store across
the week.
 A bi-weekly summary table of the product by region Ghaziabad is an
example of a region.
 A bi-weekly summary table of product group by store (a computer is a
product group).
Any of these tables can be used to get the result. However, the execution
performance will vary between each table because the volumes that have to
be read differ substantially. The query management process determines
which table delivers the answer most effectively, by calculating which table
would satisfy the query in the shortest period.
Management system resources − A single large query can use all system
resources to execute, affecting the performance of the entire system. These
queries tend to be the ones that either execute the entire detailed
information or are constructed inappropriately and perform repetitive
execution of a large table.
The query management process ensures that no single query can affect the
overall system performance.
Query capture − The query profiles are changed regularly over the life of a
data warehouse and the original user query requirements may be nothing
more than a starting point. The summary tables are structured around a
defined query profile and if the profile changes, the summary table is also
changed.
It can accurately monitor and understand what the new query profile is, it can
be very effective to capture the physical queries that are being executed. At
various points in time, these queries can be analyzed to determine the new
query profiles and the resulting impact on summary tables.
PART – C
1. Suppose that a data warehouse for Big University consists of the four dimensions
student, course, semester, and instructor, and two measures count and avg grade. At the
lowest conceptual level (e.g., for a given student, course, semester, and instructor
combination), the avg grade measure stores the actual course grade of the student. At
higher conceptual levels, avg grade stores the average grade for the given combination.
(a) Draw a snowflake schema diagram for the data warehouse. (b) Starting with the
base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g.,
roll-up from semester to year) should you perform in order to list the average grade of
CS courses for each Big University student. (c) If each dimension has five levels
(including all), such as “student < major < status < university < all”, how many cuboids
will this cube contain (including the base and apex cuboids)? Createa) A Snowflake
Shema is shown in figure below:

b) The specific OLAP operations to be performed:

1. Roll-up on course from course_id to department.


2. Roll-up on a student from student_id to university.

3. Dice on course, student with department ="CS" and university = "biguniversity"

4. Drill-down on a student from the university to student_name.

c) N = 4 dimensions

The cube will contain 54 = 625 cuboids.

Snowflake Schema Diagram


The snowflake schema for the DB-University data warehouse can be represented as follows:
Dimension Hierarchy/Attributes

Student Student_ID, Name, ...

Course Course_ID, Name, ...

Semester Semester_ID, Name, ...

Instructo
Instructor_ID, Name, ...
r

OLAP Operations for Average Grade of CS Courses


To list the average grade of CS courses for each DB University student, you would perform
the following OLAP operations starting with the base cuboid [student, course, semester,
instructor]:
1. Drill-Down:
 Drill down from the "course" dimension to the "CS course" level to focus on
computer science courses.
2. Roll-Up:
 Roll up from the "semester" dimension to a higher level such as "year" to aggregate
the average grade for CS courses across semesters into yearly averages.
3. Slice:
 Slice the data to filter for CS courses only, focusing on the "CS course" dimension.
4. Dice:
 Dice the data to further refine the analysis, for example, by specific attributes such as
"student" to view the average grade of CS courses for each student.
By applying these OLAP operations, you can obtain the average grade of CS courses for each
DB University student at different levels of granularity within the data warehouse.
2. A data warehouse can be modeled by either a star schema or a snowflake schema.
Briefly describe the similarities and the differences of the two models, and then analyze
their advantages and disadvantages with regard to one another with example.
Understand
Both star and snowflake schemas are data warehouse models, but they differ in how they
structure dimension tables. Star schemas use a central fact table and denormalized dimension
tables, leading to faster queries but more storage and data redundancy. Snowflake schemas
normalize dimension tables into multiple, smaller tables, reducing redundancy and improving
storage, but potentially slowing down query performance due to morejoins.

Similarities:

 Both schemas are designed for data warehousing and analytics, utilizing a fact table to
store numerical data and dimension tables to store descriptive information.
 Both schemas aim to improve query performance and facilitate complex reporting by
organizing data in a structured manner.
Differences:
Feature Star Schema Snowflake Schema

Data Redundancy High Low

Query Performance Fast Slower

Storage More Less

Normalization Denormalized Normalized

Complexity Simple Complex

Advantages and Disadvantages:

Star Schema:
 Advantages:
o Faster query performance: Fewer joins required for queries due to
denormalization.
o Simpler design and maintenance: Easier to understand and manage compared
to snowflake schemas.

 Disadvantages:
o Higher storage requirements: Data redundancy can lead to larger dimension
tables, requiring more storage space.
o Potential for data integrity issues: Denormalization can make it harder to
maintain data consistency.
Snowflake Schema:

 Advantages:
o Lower storage requirements: Normalization reduces data redundancy, leading
to more efficient storage.
o Improved data integrity: Normalization helps maintain data consistency and
reduces the risk of errors.
 Disadvantages:
o Slower query performance: More joins are required, potentially slowing down
query execution.
o More complex design and maintenance: Normalization can make the schema
more difficult to understand and manage.
Example:Consider a data warehouse for online sales.

 Star Schema:
A fact table stores sales transactions, and dimension tables
(e.g., Customers, Products, Date) contain descriptive attributes like customer ID, product
ID, and date. Queries to find the total sales for a specific product would require a single
join between the fact table and the Products dimension table.
 Snowflake Schema:
The dimension tables might be further normalized. For example, the Products table
could be split into Product Categories, Product Subcategories, and Product
Details tables. Finding the total sales for a specific product would require more joins
(e.g., fact table to Product Details, then Product Details to Product Subcategories, and so
on).

You might also like