DW PART A PART B NOTES
DW PART A PART B NOTES
UNIT I
PART -B
1. Explain the 3-tier data ware house architecture and its various components. Evaluate
The three-tier approach is the most widely used architecture for data warehouse systems.
1. The bottom tier is the database of the warehouse, where the cleansed and
transformed data is loaded.
2. The middle tier is the application layer giving an abstracted view of the database. It
arranges the data to make it more suitable for analysis. This is done with an OLAP
server, implemented using the ROLAP or MOLAP model.
3. The top-tier is where the user accesses and interacts with the data. It represents the
front-end client layer. You can use reporting tools, query, analysis or data mining
tools.
Data Warehouse Components
From the architectures outlined above, you notice some components overlap, while others are
unique to the number of tiers.
Below you will find some of the most important data warehouse components and their roles
in the system.
ETL Tools
ETL stands for Extract, Transform, and Load. The staging layer uses ETL tools to extract
the needed data from various formats and checks the quality before loading it into the data
warehouse.
The data coming from the data source layer can come in a variety of formats. Before merging
all the data collected from multiple sources into a single database, the system must clean and
organize the information.
The Database
The most crucial component and the heart of each architecture is the database. The
warehouse is where the data is stored and accessed.
When creating the data warehouse system, you first need to decide what kind of database you
want to use.
Data
Once the system cleans and organizes the data, it stores it in the data warehouse. The data
warehouse represents the central repository that stores metadata, summary data, and raw
data coming from each source.
Metadata is the information that defines the data. Its primary role is to simplify
working with data instances. It allows data analysts to classify, locate, and direct
queries to the required data.
Summary data is generated by the warehouse manager. It updates as new data loads
into the warehouse. This component can include lightly or highly summarized data.
Its main role is to speed up query performance.
Raw data is the actual data loading into the repository, which has not been processed.
Having the data in its raw form makes it accessible for further processing and
analysis.
Access Tools
Users interact with the gathered information through different tools and technologies. They
can analyze the data, gather insight, and create reports.
Reporting tools. They play a crucial role in understanding how your business is
doing and what should be done next. Reporting tools include visualizations such as
graphs and charts showing how data changes over time.
OLAP tools. Online analytical processing tools which allow users to analyze
multidimensional data from multiple perspectives. These tools provide fast processing
and valuable analysis. They extract data from numerous relational data sets and
reorganize it into a multidimensional format.
Data mining tools. Examine data sets to find patterns within the warehouse and the
correlation between them. Data mining also helps establish relationships when
analyzing multidimensional data.
Data MartsData marts allow you to have multiple groups within the system by segmenting
the data in the warehouse into categories. It partitions data, producing it for a particular user
group.For instance, you can use data marts to categorize information by departments within
the company.
Designing a data warehouse relies on understanding the business logic of your individual use
case.
The requirements vary, but there are data warehouse best practices you should follow:
Create a data model. Start by identifying the organization’s business logic.
Understand what data is vital to the organization and how it will flow through the data
warehouse.
Opt for a well-know data warehouse architecture standard. A data model
provides a framework and a set of best practices to follow when designing the
architecture or troubleshooting issues. Popular architecture standards
include 3NF, Data Vault modeling and star schema.
Create a data flow diagram. Document how data flows through the system. Know
how that relates to your requirements and business logic.
Have a single source of truth. When dealing with so much data, an organization has
to have a single source of truth. Consolidate data into a single repository.
Use automation. Automation tools help when dealing with vast amounts of data.
Allow metadata sharing. Design an architecture that facilitates metadata sharing
between data warehouse components.
Enforce coding standards. Coding standards ensure system efficiency
A data warehouse is a repository for structured, filtered data that has already been processed
for a specific purpose. It collects the data from multiple sources and transforms the data using
ETL process, then loads it to the Data Warehouse for business purpose.
An operational database, on the other hand, is a database where the data changes frequently.
They are mainly designed for high volume of data transaction. They are the source database
for the data warehouse. Operational databases are used for recording online transactions and
maintaining integrity in multi-access environments.
Read this article to learn more about data warehouses and operational databases and how they
are different from each other.
A Data Warehouse is a system that is used by the users or knowledge managers for data
analysis and decision-making. It can construct and present the data in a certain structure to
fulfill the diverse requirements of several users. Data warehouses are also known as Online
Analytical Processing (OLAP) Systems.
In a data warehouse or OLAP system, the data is saved in a format that allows the effective
creation of data mining documents. The data structure in a data warehousing has
denormalized schema. Performance-wise, data warehouses are quite fast when it comes to
analyzing queries.
Data warehouse systems do the integration of several application systems. These systems
then provide data processing by supporting a solid platform of consolidated historical data for
analysis
What is an Operational Database?
The type of database system that stores information related to operations of an enterprise is
referred to as an operational database. Operational databases are required for functional
lines like marketing, employee relations, customer service etc. Operational databases are
basically the sources of data for the data warehouses because they contain detailed data
required for the normal operations of the business.
In an operational database, the data changes when updates are created and shows the latest
value of the final transaction. They are also known as OLTP (Online Transactions
Processing Databases). These databases are used to manage dynamic data in real-time.
The following are the important differences between a data warehouse and an operational
database ?
It focuses on current
Type of Data It focuses on historical data.
transactional data.
In the past, data warehouses operated in layers that matched the flow of the business dat
Data layer: Data is extracted from your sources and then transformed and loaded into
the bottom tier using ETL tools. The bottom tier consists of your database server, data
marts, and data lakes. Metadata is created in this tier – and data integration tools, like
data virtualization, are used to seamlessly combine and aggregate data.
Semantics layer: In the middle tier, online analytical processing (OLAP) and online
transactional processing (OLTP) servers restructure the data for fast, complex queries
and analytics.
Analytics layer: The top tier is the front-end client layer. It holds the data warehouse
access tools that let users interact with data, create dashboards and reports, monitor
KPIs, mine and analyze data, build apps, and more. This tier often includes a workbench
or sandbox area for data exploration and new data model development.
Data warehouses have been designed to support decision making and have been primarily
built and maintained by IT teams, but over the past few years they have evolved to empower
business users – reducing their reliance on IT to get access to the data and derive actionable
insights. A few key data warehousing capabilities that have empowered business users are:
1. The semantic or business layer that provides natural language phrases and allows
everyone to instantly understand data, define relationships between elements in the data
model, and enrich data fields with new business information.
2. Virtual workspaces allow teams to bring data models and connections into one secured
and governed place supporting better collaborating with colleagues through one
common space and one common data set.
3. Cloud has further improved decision making by globally empowering employees with a
rich set of tools and features to easily perform data analysis tasks. They can connect new
apps and data sources without much IT support.
Top seven benefits of a cloud data warehouse
Cloud-based data warehouses are rising in popularity – for good reason. These modern
warehouses offer several advantages over traditional, on-premise versions. Here are the top
seven benefits of a cloud data warehouse:
1. Quick to deploy: With cloud data warehousing, you can purchase nearly unlimited
computing power and data storage in just a few clicks – and you can build your own
data warehouse, data marts, and sandboxes from anywhere, in minutes.
2. Low total cost of ownership (TCO): Data warehouse-as-a-service (DWaaS) pricing
models are set up so you only pay for the resources you need, when you need them. You
don’t have to forecast your long-term needs or pay for more compute throughout the
year than necessary. You can also avoid upfront costs like expensive hardware, server
rooms, and maintenance staff. Separating the storage pricing from the computing pricing
also gives you a way to drive down the costs.
3. Elasticity: With a cloud data warehouse, you can dynamically scale up or down as
needed. Cloud gives us a virtualized, highly distributed environment that can manage
huge volumes of data that can scale up and down.
4. Security and disaster recovery: In many cases, cloud data warehouses actually provide
stronger data security and encryption than on-premise DWs. Data is also automatically
duplicated and backed-up, so you can minimize the risk of lost data.
5. Real-time technologies: Cloud data warehouses built on in-memory database
technology can provide extremely fast data processing speeds to deliver real-time data
for instantaneous situational awareness.
6. New technologies: Cloud data warehouses allow you to easily integrate new
technologies such as machine learning, which can provide a guided experience for
business users and decision support in the form of recommended questions to ask, as an
example.
7. Empower business users: Cloud data warehouses empower employees equally and
globally with a single view of data from numerous sources and a rich set of tools and
features to easily perform data analysis tasks. They can connect new apps and data
sources without IT.
1. Storage Layer: Data is stored in Exadata storage servers using a combination of flash
and disk storage.
2. Compute Layer: The compute nodes are responsible for processing queries and
analyzing data. ADW uses a massively parallel processing (MPP) architecture to
parallelize queries across multiple nodes for faster performance.
3. Autonomous Features: ADW leverages AI and machine learning to automate
various administrative tasks, including performance tuning, security patching,
backups, and fault detection.
How to Install Oracle Autonomous Data Warehouse?
To use Oracle Autonomous Data Warehouse:
1. Sign up for Oracle Cloud: Go to the Oracle Cloud website and sign up for an Oracle
Cloud account.
2. Provision Autonomous Data Warehouse: In the Oracle Cloud Console, provision
an Autonomous Data Warehouse instance.
3. Connect to Autonomous Data Warehouse: Use SQL clients or tools to connect to
ADW and run SQL queries.
4. Load Data into ADW: Load your data into ADW from various sources using Oracle
Data Pump, SQL Developer, or other data loading tools.
5. Run Queries and Analyze Data: Write SQL queries to analyze your data and gain
insights.
6. Monitor Performance: Use the Oracle Cloud Console to monitor query performance
and resource utilization.
Please note that Oracle Autonomous Data Warehouse is a cloud-based service, and you do
not need to install it on your local machine. Instead, you access and use ADW through the
Oracle Cloud Console or SQL clients from your local environment
UNIT II
PART A
1. What is ETL? ETL functions reshape the relevant data from the source systems into
useful information to be stored in the data warehouse. Without these functions, there would
be no strategic information in the data warehouse. If the source data is not extracted correctly,
cleansed, and integrated in the proper formats, query processing and delivery of business
intelligence, the backbone of the data warehouse, could not happen.
2. What are the major steps in the ETL process? Plan for aggregate fact tables.
Determine data transformation and cleansing rules. Establish comprehensive data extraction
rules. Prepare data mapping for target data elements from sources. Integrate all the data
sources, both internal and external. Determine all the target data needed in the data
warehouse.
3. Difference between ETL vs. ELT
Basics ETL ELT Process Data is transferred to the ETL server and moved back to DB. High
network bandwidth required. Data remains in the DB except for cross Database loads (e.g.
source to object). Transformation Transformations are performed in ETL Server.
Transformations are performed (in the source or) in the target. Code Usage Typically used for
Source to target transfer Compute-intensive Transformations Typically used for o High
amounts of Small amount of data TimeMaintenance It needs highs maintenance as you need
to select data to load and transform. Low maintenance as data is always available.
Calculations Overwrites existing column or Need to append the dataset and push to the target
platform. Easily add the calculated column to the existing table. Analysis
4. What are various datawarehouse models?
1. Conceptual model 2. Logical model 3. Physical model
5. List the advantages and disadvantages of top down approach?
Advantages of top-down design Data Marts are loaded from the data warehouses.
Developing new data mart from the data warehouse is very easy. Disadvantages of top-down
design This technique is inflexible to changing departmental needs. The cost of
implementing the project is high.
6. Explain OLAP?
OLAP stands for On-Line Analytical Processing. OLAP is a classification of software
technology which authorizes analysts, managers, and executives to gain insight into
information through fast, consistent, interactive access in a wide variety of possible views of
data that has been transformed from raw information to reflect the real dimensionality of the
enterprise as understood by the clients.
7. List the benefits of OLAP?
OLAP helps managers in decision-making through the multidimensional record views that it
is efficient in providing, thus increasing their productivity. OLAP functions are self-
sufficient owing to the inherent flexibility support to the organized databases. It facilitates
simulation of business models and problems, through extensive management of analysis-
capabilities. In conjunction with data warehouse, OLAP can be used to support a reduction
in the application backlog, faster data retrieval, and reduction in query drag.
8. Explain OLTP?
OLTP (On-Line Transaction Processing) is featured by a large number of short on-line
transactions (INSERT, UPDATE, and DELETE). The primary significance of OLTP
operations is put on very rapid query processing, maintaining record integrity in multi-access
environments, and effectiveness consistent by the number of transactions per second.
9. What are the various OLAP operations?
ROLL UP, DRILL DOWN, DICE, SLICE, PIVOT
10. Mention the types of OLAP?
ROLAP stands for Relational OLAP, an application based on relational DBMSs. MOLAP
stands for Multidimensional OLAP, an application based on multidimensional DBMSs.
HOLAP stands for Hybrid OLAP, an application using both relational and multidimensional
techniques.
11. Mention the advantages and disadvantages of Hybrid OLAP?
Advantages of HOLAP HOLAP provide benefits of both MOLAP and ROLAP. It
provides fast access at all levels of aggregation. HOLAP balances the disk space
requirement, as it only stores the aggregate information on the OLAP server and the detail
record remains in the relational database. So no duplicate copy of the detail record is
maintained. Disadvantages of HOLAP HOLAP architecture is very complicated because it
supports both MOLAP and ROLAP servers.
UNIT II
PART B
1. Illustrate data modeling life cycle with neat sketch.
Following are the important phases in the Data Model Development Life Cycle.
Data Modelers have to interact with business analysts to get the functional requirements and
with end users to find out the reporting needs.
This data model includes all major entities, relationships and it will not contain much detail
about attributes and is often used in the INITIAL PLANNING PHASE. Please refer the
diagram below and follow the link to learn more about Conceptual Data Modeling Tutorial.
3. Logical Data Modeling (LDM) – Third Phase:
This is the actual implementation of a conceptual model in a logical data model. A logical
data model is the version of the model that represents all of the business requirements of an
organization. Please refer the diagram below and follow the link to learn more about Logical
Data Modeling Tutorial.
This is a complete model that includes all required tables, columns, relationship, database
properties for the physical implementation of the database. Please refer the diagram below
and follow the link to learn more about Physical Data Modeling Tutorial.
5. Database – Fifth Phase:
DBAs instruct the data modeling tool to create SQL code from physical data
model. Then the SQL code is executed in server to create databases.
What is OLAP?
OLAP or Online Analytical Processing is a category of software that allows users to extract
and examine business data from different points of view. It makes use of pre-calculated and
pre-aggregated data from multiple databases to improve the data analysis process. OLAP
databases are divided into several data structures known as OLAP cubes.
OLAP Cube
The OLAP cube or Hypercube is a special kind of data structure that is optimized for very
quick multidimensional data analysis and storage. It is a screenshot of data at a specific point
in time. As seen in the figure, using certain OLAP operations, a user can request a specified
view of the hypercube. Hence, OLAP cubes allow users to perform multidimensional
analytical querying on the data.
There exist mainly three types of OLAP systems: Relational OLAP (ROLAP): These
systems work directly with relational databases and use complex SQL queries to retrieve
information from the database. It can handle large volumes of data but provides slower data
processing.
Multidimensional OLAP (MOLAP): MOLAP is also known as the classic form of OLAP.
It uses an optimized multi-dimensional array storage system for data storage. It makes use of
positional techniques to access the data physically stored in multidimensional arrays.
OLAP Operations
OLAP provides various operations to gain insights from the data stored in multidimensional
hypercubes. These operations include:
Drill Down
Drill down operation allows a user to zoom in on the data cube i.e., the less detailed data is
converted into highly detailed data. It can be implemented by either stepping down a concept
hierarchy for a dimension or adding additional dimensions to the hypercube.
Example: Consider a cube that represents the annual sales (4 Quarters: Q1, Q2, Q3, Q4) of
various kinds of clothes (Shirt, Pant, Shorts, Tees) of a company in 4 cities (Delhi, Mumbai,
Las Vegas, New York) as shown below:
Here, the drill-down operation is applied on the time dimension and the quarter Q1 is drilled
down to January, February, and March. Hence, by applying the drill-down operation, we can
move down from quarterly sales in a year to monthly or weekly records.
Roll up
It is the opposite of the drill-down operation and is also known as a drill-up or aggregation
operation. It is a dimension-reduction technique that performs aggregation on a data cube. It
makes the data less detailed and it can be performed by combining similar dimensions across
any axis.
Here, we are performing the Roll-up operation on the given data cube by combining and
categorizing the sales based on the countries instead of cities.
Dice
Dice operation is used to generate a new sub-cube from the existing hypercube. It
selects two or more dimensions from the hypercube to generate a new sub-cube for the given
data.
Here, we are using the dice operation to retrieve the sales done by the company in the first
half of the year i.e., the sales in the first two quarters.
Slice
Slice operation is used to select a single dimension from the given cube to generate a
new sub-cube. It represents the information from another point of view.
Here, the sales done by the company during the first quarter are retrieved by performing the
slice operation on the given hypercube.
Pivot
It is used to provide an alternate view of the data available to the users. It is also known as
Rotate operation as it rotates the cube’s orientation to view the data from different
perspectives.
Example:
Here, we are using the Pivot operation to view the sub-cube from a different perspective.
Conclusion
Data warehouse are strategic investments that require a business process to generate benefits.
IT Strategy is required to procure and retain funding for the project.
Business Case
The objective of business case is to estimate business benefits that should be derived from
using a data warehouse. These benefits may not be quantifiable but the projected benefits
need to be clearly stated. If a data warehouse does not have a clear business case, then the
business tends to suffer from credibility problems at some stage during the delivery process.
Therefore in data warehouse projects, we need to understand the business case for
investment.
Organizations experiment with the concept of data analysis and educate themselves on the
value of having a data warehouse before settling for a solution. This is addressed by
prototyping. It helps in understanding the feasibility and benefits of a data warehouse. The
prototyping activity on a small scale can promote educational process as long as −
The following points are to be kept in mind to produce an early release and deliver business
benefits.
Business Requirements
To provide quality deliverables, we should make sure the overall requirements are
understood. If we understand the business requirements for both short-term and medium-
term, then we can design a solution to fulfil short-term requirements. The short-term solution
can then be grown to a full solution.
Technical Blueprint
This phase need to deliver an overall architecture satisfying the long term requirements. This
phase also deliver the components that must be implemented in a short term to derive any
business benefit. The blueprint need to identify the followings.
In this stage, the first production deliverable is produced. This production deliverable is the
smallest component of a data warehouse. This smallest component adds business benefit.
History Load
This is the phase where the remainder of the required history is loaded into the data
warehouse. In this phase, we do not add new entities, but additional physical tables would
probably be created to store increased data volumes.
Let us take an example. Suppose the build version phase has delivered a retail sales analysis
data warehouse with 2 months worth of history. This information will allow the user to
analyze only the recent trends and address the short-term issues. The user in this case cannot
identify annual and seasonal trends. To help him do so, last 2 years sales history could be
loaded from the archive. Now the 40GB data is extended to 400GB.
Note − The backup and recovery procedures may become complex, therefore it is
recommended to perform this activity within a separate phase.
Ad hoc Query
In this phase, we configure an ad hoc query tool that is used to operate a data warehouse.
These tools can generate the database query.
Note − It is recommended not to use these access tools when the database is being
substantially modified.
Automation
In this phase, operational management processes are fully automated. These would include −
Extending Scope
In this phase, the data warehouse is extended to address a new set of business requirements.
The scope can be extended in two ways −
Note − This phase should be performed separately, since it involves substantial efforts and
complexity.
Requirements Evolution
From the perspective of delivery process, the requirements are always changeable. They are
not static. The delivery process must support this and allow these changes to be reflected
within the system.
This issue is addressed by designing the data warehouse around the use of data within
business processes, as opposed to the data requirements of existing queries.
The architecture is designed to change and grow to match the business needs, the process
operates as a pseudo-application development process, where the new requirements are
continually fed into the development activities and the partial deliverables are produced.
These partial deliverables are fed back to the users and then reworked ensuring that the
overall system is continually updated to meet the business needs
UNIT III META DATA, DATA MART AND PARTITION STRATEGY
PART A
1.What is metadata?
Metadata is simply defined as data about data. The data that is used to represent other data is
known as metadata. For example, the index of a book serves as a metadata for the contents in
the book.
2. What is business metadata? It has the data ownership information, business definition,
and changing policies.
3. What is Technical Metadata? It includes database system names, table and column
names and sizes, data types and allowed values. Technical metadata also includes structural
information such as primary and foreign key attributes and indices.
4. What is Operational Metadata? It includes currency of data and data lineage. Currency
of data means whether the data is active, archived, or purged. Lineage of data means the
history of data migrated and transformation applied on it.
5. What is Data mart? A Data Mart is a subset of a directorial information store, generally
oriented to a specific purpose or primary data subject which may be distributed to provide
business needs. Data Marts are analytical record stores designed to focus on particular
business functions for a specific community within an organization.
6. Why Do We Need a Data Mart? To partition data in order to impose access control
strategies. To speed up the queries by reducing the volume of data to be scanned. To
segment data into different hardware platforms. To structure data in a form suitable for a
user access tool.
7. Why is it Necessary to Partition in data warehousing?
Partitioning is important for the following reasons in data warehousing
For easy management,
To assist backup/recovery,
To enhance performance.
8. What is vertical partitioning? How is it done?
Vertical partitioning splits the data vertically. Vertical partitioning can be performed in the
following two ways − Normalization Row Splitting
9. State few roles of metadata Metadata also helps in summarization between lightly
detailed data and highly summarized data. Metadata is used for query tools. Metadata is
used in extraction and cleansing tools. Metadata is used in reporting tools. Metadata is
used in transformation tools. Metadata plays an important role in loading functions.
10. List some cost measures for data mart The cost measures for data marting are as
follows Hardware and Software Cost Network Access Time Window Constraints
11. What is normalization?
Normalization is the standard relational method of database organization. In this method, the
rows are collapsed into a single row, hence it reduce space.
12. What is Row Splitting? Row splitting tends to leave a one-to-one map between
partitions. The motive of row splitting is to speed up the access to large table by reducing its
size.
13. What do you mean by Round Robin Partition? In the round robin technique, when a
new partition is needed, the old one is archived. It uses metadata to allow user access tool to
refer to the correct table partition. This technique makes it easy to automate table
management facilities within the data warehouse.
14. Why partitioning is done in data warehouse? Partitioning is done to enhance
performance and facilitate easy management of data. Partitioning also helps in balancing the
various requirements of the system. It optimizes the hardware performance and simplifies the
management of data warehouse by partitioning each fact table into multiple separate
partitions.
15. State the factors for determining the number of data marts The determination of how
many data marts are possible depends on Network capacity. Time window available
Volume of data being transferred Mechanisms being used to insert data into a data mart
16. What is system management? State few System managers. System management is
mandatory for the successful implementation of a data warehouse. The most important
system managers are − System configuration manager System scheduling manager
System event manager System database manager System backup recovery manager
17. What is metadata repository? Metadata repository is an integral part of a data
warehouse system. It has the following metadata Definition of data warehouse Business
metadata Operational Metadata Data for mapping from operational environment to data
warehouse Algorithms for summarization
18. State few challenges for Metadata management Metadata in a big organization is
scattered across the organization. This metadata is spread in spreadsheets, databases, and
applications. Metadata could be present in text files or multimedia files. To use this data for
information management solutions, it has to be correctly defined. There are no industry-
wide accepted standards. Data management solution vendors have narrow focus. There are
no easy and accepted methods of passing metadata.
19. Give a schematic diagram of data
20. What is partitioning dimensions? If a dimension contains large number of entries, then
it is required to partition the dimensions. Here we have to check the size of a dimension.
Consider a large design that changes over time. If we need to store all the variations in order
to apply comparisons, that dimension may be very large. This would definitely affect the
response time
.UNIT III
PART B
1. Discuss the role of Meta data in data warehousing.
Serving as "data about data," metadata is the backbone of efficient data processing and
analytics, giving users and systems the contextual information needed to leverage data
effectively. In this article, we will go into detail regarding what metadata is in data
warehouses, its different kinds of classification in their structures, how it works and improves
the performance of a data warehouse, and design concepts for managing metadata, with
specific examples of AI metadata. The article also highlights the role of metadata in
enforcing data governance, data analysis using data mining techniques, and data analytics in
real-time.
In a data warehouse, metadata refers to the data that provides the user with more detailed
elaboration and classification for it to be more manageable. Information such as the origin of
the data, what kind of data it is, what kind of operations were performed on it, timestamps,
and how different data sets relate to each other is provided as metadata. This added layer of
information enhances the usability of data, ensuring that raw data in warehouses is usable,
interpretable, and actionable.
Further, for a given dataset/package, metadata might indicate the type of structure a given
package will have, where the package was obtained from when it was taken, and what
operations were used on it. This context is important to large-scale data engineering and data
solutions and services, with the assistance of transformed and well-categorized data. For any
data warehouse, metadata acts as a very basic building block of information that assists users
in making sense of the data without getting lost in the vast quantities of information
available.
Business Metadata
Technical Metadata
Technical metadata is tailored for programmers, developers, data engineers, and other IT
people as it gives insight into how data is constructed, where it is kept, and how it is
managed. For instance, it incorporates data lineage information, which links data from its
sources after undergoing several processes and schema information such as tables, primary
key, and foreign key, as well as their linkages. Technical metadata also maintains information
on ETL, which makes it possible to follow the mapping of how data is transformed and
stored by the needs of the warehouse and how problems in the flow of data transfer and
storage in the warehouse can be fixed.
Operational Metadata
Operational metadata helps in the operational aspect of a data warehouse as it explains the
activities that occur on the system. This includes data on ETL schedules, the volume of data
loads, job success and failure indices, and other system load measures such as memory and
CPU usage. This type of metadata is also important for data center operations because, with
it, teams can manage and avoid data staleness, bottlenecks, and performance degradation
Metadata is essential for enhancing the efficiency of data warehouses, improving usability,
and ensuring the system’s ability to handle real-time data analytics demands. Here are a
number of ways of using metadata in data warehouse environments.
Data Classification and Discovery: Metadata allows users to easily find and identify
relevant data. As an example, By categorizing data according to type, source, and
purpose, metadata simplifies data search and retrieval, which is critical in data
warehouses that house vast amounts of information.
Data Lineage and Auditing: Metadata provides a clear view of data lineage, which
includes the data's origins, transformations, and destination within the data
warehouse. This is very important for compliance issues as it allows a company to
trace history, ensuring accurate and transparent reports.
Improved Data Quality: The metadata sets the standards and business rules to support
the data quality initiatives. As a means of checking data consistency and ensuring that
data transformations are automated, metadata adds value to the resources. This makes
it possible for the data warehouse to be an integral data source for analyzing and
reporting purposes.
Improved querying speed: Metadata streamlines the query process as it indicates to
the database what data is and where it is contained. This optimizes the querying
process, especially when retrieving datasets that need to be ready at all times,
especially for real-time data analytics.
Improving data management: Instead of having several policies that are sometimes
hard to understand, metadata enables using a single effective policy to guarantee data
protection. No one without the required permissions can access, use, or even change
the information in a repository.
Managing metadata effectively requires specialized tools that integrate, store, and govern
metadata across various systems within a data warehouse. Here are some popular data
engineering tools for metadata management, each offering unique capabilities to enhance data
warehouse efficiency:
Apache Atlas
Apache Atlas is a strong open-source metadata management and data governance application
suited for any data warehouse, especially those that are Hadoop-based. It allows businesses to
specify, classify, and monitor their data assets across diverse repositories. Atlas facilitates the
automation of data lineage tracking, allowing users to monitor how data flows from one
source to another and in what manner. This is important in areas of audit and compliance.
Furthermore, Atlas enhances data discovery tools through customizable classifications and
business vocabularies, which both technical and non-technical users can easily understand
and use.
Collibra
Collibra’s Data Intelligence Cloud is the most comprehensive platform for metadata
management, governance, and stewardship. It attracts enterprises with a data governance and
compliance center. It also has a big data catalog that describes data assets with simple
language, visualizing data lineage and panels for classification. Moreover, Collibra’s data
workarounds promote data collaboration across departments and hence mitigate data silos.
Collibra also sets out extensive data control policies to ensure that throughout the lifecycle of
data, its ownership and accountability are enforced such that relevant descriptions and
classification are provided and maintained.
Alation
Alation is a leader in data cataloging and metadata management and is known for its AI-
driven approach to data discovery and organization. Alation’s automated indexing and
tracking of user interaction makes metadata management a lot easier by utilizing machine
learning. Because of Alation’s emphasis on collaboration features, users can share insights
and comments and provide contextual information about the data substance, leading to more
effective governance of this data. Furthermore, its data lineage management tools help
organizations know where their data is located within their environment, allowing them to
make sure the company meets all the regulatory requirements.
Microsoft Azure Data Catalog is a fully managed cloud-based service that allows users to
catalog, annotate, and classify data assets across various sources. It offers a consolidated view
of an organization’s data assets through a central metadata repository, which allows all the
authenticated people in the organization to view the information. Azure Data Catalog can
cover both data types at a time: structured and unstructured, making it applicable for mixed
data warehouse environments. With the Azure Data Catalog, restructuring information into
other Azure services becomes easier and more fully operable due to the cloud, and other
features like tagging and search make data easier to find.
Talend Data Catalog is an integrated solution for metadata management that fits the
definition of data discovery, data quality assessment, and data lineage tracking. Talend
automatically documents metadata by linking it to warehouse business processes, allowing
businesses to visualize data flow and dependence. With its intuitive interface design, it also
features powerful search functions that allow users to perform data discovery without hassle.
With these features, Talend enables teams to collaborate on data governance activities,
improving data accuracy, reliability, and compliance with enterprise standards. Talend helps
maintain a dynamic and responsive metadata repository by providing real-time metadata
updates.
Here are some of the best examples of AI metadata that demonstrate how AI interacts with
metadata to assist with data understanding and information retrieval:
AI systems rely on metadata to maintain data provenance, which records the origin and
history of data through each transformation step. For instance, AI models in the financial or
healthcare industries rely on data lineage metadata to know where the data was obtained, how
it was processed, and how it was validated. This creates robust compliance and effective
auditing, especially for industries with strict data integrity management rules.
User interaction metadata involves clicks, search and browsing history, preferences, and
behavior, which enables AI algorithms to produce personalized recommendations. For
example, platforms like Netflix or Spotify benefit from user metadata and recommend
something according to one’s preferences. They also learn from every user’s activity to
enhance their content recommendations.
Artificial intelligence systems depend on tagging text, images, and video metadata for
efficient content classification and retrieval. For instance, metadata tags in digital image
libraries or image banks may sort keywords, descriptions, and contextual information. These
tags were incorporated into the content and image recognition AI for classification purposes,
which helps to easily retrieve and assess the vast quantity of visual data. This feature is
advantageous in media and e-commerce activities.
The NLP systems effectively comprehend text by understanding the designed metadata such
as context, sentiment score, language, and named entities. To illustrate, conversational agents
like chatbots or virtual assistants leverage metadata about user intent, previous conversations,
or sentiment level to enhance their responses. It allows AI systems to better interpret the
subtleties of human language and respond accordingly by providing a more context-
appropriate response.
To sum up, metadata in data warehouses provides the necessary framework for organization,
context, and control, facilitating ease of management and analysis of the large amounts of
data available. With the adoption of metadata, organizations can enhance data solutions and
services, improve real-time data analytics, and implement proper data governance solutions.
In these current times, which are influenced by data, metadata is one key resource every
organization must use. As organizations target a data economy to improve their data
engineering and mining strategies, it will be critical to understand metadata. Other
applications, such as Atlas Apache, Collibra, Alation,l. and IBM InfoSphere, help achieve
this goal, further setting the stage for a structured, compliant, and efficient data warehouse
ecosystem.
The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This huge
size of fact table is very hard to manage as a single entity. Therefore it needs partitioning.
To Assist Backup/Recovery
If we do not partition the fact table, then we have to load the complete fact table with all the
data. Partitioning allows us to load only as much data as is required on a regular basis. It
reduces the time to load and also enhances the performance of the system.
Note − To cut down on the backup size, all partitions other than the current partition can be
marked as read-only. We can then put these partitions into a state where they cannot be
modified. Then they can be backed up. It means only the current partition is to be backed up.
To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be enhanced. Query
performance is enhanced because now the query scans only those partitions that are relevant.
It does not have to scan the whole data.
Horizontal Partitioning
There are various ways in which a fact table can be partitioned. In horizontal partitioning, we
have to keep in mind the requirements for manageability of the data warehouse.
Partitioning by Time into Equal Segments
In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each
time period represents a significant retention period within the business. For example, if the
user queries for month to date data then it is appropriate to partition the data into monthly
segments. We can reuse the partitioned tables by removing the data in them.
This kind of partition is done where the aged data is accessed infrequently. It is implemented
as a set of small partitions for relatively current data, larger partition for inactive data.
Points to Note
The fact table can also be partitioned on the basis of dimensions other than time such as
product group, region, supplier, or any other dimension. Let's have an example.
Suppose a market function has been structured into distinct regional departments like on
a state by state basis. If each region wants to query on information captured within its
region, it would prove to be more effective to partition the fact table into regional partitions.
This will cause the queries to speed up because it does not require to scan information that is
not relevant.
Points to Note
The query does not have to scan irrelevant data which speeds up the query process.
This technique is not appropriate where the dimensions are unlikely to change in
future. So, it is worth determining that the dimension does not change in future.
If the dimension changes, then the entire fact table would have to be repartitioned.
Note − We recommend to perform the partition only on the basis of time dimension, unless
you are certain that the suggested dimension grouping will not change within the life of the
data warehouse.
When there are no clear basis for partitioning the fact table on any dimension, then we
should partition the fact table on the basis of their size. We can set the predetermined size
as a critical point. When the table exceeds the predetermined size, a new table partition is
created.
Points to Note
Partitioning Dimensions
Consider a large design that changes over time. If we need to store all the variations in order
to apply comparisons, that dimension may be very large. This would definitely affect the
response time.
In the round robin technique, when a new partition is needed, the old one is archived. It uses
metadata to allow user access tool to refer to the correct table partition.
This technique makes it easy to automate table management facilities within the data
warehouse.
Vertical Partition
Vertical partitioning, splits the data vertically. The following images depicts how vertical
partitioning is done.
Vertical partitioning can be performed in the following two ways −
Normalization
Row Splitting
Normalization
Normalization is the standard relational method of database organization. In this method, the
rows are collapsed into a single row, hence it reduce space. Take a look at the following
tables that show how normalization is performed.
Bangalor
30 5 3.67 3-Aug-13 16 sunny S
e
Bangalor
35 4 5.33 3-Sep-13 16 sunny S
e
Bangalor
45 7 5.66 3-Sep-13 16 sunny S
e
16 sunny Bangalore W
64 san Mumbai S
30 5 3.67 3-Aug-13 16
35 4 5.33 3-Sep-13 16
40 5 2.50 3-Sep-13 64
45 7 5.66 3-Sep-13 16
Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting
is to speed up the access to large table by reducing its size.
Note − While using vertical partitioning, make sure that there is no requirement to perform a
major join operation between two partitions.
It is very crucial to choose the right partition key. Choosing a wrong partition key will lead to
reorganizing the fact table. Let's have an example. Suppose we want to partition the
following table.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name
We can choose to partition on any key. The two possible keys could be
region
transaction_date
Suppose the business is organized in 30 geographical regions and each region has different
number of branches. That will give us 30 partitions, which is reasonable. This partitioning is
good enough because our requirements capture has shown that a vast majority of queries are
restricted to the user's own business region.
If we partition by transaction_date instead of region, then the latest transaction from every
region will be in one partition. Now the user who wants to look at data within his own region
has to query across multiple partitions.
1. Normalization:
This involves removing redundant columns from a table and placing them in separate
tables linked with foreign keys.
Example: A customer table could be split into two: one for basic information (name,
address) and another for contact information (phone, email).
This reduces redundancy and improves data integrity.
2. Row Splitting (Columnar Approach):
This involves splitting the original table into multiple tables, each containing a subset
of the columns.
Example: A table with user information and a separate table for user preferences.
This can improve query performance by reducing the amount of data accessed for
specific queries.
It also allows for different storage methods for different columns, like storing large
BLOBs in a separate table.
Star Schema
Snowflake Schema
A fact constellation has multiple fact tables. It is also known as galaxy schema.
The following diagram shows two fact tables, namely sales and shipping.
Schema Definition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The two
primitives, cube definition and dimension definition, can be used for defining the data
warehouses and data marts.
define cube < cube_name > [ < dimension-list > }: < measure_list >
Syntax for Dimension Definition
The star schema that we have discussed can be defined using Data Mining Query Language
(DMQL) as follows −
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier
type))
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key, city, province or state,
country))
Purpose:
Primarily used for Online Analytical Processing (OLAP) and data warehousing,
providing fast access to summarized data for analytical queries.
Structure:
Organizes data into a data cube, where dimensions represent different perspectives of the
data (e.g., time, location, product), and measures are numerical values representing the
data (e.g., sales).
Advantages:
Efficient for complex queries, allows for slicing and dicing data, and provides a
structured way to analyze large datasets.
Example:
Sales data can be viewed by product, region, and time period.
Data Mart:
Purpose:
A specialized subset of a data warehouse, focused on a specific department or business
unit.
Structure:
Often uses a simplified schema, such as a star schema, to focus on the data relevant to
that unit.
Advantages:
Provides faster access to specific data for a department, reducing the need to query the
entire data warehouse.
Example:
A marketing data mart containing information about customer demographics, purchase
history, and campaign effectiveness.
Data Cube:
Purpose:
The core storage structure within an MDB, providing a multi-dimensional array of data.
Structure:
Consists of dimensions, which are the perspectives of the data, and measures, which are
the numerical values being analyzed.
Function:
Enables users to analyze data from different angles, such as slicing and dicing the data
based on different dimensions.
Example:
A sales data cube might have dimensions like product, region, time, and measure like
sales revenue, allowing users to analyze sales trends by region and time.
Schemas for Multidimensional Databases:
Star Schema:
Features a central fact table (containing measures) linked to multiple dimension tables
(containing attributes of the dimensions). It is simple and efficient for querying large
datasets.
Snowflake Schema:
Extends the star schema by further normalizing dimension tables into multiple
tables. This can improve data integrity but increases complexity.
Fact Constellation Schema:
Contains multiple fact tables that share dimension tables. This is useful when dealing
with multiple business processes or subject areas.
3. Explain the star schema, snowflake schema and fact constellation schema with
examples.
Star Schema:The star schema is a widely used schema design in data warehousing. It
features a central fact table that holds the primary data or measures, such as sales, revenue,
or quantities. The fact table is connected to multiple dimension tables, each representing
different attributes or characteristics related to the data in the fact table. The dimension
tables are not directly connected to each other, creating a simple and easy-to-understand
structure.
Simplicity: Star schema is the simplest and most straightforward schema design, with fewer
tables and relationships. It provides ease of understanding, querying, and report generation.
Denormalization: Dimension tables in star schema are often denormalized, meaning they
measures like “Total Sales” and “Quantity Sold.” The dimension tables could include
“Product” with attributes like “Product ID,” “Product Name,” and “Category,” and “Time”
with attributes like “Date,” “Month,” and “Year.” The fact table connects to these dimension
tables through foreign keys, allowing analysts to perform queries like “Total Sales by Product
Snowflake Schema: The snowflake schema is an extension of the star schema, designed
to further reduce data redundancy by normalizing the dimension tables. In a snowflake
schema, dimension tables are broken down into multiple related sub-tables. This
normalization creates a more complex structure with additional levels of relationships,
reducing storage requirements but potentially increasing query complexity due to the
need for additional joins.
Normalization: Snowflake schema normalizes dimension tables, resulting in more tables and
Space Efficiency: Due to normalization, the snowflake schema may require less storage
space for dimension data but may lead to more complex queries due to additional joins
Example: Continuing with the retail data warehouse example, in a snowflake schema, the
product. This normalization allows for efficient storage of data, but it may require more
Complexity: Fact constellation schema is the most complex among the three designs, as it
Flexibility: This schema design offers more flexibility in modeling complex and diverse
business scenarios, allowing multiple fact tables to coexist and share dimensions.
Example: In a data warehouse for a healthcare organization, there could be multiple fact tables
representing different metrics like patient admissions, medical procedures, and medication
dispensing. These fact tables would share common dimension tables like “Patient,” “Doctor,”
and “Date.” The fact constellation schema allows analysts to analyze different aspects of
Register_document_key
Revenue_document_key
Daily_task_document_key
Administrative_key
Patient_Directory_key
Doctor_dimension_key
Diagnosis_key
Health_Insurance_key
Partient_name_key
Issue_details_key
Registration_id_key
Guardian_contact_key
Revenue_cost_document
Target_revenue_document
Target_patient_document
Revenue document key is an added measure through which total revenue can be calculated.
Keys such as cost, target ad patient admission details will be helpful to smoothly maintain
financial details of hospital.
Dimension table for Daily_task_document_key
Daily_task_details
Daily_task_id
Responsible_person_for_daily_task
Daily task document will be effective to track every day’s tasks to maintain smooth service in
hospital. Documentation of task id and responsibility will be beneficial to eliminate bias and
dilemma during service.
Dimension table for Administrative_key
Emergency_document_administration_key
Emergency_policies_document
Emergency_equipment_requirements_document_key
Maternity_ward_administrative_documentation_key
Maternity_ward_policies_document_name
General_ward_administrative_documentation_key
General_ward_policies_document_name
Administrative Department
First Administrative key has been included within the fact table. Various administrative
documents will be located under the Administrative key. Information regarding health care
policies will be under the Administrative key. Various shift timings will be under the various
dimension tables of the administrative department. Documents regarding different policies
will be under the various dimension tables of administrative fact table. This information will
avoid any confusion regarding hospital policies. Administrative people will be confident
while describing hospital policies. Documentation of policies in maternity ward and general
words can help in maintaining effective functionality in hospitals. Further, management of
documents regarding administrative details can help in elimination of documentation issue
and will help to improve service ability of any hospital.
Dimension table for Patient _Directory_Key
Patient_gender_documentation_key
Male_document_id
Female_ document_id
Transgender_ document_id
Patient_medical_history_ document_key
Previous_symptoms_document
Special_attributes_document
Patient_admission_time_document
Next Patient Directory key has been considered within the fact table. Different information
regarding the patient has been considered to make the dimension table of star schema model.
Patient gender information will be under the dimension key. Patients' medical history with
the hospital will be under dimension history of the patient's medical history key. Patient
administration timings will be under the dimension key known as patient administration time
key. Identification of gender of patients based on document history will be beneficial to
maintain smooth services regarding registration. Further, documentation of previous
symptoms of a patient and other medical history durog further treatment can be helpful to
easily understand the details and treat as per requirements.
Doctor dimension key
Doctor_information_key_document
Doctor_Names_ document
Doctor_timings _document
Doctor_shifts_document
Doctor_contact_document
Doctor_information_according_to_ward_ document_Key
Above two tables is the dimension key associated with the hospital doctor information.
Doctor names will be under the Doctor Information _key. Under this section doctor names
will be given. Timings of different doctors will be provided in this section as well. Doctor
shifts will be described under this dimension table. Doctor information according to different
wards will be provided in the Doctor information according to ward Key. Doctor timings
according to shifts in different wards will be provided in this part of the dimension table.
Doctors assigned for different shifts will be under this dimension table.Presence of details of
docturs regarding shifts and ward information will be helpful to eliminate unnecessary doubts
during service. Further, it will help to conduct a proper operations especially during any
emergency.
Diagnosis_Key Dimension Table
Diagnosis_history_document_Key
Patient_Requirements_ document_key
Diagnosis_timings_number_document
Diagnosis_result_ document
Date_number
Time_number
Diagnosis key dimension table will provide the necessary requirements of patient details.
Different patients will require various tests according to their needs. Diagnosis key will help
hospitals to keep track of different diagnosis reports of each patient (Rosita, 2021). This will
reduce the unnecessary confusion for the hospital. Hospitals will be able to complete the
regular operations in an effective way. Confusion regarding complex operations will reduce
to the diagnosis directory of the hospital. Delivery timings of diagnosis reports will clear
confusions regarding patient treatments for the hospital (Cimpoiasu et al., 2021).
Health_Insurance_key
Patient_personal_information_document_key
Patient_name_document
Patient_contact_details
Patient_Health_Insurance_History _document
Previous_coverage_document
Health Insurance key will give the patient's personal information. Patients' contact details will
be provided under this dimension table. Patient’s health insurance key will help to provide
hospitals the valuable information regarding patient's medical coverage. This will help
hospitals to make valuable decisions in a more effective manner (Rocha, Capelo and Ciferri,
2020). Doctors will be able to provide suitable alternatives according to the medical coverage
of the patient. Effective health insurance coverage will help the hospital to recover its lost
glory within a short period. These details will further help insurance organisations to grab an
option for renewing insurance. This documentation is further beneficial to treat patient at
comparatively lower cots based o criterions of insurance policies.
4) Two examples of Star schema
Insurance clear up
Several issues were continually rising due to complications with paper work. This
organisation did not have any framework that is used to resolve such paper work related
issues. A major problem that this organisation was not being able to handle was separating
relevant data from its unnecessary counterparts. These aspects were further aggravating
problems with client detestation. A star schema model was utilised by this organisation to
resolve this present issue.
Health insurance: Health_Incurance_Key
This key was used by members of that organisation to find out information related to any
specific client who was being treated at this hospital. Using a Star schema model helped put
in place several information that had been provided by clients (Bacry et al., 2020). Using this
model simplified an otherwise complicated procedure for separating necessary facts from a
combination of a massive number of resources. Entire information provided by clients was
input into this star schema and only aspects of insurance that could be covered by service
providers for these clients were displayed. Thus, a dimension table and its supporting facts
were used and explained by this organisation to its disgruntled customers.
Solving X-Ray complications
This hospital was facing problems in solving simple issues of daily working. As a result, they
were looking for a system that could solve all these problems that were causing major
hiccups on the daily. Problems related to arranging all their daily functions were becoming
problematic for their hospital’s ability to function. Upon introduction of this star schema
model, attributes that were related to specific patients were keyed and tagged (Zhu et
al., 2017). This ensured that there would be no further problems regarding
miscommunication of reports.
Fact table: Diagnosis_Key
These keys were assigned by members of this organisation to pinpoint all aspects that were
necessary for any specific situation. As a result, diagnosis of patients became even simpler
since all information related to them was easily available
Conclusion
It is concluded that an assumption over different operational services in hospital environment
have occurred due to lack of operational link-up and responsibilities taken from higher
authorities. In context, research has identified facts, including administrative, doctor, patients,
insurance, diagnosis, and other facts are identified. Moreover, consideration of insurance
cover and x-ray diagnosis related examples are evaluated that can be implemented to address
hospital issues with star schema model.
UNIT V
PART A
1. State the responsibilities of System Configuration Manager The system configuration
manager is responsible for the management of the setup and configuration of data warehouse.
The structure of configuration manager varies from one operating system to another. In
UNIX structure of configuration, the manager varies from vendor to vendor. Configuration
managers have single user interface. The interface of configuration manager allows us to
control all aspects of the system.
2. State the responsibilities of System Scheduling Manager System Scheduling Manager
is responsible for the successful implementation of the data warehouse. Its purpose is to
schedule ad hoc queries. Every operating system has its own scheduler with some form of
batch control mechanism.
3. State the features of System Scheduling manager Work across cluster or MPP
boundaries Deal with international time differences Handle job failure Handle multiple
queries Support job priorities Restart or re-queue the failed jobs Notify the user or a
process when job is completed Maintain the job schedules across system outages
4. Define Event Events are the actions that are generated by the user or the system itself. It
may be noted that the event is a measurable, observable, occurrence of a defined action.
5. What is the role of process managers? Process managers are responsible for maintaining
the flow of data both into and out of the data warehouse. There are three different types of
process
Load manager Warehouse manager Query manager
6. State the responsibilities of warehouse manager The warehouse manager is responsible
for the warehouse management process. It consists of a third-party system software, C
programs, and shell scripts. The size and complexity of a warehouse manager varies between
specific solutions
7. State the functions of Warehouse Manager A warehouse manager performs the
following functions − Analyzes the data to perform consistency and referential integrity
checks. Creates indexes, business views, partition views against the base data. Generates
new aggregations and updates the existing aggregations. Generates normalizations.
Transforms and merges the source data of the temporary store into the published data
warehouse.
8. State the functions of load manager The load manager performs the following functions
Extract data from the source system. Fast load the extracted data into temporary data
store. Perform simple transformations into structure similar to the one in the data
warehouse.
9. State the responsibilities of query manager The query manager is responsible for
directing the queries to suitable tables. By directing the queries to appropriate tables, it speeds
up the query request and response process. In addition, the query manager is responsible for
scheduling the execution of the queries posted by the user
10. What are the components of query manager? A query manager includes the following
components – Query redirection via C tool or RDBMS Stored procedures Query
management tool Query scheduling via C tool or RDBMS Query scheduling via third-
party software
11. List the functions of Query Manager It presents the data to the user in a form they
understand. It schedules the execution of the queries posted by the end-user. It stores
query profiles to allow the warehouse manager to determine which indexes and aggregations
are appropriate.
12. Why tuning a data warehouse is difficult? Tuning a data warehouse is a difficult
procedure due to following reasons Data warehouse is dynamic; it never remains constant.
It is very difficult to predict what query the user is going to post in the future. Business
requirements change with time. Users and their profiles keep changing. The user can
switch from one group to another. The data load on the warehouse also changes with time.
13. What are the two kinds of queries in data warehouse? The two kinds of queries in data
warehouse are Fixed queries Ad hoc queries
14. What is Unit Testing? In unit testing, each component is separately tested. Each
module, i.e., procedure, program, SQL Script, Unix shell is tested. This test is performed by
the developer.
15. What is Integration Testing? In integration testing, the various modules of the
application are brought together and then tested against the number of inputs. It is
performed to test whether the various components do well after integration.
16. What is System Testing? In system testing, the whole data warehouse application is
tested together. The purpose of system testing is to check whether the entire system works
correctly together or not. System testing is performed by the testing team. Since the size
of the whole data warehouse is very large, it is usually possible to perform minimal system
testing before the test plan can be enacted.
17. List the scenarios for which testing is needed Media failure Loss or damage of table
space or data file Loss or damage of redo log file Loss or damage of control file Instance
failure Loss or damage of archive file Loss or damage of table Failure during data
failure
18. List out few criteria that are required for choosing a system and database manager.
Increase user's quota. assign and de-assign roles to the users assign and de-assign the
profiles to the users perform database space management monitor and report on space
usage tidy up fragmented and unused space add and expand the space add and remove
users manage user password
19. List some common events that need to be tracked. Hardware failure Running out of
space on certain key A process dying A process returning an error CPU usage exceeding
an 805 threshold Internal contention on database serialization points Buffer cache hit
ratios exceeding or failure below threshold A table reaching to maximum of its size
Excessive memory swapping
20. What are the aspects to be considered while testing operational environment.
Security Scheduler Disk Configuration. Management Tools
Unit v
Part B
1. Describe in detail about working of system scheduling manager.
System management is mandatory for the successful implementation of a data warehouse. The most
important system managers are: System configuration manager System scheduling manager System
event manager System database manager System backup recovery manager System Configuration
Manager The system configuration manager is responsible for the management of the setup and
configuration of data warehouse. The structure of configuration manager varies from one operating
system to another. In Unix structure of configuration, the manager varies from vendor to vendor.
Configuration managers have single user interface. The interface of configuration manager allows us
to control all aspects of the system. Note: The most important configuration tool is the I/O manager.
System Scheduling Manager System Scheduling Manager is responsible for the successful
implementation of the data warehouse. Its purpose is to schedule ad hoc queries. Every operating
system has its own scheduler with some form of batch control mechanism. The list of features a
system scheduling manager must have is as follows: Work across cluster or MPP boundaries Deal
with international time differences Handle job failure Handle multiple queries Support job priorities
Restart or re-queue the failed jobs Notify the user or a process when job is completed Maintain the
job schedules across system outages Re-queue jobs to other queues Support the stopping and
starting of queues Log Queued jobs Deal with inter-queue processing Note: The above list can be
used as evaluation parameters for the evaluation of a good scheduler. Some important jobs that a
scheduler must be able to handle are as follows: Daily and ad hoc query scheduling Execution of
regular report requirements Data load Data processing Index creation Backup Aggregation creation
Data transformation Note: If the data warehouse is running on a cluster or MPP architecture, then
the system scheduling manager must be capable of running across the architecture. System Event
Manager The event manager is a kind of a software. The event manager manages the events that are
defined on the data warehouse system. We cannot manage the data warehouse manually because
the structure of data warehouse is very complex. Therefore we need a tool that automatically
handles all the events without any intervention of the user. Note: The Event manager monitors the
events occurrences and deals with them. The event manager also tracks the myriad of things that
can go wrong on this complex data warehouse system. Events Events are the actions that are
generated by the user or the system itself. It may be noted that the event is a measurable,
observable, occurrence of a defined action. Given below is a list of common events that are required
to be tracked. Hardware failure Running out of space on certain key disks A process dying A process
returning an error CPU usage exceeding an 805 threshold Internal contention on database
serialization points Buffer cache hit ratios exceeding or failure below threshold A table reaching to
maximum of its size Excessive memory swapping A table failing to extend due to lack of space Disk
exhibiting I/O bottlenecks Usage of temporary or sort area reaching a certain thresholds Any other
database shared memory usage The most important thing about events is that they should be
capable of executing on their own. Event packages define the procedures for the predefined events.
The code associated with each event is known as event handler. This code is executed whenever an
event occurs. System and Database Manager System and database manager may be two separate
pieces of software, but they do the same job. The objective of these tools is to automate certain
processes and to simplify the execution of others. The criteria for choosing a system and the
database manager are as follows: increase user's quota. assign and de-assign roles to the users
assign and de-assign the profiles to the users perform database space management monitor and
report on space usage tidy up fragmented and unused space add and expand the space add and
remove users manage user password manage summary or temporary tables assign or deassign
temporary space to and from the user reclaim the space form old or out-of-date temporary tables
manage error and trace logs to browse log and trace files redirect error or trace information switch
on and off error and trace logging perform system space management monitor and report on space
usage clean up old and unused file directories add or expand space. System Backup Recovery
Manager The backup and recovery tool makes it easy for operations and management staff to back-
up the data. Note that the system backup manager must be integrated with the schedule manager
software being used. The important features that are required for the management of backups are as
follows: Scheduling Backup data tracking Database awareness Backups are taken only to protect
against data loss. Following are the important points to remember. The backup software will keep
some form of database of where and when the piece of data was backed up. The backup recovery
manager must have a good front-end to that database. The backup recovery software should be
database aware. Being aware of the database, the software then can be addressed in database
terms, and will not perform backups that would not be viable.
2. Summarize the role of load manager and warehouse manager.
Process managers are responsible for maintaining the flow of data both into and out of the data
warehouse. There are three different types of process managers:
Load manager
Warehouse manager
Load Manager Load manager performs the operations required to extract and load the data into the
database. The size and complexity of a load manager varies between specific solutions from one data
warehouse to another. Load Manager Architecture The load manager does performs the following
functions: Extract data from the source system. Fast load the extracted data into temporary data
store. Perform simple transformations into structure similar to the one in the data warehouse.
Extract Data from Source The data is extracted from the operational databases or the external
information providers. Gateways are the application programs that are used to extract data. It is
supported by underlying DBMS and allows the client program to generate SQL to be executed at a
server. Open Database Connection ODBC and Java Database Connection JDBC are examples of
gateway. Fast Load In order to minimize the total load window, the data needs to be loaded into the
warehouse in the fastest possible time. Transformations affect the speed of data processing. It is
more effective to load the data into a relational database prior to applying transformations and
checks. Gateway technology is not suitable, since they are inefficient when large data volumes are
involved. Simple Transformations While loading, it may be required to perform simple
transformations. After completing simple transformations, we can do complex checks. Suppose we
are loading the EPOS sales transaction, we need to perform the following checks: Strip out all the
columns that are not required within the warehouse. Convert all the values to required data types.
Warehouse Manager
The warehouse manager is responsible for the warehouse management process. It consists of a
third-party system software, C programs, and shell scripts. The size and complexity of a warehouse
manager varies between specific solutions. Warehouse Manager Architecture
The controlling process Stored procedures or C with SQL Backup/Recovery tool SQL scripts Functions
of Warehouse Manager A warehouse manager performs the following functions: Analyzes the data
to perform consistency and referential integrity checks. Creates indexes, business views, partition
views against the base data. Generates new aggregations and updates the existing aggregations.
Generates normalizations. Transforms and merges the source data of the temporary store into the
published data warehouse. Backs up the data in the data warehouse. Archives the data that has
reached the end of its captured life. Note: A warehouse Manager analyzes query profiles to
determine whether the index and aggregations are appropriate. Query Manager The query manager
is responsible for directing the queries to suitable tables. By directing the queries to appropriate
tables, it speeds up the query request and response process. In addition, the query manager is
responsible for scheduling the execution of the queries posted by the user. Query Manager
Architecture A query manager includes the following components: Query redirection via C tool or
RDBMS Stored procedures Query management tool Query scheduling via C tool or RDBMS Query
scheduling via third-party software
Functions of Query Manager
It stores query profiles to allow the warehouse manager to determine which indexes and
aggregations are appropriate
Query Manager
The query manager is responsible for directing the queries to suitable
tables. By directing the queries to appropriate tables, it speeds up the
query request and resIt is the process that manages the queries and
speeds them up by directing queries to the most effective data source.
This process also ensures that all the system resources are used most
effectively, usually by scheduling the execution of queries. The query
management process monitors the actual query profiles that are used to
determine which aggregations to generate.
This process operates at all times that the data warehouse is made
available to endusers. There are no major consecutive steps within this
process, rather there are a set of facilities that are constantly in
operations.
For example, in the analysis of sales data warehouse, if a user asks the
system to "Report on sales of computer, Ghaziabad, UP over the past 2
weeks", this query would be satisfied by scanning any of the following
tables −
All the detailed information over the past 2 weeks, filtering in all
computer sales for Ghaziabad.
The query management process ensures that no single query can affect
the overall system performance.
Query capture − The query profiles are changed regularly over the life
of a data warehouse and the original user query requirements may be
nothing more than a starting point. The summary tables are structured
around a defined query profile and if the profile changes, the summary
table is also changed.
It can accurately monitor and understand what the new query profile is, it
can be very effective to capture the physical queries that are being
executed. At various points in time, these queries can be analyzed to
determine the new query profiles and the resulting impact on summary
tables.
c) N = 4 dimensions
Instructo
Instructor_ID, Name, ...
r
Similarities:
Both schemas are designed for data warehousing and analytics, utilizing a fact table to
store numerical data and dimension tables to store descriptive information.
Both schemas aim to improve query performance and facilitate complex reporting by
organizing data in a structured manner.
Differences:
Feature Star Schema Snowflake Schema
Star Schema:
Advantages:
o Faster query performance: Fewer joins required for queries due to
denormalization.
o Simpler design and maintenance: Easier to understand and manage compared
to snowflake schemas.
Disadvantages:
o Higher storage requirements: Data redundancy can lead to larger dimension
tables, requiring more storage space.
o Potential for data integrity issues: Denormalization can make it harder to
maintain data consistency.
Snowflake Schema:
Advantages:
o Lower storage requirements: Normalization reduces data redundancy, leading
to more efficient storage.
o Improved data integrity: Normalization helps maintain data consistency and
reduces the risk of errors.
Disadvantages:
o Slower query performance: More joins are required, potentially slowing down
query execution.
o More complex design and maintenance: Normalization can make the schema
more difficult to understand and manage.
Example:Consider a data warehouse for online sales.
Star Schema:
A fact table stores sales transactions, and dimension tables
(e.g., Customers, Products, Date) contain descriptive attributes like customer ID, product
ID, and date. Queries to find the total sales for a specific product would require a single
join between the fact table and the Products dimension table.
Snowflake Schema:
The dimension tables might be further normalized. For example, the Products table
could be split into Product Categories, Product Subcategories, and Product
Details tables. Finding the total sales for a specific product would require more joins
(e.g., fact table to Product Details, then Product Details to Product Subcategories, and so
on).