0% found this document useful (0 votes)
5 views

Module 3 - Datawarehousing

BI

Uploaded by

ambika venkatesh
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Module 3 - Datawarehousing

BI

Uploaded by

ambika venkatesh
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Module-3

Data Warehousing
Contents

 Data Warehousing Definitions and Concepts

 Data Warehousing Process Overview

 Data Warehousing Architectures

 Data Integration and the Extraction Transformation, and Load (ETL)

Processes
What Is a Data Warehouse?

 a data warehouse (DW) is a pool of data produced to support decision


making
 a repository of current and historical data of potential interest to
managers throughout the organization.
 Data are usually structured to be available in a form ready for analytical
processing activities (i.e., online analytical processing [OLAP], data
mining, querying, reporting, and other decision support applications)
 The data warehouse is a collection of integrated, subject-oriented
databases designed to support DSS functions, where each unit of data is
non-volatile and relevant to some moment in time
A Historical Perspective to Data Warehousing

ü Mainframe computers ü Centralized data storage ü Big Data analytics


ü Simple data entry ü Data warehousing was born ü Social media analytics
ü Routine reporting ü Inmon, Building the Data Warehouse ü Text and Web Analytics
ü Primitive database structures ü Kimball, The Data Warehouse Toolkit ü Hadoop, MapReduce, NoSQL
ü Teradata incorporated ü EDW architecture design ü In-memory, in-database

1970s 1980s 1990s 2000s 2010s

ü Mini/personal computers (PCs) ü Exponentially growing data Web data


ü Business applications for PCs ü Consolidation of DW/BI industry
ü Distributer DBMS ü Data warehouse appliances emerged
ü Relational DBMS ü Business intelligence popularized
ü Teradata ships commercial DBs ü Data mining and predictive modeling
ü Business Data Warehouse coined ü Open source software
ü SaaS, PaaS, Cloud Computing
 The motivations that led to developing data warehousing technologies go
back to the 1970s, when the computing world was dominated by the
mainframes.
 Real business data-processing applications, the ones run on the corporate
mainframes, had complicated file structures using early-generation
databases in which they stored data.
 Although these applications did a decent job of performing routine
transactional data-processing functions, the data created as a result of
these functions was locked away in the depths of the files and databases.
 When aggregated information such as sales trends by region and by
product type was needed, one had to formally request it from the data-
processing department, where it was put on a waiting list with a couple
hundred other report requests
 Later in this decade, commercial hardware and software companies began to emerge
with solutions to this problem. Founders worked to design a database management
system for parallel processing with multiple microprocessors, targeted specifically for
decision support.
 The 1980s were the decade of personal computers and minicomputers.
 Real computer applications were no longer only on mainframes; they were all over the
place-everywhere you looked in an organization. That led to a portentous problem
called islands of data.
 The solution - distributed database management system, which would pull the
requested data from databases across the organization, bring all the data back to the
same place, and then consolidate it, sort it, and do whatever else was necessarily to
answer the user's question.
 Although the concept was a good one and early results from research were promising,
the results were plain and simple: They just didn't work efficiently in the real world, and
• In the 1990s a new approach to solving the islands-of-data problem
surfaced. The 1990s philosophy involved going back to the 1970s
method, in which data from those places was copied to another
location-only doing it right this time;hence, data warehousing was born.
• In 1993, Bill Inmon wrote the seminal book Building the Data
Warehouse. Many people recognize Bill as the father of data
warehousing.
• In the 2000s, in the world of data warehousing, both popularity and the
amount of data continued to grow.
• In the 2010s the big buzz has been Big Data. The technologies that
came with Big Data include Hadoop, MapReduce, NoSQL, Hive, and so
forth
Characteristics of Data Warehousing

1. Subject oriented
2. Integrated
3. Time-variant (time series)
4. Nonvolatile
5. Web based
6. relational/multi-dimensional
7. Client/server
8. real-time
9. Include metadata
Characteristics of Data Warehousing

1. Subject oriented
 Data are organized by detailed subject, such as sales, products, or
customers, containing only information relevant for decision support.
 Subject orientation enables users to determine not only how their
business is performing, but why.
 Subject orientation provides a more comprehensive view of the
organization.
2. Integrated
 A data warehouse is developed by integrating data from varied sources
into a consistent format.
 The data must be stored in the warehouse in a consistent and universally
acceptable manner in terms of naming, format, and coding.
3. Time variant
 A warehouse maintains historical data. The data do not necessarily
provide current status (except in real-time systems).
 They detect trends, deviations, and long-term relationships for
forecasting and comparisons, leading to decision making.
 The data stored in a data warehouse is documented with an element of
time, either explicitly or implicitly.
 Data for analysis from multiple sources contains multiple time points
(e.g., daily, weekly, monthly views).
4. Nonvolatile
 Data once entered into a data warehouse must remain unchanged.
 All data is read-only. Previous data is not erased when current data is
entered.
 This helps you to analyze what has happened and when..
5. Web based
 Data warehouses are typically designed to provide an efficient
computing environment for Web-based applications.
6. Relational/multidimensional
 A data warehouse uses either a relational structure or a
multidimensional structure.
 Relational models are flat, ie. tables are two-dimensional;
multidimensional models can have more then two dimensions
7. Client/server
 A data warehouse uses the client/ server architecture to provide easy
access for end users.
8. Real time
 Newer data warehouses provide real-time, or active, data-access and
analysis capabilities
9. Include metadata
 A data warehouse contains metadata (data about data) about how
the data are organized and how to effectively use them.
Data Marts

 subset of a data warehouse, typically consisting of a single subject area (e.g.,


marketing, operations).
 can be either dependent or independent.
 Dependent data mart
• a subset that is created directly from the data warehouse.
• has the advantages of using a consistent data model and providing quality
data.
• ensures that the end user is viewing the same version of the data that is
accessed by all other data warehouse users.
• The high cost of data warehouses limits their use to large companies.
 Independent data mart
• a lower-cost, scaled-down version of a data warehouse.
• small warehouse designed for a strategic business unit (SBU) or a department,
Operational Data Stores

 A type of database often used as an interim area for a data warehouse

 Unlike the static contents of a data warehouse, the contents of an ODS are updated

throughout the course of business operations.


 An ODS is used for short-term decisions involving mission-critical applications rather

than for the medium- and long-term decisions associated with an EDW.
 An ODS is similar to short-term memory in that it stores only very recent

information. In comparison, a data warehouse is like long-term memory because it

stores permanent information.


 An ODS consolidates data from multiple source systems and provides a near-real-

time, integrated view of volatile, current data.


 Oper marts are created when operational data needs to be analyzed

multidimensionally. The data for an oper mart come from an ODS.


Enterprise Data Warehouses (EDW)

 A data warehouse for the enterprise.


 a large-scale data warehouse that is used across the enterprise for
decision support
 The large-scale nature provides integration of data from many sources
into a standard format for effective BI and decision support applications.
 EDW are used to provide data for many types of DSS, including CRM,
supply chain management (SCM), business performance management
(BPM), business activity monitoring (BAM), product life-cycle
management (PLM), revenue management, and sometimes even
knowledge management systems (KMS).
Metadata

 Metadata are data about data.


 Metadata describe the structure of and some meaning about data,
thereby contributing to their effective or ineffective use.
 In a data warehouse, metadata describe the contents of a data
warehouse and the manner of its acquisition and use
Data Warehousing Process Overview

 Many organizations need to create data warehouses-massive data


stores of time series data for decision support.
 Data are imported from various external and internal resources
and are cleansed and organized in a manner consistent with the
organization's needs.
 After the data are populated in the data warehouse, data marts
can be loaded for a specific area or department.
 Alternatively, data marts can be created first, as needed, and then
integrated into an EDW.
No data marts option
Data Applications
Sources (Visualization)
Access
Routine
ERP Business
ETL
Reporting
Process Data mart
(Marketing)
Select
Legacy Metadata Data/text

/ Middleware
Extract mining
Data mart
(Engineering)
Transform Enterprise
POS Data warehouse
OLAP,
Integrate
Data mart Dashboard,

API
(Finance) Web
Other Load
OLTP/wEB
Replication Data mart
(...) Custom built
External
applications
data

Fig: A Data Warehouse Framework and Views.


 The following are the major components of the data warehousing process:

1. Data sources
- Data are sourced from multiple independent operational "legacy" systems and

possibly from external data providers (such as the U.S. Census).


- Data may also come from an OLTP or ERP system. Web data in the form of Web

logs may also feed a data warehouse.

2. Data extraction and transformation

- Data are extracted and properly transformed using custom-written or

commercial software called ETL.

3. Data loading
- Data are loaded into a staging area, where they are transformed and cleansed.
- The data are then ready to load into the data warehouse and/or data marts.
4. Comprehensive database
- EDW to support all decision analysis by providing relevant summarized

and detailed information originating from many different sources.

5. Metadata
- Metadata are maintained so that they can be assessed by IT personnel

and users.
- Metadata include software programs about data and rules for organizing

data summaries that are easy to index and search , especially with Web

tools.
6. Middleware tools
- enable access to the data warehouse
- Power users such as analysts may write their own SQL queries
- Others may employ a managed query environment, such as Business
Objects, to access data.
- There are many front-end applications that business users can use to
interact with data stored in the data repositories, including data mining,
OLAP, reporting tools, and data visualization tools.
DATA WAREHOUSING ARCHITECTURES
 client/ server or n-tier architectures
• two-tier architectures
• three-tier architectures
 multi-tiered architectures are known to be capable of serving the needs of large-scale,
performance demanding information systems such as data warehouses.

• Three parts:

1. The data warehouse itself, which contains the data and associated software

2. Data acquisition (back-end) software, which extracts data from legacy systems and
external sources, consolidates and summarizes them, and loads them into the
data warehouse

3. Client (front-end) software, which allows users to access and analyze data from
the warehouse
3-tier architecture

Tier 1: Tier 2: Tier 3:


Client workstation Application server Database server

 In a three-tier architecture, operational systems contain the data and the


software for data acquisition, the data warehouse (i.e., the server), the data
warehouse in one tier, and the other tier includes the DSS/BI/BA engine (i.e., the
application server) and the client
 Data from the warehouse are processed twice and deposited in an additional
multidimensional database, organized for easy multidimensional analysis and
presentation, or replicated in data marts.
 advantage : separation of the functions of the data warehouse, which eliminates
resource constraints and makes it possible to easily create data marts.
2-tier architecture

 In a two-tier architecture, the DSS


engine physically runs on the same
hardware platform as the data
warehouse Therefore, it is more
economical than the three-tier
Tier 1: Tier 2:
structure. Client workstation Application & database server

 The two-tier architecture can have


performance problems for large data
warehouses that work with data-
intensive applications for decision
support.
Web-based data warehousing
 Data warehousing and the Internet are two key technologies that offer
important solutions for managing corporate data.
 The integration of these two technologies produces Web-based data
warehousing.
 The architecture is three tiered and includes the PC client, Web server, and
application server.
1. On the client side, the user needs an Internet connection and a Web
browser (preferably Java enabled) through the familiar graphical user
interface (GUI).
2. The Internet/ intranet/ extranet is the communication medium between
client and servers.
3. On the server side, a Web server is used to manage the inflow and
outflow of information between client and server. It is backed by both a
data warehouse and an application server.
 Web-based data warehousing offers several compelling advantages,
including ease of access, platform independence, and lower cost.
 Page-loading speed is an important consideration in designing Web-based
 Several issues must be considered when deciding which architecture to
use. Among them are the following:

1.Which database management system (DBMS) should be used?

2.Will parallel processing and/or partitioning be used?

3.Will data migration tools be used to load the data warehouse?

4.What tools will be used to support data retrieval and analysis?


1. Which database management system (DBMS) should be used?
 Most data warehouses are built using relational database management
systems (RDBMS). Oracle ,SQL Server and DB2 are the ones most
commonly used.
 Each of these products supports both client/server and Web-based
architectures.
2. Will parallel processing and/or partitioning be used?
 Parallel processing enables multiple CPUs to process data warehouse
query requests simultaneously and provides scalability.
 Data warehouse designers need to decide whether the database tables
will be partitioned (i.e., split into smaller tables) for access efficiency and
what the criteria will be.
 This is an important consideration that is necessitated by the large
amounts of data contained in a typical data warehouse.
3. Will data migration tools be used to load the data warehouse?
 Moving data from an existing system into a data warehouse is a tedious
and laborious task.
 Depending on the diversity and the location of the data assets, migration
may be a relatively simple procedure or (in contrast) a months-long
project.
 The results of a thorough assessment of the existing data assets should
be used to determine whether to use migration tools and, if so, what
capabilities to seek in those commercial tools.
4. What tools will be used to support data retrieval and analysis?
 Often it is necessary to use specialized tools to periodically locate,
access, analyze, extract, transform, and load necessary data into a
data warehouse.
 A decision has to be made on
(1) developing the migration tools in-house
(2) purchasing them from a third-party provider, or
(3) using the ones provided with the data warehouse system.
Alternative Data Warehousing Architectures

 The five architectures alternatives to the basic architectural design types


1.Independent data marts.
2.Data mart bus architecture.
3.Hub-and-spoke architecture.
4.Centralized data warehouse.
5.Federated data warehouse
1. Independent data marts.

 simplest and the least costly architecture alternative


 The data marts are developed to operate independently of each
another to serve the needs of individual organizational units
 Because of their independence, they may have inconsistent data
definitions and different dimensions and measures, making it difficult to
analyze data across the data marts
2. Data mart bus architecture

 This architecture is a viable alternative to the independent data marts


where the individual marts are linked to each other via some kind of
middleware
 Because the data are linked among the individual marts, there is a better
chance of maintaining data consistency across the enterprise
 Even though it allows for complex data queries across data marts, the
performance of these types of analysis may not be at a satisfactory level.
3. Hub-and-spoke architecture.

 perhaps the most famous data warehousing architecture today


 Here the attention is focused on building a scalable and maintainable
infrastructure that includes a centralized data warehouse and several
dependent data marts (each for an organizational unit)
 This architecture allows for easy customization of user interfaces and reports.
 On the negative side, this architecture lacks the holistic enterprise view, and
may lead to data redundancy and data latency.
4. Centralized data warehouse

 similar to the hub-and-spoke architecture except that there are no dependent


data marts; instead, there is a gigantic enterprise data warehouse that serves the
needs of all organizational units
 provides users with access to all data in the data warehouse instead of limiting
them to data marts.
 it reduces the amount of data the technical team has to transfer or change,
therefore simplifying data management and administration.
 If designed and implemented properly, this architecture provides a timely and
holistic view of the enterprise to whomever, whenever, and wherever they may be
within the organization.
5. Federated data warehouse.

 The federated approach is a concession to the natural forces that


undermine the best plans for developing a perfect system
 It uses all possible means to integrate analytical resources from multiple
sources to meet changing needs or business conditions
 Essentially, the federated approach involves integrating disparate systems
 In a federated architecture, existing decision support structures are left in
place, and data are accessed from those sources as needed
 The federated approach is supported by middleware vendors that propose
distributed query and join capabilities.
 These eXtensible Markup Language (XML)-based tools offer users a global
view of distributed data sources, including data warehouses, data marts,
Web sites, documents, and operational systems.
 When users choose query objects from this view and press the submit
button, the tool automatically queries the distributed sources, joins the
results, and presents them to the user.
 Because of performance and data quality issues, most experts agree that
federated approaches work well to supplement data warehouses, not
replace them
 Each architecture has advantages and disadvantages!
 Which architecture is the best?
 Ten factors that potentially affect the architecture selection decision

1. Information interdependence 6. Strategic view of the data


between organizational units warehouse prior to
implementation
2. Upper management’s 7. Compatibility with existing
information needs systems
8. Perceived ability of the in-
3. Urgency of need for a data
house IT staff
warehouse 9. Technical issues
4. Nature of end-user tasks 10. Social/political factors
5. Constraints on resources
Teradata Corp. DW Architecture
Data Integration and the Extraction, Transformation, and Load Process

Data Integration
 comprises three major processes that, when correctly implemented,
permit data to be accessed and made accessible to an array of ETL and
analysis tools and the data warehousing environment:
- data access (i.e., the ability to access and extract data from any data
source)
- data federation (i.e. , the integration of business views across multiple
data stores)
-change capture (based on the identification, capture, and delivery of
the changes made to enterprise data sources) from many sources.
 Some vendors, such as SAS Institute, Inc., have developed strong data
integration tools.
 The SAS enterprise data integration server includes customer data
integration tools that improve data quality in the integration process.
 The Oracle Business Intelligence Suite assists in integrating data as well.
 A major purpose of a data warehouse is to integrate data from multiple
systems.
 Various integration technologies enable data and metadata integration:
• Enterprise application integration (EAI)
• Service-oriented architecture (SOA)
• Enterprise information integration (Ell)
• Extraction, transformation, and load (ETL)
Enterprise application integration (EAi)
 provides a vehicle for pushing data from source systems into the data
warehouse.
 It involves integrating application functionality and is focused on sharing
functionality (rather than data) across systems, thereby enabling flexibility and
reuse.
 Traditionally, EAI solutions have focused on enabling application reuse at the
application programming interface (API) level.
 Recently, EAI is accomplished by using SOA coarse-grained services (a collection
of business processes or functions) that are well defined and documented.
• Using Web services is a specialized way of implementing an SOA.
• EAI can be used to facilitate data acquisition directly into a near-real-time data
warehouse or to deliver decisions to the OLTP systems.
• There are many different approaches to and tools for EAI implementation.
Enterprise information integration (Ell)
 an evolving tool space that promises real-time data integration from a variety of

sources, such as relational databases, Web services, and multidimensional

databases.
 It is a mechanism for pulling data from source systems to satisfy a request for

information. Ell tools use predefined metadata to populate views that make

integrated data appear relational to end users.


 XML may be the most important aspect of Ell because XML allows data to be

tagged either at creation time or later.


 These tags can be extended and modified to accommodate almost any area of

Knowledge Physical data integration has conventionally been the main

mechanism for creating an integrated view with data warehouses and data marts.

You might also like