16 08 2024 Data Virtualization Session2
16 08 2024 Data Virtualization Session2
eMTech
Data Virtualization and Dashboard
Unit – I: Benefits of DV, Database Views, Data Integration –
Models and Approaches, Data Integration vs Data
Virtualization, Master data and Metadata with examples, Data
Virtualization Tools, Data Silos
access data from multiple sources without requiring complex and time-consuming
data integration processes. This can help organizations make faster and more
from multiple sources, which can help reduce complexity and improve efficiency.
to access data without physically moving or copying it. This can help reduce the
2
Benefits of data virtualization
(continued)
❖ Increased scalability: Allows organizations to scale up their data integration
easily and analysis efforts as their needs change without needing additional
hardware or infrastructure.
❖ Reduced data duplication: Data virtualization can help to reduce the need to
physically replicate data, which can save on storage and computing resources. It
can also help to reduce the risk of errors and inconsistencies that can arise from
duplicating data.
3
Heterogeneity of data
4
Heterogeneity of data (continued)
❖ Communication Heterogeneity
➢ Some systems have web interface others do not
➢ Some systems allow direct query language others offer APIs
❖ Schema Heterogeneity
➢ The structure of the tables storing the data can be different (even if storing the
same data)
5
Heterogeneity problems
6
Reasons of Heterogeneity
7
Data Integration
❖ Many databases and sources of data that need to be integrated to work
together
❖ Data integration is the process of integrating data from multiple sources
and probably have a single view over all these sources .
❖ And answering queries using the combined information
8
Data Integration (continued)
9
DBMS: it’s all about abstraction
10
Data Integration: A Higher-level
Abstraction
11
Integrity Constraints
● Domain
● Key
● Referential
● Check
● Automatic execution driven by operations - Triggers
12
Models of data integration
13
Federated Database
❖ Simplest architecture
❖ Every pair of sources can build their own mapping and
transformation
❖ Source X needs to communicate with source Y -> build a mapping
between X and Y. Does not have to be between all sources (on
demand)
14
Data Warehouse
❖ Very common approach
❖ Data from multiple sources are copied and stored in a warehouse
❖ Data is materialized in the warehouse. Users can then query the
warehouse database only
15
Traditional Data Warehouse
Architecture
16
Mediation
❖ Mediator is a virtual view over the data (it does not store any data).
Data is stored only at the sources.
❖ Mediator has a virtual schema that combines all schemas from the
sources .
❖ The mapping takes place at query time.
❖ This is unlike warehousing where mapping takes place at upload
time
17
Mediator Types
1. Global As View (GAV)
❖ Mediator schema acts as a view over the source schemas
❖ Rules that map a mediator query to source queries
❖ Like regular views, what we see through the mediator is a
subset of the available world
18
Mediator Types
1. Local As View (LAV)
❖ Sources are defined in terms of the global schema using
expression
❖ Every source provides expressions on how it can generate
pieces of the global schema
❖ Mediator can combine these expressions to find all possible
ways to answer a query
19
Approaches to data integration
1. Centralized data
integration
2. Data warehousing
3. P2P data integration
20
1. Centralized data integration
1. Centralized data integration is the problem of providing unified and
transparent view to a collection of data stored in multiple, autonomous,
and heterogeneous data sources.
2. The unified view is achieved through a global (or target) schema, linked
to the data sources by means of mappings.
21
2. Data warehousing
❖ Materialization of the global database. materialized view is a database
object that contains the results of a query. For example, it may be a local
copy of data located remotely, or may be a subset of the rows and/or
columns of a table or join result, or may be a summary using an aggregate
function.
❖ allows for OLAP without accessing the sources
❖ similar to data exchange
22
3. Peer-to-peer data integration
❖ Peer-to-peer (P2P) data integration is a decentralized, dynamic
coordination of data between autonomous organizations. It involves a set
of autonomous, heterogeneous, independently evolving (peer) sources
whose pairwise schema or data-level mappings induce a peer-to-peer
network.
23
Data Integration vs Data Virtualization
Data Integration: Data Virtualization
❖ Virtualized Access: On the other hand, it
❖ Physical Data Movement: it involves allows users to access and query data from
physically moving and storing data from various sources without physically moving
various sources into a centralized repository or storing it in a centralized location.
like a data warehouse. ❖ Real-Time Access: Users can query the
❖ Data Transformation: Data integration
data sources directly, and the virtualization
usually includes data transformation
processes to ensure that data from different layer retrieves the data in real-time.
sources is transformed and standardized into ❖ Agility and Flexibility: It allows
a common format before being stored. organizations to quickly adapt to changing
❖ Schema Design: A predefined schema is data sources and requirements without the
designed to structure and organize the need for complex data restructuring or
integrated data. schema changes.
❖ Performance: Data integration can be time-
consuming and resource-intensive, especially
when dealing with large volumes of data. It
requires significant storage space and
computing power.
24
What is master data
25
Popularity of MDM
26
Many organisations are getting started
❖ Many organisations
are in the early
stage of their
master data
management
27
Master data overview
❖ Each system has its own functionality and
associated data models
28
What is metadata?
❖ “Metadata is structured information that describes, explains, locates, or
otherwise makes it easier to retrieve, use, or manage an information resource.
Metadata is often called data about data or information about information.” --
National Information Standards Organization
❖ Metadata provides information enabling to make sense of data (e.g.
documents, images, datasets), concepts (e.g. classification schemes) and
real-world entities (e.g. people, organisations, places, paintings, products).
❖ Types of metadata:
1. Descriptive metadata, describe a resource for purposes of discovery and
identification.
2. Structural metadata, e.g. data models and reference data.
3. Administrative metadata, provides information to help manage a resource.
29
Examples of metadata
30
Two approaches for providing metadata on the
Web
31
Metadata management is important
Metadata needs to be managed to ensure ...
❖ Metadata may be created before data is created or captured, e.g. to inform about data
that will be available in the future.
❖ Metadata needs to be kept after data has been removed, e.g. to inform about data that
has been decommissioned or withdrawn.
32
MetaData Management
Master Data Management
❖ Focus: Metadata management involves
❖ Focus: MDM is concerned with managing managing metadata, which is data about
and ensuring the consistency and accuracy data. Metadata provides information about
of business data, often referred to as master the content, context, and structure of data
assets.
data.
❖ Purpose: The primary purpose of
❖ Purpose: The primary purpose of MDM is to metadata management is to help users
create a single, accurate, and complete discover, understand, and effectively use
version of master data that can be shared organizational data assets.
across the organization. ❖ Benefits: Metadata management
❖ Benefits: MDM helps in improving data enhances data discoverability, supports
accuracy, reducing data duplication, data governance efforts, facilitates
compliance, and improves data lineage
enhancing data consistency, and ensuring
tracking. It ensures that users can trust and
that organizations have a unified view. understand the data they are working with,
leading to better decision-making and data
analysis.
33
34
Data Virtualization tools
❖ There are several data virtualization tools available in the market that enable
organizations to access and integrate data from various sources in real-time
without the need for physical data movement.
1. Denodo
2. TIBCO Data Virtualization
3. Cisco Data Virtualization
4. Informatica Intelligent Data Virtualization
5. Red Hat JBoss Data Virtualization
6. Microsoft SQL Server PolyBase
7. Stone Bond Technologies Enterprise Enabler
8. AWS Glue DataBrew
35
Data virtualization use cases
Data virtualization can be a good alternative to ETL(Extract, transform, and
load ) in a number of different situations.
❖ Physical data movement is inefficient, difficult, or too expensive.
❖ A flexible environment is needed to prototype, test, and implement new
initiatives.
❖ Data has to be available in real-time or near real-time for a range of
analytics purposes.
❖ Multiple BI tools require access to the same data sources.
36
Some primary use cases of data
virtualization
1. Real-time Business Intelligence 2. 360-Degree Customer View:
and Reporting:
Example: A financial institution aims
Example: A retail company wants to provide personalized services to
to analyze sales data from its customers. By integrating
multiple stores, online platforms, customer data from various systems
and customer feedback systems (such as transaction history,
in real-time. Data virtualization customer support interactions, and
enables them to create a unified social media), the organization can
view of this data for instant create a comprehensive customer
analysis, leading to better profile.
decision-making and improved
customer service.
37
Some primary use cases of data
virtualization
3. Data Migration and Data 4. Operational Data Integration:
Warehousing:
Example: A manufacturing company needs
Example: An enterprise is to streamline its supply chain operations.
migrating its data from on- Data virtualization allows them to integrate
premises databases to cloud- data from suppliers, production systems,
based storage. Data virtualization inventory databases, and shipping
enables the organization to partners. This integrated view helps in
access and transform data during optimizing inventory levels, production
the migration process without the schedules, and supplier relationships,
need for complex ETL (Extract, leading to cost savings and improved
Transform, Load) operations. efficiency.
38
Some primary use cases of data
virtualization
6. Regulatory Compliance and Data
5. Data Collaboration and Governance:
Federated Data Access:
39
delivery to the company’s research
projects
❖ The world’s leading pharmaceutical and biotechnology corporation,
Pfizer uses data virtualization software by TIBCO (previously Cisco) to
speed up the delivery of data to its researchers. In the past, the
company used the traditional ETL data integration approach that often
resulted in outdated data. With data virtualization, Pfizer managed to
cut the project development time by 50 percent. In addition to the quick
data retrieval and transfer, the company standardized product data to
ensure consistency in product information across all research and
medical units.
40
City Furniture: Online retailer creates
enterprise-wide data fabric to advance
analytics
❖ A huge online retail company, City Furniture realized that in the
pandemic realities, it is necessary to opt for digital transformation. And
data virtualization was the way to facilitate this goal. With the help of
the Denodo Platform, the retailer managed to integrate and deliver all
business-critical data to its supply chain, marketing, operations, sales,
and other departments by virtualizing data sources and creating a
unified, semantic layer. As a result, data virtualization enabled the
company to conduct advanced analytics and data science, contributing
to the growth of the business.
41
Global investment bank: Cost reduction
with more scalable and effective data
management
❖ In 2018, a multinational investment bank cooperated with a fintech
company to present a digital data management platform. With a logical
data layer built, the two organizations got a single source of truth for all
data. The platform quickened customer onboarding as well as product
and service consulting. As a result, they decreased customer churn rate
and reduced costs by deleting nearly 300 TB of useless data.
42
Denodo
❖ Denodo is a data virtualization tool that simplifies data integration and unifies
data security and governance management.
❖ Denodo's architecture optimizes both traditional and other data sources, such
as databases, enterprise data warehouses, data lakes, applications, big data
files, web service, and the cloud.
❖ Denodo Platform usage pattern can be categorized as three groups or layers.
1. Connect [ to any data source or format]
2. Combine [ any models with another, with ease]
3. Consume [ Connect to any consumer, wide variety of protocols]
43
The Denodo Architecture
44
Thank you
45