0% found this document useful (0 votes)
7 views

16 08 2024 Data Virtualization Session2

Uploaded by

ayandutt2002
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

16 08 2024 Data Virtualization Session2

Uploaded by

ayandutt2002
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

IIT Patna

eMTech
Data Virtualization and Dashboard
Unit – I: Benefits of DV, Database Views, Data Integration –
Models and Approaches, Data Integration vs Data
Virtualization, Master data and Metadata with examples, Data
Virtualization Tools, Data Silos

Dr. Padmalochan Bera


Associate Professor(CSE)
IIT Bhubaneswar
Benefits of data virtualization
❖ Increased agility: Data virtualization allows organizations to quickly and easily

access data from multiple sources without requiring complex and time-consuming

data integration processes. This can help organizations make faster and more

informed decisions based on a more complete view of their data.

❖ Reduced complexity: Simplifies the process of accessing and integrating data

from multiple sources, which can help reduce complexity and improve efficiency.

❖ Enhanced security: It also helps improve data security by allowing organizations

to access data without physically moving or copying it. This can help reduce the

risk of data breaches and unauthorized access to sensitive data.

2
Benefits of data virtualization
(continued)
❖ Increased scalability: Allows organizations to scale up their data integration

easily and analysis efforts as their needs change without needing additional

hardware or infrastructure.

❖ Reduced data duplication: Data virtualization can help to reduce the need to

physically replicate data, which can save on storage and computing resources. It

can also help to reduce the risk of errors and inconsistencies that can arise from

duplicating data.

3
Heterogeneity of data

❖ The main problem is the heterogeneity among the data sources.


❖ Source Type Heterogeneity: Systems storing the data can be
different

4
Heterogeneity of data (continued)
❖ Communication Heterogeneity
➢ Some systems have web interface others do not
➢ Some systems allow direct query language others offer APIs
❖ Schema Heterogeneity
➢ The structure of the tables storing the data can be different (even if storing the
same data)

5
Heterogeneity problems

6
Reasons of Heterogeneity

7
Data Integration
❖ Many databases and sources of data that need to be integrated to work
together
❖ Data integration is the process of integrating data from multiple sources
and probably have a single view over all these sources .
❖ And answering queries using the combined information

8
Data Integration (continued)

❖ Data integration is also valid within a single organization.


❖ Integrating data from different departments or sectors

9
DBMS: it’s all about abstraction

10
Data Integration: A Higher-level
Abstraction

11
Integrity Constraints
● Domain
● Key
● Referential
● Check
● Automatic execution driven by operations - Triggers

12
Models of data integration

13
Federated Database
❖ Simplest architecture
❖ Every pair of sources can build their own mapping and
transformation
❖ Source X needs to communicate with source Y -> build a mapping
between X and Y. Does not have to be between all sources (on
demand)

14
Data Warehouse
❖ Very common approach
❖ Data from multiple sources are copied and stored in a warehouse
❖ Data is materialized in the warehouse. Users can then query the
warehouse database only

15
Traditional Data Warehouse
Architecture

16
Mediation
❖ Mediator is a virtual view over the data (it does not store any data).
Data is stored only at the sources.
❖ Mediator has a virtual schema that combines all schemas from the
sources .
❖ The mapping takes place at query time.
❖ This is unlike warehousing where mapping takes place at upload
time

17
Mediator Types
1. Global As View (GAV)
❖ Mediator schema acts as a view over the source schemas
❖ Rules that map a mediator query to source queries
❖ Like regular views, what we see through the mediator is a
subset of the available world

18
Mediator Types
1. Local As View (LAV)
❖ Sources are defined in terms of the global schema using
expression
❖ Every source provides expressions on how it can generate
pieces of the global schema
❖ Mediator can combine these expressions to find all possible
ways to answer a query

19
Approaches to data integration

1. Centralized data
integration
2. Data warehousing
3. P2P data integration

20
1. Centralized data integration
1. Centralized data integration is the problem of providing unified and
transparent view to a collection of data stored in multiple, autonomous,
and heterogeneous data sources.
2. The unified view is achieved through a global (or target) schema, linked
to the data sources by means of mappings.

21
2. Data warehousing
❖ Materialization of the global database. materialized view is a database
object that contains the results of a query. For example, it may be a local
copy of data located remotely, or may be a subset of the rows and/or
columns of a table or join result, or may be a summary using an aggregate
function.
❖ allows for OLAP without accessing the sources
❖ similar to data exchange

22
3. Peer-to-peer data integration
❖ Peer-to-peer (P2P) data integration is a decentralized, dynamic
coordination of data between autonomous organizations. It involves a set
of autonomous, heterogeneous, independently evolving (peer) sources
whose pairwise schema or data-level mappings induce a peer-to-peer
network.

23
Data Integration vs Data Virtualization
Data Integration: Data Virtualization
❖ Virtualized Access: On the other hand, it
❖ Physical Data Movement: it involves allows users to access and query data from
physically moving and storing data from various sources without physically moving
various sources into a centralized repository or storing it in a centralized location.
like a data warehouse. ❖ Real-Time Access: Users can query the
❖ Data Transformation: Data integration
data sources directly, and the virtualization
usually includes data transformation
processes to ensure that data from different layer retrieves the data in real-time.
sources is transformed and standardized into ❖ Agility and Flexibility: It allows
a common format before being stored. organizations to quickly adapt to changing
❖ Schema Design: A predefined schema is data sources and requirements without the
designed to structure and organize the need for complex data restructuring or
integrated data. schema changes.
❖ Performance: Data integration can be time-
consuming and resource-intensive, especially
when dealing with large volumes of data. It
requires significant storage space and
computing power.
24
What is master data

25
Popularity of MDM

26
Many organisations are getting started
❖ Many organisations
are in the early
stage of their
master data
management

27
Master data overview
❖ Each system has its own functionality and
associated data models

28
What is metadata?
❖ “Metadata is structured information that describes, explains, locates, or
otherwise makes it easier to retrieve, use, or manage an information resource.
Metadata is often called data about data or information about information.” --
National Information Standards Organization
❖ Metadata provides information enabling to make sense of data (e.g.
documents, images, datasets), concepts (e.g. classification schemes) and
real-world entities (e.g. people, organisations, places, paintings, products).
❖ Types of metadata:
1. Descriptive metadata, describe a resource for purposes of discovery and
identification.
2. Structural metadata, e.g. data models and reference data.
3. Administrative metadata, provides information to help manage a resource.

29
Examples of metadata

30
Two approaches for providing metadata on the
Web

31
Metadata management is important
Metadata needs to be managed to ensure ...

❖ Availability: metadata needs to be stored where it can be accessed and indexed so it


can be found.
❖ Quality: metadata needs to be of consistent quality so users know that it can be trusted.
❖ Persistence: metadata needs to be kept over time.
❖ Open License: metadata should be available under a public domain license to enable its
reuse.

The metadata lifecycle is larger than the data lifecycle:

❖ Metadata may be created before data is created or captured, e.g. to inform about data
that will be available in the future.
❖ Metadata needs to be kept after data has been removed, e.g. to inform about data that
has been decommissioned or withdrawn.

32
MetaData Management
Master Data Management
❖ Focus: Metadata management involves
❖ Focus: MDM is concerned with managing managing metadata, which is data about
and ensuring the consistency and accuracy data. Metadata provides information about
of business data, often referred to as master the content, context, and structure of data
assets.
data.
❖ Purpose: The primary purpose of
❖ Purpose: The primary purpose of MDM is to metadata management is to help users
create a single, accurate, and complete discover, understand, and effectively use
version of master data that can be shared organizational data assets.
across the organization. ❖ Benefits: Metadata management
❖ Benefits: MDM helps in improving data enhances data discoverability, supports
accuracy, reducing data duplication, data governance efforts, facilitates
compliance, and improves data lineage
enhancing data consistency, and ensuring
tracking. It ensures that users can trust and
that organizations have a unified view. understand the data they are working with,
leading to better decision-making and data
analysis.
33
34
Data Virtualization tools
❖ There are several data virtualization tools available in the market that enable
organizations to access and integrate data from various sources in real-time
without the need for physical data movement.
1. Denodo
2. TIBCO Data Virtualization
3. Cisco Data Virtualization
4. Informatica Intelligent Data Virtualization
5. Red Hat JBoss Data Virtualization
6. Microsoft SQL Server PolyBase
7. Stone Bond Technologies Enterprise Enabler
8. AWS Glue DataBrew

35
Data virtualization use cases
Data virtualization can be a good alternative to ETL(Extract, transform, and
load ) in a number of different situations.
❖ Physical data movement is inefficient, difficult, or too expensive.
❖ A flexible environment is needed to prototype, test, and implement new
initiatives.
❖ Data has to be available in real-time or near real-time for a range of
analytics purposes.
❖ Multiple BI tools require access to the same data sources.

36
Some primary use cases of data
virtualization
1. Real-time Business Intelligence 2. 360-Degree Customer View:
and Reporting:
Example: A financial institution aims
Example: A retail company wants to provide personalized services to
to analyze sales data from its customers. By integrating
multiple stores, online platforms, customer data from various systems
and customer feedback systems (such as transaction history,
in real-time. Data virtualization customer support interactions, and
enables them to create a unified social media), the organization can
view of this data for instant create a comprehensive customer
analysis, leading to better profile.
decision-making and improved
customer service.

37
Some primary use cases of data
virtualization
3. Data Migration and Data 4. Operational Data Integration:
Warehousing:
Example: A manufacturing company needs
Example: An enterprise is to streamline its supply chain operations.
migrating its data from on- Data virtualization allows them to integrate
premises databases to cloud- data from suppliers, production systems,
based storage. Data virtualization inventory databases, and shipping
enables the organization to partners. This integrated view helps in
access and transform data during optimizing inventory levels, production
the migration process without the schedules, and supplier relationships,
need for complex ETL (Extract, leading to cost savings and improved
Transform, Load) operations. efficiency.

38
Some primary use cases of data
virtualization
6. Regulatory Compliance and Data
5. Data Collaboration and Governance:
Federated Data Access:

Example: A healthcare provider


Example: A global organization needs to comply with data privacy
with multiple subsidiaries wants regulations such as HIPAA in the
to facilitate collaboration among United States or GDPR in Europe.
its teams. Data virtualization Data virtualization helps in enforcing
enables secure access to data governance policies by
relevant data across providing a centralized view of
geographically dispersed sensitive patient data. Access
locations. controls and audit trails can be
implemented, ensuring compliance
with regulatory requirements.

39
delivery to the company’s research
projects
❖ The world’s leading pharmaceutical and biotechnology corporation,
Pfizer uses data virtualization software by TIBCO (previously Cisco) to
speed up the delivery of data to its researchers. In the past, the
company used the traditional ETL data integration approach that often
resulted in outdated data. With data virtualization, Pfizer managed to
cut the project development time by 50 percent. In addition to the quick
data retrieval and transfer, the company standardized product data to
ensure consistency in product information across all research and
medical units.

40
City Furniture: Online retailer creates
enterprise-wide data fabric to advance
analytics
❖ A huge online retail company, City Furniture realized that in the
pandemic realities, it is necessary to opt for digital transformation. And
data virtualization was the way to facilitate this goal. With the help of
the Denodo Platform, the retailer managed to integrate and deliver all
business-critical data to its supply chain, marketing, operations, sales,
and other departments by virtualizing data sources and creating a
unified, semantic layer. As a result, data virtualization enabled the
company to conduct advanced analytics and data science, contributing
to the growth of the business.

41
Global investment bank: Cost reduction
with more scalable and effective data
management
❖ In 2018, a multinational investment bank cooperated with a fintech
company to present a digital data management platform. With a logical
data layer built, the two organizations got a single source of truth for all
data. The platform quickened customer onboarding as well as product
and service consulting. As a result, they decreased customer churn rate
and reduced costs by deleting nearly 300 TB of useless data.

42
Denodo
❖ Denodo is a data virtualization tool that simplifies data integration and unifies
data security and governance management.
❖ Denodo's architecture optimizes both traditional and other data sources, such
as databases, enterprise data warehouses, data lakes, applications, big data
files, web service, and the cloud.
❖ Denodo Platform usage pattern can be categorized as three groups or layers.
1. Connect [ to any data source or format]
2. Combine [ any models with another, with ease]
3. Consume [ Connect to any consumer, wide variety of protocols]

43
The Denodo Architecture

44
Thank you

45

You might also like