0% found this document useful (0 votes)
4 views

I-cloud-data-management

Uploaded by

youssefbenzetta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

I-cloud-data-management

Uploaded by

youssefbenzetta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

MANAGING DATA ON THE CLOUD

GENOVEVA VARGAS SOLAR


FRENCH COUNCIL OF SCIENTIFIC RESEARCH, LIG-LAFMIA, FRANCE
[email protected]
http://www.vargas-solar.com/data-management-services-cloud
DATA MANAGEMENT IN LARGE-SCALE ENVIRONMENTS

http://news.cnet.com/2300-10805_3-10001679-10.html?tag=mncol 2
DATA MANAGEMENT IN LARGE-SCALE ENVIRONMENTS

• Definition
• Querying and exploiting
• Storage (persistency)
• Manipulation
• Efficient retrieval (indexing, caching)
• Fault tolerance (recovery, replication)
• Maintenance
Peta
10 15
Exa
Zetta 10 18
Yota 10 21
10 24 tape
magnetic
RAID
Data Volume
Hardware

3
DATA MANAGEMENT: STATE OF THE ART AND SOME CHALLENGES

Database Services
unbundled services
(tailored DBMS)

extensible

new functions
Architecture
Peta 10 15 Data
Exa 10 18 structured un- semi- Models n-tier
Zetta 10 21 structured structured
Yota 10 24 client-server

centralized
Data Volume tape
Application
reactive ubiquitous
magnetic computing
real-time
RAID adaptable

Hardware

¡ Use of memory and computing capacities of all computers and servers distributed in the world communicated by a network (e.g. Internet)
DATA MANAGEMENT IN SMALL-SCALE ENVIRONMENTS

Nowadays, monolithic software:


¡ Adding/removing functions is difficult !
¡ Full-fledged DBMS can be cumbersome !

5
CURRENT SCENARIO

6
CLOUD PRINCIPLE

7
CLOUD PRINCIPLE

8
CLOUD AIMS

¡ Ability to use applications on the Internet that


store and protect data while providing a service
¡ Ability to hold application, business and personal
data
¡ Ability to use a handful of Web services to
integrate photos, maps and GPS information to
create a mashup in a costumer Web Browser

9
ON DEMAND, SELF-SERVICE, PAY AS U GO MODEL

¡ Clouds work on a pay as U go model where an application may exist to


¡ run a job for a few minutes or hours
¡ provide services to costumers on a long-term basis

¡ Billing is based on resource consumption: CPU hours, volumes of data moved, or gigabytes of data stored

10
INFRASTRUCTURE MODELS

PUBLIC CLOUD PRIVATE CLOUD


¡ Run by third parties and applications from different ¡ Used by an exclusive client providing utmost
clients mixed together on the clouds’ servers, control over data, security and QoS
storage systems and networks
¡ The company owns and controls the infrastructure

DATA CENTER DATA CENTER 11


ARCHITECTURE

¡ SOFTWARE AS A SERVICE : applications accessible


through the network (Web services,
REST/SOAP)
¡ Salesforce.com (CRM) and Google (Gmail, Google
SOFTWARE AS A Apps)
SERVICE P LATFORM AS A SERVICE
¡ PLATFORM AS A SERVICE : provide services for
transparently managing hardware resources
INFRASTRUCTURE AS A ¡ SalesForce.com (Force.com), Google (Google App
Engine), Microsoft (Windows Azure), Facebook
SERVICE
(Facebook Platform)
¡ I NSFRASTRUCTURE AS A SERVICE : provide Data
centers resources and others like CPU, storage
and memory
¡ Amazon (EC2/S3) and IBM (Bluehouse)
12
CLOUD COMPUTING: AZURE POINT OF VIEW

Packaged Infrastructure Platform Software


(as a Service) (as a Service) (as a Service)
Software

You manage
Applications Applications Applications Applications

You manage
Data Data Data Data
Runtime Runtime Runtime Runtime

Managed by vendor
Middlewar Middlewar Middleware Middleware
You manage

Managed by vendor
e e
O/S O/S O/S O/S

Managed by vendor
Virtualization Virtualization Virtualization Virtualization
Servers Servers Servers Servers
Storage Storage Storage Storage
Networking Networking Networking Networking
SOFTWARE AS SERVICES
SERVICE ON DEMAND

14
INFRASTRUCTURE AS SERVICE

¡ Delivers basic storage and computing capabilities as standardized services over the network
¡ Servers, storage systems, switches, routers are pooled and made available to handle workloads that range
from application components to high-performance computing applications
¡ e.g., Joyent (http://www.joyent.com/), virtualized servers – performance on-demand infrastructure

¡ The infrastructure is programmable: developers specify how to configure and interconnect virtual
components, how virtual machine and application data are stored and retrieved from a storage cloud

15
INFRASTRUCTURE AS A SERVICE

¡ How to deploy components on separate servers to optimize APPLICATION


non-functional requirements including scalability, availability,
manageability and security
¡ Dynamic datacenter enable the deployment of virtual
application architectures
DBMS DBMS DBMS
¡ Horizontal scaling
¡ Run several database servers, create several virtual machines with
integration and parallelization tools
¡ e.g., hadoop for map reduce hadoop.apache.org

16
VIRTUALIZATION

¡ Virtual machines and virtual appliances become standard deployment object


¡ Abstract the hardware to the point where software stacks can be deployed and redeployed without being
tied to a specific physical server
¡ Servers provide a pool of resources that are harnessed as needed
¡ The relationship of applications to compute, storage, and network resources changes dynamically to meet workload and
business demands
¡ Applications can be deployed and scaled rapidly without having to produce physical servers

17
VIRTUALIZATION

¡ Full virtualization is a technique in which a complete installation of one machine is run on another
¡ A system where all software running on the server is within a virtual machine
¡ Applications and operating systems
¡ Means of accessing services on the cloud

¡ A compute cloud is a self-service proposition where a credit card can purchase compute cycles, and a Web
interface or API is used to create virtual machines and establish network relationships between them

18
PROGRAMMABLE INFRASTRUCTURE

¡ Cloud provider API


¡ to create an application initial composition onto virtual machines
¡ To define how it should scale and evolve to accommodate workload changes
¡ Self monitoring and self expanding applications

¡ Applications must be assembled by assembling and configuring appliances and software


¡ Cloud services must be composable so they can be consumed easily

19
PROGRAMMABLE INFRASTRUCTURE

LIBRARY CONFIGURE PATTERN DEPLOYMENT

Load Load Load


balancer balancer balancer

Web Web Web Web


server server server server

DBMS DBMS DBMS


20
HIGH PERFORMANCE UNDERLYING SUPPORT

¡ Hardware support: clusters of networked parallel computers


¡ Well supported by programming models languages and tools
¡ Concurrency, parallelism, distribution and availability
¡ Refine existing programming solutions and investigate new approaches for constructing robust, reliable software

21
HIGH PERFORMANCE UNDERLYING SUPPORT

¡ Concurrency
¡ Inherent concurrency of cloud computing where ¡ Message passing
autonomous processes interact by exchanging messages
¡ Primary parallel programming model for cloud computing
¡ Provides control flow to respond to unordered events
¡ Inherent performance, isolation with points of interaction
¡ Supports processing of independent streams of requests
¡ Requires adequate interfaces between asynchronous
¡ Parallelism communication of messages and synchronous control flow
of procedure calls
¡ Cloud computing runs on parallel computers on client and
server side ¡ Erlang (www.erlang.org) integrates message passing
¡ Higher level programming models such as transactional constructs to existing languages
memory and deterministic execution

22
HIGH PERFORMANCE UNDERLYING SUPPORT

¡ Performance
¡ Shared resources running across large number of
¡ Distribution computers and complex networks

¡ Integrate replication, concurrency, and quorum ¡ Make performance a first class programming
solutions on a mainstream programming model abstraction

¡ Libraries or languages with well suited runtimes ¡ Application partitioning


¡ High availability: fault tolerance and efficient ¡ Go beyond client server partitioning that does not
exception handling for ensuring services support computations migration
availability at every level ¡ Virtual machines move an entire image from the OS
between computers

¡ http://labs.live.com/volta
23
HIGH PERFORMANCE UNDERLYING SUPPORT

¡ Defect detection
¡ High level abstractions
¡ A system is resilient it can tolerate failures of its
¡ Google’s Map Reduce or MS Dryad are higher level
components
programming models that hide the complexity of
¡ In the cloud: computers, communication network, other writing a server-side analytic application
services and the data center in which it turns
¡ Hide complexity of data distribution, failure detection,
¡ Detect failures, respond to them minimizing the effect, notification, communication and scheduling
restore the service when possible and resume
¡ Optimization
execution

24
PLATFORM AS A SERVICE

¡ Encapsulates a layer of software and provides it as as a service for building higher level services
¡ Platform integrating an OS, middleware, application software and development environments
¡ xVM hypervisor virtual machines including netBeans, Sun GlassFish Web stack and support to languages like Perl and Ruby

¡ Encapsulated service exporting an API able to manage and scale itself to provide a given level of service
¡ Google Apps Engine serving applications on Google’s infrastructure

25
CLOUD IN A NUTSHELL
1. Cloud Software as a Service (SaaS)

DELIVER 2. Cloud Platform as a Service (PaaS)


Y 3. Cloud Infrastructure as a Service (IaaS)
MODELS

PIVOT
ISSUES 1. Private cloud
DEPLOYME 2. Public cloud
1. Virtualization NT MODELS 3. Hybrid cloud
2. Autonomics (automation)
4. Community Cloud
3. Grid Computing (job scheduling)

09/10/2015 26
MANAGING DATA AS A SERVICE

27
DATA MANAGEMENT WITH RESOURCES CONSTRAINTS
STORAGE
SUPPORT

RAM

ARCHITECTURE &
RESOURCES AWARE
Algorithms
Systems

Efficiently manage and exploit data sets according to given specific storage, memory and computation
resources
28
… WITH RESOURCES CONSTRAINTS
Distribution and organization of
Swap memory– disk data on disk
Data transfer
Query and data processing
on server
• Efficiency => time cost
• Optimizing memory and computing
cost

Q2: Which are the consumption rules of


Starbucks clients ?

Efficiently manage are


Q1: Which and the
exploit
mostdata sets according
popular productstoatgiven specific? storage, memory and computation
Starbucks
resources
29
STORING DATA

Persistency support
Data centre

Enabling virtualisation
platform

¡ Storage services: mobileme, live, Amazon


¡ Virtualization of disk space ß transparency (20 Go)
¡ Explicit configuration of the space assigned for files and mail
¡ Distribution, fragmentation and replication of data – sets
¡ Decision making for assigning storage space, for migrating data, maintaining stored data
09/10/2015 31
09/10/2015 32
09/10/2015 33
WITHOUT RESOURCES CONSTRAINTS …

Costly => minimizing cost, energy


consumption

¡ Query evaluationà How and under which limits ?


¡ Is not longer completely constraint by resources availability: computing, RAM, storage, network services
¡ Decision making process determined by resources consumption and consumer requirements
¡ Data involved in the query, particularly in the result can have different costs: top 5 gratis and the rest available
in return to a credit card number
¡ Results storage and exploitation demands more resources
34
SCIENTIFIC DATA MANAGEMENT APPLICATIONS

¡ Old model
¡ Query the world”
¡ data acquisition coupled to a specific hypothesis
¡ New model
¡ “Download the world”
¡ data acquired en masse, in support of many hypotheses
¡ E-science examples
¡ astronomy: high-resolution, high-frequency sky surveys, …
¡ oceanography: high-resolution models, cheap sensors, satellites, …
¡ biology: lab automation, high-throughput sequencing, ...
HOW TO MAP ARCHITECTURE IN CLOUD ?

How to “map” the components of the reference architecture to (virtual) machines in the cloud.

¡ How data is collect, transform, integrated, load stored, modeled?


¡ How to partition data and functions?
(load balancing)
¡ How is the consistency of the data maintained ( vs availability) ?
¡ What programming model?
¡ Whether and how to cache?
SCALABILITY PILLARS

Computing resources
architectures

Going for Ogres, onions or parfaits?


Programming model: parallelism

Vinayak borkar, Michael J. Carey, Chen Li, Inside “big data management”:
ogres, onions, or parfaits?, EDBT, 2012

Execution platforms
DM systems
37
CLOUD AWARE APPLICATIONS ARCHITECTURES

¡ Good plan for dividing data with tools implementing master/worker or other parallelization patterns
¡ Data partitioning techniques, real – time analysis
¡ Data physics: balance between local data processing and data transfer costs
¡ Combine data and computing power, e.g. virtual machine location and data storage location

38
APPLICATION REFERENCE ARCHITECTURE
Reference'Architecture
Client

HTTP XML, JSON, HTML

Web'
Server

FCGI, ... XML, JSON, HTML

App'Server

SQL records

DB'Server

get/put block

Store
BUT IS IT VALUABLE ? AND HOW ?

Adobe
Browser Adobe Air Mobile Games ...
Flex

Client REST (http)


Machines Internet

Service 1 Service 2 Service 3

Servers
of utility
Doc
provider Doc
Doc
Doc Doc

DB
App1
App1
App1
DBApp1
Internal & External Data
SOME COMMENTS …

¡ Data management applications are potential candidates for deployment in the cloud
– industry: enterprise database system have significant up-front cost that includes both
hardware and software costs
– academia: manage, process and share mass-produced data in the cloud

¡ Many “Cloud Killer Apps” are in fact data-intensive


– Batch Processing as with map/reduce
– On line Transaction Processing(OLTP) as in automated business applications
– Offline Analytical Processing(OLAP) as in data mining or machine learning
COLLECTING DATA

42
DATA ACQUISITION

¡ Traditional sensing and measurement: installing sensors dedicated to some applications


¡ Passive crowd sensing
¡ Participatory sensing

43
DATA ACQUISITION

¡ Traditional sensing and measurement


¡ Passive crowd sensing: wireless cellular networks are built for mobile communication between
individuals to sense city dynamics (e.g., predict traffic conditions and improve urban planning)
¡ Sensing City Dynamics with GPS-Equipped Vehicles: mobile sensors continually probing the traffic flow on road surfaces
processed by infrastructures that produce data representing city-wide human mobility patterns
¡ Ticketing Systems of Public Transportation (e.g., model the city-wide human mobility using transaction records of RFID-based
cards swiping)
¡ Wireless Communication Systems (e.g., call detailed records CDR)
¡ Social Networking Services (e.g., geo-tagged posts/photos, posts on natural disasters analysed for detecting anomalous
events and mobility patterns in the city)
¡ Participatory sensing

44
DATA ACQUISITION

¡ Traditional sensing and measurement


¡ Passive crowd sensing
¡ Participatory sensing: people obtain information around them and contribute to formulate collective
knowledge to solve a problem (i.e., human as a sensor)
¡ Human crowd-sensing: users willingly sense information gathered from sensors embedded in their own devices (e.g., GPS data
from a user’s mobile phone used to estimate real- time bus arrivals)
¡ Human crowd-sourcing: users are proactively engaged in the act of generating data: reports on accidents, police traps, or any
other road hazard (e.g. Waze), citizens turning into cartographers, to create open maps of their cities

45
DATA INTEGRATION IN THE CLOUD

¡ Resource consuming model focussing on the technical and economic conditions to be fulfilled to
access potentially unlimited resources
¡ Integrating and processing heterogeneous data collections, calls for efficient methods for
¡ correlating, associating, and filtering them considering their variety (i.e., different formats and data models)
¡ quality, e.g., trust, freshness, provenance, partial or total consistency.
46
DATA INTEGRATION IN THE CLOUD

¡ Quality of service (QoS) requirements expressed by data consumers and Service Level Agreement
(SLA) contracts exported by data services
¡ Cloud providers that host these collections and deliver resources for executing data processing and
integration processes
¡ SLA- based data integration for better meeting user requirements related
¡ to the conditions in which data is delivered and integrated 47

¡ on the quality of the data provided by services


MOOC SCENARIO

¡ Producers characterized by location, provided content type and topic, access conditions (e.g. cost, inscription, or
exchange unit), and content production time window
¡ Consumers characterized by location, interests during a time interval, maximum cost of the consumed content, or
resources to get the service, and QoS requirements (availability and how critical it is to consume a given type of content)
¡ Producers and consumers
¡ Have subscriptions to different cloud providers for dealing with content storage, processing and exchange 48

¡ Can ask to minimize the transfer of personal data when they share/consume content
MOOC SCENARIO

¡ MOOC
¡ Aims at being privacy respectful of the producers and consumers participating in courses
¡ Uses privacy preserving techniques to let users share content anonymously according to the level of trust associated to
data providers
49
¡ Data providers can also wish to give restricted data access credentials w.r.t. to their trust level, when their
data are used within an integration process
PROBLEM STATEMENT

¡ How can the user efficiently obtain results for her queries such that they meet her
QoS requirements
¡ they respect her subscribed contracts with the involved cloud provider(s)
¡ they do not neglect services contracts

¡ Particularly, for queries that call several services deployed on different clouds
Integration can be done enforcing all/some specified conditions
Matching data providers with requests and QoS preferences with SLA’s can be computationally costly
à results should be capitalized for further integration requests 50
PROBLEM STATEMENT (II)

Energy provision
Agreed Hub
Requirements
SLA1 Service
Agreed
Service SLA
Agreed Service
SLA Agreed
SLA

How to be sure that all the agreed SLAs are respected while satisfying the user?
51
PROBLEM STATEMENT (III)
Agreed
SLA1
Energy provision
Hub Service
Agreed
SLA
Service
Agreed Agreed Service
SLA2 SLA Agreed
SLA

Can my constraints be satisfyed? Which services shall I ask for?


How can ressources be saved for next query?
52
OBJECTIVES

Propose an SLA guided continuous data integration and provision system as a DaaS
¡ Integrated SLA computation out of the Data agreed SLA
¡ Optimized and adaptable data collection, query rewriting and integration according to user
preferences
¡ Learning based data integration mechanisms

53
HOW TO EXPLOIT SLA FOR INTEGRATING DATA

List of English poetry content providers that can provide commented Emily Dickinson poems
that
QoS are close to my city: and
preferencesuser are≤ labeled
⟨cost as experts,
$1, freshness where the
= “any”, total cost= is“certified”,
provenance less than 1 dollar,
location
using only trustful services
= “close”, duration = 7 days, privacy- preserving=“reputation-based”⟩.

estim-costi(contents, req size, cost, prov size, loc), agreedSLAi


engagei(contents, req size, payment), agreedSLAi
< topic: English poetry
cost: 0,5
av: 9:00-17:00
Agreed SLA < topic: English poetry
cost: free
Cloud SLA
freshness: every week av: continuous ⟨0,05 cents/call,
Prov: University freshness: 00
8 GB I/0 volume/month,
Acc control: login/pwd Prov: University
Anonymity: k-anonymity Acc control: void Hub free, 1 GB storage⟩.
Reputation: expert Anonimyty: differential
Location: Evry privacy
Duration: every day >
s2 Reputation: expert
Location: Paris
Duration: continuousy >
s1 s4
s3 s5
s5
54
HOW TO EXPLOIT SLA FOR INTEGRATING DATA

EVALUATION
4 Agreed
INTEGRATION
SLA
Integrated Cloud1 Data
SLA Agreed SLA providers
s3
QUERY (user
3 subscription) s1
REWRITING s2
Query + QoS preferences

IaaS
s2

SaaS
CHOOSING Cloud2
2 Agreed SLA
SERVICES
(user

PaaS
Derived subscription)
SLA Cloud3 s5 s2
s1
Agreed SLA
DERIVING (user
1 subscription) Service
SLA providers
SLA-Service
directory
A BSTRACT
INTERCLOUD LAYER
CHALLENGES BEHIND AN SLA-GUIDED APPROACH

¡ Agreed-SLA:
¡ Its content should allow to match user preferences wrt to service features
¡ a service-centric monitoring for service static and dynamic deployment conditions
¡ Challenge: How to compute coarse grained measures with fine grained ones?

¡ Derived-SLA
¡ Guides the way the query will be evaluated, and the way results will be computed and delivered
¡ Helps learning for further data integration operations
¡ Challenge : How to consider in real time the Agreed SLA clauses in the rewriting algorithm, especially for
dynamic clauses?

56
DERIVED SLA
¡ Set of measures that correspond to the user preferences computed as a function of different
static, computed as a function of different measures
¡ Inequations that have to be solved during the execution of a service composition.
¡ Guides the way the query will be evaluated, and the way results will be computed and delivered
¡ User preferences statement measures are used for defining a derived SLA

• total cost: Σ i=1...n cost(si) + data transfer + encryption cost ≤ $1;


• availability: (of services involved) ≥ 90%;
• freshness: non;
• provenance: all services involved must be expert;
• duration: 7 days;
• I/0 volume/month: 8GB;
• reputation level: ≥ threshold; 57

• storageSpace: 1GB
REWRITTING QUERIES

¡ The query to be rewritten is seen as an abstract service composition to be


expressed in terms of concrete services.
¡ Generating translations of an abstract query into several compositions over
concrete services
¡ PRINCIPLE: given a Query denoting abstract services and its Preferencesà
generate the service compositions that express
¡ The query in terms of a composition of concrete services
¡ The preferences as a set of constraints representing the derived SLA
58
QUERY EXPRESSION

Q(myIPaddress, “E.Dickinson”, 1MB, “expert”, $1, myCreditCard) ≡


myLoc = loc(myIPaddress),
estim cost(“E.Dickinson”, 1MB, cost, size, theirLocation),
query total cost + cost ≤ $1,
availability ≥ 90%, freshness = any,
provenance = “expert”, duration = 7 days,
storageSpace ≤ 1GB, I/0 volume/month ≤ 8GB,
reputation level ≥ λ (the threshold)
engage(“E.Dickinson”, size, myCreditCard).

¡ Query to be rewritten with respect to the available concrete services


59
TECHNICAL SUPPORT
Genoveva Vargas-Solar
Senior Scientist, CNRS
LIG-LAFMIA, France

Javier Espinosa
LAFMIA (UMI 3175)
France
60

[email protected]
http://www.vargas-solar.com/data-management-services-cloud

You might also like