Module 4
Syllabus
Aneka: Cloud Application Platform: Framework Overview, Anatomy of the Aneka
Container, Building Aneka Clouds, Cloud Programming and Management, Data
Intensive Computing: Map-Reduce Programming - What is Data-Intensive
Computing?, Technologies for Data-Intensive Computing, Aneka MapReduce
Programming.
Aneka: Cloud Application Platform
● Aneka is a cloud computing solution developed by Manjrasoft Pty. Ltd.
● It is used for developing, deploying, and managing cloud applications.
● Aneka acts as a scalable cloud middleware that can work with different
computing resources.
● It provides a collection of services to:
a. Execute applications efficiently.
b. Help administrators monitor cloud activities.
c. Integrate with existing cloud technologies.
One of its key features is a set of APIs (Application Programming Interfaces) that
support different programming models like:
● Task (for independent units of work).
● Thread (for parallel execution).
● MapReduce (for processing large datasets).
It allows developers to create distributed applications and add new features to the
cloud.
Aneka supports multiple cloud deployment models, including:
● Public Cloud (open to multiple users).
● Private Cloud (for a specific organization).
● Hybrid Cloud (a combination of both).
These features make Aneka different from infrastructure management
software because it focuses on application execution rather than just managing
Framework Overview
Aneka is a software platform designed for developing cloud computing applications.
It helps in combining different computing resources (like computers, servers, and
data centers) into a single virtual cloud environment called the Aneka Cloud where
applications run.
According to the Cloud Computing Reference Model, Aneka is a pure PaaS
(Platform as a Service) solution.
Aneka can work with different types of resources, including:
● Networks of computers
● Multicore servers
● Data centers
● Virtual cloud infrastructures
● A combination of these
The Aneka framework provides:
● Middleware to manage and scale cloud applications
● APIs for developing cloud-based applications
The Aneka infrastructure is designed to work on various platforms and
operating systems.
The Aneka Container is installed on each computing resource (node) and acts as
the building block of the system.
Multiple interconnected containers form the Aneka Cloud, which serves both
users and developers.
The Aneka Container provides three types of services:
○ Fabric Services → Manage computing infrastructure
○ Foundation Services → Support cloud operations
○ Execution Services → Handle application execution
Developers and administrators can access these services through:
○ Application management tools
○ Development interfaces and APIs for building cloud applications
○ Control tools for managing the Aneka Cloud
Aneka uses a Service-Oriented Architecture (SOA), where services are the
core components of an Aneka Cloud.
Services operate at the container level and provide features to developers,
users, and administrators (except for the platform abstraction layer).
Customization and Extension:
● New services can be added or existing ones replaced to customize the
cloud.
● The framework includes essential services for managing infrastructure,
nodes, application execution, accounting, and system monitoring.
Dynamic Service Integration:
○ New services can be plugged in dynamically to extend cloud functionality.
○ This flexibility allows Aneka Clouds to support different programming and
execution models.
Programming Models in Aneka:
○ A programming model provides tools for developers to create distributed
applications.
○ The runtime support for a model includes a set of execution and foundation
services that work together to run applications.
○ To implement a new programming model, developers must define the
necessary abstractions and ensure proper runtime support.
Scalability and Elasticity:
○ Aneka Cloud ensures scalable and elastic infrastructure for distributed
applications.
○ Multiple services work together to support application execution efficiently.
Key Features of Aneka Cloud
Elasticity and Scaling
● Aneka allows dynamic scaling of resources using a dynamic provisioning
service.
● The infrastructure can expand or shrink based on application needs.
Runtime Management
● Responsible for keeping the system running smoothly.
● Includes a container and services that manage:
○ Service membership and discovery.
○ Infrastructure maintenance and profiling.
Resource Management
● Resources can be added or removed as per application demand.
● Supports QoS (Quality of Service) execution by:
○ Dynamically provisioning resources.
○ Reserving specific nodes for exclusive application use.
Application Management
● Manages applications using special services for:
○ Scheduling execution.
○ Monitoring performance.
○ Storage management.
User Management
● Aneka supports multiple users running different applications.
● Provides a user system to manage:
○ User accounts, groups, and permissions.
○ Security and accounting for cloud usage.
QoS/SLA Management and Billing
● Tracks resource usage and bills users accordingly.
● Ensures applications meet service-level agreements (SLAs).
Development and Management Tools
Software Development Kit (SDK)
○ Helps developers build applications using:
■ Existing programming models.
■ Custom object models for creating new models.
Management Kit
○ Provides tools for managing infrastructure, users, and applications.
○ Interacts with runtime services to handle cloud operations.
Anatomy of the Aneka Container
● The Aneka container is the core unit in an Aneka Cloud.
● It serves as the runtime system for both services and applications.
● It is a lightweight software layer that:
○ Hosts services.
○ Interacts with the operating system and hardware.
Main Role of the Container
● Provides a lightweight environment for deploying services.
● Manages communication between different nodes in the Aneka Cloud.
● Almost all operations in Aneka are handled by services inside the container.
Types of Services in Aneka Container
● Fabric Services → Manage cloud infrastructure and resources.
● Foundation Services → Provide core functionalities like security and monitoring.
● Application Services → Handle application execution and management.
Platform Abstraction Layer (PAL)
● The PAL sits beneath the services stack and interacts with the operating
system and hardware.
● It provides a uniform interface to ensure smooth operation across different
platforms.
● Key Features of PAL:
○ Works across different platforms (platform-independent).
○ Provides standardized access to system properties.
○ Enables communication with remote nodes.
○ Offers consistent management interfaces for cloud operations.
Security and Reliability
● Security and data persistence apply to all levels of the Aneka Cloud.
● Ensures a secure and reliable cloud infrastructure for applications.
a. PAL (Platform Abstraction Layer) is a small software layer in Aneka.
b. It detects and configures the system automatically during boot.
c. Helps the Aneka container work on different operating systems
(Windows, Linux, Mac OS X).
● Role of PAL
a. Acts as a bridge between the Aneka container and the underlying
operating system.
b. Ensures compatibility across platforms.
c. Uses platform-specific components to gather system information.
● System Data Collected by PAL
a. CPU Details: Number of cores, frequency, and usage.
b. Memory: Total size and current usage.
c. Storage: Available disk space.
d. Network: Network addresses and connected devices.
Key Benefit
● PAL provides a uniform way to access system resources, making Aneka
flexible and adaptable across different environments.
Fabric services
● They form the lowest level of the Aneka software stack.
● They are part of the Aneka Container and help manage resources.
Functions of Fabric Services
● Resource Provisioning
○ Adds or removes computing nodes dynamically based on demand.
○ Uses virtualization technologies to allocate resources efficiently.
● Monitoring Services
○ Tracks hardware performance and system resources.
○ Creates a basic monitoring system that other services in the container
can use.
Profiling and monitoring
Purpose of Profiling and Monitoring Services
● These services track system performance and report real-time data for better
resource management in the Aneka Cloud.
● They help in resource scheduling, performance optimization, and system health
monitoring.
Key Components of Monitoring Services
● Heartbeat Service
○ Collects and shares real-time performance data (CPU, memory, disk space, OS
details).
○ Publishes this data to the membership service, which helps in optimizing
resource usage.
○ Also collects extra data like installed software and system properties.
○ Uses a Node Resolver to gather system information, even in different
environments (physical or virtual).
● Monitoring & Reporting Services
○ Monitoring Service: Collects and forwards system data to the
Reporting Service.
○ Reporting Service: Stores and makes this data accessible for
performance analysis.
Built-in Monitoring Services
● Membership Catalogue → Tracks node performance.
● Execution Service → Monitors job execution times.
● Scheduling Service → Tracks job state changes.
● Storage Service → Monitors file transfer details (upload/download times,
file sizes, names).
● Resource Provisioning Service → Tracks virtual node lifecycle (creation,
usage, removal).
Key Benefits
● Real-time system monitoring for optimized performance.
● Helps in scheduling and resource allocation based on live data.
● Works across different environments (physical servers, cloud instances
like EC2).
Resource management
Aneka Clouds handle resources using three key tasks:
1. Resource Membership – Managing which resources (nodes) are part of the
cloud.
2. Resource Reservation – Reserving resources for specific tasks.
3. Resource Provisioning – Allocating and managing resources as needed.
Services for Resource Management
Aneka provides three main services for handling resources:
4. Index Service (Membership Catalogue)
○ Keeps track of all nodes (computers) in the cloud.
○ Stores information about connected and disconnected nodes.
○ Works like a directory, allowing users to search for resources by name or
other details.
1. Reservation Service
○ Helps in reserving resources for particular tasks or users.
2. Resource Provisioning Service
○ Allocates and manages resources efficiently based on demand.
How the Membership Catalogue Works
● The Membership Catalogue is the most important component for managing
resources.
● When a new resource (node) starts, it sends its details to the Membership
Catalogue and updates them regularly.
● External applications and services can use the Membership Catalogue to find
available resources.
● It is structured as a distributed database to improve speed and
performance.
○ Local queries (about nearby resources) are answered directly.
○ If the information is not available locally, the request is forwarded to the
main index node, which has details of the entire cloud.
● It also collects performance data from each node and sends it to a
monitoring service for long-term storage.
Indexing and Categorizing Resources in Aneka
● Indexing and categorizing resources are key parts of resource
management.
● In addition to indexing, resource provisioning ensures resources are
available as needed.
● Infrastructure management handles container deployment and configuration
(but is separate from Fabric Services).
Dynamic Resource Provisioning in Aneka
● Dynamic provisioning allows Aneka to integrate virtual resources from IaaS
(Infrastructure as a Service) providers.
● It helps the cloud to scale up or down based on demand, ensuring:
○ Node failures are managed.
○ Quality of service for applications is maintained.
○ Cloud performance and speed remain stable.
Flexible Resource Provisioning Infrastructure
● Aneka offers a flexible system where provisioning logic, back-end support,
and runtime strategies can be changed.
● The Resource Provisioning Service is responsible for managing virtual
instances.
● Resource pools are used to interact with different IaaS providers:
○ A resource pool provides a common interface for managing different
cloud providers.
○ It can represent a private cloud, Xen Hypervisor-managed resources,
or physical resources used occasionally.
Open Protocol and Customization
● Aneka’s provisioning system uses an open protocol to support
customization.
● Metadata is used to provide extra details about resource pools.
● The system is designed to easily integrate new features and support
different implementations without disrupting the existing setup.
Foundation services
Fabric Services (Basic Infrastructure Management)
● These are fundamental services of Aneka Cloud that handle basic
infrastructure management.
● They provide essential features needed for managing cloud resources.
Foundation Services (Logical Management & Support for Distributed
Applications)
● These services manage the distributed system built on the infrastructure.
● They support the execution of distributed applications.
● All supported programming models can use these services for better
application management.
Key Features of Foundation Services:
1. Storage Management – Manages storage for applications.
2. Accounting & Billing – Handles usage tracking, billing, and pricing of resources.
3. Resource Reservation – Allows reserving cloud resources in advance.
Purpose of Foundation Services:
● They provide a uniform way to manage distributed applications.
● Developers can focus on their programming logic instead of infrastructure
management.
Role of Fabric & Foundation Services Together:
● These two services form the core of the Aneka middleware.
● They are mostly used by Execution Services and Management Consoles.
● External applications can use these services for advanced application
management.
Storage management
Aneka provides two types of storage based on the application's needs:
Centralized File Storage (For Compute-Intensive Applications)
○ Best suited for applications that need powerful processors but minimal
storage.
○ Used when small files are frequently transferred between nodes.
○ A single storage node or a small pool of storage nodes is sufficient.
○ Managed by Aneka’s Storage Service and supports File Transfer
Protocol (FTP).
Distributed File System (For Data-Intensive Applications)
○ Best suited for applications that process large datasets (GBs or TBs).
○ Processing power is not a bottleneck, but scalable storage is required.
○ Uses the combined storage space of multiple cloud nodes.
○ Based on the Google File System (GFS) model.
Google File System (GFS) Model for Distributed Storage
● Master Node: Manages a global file map and tracks all storage nodes.
● Chunk Servers: Store data in fixed-size chunks, each with a unique ID.
● Files are logically arranged in a directory structure but stored using a flat
namespace.
●
Characteristics of Applications Supported by GFS
● Files are huge (multi-gigabyte).
● New data is appended, not rewritten.
● Two major types of workloads:
1. Large streaming reads (reading big data continuously).
2. Small random reads (accessing specific small sections).
● Sustained bandwidth is more important than low latency.
Accounting, billing, and resource pricing
Role of Accounting Services in Aneka Cloud
● Tracks application execution and resource usage in the Aneka Cloud.
● Provides a detailed breakdown of cloud infrastructure usage.
● Helps in managing resources efficiently.
● Maintains a history of application execution, storage, and resource
utilization.
● This information is used for charging users based on their usage.
Billing in Aneka Cloud
● Aneka is a multi-tenant cloud platform that may use resources from
commercial IaaS providers.
● Billing Service provides cost details for each user based on resource
consumption.
● Different resources can have different pricing, depending on:
○ Available services on the Aneka container.
○ Installed software on the cloud node.
● Users can view:
○ Total budget spent on an application.
○ Summary of costs per user.
○ Detailed execution costs for each job.
Key Components of Accounting Services
1. Accounting Service
○ Tracks application execution details:
■ Job distribution among cloud resources.
■ Execution time of each job.
■ Cost calculation based on usage.
2. Reporting Service
○ Collects information from monitoring services for accounting purposes:
■ Storage utilization.
■ CPU performance.
○ This data is used by the Management Console for decision-making.
These services ensure accurate tracking, resource optimization, and fair
billing for users in the Aneka Cloud.
Resource reservation
Aneka's Resource Reservation System
● Allows applications to reserve computing resources exclusively for their
use.
● Supports the execution of distributed applications that require guaranteed
resource availability.
● Built on two key services:
1. Resource Reservation Service – Manages and tracks reserved time
slots across the Aneka Cloud.
2. Allocation Service – Runs on each node, managing reservation details
locally.
How Resource Reservation Works
1. Applications that have a deadline can request resource reservations.
2. If resources are available, the Reservation Service provides a reservation
ID.
3. This ID is used during execution to ensure only reserved nodes are
allocated.
4. Each reserved node verifies the validity of the ID before allowing execution.
Flexible Implementation in Aneka
● Aneka supports different ways to handle resource reservations.
● Various protocols and strategies can be integrated transparently.
● Extensible APIs allow for advanced reservation services.
Types of Resource Reservation in Aneka
1. Basic Reservation
○ Reserves execution slots on nodes.
○ Uses an alternate offers protocol (provides alternatives if initial request
is unavailable).
2. Libra Reservation
○ Similar to Basic Reservation.
○ Allows pricing nodes differently based on hardware capabilities.
3. Relay Reservation
○ A lightweight implementation.
○ Allows a resource broker to reserve nodes.
○ Useful when Aneka operates in an intercloud environment.
This system ensures efficient resource management, meeting deadlines while
providing flexibility for different reservation needs.
Importance of Resource Reservation in Aneka
● Ensures quality of service (QoS) for applications.
● Provides a predictable execution environment so applications can meet
deadlines.
● If an application cannot meet its deadline, it won't be executed at all.
How Reservation Requests Are Handled
● Reservations are made based on available physical/virtual infrastructure.
● Takes into account both current and future load when accepting requests.
● If a node fails, Aneka might not be able to meet the service-level
agreement (SLA).
Handling Failures and Outages
● Some implementations delay node allocation to handle minor failures.
● If there’s a serious outage, remaining nodes may not be enough to meet
demand.
● In such cases, resource provisioning is used:
○ Additional nodes are obtained from external resource providers.
○ Helps ensure the SLA is maintained and applications are not disrupted.
This approach ensures reliable execution of applications while maintaining
service commitments.
Application services
What Are Application Services in Aneka?
● Manage the execution of applications in the Aneka Cloud.
● Differentiate based on the programming model used for distributed
applications.
● The type and number of services in this layer vary depending on the
model’s needs.
Common Activities in All Programming Models
1. Scheduling – Decides which resources will run the application.
2. Execution – Runs the application on the allocated resources.
Aneka’s Reference Model for Application Services
● Two main services:
1. Scheduling Service – Manages how and where applications are
executed.
2. Execution Service – Handles the actual execution of applications.
● Aneka provides base implementations of these services.
● Developers can extend these implementations to support new
programming models.
This approach makes Aneka flexible and adaptable for different application
needs.
Scheduling
What Do Scheduling Services Do in Aneka?
● Plan and manage the execution of distributed applications.
● Assign jobs (tasks of an application) to different nodes (computing
resources).
● Work with other services, such as:
○ Resource Provisioning Service (allocates extra resources if needed).
○ Reservation Service (reserves nodes for specific applications).
○ Accounting Service (tracks resource usage and costs).
○ Reporting Service (monitors job and application status)
Common Tasks of Scheduling Services
1. Job to node mapping – Decides which node will execute a job.
2. Rescheduling failed jobs – Moves jobs to another node if the original one fails.
3. Job status monitoring – Tracks the progress of individual jobs.
4. Application status monitoring – Tracks the progress of the entire application.
How Aneka Handles Scheduling?
● No centralized scheduling engine – Each programming model has its own
scheduling service.
● Advantage: Flexibility in choosing scheduling and resource allocation strategies.
● Challenge: Requires careful design to avoid conflicts, such as:
○ Multiple jobs sent to the same node at the same time.
○ Jobs without reservations being assigned to reserved nodes.
○ Jobs sent to nodes missing the required services.
How Aneka Prevents Scheduling Issues?
● Foundation Services provide necessary information to avoid conflicts.
● No built-in policies for automatic conflict resolution—developers must
handle this.
● Only one job per programming model can run on a node at a time,
unless:
○ Resource Reservation is used to ensure exclusive execution.
This setup gives flexibility but also requires careful management to avoid
resource conflicts.
Execution
Execution Services handle the actual running of jobs in a distributed application.
They prepare, execute, and finalize jobs based on the programming model used.
Key Responsibilities of Execution Services
1. Unpacking jobs – Extracts job details sent by the scheduler.
2. Retrieving input files – Fetches required data before execution.
3. Running jobs in a sandboxed environment – Ensures isolation and
security.
4. Submitting output files – Saves results after job completion.
5. Managing execution failures – Captures failure details for debugging.
6. Monitoring performance – Tracks resource usage during execution.
7. Packing jobs and sending them back to the scheduler – Sends execution
results for further processing.
How Execution Services Work in Aneka?
● Self-contained compared to Scheduling Services.
● Integrates mainly with:
○ Storage Service (for retrieving and storing files).
○ Local Allocation and Monitoring Services (to track node performance
and job status).
● Aneka provides a reference implementation that supports these services.
● Some programming models specialize in this reference implementation.
Programming Models Supported by Aneka
Aneka supports different programming models to cater to various distributed
computing needs. Each model is designed for specific types of applications.
1. Task Model
● Best for: Independent tasks that can execute in any order.
● How it works:
○ Applications are broken down into a collection of independent tasks.
○ Tasks can be executed in parallel since they do not depend on each
other.
● Example: Running multiple image processing tasks on different files
simultaneously
2. Thread Model
● Best for: Multithreaded applications needing distributed execution.
● How it works:
○ Extends classical multithreading to execute threads remotely on
different nodes.
○ Uses Thread abstraction to wrap a method that runs remotely.
● Example: A simulation where different calculations run on separate threads
across multiple machines.
3. MapReduce Model
● Best for: Large-scale data processing and analytics.
● How it works:
○ Based on Google’s MapReduce framework.
○ Map function: Splits data into smaller chunks and processes them in
parallel.
○ Reduce function: Collects and aggregates results.
● Example: Analyzing large log files for trends and statistics.
. Parameter Sweep Model
● Best for: Applications that require testing multiple parameter combinations.
● How it works:
○ A template task is defined, and multiple instances of the task are
created by varying input parameters.
○ Each task corresponds to a different combination of input values.
● Example: Running a machine learning model with different hyperparameters
to find the best combination.
Building Aneka clouds
● Aneka is a platform – It is mainly used for developing distributed applications
in cloud environments.
● Needs infrastructure – To work properly, Aneka must be set up on a suitable
infrastructure.
● Requires management – The infrastructure needs proper management to
function efficiently.
● Infrastructure management tools – These tools help in handling and
maintaining the infrastructure.
● Building Clouds – Administrators use these tools to create and manage
cloud environments.
● Supports different cloud models – Aneka works with Public, Private, and
Hybrid Clouds.
Infrastructure Organization
● Reference model for deployments – The setup applies to different types of
Aneka cloud deployments.
● Administrative console – This plays a central role in managing the entire Aneka
cloud system.
● Role of repositories – These store all the required software libraries for setting
up and running Aneka.
● Software image – The libraries in the repository form the software image for the
Aneka node manager and container programs.
● Communication channels – Repositories share these libraries through HTTP,
FTP, or file sharing.
● Multiple repositories – The management console can choose the best repository
for deployment.
● Deploying the infrastructure – A collection of nodes is set up, and the Aneka
node manager (Aneka daemon) is installed on them.
● Aneka daemon – This software component helps remotely manage and
control container instances.
● Aneka Cloud – The group of containers working together forms the Aneka
Cloud.
● Managing physical and virtual nodes – Both types of nodes are managed
in the same way if they have internet and remote access.
● Dynamic provisioning – Virtual machines can be preconfigured with
Aneka, so they only need to be connected to the cloud.
● Installation options – You can either:
● Install the full Aneka container or
● Install just the Aneka daemon, depending on how long the virtual
machine is needed.
Logical Organization
● Aneka Cloud can have different structures – Its organization depends on
how the container instances are set up.
● Configuration matters – The way each container is configured affects the
overall cloud structure.
● Common setup – The most widely used configuration is the master-worker
model.
● Separate storage nodes – In this setup, storage is handled by separate
nodes.
● Fig. 5.4 shows this setup – It visually represents the master-worker
configuration with distinct storage nodes.
Master Node (Brain of the Cloud)
● Contains important services that manage the Aneka Cloud.
● The index service (membership catalog) is what defines a node as a
master.
● Some services can be located on other nodes, but key services are usually on
the master.
● Common services in the Master Node:
○ Index service (master copy) – Keeps track of all nodes.
○ Heartbeat service – Monitors if nodes are active.
○ Logging service – Records system activity.
○ Reservation service – Manages resource reservations.
○ Resource provisioning service – Allocates resources to applications.
○ Accounting service – Tracks resource usage.
○ Reporting & monitoring service – Provides system reports and health
checks.
○ Scheduling services – Manages task execution based on the supported
programming models.
2. Worker Nodes (Execution Units)
● These nodes run applications and perform the actual computing work.
● They contain essential services for execution.
● Common services in Worker Nodes:
○ Index service – Registers the worker node in the cloud.
○ Heartbeat service – Reports if the node is active.
○ Logging service – Tracks operations.
○ Allocation service – Assigns resources for execution.
○ Monitoring service – Observes performance and availability.
○ Execution services – Runs the applications based on supported
programming models.
Summary:
● The Master Node is responsible for managing and controlling the cloud.
● The Worker Nodes handle the execution of applications.
● A Master-Worker model is a common configuration in Aneka Cloud.
Storage Nodes (Optimized for Storage)
● Purpose: These nodes provide storage support for applications.
● Key Feature: They include a Storage Service to manage files.
● Number of Nodes: Depends on workload and storage needs of
applications.
● Hardware Requirement: These nodes are usually on machines with large
disk space to store a high volume of files.
Common Services in a Storage Node:
1. Index service – Registers the storage node in the cloud.
2. Heartbeat service – Checks if the node is active.
3. Logging service – Records storage activities.
4. Monitoring service – Observes storage performance.
5. Storage service – Manages file storage and retrieval.
Summary:
● Storage nodes are dedicated to handling and managing data storage.
● Their number depends on application storage needs.
● They include essential monitoring and management services along with
the Storage Service.
Private Cloud Deployment Mode
Uses Local Resources – The cloud is built using local physical machines and
infrastructure management software.
Supports Virtualization – Some resources can be virtualized if needed.
Heterogeneous Resource Pool – Aneka Clouds can use different types of
resources, such as:
● Desktop computers
● Clusters
● Workstations
● Resource Grouping – Resources can be partitioned into different groups
for better management.
● Flexible Configuration – Aneka can be set up to use these resources based
on the needs of applications.
● Resource Provisioning Service – Helps in allocating and managing
resources dynamically from the local resource pool.
● Uses OpenStack – Aneka can integrate with OpenStack for managing
resources in a private cloud setup.
● Suitable for Predictable Workloads – Works well when the system
workload is stable and capacity needs are manageable.
● Uses Local Virtual Machine Manager – Extra demand can be handled by
virtual machines if needed.
● Mostly Physical Nodes – Aneka nodes are primarily physical machines
with a long lifetime and fixed configurations.
● Minimal Reconfiguration Needed – These physical nodes usually do not
require frequent changes.
● Resource Management via Policies – Different machines can be managed
using specific policies with the Reservation Service.
Optimized Resource Usage:
● Office desktops – Used for daily work but can run distributed applications
outside working hours.
● Workstations & clusters – Might have legacy software needed for specific
applications.
● Specialized execution – Some machines should be preferred for tasks with
unique requirements.
Public Cloud Deployment Mode
Fully Virtualized Infrastructure – Aneka's master and worker nodes run on
virtual machines provided by cloud providers like Amazon EC2 or GoGrid.
Two Types of Deployment:
● Static Deployment – Nodes are created beforehand and used like physical
machines, without dynamic scaling.
● Dynamic Deployment – Uses elastic cloud features to automatically scale
resources based on demand.
Hosted on a Single Provider – Usually, Aneka Cloud is deployed within one IaaS
provider to:
● Reduce high costs of data transfer between different providers.
● Ensure better network performance.
Scalability with Dynamic Provisioning:
● Aneka Cloud can start with just one node.
● More nodes can be added or removed dynamically based on demand.
Resource Provisioning Service – Plays a key role in configuring, managing,
and scaling virtual machines.
Important Services in the Master Node:
● Accounting & Reporting Services – Track resource usage for billing users
in a multi-tenant cloud.
Summary:
● Public Cloud deployment in Aneka runs on virtual machines from cloud
providers.
● Supports both static and dynamic deployments.
● Dynamic provisioning allows automatic scaling of resources.
● Master node tracks usage and costs for billing in a multi-tenant cloud
Dynamic Provisioning in Public Cloud Deployment
1. On-Demand Worker Nodes – New instances are created when needed,
usually as worker nodes.
2. Custom Hardware Configurations – In Amazon EC2, different hardware
setups can be selected for worker nodes.
3. Application-Specific Provisioning – If an application requires more CPU or
memory, it informs the scheduler, which then provisions the right
resources.
4. Beyond Application Execution:
○ Dynamic provisioning is not just for running applications.
○ It can also be used for scaling services like Storage Service.
1. Storage Scaling in Multi-Tenant Clouds:
○ Multiple applications use the same storage resources.
○ This can cause bottlenecks or exceed storage limits.
○ Dynamic provisioning can add more storage as needed, just like it
does for computing power.
Summary:
● Dynamic instances are created only when needed to handle workload
demand.
● Custom worker node configurations allow optimized computing.
● Scheduler ensures applications get the right resources based on their
needs.
● Dynamic provisioning helps scale both computing and storage services
to avoid performance issues.
Hybrid Cloud Deployment Mode
Most Common Deployment Model – Hybrid Cloud is widely used because it
combines existing infrastructure with cloud resources.
Uses Existing Computing Infrastructure – Local machines (desktops, servers,
clusters) form the static part of the Aneka deployment.
Elastic Scaling on Demand – If more computing power is needed, additional
cloud resources are dynamically added.
Best of Both Worlds:
● Local Infrastructure – Cost-effective for predictable workloads.
● Cloud Resources – Used only when extra power is needed, reducing
costs.
Efficient Resource Utilization – Applications can switch between local and
Most Comprehensive Deployment – Hybrid deployment utilizes all of Aneka’s
features efficiently.
Key Capabilities:
● Dynamic Resource Provisioning – Allocates extra resources when needed.
● Resource Reservation – Reserves specific resources for priority tasks.
● Workload Partitioning – Distributes workloads between local and cloud
resources.
● Accounting, Monitoring, and Reporting – Tracks resource usage and
performance.
Optimized Cost and Resource Usage:
● If local virtual machine management is available, resources are used
efficiently, minimizing costs.
Heterogeneous Resource Utilization:
● Different types of resources (desktops, clusters, cloud instances) are used
based on priority.
● Example: Low-priority tasks can be executed on desktop machines, as
discussed in Private Cloud deployment.
● Execution During Non-Working Hours – Desktop machines (low-priority)
and clusters/workstations (high-priority) are used for application
execution.
● Local Virtualization – Additional local virtual machines are used when
more computing power is needed.
● External Cloud Providers – If more power is required, external IaaS
providers can be used to scale resources.
Resource Provisioning Across Multiple Providers
Leveraging Multiple Resource Providers:
● Different from Public Cloud, this model uses a combination of local and
external resources to provision virtual resources.
Data Transfer Costs:
● Data transfer between local infrastructure and external IaaS providers
incurs costs, so careful selection of resources is important.
Resource Provisioning Service:
● The Resource Provisioning Service in Aneka can manage multiple
resource pools at once.
● It helps choose the best pool for fulfilling application requirements.
Custom Policies:
● The service allows the creation of custom policies to optimize how
resources are selected based on application needs.
CLOUD PROGRAMMING AND MANAGEMENT
● Aneka is designed to provide a scalable middleware for running distributed
applications.
Key Features:
● Application Development – For developers to build applications using
Aneka.
● Application Management – For system administrators to manage
applications and infrastructure.
Simplified Development and Management:
● Developers get a comprehensive set of APIs to create and integrate
applications.
● Administrators use intuitive management tools to control the system
Aneka SDK (Software Development Kit)
Purpose:
● Provides APIs for:
○ Developing applications on existing programming models.
○ Creating new programming models for custom applications.
○ Developing new services to enhance Aneka Cloud.
Application Model vs. Service Model:
● Application Model – Helps in building applications and creating new
programming models.
● Service Model – Defines infrastructure for developing and integrating
new services into Aneka Cloud.
Application Model
Aneka Supports Distributed Execution in the Cloud
● Uses programming models as an abstraction for developers.
● Provides runtime support to execute applications on Aneka.
Application Model in Aneka
● A set of APIs common to all programming models.
● Specialized based on different programming needs.
Application Model Components (Fig. 5.8 Overview)
● Each distributed application is an instance of ApplicationBase<M> class.
● M represents the application manager that controls execution.
● Application Classes – Represent how developers view applications in Aneka.
● Application Managers – Internal components that:
○ Interact with Aneka Cloud.
○ Monitor and control execution.
○ Vary based on the programming model used.
Types of Distributed Applications in Aneka
● Applications consist of multiple tasks that together define execution.
● Aneka categorizes applications into two types:
1. User-Generated Tasks – Tasks are created by the user.
2. Runtime-Generated Tasks – Tasks are automatically generated by Aneka.
● Different application base classes and application managers are used for each
type.
Categories of Distributed Applications in Aneka
First Category: User-Generated Tasks
● Most common category, used as a reference for multiple programming models:
○ Task Model
○ Thread Model
○ Parameter Sweep Model
● Applications in this category consist of units of work submitted by the user.
● Work Unit Class represents these tasks and manages input/output files
transparently.
● Different programming models use specific WorkUnit classes:
○ Aneka Task → Used in Task Model.
○ Aneka Thread → Used in Thread Model.
● Applications in this category inherit from Aneka Application<W, M>, where:
○ W = Type of WorkUnit used.
○ M = Type of Application Manager implementing IManualApplicationManager
Second Category: Runtime-Generated Tasks
● Used in MapReduce and similar models where tasks are generated
dynamically.
● No common WorkUnit class since it varies based on the programming model.
● Example: MapReduce Model
○ Applications use map and reduce functions.
○ MapReduceApplication class allows defining Mapper<K, V> and
Reducer<K, V> functions.
○ Developers specify required input files.
● Other programming models may have different structures and interfaces.
● Applications in this category inherit from ApplicationBase<M>, where:
○ M = Implements AutoApplicationManager.
Additional Important Classes
● Configuration Class → Defines application settings and behavior
customization.
● ApplicationData Class → Stores runtime information about the application.
Summary of Aneka Application Model Features
● Designed to be extensible → Developers can modify or extend existing
programming models.
● Base classes serve as a foundation → New models and abstractions can
be defined.
● Existing programming models can be specialized → Example:
○ Parameter Sweep Model is a specialization of the Task Model.
○ Users define a template task and provide custom parameters for
execution.
● Allows customization → Developers can enhance or tailor programming
models as per application needs.
2. Service Model
● Defines the requirements for services hosted in the Aneka Cloud.
● Services run inside a container, which acts as the runtime environment.
● Each service follows the IService interface.
Key Functions of a Service
● Service Information: Name and status.
● Control Operations: Start, Stop, Pause, and Continue.
● Message Handling: Processes messages via the HandleMessage method.
● User Interaction: Some services, like Resource Provisioning and
Reservation, have clients for direct user interaction.
How Services Work in the Aneka Cloud
● Services operate through message processing.
● Each request triggers a specific message, and the service responds with the
required action.
Service Life Cycle (Figure 5.9 Overview)
● Initial States:
○ Unknown or Initialized → Service instance is created.
● Starting Phase:
○ When the container starts, the service moves to Starting state.
○ If successful, it enters the Running state.
○ If an error occurs, it goes back to Unknown state.
Running State:
● The service remains in this state while the container is active.
● It processes messages and executes tasks.
Pause & Resume (Optional Features):
● Pause → Moves to Pausing → Then Paused state.
● Continue → Moves to Resuming → Then back to Running state.
● (Note: Current Aneka framework does not support this feature.)
Stopping the Service:
● When the container shuts down, the Stop method is called.
● The service moves to Stopping state → Finally, Stopped state.
● All resources used by the service are released.
● Aneka provides a default base class to simplify service development.
● The ServiceBase class is the foundation for creating new services.
Features of ServiceBase Class
● Implements the basic properties of IService.
● Includes control operations (Start, Stop, Pause) with logging and state
control.
● Built-in infrastructure for delivering a service-specific client.
● Supports service monitoring.
Developer Guidelines for Implementing Services
● Use template methods to modify control operations.
● Implement custom message processing logic.
● Provide a service-specific client if needed.
Message Passing Communication Model
● Strongly typed messages are used for communication.
● Each service defines its own message types, which are the only ones it can
process.
Structure of Messages in Aneka
Each message type inherits from the Message base class and contains:
● Source node & Target node (where the message comes from and where it
is sent).
● Source service & Target service (which service sends/receives the
message).
● Security credentials (to ensure secure communication).
Message Properties in Aneka
● Messages carry specific information for each service type.
● Generally used inside Aneka infrastructure for communication.
Service Clients in Aneka
● If a service is directly used by applications, it can provide a service client.
● Service clients offer an object-oriented interface to interact with services.
● Aneka can dynamically inject service clients into applications.
● Services that inherit from ServiceBase automatically support service clients.
Purpose of Service Clients
● Helps integrate Aneka services into existing applications.
● Useful for applications that do not require distributed execution but need
access to specific services.
Advanced Configuration Capabilities
● Developers can define editors and configuration classes.
● Allows Aneka’s management tools to integrate services within its workflow.
Management Tools
Aneka is a Pure PaaS
○ It requires virtual or physical hardware for deployment.
Infrastructure Management
○ Provides tools to install and manage logical Clouds on the
infrastructure.
Management Layer Capabilities
○ Handles services and applications running in the Aneka Cloud.
○ Ensures smooth deployment and operation of cloud resources.
Infrastructure Management
Uses Both Virtual & Physical Hardware
○ Virtual hardware is managed via the Resource Provisioning Service,
which acquires resources as needed.
○ Physical hardware is managed through the Aneka administrative
console and Aneka management API (PAL).
Management Features
○ Focus on provisioning physical hardware.
○ Enable remote installation of Aneka on physical machines.
Platform Management
Basic Layer for Cloud Deployment
● Aneka Clouds are built on physical infrastructure by deploying a set of
services.
● These services help in installing and managing containers.
Containers & Platform
● A group of connected containers forms the platform where applications run.
● The platform's management is focused on the logical organization of Aneka
Clouds.
Cloud Partitioning
● Hardware can be divided into multiple Clouds, each configured for specific
purposes.
Core Features Managed
● Cloud Monitoring
● Resource Provisioning & Reservation
● User Management
● Application Profiling
Application Management
User Contribution to the Cloud
○ Applications represent how users utilize the Aneka Cloud.
Monitoring & Profiling
○ Administrators can track resource usage by users and applications.
○ Useful for billing users based on their resource consumption.
Resource Utilization Insights
○ Provides summary & detailed reports on application execution and
resource usage.
Management Interface
○ All these features are available in the Aneka Cloud Management
Studio, the main admin console.
DATA-INTENSIVE COMPUTING
What is Data-Intensive Computing?
● Focuses on applications dealing with large amounts of data.
● Used in fields like computational science and social networking.
Challenges of Handling Large Data
● Efficient storage and accessibility.
● Indexing and analyzing large datasets.
● Data grows rapidly over time, making management harder.
How Distributed Computing Helps
● Scalable storage architectures for handling big data.
● Better performance in data processing and computation.
Challenges in Using Distributed Computing for Data
● Data representation must be optimized.
● Efficient algorithms are required for processing large data.
● Scalable infrastructure is needed to support growth.
WHAT IS DATA-INTENSIVE COMPUTING?
What is Data-Intensive Computing?
● Involves producing, processing, and analyzing large-scale data.
● Data can range from hundreds of megabytes (MB) to petabytes (PB) and
beyond.
What is a Dataset?
● A collection of information relevant to one or more applications.
● Stored in repositories (storage systems for large datasets).
● Metadata (extra descriptive information) helps in classification and search.
Applications of Data-Intensive Computing
● Computational Science (scientific experiments & simulations).
● Astronomy (telescopes generate hundreds of gigabytes per second,
leading to petabytes yearly).
● Bioinformatics (analyzing massive biological databases).
● Earthquake Simulation (processing data from Earth’s vibrations).
Characterizing Data-Intensive Computations
Large Data Volumes
● Handle datasets in the terabyte (TB) to petabyte (PB) range.
● Data is stored in different formats and spread across multiple locations.
Compute-Intensive Nature
● Requires high computational power for data processing.
● Processing scales almost linearly with data size (i.e., more data = more
computing power needed).
Multistep Analytical Pipelines
● Data goes through multiple processing stages:
○ Transformation (converting raw data into usable formats).
Parallel Processing Capabilities
● Data can be processed simultaneously across multiple systems,
improving speed and efficiency.
Efficient Data Management
● Requires filtering, querying, fusion, and distribution of data efficiently.
Challenges Ahead
Scalable Algorithms
● Need efficient search and processing techniques for massive datasets.
Metadata Management
● Requires new technologies to handle complex, heterogeneous, and distributed
data sources.
High-Performance Computing Platforms
● Platforms must support in-memory multi-terabyte data structures for faster
access.
Reliable Petascale File Systems
● Need high-performance and fault-tolerant distributed storage solutions.
Data Reduction Techniques
● Data signatures help reduce redundancy and speed up processing.
Software Mobility
● Computation should move closer to data rather than transferring large
datasets.
Hybrid Interconnection Architectures
● Support for multi-gigabyte data streams from high-speed networks and
scientific instruments.
Flexible Software Integration
● Efficient techniques to combine software modules across different platforms
for fast data analysis.
Historical Perspective
Handling Large Data Volumes
● Focuses on producing, managing, and analyzing massive datasets.
Technology Integration
● Utilizes storage, networking, algorithms, and infrastructure software for
efficient computation.
Storage & Networking
● Advances in distributed storage and high-speed networking improve data
access and transfer.
Infrastructure Software
● Middleware and cloud platforms provide scalability and management for
large-scale data processing.
Evolution of Data-Intensive Computing
● Ongoing improvements in hardware, software, and algorithms shape the
field.
1. The Early Age: High-Speed Wide Area Networking
989 – High-Speed Networking Experiments
● First attempts to use high-speed networks for remote visualization of
scientific data.
1991 – High-Speed TCP/IP Distributed Applications
● Demonstrated at Supercomputing 1991 (SC91).
● Remote visualization of an MRI scan of the human brain between
Pittsburgh Supercomputing Center (PSC) and Albuquerque.
Kaiser Project – Advancing Remote Data Access
● Utilized Wide Area Large Data Object (WALDO) system for:
○ Automatic metadata generation.
○ Real-time cataloging of data and metadata.
MAGIC Project – First Data-Intensive Environment
● Funded by DARPA to support large-scale, high-speed distributed
applications.
● Developed Distributed Parallel Storage Systems (DPSS).
● Used in TerraVision, a 3D terrain visualization application.
2. Data Grids
Introduction to Data Grids
With the rise of Grid Computing, computational power and storage became
accessible across heterogeneous resources in different administrative domains.
This led to the emergence of Data Grids, which provide essential services for
managing and processing large-scale distributed datasets.
Key Functions of Data Grids
1. High-Performance and Reliable File Transfer
○ Enables efficient movement of large datasets across distributed
locations.
2. Scalable Replica Discovery and Management
○ Helps users easily locate and access replicated datasets.
Security & Access Control
○ Since Data Grids operate across multiple administrative domains,
robust security measures are necessary to protect data access.
Scientific Use Case of Data Grids
● Scientific Instruments (Telescopes, Particle Accelerators, etc.)
○ Produce massive volumes of data.
○ Initial processing occurs locally before storing data in repositories.
● Data Storage & Replication
○ Data is stored in distributed repositories.
○ Replication ensures availability and reliability.
● Scientists & Researchers
○ Use discovery services to locate datasets for experiments.
○ Need access to high-performance computing resources for data
analysis.
Characteristics and Challenges
Handling Massive Datasets
● Datasets can be huge, ranging from gigabytes to terabytes or more.
● To manage such large data, Data Grids use:
○ Fast bulk data transfers to reduce delays.
○ Data replication strategies to ensure quick access.
○ Efficient storage management to optimize space and performance.
Shared Data Collections
● Data Grids allow multiple users to share and access data from distributed
repositories.
● These repositories act as storage hubs where users can store, read, and
process data collaboratively.
Unified Namespace for Data
● Data Grids provide a single, logical system to organize and access
datasets.
● Each data file has a unique name, which can be mapped to different
physical locations for replication and easy access.
Access Restrictions & Security
● While Data Grids are designed for data sharing, some users may need
privacy and restricted access.
● Security measures include:
○ Authentication (verifying user identity).
○ Authorization (defining who can access specific data).
○ Fine-grained access control (allowing only specific users or groups to
view certain data).
. Data Clouds and "Big Data
● traditionally, large datasets were mainly used in scientific research.
● Now, companies like Google, Amazon, and Facebook process massive
amounts of data for purposes like:
○ Search engines (Google analyzing search queries).
○ Online advertisements (Facebook and Google Ads targeting users).
○ Social media insights (analyzing user behavior and trends).
● These companies use a different approach from scientific Grid Computing
to analyze their data.
Examples of Big Data in Real-World Applications
● Web logs (tracking website visits and user behavior).
● RFID (Radio Frequency Identification) (used in retail and logistics).
● Sensor networks (collecting environmental or industrial data).
● Social networks (analyzing trends on platforms like Twitter and Instagram).
● Internet search indexing (organizing and ranking webpages in search
engines).
● Call detail records (telecom companies analyzing call data).
● Military surveillance (large-scale security monitoring).
● Medical records (storing and analyzing patient data).
● Photography & video archives (handling large-scale media storage).
● E-commerce platforms (analyzing customer buying patterns).
● Data grows over time – Unlike traditional databases where data is replaced,
Big Data keeps accumulating.
What Defines Big Data?
● Size beyond traditional software limits – Regular databases and software
cannot handle such large volumes.
● Constantly growing storage needs – The data size starts at terabytes (TB)
and can increase to petabytes (PB) or more.
● Requires specialized tools and infrastructure to process and manage
efficiently.
How Cloud Computing Supports Big Data?
● On-Demand Computing Power – Cloud platforms provide multiple servers
to process large datasets in parallel.
● Optimized Storage Solutions – Cloud storage is designed to handle huge
amounts of data efficiently.
● Big Data Processing Frameworks – Special tools and APIs (Application
Programming Interfaces) help manage and analyze data, often integrated
with cloud storage for better performance.
. Databases and Data-Intensive Computing
● Massive Datasets: Data-intensive applications handle extremely large datasets,
often ranging from gigabytes to petabytes. Managing such vast amounts of data
requires strategies to minimize delays during large data transfers, effective
replication methods, and efficient storage management.
● Shared Data Collections: In data-intensive computing, data collections are
shared across various systems. Repositories serve as centralized storage
locations where data can be both stored and accessed by multiple users or
applications.
● Unified Namespace: Data Grids implement a unified logical namespace,
providing a consistent way to locate data collections and resources. Each data
element is assigned a unique logical name, which corresponds to different
physical file names to facilitate replication and accessibility.
● Access Restrictions: While Data Grids aim to promote data sharing for
collaborative experiments, it's essential to maintain data confidentiality. Users
can enforce access controls to ensure that only authorized collaborators can
access specific data, implementing authentication and authorization
measures for both broad and detailed access control over shared data
collections.
● Data-Intensive Applications in Industry: Beyond scientific computing,
industries like internet services, online advertising, and social media generate
and analyze massive datasets. Efficient analysis of this data is crucial, as it
provides valuable insights into customer behavior. For instance, companies
process extensive logs daily using distributed infrastructures to extract
meaningful information.
● Big Data Characteristics: Big Data problems are prevalent in various
domains, including web logs, sensor networks, social media, internet search
indexing, and large-scale e-commerce. These datasets are characterized by
their massive size and the continuous accumulation of new data over time,
rather than replacing existing data. Big Data refers to datasets so large that
traditional software tools struggle to capture, manage, and process them
within a reasonable timeframe, with sizes ranging from terabytes to petabytes.
Cloud Technologies Supporting Data-Intensive Computing: Cloud
technologies aid data-intensive computing by:
● Providing on-demand access to numerous computing instances for parallel
processing and analysis of large datasets.
● Offering storage systems optimized for large data objects and distributed data
storage architectures.
● Supplying frameworks and programming interfaces designed for efficient
processing and management of substantial data volumes, often integrated
with specific storage infrastructures to enhance overall system performance.
● Distributed Databases: These are collections of data distributed across
multiple networked sites. Each site can operate independently for local
applications but also collaborates in executing global applications. Distributed
databases can be formed by dividing and distributing data from an existing
database or by linking multiple existing databases. They offer robust features
like distributed transaction processing and query optimization. However, their
focus on maintaining ACID (Atomicity, Consistency, Isolation, Durability)
properties can limit their scalability compared to Data Clouds and Grids.
Cloud Computing vs. Distributed Computing:
● Cloud Computing: Provides on-demand IT resources and services over the
internet, including servers, storage, databases, networking, analytics, and
software.
● Distributed Computing: Involves multiple autonomous computers working
together to solve a single problem, sharing resources and tasks across a network.
● Key Differences:
○ Architecture: Cloud computing offers centralized services accessible over
the internet, while distributed computing consists of decentralized systems
collaborating to achieve a common goal.
○ Scalability: Cloud computing provides scalable resources managed by
service providers, whereas distributed computing's scalability depends on
adding more nodes to the network.
○ Control: Cloud computing users have less control over the underlying
infrastructure, while distributed computing allows for greater customization
and control over the computing environment.
TECHNOLOGIES FOR DATA-INTENSIVE COMPUTING
● It focuses on processing large amounts of data efficiently.
Key Components:
● Storage Systems: Technologies used to store and manage large data.
● Programming Models: Methods and frameworks used to process the data.
Why is this Important?
● Handling big data requires specialized storage and processing techniques.
● Helps in applications like data analytics, machine learning, and cloud
computing.
Storage Systems
Traditional Database Systems:
● Earlier, database management systems (DBMS) were the standard choice
for storing and managing different types of data.
Challenge with Unstructured Data:
● Today, huge amounts of unstructured data (like blogs, web pages,
software logs, and sensor readings) are being generated.
● The traditional relational database model is not the best choice for handling
large-scale data analytics.
Turning Point in Data Management:
● The database industry is evolving to adapt to new data challenges.
● New opportunities are emerging in data storage and processing.
Factors Driving Changes in Data Management
1. Growing Popularity of Big Data
○ Managing large amounts of data is no longer rare—it is common across
many fields like:
■ Scientific computing
■ Enterprise applications
■ Media and entertainment
■ Natural language processing (NLP)
■ Social network analysis
○ The sheer volume of data requires new and efficient management
techniques.
Increasing Role of Data Analytics in Business
● Data is no longer seen as just a cost but as a key factor in business
growth.
● Companies like Facebook rely on data analytics to manage:
○ User profiles
○ Interests and preferences
○ Connections between people
● This huge amount of data needs advanced technologies and strategies
for analysis.
Diverse Forms of Data (Not Just Structured Data)
● Traditional structured data (e.g., databases) continues to grow.
● However, with the rise of the internet, more unstructured data is being
generated (e.g., social media posts, images, videos, logs).
● This data does not fit well into the traditional relational database model.
New Computing Approaches and Technologies
● Cloud computing allows access to large computing power on demand.
● Engineers can build applications that scale dynamically across hundreds or
thousands of servers.
● Traditional database systems are not designed for such highly flexible
and scalable environments.
High-Performance Distributed File Systems and Storage Clouds
Distributed File Systems:
● Purpose: They manage data by allowing storage and access in the form of
files, supporting read and write operations.
● Specialized Implementations: Some are designed to handle massive
amounts of data across numerous nodes, serving as storage solutions for:
○ Large computing clusters
○ Supercomputers
○ Massively parallel architectures
○ Storage/Computing Clouds
Examples:
1. Lustre File System:
○ Overview: A highly scalable, open-source parallel file system commonly
used in large-scale cluster computing environments.
○ Capabilities: Supports storage capacities reaching petabytes (PBs) and
serves thousands of clients with input/output (I/O) throughput in the range of
hundreds of gigabytes per second (GB/s).
○ Architecture:
■ Metadata Server (MDS): Manages metadata information about the file
system.
■ Object Storage Servers (OSS): Handle actual data storage.
■ Client Access: Users can access the file system through a POSIX-
compliant client, either mounted as a kernel module or via a library.
○ Reliability: Implements robust failover strategies and recovery mechanisms,
ensuring server failures and recoveries are transparent to clients.
IBM General Parallel File System (GPFS), now known as IBM Spectrum
Scale:
● Overview: A high-performance clustered file system developed by IBM,
supporting AIX®, Linux®, and Windows systems.
Architecture:
○ Shared Disks: A collection of disks attached to the file system's nodes
via a switching fabric.
○ Data Management: The system stripes large files over the disk array
and replicates portions to ensure high availability.
○ Metadata Distribution: Distributes metadata across the system,
eliminating single points of failure.
● Capabilities: Supports petabytes of storage, accessed at high throughput
while maintaining data consistency.
● Features:
○ Data Replication: Ensures data availability and reliability.
○ Policy-Based Storage Management: Allows automated data placement
and management based on defined policies.
○ Multi-Site Operations: Supports data sharing and management across
multiple locations.
Google File System (GFS):
● GFS is Google's proprietary distributed file system designed to manage large-
scale data across many servers.
Key Design Assumptions:
1. Hardware Failures: GFS is built on commodity hardware, which is prone to
failures. The system is designed to handle such failures gracefully.
2. Large Files: Optimized for storing a modest number of large files, often multi-
gigabyte in size.
3. Access Patterns: Designed to handle large streaming reads and small
random reads, as well as large, sequential writes that append data to files.
4. Performance Focus: Prioritizes high sustained bandwidth over low latency.
Architecture:
● Master Server: Holds metadata about the file system, such as namespace,
access control information, and mapping from files to chunks.
● Chunkservers: Store the actual data, divided into fixed-size chunks (typically
64 MB). Each chunk is replicated across multiple chunkservers to ensure
reliability.
● Clients: Access data by querying the master server for chunk locations and
then interacting directly with chunkservers for read and write operations.
Fault Tolerance:
● Chunks are replicated (default is three copies) across different chunkservers.
● The master server monitors chunkserver statuses and ensures data integrity
by re-replicating chunks as needed.
Operation:
● Reading a File: Clients request the master server for chunk locations and
then read directly from the chunkservers.
● Writing/Appending to a File: Clients obtain lease from the master server
and write data to the primary chunkserver, which then propagates the
changes to secondary replicas.
Sector storage system
● Sector is an open-source distributed file system designed to support data-
intensive applications, particularly within the Sphere framework. It operates
over wide area networks (WANs) and is built on commodity hardware.
Key Features:
1. Whole-File Replication:
○ Unlike traditional file systems that divide files into blocks, Sector
replicates entire files across multiple nodes. This approach allows users
to customize replication strategies for enhanced performance.
2. User-Space Implementation:
○ Sector functions as a user-space file system, meaning it operates above
the operating system's kernel, providing flexibility and ease of
deployment.
Architecture Components:
1. Security Server:
○ Manages access control policies for users and files, ensuring secure
interactions within the system.
2. Master Nodes:
○ Coordinate and handle input/output (I/O) requests from clients,
overseeing the overall operation of the file system.
3. Slave Nodes:
○ Store data and directly interact with client machines for data access and
processing tasks.
4. Client Machines:
○ User devices that initiate requests for data storage and retrieval within
the Sector system.
Data Transfer Protocol:
● UDT (UDP-based Data Transfer):
○ Sector utilizes UDT, a lightweight, connection-oriented protocol optimized
for high-performance data transfer over wide area networks. This
ensures efficient and reliable communication between clients and slave
nodes.
Amazon Simple Storage Service (S3).
● Amazon S3 is an object storage service provided by Amazon Web Services
(AWS), offering scalable, secure, and high-performance storage solutions for
a wide range of data types and applications.
Key Features:
1. Scalability and Performance:
○ Designed to handle vast amounts of data, S3 scales seamlessly to meet
varying storage demands while maintaining high performance.
2. Data Availability and Durability:
○ Ensures high data availability and durability, making stored data reliably
accessible when needed.
3. Security:
○ Offers robust security features, including access controls and encryption,
to protect data.
Data Organization:
● Buckets and Objects:
○ Data is organized into "buckets," each associated with an AWS account.
○ Each bucket stores multiple "objects," which are individual data files
identified by unique keys.
Access and Interaction:
● HTTP Protocol:
○ Objects are accessible via unique URLs using the HTTP protocol,
allowing straightforward data retrieval and storage operations.
● POSIX-like Client Libraries:
○ While S3 provides simple get-put semantics, POSIX-like client libraries
are available to mount S3 buckets as part of the local file system,
facilitating integration with existing applications.
Security and Access Control:
● Access Policies:
○ Bucket owners can define access control policies to manage visibility and
accessibility of objects, either restricting access to specific AWS accounts or
making data publicly available.
● Authenticated URLs:
○ Supports the creation of authenticated URLs, granting temporary public access to
specific objects for a configurable duration.
Additional Features:
● BitTorrent Integration:
○ S3 allows objects to be retrieved via the BitTorrent protocol, enabling efficient
distribution of large files.
● Storage Classes:
○ Offers various storage classes to optimize cost and performance based on data
access patterns, including Standard, Intelligent-Tiering, and Glacier for archival
storage.
Not Only SQL (NoSQL) Systems
NoSQL databases, which stand for "Not Only SQL," are designed to handle large
volumes of unstructured or semi-structured data, offering flexible schemas and
scalability beyond traditional relational databases. They are categorized based on
their data models, each tailored to specific application needs:
1. Key-Value Stores:
○ Structure: Data is stored as a collection of key-value pairs, where each
key is unique, and its associated value can be a simple data type or a
more complex structure.
○ Use Cases: Ideal for applications requiring rapid read and write
operations, such as caching mechanisms and session management.
○ Examples: Redis, DynamoDB.
Document Stores:
● Structure: Data is stored in documents (e.g., JSON, BSON, XML), allowing
nested structures and varying fields across documents.
● Use Cases: Suitable for content management systems, blogging platforms,
and applications requiring flexible and evolving data schemas.
● Examples: MongoDB, CouchDB.
Column-Family Stores (Wide-Column Stores):
● Structure: Data is organized into rows and columns, but unlike traditional
relational databases, columns are grouped into families, and each row doesn't
need to have the same columns.
● Use Cases: Effective for analytical applications and scenarios involving large-
scale data warehousing.
● Examples: Apache Cassandra, HBase.
Graph Databases:
○ Structure: Data is represented as nodes (entities) and edges
(relationships), allowing complex interconnections to be efficiently stored
and queried.
○ Use Cases: Perfect for applications involving social networks,
recommendation engines, and network analysis.
○ Examples: Neo4j, Amazon Neptune.
These diverse NoSQL database types provide tailored solutions for various data
storage and retrieval challenges, enabling developers to choose the most
appropriate model based on their specific application requirements.
(a) Apache CouchDB and MongoDB.
● Apache CouchDB and MongoDB are both document-oriented databases.
● They provide a schema-less store, meaning you don’t need to define a fixed
structure for your data.
● Data is stored as documents with key-value fields.
● Field values can be of different types like string, integer, float, date, or
arrays.
● Both databases use JSON format to represent data.
● They offer a RESTful interface, allowing communication over HTTP.
● MapReduce programming model is used for querying and indexing data.
● JavaScript is the primary language for data queries and manipulation,
instead of SQL.
● They support large files as documents.
● Both databases support data replication and high availability to ensure
reliability.
● CouchDB follows ACID properties (ensuring reliability and consistency of
data).
● MongoDB supports sharding, which helps in distributing data across
multiple nodes for scalability.
Amazon Dynamo.
● Dynamo is a distributed key-value store developed by Amazon Inc.
● It helps manage business service data for Amazon.
● The main goal is to provide scalability and high availability for storing data.
● Dynamo is designed to handle massive-scale reliability, supporting millions
of requests per day.
● It uses a simple get/put interface, where data is stored and retrieved using a
unique key.
● Unlike traditional databases, Dynamo does not follow ACID properties
(Atomicity, Consistency, Isolation, Durability).
● Instead, it follows an "eventually consistent" model, meaning that over
time, all users will see the same data.
● This trade-off improves efficiency and reliability across thousands of
servers and network components.
● Dynamo's storage system is made up of multiple storage peers arranged in
a ring structure.
● The key space (set of possible keys) is divided among storage peers and
replicated across the ring.
● Replication avoids adjacent peers, ensuring data availability even if some
nodes fail.
● Each peer (node) has access to local storage for storing original data and
replicas.
● Nodes can distribute updates across the ring and detect failures or
unreachable nodes.
● Dynamo relaxes consistency rules and uses object versioning to allow
always writable storage.
● Data consistency is resolved in the background, rather than in real-time.
● This architecture simplifies storage but requires applications to manage
their own data structure.
● No referential integrity constraints, meaning no foreign keys or
relationships.
● Join operations are not supported, making it different from relational
databases.
● This model works well for Amazon’s use cases, where a simple key-value
structure is sufficient.
Google Bigtable.
Bigtable is a distributed storage system designed to handle petabytes of data
across thousands of servers.
It supports various Google applications, handling both batch processing (high-
throughput tasks) and low-latency data serving.
Key design goals:
● Scalability (handles huge data volumes)
● High performance (fast processing)
● High availability (data remains accessible even if some servers fail)
Data Organization in Bigtable
● Data is stored in tables that are distributed over Google File System (GFS).
● A table is a multidimensional sorted map, indexed by a string key of any
length.
● Tables have rows and columns, where:
○ Columns are grouped into column families for better access control,
storage, and indexing.
○ Each column value can have multiple versions, with automatic or
manual timestamps.
● Client applications can easily access data at the column level.
● Bigtable Processing and Infrastructure
● It supports complex operations, including:
○ Single row transactions (ensuring atomicity for updates within a row).
○ Advanced data manipulation using Sawzall scripting and MapReduce
APIs.
● Bigtable operates in a cluster-based environment with two main types of
processes:
○ Master Server: Manages tablet servers, assigns tasks, and monitors their
status.
○ Tablet Server: Handles requests for tablets (a contiguous set of table rows).
● A single tablet server can manage multiple tablets (from 10 to 1,000 tablets).
● If a tablet server fails, the master server reassigns its workload to other servers,
ensuring availability.
(d) Apache Cassandra.
What is Cassandra?
● Cassandra is a distributed object store designed to handle large-scale
structured data.
● It was originally developed by Facebook and is now part of the Apache
Foundation.
● Used by large web applications like Facebook, Digg, and Twitter.
● Highly reliable and scalable, with no single point of failure.
Cassandra’s Data Model
● Inspired by Dynamo (Amazon) and Bigtable (Google).
● Uses a table-based structure, but implemented as a distributed multi-
dimensional map.
● Each table row is structured, indexed by a unique key.
● Rows contain columns, and related columns are grouped into column
families.
● Simple API functions for insertion, retrieval, and deletion.
○ Insertions happen at the row level.
○ Retrieval and deletion can be done at the column level.
Cassandra’s Infrastructure
● Similar to Dynamo in architecture, using a ring-based structure.
● Nodes share key space and manage multiple, non-continuous portions
of data.
● Data is replicated across multiple nodes (up to N nodes).
● Different replication strategies:
○ Rack-aware: Replication within the same cluster.
○ Data center-aware: Replication within the same data center.
○ Rack-unaware: Replication across geo-locations.
● Uses a gossip protocol for node communication and system state updates.
Data Persistence and Recovery
● Each node uses local storage for data persistence.
● Uses commit logs to recover from failures.
● Write operations are first logged on disk before being stored in memory.
● Data in memory is periodically dumped to disk when it reaches a set size.
● Read operations are done from memory first, then disk if needed.
● To speed up searches, each file includes a key summary, avoiding full file
scans.
Scalability and Real-World Use
● Cassandra combines the best features of Dynamo and Bigtable to create a
fully distributed, highly reliable storage system.
● One of the largest deployments manages 100 TB of data across 150
machines.
Hadoop HBase.
● HBase is a distributed database designed to handle large-scale structured
data.
● Inspired by Google's Bigtable, it provides real-time read/write access to
tables with billions of rows and millions of columns.
● Built on top of the Hadoop Distributed File System (HDFS), similar to how
Bigtable uses the Google File System (GFS).
Key Features:
● Scalability: Efficiently manages vast amounts of data across clusters of
commodity hardware.
● Data Model: Organizes data into tables with rows and columns, supporting
versioned storage of data.
● Integration: Seamlessly integrates with the Hadoop ecosystem, supporting
Comparison with Google's Bigtable:
● Open Source vs. Proprietary: HBase is an open-source project under the
Apache Foundation, while Bigtable is a proprietary system developed by
Google.
● Deployment: HBase can be deployed on any environment that supports
Hadoop and HDFS, whereas Bigtable is available as a service within Google's
infrastructure.
● API Differences: While both share similar architectural principles, there are
differences in their APIs and client implementations.
Use Cases:
● Ideal for applications requiring real-time analytics, random read/write
access, and handling sparse datasets.
● Commonly used in scenarios like log data analysis, time-series data
storage, and as a backend for large-scale web applications.
. Limitations:
● Complexity: Requires careful configuration and tuning to achieve optimal
performance.
● Consistency Model: Offers eventual consistency, which may not be
suitable for all applications requiring strict transactional support.
Programming Platforms
Platforms for programming data-intensive applications offer tools to handle and
process large volumes of data efficiently. Here's a simplified breakdown of the key
points:
1. Traditional Databases and Big Data Challenges:
○ Traditional relational database management systems (RDBMS) organize
data into structured tables with defined relationships.
○ However, with the rise of "Big Data," much information is unstructured or
semi-structured, often stored in large files or numerous medium-sized
files, making traditional RDBMS less effective.
Distributed Workflows:
● To process large datasets, distributed workflows are employed, utilizing
multiple systems to analyze and process data concurrently.
● This approach has led to the development of various workflow management
systems that can leverage the scalability and flexibility of cloud computing.
Abstraction of Tasks:
● Many systems focus on the concept of 'tasks,' requiring developers to
manage data handling and transfers, which can be complex and burdensome.
High-Level Programming Platforms:
● Modern programming platforms for data-intensive computing provide higher-
level abstractions, allowing developers to focus on data processing while the
system manages data transfers and availability.
MapReduce programming model
- MapReduce is a programming platform introduced by Google for
processing large quantities of data.
- It is a processing technique and a program model for distributed
computing based on java.
- The algorithm contains two important tasks, Map and Reduce.
- Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
- Reduce takes the output from a map as an input and combines those
data tuples into a smaller set of tuples.
- As the sequence of the name MapReduce implies, the
reduce task is always performed after the map job.
- The model is expressed in the form of two functions,
map(k1,v1) -> list(k2,v2) reduce(k2,list(v2))->list(v2)
• How MapReduce Works?
• The whole process goes through four phases of execution namely,
splitting, mapping, shuffling, and reducing.
• Let's understand this with an example – Consider you have
following input data for your Map Reduce Program
Welcome to Hadoop Class Hadoop is good
Hadoop is bad
• 1. First, in the map stage, the input data (the six documents)
is split and distributed across the cluster (the three servers).
In this case, each map task works on a split containing two
documents. During mapping, there is no communication
between the nodes. They perform independently.
• Then, map tasks create a <key, value> pair for every word. These pairs show how
many times a word occurs. A word is a key, and a value is its count. For example,
one document contains three of four words we are looking for: Apache 7 times,
Class 8
times, and Track 6 times. The key-value pairs in one
map task output look like this:
– <apache, 7>
– <class, 8>
– <track, 6>
• This process is done in parallel tasks on all nodes for all documents and gives a
unique output.
3. After input splitting and mapping completes, the outputs of every
map task are shuffled. This is the first step of the Reduce stage. Since
we are looking for the frequency of occurrence for four words, there
are four parallel Reduce tasks. The reduce tasks can run on the same
nodes as the map tasks, or they can run on any other node.
• The shuffle step ensures the keys Apache, Hadoop, Class, and
Track are sorted for the reduce step. This process groups the
values by keys in the form of <key, value-list> pairs.
• 4. In the reduce step of the Reduce stage, each of the four tasks process a <key, value-
list> to provide a final key-value pair. The reduce tasks also happen at the same time and
work independently.
• In our example from the diagram, the reduce tasks get the following individual results:
<apache, 22>
<hadoop, 20>
<class, 18>
<track, 22>
• 5. Finally, the data in the Reduce stage is grouped into
one output. MapReduce now shows us how many
times the words Apache, Hadoop, Class, and track
appeared in all documents. The aggregate data is, by
default, stored in the HDFS.