0% found this document useful (0 votes)
10 views87 pages

Module - 5

The document discusses cloud application development, highlighting the challenges of traditional large-scale computing and the benefits of cloud computing, such as simplified application development and improved resource utilization. It covers architectural styles, communication protocols, and workflow concepts, including task management and coordination patterns. Additionally, it emphasizes the importance of verification and desirable properties in workflow processes to ensure successful execution and avoid issues like deadlocks.

Uploaded by

abhishekbhatis07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views87 pages

Module - 5

The document discusses cloud application development, highlighting the challenges of traditional large-scale computing and the benefits of cloud computing, such as simplified application development and improved resource utilization. It covers architectural styles, communication protocols, and workflow concepts, including task management and coordination patterns. Additionally, it emphasizes the importance of verification and desirable properties in workflow processes to ensure successful execution and avoid issues like deadlocks.

Uploaded by

abhishekbhatis07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Module - 5

Cloud applications: Cloud application development and architectural styles,


Coordination of multiple activities, Workflow patterns. Coordination based on a state
machine model—zookeeper, MapReduce programming model,

Case study: the GrepTheWeb application, Hadoop, Yarn, and Tez, SQL on Hadoop:
Pig, Hive, and Impala.
Challenges with Traditional Large-Scale Computing (Before
Cloud):
• Difficult Application Development: Creating efficient data and computationally
intensive applications was complex.
• System Suitability & Scheduling Issues: Locating appropriate systems, determining
run times, and estimating completion times were problematic.
• Portability Challenges: Moving applications between systems was often difficult,
with performance variations.
• Inefficient Resource Management (Provider): System resources were poorly utilized,
hindering QoS guarantees.
• Operational Complexities (Provider): Handling dynamic loads, security, and rapid
failure recovery at scale was challenging.
• Low Resource Utilization: Economic benefits of resource concentration were
negated by underutilized resources.
Impact of Cloud Computing (The Solution):
• Simplified Application Development: Developers can work in familiar environments with just-in-
time infrastructure.
•Location Independence: Developers don't need to worry about where their applications will run.
•Elasticity: Applications can seamlessly scale to handle varying workloads.
•Parallelization Benefits: Workloads can be partitioned and run concurrently for significant
speedups (useful for CAD, complex modeling).
•Enterprise Focus: Cloud computing primarily targets enterprise computing, unlike grid computing's
scientific/engineering focus.
•Simplified Administration: Cloud resources are within a single administrative domain (advantage
over grid).
•Improved Resource Utilization (Provider): Cloud leads to more efficient use of computing
resources.
•Framework Accommodation: Cloud infrastructure efficiently supports and shares resources
among frameworks like MapReduce.

The future of cloud computing hinges on utility computing providers effectively demonstrating the
benefits of network-centric computing and content to a wider user base by delivering satisfactory
security, scalability, reliability, QoS, and meeting SLA requirements.
Cloud Application development challenges:
•Imbalance of Resources: Inherent mismatch between computing, I/O, and communication bandwidth
is amplified in the cloud.
•Scalability & Distribution: Cloud scale and distributed nature exacerbate resource imbalance for data-
intensive apps.
•Manual Optimization: Developers must still optimize data storage, locality (spatial/temporal), and
minimize inter-instance/thread communication despite auto-distribution efforts.
•Workload Partitioning: Utilizing the scalability of the cloud requires the workload to be arbitrarily
divisible and parallelizable.
•Performance Isolation Issues: Shared infrastructure makes true performance isolation nearly
impossible, leading to VM performance fluctuations.
•Security Isolation Challenges: Maintaining security in multitenant cloud environments is difficult.
•Reliability Concerns: Frequent server failures are expected due to the large number of commodity
components.
Cloud application development challenges:
•Instance Selection Complexity: Choosing the optimal instance type involves trade-offs in
performance, reliability, security, and cost.
•Multi-Stage Application Management: Ensuring efficiency, consistency, and communication
scalability across parallel instances in multi-stage applications is crucial.
•Network Variability: Cloud infrastructure exhibits latency and bandwidth fluctuations affecting
application performance, especially data-intensive ones.
•Data Storage Optimization: Careful analysis of data storage organization, location, and
bandwidth is critical for application performance.
•Metadata Management: Storing and accessing metadata efficiently, scalably, and reliably is
important for data-intensive applications.
•Logging Trade-offs: Balancing performance limitations with the need for sufficient logging for
debugging and analysis is challenging. Logs often require specific preservation strategies.
•Software Licensing: Software licensing in the cloud environment presents ongoing challenges.
Cloud application architectural styles.
Stateless Servers: A stateless server does not require a client to first establish a connection
to the server; instead, it views a client request as an independent transaction and responds
to it.

• Request-Response: Cloud apps heavily use request-response between clients and stateless
servers.

• Independent Transactions: Each client request is treated as a separate transaction.

• Advantages:
• Simplified Recovery: Server failures don't impact clients during requests.
• Simplicity & Robustness: Easier to manage and more resilient.
• Scalability: Doesn't require reserving resources per connection.
• Client Independence: Clients don't track server state.
Protocols & Communication:
• HTTP: A stateless request-response application protocol (used by browsers).
• Uses TCP (reliable, connection-oriented transport).
• TCP can be vulnerable to DoS attacks (connection establishment flooding).
• Basic web servers and browsers are stateless.

• Interoperability Challenges: Communicating structured data between different architectures and


languages requires:
• Handling different data representations (endianness).
• Managing varying character encodings.
• Serialization at the sender and reconstruction at the receiver.

• Architectural Style Considerations:


• Neutrality: Ability to use different transport protocols (e.g., TCP, UDP).
• Extensibility: Capability to add features (e.g., security).
• Independence: Support for diverse programming styles.
Communication Mechanisms:

• RPCs (Remote Procedure Calls): Common for client-server communication in the cloud.

• Use stubs for parameter marshaling and serialization.

• ORB (Object Request Broker): Middleware facilitating communication between networked


applications.

• Handles data transformation and byte sequence transmission/mapping.

• CORBA (Common Object Request Broker Architecture): Enables interoperability between


applications in different languages and architectures using IDL (Interface Definition Language).

• SOAP (Simple Object Access Protocol): XML-based message format for web applications.
• Uses TCP (and UDP), can be layered over HTTP, SMTP, JMS.

• Involves senders, receivers, intermediaries, etc.

• Underlying layer for Web Services.


•WSDL (Web Services Description Language): XML grammar for describing communication
endpoints (services, types, operations, port types, bindings, ports).
•REST (Representational State Transfer): Architecture for distributed hypermedia with stateless
servers.
•Platform and language independent, supports caching, firewall-friendly.
•Primarily uses HTTP for CRUD operations (GET, PUT, DELETE).
•Lightweight & Easier to Use: Often simpler than RPC, CORBA, SOAP, WSDL (e.g., URL-based
data retrieval).
•SOAP Tooling Advantage: SOAP has tools (like WSDL) for self-documentation and code
generation.
Coordination of multiple activities
Basic Workflow Concepts:
• Workflow Models: Abstractions highlighting key properties of entities in a workflow management
system.
• Task: Central concept – a unit of work performed on the cloud.
• Task Attributes:
(i) Name: Unique string identifier.

(ii) Description: Natural language explanation.

(iii) Actions: Modifications to the environment.

(iv) Preconditions: Boolean expressions that must be true before execution.

(v) Postconditions: Boolean expressions that must be true after execution.

(vi) Attributes: Resource needs, responsible actors, security, reversibility, other characteristics.

(vii) Exceptions: Information on handling abnormal events as <event, action> pairs (anticipated
exceptions). Unanticipated exceptions trigger replanning (process restructuring).
Task Hierarchy and Control Flow:
•Composite Task: A structured task composed of a subset of other tasks, defining their execution
order.
•Inherits workflow properties: Contains tasks, has one start symbol, may have multiple end symbols.
•Inherits task properties: Has a name, preconditions, and postconditions.
•Primitive Task: A basic task that cannot be further broken down.
•Routing Task: A specialized task that links two tasks in a workflow description, controlling the flow
of execution.
•Predecessor Task: The task that has just finished executing.
•Successor Task: The task that will be initiated next.
•Execution Control: Routing tasks can trigger sequential, concurrent, or iterative execution.
Types of Routing Tasks:
• Fork Routing Task: Triggers the execution of multiple successor tasks concurrently.
Possible semantics include:
(i) All Enabled: All successor tasks are initiated simultaneously.
(ii) Condition-Based (Multiple Enabled): Each successor has a condition; tasks with true conditions are enabled.
(iii) Condition-Based (Single Enabled - XOR): Each successor has a mutually exclusive condition; only the task
with the single true condition is enabled.
(iv) Nondeterministic: A random selection of k out of n successor tasks are enabled (n > k).

• Join Routing Task: Waits for the completion of its predecessor tasks before enabling a
successor task. Possible semantics include:
• (i) AND Join (All Complete): The successor task is enabled only after all predecessor tasks have finished
execution.
• (ii) N out of M Join: The successor task is enabled after a specific number (k) out of a total number of
predecessor tasks (n) have completed (n > k).
• (iii) Iterative Join: The tasks located between a corresponding fork and this join routing task are executed
repeatedly.
Process descriptions and cases.
Process Descriptions (Workflow Schemas):

A process description, also called a workflow schema, is a structure describing the tasks or activities to be
executed and the order of their execution; a process description contains one start symbol and one end
symbol.

• Elements: Contains one start and one end symbol.

• Specification: Can be defined using a Workflow Definition Language (WFDL) that supports constructs for:
• Choice (e.g., XOR split/join)
• Concurrent execution (e.g., AND split)
• Classical fork and join constructs
• Iterative execution

• Analogy: Resembles a flowchart used in programming.


Workflow Lifecycle:

• Phases:
• Creation: Initial design of the workflow.
• Definition: Formal specification of the workflow (using a WFDL).
• Verification: Checking the workflow definition for correctness and consistency (analogous to
program syntax checking/compilation).
• Enactment: The actual execution of the workflow (analogous to running a compiled program).

• Analogies to Program Development:


• Workflow specification Writing a program
• Planning (automatic workflow generation) Automatic program generation
• Workflow verification Program compilation (syntactic check)
• Workflow enactment Program execution
Figure

Figure: 11.1 Workflows and programs. (a) The life-cycle of a workflow. (b) The life-cycle of a computer program.
Workflow Cases (Workflow Instances):

• Definition: A specific instance of a process description.

• Creation & Termination: Start and end symbols enable case instantiation and completion.

• Enactment Model: Describes the steps to process a case.

• Enactment Engine: Program that executes the tasks of a workflow case.

• State of a Case: Defined by completed tasks at a given time. Tracking state with concurrent
activities is complex.

• Alternative Description: Transition system showing possible paths from start to goal state.

• Planning (Goal-Oriented): System can generate a workflow description to reach a specified goal
state.

• State Space: Includes initial and goal states; a case is a specific path.
Alternative Workflow Description (Transition System):
Transition System: Describes all possible sequences of states from the initial state to the desired
goal state.
AI Planning for Workflow Generation:
• Goal-Oriented Approach: Instead of a direct process description, only the desired goal state is specified.
• Automatic Generation: The system automatically creates a workflow description (sequence of tasks) to
reach the goal.
• Knowledge Required: The system needs to know the available tasks and their associated preconditions and
postconditions.
• AI Planning: This automated workflow creation is a core concept in Artificial Intelligence planning.
State Space and Cases:
• State Space: Encompasses the initial state and the final goal state of the process.
• Transition System Mapping: The transition system outlines all feasible pathways within this state space.
• Workflow Case as a Path: Each specific execution of the workflow (a "case") corresponds to a unique path
through the transition system.
• Case State Tracking: The progress of a particular workflow execution is tracked by the sequence of states
visited along its path.
Requirements for Process Description Languages:
• Unambiguity: The language should have a clear and precise syntax and semantics to
avoid misinterpretations.
• Verifiability: The language should allow for the formal verification of the process
description before any actual execution (enactment). This helps in:
• Detecting potential errors or flaws in the workflow design early on.
• Checking for desirable properties like safety and liveness.
Importance of Verification:
• A process description might execute correctly in some scenarios but fail in others.
• Enactment failures can be expensive and disruptive.
• Thorough verification during the process definition phase is crucial to prevent these
failures.
• Different process description methods have varying degrees of suitability for
verification.
Desirable Workflow Properties

• Safety: Ensures that no undesirable or "bad" situations occur during the workflow
enactment.

• Liveness: Guarantees that the workflow will eventually lead to a successful


outcome or "good" state (e.g., the goal state).

Liveness Violation Example (Fig. 11.2(a)):

• Path B -> C: Leads to termination (liveness achieved).

• Path B -> D: Prevents task F (requiring C and E) from ever starting.

• Consequence: Task G (requiring D and F) also never starts, resulting in a


workflow that never terminates (violating liveness).
FIGURE 11.2
(a) A process description that violates the liveness requirement; if task C is chosen after completion of B, the process will terminate after
executing task G; if D is chosen, then F will never be instantiated because it requires the completion of both C and E. The process will
never terminate because G requires completion of both D and F.

(b) Tasks A and B need exclusive access to two resources r and q, and a deadlock may occur if the following sequence of events occur:
at time t1, task A acquires r, at time t2, task B acquires q and continues to run; then, at time t3, task B attempts to acquire r, and it blocks
because r is under the control of A; task A continues to run at time t4, attempts to acquire q, and blocks because q is under the control of
B.
Resource Deadlocks During Enactment (A Cautionary Note):
• Even if a process description is inherently live (guaranteed to eventually
complete), actual execution can be hindered by resource deadlocks.
• Deadlock Scenario (Fig. 11.2(b)):
• Concurrent tasks A and B both need exclusive access to resources r and q.
• Task A acquires r at time t1.
• Task B acquires q at time t2.
• Task B tries to acquire r at t3 (blocked by A).
• Task A tries to acquire q at t4 (blocked by B).
• This creates a deadlock where neither task can proceed.
Deadlock Avoidance Strategy:
• Acquire All Resources Simultaneously: A task requests all necessary
resources at once. Trade-off: This strategy can lead to resource
underutilization as resources remain idle while a task waits to acquire all its
requirements.
Workflow patterns
• The term workflow pattern refers to the temporal relationships among the tasks of a
process.
• Workflow description languages and enactment mechanisms must support these
relationships.
• Classified into categories (basic, advanced branching/synchronization, structural,
state-based, cancellation, multiple instances).
• Basic Workflow Patterns (Fig. 11.3)
•Sequence (Fig. 11.3(a)): Tasks execute one after the other, sequentially (A → B → C).
•AND Split (Fig. 11.3(b)): Task A's completion triggers the concurrent execution of multiple
tasks (A → B & C).
•Explicit: Uses a routing node to activate all connected tasks.
•Implicit: Direct connections with conditions; tasks activate only if their branch condition is
true.
•Synchronization (Fig. 11.3(c)): A task starts only after all preceding concurrent tasks have
completed (A & B → C).
•XOR Split (Exclusive OR) (Fig. 11.3(d)): Task A's completion leads to the activation of only
one of the subsequent tasks (A → either B or C), based on a decision.
Fig: 11.3 Basic workflow patterns. (a) Sequence; (b) AND split; (c) Synchronization; (d) XOR split; (e) XOR merge; (f)
OR split
•XOR Join (Exclusive OR Join) (Fig. 11.3(e)): Task C is activated upon the completion of either task
A or task B.
•OR Split (Inclusive OR Split) (Fig. 11.3(f)): After task A completes, one or more of the subsequent
tasks (B and/or C) can be activated.
•Multiple Merge (Fig. 11.3(g)): Allows a task (D) to be activated multiple times based on the
completion of concurrent tasks (B and C). The first completion of either B or C triggers D, and the
subsequent completion of the other triggers D again. No explicit synchronization is required for D to
start the first time.
•Discriminator (Fig. 11.3(h)): Task D is activated after a specific number of incoming branches (from
A, B, or C) complete (in this case, the first one). It then waits for the remaining branches to complete
without further action until all are done, after which it resets.
•N out of M Join (Fig. 11.3(i)): Task E is enabled once a specific number (N) out of a set of
concurrent tasks (M) have completed. In the example, E starts after any two out of the three tasks (A,
B, C) finish.
•Deferred Choice (Fig. 11.3(j)): Similar to an XOR split, but the decision of which branch to take (to
B or C after A) is made by the runtime environment, not explicitly defined in the workflow.
Fig: 11.3 : Basic workflow patterns.
(g) Multiple Merge; (h) Discriminator; (i) N out of M join; (j) Deferred Choice.
Goal State Reachability:
We analyze whether a goal state (σ goal​) can be reached from an initial state (σinitial​)
within a system (Σ). The analysis considers:
• Process Group (P): A set of processes {p1​,p2​,...,pn​}, where each process pi has:
• Preconditions: pre(pi​) (conditions that must be true before execution).
• Postconditions: post(pi​) (conditions that are true after execution).
• Attributes: atr(pi​) (characteristics like resource needs).
• Workflow (A or Π):
• Represented by a directed activity graph (A) where nodes are processes from P, and edges show
precedence.
• Alternatively, a procedure (Π) can construct A given the process group, initial state, and goal state
(<P,σinitial​,σgoal​>).
• Precedence rule: Pi​→Pj implies that the preconditions of Pj are a subset of the postconditions of
Pi (pre(pj​)⊆post(pi​)).
• Constraints (C): A set of conditions {C1​,C2​,...,Cm​} that must be satisfied.
Workflow Coordination and Enactment:
•Coordination Problem: Reaching σgoal from σinitial via Pfinal's postconditions, satisfying
constraints Ci​, where σinitial enables Pinitial​'s preconditions. This implies a process chain
where one process's output feeds the next.
•Process Components:
•Preconditions: Triggering conditions/events or required input data.
•Postconditions: Results produced by the process.
•Attributes: Special requirements or properties.
Enactment Models:
Strong Coordination:
•This approach relies on a central entity to manage and direct the flow of tasks (P).
It's analogous to a conductor leading an orchestra.

Coordinator Process (Enactment Engine):


•This central component has a global view of the workflow. It knows the entire activity
graph, including preconditions, postconditions, and dependencies for all tasks.
•It actively supervises tasks, meaning it determines when a task should start,
provides it with necessary inputs, and waits for its completion.
•It ensures seamless transitions by explicitly activating the next task in the sequence
once its prerequisites are met.
Weak Coordination:
• It's about peer-to-peer communication where there's no single central
orchestrator. Instead, individual processes (or "peers") interact indirectly
through a shared, passive communication channel.
Mechanism:
• Societal Service (e.g., Tuple Space): This acts as a common bulletin board or
shared data store where processes can leave messages or tokens. A "tuple
space" is a specific model where data items (tuples) can be written, read, or
taken based on pattern matching.
• Tokens & Postconditions: When a process completes its task, it doesn't
directly tell the next process to start. Instead, it deposits a "token" into the
societal service. This token typically contains information about what it has
just done or produced (its postconditions).
• Consumer Processes & Preconditions: Other processes that are waiting for
certain conditions to be met (their preconditions) continuously (or periodically)
check the tuple space for tokens that satisfy their requirements.
•Static Workflows: Activity graph remains fixed during enactment.
•Dynamic Workflows: Activity graph can be modified during
enactment, posing challenges in:
•Integrating workflow and resource management for cost optimization.
•Guaranteeing consistency after changes.
•Creating dynamic workflows.
•WFDL is suitable for static workflows; dynamic workflows need more flexible
approaches.
Coordination based on a state machine model—
Zookeeper
Cloud Computing Elasticity and Coordination:
• Distributed Environment: Elasticity necessitates distributing computations and data across multiple
systems, requiring robust coordination.
• Coordination Needs: Varies by task (data storage, activity orchestration, event blocking, consensus,
error recovery).
• Coordinated Entities: Processes on cloud servers or across multiple clouds.
High Availability through Replication:
• Server Replication: Critical tasks often run on replicated servers for fault tolerance.
• Hot Standby: Backup servers maintain the same state as the primary for seamless failover.
Proxy-Based Coordination:
• Distributed Data Stores: Proxies manage data access
• Proxy Redundancy: Multiple proxies are needed to avoid a single point of failure.
• State Synchronization: All proxies must maintain consistent state for seamless client access during
failures.
• Configuration File Approach (Limitations):
• Advertising Service Example: Coordination via a shared configuration file for various servers.
• Static Nature: Changes require file updates and redistribution.
• No Failure Recovery: Doesn't allow servers to resume their pre-crash state.

• Paxos Consensus for Coordination:


• Definition: Paxos is a complex but powerful algorithm that allows a group of computers to reliably
reach consensus (agree on a single value or decision) even if some of them fail.
• How it solves the problem: Paxos ensures that if one proxy makes a decision (e.g., "show Ad X"), all
other healthy proxies will agree on that decision and apply the same state transition. This is how they
stay synchronized without a static file.
• Implementing a proxy as a deterministic finite-state machine, which transitions based on client
commands, requires synchronization among multiple proxies to ensure they all execute the same
state transitions; this synchronization can be ensured by the Paxos algorithm.
Introducing ZooKeeper

• ZooKeeper: A Real-World Coordination Service. It provides a service


that helps your distributed applications coordinate.

• It's like the central nervous system or the air traffic controller for your
distributed applications. It operates as a replicated state machine
itself, ensuring its own consistency so it can then help your
applications be consistent.
Key ZooKeeper Features
• Ensemble/Pack & Leader Election: It's a group of ZooKeeper servers. They elect a
leader (the 'boss') to manage writes. If the boss falls, they elect a new one.
• Replicated Database & Single System Image: Every ZooKeeper server has a copy of all
the coordination data. You can connect to any one, and it feels like you're talking to one
consistent system.
• READs (Fast) vs. WRITEs (Consensus): Reading data is quick from any server. Writing
data goes to the leader, and then the leader gets agreement from a majority of other
servers before confirming.
• Znodes: These are like files/folders. They store small amounts of data (like configuration
settings, locks, group memberships).
• Version Numbers & Timestamps: ZooKeeper keeps track of changes, so you know if
your data is fresh.
• Atomic Operations: Reads and writes are all-or-nothing. No partial updates.
• Watches (Event Notification):
• Analogy: "Like putting a sticky note on a file. If someone changes that file, you get
an alert!
• Purpose: Allows applications to react immediately to changes in coordination data
(e.g., 'If the leader znode disappears, I know I need to re-elect!').
Zookeeper: A Distributed Coordination Service:
•Apache ZooKeeper is a robust, open-source service designed for high-throughput, low-latency
coordination in large-scale distributed systems.
•Its foundational model is based on a deterministic finite-state machine combined with a strong
consensus algorithm, akin to Paxos, to ensure data consistency and reliability.
•ZooKeeper requires installation on multiple servers to form an ensemble. Clients can connect to any
server within this ensemble, experiencing the service as a single, unified entity (as depicted in Fig.
11.4(a)).
•Communication occurs via TCP, with clients sending requests, receiving responses, and setting
watches on events. To maintain operational integrity, clients synchronize their clocks with their
connected server and are designed to detect server failures through TCP timeouts, automatically
reconnecting to another available server. Within the ensemble, servers constantly communicate to
elect a leader, and a consistent database is meticulously replicated across all of them, guaranteeing
service availability as long as a majority of the servers remain operational.
FIGURE 11.4: Zookeeper coordination service. (a) The service provides a single system image; clients can
connect to any server in the pack.
Zookeeper Operation:

•READ: Directed to any server, returns the same consistent result (Figs.
11.4(b) & (c)).
•WRITE: More complex, involves leader election.
•Follower servers forward WRITE requests to the leader.
•Leader uses atomic broadcast for consensus.
•New leader is elected upon failure of the current leader.
FIGURE 11.4: (b) The functional model of Zookeeper service; the replicated database is accessed directly by READ
commands. (c) Processing a WRITE command: (i) a server receiving a command from a client, forwards it to the
leader; (ii) the leader uses atomic broadcast to reach consensus among all followers.
Data Model:
•Hierarchical Namespace: Organised like a file system with paths (Fig. 11.5).
•Znodes: Equivalent to UFS inodes, can store data.
•Metadata: Each znode stores data, version numbers, ACL changes, and timestamps.
•Watch Mechanism: Clients can set watches on znodes to receive notifications on
changes, enabling coordinated updates.
•Versioned Data: Retrieved data includes a version number; updates are stamped with
a sequence number.
•Atomic Read/Write: Data in znodes is read and written entirely and atomically.
•In-Memory Storage with Disk Logging: State is in server memory for speed; updates
logged to disk for recovery; WRITEs serialized to disk before in-memory application.
FIGURE 11.5: The Zookeeper is organized as a shared hierarchical namespace; a name is a sequence of path
elements separated by a backslash.
Zookeeper Guarantees:
•Atomicity: Transactions either fully complete or fail.
•Sequential Consistency: Updates applied in the order received.
•Single System Image: Clients get the same response from any server.
•Update Persistence: Once applied, updates remain until overwritten.
•Reliability: Functions correctly if a majority of servers are operational.
•ZooKeeper effectively implements the finite-state machine model of coordination,
where znodes store the shared state.
•This foundational service allows developers to build complex, reliable higher-level
coordination primitives like:
•Group Membership (knowing who is active in a cluster).
•Synchronization (distributed locks, barriers).
•Leader Election.
•It is widely used in large-scale distributed applications (e.g., Yahoo's Message
Broker).
MapReduce programming model
• Cloud Elasticity and Workload Distribution:
•Elasticity Advantage: Cloud allows using a variable number of servers to meet
application cost and timing needs.
•Transaction Processing Example: Front-end distributes transactions to back-
end systems for workload balancing; scales by adding more back-ends.
•Arbitrarily Divisible Load Sharing: Many scientific and engineering applications
can split their workload into numerous small, equal-sized tasks, ideal for cloud
elasticity.
•Data-Intensive Partitioning: Dividing the workload of data-intensive applications
can be complex.
MapReduce for Parallel Data Processing:

•Core Idea: Simple parallel processing for data-intensive tasks with arbitrarily
divisible workloads (Fig. 11.6).
•Phase 1 (Map):
•Split the input data into blocks.
•Assign each block to a separate instance/process.
•Run these instances in parallel to perform computations on their assigned data.
•Phase 2 (Reduce):
•Merge the partial results generated by the individual instances from the Map
phase.
FIGURE 11.6: MapReduce philosophy.
1. An application starts a Master instance and M
worker instances for the Map phase and later R
worker instances for the Reduce phase.
2. The Master partitions the input data in M
segments.
3. Each Map instance reads its input data segment
and processes the data.
4. The results of the processing are stored on the local
disks of the servers where the Map instances run.
5. When all Map instances have finished processing,
their data R Reduce instances read the results of the
first phase and merge the partial results.
6. The final results are written by Reduce instances to
a shared storage server.
7. The Master instance monitors the Reduce instances,
and when all of them report task completion, the
application is terminated.
MapReduce Programming Model:
• Purpose: Processing and generating large datasets on computing
clusters.
• Transformation:
Converts a set of input <key, value> pairs into a set of output <key, value>
pairs.
• Wide Applicability: Many tasks can be easily implemented using this
model.
Examples:
• URL Access Frequency:
• Map Function: Processes web page request logs, outputs <URL, 1>.
• Reduce Function: Aggregates counts for each URL, outputs <URL, totalcount>.
• Distributed Sort:
• Map Function: Extracts the key from each record, outputs <key, record>.
• Reduce Function: Outputs the <key, record> pairs unchanged (effectively sorting by
key).
map(String key, String value):
// key: document name; value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word; values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
MapReduce Execution Flow:
MapReduce Execution Flow:
Let M be the number of Map tasks, R the number of Reduce tasks, and N the
number of systems.
1.Initialization:
1. Input files are split into M chunks (16-64 MB each).
2. N systems are identified for execution.
3. Multiple program copies start: one Master, the rest Workers.
4. Master assigns Map or Reduce tasks to idle Workers (O(M + R) scheduling
decisions).
5. Master keeps O(M × R) worker state vectors in memory (limiting M and R;
efficiency requires M, R ≤ N).
2. Map Phase (Workers):
•Worker reads its assigned input split.
•Parses <key, value> pairs.
•Passes each pair to the user-defined Map function.
•Intermediate <key, value> pairs are buffered in memory.
•Buffered pairs are written to local disk, partitioned into R regions using a
partitioning function.
•Worker reports the locations of these buffered pairs to the Master.
3. Reduce Phase (Workers):
• Master forwards the locations of intermediate data to Reduce Workers.
• Reduce Worker uses RPCs to read the partitioned data from the local disks of
Map Workers.
• After reading all relevant intermediate data, it sorts it by the intermediate keys.
• For each unique intermediate key, the key and its associated values are
passed to the user-defined Reduce function.
• The output of the Reduce function is appended to a final output file.

4. Completion:
• The Master notifies the user program when all Map and Reduce tasks are
finished.
MapReduce Fault Tolerance and Environment:
•Master Fault Tolerance:
•Stores the state (idle, in-progress, completed) and worker identity for each task.
•Pings workers periodically to detect failures.
•Failed worker tasks are reset to idle for rescheduling.
•Master periodically checkpoints its control data for restart upon failure.
•Data Storage: Uses GFS (Google File System) for storage.
•Experimental Environment:
•Commodity hardware: Dual-core x86 CPUs, 2-4 GB RAM, 100-1000 Mbps networking.
•Large clusters: Hundreds to thousands of machines.
•Local Disks: Data stored on IDE disks attached to individual machines.
•Replication: File system uses replication for availability and reliability on unreliable
hardware.
•Data Locality: Input data stored locally to minimize network bandwidth.
The GrepTheWeb application
• GrepTheWeb, an application currently in use at Amazon, serves as a prime example
of the capabilities of cloud computing.
• Its core function is to allow users to define a regular expression and then search the
vast expanse of the web for records that match that pattern.
• The application operates on an extremely large dataset of web records, specifically a
collection of document URLs compiled nightly by the Alexa Web Search system.
• The primary inputs required are the extensive data set and the user's specified regular
expression. The output is the resulting set of records from the dataset that
successfully satisfy the provided regular expression.
• Furthermore, users are provided with the ability to interact with the application to
monitor the current status of their search process.
• The user is able to interact with the application and get the current status; see Fig.
11.7(a).
. (a) The simplified workflow showing the two
inputs, the regular expression and the input records
generated by the web crawler; a third type of input
are the user commands to report the current status
and to terminate the processing.

FIGURE 11.7: The organization of the GrepTheWeb


application. The application uses the Hadoop (b) The detailed workflow; the system is based on message passing
MapReduce software and four Amazon services: EC2, between several queues; four controller threads periodically poll
Simple DB, S3, and SQS their associated input queues, retrieve messages, and carry out the
required actions.
• The application uses message passing to trigger the activities of multiple controller
threads that launch the application, initiate processing, shutdown the system, and
create billing records.

• GrepTheWeb uses Hadoop MapReduce, an open-source software package that splits


a large data set into chunks, distributes them across multiple systems, launches the
processing, and, when the processing is complete, aggregates the outputs from
various systems into a final result.

• Apache Hadoop is a software library for distributed processing of large data sets
across clusters of computers using a simple programming model.

• GrepTheWeb workflow, illustrated in Fig. 11.7(b), consists of the following steps


1. The start-up phase: create several queues to launch, monitor, billing, and shutdown
queues; start the corresponding controller threads. Each thread polls periodically its
input queue and when a message is available, retrieves the message, parses it, and
takes the required actions.
2. The processing phase: it is triggered by a StartGrep user request; then a launch
message is enqueued in the launch queue. The launch controller thread picks up the
message and executes the launch task; then, it updates the status and time stamps in
the Amazon Simple DB domain. Lastly, it enqueues a message in the monitor queue
and deletes the message from the launch queue. The processing phase consists of the
following steps:
a. The launch task starts Amazon EC2 instances: it uses a Java Runtime Environment reinstalled
Amazon Machine Image (AMI), deploys required Hadoop libraries, and starts a Hadoop job (run
Map/Reduce tasks).
b. Hadoop runs Map tasks on EC2 slave nodes in parallel: a Map task takes files from S3, runs a
regular expression, and writes locally the match results along with a description of up to five
matches; then, the Combine/Reduce task combines and sorts the results and consolidates the
output.
c. Final results are stored on Amazon S3 in the output bucket.
3. The monitoring phase: The monitor controller thread in GrepTheWeb performs the following
actions:
• Retrieves the initial message from the processing phase.
• Validates the status/error information in SimpleDB.
• Executes the monitoring task.
• Periodically checks Hadoop status and updates SimpleDB with status/error and the S3 output file location.
• Enqueues messages for the shutdown and billing queues.Deletes the initial message from the monitor queue
upon completion.
4. The shutdown phase: the shutdown controller thread retrieves the message from the shutdown
queue and executes the shutdown task that updates the status and time stamps in the Simple DB
domain; finally, it deletes the message from the shutdown queue after processing. The shutdown
phase consists of the following steps:
a. The shutdown task kills the Hadoop processes, terminates the EC2 instances after getting EC2 topology
information from Simple DB, and disposes of the infrastructure.
b. The billing task gets the EC2 topology information, Simple DB usage, S3 file and query input, calculates the
charges, and passes the information to the billing service.
5. The cleanup phase: archives the Simple DB data with user info.
6. User interactions with the system: get the status and output results. The GetStatus is applied to the
service endpoint to obtain the status of the overall system (all controllers and Hadoop) and download
the filtered results from S3 after completion.
• Multiple S3 files are bundled up and stored as S3 objects to optimize the end-to-
end transfer rates in the S3 storage system.

• Another performance optimization is to run a script and sort the keys, the URL
pointers, and upload them in sorted order in S3.

• Multiple fetch threads are started to fetch the objects.

• This application illustrates the means to create an on-demand infrastructure and


run it on a massively distributed system in a manner that enables it to run in
parallel and scale up and down, based on the number of users and the problem
size.
Hadoop, Yarn, and Tez
• Apache Hadoop is an open-source software framework designed for distributed
storage and processing of massive datasets, built upon the MapReduce programming
model.
• The MapReduce process involves stages such as Map, which processes data items
and produces key-annotated outputs, followed by a local sort, an optional combiner
for partial aggregation, and a shuffle stage to globally organize data by key.
• A Hadoop system fundamentally consists of a MapReduce engine and a database,
often the Hadoop File System (HDFS) seen in Fig. 11.8.
• HDFS is a highly performant, distributed, and replicated file system in Java that
supports data locality, though it is not fully POSIX compliant.
• The Hadoop engine, with a Job Tracker on the master and Task Trackers on slave nodes,
manages job execution and task dispatching, prioritizing data locality for efficiency.
• The Job Tracker dispatches work to Task Trackers, aiming to place tasks near their
data, while the Task Trackers supervise execution using various scheduling algorithms.
HDFS, managed by a Name Node, replicates data (default three replicas) across Data
Nodes and shares location info with the Job Tracker to minimize data movement.
Apache Hadoop Overview:
•Open-source, Java-based framework for distributed storage and processing.
•Handles extremely large data volumes using distributed applications.
•Based on the MapReduce programming model.
•Benefits from extensive community contributions and related Apache projects
(Hive, HBase, etc.).
•Widely adopted across industry, government, and research (Apple, IBM,
Facebook, Twitter, etc.).
MapReduce Programming Model:
•A model for processing large datasets in parallel.
•Stages: Map (process data items, produce key-value pairs), Local Sort (order
data by key), Combiner (optional partial aggregation), Shuffle (redistribute data
globally by key), Reduce (aggregate/process data per key).
Hadoop System Components:
•Consists of a MapReduce engine and a database.
•Database options include Hadoop File System (HDFS), Amazon S3, or CloudStore.

Hadoop Distributed File System (HDFS):


•A highly performant distributed file system written in Java.
•Stores data on commodity machines.
•Provides high aggregate bandwidth across the cluster.
•Portable but not directly mountable or fully POSIX compliant.
•Replicates data across multiple nodes (default 3 replicas).
•Uses a Name Node (master) to manage data distribution and replication.
•Uses Data Nodes (slaves) to store data blocks.
Hadoop Engine Architecture:
•Runs on a multinode cluster (master and slaves).
•Master Node: Has a Job Tracker and a Task Tracker.
•Slave Nodes: Have only a Task Tracker.

Job Tracker: Receives MapReduce jobs, dispatches work to Task Trackers, attempts
to schedule tasks near data location (data locality).
Task Tracker: Supervises the execution of assigned work on a node.
•Supports various scheduling algorithms (e.g., Facebook's fair scheduler, Yahoo's capacity
scheduler).

Data Locality Principle:


•Hadoop emphasizes bringing computations to the data (stored on disk).
•This strategy helps compete with traditional High Performance Computing (HPC).
•Spark extends this by storing data in memory.
Hadoop & Spark Data Processing:
•Bring computations to data on clusters (off-the-shelf components).
•Spark: Extends this by storing data in processor's memory (faster than disk).
•Data locality: Allows Hadoop/Spark to compete with traditional HPC (supercomputers with
high-bandwidth storage/faster networks).

Apache Hadoop Framework Modules:


1.Common: Libraries & utilities for all modules.
2.HDFS (Distributed File System): Stores data on commodity machines; high aggregate
bandwidth.
3.Yarn: Resource management & application scheduling platform.
4.MapReduce: Implementation of the MapReduce programming model.
SQL on Hadoop for Deep Analytics:
• SQL processing is key for gaining insights from large Hadoop data.
• Number of SQL-on-Hadoop systems is increasing.
• Examples of SQL engines on Hadoop: IBM BigSQL, Cloudera Impala,
Pivotal HAWQ.
• These engines:
• Implement a standard language (SQL).
• Compete on performance & extended services.
• Insulate users (applications are portable).
Types of Hadoop-based Systems:
• Native/Hybrid: Pig, Hive, Impala.
• Exploits Hadoop features but uses external DB for queries: Hadapt
(uses PostgreSQL for query fragments, but relies on Hadoop for
scheduling/fault-tolerance).
SQL on Hadoop for Deep Analytics:
• SQL processing is key for gaining insights from large Hadoop data.
• Number of SQL-on-Hadoop systems is increasing.
• Examples of SQL engines on Hadoop: IBM BigSQL, Cloudera Impala,
Pivotal HAWQ.
• These engines:
• Implement a standard language (SQL).
• Compete on performance & extended services.
• Insulate users (applications are portable).
Types of Hadoop-based Systems:
• Native/Hybrid: Pig, Hive, Impala.
• Exploits Hadoop features but uses external DB for queries: Hadapt
(uses PostgreSQL for query fragments, but relies on Hadoop for
scheduling/fault-tolerance).
Apache YARN:
• Apache YARN (Yet Another Resource Negotiator) is a core component of the
Apache Hadoop 2.x (and later) ecosystem. It's often referred to as the "operating
system for Hadoop" because its primary role is to act as a resource manager
and job scheduler for applications running on a Hadoop cluster.
• Before YARN, Hadoop's MapReduce framework was responsible for both resource
management and data processing. YARN was introduced to decouple these two
functions, making Hadoop more flexible, efficient, and capable of supporting a
wider variety of data processing frameworks beyond just MapReduce.
• Core Purpose: Resource Management and Job Scheduling YARN's main goal
is to allocate cluster resources (CPU, memory, disk, network) to various
applications and to schedule their execution. It ensures that multiple applications
can run concurrently on the same Hadoop cluster without interfering with each
other and efficiently share the available resources.
• YARN Architecture: Two Main Components YARN operates with a master-slave
architecture, primarily consisting of:
• ResourceManager (RM - Master): This is the ultimate authority for resource
allocation in the cluster. There's typically one ResourceManager per YARN cluster. Its
responsibilities include:
• Scheduler: Allocates resources (containers) to applications based on various policies (e.g.,
fairness, capacity, FIFO). It doesn't monitor application status or restart failed tasks; that's the job
of the ApplicationMaster.
• ApplicationsManager: Manages the lifecycle of submitted applications, accepts job submissions,
negotiates the first container for the ApplicationMaster, and provides restart capabilities for the
ApplicationMaster on failure.
• NodeManager (NM - Slave): This runs on every data node in the Hadoop cluster. It's
responsible for managing resources on its specific node. Its responsibilities include:
• Container Management: Manages the lifecycle of "containers" (the fundamental unit of resource
allocation in YARN, representing a specific amount of CPU, memory, etc.).
• Resource Monitoring: Monitors resource usage (CPU, memory) of containers on its node.
• Heartbeating: Regularly communicates with the ResourceManager, reporting the health and
resource availability of its node.
• Log Aggregation: Manages application logs on its node.
Application Startup Process (Fig. 11.11):
Application Submission and Execution Flow (Simplified):
1. The user submits an application to the Resource Manager.
2. The Resource Manager invokes the Scheduler and allocates a container for the
ApplicationManager.
3. The Resource Manager contacts the Node Manager where the container will be
launched.
4. The Node Manager launches the container.
5. The container executes the Application Manager.
6. The Resource Manager contacts the Node Manager(s) where the tasks of the
application will run.
7. Containers for the tasks of the application are created.
8. The Application Manager monitors the execution of the tasks until termination.
Apache Tez:
•An extensible framework for building high-performance batch and interactive
processing. It provides a more efficient and flexible execution engine for complex
analytical applications than traditional MapReduce, leading to significantly faster
query performance for tools like Apache Hive and Pig.
•Runs on YARN-based applications in Hadoop.
•Decomposes jobs into individual tasks, each running as a YARN process.
•Models data processing as a Directed Acyclic Graph (DAG), where vertices are
application logic and edges are data movement.
•Uses a Java API to represent the workflow as a DAG.
•Leverages YARN for resource acquisition.
•Reuses components in the pipeline to avoid redundant operations.
SQL on Hadoop: Pig, Hive, and Impala
•Pre-2009: Parallel SQL DB vendors existed, but none supported SQL queries
alongside MapReduce. Companies like Yahoo & Facebook needed faster, easier SQL
platforms on Hadoop due to extensive data use.
•MapReduce Limitations: Heavyweight, high-latency; lacked native support for
workflows, joins, filtering, aggregation, etc.
•Pig (Yahoo, 2009):
•Dataflow system created to address MapReduce limitations.
•Language: Pig Latin.
•Execution: Parses query, runs series of MapReduce jobs.
•Hive (Facebook):
•Built on MapReduce for SQL on Hadoop (shortest path).
•Language: HiveQL.
•Execution: Parses query, runs series of MapReduce jobs.
•Impala (Cloudera, 2012): Developed as another SQL engine on Hadoop.
Apache Pig:
• Apache Pig is an open-source, high-level platform for analyzing large datasets that
runs on top of Apache Hadoop.

• It provides a high-level data flow language called Pig Latin, which makes it easier to
write complex data transformation and analysis tasks for Hadoop, without having to write
low-level MapReduce Java code.

Apache Pig offers a more procedural, data-flow oriented scripting language.

• Apache Pig provides a powerful and flexible abstraction layer over Hadoop, enabling
users to perform complex data transformations and analysis using a high-level,
procedural language, making it an excellent tool for ETL (Extract, Transform, Load)
operations and data processing workflows where direct control over the data flow is
desired.
Key Components:
•Pig Latin: This is the high-level language used to express data analysis
programs. It's a procedural language that allows users to describe a series of
operations to be performed on data.
•Examples of Pig Latin operators include LOAD, FOREACH, FILTER, GROUP, JOIN, ORDER
BY, STORE, etc.

•Grunt Shell: This is the interactive shell (command-line interface) where users
can write and execute Pig Latin scripts directly or run script files.
•Pig Engine (or Runtime): This component takes Pig Latin scripts, parses them,
optimizes the logical plan, generates a physical plan, and then translates this into
an executable series of MapReduce (or Tez/Spark) jobs.
How it Works (Architecture):
• Script Submission: A user writes a Pig Latin script and submits it to the Pig
engine (e.g., via the Grunt shell).
• Parsing and Optimization: The Pig engine parses the script, checks for
syntax errors, and creates a logical plan. This plan is then optimized (e.g.,
reordering operations, combining small jobs) to make it more efficient.
• Physical Plan Generation: The optimized logical plan is converted into a
physical plan, which outlines the specific MapReduce (or Tez/Spark) jobs
required.
• Execution on Hadoop: The Pig engine submits these generated jobs to the
Hadoop YARN resource manager, which then schedules and executes them
on the Hadoop cluster.
• Data Flow: Data flows through the various operations defined in the Pig Latin
script, with intermediate results often being written to HDFS.
• Result Storage: The final results are typically stored back into HDFS or
another data store.
Hive.
Apache Hive is an open-source data warehouse software built on top of Apache
Hadoop.

Its primary purpose is to enable data analysts and developers to query and
analyze large datasets stored in Hadoop Distributed File System (HDFS)
using a familiar SQL-like language called HiveQL (HQL), rather than writing
complex MapReduce Java code.

• These queries are then compiled into MapReduce jobs executed on Hadoop. The
system includes the Metastore, a catalog for schemas, and the statistics used for
query optimization.
Data Organization & Storage:
• Supports data organized as Tables, Partitions, and Buckets.
• Tables:
• Inherited from relational databases.
• Can be stored internally or externally (in HDFS, NFS, local directory).
• Serialized and stored in files within an HDFS directory.
• Serialization format is stored and accessed in the system catalog (Metastore).
• Partitions:
• Components of tables.
• Described by subdirectories within the table's directory.
• Buckets:
• Consist of partition data.
• Stored as a file within the partition's directory.
• Selection based on the hash of a table column.
HiveQL Compiler Components:

1.Parser: Transforms input string into a parse tree.

2.Semantic Analyzer: Transforms parse tree into an internal representation.

3.Logical Plan Generator: Converts internal representation into a Logical Plan.

4.Optimizer: Rewrites the logical plan.


HiveQL Query Language:

• Accepts DDL, DML, and user-defined MapReduce scripts as input.

• DDL (Data Description Language):


• Syntax for defining data structures (like database schemas).

• DML (Data Manipulation Language):


• Used to retrieve, store, modify, delete, insert, update data.
• Examples: SELECT, UPDATE, INSERT statements.

• Supports user-defined column transformation and aggregation functions


(implemented in Java).

• Allows user-defined MapReduce scripts (written in any language) using a simple


row-based streaming interface.
Impala

•Impala is an open-source, massively parallel processing (MPP) SQL query engine


for data stored in a Hadoop cluster. Developed by Cloudera, its primary goal is to
provide low-latency, interactive SQL queries on large datasets stored in formats
like HDFS, Apache Kudu, and Apache HBase, without requiring data movement or
transformation.
• Interactive Query Performance: Unlike traditional MapReduce, which is
optimized for batch processing and can have high latency, Impala is designed
for speed. It achieves this through:
• In-memory processing: It performs computations directly in memory,
minimizing disk I/O.
• Direct-read from HDFS: It bypasses MapReduce, reading data directly from
HDFS.
• C++ implementation: Its native C++ implementation offers better
performance than Java-based alternatives for CPU-intensive tasks.
• MPP Architecture: Queries are distributed and executed in parallel across all
nodes in the cluster.
• Columnar Storage Awareness: It works efficiently with columnar file formats
, which optimize analytical queries by reading only necessary columns.
Core Daemons:
1.Impalad:
Deployed on every server. It receives queries, coordinates query execution, and processes data.
2.Statestored:
Provides a metadata publish-subscribe service. Runs on a single node and monitors the health
of all impalad daemons. It broadcasts metadata and coordination messages to all impalad
instances.
3.Catalogd:
Runs on a single node and propagates metadata changes (like CREATE TABLE, ALTER
TABLE) from the Hive Metastore to all impalad daemons, ensuring consistency.
Supported SQL:
•Bulk insertions: INSERT INTO ... SELECT ....
•Does NOT support UPDATE or DELETE.

Architecture and Deployment:


•Deployment: Impala code installed and runs on every node in a Cloudera cluster.
•Co-location: Runs alongside other engines (MapReduce, HBase, Spark, SAS) on nodes.
•Data Access: All engines access the same data.
•Query Execution: Uses long-running daemons on every HDFS DataNode.
•Pipelining: Intermediate results are pipelined between computation stages.
• Impala is ideal for scenarios requiring fast, interactive queries on large datasets,
such as:
• Ad-hoc data exploration and analysis: For business analysts and data scientists
to quickly explore raw data.
• Business Intelligence (BI) dashboards: Powering real-time or near real-time
dashboards that query large data volumes.
• Interactive data warehousing: As a query layer for big data warehouses.
• Data exploration for machine learning: Preparing and exploring data for machine
learning models.
In essence, Impala serves as a powerful, low-latency SQL engine that brings
traditional relational database query performance to the world of big data on
Hadoop.

You might also like