Module - 5
Module - 5
Case study: the GrepTheWeb application, Hadoop, Yarn, and Tez, SQL on Hadoop:
Pig, Hive, and Impala.
Challenges with Traditional Large-Scale Computing (Before
Cloud):
• Difficult Application Development: Creating efficient data and computationally
intensive applications was complex.
• System Suitability & Scheduling Issues: Locating appropriate systems, determining
run times, and estimating completion times were problematic.
• Portability Challenges: Moving applications between systems was often difficult,
with performance variations.
• Inefficient Resource Management (Provider): System resources were poorly utilized,
hindering QoS guarantees.
• Operational Complexities (Provider): Handling dynamic loads, security, and rapid
failure recovery at scale was challenging.
• Low Resource Utilization: Economic benefits of resource concentration were
negated by underutilized resources.
Impact of Cloud Computing (The Solution):
• Simplified Application Development: Developers can work in familiar environments with just-in-
time infrastructure.
•Location Independence: Developers don't need to worry about where their applications will run.
•Elasticity: Applications can seamlessly scale to handle varying workloads.
•Parallelization Benefits: Workloads can be partitioned and run concurrently for significant
speedups (useful for CAD, complex modeling).
•Enterprise Focus: Cloud computing primarily targets enterprise computing, unlike grid computing's
scientific/engineering focus.
•Simplified Administration: Cloud resources are within a single administrative domain (advantage
over grid).
•Improved Resource Utilization (Provider): Cloud leads to more efficient use of computing
resources.
•Framework Accommodation: Cloud infrastructure efficiently supports and shares resources
among frameworks like MapReduce.
The future of cloud computing hinges on utility computing providers effectively demonstrating the
benefits of network-centric computing and content to a wider user base by delivering satisfactory
security, scalability, reliability, QoS, and meeting SLA requirements.
Cloud Application development challenges:
•Imbalance of Resources: Inherent mismatch between computing, I/O, and communication bandwidth
is amplified in the cloud.
•Scalability & Distribution: Cloud scale and distributed nature exacerbate resource imbalance for data-
intensive apps.
•Manual Optimization: Developers must still optimize data storage, locality (spatial/temporal), and
minimize inter-instance/thread communication despite auto-distribution efforts.
•Workload Partitioning: Utilizing the scalability of the cloud requires the workload to be arbitrarily
divisible and parallelizable.
•Performance Isolation Issues: Shared infrastructure makes true performance isolation nearly
impossible, leading to VM performance fluctuations.
•Security Isolation Challenges: Maintaining security in multitenant cloud environments is difficult.
•Reliability Concerns: Frequent server failures are expected due to the large number of commodity
components.
Cloud application development challenges:
•Instance Selection Complexity: Choosing the optimal instance type involves trade-offs in
performance, reliability, security, and cost.
•Multi-Stage Application Management: Ensuring efficiency, consistency, and communication
scalability across parallel instances in multi-stage applications is crucial.
•Network Variability: Cloud infrastructure exhibits latency and bandwidth fluctuations affecting
application performance, especially data-intensive ones.
•Data Storage Optimization: Careful analysis of data storage organization, location, and
bandwidth is critical for application performance.
•Metadata Management: Storing and accessing metadata efficiently, scalably, and reliably is
important for data-intensive applications.
•Logging Trade-offs: Balancing performance limitations with the need for sufficient logging for
debugging and analysis is challenging. Logs often require specific preservation strategies.
•Software Licensing: Software licensing in the cloud environment presents ongoing challenges.
Cloud application architectural styles.
Stateless Servers: A stateless server does not require a client to first establish a connection
to the server; instead, it views a client request as an independent transaction and responds
to it.
• Request-Response: Cloud apps heavily use request-response between clients and stateless
servers.
• Advantages:
• Simplified Recovery: Server failures don't impact clients during requests.
• Simplicity & Robustness: Easier to manage and more resilient.
• Scalability: Doesn't require reserving resources per connection.
• Client Independence: Clients don't track server state.
Protocols & Communication:
• HTTP: A stateless request-response application protocol (used by browsers).
• Uses TCP (reliable, connection-oriented transport).
• TCP can be vulnerable to DoS attacks (connection establishment flooding).
• Basic web servers and browsers are stateless.
• RPCs (Remote Procedure Calls): Common for client-server communication in the cloud.
• SOAP (Simple Object Access Protocol): XML-based message format for web applications.
• Uses TCP (and UDP), can be layered over HTTP, SMTP, JMS.
(vi) Attributes: Resource needs, responsible actors, security, reversibility, other characteristics.
(vii) Exceptions: Information on handling abnormal events as <event, action> pairs (anticipated
exceptions). Unanticipated exceptions trigger replanning (process restructuring).
Task Hierarchy and Control Flow:
•Composite Task: A structured task composed of a subset of other tasks, defining their execution
order.
•Inherits workflow properties: Contains tasks, has one start symbol, may have multiple end symbols.
•Inherits task properties: Has a name, preconditions, and postconditions.
•Primitive Task: A basic task that cannot be further broken down.
•Routing Task: A specialized task that links two tasks in a workflow description, controlling the flow
of execution.
•Predecessor Task: The task that has just finished executing.
•Successor Task: The task that will be initiated next.
•Execution Control: Routing tasks can trigger sequential, concurrent, or iterative execution.
Types of Routing Tasks:
• Fork Routing Task: Triggers the execution of multiple successor tasks concurrently.
Possible semantics include:
(i) All Enabled: All successor tasks are initiated simultaneously.
(ii) Condition-Based (Multiple Enabled): Each successor has a condition; tasks with true conditions are enabled.
(iii) Condition-Based (Single Enabled - XOR): Each successor has a mutually exclusive condition; only the task
with the single true condition is enabled.
(iv) Nondeterministic: A random selection of k out of n successor tasks are enabled (n > k).
• Join Routing Task: Waits for the completion of its predecessor tasks before enabling a
successor task. Possible semantics include:
• (i) AND Join (All Complete): The successor task is enabled only after all predecessor tasks have finished
execution.
• (ii) N out of M Join: The successor task is enabled after a specific number (k) out of a total number of
predecessor tasks (n) have completed (n > k).
• (iii) Iterative Join: The tasks located between a corresponding fork and this join routing task are executed
repeatedly.
Process descriptions and cases.
Process Descriptions (Workflow Schemas):
A process description, also called a workflow schema, is a structure describing the tasks or activities to be
executed and the order of their execution; a process description contains one start symbol and one end
symbol.
• Specification: Can be defined using a Workflow Definition Language (WFDL) that supports constructs for:
• Choice (e.g., XOR split/join)
• Concurrent execution (e.g., AND split)
• Classical fork and join constructs
• Iterative execution
• Phases:
• Creation: Initial design of the workflow.
• Definition: Formal specification of the workflow (using a WFDL).
• Verification: Checking the workflow definition for correctness and consistency (analogous to
program syntax checking/compilation).
• Enactment: The actual execution of the workflow (analogous to running a compiled program).
Figure: 11.1 Workflows and programs. (a) The life-cycle of a workflow. (b) The life-cycle of a computer program.
Workflow Cases (Workflow Instances):
• Creation & Termination: Start and end symbols enable case instantiation and completion.
• State of a Case: Defined by completed tasks at a given time. Tracking state with concurrent
activities is complex.
• Alternative Description: Transition system showing possible paths from start to goal state.
• Planning (Goal-Oriented): System can generate a workflow description to reach a specified goal
state.
• State Space: Includes initial and goal states; a case is a specific path.
Alternative Workflow Description (Transition System):
Transition System: Describes all possible sequences of states from the initial state to the desired
goal state.
AI Planning for Workflow Generation:
• Goal-Oriented Approach: Instead of a direct process description, only the desired goal state is specified.
• Automatic Generation: The system automatically creates a workflow description (sequence of tasks) to
reach the goal.
• Knowledge Required: The system needs to know the available tasks and their associated preconditions and
postconditions.
• AI Planning: This automated workflow creation is a core concept in Artificial Intelligence planning.
State Space and Cases:
• State Space: Encompasses the initial state and the final goal state of the process.
• Transition System Mapping: The transition system outlines all feasible pathways within this state space.
• Workflow Case as a Path: Each specific execution of the workflow (a "case") corresponds to a unique path
through the transition system.
• Case State Tracking: The progress of a particular workflow execution is tracked by the sequence of states
visited along its path.
Requirements for Process Description Languages:
• Unambiguity: The language should have a clear and precise syntax and semantics to
avoid misinterpretations.
• Verifiability: The language should allow for the formal verification of the process
description before any actual execution (enactment). This helps in:
• Detecting potential errors or flaws in the workflow design early on.
• Checking for desirable properties like safety and liveness.
Importance of Verification:
• A process description might execute correctly in some scenarios but fail in others.
• Enactment failures can be expensive and disruptive.
• Thorough verification during the process definition phase is crucial to prevent these
failures.
• Different process description methods have varying degrees of suitability for
verification.
Desirable Workflow Properties
• Safety: Ensures that no undesirable or "bad" situations occur during the workflow
enactment.
(b) Tasks A and B need exclusive access to two resources r and q, and a deadlock may occur if the following sequence of events occur:
at time t1, task A acquires r, at time t2, task B acquires q and continues to run; then, at time t3, task B attempts to acquire r, and it blocks
because r is under the control of A; task A continues to run at time t4, attempts to acquire q, and blocks because q is under the control of
B.
Resource Deadlocks During Enactment (A Cautionary Note):
• Even if a process description is inherently live (guaranteed to eventually
complete), actual execution can be hindered by resource deadlocks.
• Deadlock Scenario (Fig. 11.2(b)):
• Concurrent tasks A and B both need exclusive access to resources r and q.
• Task A acquires r at time t1.
• Task B acquires q at time t2.
• Task B tries to acquire r at t3 (blocked by A).
• Task A tries to acquire q at t4 (blocked by B).
• This creates a deadlock where neither task can proceed.
Deadlock Avoidance Strategy:
• Acquire All Resources Simultaneously: A task requests all necessary
resources at once. Trade-off: This strategy can lead to resource
underutilization as resources remain idle while a task waits to acquire all its
requirements.
Workflow patterns
• The term workflow pattern refers to the temporal relationships among the tasks of a
process.
• Workflow description languages and enactment mechanisms must support these
relationships.
• Classified into categories (basic, advanced branching/synchronization, structural,
state-based, cancellation, multiple instances).
• Basic Workflow Patterns (Fig. 11.3)
•Sequence (Fig. 11.3(a)): Tasks execute one after the other, sequentially (A → B → C).
•AND Split (Fig. 11.3(b)): Task A's completion triggers the concurrent execution of multiple
tasks (A → B & C).
•Explicit: Uses a routing node to activate all connected tasks.
•Implicit: Direct connections with conditions; tasks activate only if their branch condition is
true.
•Synchronization (Fig. 11.3(c)): A task starts only after all preceding concurrent tasks have
completed (A & B → C).
•XOR Split (Exclusive OR) (Fig. 11.3(d)): Task A's completion leads to the activation of only
one of the subsequent tasks (A → either B or C), based on a decision.
Fig: 11.3 Basic workflow patterns. (a) Sequence; (b) AND split; (c) Synchronization; (d) XOR split; (e) XOR merge; (f)
OR split
•XOR Join (Exclusive OR Join) (Fig. 11.3(e)): Task C is activated upon the completion of either task
A or task B.
•OR Split (Inclusive OR Split) (Fig. 11.3(f)): After task A completes, one or more of the subsequent
tasks (B and/or C) can be activated.
•Multiple Merge (Fig. 11.3(g)): Allows a task (D) to be activated multiple times based on the
completion of concurrent tasks (B and C). The first completion of either B or C triggers D, and the
subsequent completion of the other triggers D again. No explicit synchronization is required for D to
start the first time.
•Discriminator (Fig. 11.3(h)): Task D is activated after a specific number of incoming branches (from
A, B, or C) complete (in this case, the first one). It then waits for the remaining branches to complete
without further action until all are done, after which it resets.
•N out of M Join (Fig. 11.3(i)): Task E is enabled once a specific number (N) out of a set of
concurrent tasks (M) have completed. In the example, E starts after any two out of the three tasks (A,
B, C) finish.
•Deferred Choice (Fig. 11.3(j)): Similar to an XOR split, but the decision of which branch to take (to
B or C after A) is made by the runtime environment, not explicitly defined in the workflow.
Fig: 11.3 : Basic workflow patterns.
(g) Multiple Merge; (h) Discriminator; (i) N out of M join; (j) Deferred Choice.
Goal State Reachability:
We analyze whether a goal state (σ goal) can be reached from an initial state (σinitial)
within a system (Σ). The analysis considers:
• Process Group (P): A set of processes {p1,p2,...,pn}, where each process pi has:
• Preconditions: pre(pi) (conditions that must be true before execution).
• Postconditions: post(pi) (conditions that are true after execution).
• Attributes: atr(pi) (characteristics like resource needs).
• Workflow (A or Π):
• Represented by a directed activity graph (A) where nodes are processes from P, and edges show
precedence.
• Alternatively, a procedure (Π) can construct A given the process group, initial state, and goal state
(<P,σinitial,σgoal>).
• Precedence rule: Pi→Pj implies that the preconditions of Pj are a subset of the postconditions of
Pi (pre(pj)⊆post(pi)).
• Constraints (C): A set of conditions {C1,C2,...,Cm} that must be satisfied.
Workflow Coordination and Enactment:
•Coordination Problem: Reaching σgoal from σinitial via Pfinal's postconditions, satisfying
constraints Ci, where σinitial enables Pinitial's preconditions. This implies a process chain
where one process's output feeds the next.
•Process Components:
•Preconditions: Triggering conditions/events or required input data.
•Postconditions: Results produced by the process.
•Attributes: Special requirements or properties.
Enactment Models:
Strong Coordination:
•This approach relies on a central entity to manage and direct the flow of tasks (P).
It's analogous to a conductor leading an orchestra.
• It's like the central nervous system or the air traffic controller for your
distributed applications. It operates as a replicated state machine
itself, ensuring its own consistency so it can then help your
applications be consistent.
Key ZooKeeper Features
• Ensemble/Pack & Leader Election: It's a group of ZooKeeper servers. They elect a
leader (the 'boss') to manage writes. If the boss falls, they elect a new one.
• Replicated Database & Single System Image: Every ZooKeeper server has a copy of all
the coordination data. You can connect to any one, and it feels like you're talking to one
consistent system.
• READs (Fast) vs. WRITEs (Consensus): Reading data is quick from any server. Writing
data goes to the leader, and then the leader gets agreement from a majority of other
servers before confirming.
• Znodes: These are like files/folders. They store small amounts of data (like configuration
settings, locks, group memberships).
• Version Numbers & Timestamps: ZooKeeper keeps track of changes, so you know if
your data is fresh.
• Atomic Operations: Reads and writes are all-or-nothing. No partial updates.
• Watches (Event Notification):
• Analogy: "Like putting a sticky note on a file. If someone changes that file, you get
an alert!
• Purpose: Allows applications to react immediately to changes in coordination data
(e.g., 'If the leader znode disappears, I know I need to re-elect!').
Zookeeper: A Distributed Coordination Service:
•Apache ZooKeeper is a robust, open-source service designed for high-throughput, low-latency
coordination in large-scale distributed systems.
•Its foundational model is based on a deterministic finite-state machine combined with a strong
consensus algorithm, akin to Paxos, to ensure data consistency and reliability.
•ZooKeeper requires installation on multiple servers to form an ensemble. Clients can connect to any
server within this ensemble, experiencing the service as a single, unified entity (as depicted in Fig.
11.4(a)).
•Communication occurs via TCP, with clients sending requests, receiving responses, and setting
watches on events. To maintain operational integrity, clients synchronize their clocks with their
connected server and are designed to detect server failures through TCP timeouts, automatically
reconnecting to another available server. Within the ensemble, servers constantly communicate to
elect a leader, and a consistent database is meticulously replicated across all of them, guaranteeing
service availability as long as a majority of the servers remain operational.
FIGURE 11.4: Zookeeper coordination service. (a) The service provides a single system image; clients can
connect to any server in the pack.
Zookeeper Operation:
•READ: Directed to any server, returns the same consistent result (Figs.
11.4(b) & (c)).
•WRITE: More complex, involves leader election.
•Follower servers forward WRITE requests to the leader.
•Leader uses atomic broadcast for consensus.
•New leader is elected upon failure of the current leader.
FIGURE 11.4: (b) The functional model of Zookeeper service; the replicated database is accessed directly by READ
commands. (c) Processing a WRITE command: (i) a server receiving a command from a client, forwards it to the
leader; (ii) the leader uses atomic broadcast to reach consensus among all followers.
Data Model:
•Hierarchical Namespace: Organised like a file system with paths (Fig. 11.5).
•Znodes: Equivalent to UFS inodes, can store data.
•Metadata: Each znode stores data, version numbers, ACL changes, and timestamps.
•Watch Mechanism: Clients can set watches on znodes to receive notifications on
changes, enabling coordinated updates.
•Versioned Data: Retrieved data includes a version number; updates are stamped with
a sequence number.
•Atomic Read/Write: Data in znodes is read and written entirely and atomically.
•In-Memory Storage with Disk Logging: State is in server memory for speed; updates
logged to disk for recovery; WRITEs serialized to disk before in-memory application.
FIGURE 11.5: The Zookeeper is organized as a shared hierarchical namespace; a name is a sequence of path
elements separated by a backslash.
Zookeeper Guarantees:
•Atomicity: Transactions either fully complete or fail.
•Sequential Consistency: Updates applied in the order received.
•Single System Image: Clients get the same response from any server.
•Update Persistence: Once applied, updates remain until overwritten.
•Reliability: Functions correctly if a majority of servers are operational.
•ZooKeeper effectively implements the finite-state machine model of coordination,
where znodes store the shared state.
•This foundational service allows developers to build complex, reliable higher-level
coordination primitives like:
•Group Membership (knowing who is active in a cluster).
•Synchronization (distributed locks, barriers).
•Leader Election.
•It is widely used in large-scale distributed applications (e.g., Yahoo's Message
Broker).
MapReduce programming model
• Cloud Elasticity and Workload Distribution:
•Elasticity Advantage: Cloud allows using a variable number of servers to meet
application cost and timing needs.
•Transaction Processing Example: Front-end distributes transactions to back-
end systems for workload balancing; scales by adding more back-ends.
•Arbitrarily Divisible Load Sharing: Many scientific and engineering applications
can split their workload into numerous small, equal-sized tasks, ideal for cloud
elasticity.
•Data-Intensive Partitioning: Dividing the workload of data-intensive applications
can be complex.
MapReduce for Parallel Data Processing:
•Core Idea: Simple parallel processing for data-intensive tasks with arbitrarily
divisible workloads (Fig. 11.6).
•Phase 1 (Map):
•Split the input data into blocks.
•Assign each block to a separate instance/process.
•Run these instances in parallel to perform computations on their assigned data.
•Phase 2 (Reduce):
•Merge the partial results generated by the individual instances from the Map
phase.
FIGURE 11.6: MapReduce philosophy.
1. An application starts a Master instance and M
worker instances for the Map phase and later R
worker instances for the Reduce phase.
2. The Master partitions the input data in M
segments.
3. Each Map instance reads its input data segment
and processes the data.
4. The results of the processing are stored on the local
disks of the servers where the Map instances run.
5. When all Map instances have finished processing,
their data R Reduce instances read the results of the
first phase and merge the partial results.
6. The final results are written by Reduce instances to
a shared storage server.
7. The Master instance monitors the Reduce instances,
and when all of them report task completion, the
application is terminated.
MapReduce Programming Model:
• Purpose: Processing and generating large datasets on computing
clusters.
• Transformation:
Converts a set of input <key, value> pairs into a set of output <key, value>
pairs.
• Wide Applicability: Many tasks can be easily implemented using this
model.
Examples:
• URL Access Frequency:
• Map Function: Processes web page request logs, outputs <URL, 1>.
• Reduce Function: Aggregates counts for each URL, outputs <URL, totalcount>.
• Distributed Sort:
• Map Function: Extracts the key from each record, outputs <key, record>.
• Reduce Function: Outputs the <key, record> pairs unchanged (effectively sorting by
key).
map(String key, String value):
// key: document name; value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word; values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
MapReduce Execution Flow:
MapReduce Execution Flow:
Let M be the number of Map tasks, R the number of Reduce tasks, and N the
number of systems.
1.Initialization:
1. Input files are split into M chunks (16-64 MB each).
2. N systems are identified for execution.
3. Multiple program copies start: one Master, the rest Workers.
4. Master assigns Map or Reduce tasks to idle Workers (O(M + R) scheduling
decisions).
5. Master keeps O(M × R) worker state vectors in memory (limiting M and R;
efficiency requires M, R ≤ N).
2. Map Phase (Workers):
•Worker reads its assigned input split.
•Parses <key, value> pairs.
•Passes each pair to the user-defined Map function.
•Intermediate <key, value> pairs are buffered in memory.
•Buffered pairs are written to local disk, partitioned into R regions using a
partitioning function.
•Worker reports the locations of these buffered pairs to the Master.
3. Reduce Phase (Workers):
• Master forwards the locations of intermediate data to Reduce Workers.
• Reduce Worker uses RPCs to read the partitioned data from the local disks of
Map Workers.
• After reading all relevant intermediate data, it sorts it by the intermediate keys.
• For each unique intermediate key, the key and its associated values are
passed to the user-defined Reduce function.
• The output of the Reduce function is appended to a final output file.
4. Completion:
• The Master notifies the user program when all Map and Reduce tasks are
finished.
MapReduce Fault Tolerance and Environment:
•Master Fault Tolerance:
•Stores the state (idle, in-progress, completed) and worker identity for each task.
•Pings workers periodically to detect failures.
•Failed worker tasks are reset to idle for rescheduling.
•Master periodically checkpoints its control data for restart upon failure.
•Data Storage: Uses GFS (Google File System) for storage.
•Experimental Environment:
•Commodity hardware: Dual-core x86 CPUs, 2-4 GB RAM, 100-1000 Mbps networking.
•Large clusters: Hundreds to thousands of machines.
•Local Disks: Data stored on IDE disks attached to individual machines.
•Replication: File system uses replication for availability and reliability on unreliable
hardware.
•Data Locality: Input data stored locally to minimize network bandwidth.
The GrepTheWeb application
• GrepTheWeb, an application currently in use at Amazon, serves as a prime example
of the capabilities of cloud computing.
• Its core function is to allow users to define a regular expression and then search the
vast expanse of the web for records that match that pattern.
• The application operates on an extremely large dataset of web records, specifically a
collection of document URLs compiled nightly by the Alexa Web Search system.
• The primary inputs required are the extensive data set and the user's specified regular
expression. The output is the resulting set of records from the dataset that
successfully satisfy the provided regular expression.
• Furthermore, users are provided with the ability to interact with the application to
monitor the current status of their search process.
• The user is able to interact with the application and get the current status; see Fig.
11.7(a).
. (a) The simplified workflow showing the two
inputs, the regular expression and the input records
generated by the web crawler; a third type of input
are the user commands to report the current status
and to terminate the processing.
• Apache Hadoop is a software library for distributed processing of large data sets
across clusters of computers using a simple programming model.
• Another performance optimization is to run a script and sort the keys, the URL
pointers, and upload them in sorted order in S3.
Job Tracker: Receives MapReduce jobs, dispatches work to Task Trackers, attempts
to schedule tasks near data location (data locality).
Task Tracker: Supervises the execution of assigned work on a node.
•Supports various scheduling algorithms (e.g., Facebook's fair scheduler, Yahoo's capacity
scheduler).
• It provides a high-level data flow language called Pig Latin, which makes it easier to
write complex data transformation and analysis tasks for Hadoop, without having to write
low-level MapReduce Java code.
• Apache Pig provides a powerful and flexible abstraction layer over Hadoop, enabling
users to perform complex data transformations and analysis using a high-level,
procedural language, making it an excellent tool for ETL (Extract, Transform, Load)
operations and data processing workflows where direct control over the data flow is
desired.
Key Components:
•Pig Latin: This is the high-level language used to express data analysis
programs. It's a procedural language that allows users to describe a series of
operations to be performed on data.
•Examples of Pig Latin operators include LOAD, FOREACH, FILTER, GROUP, JOIN, ORDER
BY, STORE, etc.
•Grunt Shell: This is the interactive shell (command-line interface) where users
can write and execute Pig Latin scripts directly or run script files.
•Pig Engine (or Runtime): This component takes Pig Latin scripts, parses them,
optimizes the logical plan, generates a physical plan, and then translates this into
an executable series of MapReduce (or Tez/Spark) jobs.
How it Works (Architecture):
• Script Submission: A user writes a Pig Latin script and submits it to the Pig
engine (e.g., via the Grunt shell).
• Parsing and Optimization: The Pig engine parses the script, checks for
syntax errors, and creates a logical plan. This plan is then optimized (e.g.,
reordering operations, combining small jobs) to make it more efficient.
• Physical Plan Generation: The optimized logical plan is converted into a
physical plan, which outlines the specific MapReduce (or Tez/Spark) jobs
required.
• Execution on Hadoop: The Pig engine submits these generated jobs to the
Hadoop YARN resource manager, which then schedules and executes them
on the Hadoop cluster.
• Data Flow: Data flows through the various operations defined in the Pig Latin
script, with intermediate results often being written to HDFS.
• Result Storage: The final results are typically stored back into HDFS or
another data store.
Hive.
Apache Hive is an open-source data warehouse software built on top of Apache
Hadoop.
Its primary purpose is to enable data analysts and developers to query and
analyze large datasets stored in Hadoop Distributed File System (HDFS)
using a familiar SQL-like language called HiveQL (HQL), rather than writing
complex MapReduce Java code.
• These queries are then compiled into MapReduce jobs executed on Hadoop. The
system includes the Metastore, a catalog for schemas, and the statistics used for
query optimization.
Data Organization & Storage:
• Supports data organized as Tables, Partitions, and Buckets.
• Tables:
• Inherited from relational databases.
• Can be stored internally or externally (in HDFS, NFS, local directory).
• Serialized and stored in files within an HDFS directory.
• Serialization format is stored and accessed in the system catalog (Metastore).
• Partitions:
• Components of tables.
• Described by subdirectories within the table's directory.
• Buckets:
• Consist of partition data.
• Stored as a file within the partition's directory.
• Selection based on the hash of a table column.
HiveQL Compiler Components: