U-3 Big Data
U-3 Big Data
Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in
volume. Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch/offline processing.It
is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding
nodes in the cluster.
The Hadoop Distributed File System (HDFS)
is the primary data storage system used by Hadoop applications. HDFS employs a NameNode and DataNode architecture to
implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.
HDFS employs a NameNode and DataNode architecture to implement a distributed file system that provides high-
performance access to data across highly scalable Hadoop clusters.
Hadoop itself is an open source distributed processing framework that manages data processing and storage for big data
applications. HDFS is a key part of the many Hadoop ecosystem technologies. It provides a reliable means for managing
pools of big data and supporting related big data analytics applications.
HDFS architecture, NameNodes and DataNodes
HDFS uses a primary/secondary architecture. The HDFS cluster's NameNode is the primary server that manages the file
system namespace and controls client access to files. As the central component of the Hadoop Distributed File System, the
NameNode maintains and manages the file system namespace and provides clients with the right access permissions. The
system's DataNodes manage the storage that's attached to the nodes they run on.
HDFS exposes a file system namespace and enables user data to be stored in files. A file is split into one or more of the
blocks that are stored in a set of DataNodes. The NameNode performs file system namespace operations, including opening,
closing and renaming files and directories. The NameNode also governs the mapping of blocks to the DataNodes. The
DataNodes serve read and write requests from the clients of the file system. In addition, they perform block creation,
deletion and replication when the NameNode instructs them to do so.
DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.
The NameNode records any change to the file system namespace or its properties. An application can stipulate the number
of replicas of a file that the HDFS should maintain. The NameNode stores the number of copies of a file, called the
replication factor of that file.
Features of HDFS
There are several features that make HDFS particularly useful, including:
Data replication. This is used to ensure that the data is always available and prevents data loss. For example, when a
node crashes or there is a hardware failure, replicated data can be pulled from elsewhere within a cluster, so processing
continues while data is recovered.
Fault tolerance and reliability. HDFS' ability to replicate file blocks and store them across nodes in a large cluster
ensures fault tolerance and reliability.
High availability. As mentioned earlier, because of replication across notes, data is available even if the NameNode or
a DataNode fails.
Scalability. Because HDFS stores data on various nodes in the cluster, as requirements increase, a cluster can scale to
hundreds of nodes.
High throughput. Because HDFS stores data in a distributed manner, the data can be processed in parallel on a cluster
of nodes. This, plus data locality (see next bullet), cut the processing time and enable high throughput.
Data locality. With HDFS, computation happens on the DataNodes where the data resides, rather than having the data
move to where the computational unit is. By minimizing the distance between the data and the computing process, this
approach decreases network congestion and boosts a system's overall throughput.
MapReduce is a programming model used for efficient processing in parallel over large data-sets in a distributed manner.
The data is first split and then combined to produce the final result. The libraries for MapReduce is written in so many
programming languages with various different-different optimizations. The purpose of MapReduce in Hadoop is to Map
each of the jobs and then it will reduce it to equivalent tasks for providing less overhead over the cluster network and to
reduce the processing power. The MapReduce task is mainly divided into two phases Map Phase and Reduce Phase.
Components of MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the MapReduce for processing. There can be multiple
clients available that continuously send jobs for processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is comprised of so many smaller tasks
that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result of all the job-parts
combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
7. Map: As the name suggests its main use is to map the input data in key-value pairs. The input to the map may be
a key-value pair where the key can be the id of some kind of address and value is the actual value that it keeps.
The Map() function will be executed in its memory repository on each of these input key-value pairs and
generates the intermediate key-value pair which works as input for the Reducer or Reduce() function.
8. Reduce: The intermediate key-value pairs that work as input for Reducer are shuffled and sort and send to
the Reduce() function. Reducer aggregate or group the data based on its key-value pair as per the reducer
algorithm written by the developer. n MapReduce, we have a client. The client will submit the job of a particular
size to the Hadoop MapReduce Master. Now, the MapReduce master will divide this job into further equivalent
job-parts. These job-parts are then made available for the Map and Reduce Task. This Map and Reduce task will
contain the program as per the requirement of the use-case that the particular company is solving. The developer
writes their logic to fulfill the requirement that the industry requires. The input data which we are using is then
fed to the Map Task and the Map will generate intermediate key-value pair as its output. The output of Map i.e.
these key-value pairs are then fed to the Reducer and the final output is stored on the HDFS. There can be n
number of Map and Reduce tasks made available for processing the data as per the requirement. The algorithm
for Map and Reduce is made with a very optimized way such that the time complexity or space complexity is
minimum.
Fault tolerance refers to the ability of a system to recover from failures and errors without losing data or functionality. Both
Hadoop and Spark are fault tolerant, meaning that they can handle node failures, data corruption, and network issues without
affecting the overall execution. However, Hadoop and Spark have different approaches to fault tolerance, which have trade-
offs and implications. Hadoop relies on data replication and checkpointing to ensure fault tolerance, which means that it
duplicates the data blocks across multiple nodes and periodically saves the state of the computation to disk. This provides
high reliability and durability, but also consumes more disk space and network bandwidth. Spark relies on data lineage and
lazy evaluation to ensure fault tolerance, which means that it tracks the dependencies and transformations of the RDDs and
only executes them when needed. This provides high performance and flexibility, but also requires more memory and
computation power.
Replication ensures the availability of the data. Replication is nothing but making a copy of something and the number of
times you make a copy of that particular thing can be expressed as its Replication Factor. As we have seen in File blocks
that the HDFS stores the data in the form of various blocks at the same time Hadoop is also configured to make a copy of
those file blocks. By default the Replication Factor for Hadoop is set to 3 which can be configured means you can change
it Manually as per your requirement like in above example we have made 4 file blocks which means that 3 Replica or
copy of each file block is made means total of 4×3 = 12 blocks are made for the backup purpose.
In the above image, you can see that there is a Master with RAM = 64GB and Disk Space = 50GB and 4 Slaves with
RAM = 16GB, and disk Space = 40GB. Here you can observe that RAM for Master is more. It needs to be kept more
because your Master is the one who is going to guide this slave so your Master has to process fast. Now suppose you have
a file of size 150MB then the total file blocks will be 2 shown below.
128MB = Block 1
22MB = Block 2
As the replication factor by-default is 3 so we have 3 copies of this file block
FileBlock1-Replica1(B1R1) FileBlock2-Replica1(B2R1)
FileBlock1-Replica2(B1R2) FileBlock2-Replica2(B2R2)
FileBlock1-Replica3(B1R3) FileBlock2-Replica3(B2R3)
These blocks are going to be stored in our Slave as shown in the above diagram which means if suppose your Slave 1
crashed then in that case B1R1 and B2R3 get lost. But you can recover the B1 and B2 from other slaves as the Replica of
this file blocks is already present in other slaves, similarly, if any other Slave got crashed then we can obtain that file
block some other slave. Replication is going to increase our storage but Data is More necessary for us.
An HDFS high availability (HA) cluster uses two NameNodes—an active NameNode and a standby NameNode. Only one
NameNode can be active at any point in time. HDFS HA depends on maintaining a log of all namespace modifications in a
location available to both NameNodes, so that in the event of a failure, the standby NameNode has up-to-date information
about the edits and location of blocks in the cluster.
You can use Cloudera Manager to configure your CDH cluster for HDFS HA and automatic failover. In Cloudera Manager,
HA is implemented using Quorum-based storage. Quorum-based storage relies upon a set of JournalNodes, each of which
maintains a local edits directory that logs the modifications to the namespace metadata. Enabling HA enables automatic
failover as part of the same command.
The Enable High Availability workflow leads you through adding a second (standby) NameNode and configuring
JournalNodes.
The key components and steps involved in Hadoop High Availability are as follows:
Active NameNode:
The active NameNode is the primary NameNode that manages client requests and metadata operations, just like in a non-HA
setup.
Standby NameNode:
The standby NameNode is an additional NameNode that continuously replicates and maintains a copy of the metadata
from the active NameNode.
The standby NameNode remains in constant communication with the active NameNode, receiving regular updates to
keep the metadata synchronized.
Quorum Journal Manager (QJM):
The Quorum Journal Manager is a component that stores the edit logs, which contain the transactional changes made to
the HDFS metadata.
The QJM ensures that the edit logs are written to a majority of nodes in the cluster, forming a quorum. This approach
ensures data consistency and prevents data loss even in the event of multiple node failures.
ZooKeeper (optional):
Hadoop HA can optionally use Apache ZooKeeper to manage the failover process between the active and standby
NameNodes.
ZooKeeper is a highly reliable coordination service that helps in selecting the active NameNode and ensuring that only
one NameNode is active at a time.
DATA LOCALTY
The major drawback of Hadoop was cross-switch network traffic due to the huge volume of data. To overcome this
drawback, Data Locality in Hadoop came into the picture. Data locality in MapReduce refers to the ability to move the
computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes
network congestion and increases the overall throughput of the system.
In Hadoop, datasets are stored in HDFS. Datasets are divided into blocks and stored across the datanodes
in Hadoop cluster. When a user runs the MapReduce job then NameNode sent this MapReduce code to the datanodes on
which data is available related to MapReduce job.
Categories of Data Locality in Hadoop
Below are the various categories in which Data Locality in Hadoop is categorized:
i. Data local data locality in Hadoop
When the data is located on the same node as the mapper working on the data it is known as data local data locality. In this
case, the proximity of data is very near to computation. This is the most preferred scenario.
It is not always possible to execute the mapper on the same datanode due to resource constraints. In such case, it is
preferred to run the mapper on the different node but on the same rack.
iii. Inter-Rack data locality in Hadoop
Sometimes it is not possible to execute mapper on a different node in the same rack due to resource constraints. In such a
case, we will execute the mapper on the nodes on different racks. This is the least preferred scenario.
Although Data locality in Hadoop MapReduce is the main advantage of Hadoop MapReduce as map code is executed on the
same data node where data resides. But this is not always true in practice due to various reasons like speculative execution
in Hadoop, Heterogeneous cluster, Data distribution and placement, and Data Layout and Input Splitter.
Data Flow In MapReduce
MapReduce is used to compute the huge amount of data . To handle the upcoming data in a parallel and distributed form, the
data has to flow from various phases.
DATA INTEGRITY
In HDFS, data is divided into fixed-size blocks, and for each block, a checksum is calculated using algorithms like CRC32C
or CRC32. This checksum serves as a unique fingerprint for the data, making even the slightest alteration in the data
instantly detectable.
Data Integrity: Checksums act as guardians of data integrity. When data is read from HDFS, the system recalculates the
checksum and compares it to the stored checksum. Any mismatch indicates data corruption.
🔄 Data Validation during Transfer: They ensure data remains intact during transfer between nodes, reducing the risk of
corruption due to network issues or hardware failures.
👁️ Identifying Faulty Blocks: Checksums help identify which replica of a block is corrupt, allowing for quick recovery or
replication from healthy copies.
⏳ Early Detection of Bit Rot: Bit rot can silently corrupt data over time. Checksums help spot this early, allowing for timely
restoration or replication.
🚫 Preventing Silent Data Corruption: They act as a safeguard against silent corruption, actively monitoring data integrity.
🛠️ Efficient Error Handling: In case of corruption, HDFS can take automated actions, reducing manual intervention and
enhancing system reliability.
What is Serialization?
Serialization is the process of translating data structures or objects state into binary or textual form to transport the data over
network or to store on some persisten storage. Once the data is transported over network or retrieved from the persistent
storage, it needs to be deserialized again. Serialization is termed as marshalling and deserialization is termed
as unmarshalling.
Generally in distributed systems like Hadoop, the concept of serialization is used for Interprocess
Communication and Persistent Storage.
Interprocess Communication
To establish the interprocess communication between the nodes connected in a network, RPC technique was used.
RPC used internal serialization to convert the message into binary format before sending it to the remote node via
network. At the other end the remote system deserializes the binary stream into the original message.
The RPC serialization format is required to be as follows −
o Compact − To make the best use of network bandwidth, which is the most scarce resource in a data center.
o Fast − Since the communication between the nodes is crucial in distributed systems, the serialization and
deserialization process should be quick, producing less overhead.
o Extensible − Protocols change over time to meet new requirements, so it should be straightforward to
evolve the protocol in a controlled manner for clients and servers.
o Interoperable − The message format should support the nodes that are written in different languages.
Persistent Storage
Persistent Storage is a digital storage facility that does not lose its data with the loss of power supply. Files, folders,
databases are the examples of persistent storage.
Writable Interface
This is the interface in Hadoop which provides methods for serialization and deserialization. The following table describes
the methods −
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize
Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an
open source under the name Apache Hive. It is used by different companies. For example, Amazon uses it in Amazon
Elastic MapReduce.
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
It stores schema in a database and processed data into HDFS.
It is designed for OLAP.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
Architecture of Hive
The following component diagram depicts the architecture of Hive:This component diagram contains different units. The
following table describes each unit:
To understand the Hive DML commands, let's see the employee and employee_department table first.
LOAD DATA
hive> LOAD DATA LOCALINPATH './usr/Desktop/kv1.txt' OVERWRITE INTOTABLEEmployee;
SELECTS and FILTERS
1. hive> SELECT E.EMP_ID FROM Employee E WHERE E.Address='US';
GROUP BY
1. hive> hive> SELECT E.EMP_ID FROM Employee E GROUP BY E.Addresss;
Hive Sort By vs Order By
Hive sort by and order by commands are used to fetch data in sorted order. The main differences between sort by and order
by commands are given below.
Sort by
1. hive> SELECT E.EMP_ID FROM Employee E SORT BY E.empid;
May use multiple reducers for final output.
Only guarantees ordering of rows within a reducer.
May give partially ordered result.
Order by
1. hive> SELECT E.EMP_ID FROM Employee E order BY E.empid;
Uses single reducer to guarantee total order in output.
LIMIT can be used to minimize sort time.
HiveQL - JOIN
The HiveQL Join clause is used to combine the data of two or more tables based on a related column between them. The
various type of HiveQL joins are: -
o Inner Join Left Outer Join Right Outer Join Full Outer Join
Here, we are going to execute the join clauses on the records of the following table:
Inner Join in HiveQL
The HiveQL inner join is used to return the rows of multiple tables where the join condition satisfies. In other words, the
join criteria find the match records in every table being joined.
Left Outer Join in HiveQL
The HiveQL left outer join returns all the records from the left (first) table and only that records from the right (second)
table where join criteria find the match.
Right Outer Join in HiveQL
The HiveQL right outer join returns all the records from the right (second) table and only that records from the left (first)
table where join criteria find the match.
Full Outer Join
The HiveQL full outer join returns all the records from both the tables. It assigns Null for missing records in either table.
U-4
Fill
Using the fill function, you can fill one or more columns with a set of values. This can be done by specifying a map—that is
a particular value and a set of columns. For example, to fill all null values in columns of type String, you might specify the
following:
df.na.fill("All Null values become this string"
Grouping
Thus far, we have performed only DataFrame-level aggregations. A more common task is to perform calculations based on
groups in the data. This is typically done on categorical data for which we group our data on one column and perform some
calculations on the other columns that end up in that group. The best way to explain this is to begin performing some
groupings. The first will be a count, just as we did before. We will group by each unique invoice number and get the count
of items on that invoice. Note that this returns another DataFrame and is lazily performed. We do this grouping in two
phases. First we specify the column(s) on which we would like to group, and then we specify the aggregation(s). The first
step returns a RelationalGroupedDataset, and the second step returns a DataFrame. As mentioned, we can specify any
number of columns on which we want to group:
df.groupBy("InvoiceNo", "CustomerId").count().show()
Grouping with Expressions
As we saw earlier, counting is a bit of a special case because it exists as a method. For this, usually we prefer to use the
count function. Rather than passing that function as an expression into a select statement, we specify it as within agg. This
makes it possible for you to pass-in arbitrary expressions that just need to have some aggregation specified. You can even do
things like alias a column after transforming it for later use in your data flow:
Join Expressions
A join brings together two sets of data, the left and the right, by comparing the value of one or more keys of the left and
right and evaluating the result of a join expression that determines whether Spark should bring together the left set of data
with the right set of data. The most common join expression, an equi-join, compares whether the specified keys in your left
and right datasets are equal. If they are equal, Spark will combine the left and right datasets. The opposite is true for keys
that do not match; Spark discards the rows that do not have matching keys.
Join Types :
Whereas the join expression determines whether two rows should join, the join type determines what should be in the result
set. There are a variety of different join types available in Spark for you to use:
Inner joins (keep rows with keys that exist in the left and right datasets) Outer joins (keep rows with keys in either the left
or right datasets) Left outer joins (keep rows with keys in the left dataset) Right outer joins (keep rows with keys in the right
dataset) Left semi joins (keep the rows in the left, and only the left, dataset where the key appears in the right dataset) Left
anti joins (keep the rows in the left, and only the left, dataset where they do not appear in the right dataset
Inner Joins
Inner joins evaluate the keys in both of the DataFrames or tables and include (and join together) only the rows that evaluate
to true. In the following example, we join the graduateProgram DataFrame with the person DataFrame to create a new
DataFrame:
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic operations on
either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that can be operated on in
parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a
dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop
Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. Let us first discuss how
MapReduce operations take place and why they are not so efficient.
The following illustration explains how the current framework works while doing the interactive queries on MapReduce.
Narrow transformations are operations where each input partition of an RDD is used to compute only one output partition
of the resulting RDD. Examples of narrow transformations include map(), filter(), and union(). Narrow transformations are
preferred because they allow for more efficient processing, as they can be executed in parallel on individual partitions
without the need for shuffling or data movement across the cluster.
Wide transformations, on the other hand, are operations where each input partition of an RDD is used to compute multiple
output partitions of the resulting RDD. Examples of wide transformations include groupByKey(), reduceByKey(),
and sortByKey(). Wide transformations require shuffling or data movement across the cluster to redistribute the data, which
can be expensive in terms of performance and network overhead.
- In wide transformation (shuffling ) is involved and so the data is written to disk thereby making it a costly and slow
transformation .
- In a DAG a new stage is created for every wide transformation.
- for *optimisation* one must reduce usage of wide transformation if possible or atleast apply as many
narrow transformations before proceeding to wide transformation.
- In Apache Spark, transformations are operations that create a new RDD (Resilient Distributed Dataset) from an existing
one. There are two types of transformations in Spark: Narrow and Wide.
In general, it is recommended to use narrow transformations whenever possible to optimize performance and minimize data
movement across the cluster. However, there are cases where wide transformations are necessary to achieve the desired
computation, such as aggregations or joins that require combining data across multiple partitions.
In summary, narrow transformations are operations where each input partition is used to compute one output partition, while
wide transformations are operations where each input partition is used to compute multiple output partitions. Narrow
transformations are preferred because they are more efficient and require less data movement across the cluster, but there are
cases where wide transformations are necessary for certain types of computations.
Action Description
reduce(func) It aggregate the elements of the dataset using a function func (which takes two arguments and
returns one). The function should be commutative and associative so that it can be computed
correctly in parallel.
collect() It returns all the elements of the dataset as an array at the driver program. This is usually useful after
a filter or other operation that returns a sufficiently small subset of the data.
takeSample(withReplacem It returns an array with a random sample of num elements of the dataset, with or without
ent, num, [seed]) replacement, optionally pre-specifying a random number generator seed.
takeOrdered(n, [ordering]) It returns the first n elements of the RDD using either their natural order or a custom comparator.
saveAsTextFile(path) It is used to write the elements of the dataset as a text file (or set of text files) in a given directory in
the local filesystem, HDFS or any other Hadoop-supported file system. Spark calls toString on each
element to convert it to a line of text in the file.
saveAsSequenceFile(path) It is used to write the elements of the dataset as a Hadoop SequenceFile in a given path in the local
(Java and Scala) filesystem, HDFS or any other Hadoop-supported file system.
saveAsObjectFile(path) It is used to write the elements of the dataset in a simple format using Java serialization, which can
(Java and Scala) then be loaded usingSparkContext.objectFile().
countByKey() It is only available on RDDs of type (K, V). Thus, it returns a hashmap of (K, Int) pairs with the
count of each key.
foreach(func) It runs a function func on each element of the dataset for side effects such as updating an
Accumulator or interacting with external storage systems.
Creating DataFrames
There are many ways to create DataFrames. Here are three of the most commonly used methods to create DataFrames:
The above JSON is a simple employee database file that contains two records/rows
Accumulator variables are used for aggregating the information through associative and commutative
operations. For example, you can use an accumulator for a sum operation or counters (in MapReduce).
The following code block has the details of an Accumulator class for PySpark.
class pyspark.Accumulator(aid, value, accum_param)
The following example shows how to use an Accumulator variable. An Accumulator variable has an
attribute called value that is similar to what a broadcast variable has. It stores the data and is used to
return the accumulator's value, but usable only in a driver program.
SPARK DEPLOYMENT
Spark application, using spark-submit, is a shell command used to deploy the Spark application on a
cluster. It uses all respective cluster managers through a uniform interface. Therefore, you do not have to
configure your application for each one.
Example
Let us take the same example of word count, we used before, using shell commands. Here, we consider
the same example as a spark application.
Sample Input
The following text is the input data and the file named is in.txt.
people are not as beautiful as they look, as
they walk or as they talk.
they are only as beautiful as they love, as
they care as they share.
Look at the following program −
SparkWordCount.scala
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark._
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext( "local", "Word Count", "/usr/local/spark", Nil, Map(), Map())
}
Save the above program into a file named SparkWordCount.scala and place it in a user-defined
directory named spark- application.
Note − While transforming the inputRDD into countRDD, we are using flatMap() for tokenizing the
lines (from text file) into words, map() method for counting the word frequency and reduceByKey()
method for counting each word repetition.
Use the following steps to submit this application. Execute all steps in the spark-application directory through the
terminal.
Step 1: Download Spark Ja
Spark core jar is required for compilation, therefore, download spark-core_2.10-1.3.0.jar from the
following link Spark core jar and move the jar file from download directory to spark-application
directory.
Step 2: Compile program
Compile the above program using the command given below. This command should be executed
from the spark-application directory. Here, /usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar
is a Hadoop support jar taken from Spark library.
$ scalac -classpath "spark-core_2.10-1.3.0.jar:/usr/local/spark/lib/spark-assembly-1.4.0-
hadoop2.6.0.jar" SparkPi.scala
Step 3: Create a JAR
Create a jar file of the spark application using the following command. Here, wordcount is the file name for jar file
Cluster Managers
Cluster manager is a platform (cluster mode) where we can run Spark. Simply put, cluster manager
provides resources to all worker nodes as per need, it operates all nodes accordingly.We can say there are
a master node and worker nodes available in a cluster. That master nodes provide an efficient working
environment to worker nodes.
There are three types of Spark cluster manager. Spark supports these cluster manager:
Standalone cluster
manager Hadoop Yarn
Apache Mesos
Apache Spark also supports pluggable cluster management. The main task of cluster manager is to
provide resources to all applications. We can say it is an external service for acquiring required
resources on the cluster.
1. Standalone Cluster Manager
It is a part of spark distribution and available as a simple cluster manager to us. Standalone cluster
manager is resilient in nature, it can handle work failures. It has capabilities to manage resources
according to the requirement of applications.
We can easily run it on Linux, Windows, or Mac. It can also access HDFS (Hadoop Distributed File
System) data. This is the easiest way to run Apache spark on this cluster. It also has high availability for a
master.
Working with Standalone Cluster Manager
As we discussed earlier, in cluster manager it has a master and some number of workers. It has available
resources as the configured amount of memory as well as CPU cores. In this cluster, mode spark provides
resources according to its core. We can say an application may grab all the cores available in the cluster
by default.If in any case, our master crashes, so zookeeper quorum can help on. It recovers the master
using standby master. We can also recover the master by using several file systems. All the applications
we are working on has a web user interface.This interface works as an eye keeper on the cluster and even
job statistics. It helps in providing several pieces of information on memory or running jobs.This cluster
manager has detailed log output for every task performed. Web UI can reconstruct the application’s UI
even after the application exits.
2. Hadoop Yarn
This cluster manager works as a distributed computing framework. It also maintains job scheduling as
well as resource management. In this cluster, masters and slaves are highly available for us. We are also
available with executors and pluggable scheduler.We can also run it on Linux and even on windows.
Hadoop yarn is also known as MapReduce 2.0. It also bifurcates the functionality of resource manager as
well as job scheduling
Working with Hadoop Yarn Cluster Manager
If we talk about yarn, whenever a job request enters into resource manager of YARN. It computes that
according to the number of resources available and then places it a job. Yarn system is a plot in a gigantic
way.It is the one who decides where the job should go. This is an evolutionary step of MapReduce
framework. It works as a resource manager component, largely motivated by the need to scale Hadoop
jobs.We can optimize Hadoop jobs with the help of Yarn. The yarn is the aim for short but fast spark
jobs. It is neither eligible for long-running services nor for short-lived queries. It is not stated as an ideal
system.That resource demand, execution model, and architectural demand are not long running services.
The yarn is suitable for the jobs that can be re-start easily if they fail. Yarn do not handle distributed file
systems or databases.While yarn massive scheduler handles different type of workloads. The yarn is not a
lightweight system. It is not able to support growing no. of current even algorithms.
Accessing the Spark Logs - When a Spark job or application fails, you can use the Spark logs to analyze
the failures.The QDS UI provides links to the logs in the Application UI and Spark Application UI.
If you are running the Spark job or application from the Analyze page, you can access the
logs via the Application UI and Spark Application UI.
If you are running the Spark job or application from the Notebooks page, you can access the logs
via the Spark Application UI.
Accessing the Application UI - To access the logs via the Application UI from the Analyze page of the QDS UI:
1. Note the command id, which is unique to the Qubole job or command.
3. Enter the command id in the Command Id field and click Apply.Logs of any Spark job are displayed in
Application
UI and Spark Application UI, which are accessible in the Logs and Resources tabs. The
information in these UIs can be used to trace any information related to command status.
Accessing the Spark Application UI - You can access the logs by using the Spark Application
UI from the Analyze page and Notebooks page.
2. Note the command id, which is unique to the Qubole job or command.
2. Click on the Spark widget on the top right and click on Spark UI
When you open the Spark UI from the Spark widget of the Notebooks page or from the Analyze page,
the Spark Application UI is displayed in a separate tab . The Spark Application UI displays the
following information:
Jobs: The Jobs tab shows the total number of completed, succeeded and failed jobs. It also shows
the number of stages that a job has succeeded.
Stages: The Stages tab shows the total number of completed and failed stages. If you want to
check more details about the failed stages, click on the failed stage in the Description column
The Errors column shows the detailed error message for the failed tasks. You should note the executor id and
the
hostname to view details in the container logs. For more details about the error stack trace, you
should check the container logs.
Storage: The Storage tab displayed the cached data if caching is enabled.
Environment : The Environment tab shows the information about JVM, Spark properties, System properties
and
classpath entries which helps to know the values for a property that is used by the spark
cluster during runtime. The following figure shows the Environment tab.
Executors : The Executors tab shows the container logs. You can map the container logs using
the executor id and the hostname, which is displayed in the Stages tab.
Spark on Qubole provides the following additional fields in the Executors tab:
o Resident size/Container size: Displays the total physical memory used within the container
(which is the executor’s java heap + off heap memory) as Resident size, and the configured yarn
container size (which is executor memory + executor overhead) as Container size.
o Heap used/committed/max: Displays values corresponding to the executor’s java heap.
The Logs column in shows the links to the container logs. Additionally, the number of tasks executed by
each executor with number of active, failed, completed and total tasks are displayed.
Spark Performance Tuning refers to the process of adjusting settings to record for memory, cores, and instances used by the
system. This process guarantees that the Spark has a flawless performance and also prevents bottlenecking of resources in
Spark.
What is Data Serialization?
In order, to reduce memory usage you might have to store spark RDDs in serialized form. Data serialization also determines
a good network performance. You will be able to obtain good results in Spark performance by:
Terminating those jobs that run long.
Ensuring that jobs are running on a precise execution engine.
Using all resources in an efficiently.
Enhancing the system’s performance time
Spark supports two serialization libraries, as follows:
Java Serialization
Kryo Serialization
Processing out-of-order data based on application timestamps (also called event time) Maintaining large amounts of state
Supporting high-data throughput Processing each event exactly once despite machine failures Handling load imbalance and
stragglers Responding to events at low latency Joining with external data in other storage systems Determining how to
update output sinks as new events arrive Writing data transactionally to output systems Updating your application’s
business logic at runtime
Stream Processing Design Points
To support the stream processing challenges we described, including high throughput, low latency, and out-of-order data,
there are multiple ways to design a streaming system. We describe the most common design options here
Record-at-a-Time Versus Declarative APIs
The simplest way to design a streaming API would be to just pass each event to the application and let it react using custom
code. This is the approach that many early streaming systems, such as Apache Storm, implemented, and it has an important
place when applications need full control over the processing of data
Event Time Versus Processing Time
For the systems with declarative APIs, a second concern is whether the system natively supports event time. Event time is
the idea of processing data based on timestamps inserted into each record at the source, as opposed to the time when the
record is received at the streaming application (which is called processing time).
Event Time
Event time is an important topic to cover discretely because Spark’s DStream API does not support processing information
with respect to event-time. At a higher level, in streamprocessing systems there are effectively two relevant times for each
event: the time at which it actually occurred (event time), and the time that it was processed or reached the
streamprocessing system (processing time). Event time Event time is the time that is embedded in the data itself. It is most
often, though not required to be, the time that an event actually occurs. This is important to use because it provides a more
robust way of comparing events against one another. The challenge here is that event data can be late or out of order. This
means that the stream processing system must be able to handle out-of-order or late data.
Processing time
Processing time is the time at which the stream-processing system actually receives data. This is usually less important
than event time because when it’s processed is largely an implementation detail. This can’t ever be out of order because it’s
a property of the streaming system at a certain time (not an external system like event time).
The fundamental idea is that the order of the series of events in the processing system does not guarantee an ordering in
event time. This can be somewhat unintuitive, but is worth reinforcing. Computer networks are unreliable. That means that
events can be dropped, slowed down, repeated, or be sent without issue. Because individual events are not guaranteed to
suffer one fate or the other, we must acknowledge that any number of things can happen to these events on the way from
the source of the information to our stream processing system. For this reason, we need to operate on event time and look at
the overall stream with reference to this information contained in the data rather than on when it arrives in the system. This
means that we hope to compare events based on the time at which those events occurred
Stateful processing
is only necessary when you need to use or update intermediate information (state) over longer periods of time (in either a
microbatch or a record-at-a-time approach). This can happen when you are using event time or when you are performing an
aggregation on a key, whether that involves event time or not. For the most part, when you’re performing stateful
operations. Spark handles all of this complexity for you. For example, when you specify a grouping, Structured Streaming
maintains and updates the information for you. You simply specify the logic. When performing a stateful operation, Spark
stores the intermediate information in a state store. Spark’s current state store implementation is an in-memory state store
that is made fault tolerant by storing intermediate state to the checkpoint directory
In a tumbling window, tuples are grouped in a single window based on time or count. A tuple belongs
to only one window.
An example would be to compute the average price of a stock over the last five minutes, computed
every five minutes.
Handling Late Data with Watermarks
The preceding examples are great, but they have a flaw. We never specified how late
we expect to see data. This means that Spark is going to need to store that
intermediate data forever because we never specified a watermark, or a time at which
we don’t expect to see any more data. This applies to all stateful processing that
operates on event time. We must specify this watermark in order to age-out data in
the stream (and, therefore, state) so that we don’t overwhelm the system over a long
period of time.
Concretely, a watermark is an amount of time following a given event or set of events
after which we do not expect to see any more data from that time. We know this can
happen due to delays on the network, devices that lose a connection, or any number
of other issues. In the DStreams API, there was no robust way to handle late data in
this way—if an event occurred at a certain time but did not make it to the processing
system by the time the batch for a given window started, it would show up in other
processing batches.