0% found this document useful (0 votes)
37 views23 pages

U-3 Big Data

Bigdata in M.tech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views23 pages

U-3 Big Data

Bigdata in M.tech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

What is Hadoop

Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in
volume. Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch/offline processing.It
is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding
nodes in the cluster.
The Hadoop Distributed File System (HDFS)
is the primary data storage system used by Hadoop applications. HDFS employs a NameNode and DataNode architecture to
implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.
HDFS employs a NameNode and DataNode architecture to implement a distributed file system that provides high-
performance access to data across highly scalable Hadoop clusters.
Hadoop itself is an open source distributed processing framework that manages data processing and storage for big data
applications. HDFS is a key part of the many Hadoop ecosystem technologies. It provides a reliable means for managing
pools of big data and supporting related big data analytics applications.
HDFS architecture, NameNodes and DataNodes
HDFS uses a primary/secondary architecture. The HDFS cluster's NameNode is the primary server that manages the file
system namespace and controls client access to files. As the central component of the Hadoop Distributed File System, the
NameNode maintains and manages the file system namespace and provides clients with the right access permissions. The
system's DataNodes manage the storage that's attached to the nodes they run on.
HDFS exposes a file system namespace and enables user data to be stored in files. A file is split into one or more of the
blocks that are stored in a set of DataNodes. The NameNode performs file system namespace operations, including opening,
closing and renaming files and directories. The NameNode also governs the mapping of blocks to the DataNodes. The
DataNodes serve read and write requests from the clients of the file system. In addition, they perform block creation,
deletion and replication when the NameNode instructs them to do so.
DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.

The NameNode records any change to the file system namespace or its properties. An application can stipulate the number
of replicas of a file that the HDFS should maintain. The NameNode stores the number of copies of a file, called the
replication factor of that file.
Features of HDFS
There are several features that make HDFS particularly useful, including:
 Data replication. This is used to ensure that the data is always available and prevents data loss. For example, when a
node crashes or there is a hardware failure, replicated data can be pulled from elsewhere within a cluster, so processing
continues while data is recovered.
 Fault tolerance and reliability. HDFS' ability to replicate file blocks and store them across nodes in a large cluster
ensures fault tolerance and reliability.
 High availability. As mentioned earlier, because of replication across notes, data is available even if the NameNode or
a DataNode fails.
 Scalability. Because HDFS stores data on various nodes in the cluster, as requirements increase, a cluster can scale to
hundreds of nodes.
 High throughput. Because HDFS stores data in a distributed manner, the data can be processed in parallel on a cluster
of nodes. This, plus data locality (see next bullet), cut the processing time and enable high throughput.
 Data locality. With HDFS, computation happens on the DataNodes where the data resides, rather than having the data
move to where the computational unit is. By minimizing the distance between the data and the computing process, this
approach decreases network congestion and boosts a system's overall throughput.
MapReduce is a programming model used for efficient processing in parallel over large data-sets in a distributed manner.
The data is first split and then combined to produce the final result. The libraries for MapReduce is written in so many
programming languages with various different-different optimizations. The purpose of MapReduce in Hadoop is to Map
each of the jobs and then it will reduce it to equivalent tasks for providing less overhead over the cluster network and to
reduce the processing power. The MapReduce task is mainly divided into two phases Map Phase and Reduce Phase.
Components of MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the MapReduce for processing. There can be multiple
clients available that continuously send jobs for processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is comprised of so many smaller tasks
that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result of all the job-parts
combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
7. Map: As the name suggests its main use is to map the input data in key-value pairs. The input to the map may be
a key-value pair where the key can be the id of some kind of address and value is the actual value that it keeps.
The Map() function will be executed in its memory repository on each of these input key-value pairs and
generates the intermediate key-value pair which works as input for the Reducer or Reduce() function.

8. Reduce: The intermediate key-value pairs that work as input for Reducer are shuffled and sort and send to
the Reduce() function. Reducer aggregate or group the data based on its key-value pair as per the reducer
algorithm written by the developer. n MapReduce, we have a client. The client will submit the job of a particular
size to the Hadoop MapReduce Master. Now, the MapReduce master will divide this job into further equivalent
job-parts. These job-parts are then made available for the Map and Reduce Task. This Map and Reduce task will
contain the program as per the requirement of the use-case that the particular company is solving. The developer
writes their logic to fulfill the requirement that the industry requires. The input data which we are using is then
fed to the Map Task and the Map will generate intermediate key-value pair as its output. The output of Map i.e.
these key-value pairs are then fed to the Reducer and the final output is stored on the HDFS. There can be n
number of Map and Reduce tasks made available for processing the data as per the requirement. The algorithm
for Map and Reduce is made with a very optimized way such that the time complexity or space complexity is
minimum.

Fault tolerance comparison

Fault tolerance refers to the ability of a system to recover from failures and errors without losing data or functionality. Both
Hadoop and Spark are fault tolerant, meaning that they can handle node failures, data corruption, and network issues without
affecting the overall execution. However, Hadoop and Spark have different approaches to fault tolerance, which have trade-
offs and implications. Hadoop relies on data replication and checkpointing to ensure fault tolerance, which means that it
duplicates the data blocks across multiple nodes and periodically saves the state of the computation to disk. This provides
high reliability and durability, but also consumes more disk space and network bandwidth. Spark relies on data lineage and
lazy evaluation to ensure fault tolerance, which means that it tracks the dependencies and transformations of the RDDs and
only executes them when needed. This provides high performance and flexibility, but also requires more memory and
computation power.

Replication and Replication Factor

Replication ensures the availability of the data. Replication is nothing but making a copy of something and the number of
times you make a copy of that particular thing can be expressed as its Replication Factor. As we have seen in File blocks
that the HDFS stores the data in the form of various blocks at the same time Hadoop is also configured to make a copy of
those file blocks. By default the Replication Factor for Hadoop is set to 3 which can be configured means you can change
it Manually as per your requirement like in above example we have made 4 file blocks which means that 3 Replica or
copy of each file block is made means total of 4×3 = 12 blocks are made for the backup purpose.
In the above image, you can see that there is a Master with RAM = 64GB and Disk Space = 50GB and 4 Slaves with
RAM = 16GB, and disk Space = 40GB. Here you can observe that RAM for Master is more. It needs to be kept more
because your Master is the one who is going to guide this slave so your Master has to process fast. Now suppose you have
a file of size 150MB then the total file blocks will be 2 shown below.
128MB = Block 1
22MB = Block 2
As the replication factor by-default is 3 so we have 3 copies of this file block
FileBlock1-Replica1(B1R1) FileBlock2-Replica1(B2R1)
FileBlock1-Replica2(B1R2) FileBlock2-Replica2(B2R2)
FileBlock1-Replica3(B1R3) FileBlock2-Replica3(B2R3)
These blocks are going to be stored in our Slave as shown in the above diagram which means if suppose your Slave 1
crashed then in that case B1R1 and B2R3 get lost. But you can recover the B1 and B2 from other slaves as the Replica of
this file blocks is already present in other slaves, similarly, if any other Slave got crashed then we can obtain that file
block some other slave. Replication is going to increase our storage but Data is More necessary for us.

An HDFS high availability (HA) cluster uses two NameNodes—an active NameNode and a standby NameNode. Only one
NameNode can be active at any point in time. HDFS HA depends on maintaining a log of all namespace modifications in a
location available to both NameNodes, so that in the event of a failure, the standby NameNode has up-to-date information
about the edits and location of blocks in the cluster.

You can use Cloudera Manager to configure your CDH cluster for HDFS HA and automatic failover. In Cloudera Manager,
HA is implemented using Quorum-based storage. Quorum-based storage relies upon a set of JournalNodes, each of which
maintains a local edits directory that logs the modifications to the namespace metadata. Enabling HA enables automatic
failover as part of the same command.

Enabling High Availability and Automatic Failover

The Enable High Availability workflow leads you through adding a second (standby) NameNode and configuring
JournalNodes.

The key components and steps involved in Hadoop High Availability are as follows:
Active NameNode:
The active NameNode is the primary NameNode that manages client requests and metadata operations, just like in a non-HA
setup.
Standby NameNode:
 The standby NameNode is an additional NameNode that continuously replicates and maintains a copy of the metadata
from the active NameNode.
 The standby NameNode remains in constant communication with the active NameNode, receiving regular updates to
keep the metadata synchronized.
Quorum Journal Manager (QJM):
 The Quorum Journal Manager is a component that stores the edit logs, which contain the transactional changes made to
the HDFS metadata.
 The QJM ensures that the edit logs are written to a majority of nodes in the cluster, forming a quorum. This approach
ensures data consistency and prevents data loss even in the event of multiple node failures.
ZooKeeper (optional):
 Hadoop HA can optionally use Apache ZooKeeper to manage the failover process between the active and standby
NameNodes.
 ZooKeeper is a highly reliable coordination service that helps in selecting the active NameNode and ensuring that only
one NameNode is active at a time.
DATA LOCALTY
The major drawback of Hadoop was cross-switch network traffic due to the huge volume of data. To overcome this
drawback, Data Locality in Hadoop came into the picture. Data locality in MapReduce refers to the ability to move the
computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes
network congestion and increases the overall throughput of the system.
In Hadoop, datasets are stored in HDFS. Datasets are divided into blocks and stored across the datanodes
in Hadoop cluster. When a user runs the MapReduce job then NameNode sent this MapReduce code to the datanodes on
which data is available related to MapReduce job.
Categories of Data Locality in Hadoop

Below are the various categories in which Data Locality in Hadoop is categorized:
i. Data local data locality in Hadoop

When the data is located on the same node as the mapper working on the data it is known as data local data locality. In this
case, the proximity of data is very near to computation. This is the most preferred scenario.

ii. Intra-Rack data locality in Hadoop

It is not always possible to execute the mapper on the same datanode due to resource constraints. In such case, it is
preferred to run the mapper on the different node but on the same rack.
iii. Inter-Rack data locality in Hadoop

Sometimes it is not possible to execute mapper on a different node in the same rack due to resource constraints. In such a
case, we will execute the mapper on the nodes on different racks. This is the least preferred scenario.

5. Hadoop Data Locality Optimization

Although Data locality in Hadoop MapReduce is the main advantage of Hadoop MapReduce as map code is executed on the
same data node where data resides. But this is not always true in practice due to various reasons like speculative execution
in Hadoop, Heterogeneous cluster, Data distribution and placement, and Data Layout and Input Splitter.
Data Flow In MapReduce

MapReduce is used to compute the huge amount of data . To handle the upcoming data in a parallel and distributed form, the
data has to flow from various phases.

Phases of MapReduce data flow


Input reader
The input reader reads the upcoming data and splits it into the data blocks of the appropriate size (64 MB to 128 MB). Each
data block is associated with a Map function.
Once input reads the data, it generates the corresponding key-value pairs. The input files reside in HDFS.
Map function
The map function process the upcoming key-value pairs and generated the corresponding output key-value pairs. The map
input and output type may be different from each other.
Partition function
The partition function assigns the output of each Map function to the appropriate reducer. The available key and value
provide this function. It returns the index of reducers.
Shuffling and Sorting
The data are shuffled between/within nodes so that it moves out from the map and get ready to process for reduce function.
Sometimes, the shuffling of data can take much computation time.
The sorting operation is performed on input data for Reduce function. Here, the data is compared using comparison function
and arranged in a sorted form.
Reduce function
The Reduce function is assigned to each unique key. These keys are already arranged in sorted order. The values associated
with the keys can iterate the Reduce and generates the corresponding output.
Output writer
Once the data flow from all the above phases, Output writer executes. The role of Output writer is to write the Reduce
output to the stable storage.

DATA INTEGRITY

In HDFS, data is divided into fixed-size blocks, and for each block, a checksum is calculated using algorithms like CRC32C
or CRC32. This checksum serves as a unique fingerprint for the data, making even the slightest alteration in the data
instantly detectable.
Data Integrity: Checksums act as guardians of data integrity. When data is read from HDFS, the system recalculates the
checksum and compares it to the stored checksum. Any mismatch indicates data corruption.
🔄 Data Validation during Transfer: They ensure data remains intact during transfer between nodes, reducing the risk of
corruption due to network issues or hardware failures.
👁️ Identifying Faulty Blocks: Checksums help identify which replica of a block is corrupt, allowing for quick recovery or
replication from healthy copies.
⏳ Early Detection of Bit Rot: Bit rot can silently corrupt data over time. Checksums help spot this early, allowing for timely
restoration or replication.
🚫 Preventing Silent Data Corruption: They act as a safeguard against silent corruption, actively monitoring data integrity.
🛠️ Efficient Error Handling: In case of corruption, HDFS can take automated actions, reducing manual intervention and
enhancing system reliability.
What is Serialization?
Serialization is the process of translating data structures or objects state into binary or textual form to transport the data over
network or to store on some persisten storage. Once the data is transported over network or retrieved from the persistent
storage, it needs to be deserialized again. Serialization is termed as marshalling and deserialization is termed
as unmarshalling.
Generally in distributed systems like Hadoop, the concept of serialization is used for Interprocess
Communication and Persistent Storage.
Interprocess Communication
 To establish the interprocess communication between the nodes connected in a network, RPC technique was used.
 RPC used internal serialization to convert the message into binary format before sending it to the remote node via
network. At the other end the remote system deserializes the binary stream into the original message.
 The RPC serialization format is required to be as follows −
o Compact − To make the best use of network bandwidth, which is the most scarce resource in a data center.
o Fast − Since the communication between the nodes is crucial in distributed systems, the serialization and
deserialization process should be quick, producing less overhead.
o Extensible − Protocols change over time to meet new requirements, so it should be straightforward to
evolve the protocol in a controlled manner for clients and servers.
o Interoperable − The message format should support the nodes that are written in different languages.
Persistent Storage

Persistent Storage is a digital storage facility that does not lose its data with the loss of power supply. Files, folders,
databases are the examples of persistent storage.

Writable Interface

This is the interface in Hadoop which provides methods for serialization and deserialization. The following table describes
the methods −

S.No. Methods and Description

1 void readFields(DataInput in)


This method is used to deserialize the fields of the given object.

2 void write(DataOutput out)


This method is used to serialize the fields of the given object.
Writable Comparable Interface
It is the combination of Writable and Comparable interfaces. This interface inherits Writable interface of Hadoop as well
as Comparable interface of Java. Therefore it provides methods for data serialization, deserialization, and comparison.
S.No. Methods and Description
1 int compareTo(class obj)
This method compares current object with the given object obj.
What is Hive

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize
Big Data, and makes querying and analyzing easy.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an
open source under the name Apache Hive. It is used by different companies. For example, Amazon uses it in Amazon
Elastic MapReduce.

Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates
Features of Hive
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.
Architecture of Hive

The following component diagram depicts the architecture of Hive:This component diagram contains different units. The
following table describes each unit:

Unit Name Operation

Hive is a data warehouse infrastructure software that can create


interaction between user and HDFS. The user interfaces that Hive
User Interface
supports are Hive Web UI, Hive command line, and Hive HD Insight
(In Windows server).

Hive chooses respective database servers to store the schema or


Meta Store Metadata of tables, databases, columns in a table, their data types,
and HDFS mapping.

HiveQL is similar to SQL for querying on schema info on the


HiveQL Process Metastore. It is one of the replacements of traditional approach for
Engine MapReduce program. Instead of writing MapReduce program in
Java, we can write a query for MapReduce job and process it.

The conjunction part of HiveQL process Engine and MapReduce is


Hive Execution Engine. Execution engine processes the query and
Execution Engine
generates results as same as MapReduce results. It uses the flavor of
MapReduce.

Hadoop distributed file system or HBASE are the data storage


HDFS or HBASE
techniques to store data into file system.
Data types in Hive different data types in Hive, which are involved in the table creation. All the
data types in Hive are classified into four types, given as follows
 Column Types Literals Null Values Complex Types
ColumnTypes
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data range exceeds
the range of INT, you need to use BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.
The following table depicts various INT data types:
String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It contains
two data types: VARCHAR and CHAR. Hive follows C-types escape characters.
The following table depicts various CHAR data types:
VARCHAR 1 to 65355
255
char
Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd
hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for
representing immutable arbitrary precision. The syntax and example is as follows:
DECIMAL(precision, scale) decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an instance using create
union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
Literals
The following literals are used in Hive:
Floating Point Types
Floating point types are nothing but numbers with decimal points. Generally, this type of data is
composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data type.
The range of decimal type is approximately -10-308 to 10308.
Null Value
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>
Hive DDL Commands
Create Database Statement
A database in Hive is a namespace or a collection of tables.
1. hive> CREATE SCHEMA userdb;
2. hive> SHOW DATABASES;
Drop database
1. hive> DROP DATABASE IF EXISTS userdb;
Creating Hive Tables
Create a table called Sonoo with two columns, the first being an integer and the other a string.
1. hive> CREATE TABLE Sonoo(foo INT, bar STRING);
Create a table called HIVE_TABLE with two columns and a partition column called ds. The partition column is a virtual
column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into.By default,
tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a).
1. hive> CREATE TABLE HIVE_TABLE (foo INT, bar STRING) PARTITIONED BY (ds STRING);
Browse the table
1. hive> Show tables;
Altering and Dropping Tables
1. hive> ALTER TABLE Sonoo RENAME TO Kafka;
2. hive> ALTER TABLE Kafka ADD COLUMNS (col INT);
3. hive> ALTER TABLE HIVE_TABLE ADD COLUMNS (col1 INT COMMENT 'a comment');
4. hive> ALTER TABLE HIVE_TABLE REPLACE COLUMNS (col2 INT, weight STRING, baz INT COMMENT 'ba’);
Hive DML Commands

To understand the Hive DML commands, let's see the employee and employee_department table first.

LOAD DATA
hive> LOAD DATA LOCALINPATH './usr/Desktop/kv1.txt' OVERWRITE INTOTABLEEmployee;
SELECTS and FILTERS
1. hive> SELECT E.EMP_ID FROM Employee E WHERE E.Address='US';
GROUP BY
1. hive> hive> SELECT E.EMP_ID FROM Employee E GROUP BY E.Addresss;
Hive Sort By vs Order By
Hive sort by and order by commands are used to fetch data in sorted order. The main differences between sort by and order
by commands are given below.
Sort by
1. hive> SELECT E.EMP_ID FROM Employee E SORT BY E.empid;
May use multiple reducers for final output.
Only guarantees ordering of rows within a reducer.
May give partially ordered result.
Order by
1. hive> SELECT E.EMP_ID FROM Employee E order BY E.empid;
Uses single reducer to guarantee total order in output.
LIMIT can be used to minimize sort time.
HiveQL - JOIN
The HiveQL Join clause is used to combine the data of two or more tables based on a related column between them. The
various type of HiveQL joins are: -
o Inner Join Left Outer Join Right Outer Join Full Outer Join
Here, we are going to execute the join clauses on the records of the following table:
Inner Join in HiveQL
The HiveQL inner join is used to return the rows of multiple tables where the join condition satisfies. In other words, the
join criteria find the match records in every table being joined.
Left Outer Join in HiveQL
The HiveQL left outer join returns all the records from the left (first) table and only that records from the right (second)
table where join criteria find the match.
Right Outer Join in HiveQL
The HiveQL right outer join returns all the records from the right (second) table and only that records from the left (first)
table where join criteria find the match.
Full Outer Join
The HiveQL full outer join returns all the records from both the tables. It assigns Null for missing records in either table.
U-4

Working with Dates and Timestamps


Dates and times are a constant challenge in programming languages and databases. It’s always necessary to keep track of
timezones and ensure that formats are correct and valid. Spark does its best to keep things simple by focusing explicitly on
two kinds of time-related information. There are dates, which focus exclusively on calendar dates, and timestamps, which
include both date and time information. Spark, as we saw with our current dataset, will make a best effort to correctly
identify column types, including dates and timestamps when we enable inferSchema.
We can see that this worked quite well with our current dataset because it was able to identify and read our date format
without us having to provide some specification for it. As we hinted earlier, working with dates and timestamps closely
relates to working with strings because we often store our timestamps or dates as strings and convert them into date types at
runtime. This is less common when working with databases and structured data but much more common when we are
working with text and CSV files. Spark’s TimestampType class supports only second-level precision, which means that if
you’re going to be working with milliseconds or microseconds, you’ll need to work around this problem by potentially
operating on them as longs. Any more precision when coercing to a TimestampType will be removed Spark can be a bit
particular about what format you have at any given point in time. It’s important to be explicit when parsing or converting to
ensure that there are no issues in doing so. At the end of the day, Spark is working with Java dates and timestamps and
therefore conforms to those standards. Let’s begin with the basics and get the current date and the current timestamps:
# in Python from pyspark.sql.functions import current_date, current_timestamp
dateDF = spark.range(10)\
.withColumn("today", current_date())\
.withColumn("now", current_timestamp())
dateDF.createOrReplaceTempView("dateTable")
dateDF.printSchema()

Working with Nulls


in Data As a best practice, you should always use nulls to represent missing or empty data in your DataFrames. Spark can
optimize working with null values more than it can if you use empty strings or other values. The primary way of interacting
with null values, at DataFrame scale, is to use the .na subpackage on a DataFrame. There are also several functions for
performing operations and explicitly specifying how Spark should handle null values.
Coalesce Spark includes a function to allow you to select the first non-null value from a set of columns by using the coalesce
function. In this case, there are no null values, so it simply returns the first column:
# in Python
from pyspark.sql.functions import coalesce
df.select(coalesce(col("Description"), col("CustomerId"))).show()
ifnull, nullIf, nvl, and nvl2:
ifnull allows you to select the second value if the first is null, and defaults to the first. Alternatively, you could use nullif,
which returns null if the two values are equal or else returns the second if they are not. nvl returns the second value if the
first is null, but defaults to the first. Finally, nvl2 returns the second value if the first is not null; otherwise, it will return the
last specified value
drop The simplest function is drop, which removes rows that contain nulls. The default is to drop any row in which any
value is null:
df.na.drop()
df.na.drop("any")

Fill
Using the fill function, you can fill one or more columns with a set of values. This can be done by specifying a map—that is
a particular value and a set of columns. For example, to fill all null values in columns of type String, you might specify the
following:
df.na.fill("All Null values become this string"

Working with JSON


Spark has some unique support for working with JSON data. You can operate directly on strings of JSON in Spark and
parse from JSON or extract JSON objects. Let’s begin by creating a JSON column:
jsonDF = spark.range(1).selectExpr("""
'{"myJSONKey" : {"myJSONValue" : [1, 2, 3]}}' as jsonString""")
You can use the get_json_object to inline query a JSON object, be it a dictionary or array. You can
use json_tuple if this object has only one level of nesting
from pyspark.sql.functions import get_json_object, json_tuple
jsonDF.select(
get_json_object(col("jsonString"), "$.myJSONKey.myJSONValue[1]") as "column", json_tuple(col("jsonString"),
"myJSONKey")).show(2)

Grouping
Thus far, we have performed only DataFrame-level aggregations. A more common task is to perform calculations based on
groups in the data. This is typically done on categorical data for which we group our data on one column and perform some
calculations on the other columns that end up in that group. The best way to explain this is to begin performing some
groupings. The first will be a count, just as we did before. We will group by each unique invoice number and get the count
of items on that invoice. Note that this returns another DataFrame and is lazily performed. We do this grouping in two
phases. First we specify the column(s) on which we would like to group, and then we specify the aggregation(s). The first
step returns a RelationalGroupedDataset, and the second step returns a DataFrame. As mentioned, we can specify any
number of columns on which we want to group:
df.groupBy("InvoiceNo", "CustomerId").count().show()
Grouping with Expressions
As we saw earlier, counting is a bit of a special case because it exists as a method. For this, usually we prefer to use the
count function. Rather than passing that function as an expression into a select statement, we specify it as within agg. This
makes it possible for you to pass-in arbitrary expressions that just need to have some aggregation specified. You can even do
things like alias a column after transforming it for later use in your data flow:

Grouping with Maps


Sometimes, it can be easier to specify your transformations as a series of Maps for which the key is the column, and the
value is the aggregation function (as a string) that you would like to perform. You can reuse multiple column names if you
specify them inline, as well:

Join Expressions
A join brings together two sets of data, the left and the right, by comparing the value of one or more keys of the left and
right and evaluating the result of a join expression that determines whether Spark should bring together the left set of data
with the right set of data. The most common join expression, an equi-join, compares whether the specified keys in your left
and right datasets are equal. If they are equal, Spark will combine the left and right datasets. The opposite is true for keys
that do not match; Spark discards the rows that do not have matching keys.
Join Types :
Whereas the join expression determines whether two rows should join, the join type determines what should be in the result
set. There are a variety of different join types available in Spark for you to use:
Inner joins (keep rows with keys that exist in the left and right datasets) Outer joins (keep rows with keys in either the left
or right datasets) Left outer joins (keep rows with keys in the left dataset) Right outer joins (keep rows with keys in the right
dataset) Left semi joins (keep the rows in the left, and only the left, dataset where the key appears in the right dataset) Left
anti joins (keep the rows in the left, and only the left, dataset where they do not appear in the right dataset
Inner Joins
Inner joins evaluate the keys in both of the DataFrames or tables and include (and join together) only the rows that evaluate
to true. In the following example, we join the graduateProgram DataFrame with the person DataFrame to create a new
DataFrame:

joinExpression = person["graduate_program"] == graduateProgram['id']


Outer Joins
Outer joins evaluate the keys in both of the DataFrames or tables and includes (and joins together) the rows that evaluate to
true or false. If there is no equivalent row in either the left or right DataFrame, Spark will insert null
joinType = "outer"
person.join(graduateProgram, joinExpression, joinType).show()
Left Outer Joins
Left outer joins evaluate the keys in both of the DataFrames or tables and includes all rows from the left DataFrame as well
as any rows in the right DataFrame that have a match in the left DataFrame. If there is no equivalent row in the right
DataFrame, Spark will insert null:
joinType = "left_outer"
graduateProgram.join(person, joinExpression, joinType).show()
Right Outer Joins:
Right outer joins evaluate the keys in both of the DataFrames or tables and includes all rows from the right DataFrame as
well as any rows in the left DataFrame that have a match in the right DataFrame. If there is no equivalent row in the left
DataFrame, Spark will insert null:
joinType = "right_outer"
person.join(graduateProgram, joinExpression, joinType).show()
Left Semi Joins
Semi joins are a bit of a departure from the other joins. They do not actually include any values from the right DataFrame.
They only compare values to see if the value exists in the second DataFrame. If the value does exist, those rows will be kept
in the result, even if there are duplicate keys in the left DataFrame. Think of left semi joins as filters on a DataFrame, as
opposed to the function of a conventional join:
Resilient Distributed Datasets
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of
objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic operations on
either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that can be operated on in
parallel.

There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a
dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop
Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. Let us first discuss how
MapReduce operations take place and why they are not so efficient.

Iterative Operations on MapReduce


Reuse intermediate results across multiple computations in multi-stage applications. The following illustration explains how
the current framework works, while doing the iterative operations on MapReduce. This incurs substantial overheads due to
data replication, disk I/O, and serialization, which makes the system slow.

Interactive Operations on MapReduce


User runs ad-hoc queries on the same subset of data. Each query will do the disk I/O on the stable storage, which can
dominate application execution time.

The following illustration explains how the current framework works while doing the interactive queries on MapReduce.

Data Sharing using Spark RDD


Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of the Hadoop applications, they
spend more than 90% of the time doing HDFS read-write operations.
Recognizing this problem, researchers developed a specialized framework called Apache Spark. The key idea of spark
is Resilient Distributed Datasets (RDD); it supports in-memory processing computation. This means, it stores the state of
memory as an object across the jobs and the object is sharable between those jobs. Data sharing in memory is 10 to 100
times faster than network and Disk.
Iterative Operations on Spark RDD
The illustration given below shows the iterative operations on Spark RDD. It will store intermediate results in a distributed
memory instead of Stable storage (Disk) and make the system faster.
Interactive Operations on Spark RDD
y default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an
RDD in memory, in which case Spark will keep the elements around on the cluster for much faster access, the next time you
query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
Transformations
Transformations when executed results in a single or multiple new RDD’s. Transformations are lazy operations meaning
none of the transformations get executed until you call an action on Spark RDD. Since RDD’s are immutable, any
transformations on it result in a new RDD leaving the current one unchanged, creating an RDD lineage.
There are two types are transformations
 Narrow Transformation
 Wide Transformation

Narrow transformations are operations where each input partition of an RDD is used to compute only one output partition
of the resulting RDD. Examples of narrow transformations include map(), filter(), and union(). Narrow transformations are
preferred because they allow for more efficient processing, as they can be executed in parallel on individual partitions
without the need for shuffling or data movement across the cluster.

Wide transformations, on the other hand, are operations where each input partition of an RDD is used to compute multiple
output partitions of the resulting RDD. Examples of wide transformations include groupByKey(), reduceByKey(),
and sortByKey(). Wide transformations require shuffling or data movement across the cluster to redistribute the data, which
can be expensive in terms of performance and network overhead.

- In wide transformation (shuffling ) is involved and so the data is written to disk thereby making it a costly and slow
transformation .
- In a DAG a new stage is created for every wide transformation.
- for *optimisation* one must reduce usage of wide transformation if possible or atleast apply as many
narrow transformations before proceeding to wide transformation.
- In Apache Spark, transformations are operations that create a new RDD (Resilient Distributed Dataset) from an existing
one. There are two types of transformations in Spark: Narrow and Wide.

In general, it is recommended to use narrow transformations whenever possible to optimize performance and minimize data
movement across the cluster. However, there are cases where wide transformations are necessary to achieve the desired
computation, such as aggregations or joins that require combining data across multiple partitions.

In summary, narrow transformations are operations where each input partition is used to compute one output partition, while
wide transformations are operations where each input partition is used to compute multiple output partitions. Narrow
transformations are preferred because they are more efficient and require less data movement across the cluster, but there are
cases where wide transformations are necessary for certain types of computations.

Action Description

reduce(func) It aggregate the elements of the dataset using a function func (which takes two arguments and
returns one). The function should be commutative and associative so that it can be computed
correctly in parallel.

collect() It returns all the elements of the dataset as an array at the driver program. This is usually useful after
a filter or other operation that returns a sufficiently small subset of the data.

count() It returns the number of elements in the dataset.

first() It returns the first element of the dataset (similar to take(1)).


take(n) It returns an array with the first n elements of the dataset.

takeSample(withReplacem It returns an array with a random sample of num elements of the dataset, with or without
ent, num, [seed]) replacement, optionally pre-specifying a random number generator seed.

takeOrdered(n, [ordering]) It returns the first n elements of the RDD using either their natural order or a custom comparator.

saveAsTextFile(path) It is used to write the elements of the dataset as a text file (or set of text files) in a given directory in
the local filesystem, HDFS or any other Hadoop-supported file system. Spark calls toString on each
element to convert it to a line of text in the file.

saveAsSequenceFile(path) It is used to write the elements of the dataset as a Hadoop SequenceFile in a given path in the local
(Java and Scala) filesystem, HDFS or any other Hadoop-supported file system.

saveAsObjectFile(path) It is used to write the elements of the dataset in a simple format using Java serialization, which can
(Java and Scala) then be loaded usingSparkContext.objectFile().

countByKey() It is only available on RDDs of type (K, V). Thus, it returns a hashmap of (K, Int) pairs with the
count of each key.

foreach(func) It runs a function func on each element of the dataset for side effects such as updating an
Accumulator or interacting with external storage systems.

What is Spark DataFrame?


In Spark, DataFrames are the distributed collections of data, organized into rows and columns. Each column in a
DataFrame has a name and an associated type. DataFrames are similar to traditional database tables, which are structured
and concise. We can say that DataFrames are relational databases with better optimization techniques.
Spark DataFrames can be created from various sources, such as Hive tables, log tables, external databases, or the
existing RDDs. DataFrames allow the processing of huge amount of data
When there is not much storage space in memory or on disk, RDDs do not function properly as they get exhausted. Besides,
Spark RDDs do not have the concept of schema—the structure of a database that defines its objects. RDDs store both
structured and unstructured data together, which is not very efficient.
RDDs cannot modify the system in such a way that it runs more efficiently. RDDs do not allow us to debug errors during the
runtime. They store the data as a collection of Java objects.
RDDs use serialization (converting an object into a stream of bytes to allow faster processing) and garbage collection (an
automatic memory management technique that detects unused objects and frees them from memory) techniques. This
increases the overhead on the memory of the system as they are very lengthy.
Features of DataFrames
Some of the unique features of DataFrames are:
 Use of Input Optimization Engine: DataFrames make use of the input optimization engines, e.g., Catalyst
Optimizer, to process data efficiently. We can use the same engine for all Python, Java, Scala, and R DataFrame APIs.
 Handling of Structured Data: DataFrames provide a schematic view of data. Here, the data has some meaning to it
when it is being stored.
 Custom Memory Management: In RDDs, the data is stored in memory, whereas DataFrames store data off-heap
(outside the main Java Heap space, but still inside RAM), which in turn reduces the garbage collection overload.

Creating DataFrames
There are many ways to create DataFrames. Here are three of the most commonly used methods to create DataFrames:

 Creating DataFrames from JSON Files

Now, what are JSON files?


JSON, or JavaScript Object Notation, is a type of file that stores simple data structure objects in the .json format. It is mainly
used to transmit data between Web servers. This is how a simple .json file looks like:{ “employee” : [ { “id”: “1”,”name”:
‘A” } , {“id } ] }

The above JSON is a simple employee database file that contains two records/rows

convert the RDD to dataframe


Apache Spark Resilient Distributed Dataset(RDD) Transformations are defined as the spark operations that are when
executed on the Resilient Distributed Datasets(RDD), it further results in the single or the multiple new defined RDD's. As
the RDD mostly are immutable, the transformations always create the new RDD without updating an existing RDD, which
results in the creation of an RDD lineage. RDD Lineage is defined as the RDD operator graph or the RDD dependency
graph. RDD Transformations are also defined as lazy operations that are none of the transformations get executed until an
action is called from the user. As the RDD's are immutable, any transformations result in the new RDD, leaving the current
one unchanged. The DataFrame is a distributed collection of the data organized into named columns similar to Database
tables, and it provides optimization and performance improvements.
convert the RDD to dataframe in PySpark. There are two approaches to convert RDD to dataframe.
1. Using createDataframe(rdd, schema)
2. Using toDF(schema)
Method 1: Using createDataframe() function.
After creating the RDD we have converted it to Dataframe using createDataframe() function in which we have passed the
RDD and defined schema for Dataframe.
Syntax:
spark.CreateDataFrame(rdd, schema) Method 2: Using toDF() function.
After creating the RDD we have converted it to Dataframe using the toDF() function in which we have passed the defined
schema for Dataframe.
Syntax:
df.toDF(schema)
When running Spark on YARN, each Spark executor runs as a YARN container. Where MapReduce schedules a container
and fires up a JVM for each task, Spark hosts multiple tasks within the same container. This approach enables several orders
of magnitude faster task startup time.
Spark supports two modes for running on YARN, “yarn-cluster” mode and “yarn-client” mode. Broadly, yarn-cluster mode
makes sense for production jobs, while yarn-client mode makes sense for interactive and debugging uses where you want to
see your application’s output immediately.
Understanding the difference requires an understanding of YARN’s Application Master concept. In YARN, each application
instance has an Application Master process, which is the first container started for that application. The application is
responsible for requesting resources from the ResourceManager, and, when allocated them, telling NodeManagers to start
containers on its behalf. Application Masters obviate the need for an active client — the process starting the application can
go away and coordination continues from a process managed by YARN running on the cluster.
In yarn-cluster mode, the driver runs in the Application Master. This means that the same process is responsible for both
driving the application and requesting resources from YARN, and this process runs inside a YARN container. The client that
starts the app doesn’t need to stick around for its entire lifetime.
The yarn-cluster mode, however, is not well suited to using Spark interactively. Spark applications that require user input, like
spark-shell and PySpark, need the Spark driver to run inside the client process that initiates the Spark application. In yarn-
client mode, the Application Master is merely present to request executor containers from YARN.
Accumulator

Accumulator variables are used for aggregating the information through associative and commutative
operations. For example, you can use an accumulator for a sum operation or counters (in MapReduce).
The following code block has the details of an Accumulator class for PySpark.
class pyspark.Accumulator(aid, value, accum_param)
The following example shows how to use an Accumulator variable. An Accumulator variable has an
attribute called value that is similar to what a broadcast variable has. It stores the data and is used to
return the accumulator's value, but usable only in a driver program.
SPARK DEPLOYMENT
Spark application, using spark-submit, is a shell command used to deploy the Spark application on a
cluster. It uses all respective cluster managers through a uniform interface. Therefore, you do not have to
configure your application for each one.

Example

Let us take the same example of word count, we used before, using shell commands. Here, we consider
the same example as a spark application.
Sample Input
The following text is the input data and the file named is in.txt.
people are not as beautiful as they look, as
they walk or as they talk.
they are only as beautiful as they love, as
they care as they share.
Look at the following program −
SparkWordCount.scala
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark._

object SparkWordCount {
def main(args: Array[String]) {

val sc = new SparkContext( "local", "Word Count", "/usr/local/spark", Nil, Map(), Map())

/* local = master URL; Word Count = application name; */


/* /usr/local/spark = Spark Home; Nil = jars; Map = environment */
/* Map = variables to work nodes */
/*creating an inputRDD to read text file (in.txt) through Spark context*/ val input
= sc.textFile("in.txt")
/* Transform the inputRDD into countRDD */

val count = input.flatMap(line ⇒ line.split(" "))


.map(word ⇒ (word, 1))
.reduceByKey(_ + _)

/* saveAsTextFile method is an action that effects on the RDD */ count.saveAsTextFile("outfile")


System.out.println("OK");
}

}
Save the above program into a file named SparkWordCount.scala and place it in a user-defined
directory named spark- application.
Note − While transforming the inputRDD into countRDD, we are using flatMap() for tokenizing the
lines (from text file) into words, map() method for counting the word frequency and reduceByKey()
method for counting each word repetition.
Use the following steps to submit this application. Execute all steps in the spark-application directory through the
terminal.
Step 1: Download Spark Ja
Spark core jar is required for compilation, therefore, download spark-core_2.10-1.3.0.jar from the
following link Spark core jar and move the jar file from download directory to spark-application
directory.
Step 2: Compile program
Compile the above program using the command given below. This command should be executed
from the spark-application directory. Here, /usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar
is a Hadoop support jar taken from Spark library.
$ scalac -classpath "spark-core_2.10-1.3.0.jar:/usr/local/spark/lib/spark-assembly-1.4.0-
hadoop2.6.0.jar" SparkPi.scala
Step 3: Create a JAR
Create a jar file of the spark application using the following command. Here, wordcount is the file name for jar file

Cluster Managers
Cluster manager is a platform (cluster mode) where we can run Spark. Simply put, cluster manager
provides resources to all worker nodes as per need, it operates all nodes accordingly.We can say there are
a master node and worker nodes available in a cluster. That master nodes provide an efficient working
environment to worker nodes.

There are three types of Spark cluster manager. Spark supports these cluster manager:
Standalone cluster
manager Hadoop Yarn
Apache Mesos

Apache Spark also supports pluggable cluster management. The main task of cluster manager is to
provide resources to all applications. We can say it is an external service for acquiring required
resources on the cluster.
1. Standalone Cluster Manager

It is a part of spark distribution and available as a simple cluster manager to us. Standalone cluster
manager is resilient in nature, it can handle work failures. It has capabilities to manage resources
according to the requirement of applications.
We can easily run it on Linux, Windows, or Mac. It can also access HDFS (Hadoop Distributed File
System) data. This is the easiest way to run Apache spark on this cluster. It also has high availability for a
master.
Working with Standalone Cluster Manager

As we discussed earlier, in cluster manager it has a master and some number of workers. It has available
resources as the configured amount of memory as well as CPU cores. In this cluster, mode spark provides
resources according to its core. We can say an application may grab all the cores available in the cluster
by default.If in any case, our master crashes, so zookeeper quorum can help on. It recovers the master
using standby master. We can also recover the master by using several file systems. All the applications
we are working on has a web user interface.This interface works as an eye keeper on the cluster and even
job statistics. It helps in providing several pieces of information on memory or running jobs.This cluster
manager has detailed log output for every task performed. Web UI can reconstruct the application’s UI
even after the application exits.

2. Hadoop Yarn

This cluster manager works as a distributed computing framework. It also maintains job scheduling as
well as resource management. In this cluster, masters and slaves are highly available for us. We are also
available with executors and pluggable scheduler.We can also run it on Linux and even on windows.
Hadoop yarn is also known as MapReduce 2.0. It also bifurcates the functionality of resource manager as
well as job scheduling
Working with Hadoop Yarn Cluster Manager

If we talk about yarn, whenever a job request enters into resource manager of YARN. It computes that
according to the number of resources available and then places it a job. Yarn system is a plot in a gigantic
way.It is the one who decides where the job should go. This is an evolutionary step of MapReduce
framework. It works as a resource manager component, largely motivated by the need to scale Hadoop
jobs.We can optimize Hadoop jobs with the help of Yarn. The yarn is the aim for short but fast spark
jobs. It is neither eligible for long-running services nor for short-lived queries. It is not stated as an ideal
system.That resource demand, execution model, and architectural demand are not long running services.
The yarn is suitable for the jobs that can be re-start easily if they fail. Yarn do not handle distributed file
systems or databases.While yarn massive scheduler handles different type of workloads. The yarn is not a
lightweight system. It is not able to support growing no. of current even algorithms.
Accessing the Spark Logs - When a Spark job or application fails, you can use the Spark logs to analyze
the failures.The QDS UI provides links to the logs in the Application UI and Spark Application UI.

 If you are running the Spark job or application from the Analyze page, you can access the
logs via the Application UI and Spark Application UI.
 If you are running the Spark job or application from the Notebooks page, you can access the logs
via the Spark Application UI.
Accessing the Application UI - To access the logs via the Application UI from the Analyze page of the QDS UI:

1. Note the command id, which is unique to the Qubole job or command.

2. Click on the down arrow on the right of the search bar.

3. Enter the command id in the Command Id field and click Apply.Logs of any Spark job are displayed in
Application

UI and Spark Application UI, which are accessible in the Logs and Resources tabs. The
information in these UIs can be used to trace any information related to command status.

The following figure shows an example of Logs tab with links.

4. Click on the Application UI hyperlink in the Logs tab or Resources tab.

The Hadoop MR application UI displays the following information:


 MR application master logs
 Total Mapper/Reducer tasks
 Completed/Failed/Killed/Successful tasks

Accessing the Spark Application UI - You can access the logs by using the Spark Application
UI from the Analyze page and Notebooks page.

From the Analyze page


1. From the Home menu, navigate to the Analyze page.

2. Note the command id, which is unique to the Qubole job or command.

3. Click on the down arrow on the right of the search bar.


4. Enter the command id in the Command Id field and click Apply.

5. Click on the Logs tab or Resources tab.

6. Click on the Spark Application UI hyperlink.

From the Notebooks page

1. From the Home menu, navigate to the Notebooks page.

2. Click on the Spark widget on the top right and click on Spark UI

When you open the Spark UI from the Spark widget of the Notebooks page or from the Analyze page,
the Spark Application UI is displayed in a separate tab . The Spark Application UI displays the
following information:
 Jobs: The Jobs tab shows the total number of completed, succeeded and failed jobs. It also shows
the number of stages that a job has succeeded.
 Stages: The Stages tab shows the total number of completed and failed stages. If you want to
check more details about the failed stages, click on the failed stage in the Description column

 The Errors column shows the detailed error message for the failed tasks. You should note the executor id and
the
hostname to view details in the container logs. For more details about the error stack trace, you
should check the container logs.
 Storage: The Storage tab displayed the cached data if caching is enabled.
 Environment : The Environment tab shows the information about JVM, Spark properties, System properties
and
classpath entries which helps to know the values for a property that is used by the spark
cluster during runtime. The following figure shows the Environment tab.
 Executors : The Executors tab shows the container logs. You can map the container logs using
the executor id and the hostname, which is displayed in the Stages tab.

Spark on Qubole provides the following additional fields in the Executors tab:

o Resident size/Container size: Displays the total physical memory used within the container
(which is the executor’s java heap + off heap memory) as Resident size, and the configured yarn
container size (which is executor memory + executor overhead) as Container size.
o Heap used/committed/max: Displays values corresponding to the executor’s java heap.

The Logs column in shows the links to the container logs. Additionally, the number of tasks executed by
each executor with number of active, failed, completed and total tasks are displayed.
Spark Performance Tuning refers to the process of adjusting settings to record for memory, cores, and instances used by the
system. This process guarantees that the Spark has a flawless performance and also prevents bottlenecking of resources in
Spark.
What is Data Serialization?
In order, to reduce memory usage you might have to store spark RDDs in serialized form. Data serialization also determines
a good network performance. You will be able to obtain good results in Spark performance by:
 Terminating those jobs that run long.
 Ensuring that jobs are running on a precise execution engine.
 Using all resources in an efficiently.
 Enhancing the system’s performance time
Spark supports two serialization libraries, as follows:
 Java Serialization
 Kryo Serialization

What is Memory Tuning?


While tuning memory usage, there are three aspects that stand out:
 The entire dataset has to fit in memory, consideration of memory used by your objects is the must.
 By having an increased high turnover of objects, the overhead of garbage collection becomes a necessity.
 You’ll have to take into account the cost of accessing those objects.

What is Data Structure Tuning?


One option to reduce memory consumption is by staying away from java features that could overhead. Here are a few ways
to do this:
 In case the RAM size is less than 32 GB, the JVM flag should be set to –xx:+ UseCompressedOops. This operation will
build a pointer of four bytes instead of eight.
 Nested structures can be dodged by using several small objects as well as pointers.
 Instead of using strings for keys you could use numeric IDs and enumerated objects

What is Garbage Collection Tuning?


In order to avoid the large “churn” related to the RDDs that have been previously stored by the program, java will dismiss
old objects in order to create space for new ones. However, by using data structures that feature fewer objects the cost is
greatly reduced. One such example would be the employment an array of Ints instead of a linked list. Alternatively, you
could use objects in the serialized form, so you will only have a single object for each RDD partition.

What is Memory Management?


An efficient memory use is essential to good performance. Spark uses memory mainly for storage and execution. Storage
memory is used to cache data that will be reused later. On the other hand, execution memory is used for computation in
shuffles, sorts, joins, and aggregations. Memory contention poses three challenges for Apache Spark:
 How to arbitrate memory between execution and storage?
 How to arbitrate memory across tasks running simultaneously?
 How to arbitrate memory across operators running within the same task?
Instead of avoiding statically reserving memory in advance, you could deal with memory contention when it arises by
forcing members to spill
Stream Processing
Stream processing is the act of continuously incorporating new data to compute a result. In stream processing, the input
data is unbounded and has no predetermined beginning or end. It simply forms a series of events that arrive at the stream
processing system
streaming applications often need to join input data against a dataset written periodically by a batch job, and the output of
streaming jobs is often files or tables that are queried in batch jobs. Moreover, any business logic in your applications needs
to work consistently across streaming and batch execution
Stream Processing Use Cases We defined stream processing as the incremental processing of unbounded datasets, but that’s
a strange way to motivate a use case
Notifications and alerting Probably the most obvious streaming use case involves notifications and alerting. Given some
series of events, a notification or alert should be triggered if some sort of event or series of events occurs
Real-time reporting
Many organizations use streaming systems to run real-time dashboards that any employee can look at. We use these
dashboards to monitor total platform usage, system load, uptime, and even usage of new features as they are rolled out,
among other applications.
Incremental ETL One of the most common streaming applications is to reduce the latency companies must endure while
retreiving information into a data warehouse—in short, “my batch job, but streaming.”
Update data to serve in real time
Streaming systems are frequently used to compute data that gets served interactively by another application. For example, a
web analytics product such as Google Analytics might continuously track the number of visits to each page, and use a
streaming system to keep these counts up to date
Real-time decision making
Real-time decision making on a streaming system involves analyzing new inputs and responding to them automatically
using business logic

Processing out-of-order data based on application timestamps (also called event time) Maintaining large amounts of state
Supporting high-data throughput Processing each event exactly once despite machine failures Handling load imbalance and
stragglers Responding to events at low latency Joining with external data in other storage systems Determining how to
update output sinks as new events arrive Writing data transactionally to output systems Updating your application’s
business logic at runtime
Stream Processing Design Points
To support the stream processing challenges we described, including high throughput, low latency, and out-of-order data,
there are multiple ways to design a streaming system. We describe the most common design options here
Record-at-a-Time Versus Declarative APIs
The simplest way to design a streaming API would be to just pass each event to the application and let it react using custom
code. This is the approach that many early streaming systems, such as Apache Storm, implemented, and it has an important
place when applications need full control over the processing of data
Event Time Versus Processing Time
For the systems with declarative APIs, a second concern is whether the system natively supports event time. Event time is
the idea of processing data based on timestamps inserted into each record at the source, as opposed to the time when the
record is received at the streaming application (which is called processing time).
Event Time
Event time is an important topic to cover discretely because Spark’s DStream API does not support processing information
with respect to event-time. At a higher level, in streamprocessing systems there are effectively two relevant times for each
event: the time at which it actually occurred (event time), and the time that it was processed or reached the
streamprocessing system (processing time). Event time Event time is the time that is embedded in the data itself. It is most
often, though not required to be, the time that an event actually occurs. This is important to use because it provides a more
robust way of comparing events against one another. The challenge here is that event data can be late or out of order. This
means that the stream processing system must be able to handle out-of-order or late data.
Processing time
Processing time is the time at which the stream-processing system actually receives data. This is usually less important
than event time because when it’s processed is largely an implementation detail. This can’t ever be out of order because it’s
a property of the streaming system at a certain time (not an external system like event time).
The fundamental idea is that the order of the series of events in the processing system does not guarantee an ordering in
event time. This can be somewhat unintuitive, but is worth reinforcing. Computer networks are unreliable. That means that
events can be dropped, slowed down, repeated, or be sent without issue. Because individual events are not guaranteed to
suffer one fate or the other, we must acknowledge that any number of things can happen to these events on the way from
the source of the information to our stream processing system. For this reason, we need to operate on event time and look at
the overall stream with reference to this information contained in the data rather than on when it arrives in the system. This
means that we hope to compare events based on the time at which those events occurred
Stateful processing
is only necessary when you need to use or update intermediate information (state) over longer periods of time (in either a
microbatch or a record-at-a-time approach). This can happen when you are using event time or when you are performing an
aggregation on a key, whether that involves event time or not. For the most part, when you’re performing stateful
operations. Spark handles all of this complexity for you. For example, when you specify a grouping, Structured Streaming
maintains and updates the information for you. You simply specify the logic. When performing a stateful operation, Spark
stores the intermediate information in a state store. Spark’s current state store implementation is an in-memory state store
that is made fault tolerant by storing intermediate state to the checkpoint directory

Arbitrary Stateful Processing


The stateful processing capabilities described above are sufficient to solve many streaming problems. However, there are
times when you need fine-grained control over what state should be stored, how it is updated, and when it should be
removed, either explicitly or via a time-out. This is called arbitrary (or custom) stateful processing and Spark allows you to
essentially store whatever information you like over the course of the processing of a stream. This provides immense
flexibility and power and allows for some complex business logic to be handled quite easily. Just as we did before, let’s
ground this with some examples:
You’d like to record information about user sessions on an ecommerce site. For instance, you might want to track what
pages users visit over the course of this session in order to provide recommendations in real time during their next session.
Naturally, these sessions have completely arbitrary start and stop times that are unique to that user
Your company would like to report on errors in the web application but only if five events occur during a user’s session.
You could do this with count-based windows that only emit a result if five events of some type occur.
You’d like to deduplicate records over time. To do so, you’re going to need to keep track of every record that you see
before deduplicating it.
Tumbling Windows

In a tumbling window, tuples are grouped in a single window based on time or count. A tuple belongs
to only one window.

For example, consider a time-based tumbling window with a length of five


seconds. The first window (w1) contains events that arrived between the zeroth
and fifth seconds. The second window (w2) contains events that arrived between
the fifth and tenth seconds, and the third window (w3) contains events that
arrived between tenth and fifteenth seconds. The tumbling window is evaluated
every five seconds, and none of the windows overlap; each segment represents a
distinct time segment.

An example would be to compute the average price of a stock over the last five minutes, computed
every five minutes.
Handling Late Data with Watermarks
The preceding examples are great, but they have a flaw. We never specified how late
we expect to see data. This means that Spark is going to need to store that
intermediate data forever because we never specified a watermark, or a time at which
we don’t expect to see any more data. This applies to all stateful processing that
operates on event time. We must specify this watermark in order to age-out data in
the stream (and, therefore, state) so that we don’t overwhelm the system over a long
period of time.
Concretely, a watermark is an amount of time following a given event or set of events
after which we do not expect to see any more data from that time. We know this can
happen due to delays on the network, devices that lose a connection, or any number
of other issues. In the DStreams API, there was no robust way to handle late data in
this way—if an event occurred at a certain time but did not make it to the processing
system by the time the batch for a given window started, it would show up in other
processing batches.

You might also like