PYSPARK Interview Questions
PYSPARK Interview Questions
This is almost always the first PySpark interview question you will
face.
the facility to read data from multiple sources which have different
data formats. Along with these features, we can also interface with
simple way.
MapReduce fashion.
not efficient.
1. spark.mllib
2. mllib.clustering
3. mllib.classification
4. mllib.regression
5. mllib.recommendation
6. mllib.linalg
7. mllib.fpm
sc.addFile to load the files on the Apache Spark. SparkFIles can also
files that were added from sc.addFile. The class methods present in
cluster.
We run the following code whenever we want to run SparkConf:
class pyspark.Sparkconf(
localdefaults = True,
_jvm = None,
_jconf = None
decisions on where the RDD will be stored (on memory or over the
deserialized, replication = 1)
SparkJobs that are in execution. The code for using the SparkJobInfo
is as follows:
”)):
information about the SparkStages that are present at that time. The
DataSet?
RDD-
efficiently reserved.
● It's useful when you need to do low-level transformations, operations, and control on
a dataset.
● It's more commonly used to alter data with functional programming structures than
DataFrame-
● It allows the structure, i.e., lines and segments, to be seen. You can think of it as a
database table.
● Optimized Execution Plan- The catalyst analyzer is used to create query plans.
● One of the limitations of dataframes is Compile Time Wellbeing, i.e., when the
● Also, if you're working on Python, start with DataFrames and then switch to RDDs if
● It has the best encoding component and, unlike information edges, it enables time
● If you want a greater level of type safety at compile-time, or if you want typed JVM
take advantage of Catalyst optimization or even when you are trying to benefit from
The toDF() function of PySpark RDD is used to construct a DataFrame from an existing
RDD. The DataFrame is constructed with the default column names "_1" and "_2" to
dfFromRDD1 = rdd.toDF()
dfFromRDD1.printSchema()
Here, the printSchema() method gives you a database schema without column names-
root
Use the toDF() function with column names as parameters to pass column names to the
columns = ["language","users_count"]
dfFromRDD1 = rdd.toDF(columns)
dfFromRDD1.printSchema()
The above code snippet gives you the database schema with the column names-
root
The StructType and StructField classes in PySpark are used to define the schema to the
DataFrame and create complex columns such as nested struct, array, and map columns.
StructType is a collection of StructField objects that determines column name, column data
columns as "struct."
● To define the columns, PySpark offers the pyspark.sql.types import StructField class,
which has the column name (String), column type (DataType), nullable column
import pyspark
spark = SparkSession.builder.master("local[1]") \
.appName('ProjectPro') \
.getOrCreate()
data = [("James","","William","36636","M",3000),
("Michael","Smith","","40288","M",4000),
("Robert","","Dawson","42114","M",4000),
("Maria","Jones","39192","F",4000)
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
])
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)
PySpark DataFrame?
There are two ways to handle row duplication in PySpark dataframes. The distinct()
function in PySpark is used to drop/remove duplicate rows (all columns) from a DataFrame,
Here’s an example showing how to utilize the distinct() and dropDuplicates() methods-
import pyspark
spark = SparkSession.builder.appName('ProjectPro).getOrCreate()
df.printSchema()
df.show(truncate=False)
Output-
The record with the employer name Robert contains duplicate rows in the table above. As
we can see, there are two rows with duplicate values in all fields and four rows with
import pyspark
spark = SparkSession.builder.appName('ProjectPro').getOrCreate()
]
column= ["employee_name", "department", "salary"]
df.printSchema()
df.show(truncate=False)
#Distinct
distinctDF = df.distinct()
distinctDF.show(truncate=False)
#Drop duplicates
df2 = df.dropDuplicates()
df2.show(truncate=False)
dropDisDF = df.dropDuplicates(["department","salary"])
dropDisDF.show(truncate=False)
}
Q. Explain PySpark UDF with the help of an example.
The most important aspect of Spark SQL & DataFrame is PySpark UDF (i.e., User Defined
Function), which is used to expand PySpark's built-in capabilities. UDFs in PySpark work
PySpark SQL udf() or register it as udf and use it on DataFrame and SQL, respectively, in the
case of PySpark.
spark = SparkSession.builder.appName('ProjectPro').getOrCreate()
column = ["Seqno","Name"]
df = spark.createDataFrame(data=data,schema=column)
df.show(truncate=False)
Output-
2. The next step is creating a Python function. The code below generates the
convertCase() method, which accepts a string parameter and turns every word's
def convertCase(str):
resStr=""
for x in arr:
return resStr
By passing the function to PySpark SQL udf(), we can convert the convertCase() function to
PySpark map or the map() function is an RDD transformation that generates a new RDD by
RDD map() transformations are used to perform complex operations such as adding a
column, changing a column, converting data, and so on. Map transformations always
records = ["Project","Gutenberg’s","Alice’s","Adventures",
"in","Wonderland","Project","Gutenberg’s","Adventures",
"in","Wonderland","Project","Gutenberg’s"]
rdd=spark.sparkContext.parallelize(records)
map(f, preservesPartitioning=False)
● We are adding a new element having value 1 for each element in this PySpark map()
example, and the output of the RDD is PairRDDFunctions, which has key-value pairs,
where we have a word (String type) as Key and 1 (Int type) as Value.
rdd2=rdd.map(lambda x: (x,1))
print(element)
Output-
Joins in PySpark are used to join two DataFrames together, and by linking them together,
one may join several DataFrames. INNER Join, LEFT OUTER Join, RIGHT OUTER Join, LEFT
ANTI Join, LEFT SEMI Join, CROSS Join, and SELF Join are among the SQL join types it
supports.
‘how’: default inner (Options are inner, cross, outer, full, full outer, left, left outer, right, right
PySpark ArrayType is a collection data type that extends PySpark's DataType class, which is
the superclass for all kinds. The types of items in all ArrayType elements should be the
accepts two arguments: valueType and one optional argument valueContainsNull, which
specifies whether a value can accept null and is set to True by default. valueType should
arrayCol = ArrayType(StringType(),False)
Using one or more partition keys, PySpark partitions a large dataset into smaller parts.
When we build a DataFrame from a file or table, PySpark creates the DataFrame in memory
partitioned data run quicker since each partition's transformations are executed in parallel.
Partitioning in memory (DataFrame) and partitioning on disc (File system) are both
supported by PySpark.
PySpark MapType accepts two mandatory parameters- keyType and valueType, and one
Here’s how to create a MapType with PySpark StructType and StructField. The StructType()
accepts a list of StructFields, each of which takes a fieldname and a value type.
schema = StructType([
])
dataDictionary = [
('James',{'hair':'black','eye':'brown'}),
('Michael',{'hair':'brown','eye':None}),
('Robert',{'hair':'red','eye':'black'}),
('Washington',{'hair':'grey','eye':'grey'}),
('Jefferson',{'hair':'brown','eye':''})
df.printSchema()
df.show(truncate=False)
Output-
Q. How can PySpark DataFrame be converted to Pandas
DataFrame?
First, you need to learn the difference between the PySpark and Pandas. The key difference
between Pandas and PySpark is that PySpark's operations are quicker than Pandas'
because of its distributed nature and parallel execution over several cores and computers.
In other words, pandas use a single node to do operations, whereas PySpark uses several
computers.
You'll need to transfer the data back to Pandas DataFrame after processing it in PySpark so
that you can use it in Machine Learning apps or other Python programs.
Below are the steps to convert PySpark DataFrame into Pandas DataFrame-
2. The next step is to convert this PySpark dataframe into Pandas dataframe.
function. toPandas() gathers all records in a PySpark DataFrame and delivers them to the
driver software; it should only be used on a short percentage of the data. When using a
ArrayType.
PySpark ArrayType is a data type for collections that extends PySpark's DataType class.
two arguments: valueType and one optional argument valueContainsNull, which specifies
whether a value can accept null and is set to True by default. valueType should extend the
arrayCol = ArrayType(StringType(),False)
The above example generates a string array that does not allow null values.
Dataframe columns and back using the unpivot() function (). Pivot() is an aggregation in
which the values of one of the grouping columns are transposed into separate columns
import pyspark
("Orange",2000,"USA"),("Orange",2000,"USA"),("Banana",400,"China"), \
("Carrots",1200,"China"),("Beans",1500,"China"),("Orange",4000,"China"), \
("Banana",2000,"Canada"),("Carrots",2000,"Canada"),("Beans",2000,"Mexico")]
columns= ["Product","Amount","Country"]
df.printSchema()
df.show(truncate=False)
Output-
To determine the entire amount of each product's exports to each nation, we'll group by
pivotDF = df.groupBy("Product").pivot("Country").sum("Amount")
pivotDF.printSchema()
pivotDF.show(truncate=False)
This will convert the nations from DataFrame rows to columns, resulting in the output seen
an example.
Broadcast variables in PySpark are read-only shared variables that are stored and
accessible on all nodes in a cluster so that processes may access or use them. Instead of
sending this information with each job, PySpark uses efficient broadcast algorithms to
broadcastVariable.value
broadcastStates = spark.sparkContext.broadcast(states)
data = [("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
rdd = spark.sparkContext.parallelize(data)
def state_convert(code):
return broadcastState.value[code]
print(res)
broadcastStates = spark.sparkContext.broadcast(states)
data = [("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","William","USA","CA"),
("Maria","Jones","USA","FL")
columns = ["firstname","lastname","country","state"]
df.printSchema()
df.show(truncate=False)
def state_convert(code):
return broadcastState.value[code]
res.show(truncate=False)
CPU cores. The following code works, but it may crash on huge
data sets, or at the very least, it may not take advantage of the
cluster's full processing capabilities. Which aspect is the most
The repartition command creates ten partitions regardless of how many of them were
loaded. On large datasets, they might get fairly huge, and they'll almost certainly outgrow
In addition, each executor can only have one partition. This means that just ten of the 240
executors are engaged (10 nodes with 24 cores, each running one executor).
If the number is set exceptionally high, the scheduler's cost in handling the partition grows,
lowering performance. It may even exceed the execution time in some circumstances,
The optimal number of partitions is between two and three times the number of executors.
RDD[UserActivity] = { sparkSession.sparkContext.parallelize(
The primary function, calculate, reads two pieces of data. (They are given in this case from
parallelize.) Each of them is transformed into a tuple by the map, which consists of a userId
and the item itself. To combine the two datasets, the userId is utilised.
All users' login actions are filtered out of the combined dataset. The uName and the event
This is eventually reduced down to merely the initial login record per user, which is then sent
to the console.
eventType. Join the two dataframes using code and count the
.repartition(col(UIdColName)) // ??????????????? .
.repartition(col(UIdColName))
.join(userActivityRdd, UIdColName)
.select(col(UNameColName))
.groupBy(UNameColName)
.count()
.withColumnRenamed("count", CountColName)
result.show()
the master and which parts will run on each worker node.
DateTimeFormatter.ofPattern("yyyy/MM") def
getEventCountOnWeekdaysPerMonth(data:
DayOfWeek.SATURDAY.getValue) . map(mapDateTime2Date)
The driver application is responsible for calling this function. The DAG is defined by the
assignment to the result value, as well as its execution, which is initiated by the collect()
operation. The worker nodes handle all of this (including the logic of the method
mapDateTime2Date). Because the result value that is gathered on the master is an array,
Q. What are the elements used by the GraphX library, and how
RDD[???[PageReference]] =
(1,1.4537951595091907) (2,0.7731024202454048)
(3,0.7731024202454048)
Vertex, and Edge objects are supplied to the Graph object as RDDs of type RDD[VertexId,
VT] and RDD[Edge[ET]] respectively (where VT and ET are any user-defined types
associated with a given Vertex or Edge). For Edge type, the constructor is Edge[ET](srcId:
VertexId, dstId: VertexId, attr: ET). VertexId is just an alias for Long.
Q. Under what scenarios are Client and Cluster modes used for
deployment?
● Cluster mode should be utilized for deployment if the client computers are not near
the cluster. This is done to prevent the network delay that would occur in Client
mode while communicating between executors. In case of Client mode, if the
machine goes offline, the entire operation is lost.
● Client mode can be utilized for deployment if the client computer is located within
the cluster. There will be no network latency concerns because the computer is part
of the cluster, and the cluster's maintenance is already taken care of, so there is no
need to be concerned in the event of a failure.
Only batch-wise data processing is done Apache Spark can handle data in both real-
using MapReduce. time and batch mode.
The data is stored in HDFS (Hadoop Spark saves data in memory (RAM), making
Distributed File System), which takes a data retrieval quicker and faster when
long time to retrieve. needed.
def keywordExists(line):
return 1
return 0
lines = sparkContext.textFile(“sample_file.txt”);
isExist = lines.map(keywordExists);
sum=isExist.reduce(sum);
Spark executors have the same fixed core count and heap size as the applications created
in Spark. The heap size relates to the memory used by the Spark executor, which is
controlled by the -executor-memory flag's property spark.executor.memory. On each worker
node where Spark operates, one executor is assigned to it. The executor memory is a
measurement of the memory utilized by the application's worker node.
The core engine for large-scale distributed and parallel data processing is SparkCore. The
distributed execution engine in the Spark core provides APIs in Java, Python, and Scala for
constructing distributed ETL applications.
Memory management, task monitoring, fault tolerance, storage system interactions, work
scheduling, and support for all fundamental I/O activities are all performed by Spark Core.
Additional libraries on top of Spark Core enable a variety of SQL, streaming, and machine
learning applications.
applications?
Despite the fact that Spark is a strong data processing engine, there are certain drawbacks
to utilizing it in applications.
PySpark?
The process of shuffling corresponds to data transfers. Spark applications run quicker and
more reliably when these transfers are minimized. There are quite a number of approaches
that may be used to reduce them. They are as follows:
● Using broadcast variables improves the efficiency of joining big and small RDDs.
● Accumulators are used to update variable values in a parallel manner during
execution.
● Another popular method is to prevent operations that cause these reshuffles.
dense vectors?
Sparse vectors are made up of two parallel arrays, one for indexing and the other for storing
values. These vectors are used to save space by storing non-zero values. E.g.- val
sparseVec: Vector = Vectors.sparse(5, Array(0, 4), Array(1.0, 2.0))
The vector in the above example is of size 5, but the non-zero values are only found at
indices 0 and 4.
When there are just a few non-zero values, sparse vectors come in handy. If there are just a
few zero values, dense vectors should be used instead of sparse vectors, as sparse vectors
would create indexing overhead, which might affect performance.
The usage of sparse or dense vectors has no effect on the outcomes of calculations, but
when they are used incorrectly, they have an influence on the amount of memory needed
and the calculation time.
The partition of a data stream's contents into batches of X seconds, known as DStreams, is
the basis of Spark Streaming. These DStreams allow developers to cache data in memory,
which may be particularly handy if the data from a DStream is utilized several times. The
cache() function or the persist() method with proper persistence settings can be used to
cache data. For input streams receiving data through networks such as Kafka, Flume, and
others, the default persistence level setting is configured to achieve data replication on two
nodes to achieve fault tolerance.
● Cache method-
● Persist method-
Spark RDD is extended with a robust API called GraphX, which supports graphs and graph-
based calculations. The Resilient Distributed Property Graph is an enhanced property of
Spark RDD that is a directed multi-graph with many parallel edges. User-defined
characteristics are associated with each edge and vertex. Multiple connections between the
same set of vertices are shown by the existence of parallel edges. GraphX offers a
collection of operators that can allow graph computing, such as subgraph,
mapReduceTriplets, joinVertices, and so on. It also offers a wide number of graph builders
and algorithms for making graph analytics chores easier.
According to the UNIX Standard Streams, Apache Spark supports the pipe() function on
RDDs, which allows you to assemble distinct portions of jobs that can use any language.
The RDD transformation may be created using the pipe() function, and it can be used to
read each element of the RDD as a String. These may be altered as needed, and the results
can be presented as Strings.
PySpark?
Spark automatically saves intermediate data from various shuffle processes. However, it is
advised to use the RDD's persist() function. There are many levels of persistence for storing
RDDs on memory, disc, or both, with varying levels of replication. The following are the
persistence levels available in Spark:
● MEMORY ONLY: This is the default persistence level, and it's used to save RDDs on
the JVM as deserialized Java objects. In the event that the RDDs are too large to fit
in memory, the partitions are not cached and must be recomputed as needed.
● MEMORY AND DISK: On the JVM, the RDDs are saved as deserialized Java objects.
In the event that memory is inadequate, partitions that do not fit in memory will be
kept on disc, and data will be retrieved from the drive as needed.
● MEMORY ONLY SER: The RDD is stored as One Byte per partition serialized Java
Objects.
● DISK ONLY: RDD partitions are only saved on disc.
● OFF HEAP: This level is similar to MEMORY ONLY SER, except that the data is saved
in off-heap memory.
The persist() function has the following syntax for employing persistence levels:
df.persist(StorageLevel.)
No. of nodes = 10
No. of cores = How many concurrent tasks the executor can handle.
= 15/5
=3
= 10 * 3
words?
sc.textFile(“hdfs://Hadoop/user/sample_file.txt”);
def toWords(line):
return line.split();
3. As a flatMap transformation, run the toWords function on each item of the RDD in Spark:
words = line.flatMap(toWords);
def toTuple(word):
wordTuple = words.map(toTuple);
return x+y:
counts = wordsTuple.reduceByKey(sum)
6. Print:
counts.collect()
languages?
Resilient Distribution Datasets (RDD) are a collection of fault-tolerant functional units that
may run simultaneously. RDDs are data fragments that are maintained in memory and
spread across several nodes. In an RDD, all partitioned data is distributed and consistent.
1. Hadoop datasets- Those datasets that apply a function to each file record in the
Hadoop Distributed File System (HDFS) or another file storage system.
2. Parallelized Collections- Existing RDDs that operate in parallel with each other.
of PySpark.
PySpark allows you to create custom profiles that may be used to build predictive models.
In general, profilers are calculated using the minimum and maximum values of each
column. It is utilized as a valuable data review tool to ensure that the data is accurate and
appropriate for future usage.
● Avoid dictionaries: If you use Python data types like dictionaries, your code might not
be able to run in a distributed manner. Consider adding another column to a
dataframe that may be used as a filter instead of utilizing keys to index entries in a
dictionary. This proposal also applies to Python types that aren't distributable in
PySpark, such as lists.
● Limit the use of Pandas: using toPandas causes all data to be loaded into memory
on the driver node, preventing operations from being run in a distributed manner.
When data has previously been aggregated, and you wish to utilize conventional
Python plotting tools, this method is appropriate, but it should not be used for larger
dataframes.
● Minimize eager operations: It's best to avoid eager operations that draw whole
dataframes into memory if you want your pipeline to be as scalable as possible.
Reading in CSVs, for example, is an eager activity, thus I stage the dataframe to S3
as Parquet before utilizing it in further pipeline steps.
PySpark provides the reliability needed to upload our files to Apache Spark. This is
accomplished by using sc.addFile, where 'sc' stands for SparkContext. We use
SparkFiles.net to acquire the directory path.
We use the following methods in SparkFiles to resolve the path to the files added using
SparkContext.addFile():
● get(filename),
● getrootdirectory()
SparkConf.
SparkConf aids in the setup and settings needed to execute a spark application locally or in
a cluster. To put it another way, it offers settings for running a Spark application. The
following are some of SparkConf's most important features:
The primary difference between lists and tuples is that lists are mutable, but tuples are
immutable.
When a Python object may be edited, it is considered to be a mutable data type. Immutable
data types, on the other hand, cannot be changed.
list_num[3] = 7
print(list_num)
tup_num[3] = 7
Output:
[1,2,5,7]
We assigned 7 to list_num at index 3 in this code, and 7 is found at index 3 in the output.
However, we set 7 to tup_num at index 3, but the result returned a type error. Because of
their immutable nature, we can't change tuples.
Python?
There are two types of errors in Python: syntax errors and exceptions.
Syntax errors are frequently referred to as parsing errors. Errors are flaws in a program that
might cause it to crash or terminate unexpectedly. When a parser detects an error, it
repeats the offending line and then shows an arrow pointing to the line's beginning.
Exceptions arise in a program when the usual flow of the program is disrupted by an
external event. Even if the program's syntax is accurate, there is a potential that an error will
be detected during execution; nevertheless, this error is an exception. ZeroDivisionError,
TypeError, and NameError are some instances of exceptions.
PySpark is a Python API created and distributed by the Apache Spark organization to make
working with Spark easier for Python programmers. Scala is the programming language
used by Apache Spark. It can communicate with other languages like Java, R, and Python.
Also, because Scala is a compile-time, type-safe language, Apache Spark has several
capabilities that PySpark does not, one of which includes Datasets. Datasets are a highly
typed collection of domain-specific objects that may be used to execute concurrent
calculations.
SparkSession in PySpark
Spark 2.0 includes a new class called SparkSession (pyspark.sql import SparkSession).
Prior to the 2.0 release, SparkSession was a unified class for all of the many contexts we
had (SQLContext and HiveContext, etc). Since version 2.0, SparkSession may replace
SQLContext, HiveContext, and other contexts specified before version 2.0. It's a way to get
into the core PySpark technology and construct PySpark RDDs and DataFrames
programmatically. Spark is the default object in pyspark-shell, and it may be generated
programmatically with SparkSession.
In PySpark, we must use the builder pattern function builder() to construct SparkSession
programmatically (in a.py file), as detailed below. The getOrCreate() function retrieves an
already existing SparkSession or creates a new SparkSession if none exists.
spark=SparkSession.builder.master("local[1]") \
.appName('ProjectPro') \
.getOrCreate()
Py4J is a Java library integrated into PySpark that allows Python to actively communicate
with JVM instances. Py4J is a necessary module for the PySpark application to execute,
and it may be found in the $SPARK_HOME/python/lib/py4j-*-src.zip directory.
To execute the PySpark application after installing Spark, set the Py4j module to the
PYTHONPATH environment variable. We’ll get an ImportError: No module named
py4j.java_gateway error if we don't set this module to env.
export SPARK_HOME=/Users/abc/apps/spark-3.0.0-bin-hadoop2.7
export
PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$SPARK_HOME/python
/lib/py4j-0.10.9-src.zip:$PYTHONPATH
The py4j module version changes depending on the PySpark version we’re using; to
configure this version correctly, follow the steps below:
Use the pip show command to see the PySpark location's path- pip show pyspark
Use the environment variables listed below to fix the problem on Windows-
set SPARK_HOME=C:\apps\opt\spark-3.0.0-bin-hadoop2.7
set HADOOP_HOME=%SPARK_HOME%
set PYTHONPATH=%SPARK_HOME%/python;%SPARK_HOME%/python/lib/py4j-0.10.9-
src.zip;%PYTHONPATH%
Spark shell, PySpark shell, and Databricks all have the SparkSession object 'spark' by
default. However, if we are creating a Spark/PySpark application in a.py file, we must
manually create a SparkSession object by using builder to resolve NameError: Name 'Spark'
is not Defined.
# Import PySpark
import pyspark
#Create SparkSession
spark = SparkSession.builder
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
If you get the error message 'No module named pyspark', try using findspark instead-
#Install findspark
# Import findspark
import findspark
findspark.init()
#import pyspark
import pyspark
● Standalone- a simple cluster manager that comes with Spark and makes setting up a
cluster easier.
● Apache Mesos- Mesos is a cluster manager that can also run Hadoop MapReduce
and PySpark applications.
● Hadoop YARN- It is the Hadoop 2 resource management.
● Kubernetes- an open-source framework for automating containerized application
deployment, scaling, and administration.
● local – not exactly a cluster manager, but it's worth mentioning because we use
"local" for master() to run Spark on our laptop/computer.
● Reliable receiver: When data is received and copied properly in Apache Spark
Storage, this receiver validates data sources.
● Unreliable receiver: When receiving or replicating data in Apache Spark Storage,
these receivers do not recognize data sources.
Spark runs almost 100 times faster than Hadoop Hadoop MapReduce is
MapReduce slower when it comes to
large scale data processing
Spark stores data in the RAM i.e. in-memory. So, it is Hadoop MapReduce data is
easier to retrieve it stored in HDFS and hence
takes a long time to retrieve
the data
Spark provides caching and in-memory data storage Hadoop is highly disk-
dependent
Apache Spark has 3 main categories that comprise its ecosystem. Those are:
Q. Explain how Spark runs applications with the help of its architecture.
This is one of the most frequently asked spark interview questions, and the interviewer
will expect you to give a thorough answer to it.
Spark applications run as independent processes that are coordinated by the
SparkSession object in the driver program. The resource manager or cluster manager
assigns tasks to the worker nodes with one task per partition. Iterative algorithms apply
operations repeatedly to the data so they can benefit from caching datasets across
iterations. A task applies its unit of work to the dataset in its partition and outputs a new
partition dataset. Finally, the results are sent back to the driver application or can be
saved to the disk.
So far, if you have any doubts regarding the apache spark interview questions and
answers, please comment below.
Q. What makes Spark good at low latency workloads like graph processing and Machine
Learning?
Apache Spark stores data in-memory for faster processing and building machine
learning models. Machine Learning algorithms require multiple iterations and different
conceptual steps to create an optimal model. Graph algorithms traverse through all the
nodes and edges to generate a graph. These low latency workloads that need multiple
iterations can lead to increased performance.
How can you trigger automatic clean-ups in Spark to handle accumulated metadata?
There are a total of 4 steps that can help you connect Spark to Apache Mesos.
Learn open-source framework and scala programming languages with the Apache
Spark and Scala Certification training course.
Q. What is shuffling in Spark? When does it occur?
Shuffling is the process of redistributing data across partitions that may lead to data
movement across the executors. The shuffle operation is implemented differently in
Spark compared to Hadoop.
It occurs while joining two tables or while performing byKey operations such as
GroupByKey or ReduceByKey
Suppose you want to read data from a CSV file into an RDD having four partitions.
This is how a filter operation is performed to remove all the multiple of 10 from the
data.
The RDD has some empty partitions. It makes sense to reduce the number of partitions,
which can be achieved by using coalesce.
This is how the resultant RDD would look like after applying to coalesce.
Spark Core is the engine for parallel and distributed processing of large data sets. The
various functionalities supported by Spark Core include:
● Scheduling and monitoring jobs
● Memory management
● Fault recovery
● Task dispatching
import com.mapr.db.spark.sql._
val df = sc.loadFromMapRDB(<table-name>)
.select(“_id”, “first_name”).toDF()
● Using SparkSession.createDataFrame
Actions: Actions are operations that return a value after running a computation on an
RDD (Example: reduce, first, count)
Q.What is a Lineage Graph?
The need for an RDD lineage graph happens when we want to compute a new RDD or if
we want to recover the lost data from the lost persisted RDD. Spark does not support
data replication in memory. So, if any data is lost, it can be rebuilt using RDD lineage. It
is also called an RDD operator graph or RDD dependency graph.
It represents a continuous stream of data that is either in the form of an input source or
processed data stream generated by transforming the input stream.
dstream.
The default persistence level is set to replicate the data to two nodes for fault-tolerance,
and for input streams that receive data over the network.
kafka
Broadcast variables allow the programmer to keep a read-only variable cached on each
machine rather than shipping a copy of it with tasks. They can be used to give every
node a copy of a large input dataset in an efficient manner. Spark distributes broadcast
variables using efficient broadcast algorithms to reduce communication costs.
scala
scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
So far, if you have any doubts regarding the spark interview questions for beginners,
please ask in the comment section below.
1. map(func)
2. transform(func)
3. filter(func)
4. count()
This is one of the most frequently asked spark interview questions where the
interviewer expects a detailed answer (and not just a yes or no!). Give as detailed an
answer as possible here.
Yes, Apache Spark provides an API for adding and managing checkpoints.
Checkpointing is the process of making streaming applications resilient to failures. It
allows you to save the data and metadata into a checkpointing directory. In case of a
failure, the spark can recover this data and start from wherever it has stopped.
There are 2 types of data for which we can use checkpointing in Spark.
Metadata Checkpointing: Metadata means the data about data. It refers to saving the
metadata to fault-tolerant storage like HDFS. Metadata includes configurations,
DStream operations, and incomplete batches.
Data Checkpointing: Here, we save the RDD to reliable storage because its need arises
in some of the stateful transformations. In this case, the upcoming RDD depends on the
RDDs of previous batches.
MEMORY_ONLY_SER - Stores the RDD as serialized Java objects with a one-byte array
per partition
MEMORY_ONLY - Stores the RDD as deserialized Java objects in the JVM. If the RDD is
not able to fit in the memory available, some partitions won’t be cached
OFF_HEAP - Works like MEMORY_ONLY_SER but stores the data in off-heap memory
MEMORY_AND_DISK - Stores RDD as deserialized Java objects in the JVM. In case the
RDD is not able to fit in the memory, additional partitions are stored on the disk
Q. What is the difference between map and flatMap transformation in Spark Streaming?
map() flatMap()
Spark Map function takes one element as an input FlatMap allows returning 0,
process it according to custom code (specified by 1, or more elements from
the developer) and returns one element at a time the map function. In the
FlatMap operation
Q. How would you compute the total count of unique words in Spark?
sc.textFile(“hdfs://Hadoop/user/test_file.txt”);
def toWords(line):
return line.split();
def toTuple(word):
wordTuple = words.map(toTuple);
return x+y:
counts = wordsTuple.reduceByKey(sum)
6. Print:
counts.collect()
Suppose you have a huge text file. How will you check if a particular keyword exists
using Spark?
lines = sc.textFile(“hdfs://Hadoop/user/test_file.txt”);
def isFound(line):
if line.find(“my_keyword”) > -1
return 1
return 0
foundBits = lines.map(isFound);
sum = foundBits.reduce(sum);
if sum > 0:
print “Found”
else:
Accumulators are variables used for aggregating information across the executors. This
information can be about the data or API diagnosis like how many records are corrupted
or how many times a library API was called.
Hope it is clear so far. Let us know what were the apache spark interview questions
ask’d by/to you during the interview process.
Spark MLlib supports local vectors and matrices stored on a single machine, as well as
distributed matrices.
Local Vector: MLlib supports two types of local vectors - dense and sparse
Labeled point: A labeled point is a local vector, either dense or sparse that is associated
with a label/response.
Local Matrix: A local matrix has integer type row and column indices, and double type
values that are stored in a single machine.
Distributed Matrix: A distributed matrix has long-type row and column indices and
double-type values, and is stored in a distributed manner in one or more RDDs.
● RowMatrix
● IndexedRowMatrix
● CoordinatedMatrix
A Sparse vector is a type of local vector which is represented by an index array and a
value array.
extends Object
implements Vector
where:
Do you have a better example for this spark interview question? If yes, let us know.
Q. Describe how model creation works with MLlib and how the model is applied.
Spark MLlib lets you combine multiple transformations into a pipeline to apply complex
data transformations.
Spark SQL loads the data from a variety of structured data sources.
It queries data using SQL statements, both inside a Spark program and from external
tools that connect to Spark SQL through standard database connectors (JDBC/ODBC).
To connect Hive to Spark SQL, place the hive-site.xml file in the conf directory of Spark.
val df = spark.read.json("examples/src/main/resources/people.json")
df.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
// Select only the "name" column
df.select("name").show()
// +-------+
// | name|
// +-------+
// |Michael|
// | Andy|
// | Justin|
// +-------+
// +-------+---------+
// | name|(age + 1)|
// +-------+---------+
// |Michael| null|
// | Andy| 31|
// | Justin| 20|
// +-------+---------+
// Select people older than 21
// +---+----+
// |age|name|
// +---+----+
// | 30|Andy|
// +---+----+
df.groupBy("age").count().show()
// +----+-----+
// | age|count|
// +----+-----+
// | 19| 1|
// |null| 1|
// | 30| 1|
// +----+-----+
Q. What are the different types of operators provided by the Apache GraphX library?
In such spark interview questions, try giving an explanation too (not just the name of the
operators).
Property Operator: Property operators modify the vertex or edge properties using a user-
defined map function and produce a new graph.
Structural Operator: Structure operators operate on the structure of an input graph and
produce a new graph.
Join Operator: Join operators add data to graphs and generate new graphs.
GraphX is Apache Spark's API for graphs and graph-parallel computation. GraphX
includes a set of graph algorithms to simplify analytics tasks. The algorithms are
contained in the org.apache.spark.graphx.lib package and can be accessed directly as
methods on Graph via GraphOps.
Triangle Counting: A vertex is part of a triangle when it has two adjacent vertices with
an edge between them. GraphX implements a triangle counting algorithm in the
TriangleCount object that determines the number of triangles passing through each
vertex, providing a measure of clustering.
It is a plus point if you are able to explain this spark interview question thoroughly, along
with an example! PageRank measures the importance of each vertex in a graph,
assuming an edge from u to v represents an endorsement of v’s importance by u.
If a Twitter user is followed by many other users, that handle will be ranked high.
PageRank algorithm was originally developed by Larry Page and Sergey Brin to rank
websites for Google. It can be applied to measure the influence of vertices in any
network graph. PageRank works by counting the number and quality of links to a page
to determine a rough estimate of how important the website is. The assumption is that
more important websites are likely to receive more links from other websites.
A typical example of using Scala's functional programming with Apache Spark RDDs to
iteratively compute Page Ranks is shown below:
Q. Explain how an object is implemented in python?
Syntax:
= ()
Example:
class Student:
id = 25;
estb = 10
stud = Student()
stud.display()
Output:
ID: 25
Estb: 10
Ans: In Python, a method is a function that is associated with an object. Any object type
can have methods.
Example:
class Student:
roll = 17;
name = "gopal"
age = 25
print(self.roll,self.name,self.age)
In the above example, a class named Student is created which contains three fields as
Student’s roll, name, age and a function “display()” which is used to display the
information of the Student.
Below is the example of encapsulation whereby the max price of the product cannot be
modified as it is set to 75 .
Example:
class Product:
def __init__(self):
self.__maxprice = 75
def sell(self):
print("Selling Price: {}".format(self.__maxprice))
self.__maxprice = price
p = Product()
p.sell()
p.__maxprice = 100
p.sell()
Output:
Selling Price: 75
Selling Price: 75
Ans: Inheritance refers to a concept where one class inherits the properties of another.
It helps to reuse the code and establish a relationship between different classes.
Parent class (Super or Base class): A class whose properties are inherited.
Child class (Subclass or Derived class): A class which inherits the properties.
In python, a derived class can inherit base class by just mentioning the base in the
bracket after the derived class name.
The syntax to inherit a base class into the derived class is shown below:
Syntax:
The syntax to inherit multiple classes is shown below by specifying all of them inside
the bracket.
Syntax:
Ans: A for loop in Python requires at least two variables to work. The first is the iterable
object such as a list, tuple or a string and second is the variable to store the successive
values from the sequence in the loop.
Syntax:
statements(iter)
The “iter” represents the iteration variable. It gets assigned with the successive values
from the input sequence.
The “sequence” may refer to any of the following Python objects such as a list, a tuple
or a string.
Example :
x=[]
for i in x:
print "in for loop"
else:
Output:
in else block
Ans: In Python, there are two types of errors - syntax error and exceptions.
Syntax Error: It is also known as parsing errors. Errors are issues in a program which
may cause it to exit abnormally. When an error is detected, the parser repeats the
offending line and then displays an arrow which points at the earliest point in the line.
Exceptions: Exceptions take place in a program when the normal flow of the program is
interrupted due to the occurrence of an external event. Even if the syntax of the program
is correct, there are chances of detecting an error during execution, this error is nothing
but an exception. Some of the examples of exceptions are - ZeroDivisionError, TypeError
and NameError.
Ans:
The key difference between lists and tuples is the fact that lists have mutable nature
and tuples have immutable nature.
It is said to be a mutable data type when a python object can be modified. On the other
hand, immutable data types cannot be modified. Let us see an example to modify an
item list vs tuple.
Example:
list_num[3] = 7
print(list_num)
tup_num[3] = 7
Output:
[1,2,5,7]
In this code, we had assigned 7 to list_num at index 3 and in the output, we can see 7 is
found in index 3 . However, we had assigned 7 to tup_num at index 3 but we got type
error on the output. This is because we cannot modify tuples due to its immutable
nature.
Ans: The method provided by Python, is a standard built-in function which converts a
string into an integer value.
It can be called with a string containing a number as the argument, and it will return the
number converted to an actual integer.
Example:
print int("1") + 2
Ans: Data cleaning is the process of preparing data for analysis by removing or
modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly
formatted.
Master Your Craft Lifetime LMS & Faculty Access 24/7 online expert support Real-
world & Project Based Learning
Q. What is Pyspark and explain its characteristics?
Ans: To support Python with Spark, the Spark community has released a tool called
PySpark. It is primarily used to process structured and semi-structured datasets and
also supports an optimized API to read data from the multiple data sources containing
different file formats. Using PySpark, you can also work with RDDs in the Python
programming language using its library name Py4j.
Based on MapReduce.
Q. Explain RDD and also state how you can create RDDs in Apache Spark.
Ans: RDD stands for Resilient Distribution Datasets, a fault-tolerant set of operational
elements that are capable of running in parallel. These RDDs, in general, are the portions
of data, which are stored in the memory and distributed over many nodes.
Hadoop datasets: Those who perform a function on each file record in Hadoop
Distributed File System (HDFS) or any other storage system.
Parallelized collections: Those existing RDDs which run in parallel with one another.
By loading an external dataset from external storage like HDFS, HBase, shared file
system.
Spark SQL: Integrates relational processing with Spark’s functional programming API.
Ans: When an Action is approached at a certain point, Spark RDD at an abnormal state,
Spark presents the heredity chart to the DAG Scheduler.
Activities are separated into phases of the errand in the DAG Scheduler. A phase
contains errands dependent on the parcel of the info information. The DAG scheduler
pipelines administrators together. It dispatches tasks through the group chief. The
conditions of stages are obscure to the errand scheduler. The Workers execute the
undertaking on the slave.
Ans: Stream processing is an extension to the Spark API that lets stream processing of
live data streams. Data from multiple sources such as Flume, Kafka, Kinesis, etc., is
processed and then pushed to live dashboards, file systems, and databases. Compared
to the terms of input data, it is just similar to batch processing, and data is segregated
into streams like batches in processing.
Ans:
Memory management.
Fault-tolerance.
Monitoring jobs.
Job scheduling.
Moreover, additional libraries, built atop the core, let diverse workloads for streaming,
machine learning, and SQL. This is useful for:
Memory management.
fault recovery.
Q. What is the module used to implement SQL in Spark? How does it work?
Ans: The module used is Spark SQL, which integrates relational processing with Spark’s
functional programming API. It helps to query data either through Hive Query Language
or SQL. These are the four libraries of Spark SQL.
DataFrame API.
SQL Service.
Querying data using SQL statements, both inside a Spark program and from external
tools that connect to Spark SQL through standard database connectors (JDBC/ODBC).
For instance, using business intelligence tools like Tableau.
Providing rich integration between SQL and regular Python/Java/Scala code, including
the ability to join RDDs and SQL tables, expose custom functions in SQL, and more.
Spark.mllib.
mllib.clustering.
mllib.classification.
mllib.regression.
mllib.recommendation.
mllib.linalg.
mllib.fpm.
Q. Explain the purpose of serializations in PySpark?
Ans: For improving performance, PySpark supports custom serializers to transfer data.
They are:
PickleSerializer: It is by default used for serializing objects. Supports any Python object
but at a slow speed.
Ans: PySpark Storage Level controls storage of an RDD. It also manages how to store
RDD in the memory or over the disk, or sometimes both. Moreover, it even controls the
replicate or serializes RDD partitions. The code for StorageLevel is as follows
Ans: PySpark SparkContext is treated as an initial point for entering and using any Spark
functionality. The SparkContext uses py4j library to launch the JVM, and then create the
JavaSparkContext. By default, the SparkContext is available as ‘sc’.
Ans: PySpark SparkFiles is used to load our files on the Apache Spark application. It is
one of the functions under SparkContext and can be called using sc.addFile to load the
files on the Apache Spark. SparkFIles can also be used to get the path using
SparkFile.get or resolve the paths to files that were added from sc.addFile. The class
methods present in the SparkFiles directory are getrootdirectory() and get(filename).
Ans: Apache Spark is a graph execution engine that enables users to analyze massive
data sets with high performance. For this, Spark first needs to be held in memory to
improve performance drastically, if data needs to be manipulated with multiple stages
of processing.
Ans: SparkConf helps in setting a few configurations and parameters to run a Spark
application on the local/cluster. In simple terms, it provides configurations to run a
Spark application.
Q. What is PySpark?
PySpark is an Apache Spark interface in Python. It is used for collaborating with Spark
using APIs written in Python. It also supports Spark’s features like Spark DataFrame,
Spark SQL, Spark Streaming, Spark MLlib and Spark Core. It provides an interactive
PySpark shell to analyze structured and semi-structured data in a distributed
environment. PySpark supports reading data from multiple sources and different
formats. It also facilitates the use of RDDs (Resilient Distributed Datasets). PySpark
features are implemented in the py4j library in python.
Download PDF
# --serializing.py----
from pyspark.context import SparkContext
from pyspark.serializers import MarshalSerializer
sc = SparkContext("local", "Marshal Serialization", serializer =
MarshalSerializer()) #Initialize spark context and serializer
print(sc.parallelize(list(range(1000))).map(lambda x: 3 *
x).take(5))
sc.stop()
When we run the file using the command:
$SPARK_HOME/bin/spark-submit serializing.py
The output of the code would be the list of size 5 of numbers multiplied by 3:
[0, 3, 6, 9, 12]
[
"interview",
"interviewbit"
]
● Action: These operations instruct Spark to perform some computations on the
RDD and return the result to the driver. It sends data from the Executer to the
driver. count(), collect(), take() are some of the examples.
Let us consider an example to demonstrate action operation by making use of
the count() function.
from pyspark import SparkContext
sc = SparkContext("local", "Action Demo")
words = sc.parallelize (
["pyspark",
"interview",
"questions",
"at",
"interviewbit"]
)
counts = words.count()
print("Count of elements in RDD -> ", counts)
In this class, we count the number of elements in the spark RDDs. The output of this
code is
depending on similarity.
algebra.
dependencies.
The above figure shows the position of cluster manager in the Spark ecosystem.
Consider a master node and multiple worker nodes present in the cluster. The master
nodes provide the worker nodes with the resources like memory, processor allocation
etc depending on the nodes requirements with the help of the cluster manager.
● In-Memory Processing: PySpark’s RDD helps in loading data from the disk to the
memory. The RDDs can even be persisted in the memory for reusing the
computations.
● Immutability: The RDDs are immutable which means that once created, they
cannot be modified. While applying any transformation operations on the RDDs, a
new RDD would be created.
● Fault Tolerance: The RDDs are fault-tolerant. This means that whenever an
operation fails, the data gets automatically reloaded from other available
partitions. This results in seamless execution of the PySpark applications.
● Lazy Evolution: The PySpark transformation operations are not performed as
soon as they are encountered. The operations would be stored in the DAG and
are evaluated once it finds the first RDD action.
● Partitioning: Whenever RDD is created from any data, the elements in the RDD are
partitioned to the cores available by default.
Q. What are the types of PySpark’s shared variables and why are they
useful?
Whenever PySpark performs the transformation operation using filter(), map() or
reduce(), they are run on a remote node that uses the variables shipped with tasks.
These variables are not reusable and cannot be shared across different tasks because
they are not returned to the Driver. To solve the issue of reusability and sharing, we have
shared variables in PySpark. There are two types of shared variables, they are:
Broadcast variables: These are also known as read-only shared variables and are used
in cases of data lookup requirements. These variables are cached and are made
available on all the cluster nodes so that the tasks can make use of them. The variables
are not sent with every task. They are rather distributed to the nodes using efficient
algorithms for reducing the cost of communication. When we run an RDD job operation
that makes use of Broadcast variables, the following things are done by PySpark:
● The job is broken into different stages having distributed shuffling. The actions
are executed in those stages.
● The stages are then broken into tasks.
● The broadcast variables are broadcasted to the tasks if the tasks need to use it.
Broadcast variables are created in PySpark by making use of the broadcast(variable)
method from the SparkContext class. The syntax for this goes as follows:
Accumulator variables: These variables are called updatable shared variables. They are
added through associative and commutative operations and are used for performing
counter or sum operations. PySpark supports the creation of numeric type
accumulators by default. It also has the ability to add custom accumulator types. The
custom types can be of two types:
Here, we will see the Accumulable section that has the sum of the Accumulator values
of the variables modified by the tasks listed in the Accumulator column present in the
Tasks table.
● Unnamed Accumulators: These accumulators are not shown on the PySpark Web
UI page. It is always recommended to make use of named accumulators.
Accumulator variables can be created by using
SparkContext.longAccumulator(variable) as shown in the example below:
ac = sc.longAccumulator("sumaccumulator")
sc.parallelize([2, 23, 1]).foreach(lambda x: ac.add(x))
Depending on the type of accumulator variable data - double, long and collection,
PySpark provide DoubleAccumulator, LongAccumulator and CollectionAccumulator
respectively.
class pyspark.Sparkconf(
localdefaults = True,
_jvm = None,
_jconf = None
)
where:
df.select(col("ID_COLUMN"), convertUDF(col("NAME_COLUMN"))
.alias("NAME_COLUMN") )
.show(truncate=False)
The output of the above code would be:
+----------+-----------------+
|ID_COLUMN |NAME_COLUMN |
+----------+-----------------+
|1 |Harry Potter |
|2 |Ronald Weasley |
|3 |Hermoine Granger |
+----------+-----------------+
UDFs have to be designed in a way that the algorithms are efficient and take less time
and space complexity. If care is not taken, the performance of the DataFrame
operations would be impacted.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]")
.appName('InterviewBitSparkSession')
.getOrCreate()
Here,
● master() – This is used for setting up the mode in which the application has to
run - cluster mode (use the master name) or standalone mode. For Standalone
mode, we use the local[x] value to the function, where x represents partition
count to be created in RDD, DataFrame and DataSet. The value of x is ideally the
number of CPU cores available.
● appName() - Used for setting the application name
● getOrCreate() – For returning SparkSession object. This creates a new object if it
does not exist. If an object is there, it simply returns that.
If we want to create a new SparkSession object every time, we can use the newSession
method as shown below:
import pyspark
from pyspark.sql import SparkSession
spark_session = SparkSession.newSession
+-----------+----------+
| Name | Age |
+-----------+----------+
| Harry | 20 |
| Ron | 20 |
| Hermoine | 20 |
+-----------+----------+
We can get the schema of the dataframe by using df.printSchema()
>> df.printSchema()
root
|-- Name: string (nullable = true)
|-- Age: integer (nullable = true)
df = spark.read.csv("/path/to/file.csv")
PySpark supports csv, text, avro, parquet, tsv and many other file extensions.
● startsWith() – returns boolean Boolean value. It is true when the value of the
column starts with the specified string and False when the match is not satisfied
in that column value.
● endsWith() – returns boolean Boolean value. It is true when the value of the
column ends with the specified string and False when the match is not satisfied
in that column value.
Both the methods are case-sensitive.
import org.apache.spark.sql.functions.col
df.filter(col("Name").startsWith("H")).show()
The output of the code would be:
+-----------+----------+
| Name | Age |
+-----------+----------+
| Harry | 20 |
| Hermoine | 20 |
+-----------+----------+
Notice how the record with the Name “Ron” is filtered out because it does not start with
“H”.
For example, consider we have the following DataFrame assigned to a variable df:
+-----------+----------+----------+
| Name | Age | Gender |
+-----------+----------+----------+
| Harry | 20 | M |
| Ron | 20 | M |
| Hermoine | 20 | F |
+-----------+----------+----------+
In the below piece of code, we will be creating a temporary table of the DataFrame that
gets accessible in the SparkSession using the sql() method. The SQL queries can be run
within the method.
df.createOrReplaceTempView("STUDENTS")
df_new = spark.sql("SELECT * from STUDENTS")
df_new.printSchema()
The schema will be displayed as shown below:
>> df.printSchema()
root
|-- Name: string (nullable = true)
|-- Age: integer (nullable = true)
|-- Gender: string (nullable = true)
For the above example, let’s try running group by on the Gender column:
+------+------------+
|Gender|Gender_Count|
+------+------------+
| F| 1 |
| M| 2 |
+------+------------+
how - type of join, by default it is inner. The values can be inner, left, right, cross, full,
outer, left_outer, right_outer, left_anti, left_semi.
The join expression can be appended with where() and filter() methods for filtering
rows. We can have multiple join too by means of the chaining join() method.
-- Employee DataFrame --
+------+--------+-----------+
|emp_id|emp_name|empdept_id |
+------+--------+-----------+
| 1| Harry| 5|
| 2| Ron | 5|
| 3| Neville| 10|
| 4| Malfoy| 20|
+------+--------+-----------+
-- Department DataFrame --
+-------+--------------------------+
|dept_id| dept_name |
+-------+--------------------------+
| 5 | Information Technology |
| 10| Engineering |
| 20| Marketting |
+-------+--------------------------+
We can inner join the Employee DataFrame with Department DataFrame to get the
department information along with employee information as:
emp_dept_df = empDF.join(deptDF,empDF.empdept_id ==
deptDF.dept_id,"inner").show(truncate=False)
The result of this becomes:
+------+--------+-----------+-------+--------------------------+
|emp_id|emp_name|empdept_id |dept_id| dept_name |
+------+--------+-----------+-------+--------------------------+
| 1| Harry| 5| 5 | Information Technology |
| 2| Ron | 5| 5 | Information Technology |
| 3| Neville| 10| 10 | Engineering |
| 4| Malfoy| 20| 20 | Marketting |
+------+--------+-----------+-------+--------------------------+
We can also perform joins by chaining join() method by following the syntax:
dfQ.join(df2,["column_name"]).join(df3,df1["column_name"] ==
df3["column_name"]).show()
Consider we have a third dataframe called Address DataFrame having columns emp_id,
city and state where emp_id acts as the foreign key equivalent of SQL to the Employee
DataFrame as shown below:
-- Address DataFrame --
+------+--------------+------+
|emp_id| city |state |
+------+--------------+------+
|1 | Bangalore | KA |
|2 | Pune | MH |
|3 | Mumbai | MH |
|4 | Chennai | TN |
+------+--------------+------+
If we want to get address details of the address along with the Employee and the
Department Dataframe, then we can run,
resultDf = empDF.join(addressDF,["emp_id"])
.join(deptDF,empDF["empdept_id"] ==
deptDF["dept_id"])
.show()
The resultDf would be:
+------+--------+-----------+--------------+------+-------+-----
---------------------+
|emp_id|emp_name|empdept_id | city |state |dept_id|
dept_name |
+------+--------+-----------+--------------+------+-------+-----
---------------------+
| 1| Harry| 5| Bangalore | KA | 5 |
Information Technology |
| 2| Ron | 5| Pune | MH | 5 |
Information Technology |
| 3| Neville| 10| Mumbai | MH | 10 |
Engineering |
| 4| Malfoy| 20| Chennai | TN | 20 |
Marketting |
+------+--------+-----------+--------------+------+-------+-----
---------------------+
root
|-- value: string (nullable = true)
Post data processing, the DataFrame can be streamed to the console or any other
destinations based on the requirements like Kafka, dashboards, database etc.
Q. What would happen if we lose RDD partitions due to the failure of the
worker node?
If any RDD partition is lost, then that partition can be recomputed using operations
lineage from the original fault-tolerant dataset.