0% found this document useful (0 votes)
24 views

Databricks RealQuestions

The document discusses different types of transformations in Spark - narrow transformations like map and filter that do not require data shuffling, and wide transformations like reduceByKey and join that require data shuffling. It also discusses the difference between repartition and coalesce operations for partitioning data and how broadcast variables can improve performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Databricks RealQuestions

The document discusses different types of transformations in Spark - narrow transformations like map and filter that do not require data shuffling, and wide transformations like reduceByKey and join that require data shuffling. It also discusses the difference between repartition and coalesce operations for partitioning data and how broadcast variables can improve performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Transformation:

Transformation is kind of operation which will transform Dataframe from one from to another.
We will get new Dataframe after execution of each transformation operation. Transformation is lazy
evaluation based on DAG(Directed Acyclic Graph). Filter,Union etc.
Action:
When we want to work with actual data, call the action. Action returns the data to driver for display or
storage into storage layer. Count,collect,Save etc.

Narrow Transformation:
Narrow transformations are transformations in Spark that do not require shuffling of data
between partitions. These transformations are performed locally on each partitionand do not require
any exchange of data between partitions.

Examples of narrow transformations in Spark include map, filter, flatMap,


and union. These transformations are applied to each partition of the data in
parallel, which makes them very efficient and fast.
Note: In Narrow Transformation no need to shuffle data across nodes. Its simple transformation or
inexpensive.
Wide Transformation:

Wide transformations are transformations in Spark that require shuffling of


data between partitions. These transformations require the exchange of
data between partitions and can be more expensive compared to narrow
transformations.

Examples of wide transformations in Spark


include reduceByKey, groupByKey, and join. Wide transformations are used to
aggregate or combine data from different partitions, which makes them
more complex and slower than narrow transformations.
Note:Its very Expensive and its shuffle data across nodes.

Difference between Narrow and Wide Transformation


Feature Difference
Data Narrow transformations do not require any data shuffling, while
Shuffling wide transformations require shuffling of data between partitions.
Narrow transformations are typically faster due to being performed
Performance locally on each partition, while wide transformations can be slower
due to the data exchange between partitions.
Narrow transformations are generally simpler and easier to
Complexity implement, while wide transformations can be more complex due
to data shuffling and aggregation between partitions.
Narrow transformations can scale well with large datasets, while
Scalability wide transformations can be more challenging to scale due to the
data exchange between partitions.

Pyspark: RDD,DataFrame and Dataset?


What are they?

 They all are APIs provided by Spark for developers for data processing and analytics.
 In terms of functionality, all are same and returns same output for provided input data.
But they differ in the way of handling and processing data. So, there is difference in
Terms of performance, user convenience and language support etc.
 Users can choose any of the API while working with Spark
Performance Optimization | Repartition vs Coalesce
Need of partition strategy:

 Best partition strategy is designing best performance in spark application.


 The right number of partitions created based on number of cores boosts the performance. If
not, hits the performance.
 Evenly distributed partition improves the performance, unevenly distributed performance hits
the performance.
 Lets say only one partition is created with size 500 MB in a worker node with 16 cores One
partition can’t shared among cores. So one core would be processing 500 MB data where 15
cores are kept idle.

Default partitions for RDD/DataFrame:


 The parameter sc.defaultParallelism determines the number of partitions when creating data
within Spark. Default value is 8 so it creates 8 partition by default.
 When reading data from external system, partitions are created based on parameter
spark.sql.files.maxPartitionBytes which is by default 128 MB.
Repartition:
o Repartition is used to increase or decrease the partitions in spark.
o Repartition always shuffles the data and build new partitions from scratch.
o Repartition results in almost equal sized partitions.
o Due to full shuffle, it is not good for performance in some use cases. But as it creates
equal sized partitions, good for performance in some use cases.

Coalesce:

 Coalesce function only reduces the number of partitions.


 Coalesce doesn’t require a full shuffle.
 Coalsce combines few partitions or shuffles data only from few partitions thus avoiding full
shuffle.
 Due to partition merge, it produces uneven size of partitions.
 Since full shuffle is avoided, coalesce is more performant than repartition.

What is Catalyst optimizer?


Scala program provided by spark for Dataframe/SQL API, that automatically finds out the most
efficient plan to execute data operations specified in the user code.

catalyst optimizer actually check for multiple execution plans to executate particular statement. Or
one action in dataframe based on that it will choose the best execution plan.
Broadcast Variable:
What is Broadcast variable?
It is programming mechanism in Spark, through which we can keep read-only copy of data into each
node of the cluster instead of sending it to node every time a task needs it.

Summery of Broadcast Variable


What is partitionBy?
Function used to write the dataframe into disc partitioned by specific key(s).
Syntax:
df.write.partitionBy(key).csv(path)
What is autoscaling?

 Databricks chooses dynamically the appropriate number of workers required to run the job
based on range of number of workers.
 It is one of the performance optimization technique.
 It is also one of cost saving technique.

What is Data Skew?


 Data skew is a condition in which a table’s data is unevenly distributed among
partitions in the cluster.
 Data skew can severely downgrade performance of queries, especially those with joins.
 Joins between big tables require shuffling data and the skew can lead to an extreme
imbalance of work in the cluster.
 It’s likely that data skew is affecting a query if a query appears to be stuck finishing very
few tasks(for example, the last 3 tasks out of 200)

Access Azure Data Lake


1. Create Azure Data Lake Gen2 Storage
2. Access Data Lake using Access Keys
3. Access Data Lake using SAS Token
4. Access Data Lake using Service Principal
5. Using Cluster Scoped Authentication
6. Access Data Lake using AAD Credential Passthrough
Access Azure Data Lake Gen2 using Access Keys
Service Principal:
A service principal is nothing but an Azure Active directory credentials, similar to user
account.
Once we have created the Service Principal, we need to grant access for the Data Lake Storage to the
The service principal.
We can then create the mount points in DBFS using these credentials. The mount points we create
provide access to the storage without requiring credentials.
Also, we will be able to use the Data Lake using file system semantics such as /mnt/storage1.
The best part is we don’t have to specify the credentials when accessing the data.

You might also like