Databricks RealQuestions
Databricks RealQuestions
Transformation is kind of operation which will transform Dataframe from one from to another.
We will get new Dataframe after execution of each transformation operation. Transformation is lazy
evaluation based on DAG(Directed Acyclic Graph). Filter,Union etc.
Action:
When we want to work with actual data, call the action. Action returns the data to driver for display or
storage into storage layer. Count,collect,Save etc.
Narrow Transformation:
Narrow transformations are transformations in Spark that do not require shuffling of data
between partitions. These transformations are performed locally on each partitionand do not require
any exchange of data between partitions.
They all are APIs provided by Spark for developers for data processing and analytics.
In terms of functionality, all are same and returns same output for provided input data.
But they differ in the way of handling and processing data. So, there is difference in
Terms of performance, user convenience and language support etc.
Users can choose any of the API while working with Spark
Performance Optimization | Repartition vs Coalesce
Need of partition strategy:
Coalesce:
catalyst optimizer actually check for multiple execution plans to executate particular statement. Or
one action in dataframe based on that it will choose the best execution plan.
Broadcast Variable:
What is Broadcast variable?
It is programming mechanism in Spark, through which we can keep read-only copy of data into each
node of the cluster instead of sending it to node every time a task needs it.
Databricks chooses dynamically the appropriate number of workers required to run the job
based on range of number of workers.
It is one of the performance optimization technique.
It is also one of cost saving technique.