0% found this document useful (0 votes)
2 views

notes (3) - Copy

The document discusses various concepts related to Scala, Hadoop, and Spark, including the importance of singleton objects, the role of different node types in Hadoop, and optimization techniques in Hive. It also highlights the differences between DataFrames and Datasets in Spark, the MapReduce processing model, and the Spark architecture. Additionally, it includes personal reflections on strengths, weaknesses, and leadership experiences in a professional context.

Uploaded by

RahulAnand
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

notes (3) - Copy

The document discusses various concepts related to Scala, Hadoop, and Spark, including the importance of singleton objects, the role of different node types in Hadoop, and optimization techniques in Hive. It also highlights the differences between DataFrames and Datasets in Spark, the MapReduce processing model, and the Spark architecture. Additionally, it includes personal reflections on strengths, weaknesses, and leadership experiences in a professional context.

Uploaded by

RahulAnand
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

Delta is a data format based on parque

rank - unlike row_number thi as an argument to other functions, returned as a value


from other functions. Scala is a pure object-oriented language in the sense that
every value is an object. It’s even more object-oriented than Java, by that I mean
that any value, even integers and doubles are objects! this opens a new world of
no-special treatment for primitive values.

A Singleton object is an object which defines a single object of a class. A


singleton object provides an entry point to your program execution. If you do not
create a singleton object in your program, then your code compile successfully but
does not give output.
Companion object is known as an objs all vals, which means they are immutable.
syntax - case class <classname> (<regular parameters>). To create a Scala Object of
a case class, we don’t use the keyword ‘new’. This is because its default apply()
method handles the creation of objects.

Rack Awareness in Hadoop is the concept to choose a nearby data node (closest to
the client which has raised the Read/Write request), thereby reducing the network
traffic.
An edge node is a computer that acts as an end user portal for communication with
other nodes in cluster computing. Edge nodes are also sometimes called gateway
nodes or edge communication nodes. In a Hadoop cluster, three types of nodes exist:
master, worker and edge nodes. Master nodes are responsible for storing data in
HDFS and overseeing key operations, such as running parallel computations on the
data using MapReduce. The worker nodes comprise most of the virtual machines in a
Hadoop cluster, and perform the job of storing the data and running computations.
Each worker node runs the DataNode and TaskTracker services, which are used to
receive the instructions from the master nodes.
a speculative execution means that Hadoop in overall doesn't try to fix slow tasks
as it is hard to detect the reason (misconfiguration, hardware issues, etc),
instead, it just launches another parallel/backup task for each task that is
performing slower than the expected, on faster nodes.

we can create a temporary table/view using createorreplacetempview() and then use


this view to create a table either in hive or athena.

groupByKey receives key value pairs and groups the records over each key.
reduceByKey has also same functionality.
While both reducebykey and groupbykey will produce the same answer, the reduceByKey
example works much better on a large dataset. That's because Spark knows it can
combine output with a common key on each partition before shuffling the data.
On the other hand, when calling groupByKey - all the key-value pairs are shuffled
around. This is a lot of unnessary data to being transferred over the network.
ReduceByKey uses a combiner to combine all the data while groupByKey doesnt.
eg - val testrdd = seq(("A", 1), ("B", 1), ("A", 1), ("C", 1))
reducebyKey o/p -> ("A", 2), ("B", 1), ("C", 1)
groupByKey o/p -> ("A", 1, 1), ("B", 1), ("C", 1)

optimization techniques used in HIVE -


Partitioning, Bucketing, Usinserialization in binary format.
Broadcasting, persisting, file format selection, minimal use of ByKey operations,
repartition and coalesce to handle parallelism.

Resilient because RDDs are immutable(can't be modified once created) and fault
tolerant, Distributed because it is distributed across cluster and Dataset because
it holds data.
DAG (direct acyclic graph) is the representation of the way Spark will execute your
program - each vertex on that graph is a separate operation and edges represent
dependencies of each operation. Your program (thus DAG that represents it) may
operate on multiple entities (RDDs, Dataframes, etc). RDD Lineage is just a portion
of a DAG (one or more operations) that lead to the creation of that particular RDD.

difference between data frame and dataset


dataframes are SparkSQL structure and are similar to a relational database, that
is, row and column like structure. Spark Datasets is an extension of Dataframes
API. It is fast as well as provides a type-safe interface. Type safety means that
the compiler will validate the data types of all the columns in the dataset while
compilation only and will throw an error if there is any mismatch in the data
types.

how is a file processed in map reduce?


MapReduce facilitates concurrent processing by splitting petabytes of data into
smaller chunks, and processing them in parallel on Hadoop commodity servers. In the
end, it aggregates all the data from multiple servers to return a consolidated
output back to the application.

Spark architecture -
In your master node, you have the driver program, which drives your application.
The code you are writing behaves as a driver program or if you are using the
interactive shell, the shell acts as the driver program. Inside the driver program,
the first thing you do is, you create a Spark Context. Assume that the Spark
context is a gateway to all the Spark functionalities. It is similar to your
database connection. Any command you execute in your database goes through the
database connection. Likewise, anything you do on Spark goes through Spark context.

Now, this Spark context works with the cluster manager to manage various jobs. The
driver program & Spark context takes care of the job execution within the cluster.
A job is split into multiple tasks which are distributed over the worker node.
Anytime an RDD is created in Spark context, it can be distributed across various
nodes and can be cached there.

Worker nodes are the slave nodes whose job is to basically execute the tasks. These
tasks are then executed on the partitioned RDDs in the worker node and hence
returns back the result to the Spark Context.

Spark Context takes the job, break we can identify which dataframes are taking
longer to execute and use repartition accordingly.
-> shuffle partition and parallelism
-> In case job is failing at any point due to stage failure, we can experiment with
adding checkpoint directory as well.

Why should we hire you -


I believe, I have worked tirelessly for all the relevant skills that I have
acquired over the past few years. I have all/most of the skills that is needed for
the position I am interviewing for which makes me certain that ike reading books,
cooking, playing various sports and video games.

Strengths and weakness -


Strengths: I am very reliable, I make sure that the tasks that are assigned to me
get done within the deadline period. I am sincere about my work, I don't take my
work lightly. I am respectful towrads my leads and colleagues. I am a fast learner
Weakness: I can get overwhelmed ty to work on technologies that are new to me in
this field. Apart from my professional story, personally I was very good at
different kinds of sports during my school days.
Tell me about an instance where you demonstrated leadership skills - The company I
am working for has expanded drastically last year, in terms of hiring more
associates. So eventually a lot of new members have been introduced to the Project
I am working for.so, from the past few months I have been given the responsibility
to get the new joinees acquainted with the project structure that we follow in out
team and also help them with their tasks. As a result from the past few months on
top of my tasks, I have been guiding all the new joinees with their problems and
technical and business understanding. I am also making sure that their tasks are
also getting delivered within the stipulated timeframe.

You might also like