BDA Unit 2

Apache Spark is a fast, open-source cluster computing technology designed for efficient data processing, offering in-memory computation that significantly speeds up tasks compared to Hadoop MapReduce. It supports various processing types, including batch and stream processing, and includes components like Spark SQL, Spark Streaming, MLlib for machine learning, and GraphX for graph processing. YARN serves as the resource management and job scheduling framework for Hadoop, optimizing resource utilization and ensuring fault tolerance across distributed computing environments.

Uploaded by

dnyangitte01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views52 pages

BDA Unit 2

Uploaded by

dnyangitte01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Unit 2

Big Data Platforms

Overview of Apache Spark
• Apache Spark is a fast cluster computing technology,
designed for fast computation.
• It is fast and general-purpose, means, it is an open-source,
extensive range data processing engine.
• It is developed around agility, ease of use, and advanced
analytics.
• Apache Spark is most famous for running the Iterative
Machine Learning Algorithm.
Overview of Apache Spark
• It is based on Hadoop MapReduce and it extends the MapReduce model to
efficiently use it for more types of computations.
• The main feature of Spark is its in-memory cluster computing that increases the
processing speed of an application.
• Spark is designed to cover a wide range of processing such as batch
applications, iterative algorithms, interactive queries and streaming.
• Spark can be used to perform batch processing and stream processing tasks.
Limitations of MapReduce
• Hadoop MapReduce is a programming model for processing big data sets with a
parallel, distributed algorithm.
• A challenge to MapReduce is the sequential multi-step process it takes to run a
job.
• With each step, MapReduce reads data from the cluster, performs operations,
and writes the results back to HDFS.
• As each step requires a disk read, and write, MapReduce jobs are slower due to
the latency of disk I/O.
Advantages of Spark over MapReduce
• Spark was created to address the limitations to MapReduce, by doing
processing in-memory, reducing the number of steps in a job, and by reusing
data across multiple parallel operations.
• With Spark, only one-step is needed where data is read into memory,
operations performed, and the results written back—resulting in a much faster
execution.
• Spark also reuses data by using an in-memory cache to greatly speed up
machine learning algorithms
Overview of Apache Spark
• Spark provides following functionality:-
1. High-level APIs in Java, Scala, Python, and R.
2. A simplified and the computationally intensive task of high processing
volumes of real-time.
3. Fast programming up to 100x faster than Apache Hadoop MapReduce.
4. Perform ad-hoc data analysis interactively.
5. Increase in processing speed.
6. In-memory cluster computation capability.
Components of Apache Spark
Components of Apache Spark
• Spark Core is the foundation of the platform.
• It is responsible for memory management, fault recovery, scheduling,
distributing & monitoring jobs, and interacting with storage systems.
• Spark Core is exposed through an application programming interface (APIs)
built for Java, Scala, Python and R.
• These APIs hide the complexity of distributed processing behind simple, high-
level operators.
Components of Apache Spark
• Spark SQL:- Interactive Queries
• Spark SQL is a distributed query engine that provides low-latency, interactive
queries up to 100x faster than MapReduce.
• It includes a cost-based optimizer, columnar storage, and code generation for
fast queries, while scaling to thousands of nodes.
• Business analysts can use standard SQL or the Hive Query Language for
querying data.
• Developers can use APIs, available in Scala, Java, Python, and R.
• It supports various data sources out-of-the-box including JDBC, ODBC, JSON,
HDFS, Hive, ORC, and Parquet.
• Other popular stores—Amazon Redshift, Amazon S3, Couchbase, Cassandra,
MongoDB, Salesforce.com, Elasticsearch, and many others can be found from
the Spark Packages ecosystem.
Components of Apache Spark
• Spark Streaming:- Real-time
• Spark Streaming is a real-time solution that leverages Spark Core’s fast
scheduling capability to do streaming analytics.
• It ingests data in mini-batches, and enables analytics on that data with
the same application code written for batch analytics.
• This improves developer productivity, because they can use the same
code for batch processing, and for real-time streaming applications.
• Spark Streaming supports data from Twitter, Kafka, Flume, HDFS, and
ZeroMQ, and many others found from the Spark Packages ecosystem.
Components of Apache Spark
• Mllib:- Machine Learning
• Spark includes MLlib, a library of algorithms to do machine
learning on data at scale.
• Machine Learning models can be trained by data scientists with R
or Python on any Hadoop data source, saved using MLlib, and
imported into a Java or Scala-based pipeline.
• Spark was designed for fast, interactive computation that runs in
memory, enabling machine learning to run quickly.
• The algorithms include the ability to do classification, regression,
clustering, collaborative filtering, and pattern mining.
Components of Apache Spark
• GraphX:- Graph Processing
• Spark GraphX is a distributed graph processing framework built on top of Spark.
• GraphX provides ETL, exploratory analysis, and iterative graph computation to
enable users to interactively build, and transform a graph data structure at
scale.
• It comes with a highly flexible API, and a selection of distributed Graph
algorithms.
HDFS
• Hadoop’s file system allows applications to be run across multiple servers.
• HDFS, in that data is divided into blocks, and copies of these blocks are stored
on other servers in the Hadoop cluster.
• This means, an individual file is actually stored as smaller blocks that are
replicated across multiple servers in the entire cluster.
• HDFS replicates the smaller pieces of data onto two additional servers by
default.
HDFS
• The redundancy of data has following benefits:-
• 1. higher availability of data
• 2. increases fault-tolerance.
• 2. allows the Hadoop cluster to break down the work into smaller chunks and
run those jobs on all the servers in the cluster for better scalability
• 3. tries to assign workloads to these servers where the data to be processed is
stored. This is known as data locality, which is critical when working with large
data sets.
HDFS

An example of how data blocks are written to HDFS

HDFS
• A data file in HDFS is divided into blocks, and the default size of these blocks for
Apache Hadoop is 64 MB.
• Hadoop was designed to scan through very large data sets, so it makes sense
for it to use a very large block size so that each server can work on a larger
chunk of data at the same time.
• The coordination across a cluster has significant overhead, and that is managed
by HDFS.
• HDFS makes sure that at least two blocks are stored on a separate server rack to
improve reliability even an entire rack of servers may be lost.
HDFS
• All of Hadoop’s data placement logic is managed by a special server called
NameNode.
• This NameNode server keeps track of all the data files in HDFS, such as where
the blocks are stored.
• All of the NameNode’s information is stored in memory, which allows it to
provide quick response times to storage manipulation or read requests.
• When a file is created, HDFS will automatically communicate with the
NameNode to allocate storage on specific servers and perform the data
replication.
HDFS
• The NameNode maintains and manages the file system namespace and
provides clients with the right access permissions.
• The NameNode performs file system namespace operations, including opening,
closing and renaming files and directories.
• The NameNode records any change to the file system namespace or its
properties
HDFS
HDFS
• HDFS also has multiple DataNodes on a commodity hardware cluster.
• The DataNodes are generally organized within the same rack in the data center.
• Data is broken down into separate blocks and distributed among the various
DataNodes for storage.
• The NameNode knows which DataNode contains which blocks and where the
DataNodes reside within the cluster.
• The NameNode also manages access to the files, across the DataNodes.
• The DataNodes serve read and write requests from the clients and can perform
block creation, deletion and replication when the NameNode instructs them to
do so
HDFS
• The DataNodes are in constant communication with the NameNode to
determine need of DataNodes to complete specific tasks.
• The NameNode is always aware of the status of each DataNode.
• If the NameNode realizes that one DataNode is not working properly, it can
immediately reassign that DataNode's task to a different node containing the
same data block.
• DataNodes also communicate with each other, so that they can cooperate
during normal file operations.
YARN
• YARN is considered the brain of the Hadoop architecture.
• Apart from resource management and allocation, it also performs job
scheduling.
• YARN is the parallel processing framework for implementing distributed
computing clusters.
• YARN allows different data processing methods like graph processing,
interactive processing, stream processing and batch processing to run and
process data stored in HDFS
YARN
YARN
• Yarn uses master servers and data servers.
• There is only one master server per cluster.
• It runs the resource manager daemon(program).
• There are many data servers in the cluster, each one runs on its own
Node Manager daemon and the application master manager as required.
YARN
• Components of YARN are:-
• Resource Manager
• It is responsible for resource allocation.
• On receiving the processing requests, it passes parts of requests to
corresponding node managers, where the actual processing takes place.
• It is the arbitrator of the cluster resources and decides the allocation of
the available resources for the applications.
• It optimizes the cluster utilization by keeping all resources in use all the
time.
• It has two major components: a) Scheduler , b) Application Manager
YARN
• a) Scheduler
• The scheduler is responsible for allocating resources to the various
running applications by considering constraints of capacities, queues.
• It is called as a pure scheduler in Resource Manager, as it does not
perform any monitoring or tracking of status for the applications.
• If there is an application failure or hardware failure, the Scheduler
does not guarantee to restart the failed tasks.
• Performs scheduling based on the resource requirements of the
applications.
• It has a pluggable policy plug-in, which is responsible for partitioning
the cluster resources among the various applications.
• There are two such plug-ins: Capacity Scheduler and Fair Scheduler
YARN
• b) Application Manager
• It is responsible for accepting job submissions.
• It negotiates the first container from the Resource Manager for executing the
application specific Application Master.
• Manages running the Application Masters in a cluster and provides service for
restarting the Application Master container on failure.
YARN
• 2. Node Manager
• It handles individual nodes in a Hadoop cluster and manages user
jobs and workflow on the given node.
• It registers with the Resource Manager and sends status of each
node.
• Its primary goal is to manage application containers assigned to it
by the resource manager.
• Application Master requests the assigned container from the Node
Manager by sending it a Container Launch Context(CLC) which
includes everything the application needs in order to run.
• The Node Manager creates the requested container process and
starts it.
• Monitors resource usage (memory, CPU) of individual containers
also performs Log management.
YARN
• 3. Application Master
• An application is a single job submitted to the framework.
• Each such application has a unique Application Master associated with it
which is a framework specific entity.
• It is the process that coordinates an application’s execution in the cluster
and also manages faults.
• Its task is to negotiate resources from the Resource Manager and work
with the Node Manager to execute and monitor the component tasks.
• It is responsible for negotiating appropriate resource containers from the
Resource Manager, tracking their status and monitoring progress.
YARN
• 4. Container
• It is a collection of physical resources such as RAM, CPU cores, and disks on a
single node.
• YARN containers are managed by a container launch context which is container
life-cycle(CLC).
• This record contains a map of environment variables, dependencies stored in a
remotely accessible storage, security tokens, payload for Node Manager
services and the command necessary to create the process.
• It assigns rights to an application to use a specific amount of
resources (memory, CPU etc.) on a specific host.
YARN
• Workflow in YARN:-
1. Client submits an application
2. Resource Manager allocates a container to start Application Manager
3. Application Manager registers with Resource Manager
4. Application Manager asks containers from Resource Manager
5. Application Manager notifies Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor
application’s status
8. Application Manager unregisters with Resource Manage
YARN
YARN
1. Client submits an application to Resource Manager
2. The Resource Manager allocates a container to start the Application Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource Manager
5. The Application Manager notifies the Node Manager to launch containers
6. Application code is executed in the container of Node Manager
7. Client contacts Resource Manager/Application Manager to monitor
application’s status
8. Once the processing is complete, the Application Manager un-registers with
the Resource Manager
YARN
• Key features of YARN:-
• Multi-tenancy: Yarn allows multiple engine access and fulfills the
requirement for a real-time system and manages the movement of that
data within the framework.
• Sharing Resources: Yarn ensures there is no dependency between the
compute jobs.
• Each compute job is run on its own node and does not share its allocated
resources.
• Each job is responsible for its own assigned work.
• Cluster Utilization: Yarn optimizes the cluster by utilizing and allocating
its resources in a dynamic manner.
YARN
• Fault Tolerance: Yarn is highly fault-tolerant.
• It allows the rescheduling of the failed to compute jobs without any
implications for the final output.
• Scalability: Yarn focuses mainly on scheduling the resources. Due to this the
data nodes can be expanded and increasing the processing capacity.
• Compatibility: The jobs that were working in MapReduce v1 can be easily
migrated to higher versions of Hadoop with ease ensuring high compatibility of
Yarn.
YARN
MapReduce
• MapReduce is a programming framework that allows to perform distributed
and parallel processing on large data sets in a distributed environment.
• The MapReduce contains two important tasks, as Map and Reduce.
• The Map task takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key-value pairs).
• The Reduce task takes the output from the Map as an input and combines
those data tuples (key-value pairs) into a smaller set of tuples.
• The reduce task is always performed after the map job.
MapReduce
• The reducer phase takes place after the mapper phase has been completed.
• In the map job, the block of data is read and processed to produce key-value
pairs as intermediate outputs.
• The output of a Mapper or map job (key-value pairs) is input to the Reducer.
• The reducer receives the key-value pair from multiple map jobs.
• Then, the reducer aggregates those intermediate data tuples (intermediate key-
value pair) into a smaller set of tuples or key-value pairs which is the final
output.
MapReduce
MapReduce
• Input Phase in this phase recorder translates each record in an
input file and sends the parsed data to the mapper in the form
of key-value pairs.
• Map − Map is a user-defined function, which takes a series of
key-value pairs and processes each one of them to generate
key-value pairs.
• Intermediate Keys − They key-value pairs generated by the
mapper are known as intermediate keys.
• Combiner − A combiner is a type of local Reducer that groups
similar data from the map phase into some sets.
• It takes the intermediate keys from the mapper as input and
applies a user-defined code to aggregate the values in a small
scope of one mapper
MapReduce
• Shuffle and Sort − The Reducer task starts with the Shuffle and Sort
step. The individual key-value pairs are sorted by key into a larger
data list. The data list groups the equivalent keys together so that
their values can be iterated in the Reducer task.
• Reducer − The Reducer takes the grouped key-value paired data as
input and runs a Reducer function on each one of them. The data
can be aggregated, filtered, and combined, and it requires a wide
range of processing. Once the execution is over, it generates zero or
more key-value pairs to the final step.
• Output Phase − In the output phase, an output formatter translates
the final key-value pairs from the Reducer function and writes them
onto a file using a record writer.
Map Reduce:Word Count Example
Map Reduce:Word Count Example
Key Value
bad 1
Class 1
good 1
Hadoop 3
is 2
to 1
Welcome 1
MapReduce
• One map task is created for each split which then executes map
function for each record in the split.
• It is always beneficial to have multiple splits because the time
taken to process a split is small as compared to the time taken for
processing of the whole input.
• When the splits are smaller, the processing is better to load
balanced since we are processing the splits in parallel.
• However, it is also not desirable to have splits too small in size.
When splits are too small, the overload of managing the splits and
map task creation begins to dominate the total job execution time.
• For most jobs, it is better to make a split size equal to the size of an
HDFS block (which is 64 MB, by default).
MapReduce
• Execution of map tasks results into writing output to a local disk on the
respective node and not to HDFS.
• Map output is intermediate output which is processed by reduce tasks to
produce the final output.
• Once the job is complete, the map output can be thrown away. So,
storing it in HDFS with replication becomes overkill.
• In the event of node failure, before the map output is consumed by the
reduce task, Hadoop reruns the map task on another node and re-creates
the map output.
MapReduce
• Reduce task doesn’t work on the concept of data locality. An output of every
map task is fed to the reduce task.
• Map output is transferred to the machine where reduce task is running.
• On this machine, the output is merged and then passed to the user-defined
reduce function.
• Unlike the map output, reduce output is stored in HDFS ,the first replica is
stored on the local node and other replicas are stored on off-rack nodes.
PageRank Algorithm
• PageRank is a recursive algorithm developed by Google founder Larry Page to
assign a real number to each page in the Web.
• They can be ranked based on the scores, where higher the score of a page, the
page is more important.
• According to Google:
• PageRank works by counting the number and quality of links to a page to
determine a rough estimate of how important the website is.
• It is based on assumption, if a website receives more links from other websites,
that is more important
PageRank Algorithm
• The PageRank algorithm outputs a probability distribution used to represent
the likelihood that a person randomly clicking on links will arrive at any
particular page.
• PageRank can be calculated for collections of documents of any size.
• The PageRank computations require several passes, called iterations, through
the collection to adjust approximate PageRank values to predict more accurate
results.
PageRank Algorithm
• Assume that there are four web pages: A, B, C and D.
• Links from a page to itself, or multiple links from one single page to another
single page, are ignored.
• PageRank is initialized to the same value for all pages.
• In the original form of PageRank, the sum of PageRank over all pages was the
total number of pages on the web at that time, so each page in this example
would have an initial value of 1.
• But for simplification, it is considered as a probability distribution between 0
and 1.
• Hence the initial value for each page in this example is 0.25.
PageRank Algorithm
• If the only links in the system were from pages B, C, and D to A, each link would
transfer 0.25 PageRank to A upon the next iteration, for a total of 0.75.i.e
PR(A)=PR(B)+PR(C)+PR(D)
• Suppose instead that page B had a link to pages C and A, page C had a link to
page A, and page D had links to all three pages.
• Thus, upon the first iteration, page B would transfer half of its existing
value,(0.125), to page A and the other half,( 0.125) to page C.
• Page C would transfer all of its existing value, 0.25, to the only page it links to,
A.
PageRank Algorithm
• Page D had three outbound links, it would transfer one third of its existing
value, or approximately 0.083, to A.
• At the completion of this iteration, page A will have a PageRank of
approximately 0.458.i.e
• PR(A)=PR(B)/2+ PR(C)/1 + PR(D)/3
• The PageRank conferred by an outbound link is equal to the document’s own
PageRank score divided by the number of outbound links L( ).
• The general equation for calculating page rank is:
PageRank Algorithm: MapReduce
• MapReduce approach will tackle the problem by taking the advantage of
running on a cluster (parallelization) and scaled up to very large data sets.
• At the beginning of each iteration, a node passes its PageRank
contributions to other nodes that it is connected to., this is also called as
spreading probability mass to neighbors via outgoing links
• At the end of the iteration, each node sums up all PageRank
contributions that have been passed to it and computes an updated
PageRank score, this is also called as gathering probability mass passed to
a node via its incoming links.

Parallel Processing
No ratings yet
Parallel Processing
38 pages
SPARK
No ratings yet
SPARK
66 pages
TCP IP Model
No ratings yet
TCP IP Model
13 pages
Bebras 2018 Solution Guide
100% (3)
Bebras 2018 Solution Guide
83 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Big Data Engines: Binary Batch Processing
No ratings yet
Big Data Engines: Binary Batch Processing
12 pages
Wa0005.
No ratings yet
Wa0005.
84 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
Unit 5
100% (1)
Unit 5
109 pages
Informatica - The Basics: Trainer: Muhammed Naufal
100% (1)
Informatica - The Basics: Trainer: Muhammed Naufal
254 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
HADOOP
No ratings yet
HADOOP
19 pages
Artificial Intelligence NPTEl Assignment 1
No ratings yet
Artificial Intelligence NPTEl Assignment 1
10 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Learn
No ratings yet
Learn
16 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
BDA Final Notes
No ratings yet
BDA Final Notes
53 pages
Hadoopvsspark 180108070838
No ratings yet
Hadoopvsspark 180108070838
17 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
Shark
No ratings yet
Shark
24 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Chap5 BigDataComputingAndProcessing
No ratings yet
Chap5 BigDataComputingAndProcessing
72 pages
BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Bda 5
No ratings yet
Bda 5
21 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Module 2
No ratings yet
Module 2
20 pages
BD Notes
No ratings yet
BD Notes
11 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Unit 2
No ratings yet
Unit 2
73 pages
Attachment
No ratings yet
Attachment
11 pages
Unit 4 Endsem PYQs
No ratings yet
Unit 4 Endsem PYQs
24 pages
Bda U4
No ratings yet
Bda U4
49 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
HADOOP
No ratings yet
HADOOP
10 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
Unit 4
No ratings yet
Unit 4
85 pages
Hadoop Vs Apache Spark
No ratings yet
Hadoop Vs Apache Spark
6 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Hadoop & Big Data
No ratings yet
Hadoop & Big Data
36 pages
Unit 5
No ratings yet
Unit 5
32 pages
SPARK
No ratings yet
SPARK
47 pages
Hadoop
No ratings yet
Hadoop
83 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Unit2 Bda
No ratings yet
Unit2 Bda
12 pages
Map Design: 1) Imagination
No ratings yet
Map Design: 1) Imagination
10 pages
Triggers - SQL Server - CodeProject
No ratings yet
Triggers - SQL Server - CodeProject
5 pages
Unit 3
No ratings yet
Unit 3
84 pages
DNA Lesson For High School by Slidesgo
No ratings yet
DNA Lesson For High School by Slidesgo
41 pages
BDA Unit 4
No ratings yet
BDA Unit 4
72 pages
Unit 4
No ratings yet
Unit 4
74 pages
Study of Consumer Behavior AND Preferences Towards Ready To Eat Products
No ratings yet
Study of Consumer Behavior AND Preferences Towards Ready To Eat Products
25 pages
BDA Unit 5
No ratings yet
BDA Unit 5
61 pages
School Cafeteria 3
No ratings yet
School Cafeteria 3
28 pages
Unit 2
No ratings yet
Unit 2
70 pages
BDA Unit 1
No ratings yet
BDA Unit 1
60 pages
Unit - 2.1
No ratings yet
Unit - 2.1
23 pages
List of SQL Commands - Codecademy
No ratings yet
List of SQL Commands - Codecademy
9 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Em4218e - Chap 1
No ratings yet
Em4218e - Chap 1
39 pages
Unit - 5
No ratings yet
Unit - 5
22 pages
Abawi
No ratings yet
Abawi
14 pages
Chapter 2
No ratings yet
Chapter 2
120 pages
Unit - 4
No ratings yet
Unit - 4
19 pages
Ishan Internship Report
No ratings yet
Ishan Internship Report
44 pages
Chapter 1. Getting Started: Copying The Northwind Sample Database
No ratings yet
Chapter 1. Getting Started: Copying The Northwind Sample Database
10 pages
1.1 Database Technologies: RNSIT, Dept. of CSE 1
No ratings yet
1.1 Database Technologies: RNSIT, Dept. of CSE 1
4 pages
Unit - 2
No ratings yet
Unit - 2
7 pages
Boot Camp Session 1: Module: COS 122 Author: Mr. Werner Hauger
No ratings yet
Boot Camp Session 1: Module: COS 122 Author: Mr. Werner Hauger
27 pages
Dichtung-Digital 24-1-8 Arns Metonymical Movies
No ratings yet
Dichtung-Digital 24-1-8 Arns Metonymical Movies
9 pages
CC Unit I
No ratings yet
CC Unit I
6 pages
Affective Computing in Education - A Systematic Review and Future Research
No ratings yet
Affective Computing in Education - A Systematic Review and Future Research
30 pages
BI Syllabus
No ratings yet
BI Syllabus
1 page
Department of Education
No ratings yet
Department of Education
21 pages
HSEC Calculation Format
No ratings yet
HSEC Calculation Format
3 pages
Unit 2 - Lecture 7 - RDBMS
No ratings yet
Unit 2 - Lecture 7 - RDBMS
23 pages
Dsbda Unit 1
No ratings yet
Dsbda Unit 1
25 pages
An Approach Based Iris Flower Species Recognition Using Machine Learning Classifiers
No ratings yet
An Approach Based Iris Flower Species Recognition Using Machine Learning Classifiers
7 pages
CONFERENCE REPORT - Sample
No ratings yet
CONFERENCE REPORT - Sample
3 pages
6 - Intro To Pivot Tables
No ratings yet
6 - Intro To Pivot Tables
10 pages
Uas Basis Data Essay No 1: Nama: Delia Akmalia NIM: 20220050164 Kelas: SI22F
No ratings yet
Uas Basis Data Essay No 1: Nama: Delia Akmalia NIM: 20220050164 Kelas: SI22F
9 pages
Ict 2
No ratings yet
Ict 2
2 pages
HEALTH INsurnc
No ratings yet
HEALTH INsurnc
5 pages
DATABASE
No ratings yet
DATABASE
3 pages
Rigol GradientOne - One Pager
No ratings yet
Rigol GradientOne - One Pager
1 page
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet

BDA Unit 2

Uploaded by

BDA Unit 2

Uploaded by

Unit 2

Big Data Platforms

An example of how data blocks are written to HDFS

You might also like