0% found this document useful (0 votes)

15 views

Networking in The Hadoop Cluster

The document discusses how high-capacity, any-to-any networking is critical for optimal performance in Hadoop clusters. It describes how Hadoop uses a MapReduce algorithm to distribute storage and parallelize computations across multiple servers. This requires large data transfers between nodes during the Map and Reduce phases. The network must be able to efficiently handle high volumes of any-to-any traffic to avoid bottlenecks. A well-designed network provides high availability, scalability, and tools for management and optimization of the Hadoop cluster.

Uploaded by

Marcos SC

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Networking in The Hadoop Cluster

Uploaded by

Marcos SC

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

White Paper

Networking in the Hadoop Cluster

Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high
capacity, any to any, easily manageable networking layer is critical for peak Hadoop performance.

Data analytics has become a key element of the business decision process over the last decade, and the ability to
process unprecedented volumes of data a consequent deliverable and differentiator in the information economy.
Classic systems based on relational databases and expensive hardware, while still useful for some applications, are
increasingly unattractive compared to the scalability, economics, processing power and availability offered by today’s
network driven distributed solutions. The perhaps most popular of these next generation systems is Hadoop, a
software framework that drives the compute engines in data centers from IBM to Facebook. .

Hadoop and the related Hadoop Distributed File System (HDFS) form an open source framework that allows clusters of
commodity hardware servers to run parallelized, data intensive workloads. Actual clusters include shoe string research
analytics to thirty petabyte data warehouses, and applications range from the most advanced machine learning
algorithms to distributed databases and transcoding farms. Given sufficient storage, processing, and networking
resources the possibilities are nearly endless.

arista.com
White Paper

HDFS
The Hadoop Distributed File System (HDFS) stores multiple copies of data in 64MB chunks throughout the system for fault tolerance
and improved availability. File location is tracked by the Hadoop NameNode. Replication is increased relative to frequency of
use, and a number of other tunable parameters and features such as RAID can be used depending on the application. Because
replication is accomplished node to node rather than top down, a well architected Hadoop cluster needs to be able to handle
significant any to any traffic loads.

Figure 1: Hadoop Architecture Distributes Storage and Computation to the Cluster

Parallelization and Pushing The Computation to the Data

The Hadoop JobTracker breaks down large problems into smaller computational tasks assigned to servers in the cluster. In order to
handle large data sets, servers are given tasks relevant to the data already present in their directly attached storage (DAS). This is
often referred to as pushing the computation to the data, and is a critical part of processing petabytes - even with 100 GbE, a badly
allocated workload could take weeks to simply read in all the data necessary! Finally, rack awareness allows the JobTracker to assign
servers close to the data in the network topology if no directly attached server is available.

How the MapReduce Algorithm Works

MapReduce is the algorithm originally used in Google’s massively parallel web ranking systems and forms the cornerstone of the
Hadoop system. It is composed of two steps: Map and Reduce.

Map
Mapping refers to the process of breaking a large file into manageable chunks that can be processed in parallel. In data warehousing
applications where many types of analysis are conducted on the same data set, these chunks may have already been formed and
distributed across the cluster. However, for many processes involving changing data or one time analyses, the entire multi-terabyte
to multi-petabyte workload must be efficiently transferred from storage to the cluster members on a per case basis - Facebook’s
larger clusters often intake 2PB per day. In these situations a high capacity network is critical to time-sensitive analytics.

Once the data has been distributed throughout the cluster, each of the servers processes the data into an intermediate format
paired to a “key” which determines where it will be sent next for processing. A very simple example of this might be mapping each of
the works of Monty Python to a separate server which will count how many times any word appears in that particular text.

Figure 2: Data flows without persistence from Map to Reduce and requires complete, any to any network topologies.

arista.com
White Paper

Reduce
When the Mapping Servers have completed their tasks, they send the intermediate data to the appropriate Reduce Server based
on the data key. While many tasks have significant compression after the Mapping calculations are completed, other analyses such
as the sorting used in descriptive statistics require almost the entire data set to be re-allocated, or “shuffled” to the Reduce Servers.
At this point the data network is the critical path, and its performance and latency directly impact the shuffle phase of a data set
reduction. High-speed, non-blocking network switches ensure that the Hadoop cluster is running at peak efficiency. .

The Reduce Servers integrate the data received from Map servers and create an aggregate result per key that can be either
reported directly or used for further analysis. To continue with our previous example, each Map Server would have by now sent the
intermediate results of frequency keyed by word to the appropriate Reduce Servers. The Reduce Servers can thus perform a number
of analytics such as calculating the aggregate sum of any word that Python wrote, or perhaps classifying which work had the
greatest ratio of ‘Spam’ to ‘Cheese’. Reductions are the final step before useful information can be extracted, but in many cases also
generate entirely new data sets that can be fed back into the Hadoop cluster for further insight.

Impact of Network Designs on Hadoop Cluster Performance

A network designed for Hadoop applications, rather than standard enterprise applications, can make a big difference in the
performance of the cluster.

Figure 3: A high performance, any to any network architecture is critical to optimal Hadoop cluster performance.

High Capacity, Any-to-Any Topology, and Incremental Scalability

Getting data into the cluster can be the first bottleneck, and whether replicating or shuffling data, Hadoop requires significant any
to any node traffic to get its job done. In order to efficiently access stored results or simply calculate new ones, a well-provisioned
network with full any to any capacity, low latency, and high bandwidth can significantly improve Hadoop cluster performance.
Finally, as workloads grow, it is important that the network can sustain the inclusion of additional servers in an incremental fashion -
Hadoop only scales in proportion to the compute resources networked together at any time.

High Availability and Fault Tolerance

While Hadoop has self-contained fault tolerance in any single node, a failure in network connectivity in any but the largest clusters
will halt both HDFS data accessibility and any currently running jobs in their tracks. Highly available, fault-tolerant networking
equipment can make sure that the Hadoop cluster stays running, and furthermore assist in a quick re-provisioning of any failed
server nodes during execution.

Figure 4: Hadoop’s data shuffle between Map and Reduce creates unavoidable fan in when multiple Map servers must stream results to a single Reduce server.

arista.com
White Paper

Dynamic Buffers and Visibility

Traffic fan in is an unavoidable fact of aggregation. Networks employing dynamically allocated buffers can shift resources to
congested ports in real time for superior adaptability in the face of rapidly changing traffic workloads. If dynamically allocated
network buffers are employed, even the most oversubscribed Reduce server can receive all its intermediate data without lost
packets and the consequent network overload and inefficiency that would otherwise occur. Finally, tools such as Arista’s Latency
Analyzer allow network introspection and cluster reconfiguration to eliminate bottlenecks and create workload dependent cluster
optimization.

Management And Extensibility

Getting the most out of any scaled solution requires proper management tools and a framework for customized application needs.
Because the EOS Extensible Operating System is based on Linux, EOS provides the perfect foundation for leveraging open source
tools and creating user defined functionality.

Admins gain immediate productivity with their preferred binaries and scripts without needing to learn and relearn proprietary
operating systems. Cluster management systems such as perfSONAR, gmond, Nagios or Ganglia can be run directly on the switch
to detect and proactively address any unexpected data center conditions - possible responses range from email and SMS alerts to
actions immediately shifting topology and configuration. EOS allows the full power of open source to be leveraged for a smoothly
running cluster.

Figure 5: Linux and open source tools can be combined for event driven reactivity and visibility in the next generation network.
Furthermore, anyone who has managed large installations knows that data center class automation begins the moment hardware
meets rack. Arista’s Zero Touch Provisioning allows bare metal switches to be instantly configured and operational in the network,
and EOS can even use PXE to provision servers parametizably by VLAN or optionId. Finally, dynamic topology detection can
automate modification to the Hadoop XML files which control location aware programming - a daunting task when separate teams
install hardware and write software, and one which is critical to efficiently pushing computation to the correct server. Ultimately,
Arista’s powerful tools for automation significantly reduce the overhead of large cluster management, allowing IT staff to focus on
meeting and exceeding actual business deliverables.

Figure 6: Arista’s Zero Touch Provisioning and extensibility create dynamically provisioned deployment and real time rack awareness in Hadoop clusters.

arista.com
White Paper

Key Take Aways

Hadoop and other distributed solutions are managing data sets of unprecedented and still growing scale. Business intelligence and
other forms of analysis will increasingly rely on frameworks such as these whose peak performance is achieved with high capacity,
any to any, easily manageable network technologies. Arista Networks is committed to delivering stable, high-performance solutions
from the silicon up for Hadoop and any other demanding workloads of the next generation data center.

Santa Clara—Corporate Headquarters Ireland—International Headquarters India—R&D Office

3130 Atlantic Avenue Global Tech Park, Tower A & B, 11th Floor 
5453 Great America Parkway,
Westpark Business Campus Marathahalli Outer Ring Road
Santa Clara, CA 95054 Shannon, Co. Clare Devarabeesanahalli Village, Varthur Hobli
Ireland Bangalore, India 560103
Phone: +1-408-547-5500
Vancouver—R&D Office Singapore—APAC Administrative Office
Fax: +1-408-538-8920
9200 Glenlyon Pkwy, Unit 300 9 Temasek Boulevard 
Email: [email protected] Burnaby, British Columbia #29-01, Suntec Tower Two
Canada V5J 5J8 Singapore 038989
San Francisco—R&D and Sales Office Nashua—R&D Office
1390 Market Street, Suite 800 10 Tara Boulevard
San Francisco, CA 94102 Nashua, NH 03062

Copyright © 2016 Arista Networks, Inc. All rights reserved. CloudVision, and EOS are registered trademarks and Arista Networks
is a trademark of Arista Networks, Inc. All other company names are trademarks of their respective holders. Information in this
document is subject to change without notice. Certain features may not yet be available. Arista Networks, Inc. assumes no
responsibility for any errors that may appear in this document. 04/14

arista.com

CSCI369 Lab 1
No ratings yet
CSCI369 Lab 1
3 pages
AWS Academy Cloud Foundations Module 10 Student Guide
100% (2)
AWS Academy Cloud Foundations Module 10 Student Guide
50 pages
SAP PP Confi
No ratings yet
SAP PP Confi
64 pages
PerfMatrix Test Plan Template
No ratings yet
PerfMatrix Test Plan Template
26 pages
Malware Analysis
0% (1)
Malware Analysis
11 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Unit 2
No ratings yet
Unit 2
56 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
Spark 4-2 Documentation
No ratings yet
Spark 4-2 Documentation
60 pages
Unit 3
No ratings yet
Unit 3
18 pages
Hadoop PDF
0% (1)
Hadoop PDF
4 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Notes Hadoop
No ratings yet
Notes Hadoop
19 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
Hadoop Job Runner UI Tool
No ratings yet
Hadoop Job Runner UI Tool
10 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Unit-Iv CC&BD CS62
No ratings yet
Unit-Iv CC&BD CS62
76 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
unit 2
No ratings yet
unit 2
28 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Hadoop & HDFS Final
No ratings yet
Hadoop & HDFS Final
31 pages
Hadoop Cluster
No ratings yet
Hadoop Cluster
23 pages
Big Data
No ratings yet
Big Data
51 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Unit 2
No ratings yet
Unit 2
21 pages
Unit 3 Introduction To Hadoop Syllabus
No ratings yet
Unit 3 Introduction To Hadoop Syllabus
22 pages
Hadoop 2
No ratings yet
Hadoop 2
27 pages
To Hadoop: A Dell Technical White Paper
No ratings yet
To Hadoop: A Dell Technical White Paper
9 pages
Design An Efficient Big Data Analytic Architecture For Retrieval of Data Based On Web Server in Cloud Environment
No ratings yet
Design An Efficient Big Data Analytic Architecture For Retrieval of Data Based On Web Server in Cloud Environment
10 pages
Bigdata
No ratings yet
Bigdata
6 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
3. Introduction-to-Hadoop-Ecosystem
No ratings yet
3. Introduction-to-Hadoop-Ecosystem
26 pages
Bda Unit-Iii-R20
No ratings yet
Bda Unit-Iii-R20
44 pages
Lecture Notes Hadoop
100% (1)
Lecture Notes Hadoop
11 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
4 UNIT-4 Introduction To Hadoop
No ratings yet
4 UNIT-4 Introduction To Hadoop
154 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
Testing Big Data: Camelia Rad
No ratings yet
Testing Big Data: Camelia Rad
31 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Unit II Big Data
No ratings yet
Unit II Big Data
27 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
Unit-3 BDA
No ratings yet
Unit-3 BDA
30 pages
Hadoop Ecosystem
100% (2)
Hadoop Ecosystem
33 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Cluster Based Load Rebalancing in Clouds
No ratings yet
Cluster Based Load Rebalancing in Clouds
5 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
CC 2
No ratings yet
CC 2
25 pages
Hadoop Interview Questions - HDFS
No ratings yet
Hadoop Interview Questions - HDFS
19 pages
Big Data notes (1)
No ratings yet
Big Data notes (1)
13 pages
CC-Unit 3
No ratings yet
CC-Unit 3
22 pages
Cloudera Hadoop Admin Notes PDF
No ratings yet
Cloudera Hadoop Admin Notes PDF
65 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Efficient Ways To Improve The Performance of HDFS For Small Files
No ratings yet
Efficient Ways To Improve The Performance of HDFS For Small Files
5 pages
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
HADOOP
No ratings yet
HADOOP
43 pages
Big Data
No ratings yet
Big Data
67 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Installation of Keil Microcontroller Development Kit (MDK)
No ratings yet
Installation of Keil Microcontroller Development Kit (MDK)
3 pages
2023-02-18 18.05.08 Crash
No ratings yet
2023-02-18 18.05.08 Crash
20 pages
CM 03
No ratings yet
CM 03
23 pages
README
No ratings yet
README
33 pages
Log
No ratings yet
Log
3 pages
What Are Available Drivers in JDBC?: JDBC-ODBC Bridge Driver
No ratings yet
What Are Available Drivers in JDBC?: JDBC-ODBC Bridge Driver
8 pages
Above Average Data Structure Question Paper With Answers
No ratings yet
Above Average Data Structure Question Paper With Answers
3 pages
Association Using C# and Other OO Concepts
No ratings yet
Association Using C# and Other OO Concepts
4 pages
CKAD Cheat Sheet
No ratings yet
CKAD Cheat Sheet
1 page
Documentation CFC Templates Manual Loading Box - Rev.01
No ratings yet
Documentation CFC Templates Manual Loading Box - Rev.01
10 pages
ACC 324 2018 PRACTICE Solution
No ratings yet
ACC 324 2018 PRACTICE Solution
4 pages
Irender NXT Registration Code
No ratings yet
Irender NXT Registration Code
2 pages
This Assignment Has Three Parts. Part One
No ratings yet
This Assignment Has Three Parts. Part One
7 pages
Cook Book - Configurable Material and Material Variants
No ratings yet
Cook Book - Configurable Material and Material Variants
8 pages
3-StringBuffer StringBuilder
No ratings yet
3-StringBuffer StringBuilder
23 pages
Difference Between Scrum and Kanban
No ratings yet
Difference Between Scrum and Kanban
4 pages
Mar 13 Dipsa
No ratings yet
Mar 13 Dipsa
4 pages
Visual Data Tools
No ratings yet
Visual Data Tools
17 pages
Unit-2 Requirement Analysis & Specification
No ratings yet
Unit-2 Requirement Analysis & Specification
8 pages
Registers
No ratings yet
Registers
19 pages
Changelog
No ratings yet
Changelog
223 pages
Java 2 Enterprise Edition (J2EE)
No ratings yet
Java 2 Enterprise Edition (J2EE)
126 pages
Unit 3 CPF Final
No ratings yet
Unit 3 CPF Final
26 pages
Why Vue - JS?: Finding A Javascript Framework That Just Worked
No ratings yet
Why Vue - JS?: Finding A Javascript Framework That Just Worked
39 pages
Computer Science Sample Resume
No ratings yet
Computer Science Sample Resume
1 page

Networking in The Hadoop Cluster

Uploaded by

Networking in The Hadoop Cluster

Uploaded by

White Paper

Networking in the Hadoop Cluster

Figure 1: Hadoop Architecture Distributes Storage and Computation to the Cluster

Parallelization and Pushing The Computation to the Data

How the MapReduce Algorithm Works

Impact of Network Designs on Hadoop Cluster Performance

High Capacity, Any-to-Any Topology, and Incremental Scalability

High Availability and Fault Tolerance

Dynamic Buffers and Visibility

Management And Extensibility

Key Take Aways

Santa Clara—Corporate Headquarters Ireland—International Headquarters India—R&D Office

You might also like