Apache Hadoop: Getting Started With
Apache Hadoop: Getting Started With
117
CONTENTS
∠ INTRODUCTION
∠ DESIGN CONCEPTS
Apache Hadoop
∠ HDFS
∠ YARN
∠ YARN APPLICATIONS
ECOSYSTEM
WRITTEN BY PIOTR KREWSKI, FOUNDER AND BIG DATA CONSULTANT AT GETINDATA
AND, ADAM KAWA, CEO AND FOUNDER AT GETINDATA ∠ ADDITIONAL RESOURCES
INTRODUCTION To install Hadoop, you can take the code from hadoop.apache.
This Refcard presents Apache Hadoop, the most popular org or (what is more recommended) use one of the Hadoop
software framework enabling distributed storage and processing distributions. The three most widely used ones come from
of large datasets using simple high-level programming models. Cloudera (CDH), Hortonworks (HDP), and MapR. Hadoop
We cover the most important concepts of Hadoop, describe its distribution is a set of tools from the Hadoop ecosystem
DZON E .COM/ RE FCA RDZ
architecture, guide you on how to start using it as well as write bundled together and guaranteed by the respective vendor
and execute various applications on Hadoop. that work and integrate with each other well. Additionally, each
vendor offers tools (open-source or proprietary) to provision,
In a nutshell, Hadoop is an open-source project of the Apache
manage, and monitor the whole platform.
Software Foundation that can be installed on a cluster of
servers so that these servers can communicate and work
DESIGN CONCEPTS
together to store and process large datasets. Hadoop has
To solve the challenge of processing and storing large datasets,
become very successful in recent years thanks to its ability to
Hadoop was built according to the following core characteristics:
effectively crunch big data. It allows companies to store all of
their data in one system and perform analysis on this data that • Distribution - instead of building one big supercomputer,
would be otherwise impossible or very expensive to do with storage and processing are spread across a cluster of
traditional solutions. smaller machines that communicate and work together.
Many companion tools built around Hadoop offer a wide variety • Horizontal scalability - it is easy to extend a Hadoop
of processing techniques. Integration with ancillary systems cluster by just adding new machines. Every new machine
and utilities is excellent, making real-world work with Hadoop proportionally increases the total storage and processing
easier and more productive. These tools together form the power of the Hadoop cluster.
Hadoop ecosystem.
• Fault-tolerance - Hadoop continues to operate even when a
You can think of Hadoop as a Big Data Operating System that few hardware or software components fail to work properly.
makes it possible to run different types of workloads over all
your huge datasets. This ranges from offline batch processing • Cost-optimization - Hadoop does not require expensive top-end
through machine learning to real-time stream processing. servers and can work just fine without commercial licenses.
Visit hadoop.apache.org to get more information about the messy details related to distributed computing. Thanks
project and access detailed documentation. to a high-level API, users can focus on implementing
business logic that solves their real-world problems.
1
GETTING STARTED WITH APACHE HADOOP
• Data locality - Hadoop doesn't move large datasets to permissions and ownership, last modification date,
where the application is running, but runs the application etc.) and controlling access to data stored in HDFS. If
where the data already is. the NameNode is down, you cannot access your data.
Fortunately, you can configure multiple NameNodes that
HADOOP COMPONENTS ensure high availability for this critical HDFS process.
Hadoop is divided into two core components:
• DataNodes - slave processes installed on each worker node
• HDFS - a distributed file system.
in the cluster that take care of storing and serving data.
• YARN - a cluster resource management technology.
Figure 1 illustrates installation of HDFS on a 4-node cluster. One
Many execution frameworks run on top of YARN, each tuned of the nodes hosts the NameNode daemon while the other
HOT TIP
for a specific use-case. The most important are discussed three run DataNode daemons.
under "YARN Applications" below.
HDFS
HDFS is a Hadoop distributed file system. It can run on as many
servers as you need - HDFS easily scales to thousands of nodes
and petabytes of data.
The larger the HDFS setup is, the bigger the probability that
DZON E .COM/ RE FCA RDZ
Note: HDFS does not allow you to modify a file's content. There
is only support for appending data at the end of the file. However,
Hadoop was designed with HDFS to be one of many pluggable
storage options — for example, with MapR-Fs, a proprietary
filesystem, files are fully read-write. Other HDFS alternatives
include Amazon S3, Google Cloud Storage, and IBM GPFS.
• NameNode - the master process that is responsible To start playing with Hadoop, you don't have to go through
for managing the file system namespace (filenames, the process of setting up a whole cluster. Hadoop can run in
2
GETTING STARTED WITH APACHE HADOOP
so-called pseudo-distributed mode on a single machine. You If you prefer to use a graphical interface to interact with HDFS,
can download the sandbox Virtual Machine with all the HDFS you can have a look at free and open-source HUE (Hadoop User
components already installed and start using Hadoop in no Experience). It contains a convenient "File Browser" component
time! Just follow one of these links: that allows you to browse HDFS files and directories and
perform basic operations.
• mapr.com/products/mapr-sandbox-hadoop
• hortonworks.com/products/hortonworks-sandbox/#install
• cloudera.com/downloads/quickstart_vms/5-12.html
$ hdfs dfs -cat /user/adam/songs.txt You can also use HUE to upload files to HDFS directly from your
computer with the "Upload" button.
Change the permission of a file:
Set the replication factor of a file to 4: YARN (Yet Another Resource Negotiator) is responsible for
managing resources on the Hadoop cluster and enables running
$ hdfs dfs -setrep -w 4 /user/adam/songs.txt
various distributed applications that process data stored on HDFS.
Check the size of a file:
YARN, similarly to HDFS, follows the master-slave design with
`$ hdfs dfs -du -h /user/adam/songs.txt
the ResourceManager process acting as a master and multiple
Create a subdirectory in your home directory. NodeManagers acting as workers. They have the following
$ hdfs dfs -mkdir songs responsibilities:
Note that relative paths always refer to the home directory of the ResourceManager
user executing the command. There is no concept of a "current"
• Keeps track of live NodeManagers and the amount of
directory on HDFS (in other words, there is no equivalent to the
available compute resources on each server in the cluster.
"cd" command):
• Allocates available resources to applications.
Move the file to the newly created subdirectory:
• Monitors execution of all the applications on the
$ hdfs dfs -mv songs.txt songs
Hadoop cluster.
Remove a directory from HDFS:
NodeManager
$ hdfs dfs -rm -r songs
You can type hdfs dfs without any parameters to get a full list
of available commands. form of resource containers, which represent a combination of
the amount of RAM and number of CPU cores.
3
GETTING STARTED WITH APACHE HADOOP
Each application that executes on the YARN cluster has its • MapReduce — the traditional and the oldest processing
own ApplicationMaster process. This process starts when the framework for Hadoop that expresses computation as
application is scheduled on the cluster and coordinates the a series of map and reduce tasks. It is currently being
execution of all tasks within this application. superseded by much faster engines like Spark or Flink.
Figure 3 illustrates the cooperation of YARN daemons on a 4-node • Apache Spark — a fast and general engine for large-scale
cluster running two applications that spawned 7 tasks in total. data processing that optimizes the computation by caching
data in memory (more details in the latter sections).
the same nodes to enable data locality. This design enables can be tracked with the ResourceManager WebUI, which is, by
performing computations on the machines that store the data, default, exposed on port 8088.
thus minimizing the necessity of sending large chunks of data
over the network, which leads to faster execution times.
distributed compute resources to various applications running amount of RAM and number of CPU cores available for
on a Hadoop cluster. In other words, YARN itself does not processing as well as the current Hadoop cluster load. Check
out "Cluster Metrics" on the top of the page.
provide any processing logic that can analyze data in HDFS.
Hence, various processing frameworks must be integrated
If you click on entries in the "ID" column, you'll get more detailed
with YARN (by providing a specific implementation of the
metrics and statistics concerning execution of the selected application.
ApplicationMaster) to run on a Hadoop cluster and process data
from HDFS.
PROCESSING DATA ON HADOOP
Below is a list of short descriptions of the most popular There are a number of frameworks that make the process of
distributed computation frameworks that can run on a Hadoop implementing distributed applications on Hadoop easy. In this
cluster powered by YARN. section, we focus on the most popular ones: Hive and Spark.
4
GETTING STARTED WITH APACHE HADOOP
HIVE
After you start a session with Beeline, all the tables that you
Hive enables working with data on HDFS using the familiar create go under the "default" database. You can change it
HOT TIP
SQL dialect. either by providing a specific database name as a prefix to
the table name or by typing the "use <database_name>;"
command.
When using Hive, our datasets in HDFS are represented as tables
that have rows and columns. Therefore, Hive is easy to learn Check if the table was created successfully: beeline> SHOW tables;
and appealing to use for those who already know SQL and have
Run a query that finds the two most popular artists in July, 2017:
experience working with relational databases.
SELECT artist, COUNT(\*) AS total
Hive is not an independent execution engine. Each Hive query is FROM songs
translated into code in either MapReduce, Tez, or Spark (work in WHERE year(date) = 2017 AND month(date) = 7
GROUP BY artist
progress) that is subsequently executed on a Hadoop cluster.
ORDER BY total DESC
LIMIT 2;
HIVE EXAMPLE
Let's process a dataset about songs listened to by users at a given You can monitor the execution of your query with the
time. The input data consists of a tab-separated file called songs.tsv: ResourceManager WebUI. Depending on your configuration,
you will see either MapReduce jobs or a Spark application
"Creep" Radiohead piotr 2017-07-20
running on the cluster.
"Desert Rose" Sting adam 2017-07-14
"Desert Rose" Sting piotr 2017-06-10
Note: You can also write and execute Hive queries from HUE.
"Karma Police" Radiohead adam 2017-07-23
There is a Query Editor dedicated for Hive with handy features
"Everybody" Madonna piotr 2017-07-01
"Stupid Car" Radiohead adam 2017-07-18 like syntax auto-completion and coloring, the option to save
"All This Time" Sting adam 2017-07-13 queries, and basic visualization of the results in the form of line,
DZON E .COM/ RE FCA RDZ
5
GETTING STARTED WITH APACHE HADOOP
As an example of working with Spark's Python dataframe API, Spark can directly read and write data to and from many
we implement the same logic as with Hive, e.g. finding the two different data stores, not only HDFS. You can easily create
most popular artists in July, 2017. dataframes from records in a table in MySQL or Oracle, rows
from HBase, JSON files on your local disk, index data in
First, we have to read in our dataset. Spark can read data directly
ElasticSearch, and many, many others.
from Hive tables: # songs = spark.table("songs")
Spark's dataframe transformations look similar to SQL operators, Spark applications, Hive queries, Sqoop imports, etc.) and then
so they are quite easy to use and understand. schedule a workflow for automated execution.
method on the dataframe (e.g. songs.cache()). Spark will Zookeeper — a distributed synchronization and configuration
then keep your data in RAM and avoid hitting the disk
management service for Hadoop. A number of Hadoop services
when you run subsequent queries, giving you an order of
magnitude better performance. take advantage of Zookeeper to work correctly and efficiently in
a distributed environment.
Dataframes are just one of the available APIs in Spark. There are also
SUMMARY
APIs and libraries for near real-time processing (Spark Streaming),
Apache Hadoop is the most popular and widely-used platform
machine learning (MLlib), or graph processing (GraphFrames).
for big data processing, thanks to features like linear scalability,
Thanks to Spark's versatility, you can use it to solve a wide high-level APIs, the ability to run on heterogeneous hardware
variety of your processing needs, staying within the same (both on-premise and in the cloud), fault tolerance, and an
framework and sharing pieces of code between different open-source nature. Hadoop has been successfully deployed in
contexts (e.g. batch and streaming). production by many companies for over a decade.
6
GETTING STARTED WITH APACHE HADOOP
• spark.apache.org/docs/latest/sql-programming-guide.html
• dzone.com/articles/apache-hadoop-vs-apache-spark
• dzone.com/articles/hadoop-and-spark-synergy-is-real
• sqoop.apache.org
managing, and expanding Hadoop clusters. He is a co-founder of GetInData, where he helps companies with building
scalable, distributed architectures for storing and processing big data. Piotr serves also as a Hadoop Instructor, delivering
GetInData proprietary trainings for administrators, developers, and analysts working with big data solutions.
DZone, Inc.
DZone communities deliver over 6 million pages each 150 Preston Executive Dr. Cary, NC 27513
month to more than 3.3 million software developers, 888.678.0399 919.678.0300
architects and decision makers. DZone offers something
Copyright © 2017 DZone, Inc. All rights reserved. No part of this publication
for everyone, including news, tutorials, cheat sheets,
may be reproduced, stored in a retrieval system, or transmitted, in any form
research guides, feature articles, source code and more. or by means electronic, mechanical, photocopying, or otherwise, without
"DZone is a developer’s dream," says PC Magazine. prior written permission of the publisher.