0% found this document useful (0 votes)
66 views7 pages

Apache Hadoop: Getting Started With

hadoop

Uploaded by

Srikanth Shenoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views7 pages

Apache Hadoop: Getting Started With

hadoop

Uploaded by

Srikanth Shenoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

257

117

CONTENTS

∠ INTRODUCTION

∠ DESIGN CONCEPTS

GETTING STARTED WITH ∠ HADOOP COMPONENTS

Apache Hadoop
∠ HDFS

∠ YARN

∠ YARN APPLICATIONS

∠ MONITORING YARN APPLICATIONS

∠ PROCESSING DATA ON HADOOP

∠ OTHER TOOLS FROM THE HADOOP

ECOSYSTEM
WRITTEN BY PIOTR KREWSKI, FOUNDER AND BIG DATA CONSULTANT AT GETINDATA
AND, ADAM KAWA, CEO AND FOUNDER AT GETINDATA ∠ ADDITIONAL RESOURCES

INTRODUCTION To install Hadoop, you can take the code from hadoop.apache.
This Refcard presents Apache Hadoop, the most popular org or (what is more recommended) use one of the Hadoop
software framework enabling distributed storage and processing distributions. The three most widely used ones come from
of large datasets using simple high-level programming models. Cloudera (CDH), Hortonworks (HDP), and MapR. Hadoop
We cover the most important concepts of Hadoop, describe its distribution is a set of tools from the Hadoop ecosystem
DZON E .COM/ RE FCA RDZ

architecture, guide you on how to start using it as well as write bundled together and guaranteed by the respective vendor
and execute various applications on Hadoop. that work and integrate with each other well. Additionally, each
vendor offers tools (open-source or proprietary) to provision,
In a nutshell, Hadoop is an open-source project of the Apache
manage, and monitor the whole platform.
Software Foundation that can be installed on a cluster of
servers so that these servers can communicate and work
DESIGN CONCEPTS
together to store and process large datasets. Hadoop has
To solve the challenge of processing and storing large datasets,
become very successful in recent years thanks to its ability to
Hadoop was built according to the following core characteristics:
effectively crunch big data. It allows companies to store all of
their data in one system and perform analysis on this data that • Distribution - instead of building one big supercomputer,
would be otherwise impossible or very expensive to do with storage and processing are spread across a cluster of
traditional solutions. smaller machines that communicate and work together.

Many companion tools built around Hadoop offer a wide variety • Horizontal scalability - it is easy to extend a Hadoop
of processing techniques. Integration with ancillary systems cluster by just adding new machines. Every new machine
and utilities is excellent, making real-world work with Hadoop proportionally increases the total storage and processing
easier and more productive. These tools together form the power of the Hadoop cluster.
Hadoop ecosystem.
• Fault-tolerance - Hadoop continues to operate even when a
You can think of Hadoop as a Big Data Operating System that few hardware or software components fail to work properly.
makes it possible to run different types of workloads over all
your huge datasets. This ranges from offline batch processing • Cost-optimization - Hadoop does not require expensive top-end

through machine learning to real-time stream processing. servers and can work just fine without commercial licenses.

• Programming abstraction - Hadoop takes care of all the


HOT TIP

Visit hadoop.apache.org to get more information about the messy details related to distributed computing. Thanks
project and access detailed documentation. to a high-level API, users can focus on implementing
business logic that solves their real-world problems.

1
GETTING STARTED WITH APACHE HADOOP

• Data locality - Hadoop doesn't move large datasets to permissions and ownership, last modification date,
where the application is running, but runs the application etc.) and controlling access to data stored in HDFS. If
where the data already is. the NameNode is down, you cannot access your data.
Fortunately, you can configure multiple NameNodes that
HADOOP COMPONENTS ensure high availability for this critical HDFS process.
Hadoop is divided into two core components:
• DataNodes - slave processes installed on each worker node
• HDFS - a distributed file system.
in the cluster that take care of storing and serving data.
• YARN - a cluster resource management technology.
Figure 1 illustrates installation of HDFS on a 4-node cluster. One
Many execution frameworks run on top of YARN, each tuned of the nodes hosts the NameNode daemon while the other
HOT TIP

for a specific use-case. The most important are discussed three run DataNode daemons.
under "YARN Applications" below.

Let's take a closer look at their architecture and describe how


they cooperate.

HDFS
HDFS is a Hadoop distributed file system. It can run on as many
servers as you need - HDFS easily scales to thousands of nodes
and petabytes of data.

The larger the HDFS setup is, the bigger the probability that
DZON E .COM/ RE FCA RDZ

Note: NameNode and DataNode are Java processes that run on


some disks, servers, or network switches will fail. HDFS survives
top of a Linux distribution such as RedHat, Centos, Ubuntu, and
these types of failures by replicating data on multiple servers.
others. They use local disks for storing HDFS data.
HDFS automatically detects that a given component has
failed and takes the necessary recovery actions that happen HDFS splits each file into a sequence of smaller, but still
transparently to the user. large, blocks (the default block size equals 128MB — bigger
blocks mean fewer disk seek operations, which results in
HDFS is designed for storing large files of the magnitude
larger throughput). Each block is stored redundantly on three
of hundreds of megabytes or gigabytes and provides high-
DataNodes for fault-tolerance (the number of replicas for each
throughput streaming data access to them. Last but not least,
file is configurable).
HDFS supports the write-once-read-many model. For this use
case, HDFS works like a charm. However, if you need to store a Figure 2 illustrates the concept of splitting files into blocks. File
large number of small files with random read-write access, then X is split into blocks B1 and B2 and File Y comprises only one
other systems like RDBMS and Apache HBase can do a better job. block, B3. All blocks are replicated twice within the cluster.

Note: HDFS does not allow you to modify a file's content. There
is only support for appending data at the end of the file. However,
Hadoop was designed with HDFS to be one of many pluggable
storage options — for example, with MapR-Fs, a proprietary
filesystem, files are fully read-write. Other HDFS alternatives
include Amazon S3, Google Cloud Storage, and IBM GPFS.

ARCHITECTURE OF HDFS INTERACTING WITH HDFS


HDFS consists of the following daemons that are installed and HDFS provides a simple POSIX-like interface to work with data.
run on selected cluster nodes: You perform file system operations using hdfs dfs commands.

• NameNode - the master process that is responsible To start playing with Hadoop, you don't have to go through
for managing the file system namespace (filenames, the process of setting up a whole cluster. Hadoop can run in

2
GETTING STARTED WITH APACHE HADOOP

so-called pseudo-distributed mode on a single machine. You If you prefer to use a graphical interface to interact with HDFS,
can download the sandbox Virtual Machine with all the HDFS you can have a look at free and open-source HUE (Hadoop User
components already installed and start using Hadoop in no Experience). It contains a convenient "File Browser" component
time! Just follow one of these links: that allows you to browse HDFS files and directories and
perform basic operations.
• mapr.com/products/mapr-sandbox-hadoop

• hortonworks.com/products/hortonworks-sandbox/#install

• cloudera.com/downloads/quickstart_vms/5-12.html

The following steps illustrate typical operations that a HDFS user


can perform:

List the content of home directory:

$ hdfs dfs -ls /user/adam

Upload a file from the local file system to HDFS:

$ hdfs dfs -put songs.txt /user/adam

Read the content of the file from HDFS:

$ hdfs dfs -cat /user/adam/songs.txt You can also use HUE to upload files to HDFS directly from your
computer with the "Upload" button.
Change the permission of a file:

$ hdfs dfs -chmod 700 /user/adam/songs.txt YARN


DZON E .COM/ RE FCA RDZ

Set the replication factor of a file to 4: YARN (Yet Another Resource Negotiator) is responsible for
managing resources on the Hadoop cluster and enables running
$ hdfs dfs -setrep -w 4 /user/adam/songs.txt
various distributed applications that process data stored on HDFS.
Check the size of a file:
YARN, similarly to HDFS, follows the master-slave design with
`$ hdfs dfs -du -h /user/adam/songs.txt
the ResourceManager process acting as a master and multiple
Create a subdirectory in your home directory. NodeManagers acting as workers. They have the following
$ hdfs dfs -mkdir songs responsibilities:

Note that relative paths always refer to the home directory of the ResourceManager
user executing the command. There is no concept of a "current"
• Keeps track of live NodeManagers and the amount of
directory on HDFS (in other words, there is no equivalent to the
available compute resources on each server in the cluster.
"cd" command):
• Allocates available resources to applications.
Move the file to the newly created subdirectory:
• Monitors execution of all the applications on the
$ hdfs dfs -mv songs.txt songs
Hadoop cluster.
Remove a directory from HDFS:
NodeManager
$ hdfs dfs -rm -r songs

• Manages compute resources (RAM and CPU) on a single


Note: Removed files and directories are moved to the trash
node in the Hadoop cluster.
(.Trash in your home directory on HDFS) and stay for one day
until they are permanently deleted. You can restore them just by • Runs various applications' tasks and enforces that they
copying or moving them from .Trash to the original location. are within the limits of assigned compute resources.

YARN assigns cluster resources to various applications in the


HOT TIP

You can type hdfs dfs without any parameters to get a full list
of available commands. form of resource containers, which represent a combination of
the amount of RAM and number of CPU cores.

3
GETTING STARTED WITH APACHE HADOOP

Each application that executes on the YARN cluster has its • MapReduce — the traditional and the oldest processing
own ApplicationMaster process. This process starts when the framework for Hadoop that expresses computation as
application is scheduled on the cluster and coordinates the a series of map and reduce tasks. It is currently being
execution of all tasks within this application. superseded by much faster engines like Spark or Flink.

Figure 3 illustrates the cooperation of YARN daemons on a 4-node • Apache Spark — a fast and general engine for large-scale
cluster running two applications that spawned 7 tasks in total. data processing that optimizes the computation by caching
data in memory (more details in the latter sections).

• Apache Flink — a high-throughput, low-latency batch


and stream processing engine. It stands out for its robust
ability to process large data streams in real time. You can
find out about differences between Spark and Flink in
this comprehensive article: dzone.com/articles/apache-
hadoop-vs-apache-spark

• Apache Tez — an engine designed with the aim of


speeding up the execution of SQL queries with Hive. It
HADOOP = HDFS + YARN is available on the Hortonworks Data Platform, where it
HDFS and YARN daemons running on the same cluster give us a replaces MapReduce as an execution engine for Hive.
powerful platform for storing and processing large datasets.
MONITORING YARN APPLICATIONS
DataNode and NodeManager processes are collocated on The execution of all applications running on the Hadoop cluster
DZON E .COM/ RE FCA RDZ

the same nodes to enable data locality. This design enables can be tracked with the ResourceManager WebUI, which is, by
performing computations on the machines that store the data, default, exposed on port 8088.
thus minimizing the necessity of sending large chunks of data
over the network, which leads to faster execution times.

For each application, you can read a bunch of important information.


YARN APPLICATIONS
YARN is merely a resource manager that knows how to allocate
With the ResourceManager WebUI, you can check the total
HOT TIP

distributed compute resources to various applications running amount of RAM and number of CPU cores available for
on a Hadoop cluster. In other words, YARN itself does not processing as well as the current Hadoop cluster load. Check
out "Cluster Metrics" on the top of the page.
provide any processing logic that can analyze data in HDFS.
Hence, various processing frameworks must be integrated
If you click on entries in the "ID" column, you'll get more detailed
with YARN (by providing a specific implementation of the
metrics and statistics concerning execution of the selected application.
ApplicationMaster) to run on a Hadoop cluster and process data
from HDFS.
PROCESSING DATA ON HADOOP
Below is a list of short descriptions of the most popular There are a number of frameworks that make the process of
distributed computation frameworks that can run on a Hadoop implementing distributed applications on Hadoop easy. In this
cluster powered by YARN. section, we focus on the most popular ones: Hive and Spark.

4
GETTING STARTED WITH APACHE HADOOP

HIVE
After you start a session with Beeline, all the tables that you
Hive enables working with data on HDFS using the familiar create go under the "default" database. You can change it

HOT TIP
SQL dialect. either by providing a specific database name as a prefix to
the table name or by typing the "use <database_name>;"
command.
When using Hive, our datasets in HDFS are represented as tables
that have rows and columns. Therefore, Hive is easy to learn Check if the table was created successfully: beeline> SHOW tables;
and appealing to use for those who already know SQL and have
Run a query that finds the two most popular artists in July, 2017:
experience working with relational databases.
SELECT artist, COUNT(\*) AS total
Hive is not an independent execution engine. Each Hive query is FROM songs
translated into code in either MapReduce, Tez, or Spark (work in WHERE year(date) = 2017 AND month(date) = 7
GROUP BY artist
progress) that is subsequently executed on a Hadoop cluster.
ORDER BY total DESC
LIMIT 2;
HIVE EXAMPLE

Let's process a dataset about songs listened to by users at a given You can monitor the execution of your query with the
time. The input data consists of a tab-separated file called songs.tsv: ResourceManager WebUI. Depending on your configuration,
you will see either MapReduce jobs or a Spark application
"Creep" Radiohead piotr 2017-07-20
running on the cluster.
"Desert Rose" Sting adam 2017-07-14
"Desert Rose" Sting piotr 2017-06-10
Note: You can also write and execute Hive queries from HUE.
"Karma Police" Radiohead adam 2017-07-23
There is a Query Editor dedicated for Hive with handy features
"Everybody" Madonna piotr 2017-07-01
"Stupid Car" Radiohead adam 2017-07-18 like syntax auto-completion and coloring, the option to save
"All This Time" Sting adam 2017-07-13 queries, and basic visualization of the results in the form of line,
DZON E .COM/ RE FCA RDZ

bar, or pie charts.


We use Hive to find the two most popular artists in July, 2017.
SPARK
Upload the songs.txt file on HDFS. You can do it with the help of
Apache Spark is a general purpose distributed computing
the "File Browser" in HUE or type the following commands using
framework. It is well integrated with the Hadoop ecosystem and
the command line tool:
Spark applications can be easily run on YARN.
# hdfs dfs -mkdir /user/training/songs
# hdfs dfs -put songs.txt /user/training/songs
Compared to the MapReduce - the traditional Hadoop computing
paradigm - Spark offers excellent performance, ease of use, and
Enter Hive using the Beeline client. You have to provide an address versatility when it comes to different data processing needs.
to HiveServer2, which is a process that enables remote clients (such
Spark's speed comes mainly from its ability to store data in RAM
as Beeline) to execute Hive queries and retrieve results.
between subsequent execution steps and optimizations in the
# beeline execution plan and data serialization.
beeline> !connect jdbc:hive2://localhost:10000
<user><password> Let's jump straight into the code to get a taste of Spark. We can
choose from Scala, Java, Python, SQL, or R APIs. Our examples
Create a table in Hive that points to our data on HDFS (note that
are in Python. To start Spark Python shell (called pyspark),
we need to specify the proper delimiter and location of the file
type # pyspark.
so that Hive can represent the raw data as a table):

After a while, you'll see a Spark prompt. It means that a


beeline> CREATE TABLE songs(
title STRING, Spark application was started on YARN (you can go to the
artist STRING, ResourceManager WebUI for confirmation; look for a running
user STRING,
application named "PySparkShell").
date DATE
)
If you don't like to work with shell, you should check out
HOT TIP

ROW FORMAT DELIMITED


FIELDS TERMINATED BY 't\' web-based notebooks such as Jupyter (jupyter.org) or
LOCATION '/user/training/songs'; Zeppelin (zeppelin.apache.org).

5
GETTING STARTED WITH APACHE HADOOP

As an example of working with Spark's Python dataframe API, Spark can directly read and write data to and from many
we implement the same logic as with Hive, e.g. finding the two different data stores, not only HDFS. You can easily create
most popular artists in July, 2017. dataframes from records in a table in MySQL or Oracle, rows
from HBase, JSON files on your local disk, index data in
First, we have to read in our dataset. Spark can read data directly
ElasticSearch, and many, many others.
from Hive tables: # songs = spark.table("songs")

OTHER TOOLS FROM THE HADOOP ECOSYSTEM


Data with schema in Spark is represented as a so called
The Hadoop ecosystem contains many different tools to
dataframe. Dataframes are immutable and are created by
accomplish specific tasks needed by modern big data platforms.
reading data from different source systems or by applying
Below, you can see a list of popular and important projects that
transformations on other dataframes.
were not covered in previous sections.
To preview the content of any dataframe, call the show() method:
Sqoop — an indispensable tool to move data in bulk from relational
# songs.show() datastores and HDFS/Hive, and the other way around. You interact
+-------------+---------+-----+----------+
with Sqoop using the command line, selecting the desired action
| title| artist| user| date|
+-------------+---------+-----+----------+ and providing a bunch of necessary parameters controlling the data
| Creep|Radiohead|piotr|2017-07-20| movement process. Importing data about users from a MySQL table
| Desert Rose| Sting| adam|2017-07-14|
is as easy as typing the following command:
| Desert Rose| Sting|piotr|2017-06-10|
| Karma Police|Radiohead| adam|2017-07-23|
| Everybody| Madonna|piotr|2017-07-01| # sqoop import \

| Stupid Car|Radiohead| adam|2017-07-18| --connect jdbc:mysql://localhost/streamrock \

|All This Time| Sting| adam|2017-07-13| --username $(whoami) -P \

+-------------+---------+-----+----------+ --table users \


--hive-import
DZON E .COM/ RE FCA RDZ

To achieve the desired result, we need to use a couple of


Note: Sqoop uses MapReduce to transfer data back and
intuitive functions chained together:
forth between the relational datastore and Hadoop. You can
# from pyspark.sql.functions import desc track a MapReduce application submitted by Sqoop in the
# songs.filter("year(date) = 2017 AND month(date) = 7") \
.groupBy("artist") \
ResourceManager WebUI.
.count() \
.sort(desc("count")) \ Oozie — a coordination and orchestration service for Hadoop.
.limit(2) \ With Oozie, you can build a workflow of different actions that
.show()
you want to perform on a Hadoop cluster (e.g. HDFS commands,

Spark's dataframe transformations look similar to SQL operators, Spark applications, Hive queries, Sqoop imports, etc.) and then

so they are quite easy to use and understand. schedule a workflow for automated execution.

HBase — a NoSQL database built on top of HDFS. It enables very


If you perform multiple transformations on the same
dataframe (e.g. when you explore a new dataset) you can fast random reads and writes of individual records using row keys.
tell Spark to cache it in memory by calling the cache()
HOT TIP

method on the dataframe (e.g. songs.cache()). Spark will Zookeeper — a distributed synchronization and configuration
then keep your data in RAM and avoid hitting the disk
management service for Hadoop. A number of Hadoop services
when you run subsequent queries, giving you an order of
magnitude better performance. take advantage of Zookeeper to work correctly and efficiently in
a distributed environment.
Dataframes are just one of the available APIs in Spark. There are also
SUMMARY
APIs and libraries for near real-time processing (Spark Streaming),
Apache Hadoop is the most popular and widely-used platform
machine learning (MLlib), or graph processing (GraphFrames).
for big data processing, thanks to features like linear scalability,
Thanks to Spark's versatility, you can use it to solve a wide high-level APIs, the ability to run on heterogeneous hardware
variety of your processing needs, staying within the same (both on-premise and in the cloud), fault tolerance, and an
framework and sharing pieces of code between different open-source nature. Hadoop has been successfully deployed in
contexts (e.g. batch and streaming). production by many companies for over a decade.

6
GETTING STARTED WITH APACHE HADOOP

The Hadoop ecosystem offers a variety of open-source tools • dzone.com/articles/sqoop-import-data-from-mysql-to-hive


for collecting, storing, and processing data, as well as cluster
• oozie.apache.org
deployment, monitoring, and data security. Thanks to this
amazing ecosystem of tools, each company can now easily and • tez.apache.org
cheaply store and process huge amounts of data in a distributed
Major packaged distributions:
and highly scalable way.
• Cloudera: cloudera.com/content/cloudera/en/products-
ADDITIONAL RESOURCES and-services/cdh.html
• hadoop.apache.org
• MapR: mapr.com/products/mapr-editions
• hive.apache.org
• Hortonworks: hortonworks.com/hadoop/
• spark.apache.org

• spark.apache.org/docs/latest/sql-programming-guide.html

• dzone.com/articles/apache-hadoop-vs-apache-spark

• dzone.com/articles/hadoop-and-spark-synergy-is-real

• sqoop.apache.org

Written by Piotr Krewski, Founder and Big Data Consultant at GetInData


Piotr has extensive practical experience in writing applications running on Hadoop clusters as well as in maintaining,
DZON E .COM/ RE FCA RDZ

managing, and expanding Hadoop clusters. He is a co-founder of GetInData, where he helps companies with building
scalable, distributed architectures for storing and processing big data. Piotr serves also as a Hadoop Instructor, delivering
GetInData proprietary trainings for administrators, developers, and analysts working with big data solutions.

Written by Adam Kawa, CEO and Founder at GetInData


Adam became a fan of big data after implementing his first Hadoop job in 2010. Since then, he has been working with big data
at Spotify (where he had proudly operated one of the largest and fastest-growing Hadoop clusters in Europe), Truecaller, the
University of Warsaw, Cloudera Training Partner, and more. Three years ago, he co-founded GetInData: a company that helps
its customers become data-driven and that builds innovative big data solutions. Adam is also a blogger, a co-organizer of
Warsaw Hadoop User Group, and a frequent speaker at major big data conferences and meetups.

DZone, Inc.
DZone communities deliver over 6 million pages each 150 Preston Executive Dr. Cary, NC 27513
month to more than 3.3 million software developers, 888.678.0399 919.678.0300
architects and decision makers. DZone offers something
Copyright © 2017 DZone, Inc. All rights reserved. No part of this publication
for everyone, including news, tutorials, cheat sheets,
may be reproduced, stored in a retrieval system, or transmitted, in any form
research guides, feature articles, source code and more. or by means electronic, mechanical, photocopying, or otherwise, without
"DZone is a developer’s dream," says PC Magazine. prior written permission of the publisher.

You might also like