0% found this document useful (0 votes)
81 views

02 HDP Introduction

Uploaded by

Tarike Zewude
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views

02 HDP Introduction

Uploaded by

Tarike Zewude
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Introduction to Hortonworks Data Platform (HDP)

Introduction to Hortonworks
Data Platform (HDP)

Big Data Ecosystem

© Copyright IBM Corporation 2019


Course materials may not be reproduced in whole or in part without the written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

© Copyright IBM Corp. 201 2-2


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Unit objectives
• Describe the functions and features of HDP
• List the IBM value-add components
• Explain what IBM Watson Studio is
• Give a brief description of the purpose of each of the value-add
components

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Unit objectives

© Copyright IBM Corp. 201 2-3


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Hortonworks Data Platform (HDP)


• HDP is platform for data-at-rest

• Secure, enterprise-ready open source Apache Hadoop distribution


based on a centralized architecture (YARN)

• HDP is:
▪ Open
▪ Central
▪ Interoperable
▪ Enterprise ready

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Hortonworks Data Platform (HDP)


Hortonworks Data Platform (HDP) is a powerful platform for managing Big Data at rest.
HDP is Open Enterprise Hadoop, and it's a platform that is:
• 100% Open Source
• Centrally architected with YARN at its core
• Interoperable with existing technology and skills, AND
• Enterprise-ready, with data services for operations, governance and security

© Copyright IBM Corp. 201 2-4


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Hortonworks Data Platform (HDP)


Governance
Tools Security Operations
Integration
Data Lifecycle Ranger Ambari
& Governance Zeppelin Ambari User Views
Knox Cloudbreak
Falcon
Atlas ZooKeeper

Atlas HDFS Oozie


Encryption

Data Flow
Data Access
Kafka Batch Script SQL NoSQL Stream Search In-Mem Others

NiFi HBase HAWQ


Map
Pig Hive Accumulo Storm Solr Spark Partners
Reduce
Phoenix Big SQL
Sqoop

Tez Tez Slider S T


NFS
YARN: Data Operating System
Hadoop Distributed File System (HDFS)
WebHDFS
Data Management
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Here is the high level view of the Hortonworks Data Platform. It is divided into several
categories, listed in no particular order of importance.
• Governance, Integration
• Tools
• Security
• Operations
• Data Access
• Data Management
These next several slides will go in more detail of each of these groupings.

© Copyright IBM Corp. 201 2-5


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Data Flow

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Data flow
In this section you will learn a little about some of the data workflow tools that come
with HDP.

© Copyright IBM Corp. 201 2-6


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Kafka

• Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-


subscribe messaging system.
▪ Used for building real-time data pipelines and streaming apps

• Often used in place of traditional message brokers like JMS and


AMQP because of its higher throughput, reliability and replication.

• Kafka works in combination with variety of Hadoop tools:


▪ Apache Storm
▪ Apache HBase
▪ Apache Spark

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Kafka
Kafka Apache Kafka is a messaging system used for real-time data pipelines. Kafka is
used to build real-time streaming data pipelines that get data between systems or
applications. Kafka works with a number variety of Hadoop tools for various
applications. Such exmaples of uses cases are:
• Website activity tracking: capturing user site activities for real-time
tracking/monitoring
• Metrics: monitoring data
• Log aggregation: collecting logs from various sources to a central location for
processing.
• Stream processing: article recommendations based on user activity
• Event sourcing: state changes in applications are logged as time-ordered
sequence of records
• Commit log: external commit log system that helps with replicating data
between nodes in case of failed nodes
More information can be found here: https://kafka.apache.org/

© Copyright IBM Corp. 201 2-7


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Sqoop

• Tool to easily import information from structured databases (Db2,


MySQL, Netezza, Oracle, etc.) and related Hadoop systems (such as
Hive and HBase) into your Hadoop cluster

• Can also use to extract data from Hadoop and export it to relational
databases and enterprise data warehouses

• Helps offload some tasks such as ETL from Enterprise Data


Warehouse to Hadoop for lower cost and efficient execution

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Sqoop
Sqoop is a tool for moving data between structured databases or relational databases
and related Hadoop system. This works both ways. You can take data in your RDBMS
and move to your HDFS and move from your HDFS to some other form of RDBMS.
You can use Sqoop to offload tasks such as ETL from data warehouses to Hadoop for
lower cost and efficient execution for analytics.
Check out the Sqoop documentation for more info: http://sqoop.apache.org/

© Copyright IBM Corp. 201 2-8


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Data access

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Data access
In this section you will learn a little about some of the data access tools that come with
HDP. These include MapReduce, Pig, Hive, HBase, Accumulo, Phoenix, Storm, Solr,
Spark, Big SQL, Tez and Slider.

© Copyright IBM Corp. 201 2-9


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Hive

• Apache Hive is a data warehouse system built on top of Hadoop.

• Hive facilitates easy data summarization, ad-hoc queries, and the


analysis of very large datasets that are stored in Hadoop.

• Hive provides SQL on Hadoop


▪ Provides SQL interface, better known as HiveQL or HQL, which allows for
easy querying of data in Hadoop

• Includes HCatalog
▪ Global metadata management layer that exposes Hive table metadata to
other Hadoop applications.
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Hive
Hive is a data warehouse system built on top of Hadoop. Hive supports easy data
summarization, ad-hoc queries, and analysis of large data sets in Hadoop. For those
who have some SQL background, Hive is great tool because it allows you to use a
SQL-like syntax to access data stored in HDFS. Hive also works well with other
applications in the Hadoop ecosystem. It includes a HCatalog, which is a global
metadata management layer that exposes the Hive table metadata to other Hadoop
applications.
Hive documentation: https://hive.apache.org/

© Copyright IBM Corp. 201 2-10


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Pig

• Apache Pig is a platform for analyzing large data sets.


• Pig was designed for scripting a long series of data operations (good
for ETL)
▪ Pig consists of a high-level language called Pig Latin, which was designed
to simplify MapReduce programming.
• Pig's infrastructure layer consists of a compiler that produces
sequences of MapReduce programs from this Pig Latin code that you
write.
• The system is able to optimize your code, and "translate" it into
MapReduce allowing you to focus on semantics rather than efficiency.

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Pig
Another data access tool is Pig, which was written for analyzing large data sets. Pig has
its own language, called Pig Latin, with a purpose of simplifying MapReduce
programming. PigLatin is a simple scripting language, once compiled, will become
MapReduce jobs to run against Hadoop data. The Pig system is able to optimize your
code, so you as the developer can focus on the semantics rather than efficiency.
Pig documentation: http://pig.apache.org/

© Copyright IBM Corp. 201 2-11


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

HBase

• Apache HBase is a distributed, scalable, big data store.

• Use Apache HBase when you need random, real-time read/write


access to your Big Data.
▪ The goals of the HBase project is to be able to handle very large tables of
data running on clusters of commodity hardware.

• HBase is modeled after Google's BigTable and provides BigTable-like


capabilities on top of Hadoop and HDFS. HBase is a NoSQL
datastore.

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

HBase
HBase is a columnar datastore, which means that the data is organized in columns as
opposed to traditional rows where traditional RDMBS is based upon. HBase has this
concept of column families to store and retrieve data. HBase is great for large datasets,
but, not ideal for transactional data processing. This means if you have use cases
where you rely on transactional processing, you would go with a different datastore that
has the features that you need. The common use for HBase is if you need to perform
random read/write access to your big data.
HBase documentation: https://hbase.apache.org/

© Copyright IBM Corp. 201 2-12


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Accumulo
Accumulo is another key/value store, similar to HBase. You can think of Accumulo as a
"highly secure HBase". There are various features that provides a robust, scalable, data
storage and retrieval. It is also based on Google's BigTable, which again, is the same
technology for HBase. HBase, however, is getting more features as aligns it more with
what the community needs. It is up to you to evaluate your requirements and determine
the best tool for your needs.
Accumulo documentation: https://accumulo.apache.org/

© Copyright IBM Corp. 201 2-13


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Phoenix

• Apache Phoenix enables OLTP and operational analytics in Hadoop


for low latency applications by combining the best of both worlds:
▪ The power of standard SQL and JDBC APIs with full ACID transaction
capabilities.
▪ The flexibility of late-bound, schema-on-read capabilities from the NoSQL
world by leveraging HBase as its backing store.

• Essentially this is SQL for NoSQL

• Fully integrated with other Hadoop products such as Spark, Hive, Pig,
and MapReduce

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Phoenix
Phoenix enables online transactional process and operational analytics in Hadoop for
low latency applications. Essentially, this is a SQL for NoSQL database. Recall that
HBase is not designed for transactional processing. Phoenix combines the best of the
NoSQL datastore and the need for transactional processing. This is fully integrated with
other Hadoop produces such as Spark, Hive, Pig and MapReduce.
Phoenix documentation: https://phoenix.apache.org/

© Copyright IBM Corp. 201 2-14


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Storm

• Apache Storm is an open source distributed real-time computation


system.
▪ Fast
▪ Scalable
▪ Fault-tolerant

• Used to process large volumes of high-velocity data

• Useful when milliseconds of latency matter and Spark isn't fast enough
▪ Has been benchmarked at over a million tuples processed per second per
node

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Storm
Storm is designed for real-time computation that is fast, scalable, and fault-tolerant.
When you have a use case to analyze streaming data, consider Storm as an option. Of
course, there are numerous other streaming tools available such as Spark or even IBM
Streams, a proprietary software with decades of research behind it for real-time
analytics.

© Copyright IBM Corp. 201 2-15


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Solr

• Apache Solr is a fast, open source enterprise search platform built on


the Apache Lucene Java search library

• Full-text indexing and search


▪ REST-like HTTP/XML and JSON APIs make it easy to use with variety of
programming languages

• Highly reliable, scalable and fault tolerant, providing distributed


indexing, replication and load-balanced querying, automated failover
and recovery, centralized configuration and more

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Solr
Solr is built off of the Apache Lucene search library. It is designed for full text indexing
and searching. Solr powers the search of many big sites around the internet. It is highly
reliable, scalable and fault tolerant providing distributed indexing, replication and load-
balanced querying, automated failover and recover, centralized configure and more!

© Copyright IBM Corp. 201 2-16


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Spark

• Apache Spark is a fast and general engine for large-scale data


processing.
• Spark has a variety of advantages including:
▪ Speed
− Run programs faster than MapReduce in memory
▪ Easy to use
− Write apps quickly with Java, Scala, Python, R
▪ Generality
− Can combine SQL, streaming, and complex analytics
▪ Runs on variety of environments and can access diverse data sources
− Hadoop, Mesos, standalone, cloud…
− HDFS, Cassandra, HBase, S3…
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Spark
Spark is an in-memory processing engine where speed and scalability is the big
advantage. There are a number of built-in libraries that sits on top of the Spark core,
which takes advantage of all its capabilities. Spark ML, Spark's GraphX, Spark
Streaming, Spark SQL and DataFrames. There are three main languages supported by
Spark: Scala, Python, and R. In most cases, Spark can run programs faster than
MapReduce can by utilizing its in-memory architecture.
Spark documentation: https://spark.apache.org/

© Copyright IBM Corp. 201 2-17


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Druid

• Apache Druid is a high-performance, column-oriented, distributed


data store.
▪ Interactive sub-second queries
− Unique architecture enables rapid multi-dimensional filtering, ad-hoc attribute
groupings, and extremely fast aggregations
▪ Real-time streams
− Lock-free ingestion to allow for simultaneous ingestion and querying of high
dimensional, high volume data sets
− Explore events immediately after they occur
▪ Horizontally scalable
▪ Deploy anywhere

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Druid
Druid is a datastore designed for business intelligence (OLAP) queries. Druid provides
real-time data ingestion, query, and fast aggregations. It integrates with Apache Hive to
build OLAP cubes and run sub-seconds queries.

© Copyright IBM Corp. 201 2-18


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Data Lifecycle and Governance

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Data Lifecycle and Governance


In this section you will learn a little about some of the data lifecycle and governance
tools that come with HDP.

© Copyright IBM Corp. 201 2-19


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Falcon

• Framework for managing data life cycle in Hadoop clusters

• Data governance engine


▪ Defines, schedules, and monitors data management policies

• Hadoop admins can centrally define their data pipelines


▪ Falcon uses these definitions to auto-generate workflows in Oozie

• Addresses enterprise challenges related to Hadoop data replication,


business continuity, and lineage tracing by deploying a framework for
data management and processing
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Falcon
Falcon is used for managing the data life cycle in Hadoop clusters. One example use
case would feed management services such as feed retention, replications across
clusters for backups, archival of data, etc.
Falcon documentation: https://falcon.apache.org/

© Copyright IBM Corp. 201 2-20


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Atlas

• Apache Atlas is a scalable and extensible set of core foundational


governance services
▪ Enables enterprises to effectively and efficiently meet their compliance
requirements within Hadoop
• Exchange metadata with other tools and processes within and outside
of the Hadoop
▪ Allows integration with the whole enterprise data ecosystem
• Atlas Features:
▪ Data Classification
▪ Centralized Auditing
▪ Centralized Lineage
▪ Security & Policy Engine
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Atlas
Atlas enables enterprises to meet their compliance requirements within Hadoop. It
provides features for data classification, centralized auditing, centralized lineage, and
security and policy engine. It integrates with the whole enterprise data ecosystem.
Atlas documentation: https://atlas.apache.org/

© Copyright IBM Corp. 201 2-21


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Security

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Security
In this section you will learn a little about some of the security tools that come with HDP.

© Copyright IBM Corp. 201 2-22


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Ranger

• Centralized security framework to enable, monitor and manage


comprehensive data security across the Hadoop platform

• Manage fine-grained access control over Hadoop data access


components like Apache Hive and Apache HBase

• Using Ranger console can manage policies for access to files, folders,
databases, tables, or column with ease

• Policies can be set for individual users or groups


▪ Policies enforced within Hadoop
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Ranger
Ranger is used to control data security across the entire Hadoop platform. The Ranger
console can manage policies for access to files, folders, databases, tables and
columns. The policies can be set for individual users or groups.
Ranger documentation: https://ranger.apache.org/

© Copyright IBM Corp. 201 2-23


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Knox

• REST API and Application Gateway for the Apache Hadoop


Ecosystem

• Provides perimeter security for Hadoop clusters

• Single access point for all REST interactions with Apache Hadoop
clusters

• Integrates with prevalent SSO and identity management systems


▪ Simplifies Hadoop security for users who access cluster data and execute
jobs
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Knox
Knox is a gateway for the Hadoop ecosystem. It provides perimeter level security for
Hadoop. You can think of Knox like the castle walls, where within walls is your Hadoop
cluster. Knox integrates with SSO and identity management systems to simplify
Hadoop security for users who access cluster data and execute jobs.
Knox documentation: https://knox.apache.org/

© Copyright IBM Corp. 201 2-24


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Operations

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Operations
In this section you will learn a little about some of the operations tools that come with
HDP.

© Copyright IBM Corp. 201 2-25


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Ambari

• For provisioning, managing, and monitoring Apache Hadoop clusters.

• Provides intuitive, easy-to-use Hadoop management web UI backed by


its RESTful APIs

• Ambari REST APIs


▪ Allows application developers and system integrators to easily integrate
Hadoop provisioning, management, and monitoring capabilities to their own
applications

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Ambari
You will grow to know your way around Ambari, as this is the central place to manage
your entire Hadoop cluster. Installation, provisioning, management and monitoring of
your Hadoop cluster is done with Ambari. It also comes with some easy to use RESTful
APIs which allows application developers to easily integrate Ambari with their own
applications.
Ambari documentation: https://ambari.apache.org/

© Copyright IBM Corp. 201 2-26


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

The Ambari web interface

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

The Ambari web interface


Here is a look at the Ambari web interface. Available services are shown on the left with
the metrics of your cluster showing on the center of the page. Additional component
and server configurations are available once you drill down on the respective pages.

© Copyright IBM Corp. 201 2-27


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Cloudbreak

• A tool for provisioning and managing Apache Hadoop clusters in the


cloud

• Automates launching of elastic Hadoop clusters

• Policy-based autoscaling on the major cloud infrastructure platforms,


including:
▪ Microsoft Azure
▪ Amazon Web Services
▪ Google Cloud Platform
▪ OpenStack
▪ Platforms that support Docker container

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Cloudbreak
Cloudbreak is a tool for managing clusters in the cloud. Cloudbreak is a Hortonworks'
project, and is currently not a part of Apache. It automates the launch of clusters into
various cloud infrastructure platforms.

© Copyright IBM Corp. 201 2-28


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

ZooKeeper

• Apache ZooKeeper is centralized service for maintaining configuration


information, naming, providing distributed synchronization, and
providing group services
▪ All of these kinds of services are used in some form or another by
distributed applications
▪ Saves time so you don't have to develop your own

• It is fast, reliable, simple and ordered

• Distributed applications can use ZooKeeper to store and mediate


updates to important configuration information

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

ZooKeeper
ZooKeeper provides a centralized service for maintaining configuration information,
naming, providing distributed synchronization and providing group services across your
Hadoop cluster. Applications within the Hadoop cluster can use ZooKeeper to maintain
configuration information.
ZooKeeper documentation: https://zookeeper.apache.org/

© Copyright IBM Corp. 201 2-29


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Oozie

• Oozie is a Java based workflow scheduler system to manage Apache


Hadoop jobs

• Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions

• Integrated with the Hadoop stack


▪ YARN is its architectural center
▪ Supports Hadoop jobs for MapReduce, Pig, Hive, and Sqoop

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Oozie
Oozie is a workflow scheduler system to manage Hadoop jobs. Oozie is integrated with
the rest of the Hadoop stack. Oozie workflow jobs are Directed Acyclical Graphs
(DAGs) of actions. At the heart of this is YARN.
Oozie documentation: http://oozie.apache.org/

© Copyright IBM Corp. 201 2-30


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Tools

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Tools
In this section you will learn a little about some of the Tools that come with HDP.

© Copyright IBM Corp. 201 2-31


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Zeppelin

• Apache Zeppelin is a Web-based notebook that enables data-driven,


interactive data analytics and collaborative documents

• Documents can contain SparkSQL, SQL, Scala, Python, JDBC


connection, and much more

• Easy for both end-users and data scientists to work with

• Notebooks combine code samples, source data, descriptive markup,


result sets, and rich visualizations in one place

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Zeppelin
Zepplin is a web based notebook designed for data scientists to easily and quickly
explore dataset through collaborations. Notebooks can contain Spark SQL, SQL, Scala,
Python, JDBC, and more. Zeppelin allows for interaction and visualization of large
datasets.
Zeppelin documentation: https://zeppelin.apache.org/

© Copyright IBM Corp. 201 2-32


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Zeppelin GUI

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Zeppelin GUI
Here is a screenshot of the Zepplin notebook showing some visualization of a particular
dataset.

© Copyright IBM Corp. 201 2-33


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Ambari Views

• Ambari web interface includes a built-in set of Views that are pre-
deployed for you to use with your cluster

• These GUI components increase ease-of-use

• Includes views for Hive, Pig, Tez, Capacity Scheduler, File, HDFS

• Ambari Views Framework allow developers to create new user


interface components that plug into Ambari Web UI

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019

Ambari Views
Ambari views provide a built-in set of views for Hive, Pig, Tez, Capacity Schedule, File,
HDFS which allows developers to monitor and manage the cluster. It also allows
developers to create new user interface components that plug in to the Ambari Web UI.

© Copyright IBM Corp. 201 2-34


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

IBM value-add components

• Big SQL

• Big Replicate

• BigQuality

• BigIntegrate

• Big Match

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

IBM value-add components


These are some of the value-add components available from IBM. You will learn a bit
about these components next.

© Copyright IBM Corp. 201 2-35


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Big SQL is SQL on Hadoop


• Big SQL builds on Apache Hive
foundation
▪ Integrates with the Hive metastore Big SQL
Hive
Hive APIs
▪ Instead of MapReduce, uses Sqoop
Pig
powerful native C/C++ MPP engine Hive APIs
Hive APIs

• View on your data residing in the


Hive Metastore

Hadoop FileSystem
• No proprietary storage format
• Modern SQL:2011 capabilities
• Same SQL can be used on your Hadoop
Cluster
warehouse data with little or no
modifications

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Big SQL is SQL on Hadoop


Big SQL is a SQL processing engine for the Hadoop cluster. It provides a SQL on
Hadoop interface and plugs in directly to the cluster. There is no proprietary format and
no new SQL syntax to learn. The same SQL can be used on the warehouse data with
little or no modification. It provides an ease-of-use language for SQL developers to be
able to work with data on Hadoop.

© Copyright IBM Corp. 201 2-36


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Big Replicate
• Provides active-active data replication for Hadoop across supported
environments, distributions, and hybrid deployments
• Replicates data automatically with guaranteed consistency across
Hadoop clusters running on any distribution, cloud object storage and
local and NFS mounted file systems
• Provides SDK to extend Big Replicate replication to virtually any data
source
• Patented distributed coordination engine enables:
▪ Guaranteed data consistency across any number of sites at any distance
▪ Minimized RTO/RPO
• Totally non-invasive
▪ No modification to source code
▪ Easy to turn on/off

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Big Replicate
For active-active data replication on Hadoop clusters, Big Replicate has no competition.
It replicates data automatically with guaranteed consistency on Hadoop clusters on any
distribution, cloud object storage and local and NDFS mounted file systems. Big
Replicate provides SDK to extend it to any other data source. Patented distributed
coordination engine enables guaranteed data consistency across any number of sites
at any distance.

© Copyright IBM Corp. 201 2-37


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Information Server and Hadoop: BigQuality and BigIntegrate

• IBM InfoSphere Information Server is a market-leading data integration


platform which includes a family of products that enable you to
understand, cleanse, monitor, transform, and deliver data, as well as to
collaborate to bridge the gap between business and IT.

• Information Server can now be used with Hadoop

• You can profile, validate, cleanse, transform, and integrate your big
data on Hadoop, an open source framework that can manage large
volumes of structured and unstructured data.

• This functionality is available with the following product offerings


▪ IBM BigIntegrate: Provides data integration features of Information Server.
▪ IBM BigQuality: Provides data quality features of Information Server.
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Information Server and Hadoop: BigQuality and BigIntegrate


Information Server is a platform for data integration, data quality, and governance that is
unified by a common metadata layer and scalable architecture. This means more
reuse, better productivity, and the ability to leverage massively scalable architectures
like MPP, GRID, and Hadoop clusters.

© Copyright IBM Corp. 201 2-38


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Information Server - BigIntegrate:


Ingest, transform, process and deliver any data into & within Hadoop
Satisfy the most complex transformation requirements with
the most scalable runtime available in batch or real-time

• Connect
▪ Connect to wide range of traditional
enterprise data sources as well as Hadoop
data sources
▪ Native connectors with highest level of
performance and scalability for key
data sources
• Design & Transform
▪ Transform and aggregate any data volume
▪ Benefit from hundreds of built-in
transformation functions
▪ Leverage metadata-driven productivity and
enable collaboration
• Manage & Monitor
▪ Use a simple, web-based dashboard to
manage your runtime environment
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Information Server - BigIntegrate: Ingest, transform, process and deliver any data into &
within Hadoop
IBM BigIntegrate is a big data integration solution that provides superior connectivity,
fast transformation and reliable, easy-to-use data delivery features that execute on the
data nodes of a Hadoop cluster. IBM BigIntegrate provides a flexible and scalable
platform to transform and integrate your Hadoop data.
Once you have data sources that are understood and cleansed, the data must be
transformed into a usable format for the warehouse and delivered in a timely fashion
whether in batch, real-time or SOA architectures. All warehouse projects require data
integration – how else will the many enterprise data sources make their way into the
warehouse? Hand-coding is not a scalable option.
Increase developer efficiency
• Top down design – Highly visual dev environment
• Enhanced collaboration through design asset reuse
High performance delivery with flexible deployments
• Support for multiple delivery styles: ETL, ELT, Change Data Capture, SOA
integration etc.
• High-performance, parallel engine

© Copyright IBM Corp. 201 2-39


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Rapid integration
• Pre-built connectivity
• Balanced Optimization
• Multiple user configuration options
• Job parameter available for all options
• Powerful logging and tracing
BigIntegrate is built for the simple to the most sophisticated data transformations.
Think about the simple transformations such as transforming or calculating total values.
This is the very basic of transformation across data like you would do with a
spreadsheet or calculator. Then imagine the more complex. Such as provide a lookup
to an automated loan system where the loan qualification date equals the interest rate
for that time of day based on a look up to an ever changing system.
These are the types of transformations our customers are doing every day and they
require an easy to use canvas that allows you to design as you think. This is exactly
what BigIntegrate has been built to do.

© Copyright IBM Corp. 201 2-40


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Information Server - BigQuality:


Analyze, cleanse and monitor your big data
Most comprehensive data quality capabilities that run
natively on Hadoop
• Analyze
▪ Discovers data of interest to the organization
based on business defined data classes
▪ Analyzes data structure, content
and quality
▪ Automates your data analysis process
• Cleanse
▪ Investigate, standardize, match and survive
data at scale and with the full power of
common data integration processes
• Monitor
▪ Assess and monitor the quality of your data in
any place and across systems
▪ Align quality indicators to
business policies
▪ Engage data steward team when issues
exceed thresholds of the business
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Information Server - BigQuality: Analyze, cleanse and monitor your big data
IBM BigQuality provides a massively scalable engine to analyze, cleanse, and monitor
data.
Analysis discovers patterns, frequencies, and sensitive data that is critical to the
business – the content, quality, and structure of data at rest. While a robust user
interface is provided, the process can be completely automated, too.
Cleanse uses powerful out of the box (that are completely customizable) routines to
investigate, standardize, match, and survive free format data. For example,
understanding that William Smith and Bill Smith are the same person. Or knowing that
BLK really means Black in some contexts.
Monitor is measuring the content, quality, and structure of data in flight to make
operational decisions about data. For example, ‘exceptions’ can be send to a full
workflow engine called the Stewardship Center where people can collaborate on the
issues.

© Copyright IBM Corp. 201 2-41


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

IBM InfoSphere Big Match for Hadoop

PME Hadoop

Algorithm

Big Match is a Probabilistic Matching Engine (PME) running


natively within Hadoop for Customer Data Matching

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

IBM InfoSphere Big Match for Hadoop


Big Match provides proven probabilistic matching algorithms natively in Hadoop to
accurately and economically link all customer data across structured and un-structured
data sources at large scale. Big Match helps provide customer information that
consuming applications can act on with confidence.
Big Match integrates directly with Apache Hadoop.

© Copyright IBM Corp. 201 2-42


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Watson Studio (formerly Data Science Experience (DSX))

• Watson Studio is a collaborative platform for data scientists,


built on open source components and IBM added value,
available in the cloud or on premises.
• https://datascience.ibm.com/

Learn Create Collaborate


Built-in learning to The best of open Community and
get started or go the source and IBM social features that
distance with value-add to create provide meaningful
advanced tutorials state-of-the-art data collaboration
products
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Watson Studio (formerly Data Science Experience (DSX))


Watson Studio is a collaborative platform designed for data scientists, built on open
source components and IBM value-add components available in the cloud or on-
premise

© Copyright IBM Corp. 201 2-43


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Checkpoint
1) List the components of HDP which provides data access capabilities?
2) List the components that provides the capability to move data from
relational database into Hadoop?
3) Managing Hadoop clusters can be accomplished using which
component?
4) True or False? The following components are value-add from IBM:
Big Replicate, Big SQL, BigIntegrate, BigQuality, Big Match
5) True or False? Data Science capabilities can be achieved using only
HDP.

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Checkpoint

© Copyright IBM Corp. 201 2-44


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Checkpoint solution
1) List the components of HDP which provides data access capabilities.
▪ MapReduce, Pig, Hive, HBase, Phoenix, Spark, and more!
2) List the components that provides the capability to move data from
relational database into Hadoop.
▪ Sqoop, Flume, Kafka
3) Managing Hadoop clusters can be accomplished using which
component?
▪ Ambari
4) True or False? The following components are value-add from IBM:
Big Replicate, Big SQL, BigIntegrate, BigQuality, Big Match
▪ True
5) True or False? Data Science capabilities can be achieved using only
HDP.
▪ False. Data Science capabilities also requires Watson Studio.
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Checkpoint solution

© Copyright IBM Corp. 201 2-45


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Unit summary
• Describe the functions and features of HDP
• List the IBM value-add components
• Explain what IBM Watson Studio is
• Give a brief description of the purpose of each of the value-add
components

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Unit summary

© Copyright IBM Corp. 201 2-46


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Lab 1
• Exploration of the lab environment
▪ Start the VMWare image
▪ Launch the Ambari console
▪ Perform some basic setup
▪ Start services and the Hadoop processing environment
▪ Explore the placement of files in the Linux host-system environment and
explain the file system and directory structures

Creating
Introduction
Bigto
SQL
Hortonworks
schemas and
Datatables
Platform (HDP) © Copyright IBM Corporation 2018

Lab 1: Exploration of the lab environment

© Copyright IBM Corp. 201 2-47


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Lab 1:
Exploration of the lab environment
Purpose:
In this lab, you will explore the lab environment. You will access your lab
environment and launch Apache Ambari. You will startup a variety of services
by using the Ambari GUI. You will also explore some of the directory structure
on the Linux system that you will be using.
Before doing the labs, you must have an IBM Cloud Lite account. One way to get it
is to register yourself in the IBM Digital-Nation Africa program (for Africa only). That
enables you to explore emerging technologies, build innovative solutions, learn new
skills and find a job.
Here are the steps for registration:
1. Browse to www.DigitalNationAfrica.com.
2. Click Register for Free.
3. If you already have an IBM ID, click Log in and continue step 10.
4. If you don’t have an IBM ID, enter your data then click Next.
5. Accept the IBM ID Account Privacy by clicking Proceed.
6. Check your email. You should receive an email from [email protected],
mentioning the code that you will use for activating your IBM ID account.
7. Copy the code and paste it in the DNA registration page.
8. Click Verify.
9. Click Complete registration to finish the last step of the registration.
10. After logging in, you are taken to the home page of the site.
11. You will receive another email containing a link to confirm your IBM Cloud
account. Click that link so that the account is activated.
12. After the IBM Cloud account is activated click
Task 1. Create an instance of the Analytics Engine service
The IBM Analytics Engine service enables you to create Apache Spark and Apache
Hadoop clusters in minutes and customize these clusters by using scripts. You can
access and administer IBM Analytics Engine through the Cloud Foundry command line
interface, REST API’s, IBM Cloud portal, and Ambari.
You can check the documentation of the Analytics Engine service here:
https://cloud.ibm.com/docs/services/AnalyticsEngine.
In the following steps you will create an instance of the service:
1. Browse to IBM Cloud console at https://cloud.ibm.com.

© Copyright IBM Corp. 201 2-48


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

2. Enter your IBM ID (email) and password, then click Log in.
3. In the page header, click Create resource.
4. Click the Analytics Engine service.

5. You can specify the Service Name for this instance, e.g. Analytics Engine
demo.
6. You can Choose a region/location to deploy in, e.g. London.
7. Review the features of the Pricing Plan, then click Configure.
8. Set Number of compute nodes to 1.

© Copyright IBM Corp. 201 2-49


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

9. For Software package, select AE 1.1 Spark and Hadoop, then click Create.

10. Wait until the status of the service instance is Provisioned.

11. After the cluster provisioning is complete, click the instance name (Analytics
Engine demo) to open it.
12. Click Reset password to generate a new password.
Now you can see the status of the cluster and the username and password that
can be used to access it. You will use these username and password for the
following labs.
In the Nodes section, you can see the components of the cluster. The cluster
consists of a management instance and one or more compute instances. The
management instance itself consists of three management nodes (master
management node mn001, and two management slave nodes mn002 and
mn003). Each of the compute nodes runs in a separate compute instance.
The output looks like the following:

© Copyright IBM Corp. 201 2-50


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

13. Launch the Ambari console by clicking Launch Console.


You can see that all services are started.

© Copyright IBM Corp. 201 2-51


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

14. Close the Ambari browser tab to return to the tab of the service instance.
Task 2. Retrieve service credentials and service end points
You can fetch the cluster credentials and the service end points by using the IBM Cloud
CLI, REST API, or from the IBM Cloud console.
1. While inside the service instance, click Service credentials in the left side bar.
2. Click New credential to create a new service credential.

3. In the Add new credential dialog, set Role to Manager, then click Add.

© Copyright IBM Corp. 201 2-52


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

4. In the new credential, click View credentials, then click Copy to clipboard.

5. Open a new text file in a text editor then press Ctrl-V to paste the JSON text.
The following table lists important credentials and their locations in the JSON file
under the cluster object. You will refer to them later in the labs.

© Copyright IBM Corp. 201 2-53


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Property Location Sample Value

Username user clsadmin

Password password 5uRPjR9F5b6M

SSH service_endpoints.ssh ssh clsadmin@chs-quk-301-


mn003.eu-gb.ae.appdomain.cloud

Hostname service_endpoints.ssh (after the @) chs-quk-301-mn003.eu-


gb.ae.appdomain.cloud

Ambari service_endpoints.ambari_console https://chs-quk-301-mn001.eu-


URL gb.ae.appdomain.cloud:9443
Task 3. Explore the placement of files in the Linux
host-system environment
You will review the placement of some configuration files in the Linux host-system
environment so that you can become familiar with where the open-source configuration
files are installed. This is a behind the covers look at the actual configuration files that
are updated when you edit configuration settings for the different components within
Ambari on your Hadoop cluster.
1. Open PuTTY 1 or any SSH client.
2. Use the following information then click Open 2:
• Session > Host name (or IP address): Hostname
• Connection > Data > Auto-login username: Username
3. When prompted for the password, enter Password.
4. To change the directory to /usr/hdp, type cd /usr/hdp.
The results appear as follows:

5. To review the files and directories in the /usr/hdp directory, type ls -l.

1
Download and install PuTTY from here: www.putty.org
2
You can also give the connection a name and save it so that you can use it later
without re-entering the same info again

© Copyright IBM Corp. 201 2-54


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

The first of these subdirectories (2.6.5.0-292) is the release level of the HDP
software that you are working with; the version in your lab environment may
differ if the software has been updated.
6. To display the contents of that directory, type ls -l 2*.
The results appear like the following:

Take note of the user IDs associated with the various directories. Some have a
user name that is same as the software held in the directory; some are owned
by root. You will be looking at the standard users in the Apache Ambari unit
when you explore details of the Ambari server and work with the management
of your cluster.
7. To view the current subdirectory which has a set of links that point back to files
and subdirectories in the 2.*.*.*-*** directory, type ls -l current.

© Copyright IBM Corp. 201 2-55


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

There are sometimes multiple links to one directory.


The software that will be executed at this time (in the current directory) is
determined by these links. Since you can have multiple versions of software
installed, it is possible to install the software for an upcoming upgrade and then
later have Ambari install a rolling upgrade that will systematically change the
links as appropriate to files in the new, upgraded software directories.
8. Change directories to the /etc directory by typing cd /etc in the Linux console.

© Copyright IBM Corp. 201 2-56


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

9. List the contents of the /etc directory by executing the ls -l command. Your
results will look like the following:

You will notice a huge listing of items printed out.


You can view the various Hadoop component configuration files at (for most of
the software that comes with the HDP installation) within the
/etc/<componentname>/conf/* directories. Do NOT try to manually
change configuration files here. The configs are managed within Ambari, and
you should only edit configuration settings from within the Ambari GUI.
10. Take a quick look at one of the component's conf directory. Enter the following
command to change into Zookeeper's conf directory:
cd /etc/zookeeper/conf

© Copyright IBM Corp. 201 2-57


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

11. List the contents of the directory by typing the ls -l command.

A listing of files is displayed.


12. Close all open windows.
Results:
In this lab, you explored the lab environment. You accessed your lab
environment and launched Apache Ambari. You also explored some of the
directory structure on the Linux system.

© Copyright IBM Corp. 201 2-58


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

You might also like