02 HDP Introduction
02 HDP Introduction
Introduction to Hortonworks
Data Platform (HDP)
Unit objectives
• Describe the functions and features of HDP
• List the IBM value-add components
• Explain what IBM Watson Studio is
• Give a brief description of the purpose of each of the value-add
components
Unit objectives
• HDP is:
▪ Open
▪ Central
▪ Interoperable
▪ Enterprise ready
Data Flow
Data Access
Kafka Batch Script SQL NoSQL Stream Search In-Mem Others
Here is the high level view of the Hortonworks Data Platform. It is divided into several
categories, listed in no particular order of importance.
• Governance, Integration
• Tools
• Security
• Operations
• Data Access
• Data Management
These next several slides will go in more detail of each of these groupings.
Data Flow
Data flow
In this section you will learn a little about some of the data workflow tools that come
with HDP.
Kafka
Kafka
Kafka Apache Kafka is a messaging system used for real-time data pipelines. Kafka is
used to build real-time streaming data pipelines that get data between systems or
applications. Kafka works with a number variety of Hadoop tools for various
applications. Such exmaples of uses cases are:
• Website activity tracking: capturing user site activities for real-time
tracking/monitoring
• Metrics: monitoring data
• Log aggregation: collecting logs from various sources to a central location for
processing.
• Stream processing: article recommendations based on user activity
• Event sourcing: state changes in applications are logged as time-ordered
sequence of records
• Commit log: external commit log system that helps with replicating data
between nodes in case of failed nodes
More information can be found here: https://kafka.apache.org/
Sqoop
• Can also use to extract data from Hadoop and export it to relational
databases and enterprise data warehouses
Sqoop
Sqoop is a tool for moving data between structured databases or relational databases
and related Hadoop system. This works both ways. You can take data in your RDBMS
and move to your HDFS and move from your HDFS to some other form of RDBMS.
You can use Sqoop to offload tasks such as ETL from data warehouses to Hadoop for
lower cost and efficient execution for analytics.
Check out the Sqoop documentation for more info: http://sqoop.apache.org/
Data access
Data access
In this section you will learn a little about some of the data access tools that come with
HDP. These include MapReduce, Pig, Hive, HBase, Accumulo, Phoenix, Storm, Solr,
Spark, Big SQL, Tez and Slider.
Hive
• Includes HCatalog
▪ Global metadata management layer that exposes Hive table metadata to
other Hadoop applications.
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Hive
Hive is a data warehouse system built on top of Hadoop. Hive supports easy data
summarization, ad-hoc queries, and analysis of large data sets in Hadoop. For those
who have some SQL background, Hive is great tool because it allows you to use a
SQL-like syntax to access data stored in HDFS. Hive also works well with other
applications in the Hadoop ecosystem. It includes a HCatalog, which is a global
metadata management layer that exposes the Hive table metadata to other Hadoop
applications.
Hive documentation: https://hive.apache.org/
Pig
Pig
Another data access tool is Pig, which was written for analyzing large data sets. Pig has
its own language, called Pig Latin, with a purpose of simplifying MapReduce
programming. PigLatin is a simple scripting language, once compiled, will become
MapReduce jobs to run against Hadoop data. The Pig system is able to optimize your
code, so you as the developer can focus on the semantics rather than efficiency.
Pig documentation: http://pig.apache.org/
HBase
HBase
HBase is a columnar datastore, which means that the data is organized in columns as
opposed to traditional rows where traditional RDMBS is based upon. HBase has this
concept of column families to store and retrieve data. HBase is great for large datasets,
but, not ideal for transactional data processing. This means if you have use cases
where you rely on transactional processing, you would go with a different datastore that
has the features that you need. The common use for HBase is if you need to perform
random read/write access to your big data.
HBase documentation: https://hbase.apache.org/
Accumulo
Accumulo is another key/value store, similar to HBase. You can think of Accumulo as a
"highly secure HBase". There are various features that provides a robust, scalable, data
storage and retrieval. It is also based on Google's BigTable, which again, is the same
technology for HBase. HBase, however, is getting more features as aligns it more with
what the community needs. It is up to you to evaluate your requirements and determine
the best tool for your needs.
Accumulo documentation: https://accumulo.apache.org/
Phoenix
• Fully integrated with other Hadoop products such as Spark, Hive, Pig,
and MapReduce
Phoenix
Phoenix enables online transactional process and operational analytics in Hadoop for
low latency applications. Essentially, this is a SQL for NoSQL database. Recall that
HBase is not designed for transactional processing. Phoenix combines the best of the
NoSQL datastore and the need for transactional processing. This is fully integrated with
other Hadoop produces such as Spark, Hive, Pig and MapReduce.
Phoenix documentation: https://phoenix.apache.org/
Storm
• Useful when milliseconds of latency matter and Spark isn't fast enough
▪ Has been benchmarked at over a million tuples processed per second per
node
Storm
Storm is designed for real-time computation that is fast, scalable, and fault-tolerant.
When you have a use case to analyze streaming data, consider Storm as an option. Of
course, there are numerous other streaming tools available such as Spark or even IBM
Streams, a proprietary software with decades of research behind it for real-time
analytics.
Solr
Solr
Solr is built off of the Apache Lucene search library. It is designed for full text indexing
and searching. Solr powers the search of many big sites around the internet. It is highly
reliable, scalable and fault tolerant providing distributed indexing, replication and load-
balanced querying, automated failover and recover, centralized configure and more!
Spark
Spark
Spark is an in-memory processing engine where speed and scalability is the big
advantage. There are a number of built-in libraries that sits on top of the Spark core,
which takes advantage of all its capabilities. Spark ML, Spark's GraphX, Spark
Streaming, Spark SQL and DataFrames. There are three main languages supported by
Spark: Scala, Python, and R. In most cases, Spark can run programs faster than
MapReduce can by utilizing its in-memory architecture.
Spark documentation: https://spark.apache.org/
Druid
Druid
Druid is a datastore designed for business intelligence (OLAP) queries. Druid provides
real-time data ingestion, query, and fast aggregations. It integrates with Apache Hive to
build OLAP cubes and run sub-seconds queries.
Falcon
Falcon
Falcon is used for managing the data life cycle in Hadoop clusters. One example use
case would feed management services such as feed retention, replications across
clusters for backups, archival of data, etc.
Falcon documentation: https://falcon.apache.org/
Atlas
Atlas
Atlas enables enterprises to meet their compliance requirements within Hadoop. It
provides features for data classification, centralized auditing, centralized lineage, and
security and policy engine. It integrates with the whole enterprise data ecosystem.
Atlas documentation: https://atlas.apache.org/
Security
Security
In this section you will learn a little about some of the security tools that come with HDP.
Ranger
• Using Ranger console can manage policies for access to files, folders,
databases, tables, or column with ease
Ranger
Ranger is used to control data security across the entire Hadoop platform. The Ranger
console can manage policies for access to files, folders, databases, tables and
columns. The policies can be set for individual users or groups.
Ranger documentation: https://ranger.apache.org/
Knox
• Single access point for all REST interactions with Apache Hadoop
clusters
Knox
Knox is a gateway for the Hadoop ecosystem. It provides perimeter level security for
Hadoop. You can think of Knox like the castle walls, where within walls is your Hadoop
cluster. Knox integrates with SSO and identity management systems to simplify
Hadoop security for users who access cluster data and execute jobs.
Knox documentation: https://knox.apache.org/
Operations
Operations
In this section you will learn a little about some of the operations tools that come with
HDP.
Ambari
Ambari
You will grow to know your way around Ambari, as this is the central place to manage
your entire Hadoop cluster. Installation, provisioning, management and monitoring of
your Hadoop cluster is done with Ambari. It also comes with some easy to use RESTful
APIs which allows application developers to easily integrate Ambari with their own
applications.
Ambari documentation: https://ambari.apache.org/
Cloudbreak
Cloudbreak
Cloudbreak is a tool for managing clusters in the cloud. Cloudbreak is a Hortonworks'
project, and is currently not a part of Apache. It automates the launch of clusters into
various cloud infrastructure platforms.
ZooKeeper
ZooKeeper
ZooKeeper provides a centralized service for maintaining configuration information,
naming, providing distributed synchronization and providing group services across your
Hadoop cluster. Applications within the Hadoop cluster can use ZooKeeper to maintain
configuration information.
ZooKeeper documentation: https://zookeeper.apache.org/
Oozie
Oozie
Oozie is a workflow scheduler system to manage Hadoop jobs. Oozie is integrated with
the rest of the Hadoop stack. Oozie workflow jobs are Directed Acyclical Graphs
(DAGs) of actions. At the heart of this is YARN.
Oozie documentation: http://oozie.apache.org/
Tools
Tools
In this section you will learn a little about some of the Tools that come with HDP.
Zeppelin
Zeppelin
Zepplin is a web based notebook designed for data scientists to easily and quickly
explore dataset through collaborations. Notebooks can contain Spark SQL, SQL, Scala,
Python, JDBC, and more. Zeppelin allows for interaction and visualization of large
datasets.
Zeppelin documentation: https://zeppelin.apache.org/
Zeppelin GUI
Zeppelin GUI
Here is a screenshot of the Zepplin notebook showing some visualization of a particular
dataset.
Ambari Views
• Ambari web interface includes a built-in set of Views that are pre-
deployed for you to use with your cluster
• Includes views for Hive, Pig, Tez, Capacity Scheduler, File, HDFS
Ambari Views
Ambari views provide a built-in set of views for Hive, Pig, Tez, Capacity Schedule, File,
HDFS which allows developers to monitor and manage the cluster. It also allows
developers to create new user interface components that plug in to the Ambari Web UI.
• Big SQL
• Big Replicate
• BigQuality
• BigIntegrate
• Big Match
Hadoop FileSystem
• No proprietary storage format
• Modern SQL:2011 capabilities
• Same SQL can be used on your Hadoop
Cluster
warehouse data with little or no
modifications
Big Replicate
• Provides active-active data replication for Hadoop across supported
environments, distributions, and hybrid deployments
• Replicates data automatically with guaranteed consistency across
Hadoop clusters running on any distribution, cloud object storage and
local and NFS mounted file systems
• Provides SDK to extend Big Replicate replication to virtually any data
source
• Patented distributed coordination engine enables:
▪ Guaranteed data consistency across any number of sites at any distance
▪ Minimized RTO/RPO
• Totally non-invasive
▪ No modification to source code
▪ Easy to turn on/off
Big Replicate
For active-active data replication on Hadoop clusters, Big Replicate has no competition.
It replicates data automatically with guaranteed consistency on Hadoop clusters on any
distribution, cloud object storage and local and NDFS mounted file systems. Big
Replicate provides SDK to extend it to any other data source. Patented distributed
coordination engine enables guaranteed data consistency across any number of sites
at any distance.
• You can profile, validate, cleanse, transform, and integrate your big
data on Hadoop, an open source framework that can manage large
volumes of structured and unstructured data.
• Connect
▪ Connect to wide range of traditional
enterprise data sources as well as Hadoop
data sources
▪ Native connectors with highest level of
performance and scalability for key
data sources
• Design & Transform
▪ Transform and aggregate any data volume
▪ Benefit from hundreds of built-in
transformation functions
▪ Leverage metadata-driven productivity and
enable collaboration
• Manage & Monitor
▪ Use a simple, web-based dashboard to
manage your runtime environment
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018
Information Server - BigIntegrate: Ingest, transform, process and deliver any data into &
within Hadoop
IBM BigIntegrate is a big data integration solution that provides superior connectivity,
fast transformation and reliable, easy-to-use data delivery features that execute on the
data nodes of a Hadoop cluster. IBM BigIntegrate provides a flexible and scalable
platform to transform and integrate your Hadoop data.
Once you have data sources that are understood and cleansed, the data must be
transformed into a usable format for the warehouse and delivered in a timely fashion
whether in batch, real-time or SOA architectures. All warehouse projects require data
integration – how else will the many enterprise data sources make their way into the
warehouse? Hand-coding is not a scalable option.
Increase developer efficiency
• Top down design – Highly visual dev environment
• Enhanced collaboration through design asset reuse
High performance delivery with flexible deployments
• Support for multiple delivery styles: ETL, ELT, Change Data Capture, SOA
integration etc.
• High-performance, parallel engine
Rapid integration
• Pre-built connectivity
• Balanced Optimization
• Multiple user configuration options
• Job parameter available for all options
• Powerful logging and tracing
BigIntegrate is built for the simple to the most sophisticated data transformations.
Think about the simple transformations such as transforming or calculating total values.
This is the very basic of transformation across data like you would do with a
spreadsheet or calculator. Then imagine the more complex. Such as provide a lookup
to an automated loan system where the loan qualification date equals the interest rate
for that time of day based on a look up to an ever changing system.
These are the types of transformations our customers are doing every day and they
require an easy to use canvas that allows you to design as you think. This is exactly
what BigIntegrate has been built to do.
Information Server - BigQuality: Analyze, cleanse and monitor your big data
IBM BigQuality provides a massively scalable engine to analyze, cleanse, and monitor
data.
Analysis discovers patterns, frequencies, and sensitive data that is critical to the
business – the content, quality, and structure of data at rest. While a robust user
interface is provided, the process can be completely automated, too.
Cleanse uses powerful out of the box (that are completely customizable) routines to
investigate, standardize, match, and survive free format data. For example,
understanding that William Smith and Bill Smith are the same person. Or knowing that
BLK really means Black in some contexts.
Monitor is measuring the content, quality, and structure of data in flight to make
operational decisions about data. For example, ‘exceptions’ can be send to a full
workflow engine called the Stewardship Center where people can collaborate on the
issues.
PME Hadoop
Algorithm
Checkpoint
1) List the components of HDP which provides data access capabilities?
2) List the components that provides the capability to move data from
relational database into Hadoop?
3) Managing Hadoop clusters can be accomplished using which
component?
4) True or False? The following components are value-add from IBM:
Big Replicate, Big SQL, BigIntegrate, BigQuality, Big Match
5) True or False? Data Science capabilities can be achieved using only
HDP.
Checkpoint
Checkpoint solution
1) List the components of HDP which provides data access capabilities.
▪ MapReduce, Pig, Hive, HBase, Phoenix, Spark, and more!
2) List the components that provides the capability to move data from
relational database into Hadoop.
▪ Sqoop, Flume, Kafka
3) Managing Hadoop clusters can be accomplished using which
component?
▪ Ambari
4) True or False? The following components are value-add from IBM:
Big Replicate, Big SQL, BigIntegrate, BigQuality, Big Match
▪ True
5) True or False? Data Science capabilities can be achieved using only
HDP.
▪ False. Data Science capabilities also requires Watson Studio.
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018
Checkpoint solution
Unit summary
• Describe the functions and features of HDP
• List the IBM value-add components
• Explain what IBM Watson Studio is
• Give a brief description of the purpose of each of the value-add
components
Unit summary
Lab 1
• Exploration of the lab environment
▪ Start the VMWare image
▪ Launch the Ambari console
▪ Perform some basic setup
▪ Start services and the Hadoop processing environment
▪ Explore the placement of files in the Linux host-system environment and
explain the file system and directory structures
Creating
Introduction
Bigto
SQL
Hortonworks
schemas and
Datatables
Platform (HDP) © Copyright IBM Corporation 2018
Lab 1:
Exploration of the lab environment
Purpose:
In this lab, you will explore the lab environment. You will access your lab
environment and launch Apache Ambari. You will startup a variety of services
by using the Ambari GUI. You will also explore some of the directory structure
on the Linux system that you will be using.
Before doing the labs, you must have an IBM Cloud Lite account. One way to get it
is to register yourself in the IBM Digital-Nation Africa program (for Africa only). That
enables you to explore emerging technologies, build innovative solutions, learn new
skills and find a job.
Here are the steps for registration:
1. Browse to www.DigitalNationAfrica.com.
2. Click Register for Free.
3. If you already have an IBM ID, click Log in and continue step 10.
4. If you don’t have an IBM ID, enter your data then click Next.
5. Accept the IBM ID Account Privacy by clicking Proceed.
6. Check your email. You should receive an email from [email protected],
mentioning the code that you will use for activating your IBM ID account.
7. Copy the code and paste it in the DNA registration page.
8. Click Verify.
9. Click Complete registration to finish the last step of the registration.
10. After logging in, you are taken to the home page of the site.
11. You will receive another email containing a link to confirm your IBM Cloud
account. Click that link so that the account is activated.
12. After the IBM Cloud account is activated click
Task 1. Create an instance of the Analytics Engine service
The IBM Analytics Engine service enables you to create Apache Spark and Apache
Hadoop clusters in minutes and customize these clusters by using scripts. You can
access and administer IBM Analytics Engine through the Cloud Foundry command line
interface, REST API’s, IBM Cloud portal, and Ambari.
You can check the documentation of the Analytics Engine service here:
https://cloud.ibm.com/docs/services/AnalyticsEngine.
In the following steps you will create an instance of the service:
1. Browse to IBM Cloud console at https://cloud.ibm.com.
2. Enter your IBM ID (email) and password, then click Log in.
3. In the page header, click Create resource.
4. Click the Analytics Engine service.
5. You can specify the Service Name for this instance, e.g. Analytics Engine
demo.
6. You can Choose a region/location to deploy in, e.g. London.
7. Review the features of the Pricing Plan, then click Configure.
8. Set Number of compute nodes to 1.
9. For Software package, select AE 1.1 Spark and Hadoop, then click Create.
11. After the cluster provisioning is complete, click the instance name (Analytics
Engine demo) to open it.
12. Click Reset password to generate a new password.
Now you can see the status of the cluster and the username and password that
can be used to access it. You will use these username and password for the
following labs.
In the Nodes section, you can see the components of the cluster. The cluster
consists of a management instance and one or more compute instances. The
management instance itself consists of three management nodes (master
management node mn001, and two management slave nodes mn002 and
mn003). Each of the compute nodes runs in a separate compute instance.
The output looks like the following:
14. Close the Ambari browser tab to return to the tab of the service instance.
Task 2. Retrieve service credentials and service end points
You can fetch the cluster credentials and the service end points by using the IBM Cloud
CLI, REST API, or from the IBM Cloud console.
1. While inside the service instance, click Service credentials in the left side bar.
2. Click New credential to create a new service credential.
3. In the Add new credential dialog, set Role to Manager, then click Add.
4. In the new credential, click View credentials, then click Copy to clipboard.
5. Open a new text file in a text editor then press Ctrl-V to paste the JSON text.
The following table lists important credentials and their locations in the JSON file
under the cluster object. You will refer to them later in the labs.
5. To review the files and directories in the /usr/hdp directory, type ls -l.
1
Download and install PuTTY from here: www.putty.org
2
You can also give the connection a name and save it so that you can use it later
without re-entering the same info again
The first of these subdirectories (2.6.5.0-292) is the release level of the HDP
software that you are working with; the version in your lab environment may
differ if the software has been updated.
6. To display the contents of that directory, type ls -l 2*.
The results appear like the following:
Take note of the user IDs associated with the various directories. Some have a
user name that is same as the software held in the directory; some are owned
by root. You will be looking at the standard users in the Apache Ambari unit
when you explore details of the Ambari server and work with the management
of your cluster.
7. To view the current subdirectory which has a set of links that point back to files
and subdirectories in the 2.*.*.*-*** directory, type ls -l current.
9. List the contents of the /etc directory by executing the ls -l command. Your
results will look like the following: