Course Guide v1
Course Guide v1
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
IBM Training
Preface
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Contents
Preface................................................................................................................. P-1
Contents ............................................................................................................. P-3
Course overview................................................................................................. P-7
Document conventions ....................................................................................... P-8
Additional training resources .............................................................................. P-9
IBM product help .............................................................................................. P-10
BIGINSIGHTS OVERVIEW ...................................................................................... I
Introduction to Big Data ....................................................................... 1-1
Unit objectives .................................................................................................... 1-3
System of Units/Binary System of Units ............................................................. 1-4
The scale............................................................................................................ 1-5
There is an explosion in data and real world events ........................................... 1-6
Some examples of big data ................................................................................ 1-7
The growth of data ............................................................................................. 1-8
Example: The perception gap surrounding social media .................................. 1-10
Streams and oceans of information .................................................................. 1-11
Big data presents big opportunities .................................................................. 1-12
Merging the traditional and big data approaches .............................................. 1-13
What we hear from customers .......................................................................... 1-14
Big data scenarios span many industries ......................................................... 1-15
Big data use study ............................................................................................ 1-17
Big data use: focus areas and data sources ..................................................... 1-19
Unit summary ................................................................................................... 1-20
Exercise 1: Setting up the lab environment ...................................................... 1-21
Introduction to IBM BigInsights ........................................................... 2-1
Unit objectives .................................................................................................... 2-3
IBM big data strategy.......................................................................................... 2-4
IBM BigInsights for Apache Hadoop ................................................................... 2-5
Overview of BigInsights ...................................................................................... 2-6
Hadoop and the enterprise ................................................................................. 2-7
Overview of BigInsights ...................................................................................... 2-8
About the IBM Open Platform for Apache Hadoop ............................................. 2-9
Open source currency ...................................................................................... 2-10
Overview of BigInsights .................................................................................... 2-11
SQL for Hadoop (Big SQL) ............................................................................... 2-12
Spreadsheet-style analysis (BigSheets) ........................................................... 2-13
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Course overview
Preface overview
This course is for those who want a foundation of IBM BigInsights. In the IBM
BigInsights Overview part of this course, you will have an overview of IBM's big data
strategy and review why it is important to understand and use big data. It will cover IBM
BigInsights as a platform for managing and gaining insights from your big data. As
such, you will see how the BigInsights have aligned their offerings to better suit your
needs with the IBM Open Platform (IOP) along with the three specialized modules with
value-add that sits on top of the IOP. You will also get an introduction to the BigInsights
value-add including Big SQL, BigSheets, and Big R. In the IBM Open Platform with
Apache Hadoop part of the course, you will review how IBM Open Platform (IOP) with
Apache Hadoop is the collaborative platform to enable Big Data solutions to be
developed on the common set of Apache Hadoop technologies. You will also have an
in-depth introduction to the main components of the ODP core, namely Apache Hadoop
(inclusive of HDFS, YARN, and MapReduce) and Apache Ambari, as well as providing
a treatment of the main open-source components that are generally made available
with the ODP core in a production Hadoop cluster. The participant will be engaged with
the product through interactive exercises.
Intended audience
This course is for those who want a foundation of IBM BigInsights. This includes: Big
data engineers, data scientists, developers or programmers, and administrators who
are interested in learning about IBM's Open Platform with Apache Hadoop.
Topics covered
Topics covered in this course include:
IBM BigInsights Overview: IBM Open Platform with Apache Hadoop
• Introduction to Big Data • IBM Open Platform with Apache Hadoop
• Introduction to IBM BigInsights • Apache Ambari
• IBM BigInsights for Analysts • Hadoop Distributed File System
• IBM BigInsights for Data Scientist • MapReduce and Yarn
• IBM BigInsights for Enterprise • Apache Spark
Management • Coordination Management and
Governance
• Data Movement
• Storing and Accessing Data
• Advanced Topics
Course prerequisites
Participants should have:
• None, however, knowledge of Linux would be beneficial.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Document conventions
Conventions used in this guide follow Microsoft Windows application standards, where
applicable. As well, the following conventions are observed:
• Bold: Bold style is used in demonstration and exercise step-by-step solutions to
indicate a user interface element that is actively selected or text that must be
typed by the participant.
• Italic: Used to reference book titles.
• CAPITALIZATION: All file names, table names, column names, and folder names
appear in this guide exactly as they appear in the application.
To keep capitalization consistent with this guide, type text exactly as shown.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Task- You are working in the product and IBM Product - Help link
oriented you need specific task-oriented help.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
BigInsights Overview
• Introduction to Big Data
• Introduction to IBM BigInsights
• IBM BigInsights for Analysts
• IBM BigInsights for Data Scientist
• IBM BigInsights for Enterprise Management
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Introduction to Big Data
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit objectives
• Understand when and why you would use big data
• Explain the perception gap
• Explain the difference between data-at-rest and data-in-motion
• Describe the 3 Vs
Unit objectives
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
The scale
• 2.5 petabytes
Memory capacity of the human brain
• 13 petabytes
Amount that could be downloaded from the internet in two minutes, if every
American (300M) was on a computer at the same time
• 4.75 exabytes
Total genome sequences of all people on the Earth
• 422 exabytes
Total digital data created in 2008
• 1 zetabyte
World’s current digital storage capacity
• 1.8 zettabytes
Total digital data expected to be created in 2011
The scale
It is hard for most people to grasp the concept of how large a petabyte or an exabyte is.
For a long time people thought that a billion was a large number. But as quickly as most
governments spend a billion dollar or euros, obviously, it cannot be that large of a
number. To better understand extremely large numbers, it is best to view them in
comparison to something that you can understand. The capacity of the human brain is
about 2.5 petabytes. (This is also the estimated size of Walmart databases that handle
1 million customer transactions a day.) The total genome sequences of all people on
the Earth is 4.75 exabytes. The total amount of digital data created in 2008 was 422
exabytes. And the total that was expected to be created in 2011 was 1.8 zettabytes.
In 2000 the Sloan Digital Sky Survey began collecting astronomical data. In the first few
weeks it amassed more data than was collected in the history of astronomy. And the
total amount of data collected by the SDSS is the amount that its successor, the Large
Synoptic Survey Telescope, is expected to collect every 5 days, when it comes online
in 2016.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
2 Billion Internet
users by 2011
1.3 Billion RFID tags in 2005
30 Billion RFID today
4.6 Billon
Mobile Phones
World Wide
Capital market
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
1 in 3
300,000 tweets > 1 PB per
Business leaders frequently make
decisions based on information they
per minute day gas don’t trust, or don’t have
80% 83%
of CIOs cited "Business intelligence
and analytics" as part of their
visionary plans
Of world’s data to enhance competitiveness
is unstructured
60%
of CEOs need to do a better job
capturing and understanding
information rapidly in order to
2012 make swift business decisions
2.8 zettabytes
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Sources:
• The Guardian, May 2010
• IDC Digital Universe, 2010
• IBM Institute for Business Value, 2009
• IBM CIO Study 2010
• TDWI: Next Generation Data Warehouse Platforms Q4 2009
• https://blog.kissmetrics.com/facebook-statistics/
• http://www.webopedia.com/quick_ref/just-how-much-data-is-out-there.html
• http://www.computerworlduk.com/news/infrastructure/3433595/boeing-787s-to-
create-half-a-terabyte-of-data-per-flight-says-virgin-atlantic/
• http://www.webopedia.com/quick_ref/just-how-much-data-is-out-there.html
• http://www.forbes.com/sites/maribellopez/2013/05/10/ge-speaks-on-the-business-
value-of-the-internet-of-things/
• http://www.idc.com/prodserv/4Pillars/bigdata;jsessionid=94A407E4522FB407627
ECEBBAAA90A24
• http://www.digitalbuzzblog.com/infographic-24-hours-on-the-internet/
• ZB = 1 billion TB
• IDC reference:
o http://idcdocserv.com/925
o http://www.computer.org/portal/web/news/home/-
/blogs/2613266;jsessionid=abbfded1402383e107abfa2641d6
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
7%
l Disagree
23%
Neutral
Agree
70%
Source: "Capitalizing on
complexity, Insights from the "What Customers Want"
Global Chief Executive Office
First in a two-part series
Study," IBM Institute for Business
Value, 2010 IBM Institute for Business Value
Published March 2011
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
IT
Business Users
Delivers a platform
Determine what to enable creative
question to ask discovery
IT Business
Structures the Explores what
data to answer questions could be
that question asked
Monthly sales reports Brand sentiment
Profitability analysis Product strategy
Customer surveys Maximum asset utilization
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Multi-channel customer
sentiment and experience
analysis
Detect life-threatening
conditions at hospitals in time to
intervene
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
• Imagine if you could make risk decisions, such as whether or not someone
qualifies for a mortgage, in minutes, by analyzing many sources of data, including
real-time transactional data, while the client is still on the phone or in the office.
• Imagine if law enforcement agencies could analyze audio and video feeds in real-
time without human intervention to identify suspicious activity.
As these new sources of data continue to grow in volume, variety and velocity, so too
does the potential of this data to revolutionize the decision-making processes in every
industry.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Gartner Sept. 2014 report: 13% of surveyed organizations have deployed big data solutions, while 73%
have invested in big data or plan to do so.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
In the Educate stage, the primary focus is on awareness and knowledge development.
Almost 25 percent of respondents indicated that they are not yet using big data within
their organizations. While some remain relatively unaware of the topic of big data, our
interviews suggest that most organizations in this stage are studying the potential
benefits of big data technologies and analytics, and trying to better understand how big
data can help address important business opportunities in their own industries or
markets.
The focus of the Explore stage is to develop an organization's roadmap for big data
development. Almost half of respondents reported formal, ongoing discussions within
their organizations about how to use big data to solve important business challenges.
Key objectives of these organizations include developing a quantifiable business case
and creating a big data blueprint.
In the Engage stage, organizations begin to prove the business value of big data, as
well as perform an assessment of their technologies and skills. More than one in five
respondent organizations is currently developing proofs-of-concept (POCs) to validate
the requirements associated with implementing big data initiatives, as well as to
articulate the expected returns.
In the Execute stage, big data and analytics capabilities are more widely
operationalized and implemented within the organization. However, only 6 percent of
respondents reported that their organizations have implemented two or more big data
solutions at scale, the threshold for advancing to this stage.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit summary
• Understand when and why you would use big data
• Explain the perception gap
• Explain the difference between data-at-rest and data-in-motion
• Describe the 3 Vs
Unit summary
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Exercise 1
Setting up the lab environment
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Exercise 1:
Setting up the lab environment
Purpose:
You will set up your lab environment by starting the VMWare image, launching
the Ambari console, and starting the required services. You will also learn
about the file system and directory structures.
Estimated time: 30 minutes
User/Password: biadmin/biadmin
root/dalvm3
Services Password: ibm2blue
Task 1. Configure your image.
As copies are made of the VMWare image, additional network devices get
defined and the IP address changes. Configuration changes are required to get
the Ambari console to work.
Note: Occasionally, when you suspend and resume the VM image, the network
may assign a different IP address than the one you had configured. In these
instances, the Ambari console and the services will not run. You will need to
update /etc/hosts file with the newly assigned IP address to continue working
with the image. No restart of the VM image is necessary, just give it a couple of
minutes, at most. In some cases, you may need to restart the Ambari server,
using ambari-server restart from the command line.
1. To open a new terminal, right-click the desktop, and then click Open in
Terminal.
2. Type ifconfig to check for the current assigned IP address.
3. Take note of the IP address next to inet.
You need to edit the /etc/hosts file to map the hostname to the IP address.
4. To switch to the root user, type su -.
5. When prompted for a password, type dalvm3.
6. To open the /etc/hosts file, type gedit /etc/hosts.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
7. Ensure that the contents of the file are similar to the following:
10.0.0.118 ibmclass.localdomain ibmclass
127.0.0.1 localhost.localdomain localhost
8. Update the IP address on the first line from step 3.
9. Save and exit the file, and then close the terminal.
Task 2. Start the BigInsights components.
You will start all the services via the Ambari console to ensure that everything is
ready for the exercise. You may stop what you don't need later, but for now, you
will start everything.
1. Launch Firefox, and then if necessary, navigate to the Ambari login page,
http://ibmclass.localdomain:8080.
2. Log in to the Ambari console as admin/admin.
On the left side of the browser are the statuses of all the services. If any are
currently yellow, wait a couple of minutes for them to become red before
proceeding.
3. Once all the statuses are red, at the bottom of the left side, click Actions and
then click Start All to start the services.
This will take several minutes to complete.
4. When the services have started successfully, click OK.
Task 3. Begin to explore Ambari.
This section will provide some basic Ambari administration and cluster
management. The IBM Open Data Platform (IOP) with Apache Hadoop section
of this course will cover Ambari administration in more detail.
1. Launch Firefox, and then if necessary, navigate to the Ambari login page,
http://ibmclass.localdomain:8080.
2. Log in to the Ambari console as admin/admin.
Once logged in, you will notice the statuses of the services on the left side. If
everything is green, then all services are running.
You can select any of the services to go to the details page for that service.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Introduction to
IBM BigInsights
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Introduction to IBM BigInsights
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit objectives
• Describe the functions and features IBM BigInsights
• List the IBM value-add components that comes with BigInsights
• Give a brief description of the purpose of each of the value-add
components
Unit objectives
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
scheduling
• Provide for security and
governance Information Integration & Governance
• Integrate with enterprise software
Introduction to IBM BigInsights © Copyright IBM Corporation 2015
• Distinguishing characteristics
Built-in analytics: Enhances business
knowledge
Enterprise software integration:
Complements and extends existing
capabilities
Production-ready platform: Speeds time-
to-value
• IBM advantage
Combination of software, hardware,
services and advanced research
Introduction to IBM BigInsights © Copyright IBM Corporation 2015
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Overview of BigInsights
Overview of BigInsights
At the bottom, there's a 100% open source platform based on key Apache components,
like HDFS, YARN, Spark, HBase, Hive, and others. IBM has joined an industry
consortium (the Open Data Platform Initiative) whose mission is to define, test, and
validate a common core of Hadoop components. When that core set becomes
available, IBM will support it as part of its core platform. The goal here is to contribute to
the open source community in a way that helps all organizations be active in the
Hadoop space and mitigates concerns about "vendor lock in".
But BigInsights is more than that. Years of IBM research and development efforts have
incorporated the results into 3 modules that you can add to the Open Platform stack.
The BigInsights Analyst module includes a sophisticated SQL engine (Big SQL) and an
easy-to-use spreadsheet-style tool (BigSheets) for exploring big data. BigInsights Data
Scientist includes all the Analyst features and adds important analytical technologies for
text, R integration, and machine learning. BigInsights Enterprise Management offers a
robust, POSIX-compliant file system alternative to HDFS as well as key technologies
for managing multiple workloads and multiple tenants on your cluster. These 3 offerings
are fee based. However, if you want to get off to a quick start, we offer a free Quick
Start edition for non-production use.
This will be covered in more detail later; now that you know a little about BigInsights,
let's consider how this technology fits into a broader IT infrastructure.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Overview of BigInsights
Overview of BigInsights
You will start by exploring the base platform.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Overview of BigInsights
Overview of BigInsights
Now that you understand the IBM Open Platform with Apache Hadoop, you will look at
the IBM modules. To start, there is IBM BigInsights Analyst. BigSheets is one part of
the Analyst offering, and Big SQL is another.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Overview of BigInsights
Overview of BigInsights
The IBM BigInsights Data Scientist module includes native support for the R
programming language (Big R) and adds Machine Learning algorithms that are
optimized for Hadoop. It also provides web-based tooling for text analysis. Each of
these capabilities will be explored individually.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
What is Big R?
• End-to-end integration of R-Project R Clients
What is Big R?
Big R is a library of functions that provide end-to-end integration with the R language
and BigInsights. Big R can be used for comprehensive data analysis on your
BigInsights cluster, hiding some of the complexity of manually writing MapReduce jobs.
Big R uses the open source R language to enable rich statistical analysis. You can use
Big R to manipulate data by running a combination of R and Big R functions. Big R
functions are similar to existing R functions but are designed specifically for analyzing
big data.
Big R is:
• an R package: end-to-end integration of R into IBM BigInsights
• overloads a number of R primitives to work with big data
Big R's native support for open source R statistical computing helps clients leverage
their existing R code or gain from more than 4,500 freely available statistics packages
from the Open R community.
To learn more about R, create an account on the Big Data University website and take
the course on R programming
(http://bigdatauniversity.com/courses/course/view.php?id=522). The main R web site is
http://www.r-project.org/.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Big R implementation
*Dataset: "airline".
Scheduled flights in US
1987-2009.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
. . . . . . . ..
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Text Analytics
• Distills structured info from unstructured text
Sentiment analysis
Consumer behavior
Illegal or suspicious activities
…
• Parses text and detects meaning with annotators
• Understands the context in which the text is analyzed
• Features pre-built extractors for names, addresses, phone numbers,
etc.
Text analytics
Of course, there's more to BigInsights Data Scientist than Big R. Text analytics is
another capability included with this offering.
Key points:
• IBM research developed a sophisticated text analytics engine, similar technology
to what was demonstrated in Watson and is now an integral part of BigInsights to
identify meaning within unstructured text
• there are 100s of pre-built rules (annotators)
• the annotators are context sensitive and discover the relationship between terms
even if they are separated by text
• it is built for top performance and has been optimized for BigInsights workloads
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Overview of BigInsights
Overview of BigInsights
The final module, IBM BigInsights Enterprise Management, helps administrators
manage, monitor and secure their Hadoop distribution. BigInsights Enterprise
Management introduces tools to allocate resources, monitor multiple clusters, and
optimize workflows to increase performance.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unlike HDFS, which is optimized for large-block I/O, GPFS is flexible enough to support
a variety of different access patterns, including applications with small or medium-size
blocks as well as write-intensive applications. This means GPFS can provide better
performance across a wider range of applications. As a specific example, a customer
may be running an SAS workload, including a series of ETL-related steps to manipulate
data on a shared GPFS file system. At a particular stage in the ETL workflow, a
MapReduce program may be the most efficient way to process a specific data set.
Because GPFS data can be accessed by both MapReduce and non-MapReduce
workloads, the MapReduce job can be incorporated into the broader ETL workflow,
executing against the same GPFS resident data, to avoid the time and cost associated
with migrating data from one file system to another.
Also, the grid manager itself can support multiple workload patterns at the same time,
which helps reduce the cost of infrastructure. Traditional batch workloads (using the
IBM Platform LSF® batch scheduler) can coexist with service-oriented workloads and
Hadoop MapReduce workloads (best implemented on Platform Symphony) across a
common resource orchestration layer so that all workloads can share the same physical
infrastructure.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Platform Symphony
• Multiple users, applications and lines of business on a shared,
heterogeneous, multi-tenant grid
A B C D
Third Party Trading
IBM Algorithmics In-house Applications
& Risk Platforms BigInsights
Workload Manager
C C C C C C B B A A A A A A A A
C C C C C C B B A A A A A A A A
C C C C C C B D D D D D D B B B
B B B B B B B D D D D D D B B B
Resource Orchestration
Platform Symphony
Platform Symphony (formerly known as Adaptive MapReduce) replaces the traditional
MapReduce layer, allowing a computing cluster to support many different types of
applications. It includes IBM-specific features for workload management, resource
orchestration, and high performance.
• A heterogeneous grid management platform.
• A high-performance SOA middleware environment.
• Supports diverse compute & data intensive applications:
• ISV applications
• in-house developed applications (C/C++, C#/.NET, Java, Excel, R, etc.)
• optimized low-latency Hadoop compatible run-time
• can be used to launch, persist and manage non-grid aware application
services
• Reacts instantly to time critical-requirements.
• A multi-tenant shared services platform with unique resource sharing capabilities.
• A limited-use run-time for Platform Symphony (called Adaptive MapReduce)
included in BigInsights Enterprise Management.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Overview of BigInsights
Overview of BigInsights
IBM also has a free offering for non-production use called the Quick Start edition. It
includes the IBM Open Platform as well as most features of BigInsights Analyst and
Data Scientist.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Overview of BigInsights
Overview of BigInsights
One last packaging aspect to review is the full production package called IBM
BigInsights for Apache Hadoop. This includes all of the open source components, all of
the IBM added value components in the Analyst, Data Scientist, and Enterprise
Management modules, and a collection of limited use licenses to other IBM offerings,
such as Cognos, Watson Explorer, InfoSphere Streams, and Data Click.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Yearly
Pricing Terms Free Free Subscription Perpetual or Monthly License
Only
Support provided Community Community IBM 24x7 support
Non-
Usage License production, Production Usage
five node cap
Pricing Model Free Free Node based pricing
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
$100M
$24B Announced investment
in IBM Interactive
Experience, creating
9
Investment Analytics
$1B
in both organic 10 new labs worldwide
Solution
development Centers
To bring Developing
and 30+
cognitive curriculum
acquisitions and training for
services and
analytics with
1,000
applications
to market
universities
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Background:
• When it comes to what sets IBM apart, they have rich offerings that other vendors
can only dream about.
• From our expertise in consulting to our solutions, to our advances in technology
that address context computing, streams, advanced analytics and cognitive, and
infrastructure that addresses the compute intensive computing; there is not one
vendor out there than can deliver the depth we can to our clients.
• The reason why we have invested in this breadth and depth of technology,
expertise and reach is because our clients require it to be successful capitalizing
on the competitive advantage of data.
• We understand the value of data as a new natural resource and the need for
individuals and organizations to exploit it and put it to work.
• The market shifts we see creating the opportunity to realize the potential of the
portfolio we have, and our investment areas establish us for the future.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit summary
• Describe the functions and features IBM BigInsights
• List the IBM value-add components that comes with BigInsights
• Give a brief description of the purpose of each of the value-add
components
Unit summary
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Exercise 1
Getting started with IBM BigInsights
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Exercise 1:
Getting started with IBM BigInsights
Purpose:
You will learn more about the file system and directory structures of the IBM
value-adds that are available with IBM BigInsights, and begin working with
basic Hadoop commands.
User/Password: biadmin/biadmin
root/dalvm3
Services Password: ibm2blue
Important: Before doing this exercise, ensure that your access and services are
configured and running. Check that:
• /etc/hosts displays your environment's IP address
• in the Ambari console, ensure that all BigInsights services are running
If you are unsure of the steps, please refer to Unit 1, Exercise 1 to ensure that your
environment is ready to proceed. You should review the steps in Task 1 (Configure
your image) and Task 2 (Start the BigInsights components).
Task 1. Navigate the file system.
In this task, you will get a brief overview of the file system.
1. To open a new terminal, right-click the desktop, and then click Open in
Terminal.
There are two main directories of where all of the BigInsights and the IBM Open
Platform (IOP) components are installed.
2. To navigate to the IBM value-adds directory, type the following:
cd /usr/ibmpacks
3. Type ls to see a listing of the components that are currently installed in the VM.
Each of those directories contains specific functions related to it. There will be a
directory for each of the components installed (such as bigr, bigsheets, etc.)
4. Navigate to /usr/ibmpacks/bin.
This is where the scripts reside to remove the value-adds from the IOP stack.
This is useful to know, if you no longer need any of the services and want to
save space and memory.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
5. Navigate to /usr/ibmpacks/current.
This directory links to the current releases of those value-add components.
6. To navigate to the IOP directory, where you can access the Apache stack
containing the open source components, type the following:
cd /usr/iop/current
This is where you navigate to if you want to use the IOP components. You will
use these components in the IOP section of this course.
7. Close the terminal.
Task 2. Working with basic Hadoop commands.
In this task, you will get a brief overview of the file system.
1. To open a new terminal, right-click the desktop, and then click Open in
Terminal.
2. To switch to the root user, type su -, and then type the password dalvm3.
3. Create the biadmin folder on the hdfs under /user:
su - hdfs -c "hdfs dfs -mkdir -p /user/biadmin/"
9. To see a listing of the /user/biadmin directory, and the uploaded file on the
hdfs, type the following:
hdfs dfs -ls /user/biadmin
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
You are not going to do anything else with that file now. The purpose of this
exercise was to introduce you to some basic HDFS commands. They are
similar, if not exactly the same as common Linux commands. You will work
more with Hadoop commands in an upcoming exercise.
Results:
You have learned more about the file system and directory structures of the
IBM value-adds that are available with IBM BigInsights, and you began
working with basic Hadoop commands.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
I B M B i g I n s i g h t s f o r A n a l ys t s
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit objectives
• Describe the components that come with the IBM BigInsights Analyst
module
• Explain the benefits of using Big SQL for big data
• Understand the purpose of BigSheets
Unit objectives
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Overview of BigInsights
Machine Learning on
Big R
IBM BigInsights
IBM BigInsights Enterprise Management
Analyst Big R (R support)
POSIX Distributed
Industry standard SQL Filesystem
Big SQL
(Big SQL)
Multi-workload, multi-tenant
Spreadsheet-style BigSheets
scheduling
tool (BigSheets)
...
Overview of BigInsights
This unit will cover the IBM value-adds that comes with the IBM BigInsights Analysts
module. A brief overview was provided earlier in the unit on BigInsights; more detail will
be presented in this unit.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Executive Summary
• What is Big SQL?
Industry-standard SQL query interface for BigInsights data
New Hadoop query engine derived from decades of IBM R&D investment in RDBMS
technology, including database parallelism and query optimization
• Why Big SQL?
Easy on-ramp to Hadoop for SQL professionals
Support familiar SQL tools / applications (via JDBC and ODBC drivers)
• What operations are supported?
Create tables / views. Store data in DFS, HBase, or Hive warehouse
Load data into tables (from local files, remote files, RDBMSs)
Query data (project, restrict, join, union, wide range of sub-queries, wide range of built-in
functions, UDFs, etc.)
GRANT / REVOKE privileges, create roles, create column masks and row permissions
Transparently join / union data between Hadoop and RDBMSs in single query
Collect statistics and inspect detailed data access plan
Establish workload management controls
Monitor Big SQL usage
etc.
IBM BigInsights for Analysts © Copyright IBM Corporation 2015
Executive Summary
In this presentation, you will be introduced to IBM's Big SQL technology in BigInsights
Analyst and BigInsights Data Scientist. You'll learn what it can do and why IBM
developed this technology. You'll see how you can create, populate, and query Big SQL
tables. And you will be presented with some important concepts that relational DBMS
experts should understand about Big SQL.
Although a number of vendors offer SQL-on-Hadoop implementations, IBM's vast SQL
experience enabled IBM to deliver a range of SQL capabilities that you'll be hard-
pressed to find in competing offerings today. For example, while most implementations
have limited support for subqueries, perhaps not allowing them in SELECT lists, in the
HAVING clause, or with certain quantifiers (such as SOME, ANY or ALL). IBM does not
have comparable restrictions. In addition, IBM provides more than 200 built-in functions,
including a wide range of OLAP functions, where other implementations have
considerably fewer. IBM supports UDFs written in Java, C, and SQL. Most other
implementations support only Java-based UDFs. Finally, IBM offers fine-grained access
control (column masking and row-based permissions) as well as federated queries.
Again, many competing offerings lack such capabilities.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Agenda
• Big SQL overview
motivation
architecture
distinguishing characteristics
• Using Big SQL: the basics
invocation options
creating tables and views
populating tables with data
querying data
Agenda
Here's a quick look at what will be reviewed. You will spend some time on the
motivation and architecture of Big SQL, during which time distinguishing characteristics
will be summarized. The bulk of this presentation will present how you can use Big
SQL. First the basics will be reviewed, and then some advanced topics. Note that there
is a lot more to Big SQL than can be covered in the scope of this introductory course.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Agenda
• Big SQL overview
motivation
architecture
distinguishing characteristics
• Using Big SQL: the basics
invocation options
creating tables and views
populating tables with data
querying data
Agenda
First, a quick overview of Big SQL.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Source: Analytics: The real-world use of big data, How innovative enterprises extract
value from uncertain data, IBM Institute for Business Value and Saïd Business School
at the University of Oxford, 2012
Link:
http://public.dhe.ibm.com/common/ssi/ecm/en/gbe03519usen/GBE03519USEN.PDF
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
SQL-on-Hadoop landscape
• The SQL-on-Hadoop landscape changes constantly
SQL-on-Hadoop landscape
Of course, IBM is not the only vendor to recognize the demand for SQL on Hadoop.
Most of the major Hadoop vendors have jumped into the fray and delivered some level
of SQL support. Features vary significantly, and vendors are rapidly evolving their
offerings. But most players are pretty new to the world of SQL, and it's not quick or easy
to build a truly enterprise-grade SQL engine. This means they were forced to
compromise on some critical features. By contrast, Big SQL is based on decades of
IBM's research and development investment in relational technology, affording IBM a
better opportunity to deliver advanced technology to the Hadoop community today.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Distinguishing characteristics
Application Portability & Integration Performance
Rich SQL
Comprehensive SQL Support
IBM SQL PL compatibility
Extensive Analytic Functions
Federation Enterprise
Features
Distributed requests to multiple data
sources within a single SQL statement Advanced security/auditing
Main data sources supported: Resource and workload management
DB2 LUW, Teradata, Oracle, Netezza, Self tuning memory management
Informix, SQL Server Comprehensive monitoring
IBM BigInsights for Analysts © Copyright IBM Corporation 2015
Distinguishing characteristics
What distinguishes Big SQL from other SQL-on-Hadoop offerings? Summarizing the
key characteristics here, they have been categorized into four broad areas. Many of
these features will be reviewed in greater detail later.
There is also a good white paper that summarizes IBM strengths: http://www-
01.ibm.com/common/ssi/cgi-
bin/ssialias?subtype=WH&infotype=SA&appname=SWGE_SW_SW_USEN&htmlfid=S
WW14019USEN&attachment=SWW14019USEN.PDF#loaded
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Agenda
• Big SQL overview
motivation
architecture
distinguishing characteristics
• Using Big SQL: the basics
invocation options
creating tables and views
populating tables with data
querying data
Agenda
With that background, it is time to look at Big SQL in action. Firstly, the basics: how to
invoke Big SQL, how to create tables and views, how to populate tables with data, and
how to query Big SQL tables. And in doing so, you will begin to understand that Big
SQL provides an easy on-ramp to Hadoop for SQL professionals.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Invocation options
• Command-line interface:
Java SQL Shell (JSqsh)
Invocation options
Big SQL includes a command-line interface called JSqsh. JSqsh (pronounced
J-skwish) is a short name for Java SQshell (pronounced s-q-shell). This is an open
source database query tool featuring much of the functionality provided by a good shell,
such as variables, redirection, history, command line editing, and so on. As displayed
on this chart, it includes built-in help information and a wizard for establishing new
database connections.
In addition, when Big SQL is installed, administrators can also install IBM Data Server
Manager (DSM) on the Big SQL Head Node. This web-based tool includes a SQL
editor that runs statements and returns results, as shown here. DSM also includes
facilities for monitoring your Big SQL database.
For more on DSM, visit http://www-03.ibm.com/software/products/en/ibm-data-server-
manager.
Tools that support IBM's JDBC / ODBC driver are also options.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
• Worth noting:
"Hadoop" keyword creates table in DFS
Row format delimited and textfile formats are default
Constraints not enforced (but useful for query optimization)
• Examples in these charts focus on DFS storage, both within or external to
Hive warehouse. HBase examples provided separately.
IBM BigInsights for Analysts © Copyright IBM Corporation 2015
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Hadoop Keyword:
• Big SQL requires the HADOOP keyword
• Big SQL has internal traditional RDBMS table support
• stored only at the head node
• does not live on HDFS
• supports full ACID capabilities
• not usable for big data
• The HADOOP keyword identifies the table as living on HDFS
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
CREATE VIEW
• Standard SQL syntax
CREATE VIEW
You can create Big SQL views in the same way that you would create a view in a
relational DBMS. A simple example is shown here.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
SELECT
SOME, etc. )
AND NOT EXISTS (
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
What is BigSheets?
• Browser-based analytics tool for business users
• Spreadsheet like interface for analyzing big data
• A component of the Analyst module of IBM BigInsights
What is BigSheets?
BigSheets is a browser-based analytic tool designed to work with big data. Unlike many
other big data tools, it is designed to support business users and non-technical
professionals. To do so, it presents a familiar, spreadsheet-like interface that allows
users to gather, filter, combine, explore, and visualize data from various sources.
IBM chose the spreadsheet as the model for organizing data because most users are
already familiar with such software. If users want to represent the data in more complex
ways, the tool works with an IBM visualization tool called Many Eyes, and other
visualization software.
As an important part of IBM's big data strategy, BigSheets is a feature of IBM
BigInsights.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
BigSheets BigSheets
web sites, local data exploration,
files, DBMS data collection
and import manipulation and
exports, etc. analysis
BigSheets
BigInsights
Engine
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit summary
• Describe the components that come with the IBM BigInsights Analyst
module
• Explain the benefits of using Big SQL for big data
• Understand the purpose of BigSheets
Unit summary
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Exercise 1
Working with BigSheets
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Exercise 1:
Working with BigSheets
Purpose:
You will create a BigSheets workbook and derive a chart from it to visualize
your data.
Estimated time: 30 minutes
User/Password: biadmin/biadmin
root/dalvm3
Services Password: ibm2blue
Important: Before doing this exercise, ensure that your access and services are
configured and running. Check that:
• /etc/hosts displays your environment's IP address
• in the Ambari console, ensure that all BigInsights services are running
If you are unsure of the steps, please refer to Unit 1, Exercise 1 to ensure that your
environment is ready to proceed. You should review the steps in Task 1 (Configure
your image) and Task 2 (Start the BigInsights components).
Task 1. Loading data into BigSheets.
BigSheets allows you to analyze the data residing on the HDFS. You can create
master workbooks, apply various sheets types to refine and filter the data, and
then create charts to visualize the data. This task will walk you through the start
to the end from creating a workbook to visualizing the data with charts. More
functions and features will be covered in the BigSheets specific module.
You will load in two set of data into the HDFS.
1. To open a new terminal, right-click biadmin's Home, and then click Open in
Terminal.
2. Navigate to /home/biadmin/labfiles/bigsheets to see the files.
3. Upload blogs-data.txt and news-data.txt to /user/biadmin/.
hdfs dfs -put /home/biadmin/labfiles/bigsheets/blogs-data.txt /user/biadmin
hdfs dfs -put /home/biadmin/labfiles/bigsheets/news-data.txt /user/biadmin
Once the files are inside of the HDFS, you are ready to create the BigSheets
workbook.
4. Launch Firefox, and then if necessary, navigate to the Ambari login page,
http://ibmclass.localdomain:8080.
5. Log in to the Ambari console as admin/admin.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
15. In the Select a reader list, select JSON Array, and then click Set reader .
You can see now that the data is properly parsed.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
16. At the bottom of the New Workbook window, click Save workbook .
Using the same steps, you will create the Blog Data workbook.
17. Click the Workbooks link (breadcrumb) to go back to the BigSheets home
page.
18. Click the New Workbook button.
19. On the New Workbook window, under Name, type Blogs Data.
You can leave the description field blank.
20. On the DFS Files tab, navigate to /user/biadmin/ and select the blogs-data.txt
file.
21. Specify the JSON Array reader.
22. At the bottom of the New Workbook window, click Save workbook .
23. Click the Workbooks link to return to the BigSheets home page.
The results appear as follows:
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
5. Click Remove item beside each of the following columns to delete them.
• Crawled
• Inserted
• MoreoverUrl
• PostSize
• URL
7. Beside News Data(1), click Edit workbook name , beside Name, type
NewsDataRevised, and then click Save Tag .
8. Expand the Save dropdown, and then click Save and Exit.
9. Note that you could also name the workbook in here. Since we had already
named it, click the Save button.
Part of BigSheets is the feature to view what you intend to do on the subset of
the data. In order for the changes to take effect on the full dataset, you must run
the workbook. When you save and exit from the workbook, you will be prompted
to Run the workbook.
10. Click Run to run the workbook on the full set of data.
11. Click Workbooks.
You will use the steps above to revise the Blogs Data workbook.
12. Click Blogs Data.
13. Beside Blogs Data, click the Build new workbook button.
You will now remove unnecessary columns.
14. In the IsAdult column header, expand the dropdown menu, and then click
Remove.
15. In any column header, expand the dropdown menu, and then click Organize
Columns.
16. Click Remove item beside each of the following columns to delete them.
• Crawled
• Inserted
• Url
• PostSize
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
18. Beside Blogs Data(1), click Edit workbook name , beside Name, type
BlogsDataRevised, and then click Save Tag .
19. Expand the Save dropdown, and then click Save and Exit.
20. Note that you could also name the workbook in here. Since we had already
named it, click the Save button.
Part of BigSheets is the feature to view what you intend to do on the subset of
the data. In order for the changes to take effect on the full dataset, you must run
the workbook. When you save and exit from the workbook, you will be prompted
to Run the workbook.
21. Click Run to run the workbook on the full set of data.
The results appear as follows:
Leave The BigInsights - BigSheets window open for the next task.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
10. Click Save, click Save & Exit, in the Name box, type NewsAndBlogsData, and
then click Save.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
In this case, you have multiple entries that you need to treat as identical. For the
purpose of our exercise, you will stop here, but if you have some time, you may
play around with the different sheets and functions to see other types of
operations you can perform on the data.
12. Close all open windows.
Results:
You have created a BigSheets workbook and a chart from it to visualize your
data.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
IBM BigInsights for Data Scientists
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit objectives
• Describe the components that come with the IBM BigInsights Data
Scientist module
• Understand the benefits of using text analytics as opposed to coding
MapReduce jobs
• Use R / Big R for statistical analysis on the Hadoop scale
Unit objectives
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Overview of BigInsights
Overview of BigInsights
This unit will cover the IBM value-adds that comes with the IBM BigInsights Data
Scientist module. You saw briefly what they are in the earlier unit on BigInsights.
Remember that as part of the Data Scientist module, you will also get Big SQL and
BigSheets, which is covered in the unit on the Analyst module.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Known usage
− Represents salary versus zip code
• Unstructured data has:
No known attribute types nor usage
• Usage is based upon context
Tom Brown has brown eyes
• A computer program has to be able view a word in context to know its
meaning
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Label
Develop Test Profile Export
Sample snippets
extractors extractors extractors extractors
input Find clues
documents
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Key Take-Away
Open Source R is a powerful tool, however, it has limited
functionality in terms of parallelism and memory, thereby
bounding the ability to analyze big data.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Key Take-Away
Open Source R has packages for helping to deal with parallelism and
memory constraints, however, these packages assume advanced
parallel programming skills and require significant time-to-value.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
All All
available Analyzed available
information information information
analyzed
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
These types of problems with running large-scale analytics are well known to users of
R; open source R is a statistical computing environment and is the most popular tool in
data science today, with a vibrant and growing community. R is used across industries
and academia for performing cost-effective statistical analysis. An important part of the
open source R community is the CRAN repository. CRAN provides R users with over
4500+ freely available statistics packages. If you come up with your own improvements
or your own algorithms, you can post these to CRAN for the world to use. These
packages vary in quality, but it is important to note that many of the world's leading
statisticians and researchers use R to develop their bleeding edge algorithms. R
enables them to get their algorithms out into the hands of statisticians as quickly as
possible. Many corporations actually have R packages that they are active contributors
to and/or stewards of. R is an essential piece of many analytics workflows.
Although open source R has many, many benefits, it does have some restrictions. For
example, R was originally conceived as a single user tool. It is naturally single threaded
on a single node. Therefore, you can you only perform your analysis if both your
dataset and the accompanying computational requirements will all fit into memory. In
the context of big data, R and Hadoop are not naturally friends. The desire to be able to
run R in Hadoop is the reason why IBM developed Big R for BigInsights. Big R extends
open source R so that it can run with Hadoop.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Big R also enables the user to leverage their existing R assets and push them to the
cluster. You can take code snippets from your existing R code base and CRAN
packages and push those to partitions of the data within the cluster for data parallel
execution. This is called "partitioned execution", which follows the same "Apply
approach" that is used in open source R. Through this method, Big R automatically
stands up multiple instances of R in the Hadoop cluster, based on how you need to
partition the data. Each of these instances of R will run your desired analysis, and then
you can pull the analysis back to your client for further exploration or visualization. A
common use case here is when the data scientist wants to build models on multiple
subsets of the data. For example, you may build a decision tree model for each product
category of interest. Beyond data parallelism, this flexible partitioned execution
mechanism can also perform task parallelism. An example of task parallelism is when
the data scientist wants to concurrent simulation of a given modelling technique (such
as exploring the parameter space), and then use the best model that they find.
So although you can scale out your native R analysis across multiple nodes in your
Hadoop cluster, through this approach you are still confined to the memory restrictions
of R. In the situations where you need to run statistical analysis and machine learning
beyond R in Hadoop, you can call Big R's wide array of scalable algorithms. Big R
comes with a set of prepackaged scalable algorithms. These are written in an R-like
declarative language under-the-hood, and can run optimally at any scale. Since they
are declarative, they are compiled for automatic parallelization and optimization based
on data characteristics and the Hadoop configuration. This means, automatic
performance tuning, something that is very key when running analysis on Hadoop. In
the future, IBM plans on opening up this R-like declarative language to the user so that
they can tweak existing prepackaged algorithms, as well as provide the ability to write
their own custom algorithms in a fairly R-like language that will automatically parallelize
and optimize for the computation at hand. It seems that no one else is attempting to
build anything anywhere close to this level of value and sophistication for the data
scientist. Again, the value that these scalable algorithms bring is optimization for high
performance, and flexibility to enable data scientists to customize these (or their own)
algorithms. And you get all this capability from your favorite R tool!
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Big R architecture
R User
Interface
IBM BigInsights
Big R architecture
The Big R architecture makes a lot of sense to the R user. R sits on your client using a
normal R client, such as RStudio, as the IDE. This IDE will usually be on your data
scientist's laptop. The Big R package itself will be installed on your client as well as the
nodes on the cluster. From your client, Big R provides you with data connectivity to
several different data sources within Hadoop, any delimited file (such as CSV), data
sources cataloged by Hive, HBase data, or JSON files. Under the hood, Big R is simply
opening a proxy to this entire dataset that is stored in Hadoop. It is not actually moving
the data, which is obviously very important when dealing with large datasets. From the
user's end, it will look and feel as if all of the data is sitting on their laptop, but obviously
due to the data volume, that is not possible, as it is still sitting in Hadoop.
Big R moves the function in your R applications to the data in Hadoop. So those three
key capabilities afforded by Big R will all be pushed down into the Hadoop cluster for
scalable analysis. In short, you have the ability to perform scalable data processing,
wrap up native R functions for parallel execution on the cluster, and run the scalable
algorithms that seamless run across entire datasets to build machine learning models
and descriptive statistics.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Connect to BI cluster
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Hadoop
Performance Scalability
R out-of-
(data fit in (data larger than
28X Speedup
memory
memory) aggr. memory)
bigr.lm
28x
Scales beyond
cluster memory
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit summary
• Describe the components that come with the IBM BigInsights Data
Scientist module
• Understand the benefits of using text analytics as opposed to coding
MapReduce jobs
• Use R / Big R for statistical analysis on the Hadoop scale
Unit summary
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Exercise 1
Working with Text Analytics and R / Big R
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Exercise 1:
Working with Text Analytics and R / Big R
Purpose:
You will create a new text analytics web tooling project and load documents
to scan for certain keywords. You will start up the R console and run basic
commands on it. You will also load the BigR libraries and run basic BigR
operations.
Estimated time: 1 hour
User/Password: biadmin/biadmin
root/dalvm3
Services Password: ibm2blue
Important: Before doing this exercise, ensure that your access and services are
configured and running. Check that:
• /etc/hosts displays your environment's IP address
• in the Ambari console, ensure that all BigInsights services are running
If you are unsure of the steps, please refer to Unit 1, Exercise 1 to ensure that your
environment is ready to proceed. You should review the steps in Task 1 (Configure
your image) and Task 2 (Start the BigInsights components).
Task 1. Launching the text analytics Web Tooling module.
IBM BigInsights provides a Web Tooling module that makes text analytics easy.
In this task, you will see how to use the Web Tooling module to create a project,
and load some documents to start the analysis.
1. To open a new terminal, right-click the desktop, and then click Open in
Terminal.
You will review the set of files that you will be using for this exercise.
2. Navigate to /home/biadmin/labfiles/ta/WatsonData/Data/, and then type ls to
see the files.
These are sample blog files by IBM.
3. Launch Firefox, and then if necessary, navigate to the Ambari login page,
http://ibmclass.localdomain:8080.
4. Log in to the Ambari console as admin/admin, and ensure that all of the
components have started.
5. Click the Knox component, click Service Actions, and then click Start Demo
LDAP.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
6. Click OK, and then when the Start Demo LDAP process is complete, click OK
again.
Leave the Ambari tab open in Firefox for Task 2 of this exercise.
7. To launch the BigInsights home page, open a new browser tab and type:
https://ibmclass.localdomain:8443/gateway/default/BigInsightsWeb/index.html
There is a bookmark saved on the toolbar. The id and password is guest /
guest-password, but that is also saved for you in the lab envrionment.
Note: You may need to wait for a minute before the two links display (BigSheets
and Text Analytics).
8. Click the Text Analytics link to open up the Web Tooling module.
You are going to create a project and load in some documents to do a text
extraction for the Watson keyword.
14. Click the Remove Tags button in the upper right corner.
The goal is to find blogs about the Watson computer. The extractor should look
for the word Watson.
15. Click the New Literal button at the top of your Watson project pane.
A textbox appears in the project pane.
16. Type Watson in the textbox, and then press Enter to confirm.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
17. Select the Watson literal, and then click the Run Selected to run the
extractor.
Hint: You may need to scroll the project pane further to the right to see the rest
of the icons.
It will take a few seconds to run. All of the occurrences of the word Watson will
be highlighted in the Documents pane (on the right). You can see details of
each match in the Results pane (on the bottom).
The results appear as follows:
You will put some context around the word Watson, creating a dictionary of
terms frequently associated with the Watson computer.
18. Click New Dictionary in the middle pane, and then type PositiveClues in
the text box that appears in the canvas.
Now, you will want to peruse your documents for words. For example, Watson
is commonly associated with IBM. You will want to add this to your
PositiveClues dictionary.
19. Click on the PositiveClues extractor, and select Settings (this is located below
your canvas).
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
21. Add in computer, computing, solutions, and technology as positive clues for
the context around the Watson.
22. To run the PositiveClues extractor, select it in the canvas, and then click Run
Selected.
The words will be highlighted. Review the results pane for details of the search.
You are going to combine the Watson extractor with the PositiveClues extractor
with a proximity rule so that when the clues appear within five words, you know
that the Watson keyword is of the correct context.
A new sequence is now created with all words of Watsons within five tokens of
one of your dictionary terms.
26. Click the Sequence 1, and then in the Extractor Properties pane, click
Output.
27. Rename Sequence 1 to WatsonSpan.
28. Rename Literal_1 to Watson.
Do not rename PositiveClues.
29. Select the extractor, and then click Run Selected.
30. Click a few of the rows in the Results pane. You can see the words of Watson
that comes with the dictionary words.
This only extracts the word Watson with the clues that comes after. If you want
to have the same context in front of the Watson word as well, what would you
need to do?
Similarly, you may have false positives, such as a person's name of Watson.
You would not want to include that in your search. What would you need to do
there?
The last two are left as open exercises for you to try on your own. This exercise
serves as an overview and just barely scratches the surface of text analytics
with WebTooling.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
23. To will run the script "ex1_huron.R" from within the console using the source()
function, type:
source("ex1_huron.R")
The ex1_huron.R script will generate output to the R console and it will create a
graph of the water level of Lake Huron over many years.
The results appear as follows:
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
where you supply the host, the user id, and the password.
In the lab environment, LDAP has been turned off, so the user id and the
password are ignored. Do not do this for the production environment.
If you are unable to connect, make sure the Big R service is started. Try a
restart if it still does not work.
6. To verify that the connection was successful, type is.bigr.connected().
The results appear as follows:
Once connected, you will be able to browse the HDFS file system and examine
datasets that have already been loaded onto the cluster.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
12. This creates an object air that is of bigr.frame; to check out the air object, type
class(air).
13. Examine the structure of the dataset.
Note that the output looks very similar to R's data.frames. The dataset has 29
variables (for example, columns). The first few values of each column are also
shown.
14. To examine the columns and see what they may possibly represent, type
str(air).
Notice that the column types are all "character" (abbreviated as "chr"). Unless
specified otherwise, Big R automatically assumes all data to be strings.
However, only columns Year (1), Month (2), UniqueCarrier (9), TailNum (11),
Origin (17), Dest (18), CancellationCode (23) are strings, while the rest are
numbers. You will assign the correct column types.
15. To build a vector that holds the column types for all columns,type:
ct <- ifelse(1:29 %in% c(1,2,9,11,17,18,23), "character", "integer")
print (ct)
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Summarizing columns (vectors) one by one will give you additional information.
In some cases, you will also visualize the information. The following statement
returns the distribution of flights by year. Again, you have 22 years worth of
data. What you will see is a vector that has the "year" for the name, and the
flight count for the values.
20. Type:
summary(air$Year)
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
21. To visualize the data using some of R's visualization capabilities to see the
same data distribution graphically, type barplot(summary(air$Year)).
The results appear as follows:
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
IBM BigInsights for Enterprise Management
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit objectives
• List the advantages of using GPFS - FPO over HDFS
• Understand the benefits of the POSIX file system
• Describe the YARN architecture
• Understand the role of Platform Symphony
Unit objectives
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Topic:
GPFS Overview
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
HDFS: architecture
• Master / Slave architecture
NameNode
• NameNode File1
a
Manages the file system namespace and b
metadata c
d
− FsImage
− EditLog
Regulates access by files by clients
• DataNode
Many DataNodes per cluster
Manages storage attached to the nodes
Periodically reports status to NameNode
Data is stored across multiple nodes a b a c
b a d b
Nodes and components will fail, so for d c c d
reliability data is replicated across
multiple nodes DataNodes
HDFS: architecture
HDFS has a master/slave architecture. An HDFS cluster consists of a single
NameNode, a master server that manages the file system namespace and regulates
access to files by clients. In addition, there are a number of DataNodes, usually one per
node in the cluster, which manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in files.
The HDFS namespace is stored by the NameNode. The NameNode uses a transaction
log called the EditLog to persistently record every change that occurs to file system
metadata. For example, creating a new file in HDFS causes the NameNode to insert a
record into the EditLog indicating this. Similarly, changing the replication factor of a file
causes a new record to be inserted into the EditLog. The NameNode uses a file in its
local host OS file system to store the EditLog. The entire file system namespace,
including the mapping of blocks to files and file system properties, is stored in a file
called the FsImage. The FsImage is stored as a file in the NameNode's local file system
too.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Internally, a file is split into one or more blocks and these blocks are stored in a set of
DataNodes. In this example, File 1 is a large file that is divided into (a, b, c, d) chunks.
Each chunk is replicated (default 3 times) to 3 different nodes so that there is data
resiliency. If a node goes down, the blocks on that node are re-replicated to surviving
nodes to re-establish the replication factor to 3. Having 3 copies for a block also allow
Hadoop to run a calculation on 1 of 3 different servers - whichever is least busy.
The NameNode executes file system namespace operations like opening, closing, and
renaming files and directories. It also determines the mapping of blocks to DataNodes.
The DataNodes are responsible for serving read and write requests from the file
system's clients. The DataNodes also perform block creation, deletion, and replication
upon instruction from the NameNode.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
A C B A C B
Data Node 2 Data Node 6 Data Node 10 Rack aware:
R1: 1,2,3,4
C A B
R2: 5,6,7,8
Data Node 3 Data Node 7 Data Node 11 R3: 9,10,11
Metadata
Data Node 4 Data Node 8 Data Node 12 file.txt=
A: 1, 5, 6
Rack 1 Rack 2 Rack 3 B: 5, 9, 10
C: 9, 1, 2
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Optimized for
Lustre
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
"All user code that may potentially use the Hadoop Distributed File System should
be written to use a FileSystem object."
Source: hadoop.apache.org
IBM BigInsights for Enterprise Management © Copyright IBM Corporation 2015
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
*requires Enterprise Manager V4.1, which ships with GPFS V4.1.1 (July 2015)
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
*requires Enterprise Manager V4.1, which ships with GPFS V4.1.1 (July) 2015
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Topic:
POSIX file system
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
HDFS:
hadoop fs -copyFromLocal /local/source/path /hdfs/target/path
GPFS/UNIX:
cp /source/path /target/path
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
There is no concept of
current working directory
HDFS:
hadoop fs -mv
/always/absolute/path/to/file/that/can/be/really/long/
/always/absolute/path/to/file/that/can/be/also/really/long
/
GPFS/regular UNIX:
mv path1/ path2/
Relative paths, current
working directories.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
HDFS:
diff < (hadoop fs -cat /path/to/file) < (hadoop fs -cat
/path/to/file2)
GPFS/regular UNIX:
diff path/to/file1 path/to/file2
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Hadoop
jobs
ext4
Raw ext4
data HDFS
Traditional Traditional
applications applications
Application
writes direct
direct-read
to Hadoop
path
Raw
data GPFS-FPO
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Topic:
YARN overview
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
YARN architecture
Resource Manager
Application: App1 Application: App2
Client 1 Client 2
Scheduler
App 1 App 2
Container Container
2 1
3 GB
Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager
2
Core
App 1 App 2
Container Container
3 2
Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager
YARN architecture
Yarn has 2 core components in the architecture:
• Resource manager: the master of all cluster resources and a central agent
• Node Manager: The enforcer or per node agent who do this like tasks trackers
but more granularly
The resource manager which is essentially the master of all cluster resources and is
like a Job tracker and the other one is the node manager is like task tracker.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Details
• Core services for YARN are via long-running daemons:
Resource Manager (one per cluster)
Node managers (one per node)
Timeline server (stores application history)
• Node Managers launch and monitor containers on behalf of Application
Master
• A Container executes an application specific process within a constrained set
of resources (memory, CPU).
IOP calculates container defaults based on cluster resources such as # of
nodes, total memory, and number of cores available.
Virtual memory is supported within a container. Program permitted to
exceed memory limits of container up to a (default) factor of 2.1x (or 210%
of real memory in the container). For IOP, the default has been increased to
5x.
• For small jobs, once the Application Master is allocated additional containers
may not be required to avoid unnecessary overheads.
IBM BigInsights for Enterprise Management © Copyright IBM Corporation 2015
Details
cgroups (abbreviated from control groups) is a Linux kernel feature that limits, accounts
for and isolates the resource usage (CPU, memory, disk I/O, network, etc.) of a
collection of processes.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
High availability
• Making Hadoop highly available has become a divide-and-conquer
problem
Provide HA for the resource manager
Provide HA for each application (on a per-application basis)
• Hadoop 2 supports HA for both resource manager and AM for map
reduce jobs.
• ResourceManager HA is similar to the NameNode QJM HA.
There is an active RM and a standby one
A group of Zookeeper nodes determine which is the active RM at any point
in time
Since Hadoop 2.6.0, RM restart is work-preserving. Applications running in
the cluster can keep running when the RM fails and a new instance takes
the active role
High availability
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Topic:
Platform Symphony
YARN-Plugin
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Symphony
FIFO Fair Capacity
EGO
Like other schedulers, queues and policies are defined in Platform Symphony EGO.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
becomes
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Platform symphony makes life easier, simplifies queue management and configuration
With Platform Symphony, configuring queues and queue hierarchies is easier to define
and manage with a web based GUI interface. Resource lending policies are also
defined here.
Further, the clusters behavior can be visualized to confirm that the resource sharing
policies are working as desired.
You can see, for example, that workload RED is allowed to use the whole cluster when
nothing else is running. But, as soon as workload Green starts, RED should give up its
resources.
Further, when workload CYAN (blue) starts, both RED and GREEN need to give up
their resources. As soon as CYAN is complete and nothing else is running, workload
RED ramps up again. The result is very high cluster utilization.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
YARN Web Console: Basic view into containers and memory used
This is a screen capture of the YARN web interface. It is pretty basic, but important to
show to help you understand what you get out of-the-box. Performance information
largely relates to job status, memory usage, container usage, etc.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
• Help Application
owners improve
the efficiency of
applications.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit summary
• List the advantages of using GPFS - FPO over HDFS
• Understand the benefits of the POSIX file system
• Describe the YARN architecture
• Understand the role of Platform Symphony
Unit summary
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE