Cloudera Developer Training Exercise Manual
Cloudera Developer Training Exercise Manual
General Notes
This course’s exercise environment provides a small cluster that allows you to practice
the concepts taught in the course in a realistic environment. This section will familiarize
you with the environment and provide tips on using it.
Depending on which version of the course, your setup may be running all or partially
in the cloud. If there are multiple choices for Starting the Exercise Environment,
check with your instructor to make sure you run the correct Starting the Exercise
Environment exercise.
Environment Hosts
Get2Cluster Virtual machine (VM) that runs in the cloud or as a VMware
Virtual Machine virtual machine on your local host machine.
This provides the Gnome desktop environment and is your entry
point to the exercise environment.
gateway This cluster node provides you with access to the Hadoop
cluster. You will log in to gateway to do most of your exercise
steps.
Environment Hosts
Connect using the menu item Applications > Training >
Connect to Gateway from the Get2Cluster virtual machine
desktop.
cmhost This cluster node hosts Cloudera Manager, which installs,
configures, and monitors the services on the Hadoop cluster. You
will only need to log in to this host to launch your cluster.
Connect using the menu item Applications > Training >
Connect to CM Host from the Get2Cluster virtual machine
desktop.
master-1 The master nodes run the services that manage the Hadoop
master-2 cluster. You may visit the UIs for services running on these hosts
using the web browser on your VM, but you should not need to
log in to them directly.
worker-1 The worker nodes execute the distributed tasks for applications
worker-2 that run on the Hadoop cluster. You will not need to access these
worker-3 hosts directly.
• data—contains the data files used in all the exercises. Usually you will upload the
files to Hadoop’s distributed file system (HDFS) before working with them.
• examples—contains example code and data presented in the chapter slides in the
course.
• scripts—contains the course setup scripts and other scripts required to complete
the exercises.
The dollar sign ($) at the beginning of each line indicates the Linux shell
prompt. The actual prompt will include additional information (for example,
training@gateway:~/training_materials$) but this is omitted from these
instructions for brevity.
The backslash (\) at the end of a line signifies that the command is not complete
and continues on the next line. You can enter the code exactly as shown (on multiple
lines), or you can enter it on a single line. If you do the latter, you should not type in
the backslash.
• The course setup script defines a few environment variables in your gateway host’s
command-line environment that are used in place of longer paths in the instructions.
Since each variable is automatically replaced with its corresponding values when
you run commands in the terminal, this makes it easier and faster for you to enter a
command:
◦ $DEVDATA refers to the directory containing the data files used in the exercises.
Use the echo command on the gateway host to see the value of an environment
variable:
$ echo $DEVSH
• Graphical editor
If you prefer a graphical text editor, use the gedit editor on your virtual machine (not
the gateway node). You can start gedit using an icon from the VM tool bar.
Note:
Graphical editors like gedit run on your local VM, but typically the files you need to
work with are located in the file system on the gateway host. For your convenience,
the exercise environment setup remotely mounts the gateway file system on the
local VM. The training user’s home directory on the VM (/home/training)
contains a training_materials directory that links to /home/training/
training_materials on the gateway host. So when you view or edit a file
in training_materials, you can access the same file on the local VM or in a
gateway terminal session.
Bonus Exercises
There are additional challenges for some of the hands-on exercises. If you finish the
main exercise, please attempt the additional steps.
Catch-Up Script
If you are unable to complete an exercise, there is a script to catch you up automatically.
Each exercise has instructions for running the catch-up script if the exercise depends on
completion of prior exercises.
$ $DEVSH/scripts/catchup.sh
The script will prompt you for the exercise that you are starting; it will then set up all
the required data as if you had completed all of the previous exercises.
Note: If you run the catch-up script, you may lose your work. For example, all exercise
data will be deleted from HDFS before uploading the required files.
Troubleshooting
If you have trouble or unexpected behavior in the exercise environment, refer to the
Troubleshooting Tips section at the end of the exercise manual.
Exercise Instructions
Before You Start
Before you start, verify the following:
• If you are taking a custom course, be sure you have the custom course code provided
by your instructor.
2. If the VM .zip file is not yet downloaded to your local host, do so now using the
link provided by your instructor.
3. Unzip the VM file into a directory of your choice. On most systems, you can do this
by simply double-clicking the file.
4. Start the VM. On most systems, you can do this by double-clicking on the .vmx file
(such as Cloudera-Training-NGEE-Get2Cluster-VM.vmx) in the unzipped
VM directory. You can also launch the VMware reader and load the same file in
using the File menu.
When the VM has started, you will automatically be logged in as the user
training, and you will see the VM’s desktop.
5. Copy the Hands-On Exercise Manual (the document you are currently viewing) onto
your Get2Cluster VM desktop and use the VM’s PDF viewer to view it. Some PDF
viewers do not allow you to cut and paste multi-line commands properly. To do
this, drag the PDF file from your local host’s file browser to the VM window. Double-
click the PDF file on the desktop to open it in a viewer running on the VM. This will
allow you to cut and paste commands directly from the PDF into your VM window.
6. Configure the Get2Cluster VM’s network access to the CM Host by selecting the VM’s
Applications menu, and then choosing Training > Configure Hosts Files.
a. When prompted, enter the public IP address for the CM host provided by your
instructor. (IP addresses, such as 54.219.180.24, consist of four numbers
separated by dots.)
b. When prompted, verify that the IP address you entered is correct, then enter y.
Note: The script may take up to five minutes to run. Please allow the script time to
complete.
7. From the VM’s Applications menu, choose Training > Start Proxy Server. This
starts a proxy process that will allow you to access web pages hosted on your
cluster using the web browser on the Get2Cluster VM.
This will open a terminal window with the title proxy. Leave the terminal process
running and minimize the window.
Note: The proxy process must remain running throughout the day. Do not close the
terminal window or exit the process until the end of the day. If you do exit the
process, you will not be able to access web UI pages hosted on the cluster from the
Get2Cluster web browser.
If you accidentally stop the proxy server, restart it following the step above. You will
also need to restart the proxy server if you lose your connection to the internet at any
point during the class. (A possible indication that the proxy has stopped working is if
the terminal no longer displays the "proxy" title in the desktop menu bar.)
8. Open the Firefox browser on the Get2Cluster VM and click on the Cloudera
Manager bookmark. Log in as user admin with password admin.
You may see any of the following indications that the cluster is is still starting up:
• The Running Commands icon (a paper scroll) in the upper right corner of the
Cloudera Manager web UI has a number 1.
• One or more cluster services appears with a red dot indicating an unhealthy
status.
If the Cloudera Manager web UI displays any of these indicators, it means that the
cluster is still starting. Wait a few minutes and reload the page.
9. When Cloudera Manager is fully started and running correctly, the start-up
indicators above will be cleared and the Cloudera Management Service will show a
healthy status, indicated by a green dot next to it.
a. If all the start-up indicators are cleared, but the Cloudera Manager Service
still does not have a healthy status, restart the service manually in a CM Host
terminal session.
i. Start a new terminal session connected to the CM host: from the VM’s
Applications menu, select Training > Connect to CM Host. This will open
a new terminal window titled training@cmhost with a session running on
the CM host.
ii. In the CM host terminal window, restart the Cloudera Manager Service.
$ ~/config/reset-cm.sh
The service should show a healthy status (green dot icon) after a few moments.
10. Start a new terminal session connected to the CM host: from the VM’s Applications
menu, select Training > Connect to CM Host. This will open a new terminal
window titled training@cmhost with a session running on the CM host.
11. In the CM host terminal session, run the command below to create and launch a
new cluster.
$ ~/create-cluster.sh
a. When prompted for a name for your new cluster, enter your last name. The
name of the cluster can be up to 20 characters, and should include only letters
and/or numbers. Do not include any spaces, punctuation, or special characters.
Note: If you need to rebuild a cluster for any reason, choose a different cluster
name than the one you used the first time by adding 2 to your original cluster
name.
b. When prompted, choose the number corresponding to the course you are
taking.
• If you are taking a regular course (not a custom course) choose the number
for DevSH.
• If you are taking a custom course, select Custom, enter the custom course
code (three numbers separated by dots, such as 32.9.27) provided by your
instructor.
Note: It is important to choose the correct course, so that the correct files will
be downloaded and the correct configurations will be applied to the cluster.
After you choose your course, the script will continue and will take 15 to 30 minutes
complete. It is important to leave the script running uninterrupted.
Because the script takes a while to complete, your instructor may proceed with the
course while it runs. Before proceeding with the next exercise, be sure to return to the
Verify Your Cluster section below to make sure that your cluster is running correctly. If
the cluster is not running correctly, refer to the Troubleshooting Tips section at the end
of the Exercise Manual, or ask your instructor for help.
12. Review the status of services running on the cluster in the Get2Cluster VM’s web
browser using the Cloudera Manager bookmark.
Log in as user admin with password admin. Make sure that all the services in your
cluster are healthy (indicated by a green dot), as shown below. You may disregard
yellow configuration warning icons.
13. Open a terminal window with a remote connection to the gateway host.
There are three ways to start a gateway session terminal from the Get2Cluster
desktop. Choose whichever you prefer:
• From the VM’s Applications menu, select Training > Connect to Gateway.
• Open a local terminal window on the VM, then use this command:
$ connect_to_gateway.sh
Any of these steps will open a new window titled training@gateway running a
session on the gateway host.
Note: You will need to start a gateway terminal session several times during the
course. You can repeat this step at any time to create a new terminal window
running a gateway session. You can close a terminal window by clicking the X in the
upper right-hand corner of the window.
$ ls ~/training_materials/devsh
The directory should exist and contain several subdirectories including data and
exercises.
15. From the Get2Cluster VM desktop, select the Applications menu, then select
Training > Stop Cluster to stop your cluster. When prompted if you are sure you
want to stop the cluster, confirm by entering Y.
This will stop all the VMs except for CM host and Get2Cluster.
16. After the cluster stops, perform the following additional steps on the Get2Cluster
VM:
• Exit all open terminal windows including the proxy terminal window.
17. Restart your Get2Cluster virtual machine if necessary. Exit any terminal windows
or browser sessions still open on the VM, including the proxy terminal window.
18. On the Get2Cluster VM desktop menu bar, select the Applications menu and
choose Training > Start Cluster to restart your cluster.
19. After the cluster restarts, restart the web proxy server on the VM desktop menu
by selecting Applications > Training > Start Proxy Server. Minimize the proxy
terminal window and leave it running for the rest of the day.
20. Launch Firefox on your VM and click on the Cloudera Manager bookmark. Log in
to Cloudera Manager with the username admin and password admin.
21. Some health warnings may appear as the cluster is starting. They will typically
resolve within a few minutes. If you see any remaining health issues after five
minutes, they may be due to clock offset issues. Try resolving the issues following
these steps.
a. Click All Health Issues and then click on Organize by Health Test.
b. Check if there are Clock Offset issues. If there are, open a new CM terminal
window on your VM (Applications > Training > Connect to CM Host) and run
the following command on CM host:
$ ~/config/reset-clocks.sh
Additional troubleshooting tips, if needed, are documented in the appendix. The tip
entitled "Cloudera Manager displays unhealthy status of services" offers solutions
to any other health issues your cluster you may experience.
Exercise Instructions
Before You Start
Before you start, verify the following:
• You have a URL from the instructor that provides access to a Get2Cluster virtual
machine running in the cloud.
• If you are taking a custom course, be sure you have the custom course code provided
by your instructor.
2. You should see a page that shows the Get2Cluster VM. If it is not already running,
click the play button (triangle icon) to start it.
3. After the Get2Cluster VM has started, click on the desktop thumbnail view to access
the desktop.
The Get2Cluster desktop will open in a new browser tab. You will automatically be
logged in as the user training.
4. In order for cut-and-paste to work correctly, you should view the course Exercise
Manual (the document you are currently viewing) on your Get2Cluster VM rather
than on your local host machine.
a. On your Get2Cluster VM, start the Firefox browser. The default page will display
the Cloudera University home page. (You can return to this page at any time by
clicking the home icon in Firefox.)
b. Log in to your Cloudera University account and from the Dashboard find this
course under Current.
c. Select the course title, then click to download the Exercise Manual under
Materials.
$ evince &
e. Select menu item File > Open and open the Exercise Manual PDF file in the
Downloads directory.
5. Configure the Get2Cluster VM’s network access to the CM Host by selecting the VM’s
Applications menu, and then choosing Training > Configure Hosts Files.
a. When prompted, enter the public IP address for the CM host provided by your
instructor. (IP addresses consist of four numbers separated by dots, such as
54.219.180.24.)
b. When prompted, verify that the IP address you entered is correct, then enter y.
Note: The script may take up to five minutes to run. Please allow the script time to
complete.
6. From the VM’s Applications menu, choose Training > Start Proxy Server. This
starts a proxy process that will allow you to access web pages hosted on your
cluster using the web browser on the Get2Cluster VM.
This will open a terminal window with the title proxy. Leave the terminal process
running and minimize the window.
Note: The proxy process must remain running throughout the day. Do not close the
terminal window or exit the process until the end of the day. If you do exit the
process, you will not be able to access web UI pages hosted on the cluster from the
Get2Cluster web browser.
If you accidentally stop the proxy server, restart it following the step above. You
will also need to restart the proxy server if you lose your connection to the internet at
any point during the class. (A possible indication that the proxy has stopped working
is if the terminal no longer displays the "proxy" title in the desktop menu bar.)
7. Open the Firefox browser on the Get2Cluster VM and click on the Cloudera
Manager bookmark. Log in as user admin with password admin.
You may see any of the following indications that the cluster is is still starting up:
• The Running Commands icon (a paper scroll) in the upper right corner of the
Cloudera Manager web UI has a number 1.
• One or more cluster services appears with a red dot indicating an unhealthy
status.
If the Cloudera Manager web UI displays any of these indicators, it means that the
cluster is still starting. Wait a few minutes and reload the page.
8. When Cloudera Manager is fully started and running correctly, the start-up
indicators above will be cleared and the Cloudera Management Service will show a
healthy status, indicated by a green dot next to it.
a. If all the start-up indicators are cleared, but the Cloudera Manager Service
still does not have a healthy status, restart the service manually in a CM Host
terminal session.
i. Start a new terminal session connected to the CM host: from the VM’s
Applications menu, select Training > Connect to CM Host. This will open
a new terminal window titled training@cmhost with a session running on
the CM host.
ii. In the CM host terminal window, restart the Cloudera Manager Service.
$ ~/config/reset-cm.sh
The service should show a healthy status (green dot icon) after a few moments.
9. Start a new terminal session connected to the CM host: from the VM’s Applications
menu, select Training > Connect to CM Host. This will open a new terminal
window titled training@cmhost with a session running on the CM host.
10. In the CM host terminal session, run the command below to create and start a new
cluster.
$ ~/create-cluster.sh
a. When prompted for a name for your new cluster, enter your last name. The
name of the cluster can be up to 20 characters, and should include only letters
and/or numbers. Do not include any spaces, punctuation, or special characters.
Note: If you need to rebuild a cluster for any reason, choose a different cluster
name than the one you used the first time by adding 2 to your original cluster
name.
b. When prompted, choose the number corresponding to the course you are
taking.
• If you are taking a regular course (not a custom course) choose the number
for DevSH.
• If you are taking a custom course, select Custom, enter the custom course
code (three numbers separated by dots, such as 32.9.27) provided by your
instructor.
Note: It is important to choose the correct course, so that the correct files will
be downloaded and the correct configurations will be applied to the cluster.
After you choose your course, the script will continue and will take 15 to 30 minutes
complete. It is important to leave the script running uninterrupted.
Because the script takes a while to complete, your instructor may proceed with the
course while it runs. Before proceeding with the next exercise, be sure to return to the
Verify Your Cluster section below to make sure that your cluster is running correctly. If
the cluster is not running correctly, refer to the Troubleshooting Tips section at the end
of the Exercise Manual, or ask your instructor for help.
11. Review the status of services running on the cluster in the Get2Cluster VM’s web
browser using the Cloudera Manager bookmark.
Log in as user admin with password admin. Make sure that all the services in your
cluster are healthy (indicated by a green dot), as shown below. You may disregard
yellow configuration warning icons.
12. Open a terminal window with a remote connection to the gateway host.
There are three ways to start a gateway session terminal from the Get2Cluster
desktop. Choose whichever you prefer:
• From the VM’s Applications menu, select Training > Connect to Gateway.
• Open a local terminal window on the VM, then use this command:
$ connect_to_gateway.sh
Any of these steps will open a new window titled training@gateway running a
session on the gateway host.
Note: You will need to start a gateway terminal session several times during the
course. You can repeat this step at any time to create a new terminal window
running a gateway session. You can close a terminal window by clicking the X in the
upper right-hand corner of the window.
$ ls ~/training_materials/devsh
The directory should exist and contain several subdirectories including data and
exercises.
14. From the Get2Cluster VM desktop, select the Applications menu, then select
Training > Stop Cluster to stop your cluster. When prompted if you are sure you
want to stop the cluster, confirm by entering Y.
This will stop all the VMs except for CM host and Get2Cluster.
15. After the cluster stops, perform the following additional steps on the Get2Cluster
VM:
• Exit all open terminal windows including the proxy terminal window.
• Exit all open browser sessions.
• Suspend or stop the Get2Cluster VM.
Restart Your Cluster and Verify Cluster Health
At the start of the second day of class, and each day of class thereafter, restart your
cluster.
16. Return to the URL provided at the start of class by your instructor and restart your
Get2Cluster virtual machine if necessary. Exit any terminal windows or browser
sessions still open on the VM, including the proxy terminal window.
17. On the Get2Cluster VM desktop menu bar, select the Applications menu and
choose Training > Start Cluster to restart your cluster.
18. After the cluster restarts, restart the web proxy server on the VM desktop menu
by selecting Applications > Training > Start Proxy Server. Minimize the proxy
terminal window and leave it running for the rest of the day.
19. Launch Firefox on your VM and click on the Cloudera Manager bookmark. Log in
to Cloudera Manager with the username admin and password admin.
20. Some health warnings may appear as the cluster is starting. They will typically
resolve within a few minutes. If you see any remaining health issues after five
minutes, they may be due to clock offset issues. Try resolving the issues following
these steps.
a. Click All Health Issues and then click on Organize by Health Test.
b. Check if there are Clock Offset issues. If there are, open a new CM terminal
window on your VM (Applications > Training > Connect to CM Host) and run
the following command on CM host:
$ ~/config/reset-clocks.sh
Additional troubleshooting tips, if needed, are documented in the appendix. The tip
entitled "Cloudera Manager displays unhealthy status of services" offers solutions
to any other health issues your cluster you may experience.
Exercise Instructions
Start the Virtual Machines
1. Open the URL provided by your instructor in your browser. If you are an
OnDemand student, simply click the Open button at the top of any exercise unit in
the course.
2. You should see a web page that shows eight virtual machines (VMs). They are the
Get2Cluster VM, as well as cmhost and all the other cluster VMs.
You will need to have all eight of the VMs running. If any of them are not already
running, click the play button (triangle icon) in the top left corner of the VMs tab to
start all the VMs at once.
3. Your entry point to your exercise environment is the Get2Cluster VM. After the
Get2Cluster VM is started, click on the desktop thumbnail view to access the
desktop.
The Get2Cluster desktop will open in a new browser tab. You will automatically be
logged in as the user training.
4. In order for cut-and-paste to work correctly, you should view the course Exercise
Manual (the document you are currently viewing) on your Get2Cluster VM rather
than on your local host machine. If you are an OnDemand student, skip this step.
a. On your Get2Cluster VM, start the Firefox browser. The default page will display
the Cloudera University home page. (You can return to this page at any time by
clicking the home icon in Firefox.)
b. Log in to your Cloudera University account and from the Dashboard find this
course under Current.
c. Select the course title, then click to download the Exercise Manual under
Materials.
$ evince &
e. Select menu item File > Open and open the Exercise Manual PDF file in the
Downloads directory.
5. From the Get2Cluster desktop’s Applications menu, choose Training > Start
Cluster. A terminal window will open and a script will run in it.
7. Review the status of services running on the cluster in the Get2Cluster VM’s web
browser using the Cloudera Manager bookmark.
Log in as user admin with password admin.
Some health warnings may appear as the cluster is starting. They will typically resolve
themselves once at least five minutes have passed since the Start Cluster terminal
window process finished. Please be patient.
Verify that all the services in your cluster are healthy (indicated by a green dot), as
shown below. You may disregard yellow configuration warning icons.
• From the VM’s Applications menu, select Training > Connect to Gateway.
• Open a local terminal window on the VM, then use this command:
$ connect_to_gateway.sh
Any of these steps will open a new window titled training@gateway running a
session on the gateway host.
Note: You will need to start a gateway terminal session several times during the
course. You can repeat this step at any time to create a new terminal window
running a gateway session. You can close a terminal window by clicking the X in the
upper right-hand corner of the window.
$ ls ~/training_materials/devsh
The directory should exist and contain several subdirectories including data and
exercises.
10. In the Get2Cluster VM desktop, exit all open terminal windows. Also exit all open
browser sessions. If you happen to have used any of the other desktop interfaces
(such as cmhost or gateway), exit all terminal and browser windows in those
desktops as well.
11. Return to the URL provided to you by the instructor at the start of class.
12. Click on the stop icon or the pause icon at the top of the VMs tab to stop or suspend
all eight VMs.
Note: If you happen to leave the VMs running, the VMs may be suspended after a few
hours in which no keyboard or mouse interaction with the cluster is detected.
13. Return to the URL provided by your instructor at the beginning of class. If you are
an OnDemand student, simply click the Open button at the top of any exercise unit
in the course.
14. Start all eight VMs if they are not currently running.
15. Open the Get2Cluster VM desktop and exit any terminal windows or browser
sessions still open in the Get2Cluster desktop.
16. From the Get2Cluster VM desktop, select the Applications menu and choose
Training > Start Cluster to restart your cluster daemons and services.
18. Launch Firefox inside your Get2Cluster desktop and click on the Cloudera
Manager bookmark. Log in to Cloudera Manager with the username admin and
password admin.
19. Some health warnings may appear as the cluster is starting. They will typically
resolve themselves once at least five minutes have passed since the Start Cluster
terminal window process finished. Please be patient.
If you see any remaining health issues after five minutes, they may be due to clock
offset issues. Try resolving the issues following these steps.
a. Click All Health Issues and then click on Organize by Health Test.
b. Check if there are Clock Offset issues. If there are, open a new CM terminal
window on your VM (Applications > Training > Connect to CM Host) and run
the following command on CM host:
$ ~/config/reset-clocks.sh
Additional troubleshooting tips, if needed, are documented in the appendix. The tip
entitled "Cloudera Manager displays unhealthy status of services" offers solutions
to any other health issues your cluster you may experience.
In this exercise, you will use the Hue Impala Query Editor to explore data in a
Hadoop cluster.
This exercise is intended to let you begin to familiarize yourself with the course
exercise environment as well as Hue. You will also briefly explore the Impala Query
Editor.
Before starting this exercise, return to the instructions for the Starting the Exercise
Environment exercise. Make sure that your exercise environment cluster is running
correctly following the steps in the Verify Your Cluster section.
2. View the Hue UI. Your Hue user name should be training, with password training
(which is the same login information as on your cluster hosts.)
Note: Make sure to use this exact username and password. Your exercise environment
is configured with a system user called training and your Hue username must
match. If you accidentally use the wrong username, refer to the instructions in the
Troubleshooting Tips section at the end of the exercise.
b. Because this is the first time anyone has logged into Hue on this server, you
will be prompted to create a new user account. Enter username training and
password training, and then click Create account. (If prompted you may click
Remember.)
Hue Warnings
You may see warnings on the Hue home screen about disabled
or misconfigured services. You can disregard these; the noted
services are not required for these exercises.
4. In the left panel under the default database, select the accounts table. This will
display the table’s column definitions.
• Note: There are several columns defined. If you do not see them all, try resizing
the Firefox window.
5. Hover your pointer over the accounts table to reveal the associated Show details
icon (labeled i), as shown below.
Click the icon to bring up the details popup, and select the Sample tab. The tab will
display the first several rows of data in the table. When you are done viewing the
data, click the X in the upper right corner of the popup to close it.
6. In the main panel in the query text box, enter a SQL query like the one below:
7. Click the Execute button (labeled as a blue “play” symbol: ) to execute the
command.
8. To see results, view the Results tab below the query area.
9. Optional: If you have extra time, continue exploring the Impala Query Editor on
your own. For instance, try selecting other tabs after viewing the results.
In this exercise, you will practice working with HDFS, the Hadoop Distributed File
System.
You will use the HDFS command line tool and the Hue File Browser web-based interface
to manipulate files in HDFS.
1. Open a terminal window with a SSH session to the cluster gateway node by double-
clicking the Connect to Gateway icon on your Virtual Machine desktop or selecting
the Applications > Connect to Gateway item from the VM menu bar.
2. In the gateway terminal session, use the HDFS command line to list the content of
the HDFS root directory using the following command:
There will be multiple entries, one of which is /user. Each user has a “home”
directory under this directory, named after their username; your username in this
course is training, therefore your home directory is /user/training.
Relative Paths
In HDFS, relative (non-absolute) paths are considered relative
to your home directory. There is no concept of a “current” or
“working” directory as there is in Linux and similar filesystems.
There are no files yet, so the command silently exits. Compare this to the behavior
if you tried to view a nonexistent directory, such as hdfs dfs -ls /foo, which
would display an error message.
Note that the directory structure in HDFS has nothing to do with the directory
structure of the local filesystem; they are completely separate namespaces.
5. Start by creating a new top-level directory for exercises. You will use this directory
throughout the rest of the course.
6. Change directories to the Linux local filesystem directory containing the sample
data we will be using in the course.
$ cd $DEVDATA
If you perform a regular Linux ls command in this directory, you will see several
files and directories that will be used in this course. One of the data directories
is kb. This directory holds Knowledge Base articles that are part of Loudacre’s
customer service website.
This copies the local kb directory and its contents into a remote HDFS directory
named /loudacre/kb.
You should see the KB articles that were in the local directory.
9. Practice uploading a directory, confirm the upload, and then remove it, as it is not
actually needed for the exercises.
This prints the first 20 lines of the article to your terminal. This command is handy
for viewing text data in HDFS. An individual file is often very large, making it
inconvenient to view the entire file in the terminal. For this reason, it is often a good
idea to pipe the output of the dfs -cat command into head, more, or less. You
can also use hdfs dfs -tail to more efficiently view the end of the file, rather
than piping the whole content.
11. To download a file to work with on the local filesystem use the
hdfs dfs -get command. This command takes two arguments: an HDFS path
and a local Linux path. It copies the HDFS contents into the local filesystem:
Enter the letter q to quit the less command after reviewing the downloaded file.
$ hdfs dfs
You see a help message describing all the filesystem commands provided by HDFS.
Try playing around with a few of these commands if you like.
Use the Hue File Browser to Browse, View, and Manage Files
13. In Firefox, visit Hue by clicking the Hue bookmark, or going to URL http://
master-1:8888/.
14. If your prior session has expired, re-log in using the login credentials you created
earlier: username training and password training.
15. To access HDFS, click File Browser in the Hue menu bar.
• Note: If your Firefox window is too small to display the full menu names, you will
see just the icons instead. (The mouse-over text is “HDFS Browser”.)
16. By default, the contents of your HDFS home directory (/user/training) are
displayed. In the directory path name, click the leading slash (/) to view the HDFS
root directory.
17. The contents of the root directory are displayed, including the loudacre directory
you created earlier. Click that directory to see the contents.
18. Click the name of the kb directory to see the Knowledge Base articles you uploaded.
19. Click the checkbox next to any of the files, and then click the Actions button to see a
list of actions that can be performed on the selected file(s).
20. View the contents of one of the files by clicking on the name of the file.
• Note: In the file viewer, the contents of the file are displayed on the right. In
this case, the file is fairly small, but typical files in HDFS are very large, so rather
than displaying the entire contents on one screen, Hue provides buttons to move
between pages.
21. Return to the directory view by clicking View file location in the Actions panel on
the left.
For your convenience, the exercise environment setup remotely mounts the
gateway file system on the local VM. The training user’s home directory on the VM
(/home/training) contains a training_materials link that links to /home/
training/training_materials on the gateway host. This allows you to use
Hue to upload a file in the training_materials directory on the gateway host,
by browsing to the training_materials link on the local VM.
a. Click the Upload button on the right. You can choose to upload a
plain file, or to upload a zipped file (which will automatically be
unzipped after upload). In this case, select Files, then click Select Files.
c. Confirm that the file was correctly uploaded into the current directory.
24. Optional: Explore the various file actions available. When you have finished, select
any additional files you have uploaded and click the Move to trash button to
delete. (Do not delete base_stations.parquet; that file will be used in later
exercises.)
In this exercise, you will submit an application to the YARN cluster, and monitor
the application using both the Hue Job Browser and the YARN Web UI.
The application you will run is provided for you. It is a simple Spark application
written in Python that counts the occurrence of words in Loudacre’s customer service
Knowledge Base (which you uploaded in a previous exercise). The focus of this exercise
is not on what the application does, but on how YARN distributes tasks in a job across a
cluster, and how to monitor an application and view its log files.
Important: This exercise depends on a previous exercise: “Access HDFS with the
Command Line and Hue.” If you did not complete that exercise, run the course catch-up
script and advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
2. Take note of the values in the Cluster Metrics section, which displays information
such as the number of applications running currently, previously run, or waiting to
run; the amount of memory used and available; and how many worker nodes are in
the cluster.
3. Click the Nodes link in the Cluster menu on the left. The bottom section will display
a list of worker nodes in the cluster.
7. In your gateway session, run the example wordcount.py program on the YARN
cluster to count the frequency of words in the Knowledge Base file set:
9. The Job Browser displays a list of currently running and recently completed
applications. (If you don’t see the application you just started, wait a few seconds,
the page will automatically reload; it can take some time for the application to be
accepted and start running.) Review the entry for the current job.
This page allows you to click the application ID to see details of the running
application, or to kill a running job. (Do not do that now though!)
10. Reload the YARN RM page in Firefox. Notice that the application you just started is
displayed in the list of applications in the bottom section of the RM home page.
12. Select the node HTTP address link for worker-1 to open the Node Manager UI on
that node.
13. Now that an application is running, you can click List of Applications to see the
application you submitted.
This will display the containers the Resource Manager has allocated on the selected
node for the current application. (No containers will show if no applications are
running; if you missed it because the application completed, you can run the
application again. In the terminal window, use the up arrow key to recall previous
commands.)
If your application is still running, you should see it listed, including the application
ID (such as application_1469799128160_0001), the application name
(PythonWordCount), the type (SPARK), and so on.
If there are no applications on the list, your application has probably finished
running. By default, only current applications are included. Use the -appStates
ALL option to include all applications in the list:
1. In the web browser on your VM, go to the Cloudera Manager UI using the provided
bookmark.
2. Log into Cloudera Manager with the username admin and password admin.
3. On the Cloudera Manager home page, open the Clusters menu and select YARN
Applications.
Applications that are currently running or have recently run are shown. Confirm
that the application you ran above is displayed in the list. (If your application has
completed, you can restart it to explore the CM Applications manager working with
a running application.)
5. Optional: Continue exploring the CM YARN applications manager. For example, try
the Collect Diagnostics button, or other action items available in the drop-down
menu shown to the right of each application.
In this exercise, you will use the Spark shell to work with DataFrames.
You will start by viewing and bookmarking the Spark documentation in your browser.
Then you will start the Spark shell and read a simple JSON file into a DataFrame.
Important: This exercise depends on a previous exercise: “Access HDFS with Command
Line and Hue.” If you did not complete that exercise, run the course catch-up script and
advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
2. From the Programming Guides menu, select the DataFrames, Datasets and SQL.
Briefly review the guide and bookmark the page for later review.
3. From the API Docs menu, select either Scala or Python, depending on your
language preference. Bookmark the API page for use during class. Later exercises
will refer you to this documentation.
4. If you are viewing the Scala API, notice that the package names are displayed on
the left. Use the search box or scroll down to find the org.apache.spark.sql
package. This package contains most of the classes and objects you will be working
with in this course. In particular, note the Dataset class. Although this exercise
focuses on DataFrames, remember that DataFrames are simply an alias for Datasets
of Row objects. So all the DataFrame operations you will practice using in this
exercise are documented on the Dataset class.
5. If you are viewing the Python API, locate the pyspark.sql module. This module
contains most of the classes you will be working with in this course. At the top are
some of the key classes in the module. View the API for the DataFrame class; these
are the operations you will practice using in this exercise.
6. If you don’t already have a terminal window connected to the gateway node,
start one now, using the desktop Connect to Gateway icon on your VM or the
Applications > Connect to Gateway item from the VM menu bar.
7. In the terminal window, start the Spark 2 shell. Start either the Python shell or the
Scala shell, not both.
To start the Python shell, use the pyspark2 command.
$ pyspark2
$ spark2-shell
You may get several WARN messages, which you can disregard.
8. Spark creates a SparkSession object for you called spark. Make sure the object
exists. Use the first command below if you are using Python, and the second one if
you are using Scala. (You only need to complete the exercises in Python or Scala.)
pyspark> spark
scala> spark
Python will display information about the spark object such as:
<pyspark.sql.session.SparkSession at address>
9. Using command completion, you can see all the available Spark session methods:
type spark. (spark followed by a dot) and then the TAB key.
Note: You can exit the Scala shell by typing sys.exit. To exit the Python shell,
press Ctrl+D or type exit. However, stay in the shell for now to complete the
remainder of this exercise.
11. Use the less command or an editor to view the simple text file you will be using
by viewing (without editing) the file in a text editor in a separate window (not the
Spark shell). The file is located at: $DEVDATA/devices.json. This file contains
records for each of Loudacre’s supported devices. For example:
{"devnum":1,"release_dt":"2008-10-21T00:00:00.000-07:00",
"make":"Sorrento","model":"F00L","dev_type":"phone"}
Notice the field names and types of values in the first few records.
13. In the Spark shell, create a new DataFrame based on the devices.json file in
HDFS.
14. Spark has not yet read the data in the file, but it has scanned the file to infer the
schema. View the schema, and note that the column names match the record field
names in the JSON file.
pyspark> devDF.printSchema()
scala> devDF.printSchema
15. Display the data in the DataFrame using the show function. If you don’t pass an
argument to show, Spark will display the first 20 rows in the DataFrame. For this
step, display the first five rows. Note that the data is displayed in tabular form,
using the column names defined in the schema.
> devDF.show(5)
Note: Like many Spark queries, this command is the same whether you are using
Scala or Python.
16. The show and printSchema operations are actions—that is, they return a value
from the distributed DataFrame to the Spark driver. Both functions display the data
in a nicely formatted table. These functions are intended for interactive use in the
shell, but do not allow you actually work with the data that is returned. Try using
the take action instead, which returns an array (Scala) or list (Python) of Row
objects. You can display the data by iterating through the collection.
Query a DataFrame
17. Use the count action to return the number of items in the DataFrame.
> devDF.count()
columns, then display its schema. Note that only the selected columns are in the
schema.
pyspark> makeModelDF.show()
scala> makeModelDF.show
pyspark> devDF.select("devnum","make","model"). \
where("make = 'Ronin'"). \
show()
scala> devDF.select("devnum","make","model").
where("make = 'Ronin'").
show
In this exercise, you will work with structured account and mobile device data
using DataFrames.
You will practice creating and saving DataFrames using different types of data sources,
and inferring and defining schemas.
Important: This exercise depends on a previous exercise: “Exploring DataFrames Using
the Spark Shell.” If you did not complete that exercise, run the course catch-up script
and advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
2. If you don’t have one already, open a terminal session to the gateway node, and
start the Spark 2 shell (either Scala or Python, as you prefer).
4. Print the schema and the first few rows of the DataFrame, and note that the schema
and data are the same as the Hive table.
5. Create a new DataFrame with rows from the accounts data where the zip code is
94913, and save the result to CSV files in the /loudacre/accounts_zip94913
HDFS directory. You can do this in a single command, as shown below, or with
multiple commands.
6. Use Hue or the command line (in a separate gateway session) to view the
/loudacre/accounts_zip94913 directory in HDFS and the data in one of the
saved files. Confirm that the CSV file includes a header line, and that only records
for the selected zip code are included.
7. Optional: Try creating a new DataFrame based on the CSV files you created above.
Compare the schema of the original accountsDF and the new DataFrame. What’s
different? Try again, this time setting the inferSchema option to true and
compare again.
9. Create a new DataFrame based on the devices.json file. (This command could
take several seconds while it infers the schema.)
10. View the schema of the devDF DataFrame. Note the column names and types that
Spark inferred from the JSON file. In particular, note that the release_dt column
is of type string, whereas the data in the column actually represents a timestamp.
11. Define a schema that correctly specifies the column types for this DataFrame. Start
by importing the package with the definitions of necessary classes and types.
pyspark> devColumns = [
StructField("devnum",LongType()),
StructField("make",StringType()),
StructField("model",StringType()),
StructField("release_dt",TimestampType()),
StructField("dev_type",StringType())]
13. Create a schema (a StructType object) using the column definition list.
14. Recreate the devDF DataFrame, this time using the new schema.
15. View the schema and data of the new DataFrame, and confirm that the
release_dt column type is now timestamp.
16. Now that the device data uses the correct schema, write the data in Parquet format,
which automatically embeds the schema. Save the Parquet data files into an HDFS
directory called /loudacre/devices_parquet.
17. Optional: In a separate gateway terminal session, use parquet-tools to view the
schema of the saved files.
$ parquet-tools schema \
hdfs://master-1/loudacre/devices_parquet/
Note that the type of the release_dt column is noted as int96; this is how Spark
denotes a timestamp type in Parquet.
18. Create a new DataFrame using the Parquet files you saved in devices_parquet
and view its schema. Note that Spark is able to correctly infer the timestamp type
of the release_dt column from Parquet’s embedded schema.
In this exercise, you will analyze account and mobile device data using DataFrame
queries.
First, you will practice using column expressions in queries. You will analyze data in
DataFrames by grouping and aggregating data, and by joining two DataFrames. Then
you will query multiple sets of data to find out how many of each mobile device model
is used in active accounts.
Important: This exercise depends on a previous exercise: “Working with DataFrames
and Schemas.” If you did not complete that exercise, run the course catch-up script and
advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
2. Create a new DataFrame called accountsDF based on the Hive accounts table.
3. Try a simple query with select, using both column reference syntaxes.
pyspark> accountsDF. \
select(accountsDF["first_name"]).show()
pyspark> accountsDF.select(accountsDF.first_name).show()
scala> accountsDF.
select(accountsDF("first_name")).show
scala> accountsDF.select($"first_name").show
4. To explore column expressions, create a column object to work with, based on the
first_name column in the accountsDF DataFrame.
5. Note that the object type is Column. To see available methods and attributes, use
tab completion—that is, enter fnCol. followed by TAB.
6. New Column objects are created when you perform operations on existing
columns. Create a new Column object based on a column expression that identifies
users whose first name is Lucy using the equality operator on the fnCol object you
created above.
pyspark> accountsDF. \
select(accountsDF.first_name,accountsDF.last_name,lucyCol).show()
scala> accountsDF.
select($"first_name",$"last_name",lucyCol).show
> accountsDF.where(lucyCol).show(5)
9. Column expressions do not need to be assigned to a variable. Try the same query
without using the lucyCol variable.
10. Column expressions are not limited to where operations like those above. They can
be used in any transformation for which a simple column could be used, such as a
select. Try selecting the city and state columns, and the first three characters
of the phone_number column (in the U.S., the first three digits of a phone number
are known as the area code). Use the substr operator on the phone_number
column to extract the area code.
pyspark> accountsDF. \
select("city", "state", \
accountsDF.phone_number.substr(1,3)). \
show(5)
scala> accountsDF.
select($"city", $"state",
$"phone_number".substr(1,3)).
show(5)
11. Notice that in the last step, the values returned by the query were correct, but the
column name was substring(phone_number, 1, 3), which is long and
hard to work with. Repeat the same query, using the alias operator to rename that
column as area_code.
pyspark> accountsDF. \
select("city", "state", \
accountsDF.phone_number. \
substr(1,3).alias("area_code")). \
show(5)
scala> accountsDF.
select($"city", $"state",
$"phone_number".substr(1,3).alias("area_code")).
show(5)
12. Perform a query that results in a DataFrame with just first_name and
last_name columns, and only includes users whose first and last names both
begin with the same two letters. (For example, the user Robert Roget would be
included, because both his first and last names begin with “Ro”.)
pyspark> accountsDF.groupBy("last_name").count().show(5)
scala> accountsDF.groupBy("last_name").count.show(5)
14. You can also group by multiple columns. Query accountsDF again, this time
counting the number of people who share the same last and first name.
pyspark> accountsDF. \
groupBy("last_name","first_name").count().show(5)
scala> accountsDF.
groupBy("last_name","first_name").count.show(5)
$ parquet-tools schema \
hdfs://master-1/loudacre/base_stations.parquet
$ parquet-tools head \
hdfs://master-1/loudacre/base_stations.parquet
16. In your Spark shell, create a new DataFrame called baseDF using the base stations
data. Review the baseDF schema and data to ensure it matches the data in the
Parquet file.
17. Some account holders live in zip codes that have a base station. Join baseDF and
accountsDF to find those users, and for each, include their account ID, first name,
last name, and the ID and location data for the base station in their zip code.
pyspark> accountsDF. \
select("acct_num","first_name","last_name","zipcode"). \
join(baseDF, baseDF.zip == accountsDF.zipcode). \
show()
scala> accountsDF.
select("acct_num","first_name","last_name","zipcode").
join(baseDF,$"zip" === $"zipcode").show()
20. Use the account device data and the DataFrames you created previously in this
exercise to find the total number of each device models across all active accounts
(that is, accounts that have not been closed). The new DataFrame should be sorted
from most to least common model. Save the data as Parquet files in a directory
called /loudacre/top_devices with the following columns:
Column Description Example Value
Name
device_id The ID number of each known device 18
(including those that might not be in
use by any account)
make The manufacturer name for the device Ronin
Hints:
• Active accounts are those with a null value for acct_close_dt (account close
date) in the accounts table.
• The device_id column in the device accounts data corresponds to the devnum
column in the list of known devices in the /loudacre/devices.json file.
In this exercise, you will use the Spark shell to work with RDDs.
You will start reading a simple text file into a Resilient Distributed Dataset (RDD) and
displaying the contents. You will then create two new RDDs and use transformations to
union them and remove duplicates.
Important: This exercise depends on a previous exercise: “Accessing HDFS with
Command Line and Hue.” If you did not complete that exercise, run the course catch-up
script and advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
4. In the Spark shell, define an RDD based on the frostroad.txt text file.
5. Using command completion, you can see all the available transformations and
operations you can perform on an RDD. Type myRDD. and then the TAB key.
6. Spark has not yet read the file. It will not do so until you perform an action on the
RDD. Try counting the number of elements in the RDD using the count action:
pyspark> myRDD.count()
scala> myRDD.count
The count operation causes the RDD to be materialized (created and populated).
The number of lines (23) should be displayed, for example:
Out[2]: 23 (Python) or
res1: Long = 23 (Scala)
7. Call the collect operation to return all data in the RDD to the Spark driver. Take
note of the type of the return value; in Python will be a list of strings, and in Scala it
will be an array of strings.
Note: collect returns the entire set of data. This is convenient for very small
RDDs like this one, but be careful using collect for more typical large sets of data.
8. Display the contents of the collected data by looping through the collection.
12. Display the contents of the makes1RDD data using collect and then looping
through returned collection.
13. Repeat the previous steps to create and display an RDD called makes2RDD based
on the second file, /loudacre/makes2.txt.
14. Create a new RDD by appending the second RDD to the first using the union
transformation.
15. Collect and display the contents of the new allMakesRDD RDD.
17. Optional: Try performing different transformations on the RDDs you created above,
such as intersection, subtract, or zip. See the RDD API documentation for
details.
$ $DEVSH/scripts/catchup.sh
2. Copy the weblogs directory from the gateway filesystem to the /loudacre HDFS
directory.
3. Create an RDD from the uploaded web logs data files in the /loudacre/
weblogs/ directory in HDFS.
4. Create an RDD containing only those lines that are requests for JPG files. Use the
filter operation with a transformation function that takes a string RDD element
and returns a boolean value.
pyspark> jpglogsRDD = \
logsRDD.filter(lambda line: ".jpg" in line)
5. Use take to return the first five lines of the data in jpglogsRDD. The return value
is a list of strings (in Python) or array of strings (in Scala).
scala> jpgLines.foreach(println)
7. Now try using the map transformation to define a new RDD. Start with a simple map
function that returns the length of each line in the log file. This results in an RDD of
integers.
pyspark> lineLengthsRDD = \
logsRDD.map(lambda line: len(line))
8. Loop through and display the first five elements (integers) in the RDD.
9. Calculating line lengths is not very useful. Instead, try mapping each string in
logsRDD by splitting the strings based on spaces. The result will be an RDD in
which each element is a list of strings (in Python) or an array of strings (in Scala).
Each string represents a “field” in the web log line.
pyspark> lineFieldsRDD = \
logsRDD.map(lambda line: line.split(' '))
10. Return the first five elements of lineFieldsRDD. The result will be a list of lists of
strings (in Python) or an array of arrays of strings (in Scala).
11. Display the contents of the return from take. Unlike in examples above, which
returned collections of simple values (strings and ints), this time you have a set of
compound values (arrays or lists containing strings). Therefore, to display them
properly, you will need to loop through the arrays/lists in lineFields, and then
loop through each string in the array/list. (To make it easier to read the output, use
------- to separate each set of field values.)
12. Now that you know how map works, create a new RDD containing just the IP
addresses from each line in the log file. (The IP address is the first space-delimited
field in each line.)
pyspark> ipsRDD = \
logsRDD.map(lambda line: line.split(' ')[0])
pyspark> for ip in ipsRDD.take(10): print ip
pyspark> ipsRDD.saveAsTextFile("/loudacre/iplist")
scala> ipsRDD.saveAsTextFile("/loudacre/iplist")
• Note: If you re-run this command, you will not be able to save to the same
directory because it already exists. Be sure to first delete the directory using
either the hdfs command (in a separate terminal window) or the Hue file
browser.
14. In a gateway terminal window or the Hue file browser, list the contents of the
/loudacre/iplist folder. Review the contents of one of the files to confirm that
they were created correctly.
165.32.101.206,8
100.219.90.44,102
182.4.148.56,173
246.241.6.175,45395
175.223.172.207,4115
…
16. Now that the data is in CSV format, it can easily be used by Spark SQL. Load the new
CSV files in /loudacre/userips_csv created above into a DataFrame, then
view the data and schema.
18. Determine which delimiter to use (the 20th character—position 19—is the first use
of the delimiter).
19. Filter out any records which do not parse correctly (hint: each record should have
exactly 14 values).
20. Extract the date (first field), model (second field), device ID (third field), and
latitude and longitude (13th and 14th fields respectively).
21. The second field contains the device manufacturer and model name (such as Ronin
S2). Split this field by spaces to separate the manufacturer from the model (for
example, manufacturer Ronin, model S2). Keep just the manufacturer name.
23. Confirm that the data in the file(s) was saved correctly. The lines in the file should
all look similar to this, with all fields delimited by commas.
2014-03-15:10:10:20,Sorrento,8cc3b47e-bd01-4482-
b500-28f2342679af,33.6894754264,-117.543308253
24. Review the data on the local Linux filesystem in the directory $DEVDATA/
activations. Each XML file contains data for all the devices activated by
customers during a specific month.
Sample input data:
<activations>
<activation timestamp="1225499258" type="phone">
<account-number>316</account-number>
<device-id>
d61b6971-33e1-42f0-bb15-aa2ae3cd8680
</device-id>
<phone-number>5108307062</phone-number>
<model>iFruit 1</model>
</activation>
…
</activations>
Follow the steps below to write code to go through a set of activation XML files and
extract the account number and device model for each activation, and save the list to a
file as account_number:model.
The output will look something like:
1234:iFruit 1
987:Sorrento F00L
4566:iFruit 1
…
26. Start with the ActivationModels stub script in the bonus exercise directory:
$DEVSH/exercises/rdds/bonus-xml. (Stubs are provided for Scala and
Python; use whichever language you prefer.) Note that for convenience you have
been provided with functions to parse the XML, as that is not the focus of this
exercise. Copy the stub code into the Spark shell of your choice.
27. Use wholeTextFiles to create an RDD from the activations dataset. The
resulting RDD will consist of tuples, in which the first value is the name of the file,
and the second value is the contents of the file (XML) as a string.
28. Each XML file can contain many activation records; use flatMap to map the
contents of each file to a collection of XML records by calling the provided
getActivations function. getActivations takes an XML string, parses it, and
returns a collection of XML records; flatMap maps each record to a separate RDD
element.
30. Save the formatted strings to a text file in the directory /loudacre/account-
models.
In this exercise, you will explore the Loudacre web server log files, as well as the
Loudacre user account data, using key-value pair RDDs.
Important: This exercise depends on a previous exercise: “Transforming Data Using
RDDs.” If you did not complete that exercise, run the course catch-up script and advance
to the current exercise:
$ $DEVSH/scripts/catchup.sh
1. Using map-reduce logic, count the number of requests from each user.
a. Use map to create a pair RDD with the user ID as the key and the integer 1
as the value. (The user ID is the third field in each line.) Your data will look
something like this:
(userid,1)
(userid,1)
(userid,1)
…
b. Use reduceByKey to sum the values for each user ID. Your RDD data will be
similar to this:
(userid,5)
(userid,7)
(userid,2)
…
2. Use countByKey to determine how many users visited the site for each frequency.
That is, how many users visited once, twice, three times, and so on.
3. Create an RDD where the user ID is the key, and the value is the list of all the IP
addresses that user has connected from. (IP address is the first field in each request
line.)
(userid,[20.1.34.55, 74.125.239.98])
(userid,[75.175.32.10, 245.33.1.1, 66.79.233.99])
(userid,[65.50.196.141])
…
ID, which corresponds to the user ID in the web server logs. The other fields include
account details such as creation date, first and last name, and so on.
4. Join the accounts data with the weblog data to produce a dataset keyed by user ID
which contains the user account information and the number of website hits for
that user.
b. Join the pair RDD with the set of user-id/hit-count pairs calculated in the first
step.
(9012,([9012,2008-11-24 10:04:08,\N,Cheryl,West, 4905 Olive
Street,San Francisco,CA,…],4))
(2312,([2312,2008-11-23 14:05:07,\N,Elizabeth,Kerns, 4703
Eva Pearl Street,Richmond,CA,…],8))
(1195,([1195,2008-11-02 17:12:12,2013-07-18
16:42:36,Melissa, Roman,3539 James Martin
Circle,Oakland,CA,…],1))
…
c. Display the user ID, hit count, and first name (4th value) and last name (5th
value) for the first five elements. The output should look similar to this:
Bonus Exercises
If you have more time, attempt the following extra bonus exercises:
1. Use keyBy to create an RDD of account data with the postal code (9th field in the
CSV file) as the key.
Tip: Assign this new RDD to a variable for use in the next bonus exercise.
2. Create a pair RDD with postal code as the key and a list of names (Last Name,First
Name) in that postal code as the value.
• Hint: First name and last name are the 4 and 5 fields respectively.
th th
--- 85003
Jenkins,Thad
Rick,Edward
Lindsay,Ivy
…
--- 85004
Morris,Eric
Reiser,Hazel
Gregg,Alicia
Preston,Elizabeth
…
In this exercise, you will use the Catalog API to explore Hive tables, and create
DataFrames by executing SQL queries.
Use the Catalog API to list the tables in the default Hive database, and view the schema
of the accounts table. Perform queries on the accounts table, and review the resulting
DataFrames. Create a temporary view based on the accountdevice CSV files, and use
SQL to join that table with the accounts table.
Important: This exercise depends on a previous exercise: “Analyzing Data with
DataFrame Queries.” If you did not complete that exercise, run the course catch-up
script and advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
scala> spark.catalog.listTables.show
scala> spark.catalog.listColumns("accounts").show
3. Create a new DataFrame based on the accounts table, and confirm that its schema
matches that of the column list above.
5. Optional: Perform the equivalent query using the DataFrame API, and compare the
schema and data in the results to those of the query above.
8. Confirm the view was created correctly by listing the tables and views in the
default database as you did earlier. Notice that the account_dev table type is
TEMPORARY.
9. Using a SQL query, create a new DataFrame based on the first five rows of the
account_dev table, and display the results.
11. Save nameDevDF as a table called name_dev (with the file path as /loudacre/
name_dev).
12. Use the Catalog API to confirm that the table was created correctly with the right
schema.
13. Optional: If you are familiar with using Hive or Impala, verify that the name_dev
table now exists in the Hive metastore. If you use Impala, be sure to invalidate
Impala’s local store of the metastore using the INVALIDATE METADATA command
or the refresh icon in the Hue Impala Query Editor.
14. Optional: Exit and restart the shell and confirm that the temporary view is no longer
available.
In this exercise, you will explore Datasets using web log data.
Create an RDD of account ID/IP address pairs, and then create a new Dataset of
products (case class objects) based on that RDD. Compare the results of typed and
untyped transformations to better understand the relationship between DataFrames
and Datasets.
Note: These exercises are in Scala only, because Datasets are not defined in Python.
Important: This exercise depends on a previous exercise: “Transforming Data Using
RDDs.” If you did not complete that exercise, run the course catch-up script and advance
to the current exercise:
$ $DEVSH/scripts/catchup.sh
2. Create an RDD of AccountIP objects by using the web log data in /loudacre/
weblogs. Split the data by spaces as use the first field as IP address and the third
as account ID.
6. Save the accountIPDS Dataset as a Parquet file, then read the file back into
a DataFrame. Note that the type of the original Dataset (AccountIP) is not
preserved, but the types of the columns are.
Bonus Exercises
1. Try creating a new Dataset of AccountIP objects based on the DataFrame you
created above.
2. Create a view on the AccountIPDS Dataset, and perform a SQL query on the view.
What is the return type of the SQL query? Were column types preserved?
In this exercise, you will write your own Spark application instead of using the
interactive Spark shell application.
Write a simple Spark application that takes a single argument, a state code (such as CA).
The program should read the data from the accounts Hive table and save the rows
whose state column value matches the specified state code. Write the results to /
loudacre/accounts_by_state/state-code (such as accounts_by_state/
CA).
Depending on which programming language you are using, follow the appropriate set of
instructions below to write a Spark program.
Before running your program, be sure to exit from the Spark shell.
Important: This exercise depends on a previous exercise: “Accessing HDFS with the
Command Line and Hue.” If you did not complete that exercise, run the course catch-up
script and advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
2. A simple stub file to get started has been provided in the exercise directory on the
gateway node: $DEVSH/exercises/spark-application/python-stubs/
accounts-by-state.py. This stub imports the required Spark classes and sets
up your main code block. Open the stub file in an editor.
spark = SparkSession.builder.getOrCreate()
4. In the body of the program, load the accounts Hive table into a DataFrame. Select
accounts where the state column value matches the string provided as the first
argument to the application. Save the results to a directory called /loudacre/
accounts_by_state/state-code (where state-code is a string such
as CA.) Use overwrite mode when saving the file so that you can re-run the
application without needing to delete the directory.
spark.stop()
6. Run your application. In a gateway terminal session, change to the exercise working
directory, then run the program, passing the state code to select. For example, to
select accounts in California, use the following command:
$ cd $DEVSH/exercises/spark-application/
$ spark2-submit python-stubs/accounts-by-state.py CA
7. Once the program completes, use parquet-tools to verify that the file contents
are correct. For example, if you used the state code CA, you would use the command
below:
$ parquet-tools head \
hdfs://master-1/loudacre/accounts_by_state/CA
8. Skip the section below on writing and running a Spark application in Scala and
continue with Viewing the Spark Application UI.
11. In the body of the program, load the accounts Hive table into a DataFrame. Select
accounts where the state column value matches the string provided as the first
argument to the application. Save the results to a Parquet file called /loudacre/
accounts_by_state/state-code (where state-code is a string such
as CA). Use overwrite mode when saving the file so that you can re-run the
application without needing to delete the save directory.
12. At the end of the application, be sure to stop the Spark session:
spark.stop
13. In a gateway terminal session, change to the project directory, then build your
project. Note that the first time you compile a Spark application using Maven, it
may take several minutes for Maven to download the necessary libraries. When you
build using the same libraries in the future, they will not be downloaded a second
time, and building will be must faster. (The line below must be entered on a single
line in your gateway session.)
$ cd $DEVSH/exercises/spark-application/accounts-by-
state_project
$ mvn package
14. If the build is successful, Maven will generate a JAR file called accounts-by-
state-1.0.jar in the target directory. Run the program, passing the state
code to select. For example, to select accounts in California, use the following
command:
$ spark2-submit \
--class stubs.AccountsByState \
target/accounts-by-state-1.0.jar CA
15. Once the program completes, use parquet-tools to verify that the file contents
are correct. For example, if you used the state code CA, you would use the command
below:
$ parquet-tools head \
hdfs://master-1/loudacre/accounts_by_state/CA/
16. Open Firefox on your VM and visit the YARN Resource Manager UI using the
provided RM bookmark (or go to URI http://master-1:8088/). While the
application is running, it appears in the list of applications something like this:
After the application has completed, it will appear in the list like this:
17. Follow the ApplicationMaster link to view the Spark Application UI, or the History
link to view the application in the History Server UI.
19. Go back to the YARN RM UI in your browser, and confirm that the application name
was set correctly in the list of applications.
20. Follow the ApplicationMaster or History link. View the Environment tab. Take
note of the spark.* properties such as master and app.name.
21. You can set most of the common application properties using submit script
flags such as name, but for others you need to use conf. Use conf to set the
spark.default.parallelism property, which controls how many partitions
result after a "wide" RDD operation like reduceByKey.
22. View the application history for this application to confirm that the
spark.default.parallelism property was set correctly. (You will need to
view the YARN RM UI again to view the correct application’s history.)
24. Examine the extra output displayed when the application starts up.
a. The first section starts with Using properties file, and shows the
file name and the default property settings the application loaded from that
properties file.
b. The second section starts with Parsed arguments. This lists the arguments
—that is, the flags and settings—you set when running the submit script
(except for conf). Submit script flags that you didn’t pass use their default
values, if defined by the script, or are shown as null.
• Does the list correctly include the value you set with --name?
• Which arguments (flags) have defaults set in the script and which do not?
c. Scroll down to the section that starts with System properties. This list
shows all the properties set—those loaded from the system properties file,
those you set using submit script arguments, and those you set using the conf
flag.
1. Edit the Python or Scala application your wrote above, and use the builder function
appName to set the application name.
3. View the YARN UI to confirm that the application name was correctly set.
In this exercise, you will explore how Spark executes RDD and DataFrame/
Dataset queries.
First, you will explore RDD partitioning and lineage-based execution plans using the
Spark shell and the Spark Application UI. Then you will explore how Catalyst executes
DataFrame and Dataset queries.
Important: This exercise depends on a previous exercise: “Transforming Data Using
RDDs”. If you did not complete those exercises, run the course catch-up script and
advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
2. In the Spark shell, create an RDD called accountsRDD by reading the accounts
data, splitting it by commas, and keying it by account ID, which is the first field of
each line.
pyspark> accountsRDD.getNumPartitions()
scala> accountsRDD.getNumPartitions
scala> accountsRDD.toDebugString
6. In the browser, view the application in the YARN RM UI using the provided
bookmark (or http://master-1:8088) and click through to view the Spark
Application UI.
7. Make sure the Jobs tab is selected, and review the list of completed jobs. The most
recent job, which you triggered by calling count, should be at the top of the list.
(Note that the job description is usually based on the action that triggered the job
execution.) Confirm that the number of stages is correct, and the number of tasks
completed for the job matches the number of RDD partitions you noted when you
used toDebugString.
8. Click on the job description to view details of the job. This will list all the stages in
the job, which in this case is one.
9. Click on DAG Visualization to see a diagram of the execution plan based on the
RDD’s lineage. The main diagram displays on the stages, but if you click on a stage, it
will show you the tasks within that stage.
10. Optional: Explore the partitioning and DAG of a more complex query like the one
below. Before you view the execution plan or job details, try to figure out how many
stages the job will have.
This query loads Loudacre’s web log data, and calculates how many times each user
visited. Then it joins that user count data with account data for each user.
Note: If you execute the query multiple times, you may note that some tasks within
a stage are marked as “skipped.” This is because whenever a shuffle operation
is executed, Spark temporarily caches the data that was shuffled. Subsequent
executions of the same query re-use that data if it’s available to save some steps and
increase performance.
12. View the full execution plan for the new DataFrame.
pyspark> activeAccountsDF.explain(True)
scala> activeAccountsDF.explain(true)
Can you locate the line in the physical plan corresponding to the command to load
the accounts table into a DataFrame?
How many stages do you think this query has?
14. View the Spark Application UI and choose the SQL tab. This displays a list of
DataFrame and Dataset queries you have executed, with the most recent query at
the top.
15. Click the description for the top query to see the visualization of the query’s
execution. You can also see the query’s full execution plan by opening the Details
panel below the visualization graph.
16. The first step in the execution is a HiveTableScan, which loaded the account data
into the DataFrame. Hover your mouse over the step to show the step’s execution
plan. Compare that to the physical plan for the query. Note that it is the same as the
last line in the physical execution plan, because it is the first step to execute. Did
you correctly identify this line in the execution plan as the one corresponding to the
DataFrame.read.table operation?
17. The Succeeded Jobs label provides links to the jobs that executed as part of this
query execution. In this case, there is just a single job. Click its ID to view the job
details. This will display a list of stages that were completed for the query.
How many stages executed? Is that the number of stages you predicted it would be?
18. Optional: Click the description of the stage to view metrics on the execution of the
stage and its tasks.
19. The previous query was very simple, involving just a single data source with a
where to return only active accounts. Try executing a more complex query that
joins data from two different data sources.
This query reads in the accountdevice data file, which maps that maps account
IDs to associated device IDs. Then it joins that data with the DataFrame of active
accounts you created above. The result is DataFrame consisting of all device IDs in
use by currently active accounts.
20. Review the full execution plan using explain, as you did with the previous
DataFrame.
Can you identify which lines in the execution plan load the two different data
sources?
How many stages do you think this query will execute?
21. Execute the query and review the execution visualization in the Spark UI.
What differences do you see between the execution of the earlier query and this
one?
How many stages executed? Is this what you expected?
22. Optional: Explore an even more complex query that involves multiple joins with
three data sources. You can use the last query in the solutions file for this exercise
(in the $DEVSH/exercises/query-execution/solution/ directory). That
query creates a list of device IDs, makes, and models, and the number of active
accounts that use that type of device, sorted in order from most popular device type
to least.
$ $DEVSH/scripts/catchup.sh
2. The query code you pasted above defines a new DataFrame called
accountsDevsDF, which joins account data and device data for all active
accounts. Try executing a query starting with the accountsDevsDF DataFrame
that displays the account number, first name, last name and device ID for each row.
pyspark> accountsDevsDF. \
select("acct_num","first_name","last_name","device_id"). \
show(5)
scala> accountsDevsDF.
select("acct_num","first_name","last_name","device_id").
show(5)
3. In your browser, go to the SQL tab of your application’s Spark UI, and view the
execution visualization of the query you just executed. Take note of the complexity
so that you can compare it to later executions when using persistence.
Remember that queries are listed in the SQL tab in the order they were executed,
starting with the most recent. The descriptions of multiple executions of the same
action will not distinguish one query from another, so make sure you choose the
correct one for the query you are looking at.
4. In your Spark shell, persist the accountsDevsDF DataFrame using the default
storage level.
pyspark> accountsDevsDF.persist()
scala> accountsDevsDF.persist
pyspark> accountsDevsDF. \
select("acct_num","first_name","last_name","device_id"). \
show(5)
scala> accountsDevsDF.
select("acct_num","first_name","last_name","device_id").
show(5)
6. In the browser, reload the Spark UI SQL tab, and view the execution diagram for
the query just just executed. Notice that it has far fewer steps. Instead of reading,
filtering, and joining the data from the two sources, it reads the persisted data from
memory. If you hover your mouse over the memory scan step, you will see that the
only operation it performs on the data in memory is the last step of the query: the
unpersisted select transformation. Compare the diagram for this query with the
first one you executed above, before persisting.
7. The first time you execute a query on a persisted DataFrame, Dataset, or RDD, Spark
has to execute the full query in order to materialize the data that gets saved in
memory or on disk. Compare the difference between the first and second queries
after executing persist by re-executing the query one final time. Then use the
Spark UI to compare both queries executed after the persist operation, and
consider these questions.
• Did one query take longer than the other? If so, which one, and why?
• How many partitions of the RDD were persisted and how much space do those
partitions take up in memory and on disk?
• Note that only a small percentage of the data is cached. Why is that? How could
you cache more of the data?
• Click the RDD name to view the storage details. Which executors are storing data
for this RDD?
9. Execute the same query as above using the write action instead of show.
pyspark> accountsDevsDF.write.mode("overwrite"). \
save("/loudacre/accounts_devices")
scala> accountsDevsDF.write.mode("overwrite").
save("/loudacre/accounts_devices")
• What percentage of the data is cached? Why? How does this compare to the last
time you persisted the data?
• How much memory is the data taking up? How much disk space?
> accountsDevsDF.unpersist()
12. View the Spark UI Storage to verify that the cache for accountsDevsDF has been
removed.
13. Repersist the same DataFrame, setting the storage level to save the data to files on
disk, replicated twice.
15. Reload the Storage tab to confirm that the storage level for the RDD is set correctly.
Also consider these questions:
• How much memory is the data taking up? How much disk space?
2. Examine the data in the dataset. Note that the latitude and longitude are the 4th and
5th fields, respectively, as shown in the sample data below:
2014-03-15:10:10:20,Sorrento,8cc3b47e-bd01-4482-b500-
28f2342679af,33.6894754264,-117.543308253
2014-03-15:10:10:20,MeeToo,ef8c7564-0a1a-4650-a655-
c8bbd5f8f943,37.4321088904,-121.485029632
• addPoints: given two points, return a point which is the sum of the two points
—that is, (x1+x2, y1+y2)
• distanceSquared: given two points, returns the squared distance of the two
—this is a common calculation required in graph analysis
Note that the stub code sets the variable K equal to 5—this is the number of
means to calculate.
4. The stub code also sets the variable convergeDist. This will be used to decide
when the k-means calculation is done—when the amount the locations of the
means changes between iterations is less than convergeDist. A “perfect”
solution would be 0; this number represents a “good enough” solution. For this
exercise, use a value of 0.1.
Or in Scala:
7. Iteratively calculate a new set of K means until the total distance between the
means calculated for this iteration and the last is smaller than convergeDist. For
each iteration:
a. For each coordinate point, use the provided closestPoint function to map
that point to the index in the kPoints array of the location closest to that
point. The resulting RDD should be keyed by the index, and the value should be
the pair: (point, 1). (The value 1 will later be used to count the number of
points closest to a given mean.) For example:
b. Reduce the result: for each center in the kPoints array, sum the latitudes and
longitudes, respectively, of all the points closest to that center, and also find the
number of closest points. For example:
(0, ((2638919.87653,-8895032.182481), 74693)))
(1, ((3654635.24961,-12197518.55688), 101268)))
(2, ((1863384.99784,-5839621.052003), 48620)))
(3, ((4887181.82600,-14674125.94873), 126114)))
(4, ((2866039.85637,-9608816.13682), 81162)))
c. The reduced RDD should have (at most) K members. Map each to a new center
point by calculating the average latitude and longitude for each set of closest
points: that is, map (index,(totalX,totalY),n) to (index,(totalX/
n, totalY/n)).
d. Collect these new points into a local map or array keyed by index.
f. Copy the new center points to the kPoints array in preparation for the next
iteration.
8. When all iterations are complete, display the final K center points.
In this exercise, you will write a Spark Streaming application to count Knowledge
Base article requests.
This exercise has two parts. First, you will review the Spark Streaming documentation.
Then you will write and test a Spark Streaming application to read streaming web
server l og data and count the number of requests for Knowledge Base articles.
• Follow the links at the top of the package page to view the DStream and
PairDStreamFunctions classes— these will show you the methods available
on a DStream of regular RDDs and pair RDDs respectively.
For Python:
2. You may also wish to view the Spark Streaming Programming Guide (select
Programming Guides > Spark Streaming on the Spark documentation main
page).
3. Stream the Loudacre web log files at a rate of 20 lines per second using the
provided test script.
This script will exit after the client disconnects, so you will need to restart the script
when you restart your Spark application.
Tip: This exercise involves using multiple terminal windows. To avoid confusion,
set a different title for each one by selecting Set Title… on the Terminal menu:
6. Create a DStream by reading the data from the host and port provided as input
parameters.
7. Filter the DStream to only include lines containing the string KBDOC.
8. To confirm that your application is correctly receiving the streaming web log data,
display the first five records in the filtered DStream for each one-second batch. (In
Scala, use the DStream print function; in Python, use pprint.)
9. For each RDD in the filtered DStream, display the number of items—that is, the
number of requests for KB articles.
Tip: Python does not allow calling print within a lambda function, so create a
named defined function to print.
10. Save the filtered logs to text files in HDFS. Use the base directory name
/loudacre/streamlog/kblogs.
11. Finally, start the Streaming context, and then call awaitTermination().
12. In a new terminal window, change to the correct directory for the language you are
using for your application.
For Python, change to the exercise directory:
$ cd $DEVSH/exercises/streaming-dstreams
$ cd \
$DEVSH/exercises/streaming-dstreams/streaminglogs_project
13. If you are using Scala, build your application JAR file using the mvn package
command.
Note: If this is your first time compiling a Spark Scala application, it may take
several minutes for Maven to download the required libraries to package the
application.
$ spark2-submit \
stubs-python/StreamingLogs.py gateway 1234
15. After a few moments, the application will connect to the test script’s simulated
stream of web server log output. Confirm that for every batch of data received
(every second), the application displays the first few Knowledge Base requests and
the count of requests in the batch. Review the HDFS files the application saved in /
loudacre/streamlog.
16. Return to the terminal window in which you started the streamtest.py test
script earlier. Stop the test script by typing Ctrl+C.
17. Return to the terminal window in which your application is running. Stop your
application by typing Ctrl+C. (You may see several error messages resulting from
the interruption of the job in Spark; you may disregard these.)
In this exercise, you will write a Spark Streaming application to count web page
requests over time.
1. Open a new gateway terminal session. This exercise uses multiple terminal
windows. To avoid confusion, you might wish to set a different title for the new
window such as “Test Stream”.
2. Stream the Loudacre Web log files at a rate of 20 lines per second using the
provided test script.
This script exits after the client disconnects, so you will need to restart the script
when you restart your Spark application.
5. Count the number of page requests over a window of five seconds. Print out the
updated five-second total every two seconds.
6. In a different gateway terminal window than the one in which you started the
streamtest.py script, change to the correct directory for the language you are
using for your application. To avoid confusion, you might wish to set a different title
for the new window such as “Application”.
For Python, change to the exercise directory:
$ cd $DEVSH/exercises/streaming-multi
$ cd \
$DEVSH/exercises/streaming-multi/streaminglogsMB_project
7. If you are using Scala, build your application JAR file using the mvn package
command.
For Python:
$ spark2-submit \
stubs-python/StreamingLogsMB.py gateway 1234
9. After a few moments, the application should connect to the test script’s simulated
stream of web server log output. Confirm that for every batch of data received
(every second), the application displays the first few Knowledge Base requests and
the count of requests in the batch. Review the files.
10. Return to the terminal window in which you started the streamtest.py test
script earlier. Stop the test script by typing Ctrl+C.
11. Return to the terminal window in which your application is running. Stop your
application by typing Ctrl+C. (You may see several error messages resulting from
the interruption of the job in Spark; you may disregard these.)
Bonus Exercise
Extend the application you wrote above to also count the total number of page requests
by user from the start of the application, and then display the top ten users with the
highest number of requests.
Follow the steps below to implement a solution for this bonus exercise:
1. Use map-reduce to count the number of times each user made a page request in
each batch (a hit-count).
2. Define a function called updateCount that takes an array (in Python) or sequence
(in Scala) of hit-counts and an existing hit-count for a user. The function should
return the sum of the new hit-counts plus the existing count.
• Hint: You will have to swap the key (user ID) with the value (hit-count) to sort.
Note: The solution files for this bonus exercise are in the bonus package in the exercise
Maven project directory (Scala) and in solution-python/bonus in the exercise
directory (Python).
In this exercise, you will write an Apache Spark Streaming application to handle
web logs received as messages on a Kafka topic.
• For Python, start with the stub file StreamingLogsKafka.py in the stubs-
python directory.
1. Your application should accept two input arguments that the user will set when
starting the application:
3. Kafka messages are in (key, value) form, but for this application, the key is null
and only the value is needed. (The value is the web log line.) Map the DStream to
remove the key and use only the value.
4. To verify that the DStream is correctly receiving messages, display the first 10
elements in each batch.
5. For each RDD in the DStream, display the number of items—that is, the number of
requests.
Tip: Python does not allow calling print within a lambda function, so define a
named function to print.
6. Save the filtered logs to text files in HDFS. Use the base directory name
/loudacre/streamlog/kafkalogs.
$ cd \
$DEVSH/exercises/streaming-kafka/streaminglogskafka_project
8. Build your application JAR file using the mvn package command.
10. Use the kafka-topics script to create a Kafka topic called weblogs from which
your application will consume messages.
11. Confirm your topic was created correctly by listing topics. Make sure weblogs is
displayed.
$ $DEVSH/scripts/streamtest-kafka.sh \
weblogs worker-1:9092 20 $DEVDATA/weblogs/*
The script will begin displaying the messages it is sending to the weblogs Kafka
topic. (You may disregard any SLF4J messages.)
13. Return to the terminal window where your Spark application is running to verify
the count output. Also review the contents of the saved files in HDFS directories
/loudacre/streamlog/kafkalogs-<time-stamp>. These directories hold
part files containing the page requests.
15. Change to the correct directory for the language you are using for your application.
$ cd $DEVSH/exercises/streaming-kafka
$ cd \
$DEVSH/exercises/streaming-kafka/streaminglogskafka_project
16. Use spark2-submit to run your application. Your application takes two
parameters: the name of the Kafka topic from which the DStream will read
messages, weblogs, and a comma-separated list of broker hosts and ports.
• For Python:
$ spark2-submit \
stubs-python/StreamingLogsKafka.py weblogs \
worker-1:9092
• For Scala:
17. Confirm that your application is correctly displaying the Kafka messages it receives,
as well as displaying the number of received messages, every second.
Note: It may take a few moments for your application to start receiving messages.
Occasionally you might find that after 30 seconds or so, it is still not receiving any
messages. If that happens, press Ctrl+C to stop the application, then restart it.
Clean Up
18. Stop the Spark application in the first terminal window by pressing Ctrl+C. (You
might see several error messages resulting from the interruption of the job in
Spark; you may disregard these.)
In this exercise, you will use Kafka’s command line tool to create a Kafka topic.
You will also use the command line producer and consumer clients to publish and
read messages.
$ kafka-topics --create \
--zookeeper master-1:2181,master-2:2181,worker-2:2181 \
--replication-factor 3 \
--partitions 2 \
--topic weblogs
2. Display all Kafka topics to confirm that the new topic you just created is listed:
$ kafka-topics --list \
--zookeeper master-1:2181,master-2:2181,worker-2:2181
$ kafka-console-producer \
--broker-list worker-1:9092,worker-2:9092,worker-3:9092 \
--topic weblogs
You will see a few SLF4J messages, at which point the producer is ready to accept
messages on the command line.
Tip: This exercise involves using multiple terminal windows. To avoid confusion,
set a different title for each one by selecting Set Title… on the Terminal menu:
5. Publish a test message to the weblogs topic by typing the message text and then
pressing Enter. For example:
6. Open a new terminal window and adjust it to fit on the window beneath the
producer window. Set the title for this window to “Kafka Consumer.”
7. In the new gateway terminal window, start a Kafka consumer that will read from
the beginning of the weblogs topic:
$ kafka-console-consumer \
--zookeeper master-1:2181,master-2:2181,worker-2:2181 \
--topic weblogs \
--from-beginning
After a few SLF4J messages, you should see the status message you sent using the
producer displayed on the consumer’s console, such as:
test weblog entry 1
8. Press Ctrl+C to stop the weblogs consumer, and restart it, but this time omit the
--from-beginning option to this command. You should see that no messages
are displayed.
9. Switch back to the producer window and type another test message into the
terminal, followed by the Enter key:
10. Return to the consumer window and verify that it now displays the alert message
you published from the producer in the previous step.
Cleaning Up
11. Press Ctrl+C in the consumer terminal window to end its process.
12. Press Ctrl+C in the producer terminal window to end its process.
In this exercise, you will run a Flume agent to ingest web log data from a local
directory to HDFS.
Apache web server logs are generally stored in files on the local machines running the
server. In this exercise, you will simulate an Apache server by placing provided web log
files into a local spool directory, and then use Flume to collect the data.
Both the local and HDFS directories must exist before using the spooling directory
source.
Configure Flume
A Flume agent configuration file has been provided for you:
$DEVSH/exercises/flume/spooldir.conf.
Review the configuration file. You do not need to edit this file. Take note in particular of
the following:
• The source is a spooling directory source that pulls from the local
/flume/weblogs_spooldir directory.
4. Start the Flume agent using the configuration you just reviewed:
$ flume-ng agent \
--conf /etc/flume-ng/conf \
--conf-file $DEVSH/exercises/flume/spooldir.conf \
--name agent1 -Dflume.root.logger=INFO,console
5. Wait a few moments for the Flume agent to start up. You will see a message like:
Component type: SOURCE, name: webserver-log-source started
$ $DEVSH/scripts/copy-move-weblogs.sh \
/flume/weblogs_spooldir
This script will create a temporary copy of the web log files and move them to the
spooldir directory.
7. Return to the terminal that is running the Flume agent and watch the logging
output. The output will give information about the files Flume is putting into HDFS.
8. Once the Flume agent has finished, enter Ctrl+C to terminate the process.
9. Using the command line or Hue File Browser, list the files that were added by the
Flume agent in the HDFS directory /loudacre/weblogs_flume.
Note that the files that were imported are tagged with a Unix
timestamp corresponding to the time the file was imported, such as
FlumeData.1427214989392.
In this exercise, you will run a Flume agent on the gateway node that ingests web
logs from a local spool directory and sends each line as a message to a Kafka
topic.
The Flume agent is configured to send messages to the weblogs topic you created
earlier.
Important: This exercise depends on two prior exercises: “Collect Web Server Logs
with Flume” and “Produce and Consume Kafka Messages.” If you did not complete both
of these exercises, run the catch-up script and advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
1. Review the configuration file. You do not need to edit this file. Take note in particular
of the following points:
• The source and channel configurations are identical to the ones in the “Collect
Web Server Logs with Flume” exercise: a spooling directory source that pulls
from the local /flume/weblogs_spooldir directory, and a memory channel.
• Instead of an HDFS sink, this configuration uses a Kafka sink that publishes
messages to the weblogs topic.
3. Wait a few moments for the Flume agent to start up. You will see a message like:
Component type: SINK, name: kafka-sink started
Tip: This exercise involves using multiple terminal windows. To avoid confusion,
set a different title for each window. Set the title of the current window to “Flume
Agent.”
$ kafka-console-consumer \
--zookeeper master-1:2181,master-2:2181,worker-2:2181 \
--topic weblogs
5. In a separate new gateway terminal window, run the script to place the web log
files in the /flume/weblogs_spooldir directory:
$ $DEVSH/scripts/copy-move-weblogs.sh \
/flume/weblogs_spooldir
Note: that if you completed an earlier Flume exercise or ran catchup.sh, the
script will prompt you whether you want to clear out the spooldir directory. Be
sure to enter y when prompted.
6. In the terminal that is running the Flume agent, watch the logging output. The
output will give information about the files Flume is ingesting from the source
directory.
7. In the terminal that is running the Kafka consumer, confirm that the consumer tool
is displaying each message (that is, each line of the web log file Flume is ingesting).
8. Once the Flume agent has finished, enter Ctrl+C in both the Flume agent terminal
and the Kafka consumer terminal to end their respective processes.
In this exercise, you will import tables from MySQL into HDFS using Sqoop.
Important: This exercise depends on a previous exercise: “Accessing HDFS with
Command Line and Hue.” If you did not complete that exercise, run the course catch-up
script and advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
1. If you don’t already have one started, open a gateway terminal window.
2. Run the sqoop help command to familiarize yourself with the options in Sqoop:
$ sqoop help
$ sqoop list-tables \
--connect jdbc:mysql://gateway/loudacre \
--username training --password training
5. Use Sqoop to import the basestations table in the loudacre database and save
it in HDFS under /loudacre:
$ sqoop import \
--connect jdbc:mysql://gateway/loudacre \
--username training --password training \
--table basestations \
--target-dir /loudacre/basestations_import \
--null-non-string '\\N'
6. Optional: While the Sqoop job is running, try viewing it in the Hue Job Browser or
YARN Web UI, as you did in the previous exercise.
8. Use either the Hue File Browser or the -tail option to the hdfs command to view
the last part of the file for each of the MapReduce partition files, for example:
$ sqoop import \
--connect jdbc:mysql://gateway/loudacre \
--username training --password training \
--table basestations \
--target-dir /loudacre/basestations_import_parquet \
--as-parquetfile
10. View the results of the import command by listing the contents of the
basestations_import_parquet directory in HDFS, using either Hue or the
hdfs command. Note that the Parquet files are each given unique names, such as
e8f3424e-230d-4101-abba-66b521bae8ef.parquet.
• Note: You can’t directly view the contents of the Parquet files because they are
binary files rather than text.
11. Use the parquet-tools head command to view the first few records in the set
of data files imported by Sqoop.
$ parquet-tools head \
hdfs://master-1/loudacre/basestations_import_parquet/
4. Open a new terminal window. (It must be a new terminal so it reloads your
edited.bashrc file.)
The output should include the setting below. If not, the .bashrc file was not edited
or saved properly.
PYSPARK_DRIVER_PYTHON=ipython
PYSPARK_DRIVER_PYTHON_OPTS=notebook --ip gateway --no-
browser
6. Enter pyspark2 in the terminal. This will start a notebook server on the gateway
node.
8. On the right hand side of the page select Python 2 from the New menu.
9. Enter some Spark code such as the following and use the play button to execute
your Spark code.
11. To stop the Spark notebook server, enter Ctrl+C in the gateway terminal.
• Continue with the steps documented in the "Verify Your Cluster" section of the
Starting the Exercise Environment exercise.
• Login to Cloudera Manager (if the browser cannot connect, verify your web proxy
is still running). Do you see that a full cluster with the name you specified now
exists and is healthy (as indicated by green status icons)? If not, it may be that the
cluster is still being created. In Cloudera Manager, click on the Running Commands
(paper scroll) icon to see if any commands are still running. You can view All Recent
Commands as well. Verify that the Import Cluster Template command succeeded (it
should have a green checkmark next to it).
• If you want to see the messages that would have displayed in the original terminal
window where you ran create-cluster, had the network interruption not occured,
run this command from a cmhost terminal: $ cat /home/training/config/
cluster.log
If the cluster did not get created and there are no more running commands, you can
always reuse your cmhost to create a new cluster. This will take approximately 25
additional minutes to complete. If you want to go this option, run this command from
the /home/training directory of a cmhost terminal (then repeat the Create and Launch
the Exercise Cluster section of the Starting the Exercise Environment exercise):
$ ./create-cluster.sh
display across the top of the Cloudera Manager web UI with messages such as, "Request
to the Service Monitor failed…".
To resolve the issue, try running this command from the /home/training directory of a
cmhost terminal:
$ ./config/reset-cm.sh
• If you have "Process Status" issues where the Cloudera Manager agent is not
responding (as indicated by Hosts with unhealthy status), run this command from the
cmhost terminal:
◦ $ ./config/restart-agents.sh
◦ Allow two to three minutes after running the script for the health issues to
disappear from the CM web UI.
◦ If the restart-agent script throws errors, ensure that your cluster instances are
running. You can run Applications > Training > Start Cluster to ensure they are
running.
• If you have "Clock Offset" issues, run this command from the cmhost terminal:
◦ $ ./config/reset-clocks.sh
◦ Note: it can take two or three minutes after running the above command for the
health issues to clear from the Cloudera Manager web UI
• If you have any type of "Canary" issues, these typically clear up on their own, given
time.
• If any other issues still exist after solving any Process Status and Clock Offset issues:
◦ In Cloudera Manager, note the name of one of the services reporting the issue (e.g.
HDFS).
$ ~/bin/mount_gateway_drive.sh
You may need to restart applications such as the File Browser or gedit in order to
access the re-mounted folder.
Unable to load any cluster web pages in the browser on the Get2Cluster VM
If you are unable to load the YARN Resource Manager UI, Hue, Cloudera Manager or
other cluster pages, your proxy service may have stopped running due to network
issues. Try restarting it following the instructions in the Starting the Exercise
Environment exercise, then reload the page in your VM browser.
disconnected session. You will also have to restart your proxy server as explained in the
Starting the Exercise Environment exercise.
1. Use the Cloudera Manager bookmark on your VM browser to view the Cloudera
Manager web UI, and log in using username admin with password admin.
2. If any of the cluster services you need for the exercises are shown with anything
other than a green dot (such as a gray or red dot), restart the service by clicking on
the dropdown menu next to the service name and selecting Restart.
This screenshot shows an example in which the HDFS-1 service is stopped, and
how you would restart it.
3. After restarting the service, you may find that other services that depend on the
restarted service also need restarting, which is indicated by an icon next to the
service name. For example, Hue depends on HDFS, so after restarting HDFS, you
would need to restart Hue following the same steps. The screenshot below shows
the icon indicating that a service restart is required.
2. Select the Manage Users item on the Administration menu in the Hue menu bar.
4. In the Step 1 tab, enter the correct credentials: Username: training and
Password: training. Uncheck the box labeled Create home directory.
5. Skip step 2 by clicking on the Step 3 tab. Check the box labeled Superuser status.
Workaround 1: Try waiting for a few moments. Sometimes the configuration check
takes a while but eventually completes, either with a confirmation that all services are
working, or that one or more services might be misconfigured. If the misconfigured
services are not ones that are required for the exercises (such as HBase or Oozie), you
can continue with the exercise steps.
Workaround 2: Usually when the configuration check has not completed or warns of
a misconfigured service, the rest of Hue will still work correctly. Try completing the
exercise by going to the Hue page you want to use, such as the Query Editor or File
Browser. If those functions work, you can continue with the exercise steps.
Solution: If the workarounds above are not helpful, you may need to restart the Hue
service using Cloudera Manager. Refer to the section above called "Services on the
cluster are unavailable". Follow those steps to restart the HUE-1 service, even if the
service is displayed with a healthy (green dot) indicator in Cloudera Manager. When the
service is restarted, reload the Hue page in your browser.
FAILED org.spark-project.jetty.server.Server@69419d59:
java.net.BindException: Address already in use
This is usually because you are attempting to run two instances of the Spark shell at the
same time.
To fix this issue, exit one of your two running Spark shells.
If you do not have a terminal window running a second Spark shell, you may have one
running in the background. View the applications running on the YARN cluster using
the Hue Job Browser. Check the start times to determine which application is the one
you want to keep running. Select the other one and click kill.
After a few seconds, you should be notified that the job status is now RUNNING. If the
ACCEPTED message keeps displaying and the application or query never executes, this
means that YARN has scheduled the job to run when cluster resources are availble, but
none (or too few) resources are available.
Cause: This usually happens if you are running multiple Spark applications (such as
two Spark shells, or a shell and an application) at the same time. It can also mean that a
Spark application has crashed or exited without releasing its cluster resources.
Fix: Stop running one of the Spark applications. If you cannot finding a running
application, use the Hue job browser to see what applications are running on the YARN
cluster, and kill the one you do not need.