0% found this document useful (0 votes)
47 views

Big Data Analytics Laboratory

useful

Uploaded by

21cse004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Big Data Analytics Laboratory

useful

Uploaded by

21cse004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 57

P.S.R.

ENGINEERING COLLEGE
(An Autonomous Institution, Affiliated to Anna University, Chennai)
Sevalpatti (P.O), Sivakasi - 626140.
Virudhunagar Dt.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINE RING


P.S.R.R COLLEGE OFENGINEERING
(Anna University, Chennai)
Sevalpatti (P.O), Sivakasi - 626140.
Tamilnadu State

LABORATORY MANUAL

COURSE CODE : Ccs334

COURSE NAME : BIGDATA ANALYTICS LABORATORY

SEMSTER : V

ACADEMIC YEAR : 2024 (Even SEMESTER)

Faculty Incharges

A.Rajalakshmi

Page |1
P.S.R.ENGINEERING COLLEGE
(An Autonomous Institution, Affiliated to Anna University, Chennai)
Sevalpatti (P.O), Sivakasi - 626140.
Virudhunagar Dt.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

LABORATORY MANUAL

COURSE CODE : CCS334

COURSE NAME : BIGDATA ANALYTICS LABORATORY

SEMSTER : V

ACADEMIC YEAR : 2024-2025 (ODD SEMESTER)

PREPARED BY APPROVED BY

A.RAJALAKSHMI
HOD/CSE
Page |2
L T P C
CCS334 BIG DATA ANALYTICS LABORATORY
0 0 2 1
Programme: B.TECH Sem: 5 Category: PC
Prerequisites: -
Aim: To provide the knowledge of HDFS and work with big data problems.
Course Outcomes: The Students will be able to
CO1: Set up and implement Hadoop clusters
CO2: Learn to use Hadoop Distributed File System(HDFS) to set up single and multi-node
CO3: Use the map reduce tasks for various applications
CO4: Analyze the various technologies & tools associated with Big Data
CO5: Discuss techniques common to NoSQL datastores
CO6: Propose solutions for Big Data Analytics problems
LIST OF EXPERIMENTS
1. Find procedure to set up Hortonworks Data Platform (HDP) - Cloudera CDH Stack
2. HDFS Commands
3. Find procedure to set up single and multi-node Hadoop cluster
Find procedure to load data into HDFS using
Apache Flume
4.
Apache Kafka
Apache Sqoop
5. Install & run the MongoDB Server
Demonstrate unstructured data into NoSQL data and do all operation with such as NoSQL
6.
query with API
7. Write a weather forecasting program using Map Reduce
8. Write an event detection program using Spark
9. Page Rank Computation.
Total Hours 60

Program Specific Outcomes


Course Program Outcomes (POs)
(PSOs)
Out
Comes PSO PSO PSO
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO4
1 2 3
CO1 3 2 2 1 2 2 1 1 1
CO2 2 2 2 1 1 2 1 2 2
CO3 1 1 2 1 3 2 2 2 2
CO4 1 2 3 2 3 2 2 2 2
CO5 1 2 1 1 2 2 2 2
CO6 2 2 2 3 2 3 2 2 2 2

Page |3
Ex: 1 Install Cloudera(CDH) & Hortonworks Data Platform (HDP)
Aim:
To set up Cloudera and Hortonworks data and platform(HDP) using virtual box.

Step 1: Download the virtual box-executable file from https://www.virtualbox.org/wiki/Downloads Download


Virtual Box 4.2.16 for window host.

Step 2: Install VirtualBox by double clicking on the downloaded file.

Page |4
Step 3: The Software installation is successful you will see a Virtual Manager window to manage VMs.

Step 4: Download the Cloudera quickstart vm for VirtualBox


Go to the link - https://ccp.cloudera.com/display/SUPPORT/Cloudera+QuickStart+VM

Page |5
Select quickVM for VirtualBox and click on download

Step 5: Unzip the downloaded file. When you unzip the file cloudera-quickstart-vm-4.3.0-virtualbox.tar you
will find these two files in the directory.

Step 6: Open VirualBox and click on “New” to create new virtual box

Give name for new virtual machine and select type as Linux and versions as Linux 2.6

Step 7 : Select Memory Size as 4GB and click Next.

Page |6
Step 8: In the next page, VirtualBox asks to select Hard Drive for new VirualBox as shown in the screenshot.
Create a virtual hard drive now is selected by default. But you have to select “Use an existing virtual hard
drive file” option.

Page |7
Select “Use an existing virtual hard drive file”.

Step 9: Click on the small yellow icon beside the dropdown to browse and select the cloudera-quickstart-vm-
4.3.0-virtualbox-disk1.vm file (which is download in step 4).

Page |8
Click on create to create Cloudera quickstart vm.

Step 10: Your virtual box should look like following screen shots. We can see the new virtual machine named
Cloudera Hadoop on the left side.

Step 11: Select Cloudera vm and click on “Start” Virtual Machine starts to boot

Page |9
Step 12: System is loaded and CDH is installed on virtual machine.

Step 13: System redirects you to the index page of Cloudera.

Step 14: Select Cloudera Manager and Agree to the information assurance policy.

P a g e | 10
Step 15: Login to Cloudera Manager as admin. Password is admin.

Step 16: Click on the Hosts tab and we can see that one host is running, version of CDH installed on it is 4,
health of the host is good and last heart beat was listened 5.22s ago.

P a g e | 11
HORTONWORKS DATA PLATFORMS
Step 1 – Launch 3 instances of t2.xlarge type
AWS gives us these various configurations. The one which we are going with is 16 GB RAM
which is t2.xlarge. You can see how AWS console will look like

As you can see in the above image, we have selected the centos 7. Now the next step is to select the
instance type. As stated earlier, we are going with t2.xlarge instance type.

P a g e | 12
In the next step, we will add storage of 100 GB. Please make sure that you select Magnetic volume
type because SSD will cost more and provide less storage. For a reference, HDFS will consume 3 GB
storage in order to give you 1 GB of storage, therefore, it makes more sense economically to go with
the magnetic value type.

In the next step, we give name to the server.

After you give a name to the server, the next step is to create a security group. Here, we are allowing
all the ports so that there is no restriction.

In the next step, you need to create a new key pair.


Amazon provides you a feature of the private & public key. It takes away the headache of
login every time on each machine. It will generate a private-public key and give you the private one
and it will save the public key on all the machines. This will allow you to connect to your instance
securely and easily. As you can see below, we have successfully initialized the three instances.

P a g e | 13
Step 2 – Change permission of the downloaded key so that nobody else can access it
chmod 400 hadoop-hdp-demo.pem (Here you are not allowing anyone to read or view)

Step 3 – Give the name to servers


Name one of the nodes to “hadoop-ambari-server” and others two to “hadoop-data-node”

Step 4 – Login to each of the node using the downloaded private key
In this step, you need to provide the private key and your public IP address. For example, ssh -i
~/hadoop-demo.pem [email protected]
You can locate the Public IP address on AWS server as shown in the image below

P a g e | 14
Step 5 – Run “sudo yum update” to update the packages
It will update the packages on all the machines because the instances that have been given by
Amazon are a bit old, thus we need to update the software on these machines. Yum is the package
manager on Red Head (Centos machine). You will run this command on all of the machines.

Step 6 – Make centos sudoers on all machines


Sudoer is somebody who can do administrative tasks, therefore, we make centos the sudoer.

sudo visudo

centos ALL=(ALL) ALL

With these commands, centos will be able to run command anywhere without any restriction.

Step 7 – Now on each machine, verify if the hostname is properly set

hostname -f

With the above command, it will give the full details of the hostname.

Step 8 – Edit the Network Configuration File


In this step, we are setting up the hostname properly.

sudo vi /etc/sysconfig/network

NETWORKING=yes

HOSTNAME=<fqdn>

Please make sure to replace <fqdn> with a proper hostname. You can get this detail from the
command used in step 7 i.e hostname -f

Result:
Thus the cloudera and hortonworks data platforms were installed successfully.

P a g e | 15
Ex: 2 HDFS COMMANDS
Aim:
To load big data into Hadoop Distributed File System using HDFS commands for creating
directories, moving files, adding files, deleting files, reading files and listing directories.
Commands
1 .ls
Description: This command is used to list all the files. Use lsr for recursive approach. It is useful when
we want a hierarchy of a folder.

SYNTAX: bin/hdfs dfs –ls

2. mkdir
Description: To create a directory. In Hadoop dfs there is no home directory by default. So let’s first
create it.

SYNTAX: bin/hdfs dfs –mkdir folder name

3. touchz
Description: It creates an empty file.

SYNTAX: hadoop fs –touchz /directory/filename

4. cp:
Description: This command is used to copy files within hdfsS.

SYNTAX: bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>

P a g e | 16
5. moveFromLocal:
Description: This command will move file from local to hdfs.

SYNTAX: bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>

6. Put
Description: This command is used to copy localfile1 of the local file system to the Hadoop filesystem.

SYNTAX: hadoop fs -moveFromLocal <localsrc> <dest>

7. Get
Description: The Hadoop fs shell command get copies the file or directory from the Hadoop file
system to the local file system.

SYNTAX: hadoop fs -get <src> <localdest>

8. df
Description: It shows the capacity, size, and free space available on the HDFS file system.

SYNTAX: hadoop fs -df [-h] <path>

P a g e | 17
9. fsck
Description: Hadoop command is used to check the health of the HDFS.

SYNTAX: hadoop fsck <path> [ -move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]

Result:
Thus the files operation such as create, move and delete operation were performed in cloudera
terminal.

P a g e | 18
Ex: 3 Single and multi –node Hadoop cluster

Aim:
To set up single and multi –node Hadoop cluster on a distributed environment.
Program:
Install Hadoop
Step 1: Download the Java 8 Package. Save the file in your home directory.
Step 2: Extract the Java Tar File.
Command: tar -xvf jdk-8u101-linux-i586.tar.gz

Fig1: Hadoop Installation – Extracting Java Files

Step 3: Download the Hadoop 2.7.3 Package.


Command: wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-
2.7.3.tar.gz

Fig2: Extracting Hadoop Files

Step 4: Add the Hadoop and Java paths in the bash file (.bashrc).Open. bashrc file. Now, add
Hadoop and Java Path as shown below.Learn more about the Hadoop Ecosystem and its tools
with the Hadoop Certification.
Command: vi .bashrc

P a g e | 19
Fig 3 Setting Environment Variable

Step 5: Then, save the bash file and close it.For applying all these changes to the current
Terminal, execute the source command.
Command: source. bashrc

Fig 4 Refreshing environment variables

Step 6: To make sure that Java and Hadoop have been properly installed on your system
and can be accessed through the Terminal, execute the java -version and hadoop version
commands.
Command: java –version

Fig 5 Checking Java Version

Command: hadoop version

Fig6: Checking Hadoop Version

P a g e | 20
Step 6: Edit the Hadoop Configuration files.

Command: cd hadoop-2.7.3/etc/hadoop/
Command: ls

All the Hadoop configuration files are located in hadoop-


2.7.3/etc/hadoop directory as you can see in the snapshot below:

Fig 7: Hadoop Configuration Files

Step 7: Open core-site.xml and edit the property mentioned below inside configuration tag:
core-site.xml informs Hadoop daemon where NameNode runs in the cluster. It contains
configuration settings of Hadoop core such as I/O settings that are
common to HDFS &amp.

Command: vi core-site.xml

Fig 8: Configuring core-site.xml

<?xml version=”1.0” encoding=”UTF-8”?>


<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Step 8: Edit hdfs-site.xml and edit the property mentioned below inside configuration tag:hdfs-
site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode,
Secondary NameNode). It also includes the replication factor and block size of HDFS.

P a g e | 21
Command: vi hdfs-site.xml

Fig 9: Hadoop Installation – Configuring hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
</configuration>

Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside
configuration tag:mapred-site.xml contains configuration settings of MapReduce application
like number of JVM that can run in parallel, the size of the mapper and the reducer process,
CPU cores available for a process, etc. In some cases, mapred-site.xml file is not available. So,
we have to create the mapred-site.xml file using mapred-site.xml template.

Command: cp mapred-site.xml.template mapred-site.xml


Command: vi mapred-site.xml.

Fig10 : Hadoop Installation – Configuring mapred-site.xml

P a g e | 22
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Step 10: Edit yarn-site.xml and edit the property mentioned below inside configuration
tag:yarn-site.xml contains configuration settings of ResourceManager and
NodeManager like application memory management size, the operation needed on
program & algorithm, etc.

Command: vi yarn-site.xml

Fig 11: Hadoop Installation – Configuring yarn-site.xml

<?xml version="1.0">
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

Result:
Thus the single node and multi node Hadoop cluster were installed on a distributed
environment

P a g e | 23
Ex. No: 4A Apache flume
Aim:

To load Twitter data into HDFS using Apache Flume.

Procedure
Step 1
To create a Twitter application, click on the following link https://apps.twitter.com/. Sign in to
your Twitter account. You will have a Twitter Application Management window where you
can create, delete, and manage Twitter Apps.

Step 2
Click on the Create New App button. You will be redirected to a window where you will get
an application form in which you have to fill in your details in order to create the App. While
filling the website address, give the complete URL pattern, for example, http://example.com.

P a g e | 24
Step 3
Fill in the details, accept the Developer Agreement when finished, click on the Create your
Twitter application button which is at the bottom of the page. If everything goes fine, an App
will be created with the given details as shown below.

Step 4
Under keys and Access Tokens tab at the bottom of the page, you can observe a button
named Create my access token. Click on it to generate the access token.

P a g e | 25
Step 5
Finally, click on the Test OAuth button which is on the right side top of the page. This will
lead to a page which displays your Consumer key, Consumer secret, Access token, and Access
token secret. Copy these details. These are useful to configure the agent in Flume.

Starting HDFS
Since we are storing the data in HDFS, we need to install / verify Hadoop. Start Hadoop and
create a folder in it to store Flume data. Follow the steps given below before configuring
Flume.
Step 6: Install / Verify Hadoop
Install Hadoop. If Hadoop is already installed in your system, verify the installation using
Hadoop version command, as shown below.
$ hadoop version
If your system contains Hadoop, and if you have set the path variable, then you will get the
following output −
Hadoop 2.6.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using /home/Hadoop/hadoop/share/hadoop/common/hadoop-common-
2.6.0.jar

P a g e | 26
Step 7: Starting Hadoop
Browse through the sbin directory of Hadoop and start yarn and Hadoop dfs (distributed file
system) as shown below.
cd /$Hadoop_Home/sbin/
$ start-dfs.sh
localhost: starting namenode, logging to
/home/Hadoop/hadoop/logs/hadoop-Hadoop-namenode-localhost.localdomain.out
localhost: starting datanode, logging to
/home/Hadoop/hadoop/logs/hadoop-Hadoop-datanode-localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
starting secondarynamenode, logging to
/home/Hadoop/hadoop/logs/hadoop-Hadoop-secondarynamenode-localhost.localdomain.out

$ start-yarn.sh
starting yarn
daemons
starting resourcemanager, logging to
/home/Hadoop/hadoop/logs/yarn-Hadoop-resourcemanager-localhost.localdomain.out
localhost: starting nodemanager, logging to
/home/Hadoop/hadoop/logs/yarn-Hadoop-nodemanager-localhost.localdomain.out
Step 8: Create a Directory in HDFS
In Hadoop DFS, you can create directories using the command mkdir. Browse through it and
create a directory with the name twitter_data in the required path as shown below.
$cd /$Hadoop_Home/bin/
$ hdfs dfs -mkdir hdfs://localhost:9000/user/Hadoop/twitter_data

Configuring Flume
We have to configure the source, the channel, and the sink using the configuration file in
the conf folder. The example given in this chapter uses an experimental source provided by
Apache Flume named Twitter 1% Firehose Memory channel and HDFS sink.
Twitter 1% Firehose Source.

Setting the classpath


Set the classpath variable to the lib folder of Flume in Flume-env.sh file as shown below.
export CLASSPATH=$CLASSPATH:/FLUME_HOME/lib/*

P a g e | 27
Example – Configuration File
# Naming the components on the current agent.
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

# Describing/Configuring the source


TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = Your OAuth consumer key
TwitterAgent.sources.Twitter.consumerSecret = Your OAuth consumer secret
TwitterAgent.sources.Twitter.accessToken = Your OAuth consumer key access token
TwitterAgent.sources.Twitter.accessTokenSecret = Your OAuth consumer key access token
secret
TwitterAgent.sources.Twitter.keywords = tutorials point,java, bigdata, mapreduce, mahout,
hbase, nosql

# Describing/Configuring the sink

TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/Hadoop/twitter_data/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

# Describing/Configuring the channel


TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

# Binding the source and sink to the channel


TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel

Execution
Browse through the Flume home directory and execute the application as shown below.
$ cd $FLUME_HOME
$ bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf
Dflume.root.logger=DEBUG,console -n TwitterAgent.
Given below is the snapshot of the command prompt window while fetching tweets.

P a g e | 28
Verifying HDFS
You can access the Hadoop Administration Web UI using the URL given below.
http://localhost:50070/
Click on the dropdown named Utilities on the right-hand side of the page. You can see two
options as shown in the snapshot given below.

Result:
The cluster ID were generated from the name node and the twitter data were
successfully loaded into HDFS by using Apache flume.

P a g e | 29
Ex. No: 4B Apache sqoop

Aim:
To import data from diverse data sources, such as a relational database, into MySQL
using Apache Sqoop.
Procedure: The following steps are involved in pulling the data present in the MySQL table
and inserting it into HDFS. We see how sqoop is used to achieve this.

Step 1: Log in to MySQL using


mysql -u root -p;

Enter the required credentials.

Step 2: Before transferring the selected data, check how many records are present in the table.
To do this, change to the required database using.

use &ltdatabase name>

Now, check the number of records present in the table that we select to move its data into
HDFS, using “count(*).”select count(*) from &lttablename>

P a g e | 30
Step 3: Before running the “sqoop import” command, ensure that the target directory is not
already present. Otherwise, the import command throws an error. To check this, let us try
deleting the directory that we wish to use as our target directory.

hadoop fs -rm -r &lt target directory &gt

Step 4: Let us now do the “sqoop import” job to pull the selected data from MySQL and insert
it into HDFS using the command:

sqoop import \
--connect jdbc:mysql://localhost/&ltdatabase name&gt \
--table &lttable name&gt \
--username &ltusername&gt --password &ltpassword&gt \
--target-dir &lttarget location in HDFS&gt \
-m &ltno. of Mapper jobs you wish to create&gt

In our example, the database is “test,” selected data is in table “retailinfo,” and our target
location in HDFS is in the directory “/user/root/online_basic_command.” We proceeded with
only one mapper job.

P a g e | 31
Upon successful data transfer, the output looks similar to:

P a g e | 32
We can see that all the records in our database table are retrieved into HDFS.

Result:
Thus the Sqoop imported data from RDBMS (Relational Database Management
System) such as MySQL to HDFS (Hadoop Distributed File System).

P a g e | 33
Ex. No: 4C Apache kafka

Aim:
To load data into HDFS using Apache kafka.

Procedure

Step 1: Install Java Apache Kafka requires Java. To ensure that Java is installed first update the
Operating System then try to install it:
sudo apt-get updatesudo apt-get upgradesudo add-apt-repository
-y ppa:webupd8team/javasudo apt-get install oracle-java8-
installer Step 2: Download Kafka

First, we need to download Kafka binaries package.


wget http://www-eu.apache.org/dist/kafka/1.1.1/kafka_2.11-1.1.1.tgz

Step 3: Now, we need to extract the tgz package. We choose to install Kafka in /opt/kafka
directory:
sudo tar xvf kafka_2.11–1.1.1.tgz -–directory /opt/kafka -–strip 1

Step 4: Kafka stores it logs on disk in /tmp directory, it is better to create a new directory to
store logs
sudo mkdir /var/lib/kafkasudo mkdir /var/lib/kafka/data

Step 5: Configure Kafka. By default, Kafka does not allow us to delete topics. To be able to
delete topics, find the line and change it (if it is not found just add it)
Now, we need to edit the Kafka server configuration file.
sudo gedit /opt/kafka/config/server.properties
delete.topic.enable = true
log.dirs=/var/lib/kafka/data

Step 6: In addition, we can adjust the time interval for logs deletion (Kafka deletes logs after a
particular time or according to disk size):
log.retention.hours=168 # according to timelog.retention.bytes=104857600 # according to disk size

Step 7: We need to give access to kafkauser on the logs directory and kafka installation
directory:
sudo chown –R kafkauser:nogroup /opt/kafkasudo chown –R kafkauser:nogroup /var/lib/kafka

Step 8: To start Apache Kafka service you can use the following command. You should see the
output, if the server has started successfully. To start Kafka as a background process, you can
use nohup command.
sudo /opt/kafka/bin/kafka-server-start.sh /opt/kafka/ config/server.properties [2018–07–
23 21:43:48,279] WARN No meta.properties file under dir
/var/lib/kafka/data/meta.properties (kafka.server.BrokerMetadataCheckpoint)[2018–07–23
21:43:48,516] INFO Kafka version : 0.10.0.1
(org.apache.kafka.common.utils.AppInfoParser)[2018–07–23 21:43:48,525] INFO Kafka commitId
: a7a17cdec9eaa6c5 (org.apache.kafka.common.utils.AppInfoParser)[2018–07–23 21:43:48,527]
INFO [Kafka Server 0], started (kafka.server.KafkaServer)[2018–07–23 21:43:48,555] INFO New
leader is 0 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)

P a g e | 34
sudo nohup /opt/kafka/bin/kafka-server-start.sh /opt/kafka/ config/server.properties
/var/lib/kafka/data/kafka.log 2>&1 &

Step 9: The Kafka server running and listening on port 9092. Step 9: Import a text file into a
Kafka topic. To open a text file into a Kafka topic, you need to use cat command with a
pipeline:
cat filename.txt | /opt/kafka/bin/kafka-console-producer.sh — broker-list localhost:9092 — topic
testKafka

Result:
Thus the kafka, an intermediate data store, helps to very easily replay ingestion,
consume datasets across multiple applications, and perform data analysis.

P a g e | 35
Ex: 5 INSTALL AND RUN MONGODB
Aim:
To install and run mongoDB database.

Procedure:
Step 1 — Download the MongoDB MSI Installer Package
Download the current version of MongoDB. Make sure you select MSI as the package you want
to download.

Step 2 — Install MongoDB with the Installation Wizard


A. Make sure you are logged in as a user with Admin privileges. Then navigate to your
downloads folder and double click on the .msi package you just downloaded. This will
launch the installation wizard

P a g e | 36
B. Click Next to start installation.

C. Accept the licence agreement then click Next.

P a g e | 37
D. Select the Complete setup.

E. Select “Run service as Network Service user” and make a note of the data directory.

P a g e | 38
F. We won’t need Mongo Compass, so deselect it and click Next.

G. Click Install to begin installation.

P a g e | 39
H. Hit Finish to complete installation.

Step 3— Create the Data Folders to Store our Databases


A. Navigate to the C Drive on your computer using Explorer and create a new folder
called data here.

P a g e | 40
B. Inside the data folder you just created, create another folder called db.

B. Click Next to start installation.

P a g e | 41
C. Accept the licence agreement then click Next.

D. Select the Complete setup.

P a g e | 42
E. Select “Run service as Network Service user” and make a note of the data directory.

F. We won’t need Mongo Compass, so deselect it and click Next.

P a g e | 43
G. Click Install to begin installation.

F. Hit Finish to complete installation.

P a g e | 44
Step 3— Create the Data Folders to Store our Databases
A. Navigate to the C Drive on your computer using Explorer and create a new folder
called data here.

B. Inside the data folder you just created, create another folder called db.

P a g e | 45
Result:
Thus, the MongoDB were installed and SQL query were imported.

P a g e | 46
Ex: 6 Unstructured data into NoSQL using MongoDb with API.
Aim:
To Demonstrate unstructured data into NoSQL data and do all operation with such as NoSQL
query with API
Program:
Step 1: To go to start the MongoDB in Terminal
Step 2: create a database and table to insert necessary data
Step 3: db.student.insert() method is used to add or insert new documents into a collection in your
database.
Step 4: db.student.find() method using view the row of table.
Step 5: db.student.update() method using update the row of table
Microsoft Windows [Version 10.0.17134.1488]
(c) 2018 Microsoft Corporation. All rights reserved.
C:\Users\Administrator>cd ..

C:\Users>cd ..

C:\>cd "Program Files (x86)" C:\

Program Files (x86)>cd "MongoDB"

C:\Program Files (x86)\MongoDB>cd Server C:\

Program Files (x86)\MongoDB\Server>cd 3.0 C:\

Program Files (x86)\MongoDB\Server\3.0>cd bin

C:\Program Files (x86)\MongoDB\Server\3.0\bin>mongo


MongoDB shell version: 3.0.13-rc0
connecting to: test
Welcome to the MongoDB shell.
For interactive help, type "help".
For more comprehensive documentation, see
http://docs.mongodb.org/
Questions? Try the support group
http://groups.google.com/group/mongodb-user
Server has startup warnings:
2021-09-13T15:03:04.319+0530 I CONTROL [initandlisten] ** NOTE: This is a 32-bit MongoDB
binary running on a 64-bit operating
2021-09-13T15:03:04.319+0530 I CONTROL [initandlisten] ** system. Switch to a 64-bit build
of MongoDB to
2021-09-13T15:03:04.319+0530 I CONTROL [initandlisten] ** support larger
databases. 2021-09-13T15:03:04.319+0530 I CONTROL [initandlisten]
2021-09-13T15:03:04.319+0530 I CONTROL [initandlisten]
2021-09-13T15:03:04.320+0530 I CONTROL [initandlisten] ** NOTE: This is a 32 bit MongoDB
binary.
2021-09-13T15:03:04.320+0530 I CONTROL [initandlisten] ** 32 bit builds are limited to less
than 2GB of data (or less with --journal).
2021-09-13T15:03:04.321+0530 I CONTROL [initandlisten] ** Note that journaling defaults
to off for 32 bit and is currently off.
P a g e | 47
2021-09-13T15:03:04.321+0530 I CONTROL [initandlisten] ** See
http://dochub.mongodb.org/core/32bit
2021-09-13T15:03:04.321+0530 I CONTROL [initandlisten]

Create Database
> use big-data
switched to db big-data
View Database
> db
big-data
Show Database
> show dbs
dept 0.078GB
local 0.078GB
Create Table
> db.student.insert({name:'sri'})
WriteResult({ "nInserted" : 1 })
View Table
> db.student.find()
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }
> db.student.insert({name:'sri',rollno:'19cs0023'})
WriteResult({ "nInserted" : 1 })
> db.student.find()
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }
{ "_id" : ObjectId("613f1d84d7a1744c6d681094"), "name" : "sri", "rollno" : "19cs0023" }
> db.student.insert({name:'sri',rollno:'19cs0023',course:'android'})
WriteResult({ "nInserted" : 1 })
> db.student.find()
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }
{ "_id" : ObjectId("613f1d84d7a1744c6d681094"), "name" : "sri", "rollno" : "19cs0023" }
{ "_id" : ObjectId("613f1db5d7a1744c6d681095"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
> db.student.insert({name:'sri',rollno:'19cs0023',course:'android'})
WriteResult({ "nInserted" : 1 })
> db.student.find()

P a g e | 48
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }
{ "_id" : ObjectId("613f1d84d7a1744c6d681094"), "name" : "sri", "rollno" : "19cs0023" }
{ "_id" : ObjectId("613f1db5d7a1744c6d681095"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
{ "_id" : ObjectId("613f1e22d7a1744c6d681096"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
> db.student.insert({name:'sri',rollno:'19cs0023',course:'android'})
WriteResult({ "nInserted" : 1 })
> db.student.find()
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }
{ "_id" : ObjectId("613f1d84d7a1744c6d681094"), "name" : "sri", "rollno" : "19cs0023" }
{ "_id" : ObjectId("613f1db5d7a1744c6d681095"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
{ "_id" : ObjectId("613f1e22d7a1744c6d681096"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
{ "_id" : ObjectId("613f1e42d7a1744c6d681097"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }

Update Row
> db.student.update({course:'android'},{$set:{course:'big-data'}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1
})
> db.student.find()
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }
{ "_id" : ObjectId("613f1d84d7a1744c6d681094"), "name" : "sri", "rollno" : "19cs0023" }
{ "_id" : ObjectId("613f1db5d7a1744c6d681095"), "name" : "sri", "rollno" : "19cs0023", "course"
: "big-data" }
{ "_id" : ObjectId("613f1e22d7a1744c6d681096"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
{ "_id" : ObjectId("613f1e42d7a1744c6d681097"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
> db.student.update({course:'android'},{$set:{course:'big-data'}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.student.find()

P a g e | 49
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }

P a g e | 50
{ "_id" : ObjectId("613f1d84d7a1744c6d681094"), "name" : "sri", "rollno" : "19cs0023" }
{ "_id" : ObjectId("613f1db5d7a1744c6d681095"), "name" : "sri", "rollno" : "19cs0023", "course" :
"big-data" }
{ "_id" : ObjectId("613f1e22d7a1744c6d681096"), "name" : "sri", "rollno" : "19cs0023", "course"
: "big-data" }
{ "_id" : ObjectId("613f1e42d7a1744c6d681097"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
>
MongoDB Delete Documents
> db.student.remove({})
WriteResult({ "nRemoved" : 5
})
> db.student.find()
Drop Database:
> db.dropDatabase()
{“ok”: 1}

Result:
Thus the program for Unstructured data into NoSQL data & do all operations such as NoSQL
query with API was executed.
P a g e | 51
Ex: 8 Write an event detection program using Spark
Aim:
To predict whether the person has heart disease or not by using Apache Spark.
Procedure:
Step 1: Dataset

Step 2: Spark can read data in different formats like CSV, Parquet, Avro and JSON.
Here the data is in CSV format and read by using the below code.

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName("heart_disease_
classification").getOrCreate()
#Reading data from csv and creating a Spark Dataframe
df = spark.read.csv("datasets_heart.csv",
inferSchema=True,
header=True)
df.show(2)

P a g e | 52
df.groupBy('target').count().show()+------+-----+
|target|count|
+ + +
| 1| 165|
| 0| 138|
+ + +

Step 3: The column in the dataframe that has null values. We can drop the data if the percentage of
null values is very less. Since our dataset is clean with no null values.
from pyspark.sql.functions import *
def null_value_calc(df):
null_columns_counts = []
numRows = df.count()
for k in df.columns:
nullRows = df.where(col(k).isNull()).count()
if(nullRows > 0):
temp = k,nullRows,(nullRows/numRows)*100
null_columns_counts.append(temp)
return(null_columns_counts)
null_columns_calc_list = null_value_calc(df)
if null_columns_calc_list :
spark.createDataFrame(null_columns_calc_list, ['Column_Name',
'Null_Values_Count','Null_Value_Percent']).show()
else :
print("Data is clean with no null values")

Step 4: Skewness measures how much a distribution of values deviates from symmetry around the
mean. A value of zero means the distribution is symmetric, while a positive skewness indicates a
greater number of smaller values, and a negative value indicates a greater number of larger values.

P a g e | 53
from pyspark.sql.functions import *
d = {}
# Create a dictionary of quantiles
for col in numeric_inputs:
d[col] = indexed.approxQuantile(col,[0.01,0.99],0.25) #if you want to make it go
faster increase the last number
#Now fill in the values
for col in numeric_inputs:
skew = indexed.agg(skewness(indexed[col])).collect() #check for skewness
paramGrid = (ParamGridBuilder() \
.addGrid(classifier.maxDepth, [2, 5, 10])
# .addGrid(classifier.maxBins, [5, 10, 20])
# .addGrid(classifier.numTrees, [5, 20, 50])
.build())

# Add parameters of your choice here:


if Mtype in("GBTClassifier"):
paramGrid = (ParamGridBuilder() \
# .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
# .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
.addGrid(classifier.maxIter, [10, 15,50,100])
.build())

# Add parameters of your choice here:


if Mtype in("DecisionTreeClassifier"):
paramGrid = (ParamGridBuilder() \
# .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
.addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \

.build())

#Cross Validator requires all of the following parameters:


crossval = CrossValidator(estimator=classifier,

P a g e | 54
estimatorParamMaps=paramGrid,
evaluator=MulticlassClassificationEvaluator(),
numFolds=folds) # 3 + is best practice
# Fit Model: Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

Step 5 : Logistic Regression is having highest prediction score, so we can use best
model of Logistic Regression to create our final model.
#Select the top n features and view results
n = 72
# For Logistic regression or One vs Rest
selector = ChiSqSelector(numTopFeatures=n, featuresCol="features",
outputCol="selectedFeatures", labelCol="label")
bestFeaturesDf = selector.fit(data).transform(data)
bestFeaturesDf = bestFeaturesDf.select("label","selectedFeatures")
bestFeaturesDf =
bestFeaturesDf.withColumnRenamed("selectedFeatures","features")
# Collect features
features = bestFeaturesDf.select(['features']).collect()
# Split
train,test = bestFeaturesDf.randomSplit([0.7,0.3])

Result:
Thus the heart disease predicted using Apache spark(logistic regression) with high accuracy.

P a g e | 55
Ex: 9 Page Rank Computation
Aim:
To compute the page Rank in R-Studio.

Procedure:
Step 1: To go to Open the R Tool.
Step 2: Small universe of four web pages: A, B, C and D
Step 3: Multiple outbound links from one single page to another single page is ignored.

PageRank is initialized to the same value for all pages.

Program:

library(igraph) library(expm)
g <- graph(c(
1, 2, 1, 3, 1, 4,
2, 3, 2, 6, 3, 1,
3, 5, 4, 2, 4, 1,
4, 5, 5, 2, 5, 6,
6, 3, 6, 4),
+ directed=TRUE)
> M = get.adjacency(g, sparse = FALSE)
> M = t(M / rowSums(M))
> n = nrow(M)
> U = matrix(data=rep(1/n, n^2), nrow=n, ncol=n)
> beta=0.85
> A = beta*M+(1-beta)*U
> e = eigen(A)
> v <- e$vec[,1]
> v <- as.numeric(v) / sum(as.numeric(v))
>v
[1] 0.1547936 0.1739920 0.2128167 0.1388701 0.1547936 0.1647339
> page.rank(g)$vector
[1] 0.1547936 0.1739920 0.2128167 0.1388701 0.1547936 0.1647339
> library(expm)
> n = nrow(M)
> U = matrix(data=rep(1/n, n^2), nrow=n, ncol=n)
> beta=0.85
> A = beta*M+(1-beta)*U
> r = matrix(data=rep(1/n, n), nrow=n, ncol=1)
> t(A%^%100 %*% r)

Output
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.1547936 0.173992 0.2128167 0.1388701 0.1547936 0.1647339

Result:
Thus the page rank computation was implemented in R-Studio.

P a g e | 56
P a g e | 57

You might also like