Big Data Analytics Laboratory
Big Data Analytics Laboratory
ENGINEERING COLLEGE
(An Autonomous Institution, Affiliated to Anna University, Chennai)
Sevalpatti (P.O), Sivakasi - 626140.
Virudhunagar Dt.
LABORATORY MANUAL
SEMSTER : V
Faculty Incharges
A.Rajalakshmi
Page |1
P.S.R.ENGINEERING COLLEGE
(An Autonomous Institution, Affiliated to Anna University, Chennai)
Sevalpatti (P.O), Sivakasi - 626140.
Virudhunagar Dt.
LABORATORY MANUAL
SEMSTER : V
PREPARED BY APPROVED BY
A.RAJALAKSHMI
HOD/CSE
Page |2
L T P C
CCS334 BIG DATA ANALYTICS LABORATORY
0 0 2 1
Programme: B.TECH Sem: 5 Category: PC
Prerequisites: -
Aim: To provide the knowledge of HDFS and work with big data problems.
Course Outcomes: The Students will be able to
CO1: Set up and implement Hadoop clusters
CO2: Learn to use Hadoop Distributed File System(HDFS) to set up single and multi-node
CO3: Use the map reduce tasks for various applications
CO4: Analyze the various technologies & tools associated with Big Data
CO5: Discuss techniques common to NoSQL datastores
CO6: Propose solutions for Big Data Analytics problems
LIST OF EXPERIMENTS
1. Find procedure to set up Hortonworks Data Platform (HDP) - Cloudera CDH Stack
2. HDFS Commands
3. Find procedure to set up single and multi-node Hadoop cluster
Find procedure to load data into HDFS using
Apache Flume
4.
Apache Kafka
Apache Sqoop
5. Install & run the MongoDB Server
Demonstrate unstructured data into NoSQL data and do all operation with such as NoSQL
6.
query with API
7. Write a weather forecasting program using Map Reduce
8. Write an event detection program using Spark
9. Page Rank Computation.
Total Hours 60
Page |3
Ex: 1 Install Cloudera(CDH) & Hortonworks Data Platform (HDP)
Aim:
To set up Cloudera and Hortonworks data and platform(HDP) using virtual box.
Page |4
Step 3: The Software installation is successful you will see a Virtual Manager window to manage VMs.
Page |5
Select quickVM for VirtualBox and click on download
Step 5: Unzip the downloaded file. When you unzip the file cloudera-quickstart-vm-4.3.0-virtualbox.tar you
will find these two files in the directory.
Step 6: Open VirualBox and click on “New” to create new virtual box
Give name for new virtual machine and select type as Linux and versions as Linux 2.6
Page |6
Step 8: In the next page, VirtualBox asks to select Hard Drive for new VirualBox as shown in the screenshot.
Create a virtual hard drive now is selected by default. But you have to select “Use an existing virtual hard
drive file” option.
Page |7
Select “Use an existing virtual hard drive file”.
Step 9: Click on the small yellow icon beside the dropdown to browse and select the cloudera-quickstart-vm-
4.3.0-virtualbox-disk1.vm file (which is download in step 4).
Page |8
Click on create to create Cloudera quickstart vm.
Step 10: Your virtual box should look like following screen shots. We can see the new virtual machine named
Cloudera Hadoop on the left side.
Step 11: Select Cloudera vm and click on “Start” Virtual Machine starts to boot
Page |9
Step 12: System is loaded and CDH is installed on virtual machine.
Step 14: Select Cloudera Manager and Agree to the information assurance policy.
P a g e | 10
Step 15: Login to Cloudera Manager as admin. Password is admin.
Step 16: Click on the Hosts tab and we can see that one host is running, version of CDH installed on it is 4,
health of the host is good and last heart beat was listened 5.22s ago.
P a g e | 11
HORTONWORKS DATA PLATFORMS
Step 1 – Launch 3 instances of t2.xlarge type
AWS gives us these various configurations. The one which we are going with is 16 GB RAM
which is t2.xlarge. You can see how AWS console will look like
As you can see in the above image, we have selected the centos 7. Now the next step is to select the
instance type. As stated earlier, we are going with t2.xlarge instance type.
P a g e | 12
In the next step, we will add storage of 100 GB. Please make sure that you select Magnetic volume
type because SSD will cost more and provide less storage. For a reference, HDFS will consume 3 GB
storage in order to give you 1 GB of storage, therefore, it makes more sense economically to go with
the magnetic value type.
After you give a name to the server, the next step is to create a security group. Here, we are allowing
all the ports so that there is no restriction.
P a g e | 13
Step 2 – Change permission of the downloaded key so that nobody else can access it
chmod 400 hadoop-hdp-demo.pem (Here you are not allowing anyone to read or view)
Step 4 – Login to each of the node using the downloaded private key
In this step, you need to provide the private key and your public IP address. For example, ssh -i
~/hadoop-demo.pem [email protected]
You can locate the Public IP address on AWS server as shown in the image below
P a g e | 14
Step 5 – Run “sudo yum update” to update the packages
It will update the packages on all the machines because the instances that have been given by
Amazon are a bit old, thus we need to update the software on these machines. Yum is the package
manager on Red Head (Centos machine). You will run this command on all of the machines.
sudo visudo
With these commands, centos will be able to run command anywhere without any restriction.
hostname -f
With the above command, it will give the full details of the hostname.
sudo vi /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=<fqdn>
Please make sure to replace <fqdn> with a proper hostname. You can get this detail from the
command used in step 7 i.e hostname -f
Result:
Thus the cloudera and hortonworks data platforms were installed successfully.
P a g e | 15
Ex: 2 HDFS COMMANDS
Aim:
To load big data into Hadoop Distributed File System using HDFS commands for creating
directories, moving files, adding files, deleting files, reading files and listing directories.
Commands
1 .ls
Description: This command is used to list all the files. Use lsr for recursive approach. It is useful when
we want a hierarchy of a folder.
2. mkdir
Description: To create a directory. In Hadoop dfs there is no home directory by default. So let’s first
create it.
3. touchz
Description: It creates an empty file.
4. cp:
Description: This command is used to copy files within hdfsS.
P a g e | 16
5. moveFromLocal:
Description: This command will move file from local to hdfs.
6. Put
Description: This command is used to copy localfile1 of the local file system to the Hadoop filesystem.
7. Get
Description: The Hadoop fs shell command get copies the file or directory from the Hadoop file
system to the local file system.
8. df
Description: It shows the capacity, size, and free space available on the HDFS file system.
P a g e | 17
9. fsck
Description: Hadoop command is used to check the health of the HDFS.
SYNTAX: hadoop fsck <path> [ -move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]
Result:
Thus the files operation such as create, move and delete operation were performed in cloudera
terminal.
P a g e | 18
Ex: 3 Single and multi –node Hadoop cluster
Aim:
To set up single and multi –node Hadoop cluster on a distributed environment.
Program:
Install Hadoop
Step 1: Download the Java 8 Package. Save the file in your home directory.
Step 2: Extract the Java Tar File.
Command: tar -xvf jdk-8u101-linux-i586.tar.gz
Step 4: Add the Hadoop and Java paths in the bash file (.bashrc).Open. bashrc file. Now, add
Hadoop and Java Path as shown below.Learn more about the Hadoop Ecosystem and its tools
with the Hadoop Certification.
Command: vi .bashrc
P a g e | 19
Fig 3 Setting Environment Variable
Step 5: Then, save the bash file and close it.For applying all these changes to the current
Terminal, execute the source command.
Command: source. bashrc
Step 6: To make sure that Java and Hadoop have been properly installed on your system
and can be accessed through the Terminal, execute the java -version and hadoop version
commands.
Command: java –version
P a g e | 20
Step 6: Edit the Hadoop Configuration files.
Command: cd hadoop-2.7.3/etc/hadoop/
Command: ls
Step 7: Open core-site.xml and edit the property mentioned below inside configuration tag:
core-site.xml informs Hadoop daemon where NameNode runs in the cluster. It contains
configuration settings of Hadoop core such as I/O settings that are
common to HDFS &.
Command: vi core-site.xml
Step 8: Edit hdfs-site.xml and edit the property mentioned below inside configuration tag:hdfs-
site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode,
Secondary NameNode). It also includes the replication factor and block size of HDFS.
P a g e | 21
Command: vi hdfs-site.xml
Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside
configuration tag:mapred-site.xml contains configuration settings of MapReduce application
like number of JVM that can run in parallel, the size of the mapper and the reducer process,
CPU cores available for a process, etc. In some cases, mapred-site.xml file is not available. So,
we have to create the mapred-site.xml file using mapred-site.xml template.
P a g e | 22
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Step 10: Edit yarn-site.xml and edit the property mentioned below inside configuration
tag:yarn-site.xml contains configuration settings of ResourceManager and
NodeManager like application memory management size, the operation needed on
program & algorithm, etc.
Command: vi yarn-site.xml
<?xml version="1.0">
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Result:
Thus the single node and multi node Hadoop cluster were installed on a distributed
environment
P a g e | 23
Ex. No: 4A Apache flume
Aim:
Procedure
Step 1
To create a Twitter application, click on the following link https://apps.twitter.com/. Sign in to
your Twitter account. You will have a Twitter Application Management window where you
can create, delete, and manage Twitter Apps.
Step 2
Click on the Create New App button. You will be redirected to a window where you will get
an application form in which you have to fill in your details in order to create the App. While
filling the website address, give the complete URL pattern, for example, http://example.com.
P a g e | 24
Step 3
Fill in the details, accept the Developer Agreement when finished, click on the Create your
Twitter application button which is at the bottom of the page. If everything goes fine, an App
will be created with the given details as shown below.
Step 4
Under keys and Access Tokens tab at the bottom of the page, you can observe a button
named Create my access token. Click on it to generate the access token.
P a g e | 25
Step 5
Finally, click on the Test OAuth button which is on the right side top of the page. This will
lead to a page which displays your Consumer key, Consumer secret, Access token, and Access
token secret. Copy these details. These are useful to configure the agent in Flume.
Starting HDFS
Since we are storing the data in HDFS, we need to install / verify Hadoop. Start Hadoop and
create a folder in it to store Flume data. Follow the steps given below before configuring
Flume.
Step 6: Install / Verify Hadoop
Install Hadoop. If Hadoop is already installed in your system, verify the installation using
Hadoop version command, as shown below.
$ hadoop version
If your system contains Hadoop, and if you have set the path variable, then you will get the
following output −
Hadoop 2.6.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using /home/Hadoop/hadoop/share/hadoop/common/hadoop-common-
2.6.0.jar
P a g e | 26
Step 7: Starting Hadoop
Browse through the sbin directory of Hadoop and start yarn and Hadoop dfs (distributed file
system) as shown below.
cd /$Hadoop_Home/sbin/
$ start-dfs.sh
localhost: starting namenode, logging to
/home/Hadoop/hadoop/logs/hadoop-Hadoop-namenode-localhost.localdomain.out
localhost: starting datanode, logging to
/home/Hadoop/hadoop/logs/hadoop-Hadoop-datanode-localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
starting secondarynamenode, logging to
/home/Hadoop/hadoop/logs/hadoop-Hadoop-secondarynamenode-localhost.localdomain.out
$ start-yarn.sh
starting yarn
daemons
starting resourcemanager, logging to
/home/Hadoop/hadoop/logs/yarn-Hadoop-resourcemanager-localhost.localdomain.out
localhost: starting nodemanager, logging to
/home/Hadoop/hadoop/logs/yarn-Hadoop-nodemanager-localhost.localdomain.out
Step 8: Create a Directory in HDFS
In Hadoop DFS, you can create directories using the command mkdir. Browse through it and
create a directory with the name twitter_data in the required path as shown below.
$cd /$Hadoop_Home/bin/
$ hdfs dfs -mkdir hdfs://localhost:9000/user/Hadoop/twitter_data
Configuring Flume
We have to configure the source, the channel, and the sink using the configuration file in
the conf folder. The example given in this chapter uses an experimental source provided by
Apache Flume named Twitter 1% Firehose Memory channel and HDFS sink.
Twitter 1% Firehose Source.
P a g e | 27
Example – Configuration File
# Naming the components on the current agent.
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/Hadoop/twitter_data/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
Execution
Browse through the Flume home directory and execute the application as shown below.
$ cd $FLUME_HOME
$ bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf
Dflume.root.logger=DEBUG,console -n TwitterAgent.
Given below is the snapshot of the command prompt window while fetching tweets.
P a g e | 28
Verifying HDFS
You can access the Hadoop Administration Web UI using the URL given below.
http://localhost:50070/
Click on the dropdown named Utilities on the right-hand side of the page. You can see two
options as shown in the snapshot given below.
Result:
The cluster ID were generated from the name node and the twitter data were
successfully loaded into HDFS by using Apache flume.
P a g e | 29
Ex. No: 4B Apache sqoop
Aim:
To import data from diverse data sources, such as a relational database, into MySQL
using Apache Sqoop.
Procedure: The following steps are involved in pulling the data present in the MySQL table
and inserting it into HDFS. We see how sqoop is used to achieve this.
Step 2: Before transferring the selected data, check how many records are present in the table.
To do this, change to the required database using.
Now, check the number of records present in the table that we select to move its data into
HDFS, using “count(*).”select count(*) from <tablename>
P a g e | 30
Step 3: Before running the “sqoop import” command, ensure that the target directory is not
already present. Otherwise, the import command throws an error. To check this, let us try
deleting the directory that we wish to use as our target directory.
Step 4: Let us now do the “sqoop import” job to pull the selected data from MySQL and insert
it into HDFS using the command:
sqoop import \
--connect jdbc:mysql://localhost/<database name> \
--table <table name> \
--username <username> --password <password> \
--target-dir <target location in HDFS> \
-m <no. of Mapper jobs you wish to create>
In our example, the database is “test,” selected data is in table “retailinfo,” and our target
location in HDFS is in the directory “/user/root/online_basic_command.” We proceeded with
only one mapper job.
P a g e | 31
Upon successful data transfer, the output looks similar to:
P a g e | 32
We can see that all the records in our database table are retrieved into HDFS.
Result:
Thus the Sqoop imported data from RDBMS (Relational Database Management
System) such as MySQL to HDFS (Hadoop Distributed File System).
P a g e | 33
Ex. No: 4C Apache kafka
Aim:
To load data into HDFS using Apache kafka.
Procedure
Step 1: Install Java Apache Kafka requires Java. To ensure that Java is installed first update the
Operating System then try to install it:
sudo apt-get updatesudo apt-get upgradesudo add-apt-repository
-y ppa:webupd8team/javasudo apt-get install oracle-java8-
installer Step 2: Download Kafka
Step 3: Now, we need to extract the tgz package. We choose to install Kafka in /opt/kafka
directory:
sudo tar xvf kafka_2.11–1.1.1.tgz -–directory /opt/kafka -–strip 1
Step 4: Kafka stores it logs on disk in /tmp directory, it is better to create a new directory to
store logs
sudo mkdir /var/lib/kafkasudo mkdir /var/lib/kafka/data
Step 5: Configure Kafka. By default, Kafka does not allow us to delete topics. To be able to
delete topics, find the line and change it (if it is not found just add it)
Now, we need to edit the Kafka server configuration file.
sudo gedit /opt/kafka/config/server.properties
delete.topic.enable = true
log.dirs=/var/lib/kafka/data
Step 6: In addition, we can adjust the time interval for logs deletion (Kafka deletes logs after a
particular time or according to disk size):
log.retention.hours=168 # according to timelog.retention.bytes=104857600 # according to disk size
Step 7: We need to give access to kafkauser on the logs directory and kafka installation
directory:
sudo chown –R kafkauser:nogroup /opt/kafkasudo chown –R kafkauser:nogroup /var/lib/kafka
Step 8: To start Apache Kafka service you can use the following command. You should see the
output, if the server has started successfully. To start Kafka as a background process, you can
use nohup command.
sudo /opt/kafka/bin/kafka-server-start.sh /opt/kafka/ config/server.properties [2018–07–
23 21:43:48,279] WARN No meta.properties file under dir
/var/lib/kafka/data/meta.properties (kafka.server.BrokerMetadataCheckpoint)[2018–07–23
21:43:48,516] INFO Kafka version : 0.10.0.1
(org.apache.kafka.common.utils.AppInfoParser)[2018–07–23 21:43:48,525] INFO Kafka commitId
: a7a17cdec9eaa6c5 (org.apache.kafka.common.utils.AppInfoParser)[2018–07–23 21:43:48,527]
INFO [Kafka Server 0], started (kafka.server.KafkaServer)[2018–07–23 21:43:48,555] INFO New
leader is 0 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
P a g e | 34
sudo nohup /opt/kafka/bin/kafka-server-start.sh /opt/kafka/ config/server.properties
/var/lib/kafka/data/kafka.log 2>&1 &
Step 9: The Kafka server running and listening on port 9092. Step 9: Import a text file into a
Kafka topic. To open a text file into a Kafka topic, you need to use cat command with a
pipeline:
cat filename.txt | /opt/kafka/bin/kafka-console-producer.sh — broker-list localhost:9092 — topic
testKafka
Result:
Thus the kafka, an intermediate data store, helps to very easily replay ingestion,
consume datasets across multiple applications, and perform data analysis.
P a g e | 35
Ex: 5 INSTALL AND RUN MONGODB
Aim:
To install and run mongoDB database.
Procedure:
Step 1 — Download the MongoDB MSI Installer Package
Download the current version of MongoDB. Make sure you select MSI as the package you want
to download.
P a g e | 36
B. Click Next to start installation.
P a g e | 37
D. Select the Complete setup.
E. Select “Run service as Network Service user” and make a note of the data directory.
P a g e | 38
F. We won’t need Mongo Compass, so deselect it and click Next.
P a g e | 39
H. Hit Finish to complete installation.
P a g e | 40
B. Inside the data folder you just created, create another folder called db.
P a g e | 41
C. Accept the licence agreement then click Next.
P a g e | 42
E. Select “Run service as Network Service user” and make a note of the data directory.
P a g e | 43
G. Click Install to begin installation.
P a g e | 44
Step 3— Create the Data Folders to Store our Databases
A. Navigate to the C Drive on your computer using Explorer and create a new folder
called data here.
B. Inside the data folder you just created, create another folder called db.
P a g e | 45
Result:
Thus, the MongoDB were installed and SQL query were imported.
P a g e | 46
Ex: 6 Unstructured data into NoSQL using MongoDb with API.
Aim:
To Demonstrate unstructured data into NoSQL data and do all operation with such as NoSQL
query with API
Program:
Step 1: To go to start the MongoDB in Terminal
Step 2: create a database and table to insert necessary data
Step 3: db.student.insert() method is used to add or insert new documents into a collection in your
database.
Step 4: db.student.find() method using view the row of table.
Step 5: db.student.update() method using update the row of table
Microsoft Windows [Version 10.0.17134.1488]
(c) 2018 Microsoft Corporation. All rights reserved.
C:\Users\Administrator>cd ..
C:\Users>cd ..
Create Database
> use big-data
switched to db big-data
View Database
> db
big-data
Show Database
> show dbs
dept 0.078GB
local 0.078GB
Create Table
> db.student.insert({name:'sri'})
WriteResult({ "nInserted" : 1 })
View Table
> db.student.find()
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }
> db.student.insert({name:'sri',rollno:'19cs0023'})
WriteResult({ "nInserted" : 1 })
> db.student.find()
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }
{ "_id" : ObjectId("613f1d84d7a1744c6d681094"), "name" : "sri", "rollno" : "19cs0023" }
> db.student.insert({name:'sri',rollno:'19cs0023',course:'android'})
WriteResult({ "nInserted" : 1 })
> db.student.find()
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }
{ "_id" : ObjectId("613f1d84d7a1744c6d681094"), "name" : "sri", "rollno" : "19cs0023" }
{ "_id" : ObjectId("613f1db5d7a1744c6d681095"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
> db.student.insert({name:'sri',rollno:'19cs0023',course:'android'})
WriteResult({ "nInserted" : 1 })
> db.student.find()
P a g e | 48
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }
{ "_id" : ObjectId("613f1d84d7a1744c6d681094"), "name" : "sri", "rollno" : "19cs0023" }
{ "_id" : ObjectId("613f1db5d7a1744c6d681095"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
{ "_id" : ObjectId("613f1e22d7a1744c6d681096"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
> db.student.insert({name:'sri',rollno:'19cs0023',course:'android'})
WriteResult({ "nInserted" : 1 })
> db.student.find()
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }
{ "_id" : ObjectId("613f1d84d7a1744c6d681094"), "name" : "sri", "rollno" : "19cs0023" }
{ "_id" : ObjectId("613f1db5d7a1744c6d681095"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
{ "_id" : ObjectId("613f1e22d7a1744c6d681096"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
{ "_id" : ObjectId("613f1e42d7a1744c6d681097"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
Update Row
> db.student.update({course:'android'},{$set:{course:'big-data'}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1
})
> db.student.find()
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }
{ "_id" : ObjectId("613f1d84d7a1744c6d681094"), "name" : "sri", "rollno" : "19cs0023" }
{ "_id" : ObjectId("613f1db5d7a1744c6d681095"), "name" : "sri", "rollno" : "19cs0023", "course"
: "big-data" }
{ "_id" : ObjectId("613f1e22d7a1744c6d681096"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
{ "_id" : ObjectId("613f1e42d7a1744c6d681097"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
> db.student.update({course:'android'},{$set:{course:'big-data'}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.student.find()
P a g e | 49
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }
P a g e | 50
{ "_id" : ObjectId("613f1d84d7a1744c6d681094"), "name" : "sri", "rollno" : "19cs0023" }
{ "_id" : ObjectId("613f1db5d7a1744c6d681095"), "name" : "sri", "rollno" : "19cs0023", "course" :
"big-data" }
{ "_id" : ObjectId("613f1e22d7a1744c6d681096"), "name" : "sri", "rollno" : "19cs0023", "course"
: "big-data" }
{ "_id" : ObjectId("613f1e42d7a1744c6d681097"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
>
MongoDB Delete Documents
> db.student.remove({})
WriteResult({ "nRemoved" : 5
})
> db.student.find()
Drop Database:
> db.dropDatabase()
{“ok”: 1}
Result:
Thus the program for Unstructured data into NoSQL data & do all operations such as NoSQL
query with API was executed.
P a g e | 51
Ex: 8 Write an event detection program using Spark
Aim:
To predict whether the person has heart disease or not by using Apache Spark.
Procedure:
Step 1: Dataset
Step 2: Spark can read data in different formats like CSV, Parquet, Avro and JSON.
Here the data is in CSV format and read by using the below code.
P a g e | 52
df.groupBy('target').count().show()+------+-----+
|target|count|
+ + +
| 1| 165|
| 0| 138|
+ + +
Step 3: The column in the dataframe that has null values. We can drop the data if the percentage of
null values is very less. Since our dataset is clean with no null values.
from pyspark.sql.functions import *
def null_value_calc(df):
null_columns_counts = []
numRows = df.count()
for k in df.columns:
nullRows = df.where(col(k).isNull()).count()
if(nullRows > 0):
temp = k,nullRows,(nullRows/numRows)*100
null_columns_counts.append(temp)
return(null_columns_counts)
null_columns_calc_list = null_value_calc(df)
if null_columns_calc_list :
spark.createDataFrame(null_columns_calc_list, ['Column_Name',
'Null_Values_Count','Null_Value_Percent']).show()
else :
print("Data is clean with no null values")
Step 4: Skewness measures how much a distribution of values deviates from symmetry around the
mean. A value of zero means the distribution is symmetric, while a positive skewness indicates a
greater number of smaller values, and a negative value indicates a greater number of larger values.
P a g e | 53
from pyspark.sql.functions import *
d = {}
# Create a dictionary of quantiles
for col in numeric_inputs:
d[col] = indexed.approxQuantile(col,[0.01,0.99],0.25) #if you want to make it go
faster increase the last number
#Now fill in the values
for col in numeric_inputs:
skew = indexed.agg(skewness(indexed[col])).collect() #check for skewness
paramGrid = (ParamGridBuilder() \
.addGrid(classifier.maxDepth, [2, 5, 10])
# .addGrid(classifier.maxBins, [5, 10, 20])
# .addGrid(classifier.numTrees, [5, 20, 50])
.build())
.build())
P a g e | 54
estimatorParamMaps=paramGrid,
evaluator=MulticlassClassificationEvaluator(),
numFolds=folds) # 3 + is best practice
# Fit Model: Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)
Step 5 : Logistic Regression is having highest prediction score, so we can use best
model of Logistic Regression to create our final model.
#Select the top n features and view results
n = 72
# For Logistic regression or One vs Rest
selector = ChiSqSelector(numTopFeatures=n, featuresCol="features",
outputCol="selectedFeatures", labelCol="label")
bestFeaturesDf = selector.fit(data).transform(data)
bestFeaturesDf = bestFeaturesDf.select("label","selectedFeatures")
bestFeaturesDf =
bestFeaturesDf.withColumnRenamed("selectedFeatures","features")
# Collect features
features = bestFeaturesDf.select(['features']).collect()
# Split
train,test = bestFeaturesDf.randomSplit([0.7,0.3])
Result:
Thus the heart disease predicted using Apache spark(logistic regression) with high accuracy.
P a g e | 55
Ex: 9 Page Rank Computation
Aim:
To compute the page Rank in R-Studio.
Procedure:
Step 1: To go to Open the R Tool.
Step 2: Small universe of four web pages: A, B, C and D
Step 3: Multiple outbound links from one single page to another single page is ignored.
Program:
library(igraph) library(expm)
g <- graph(c(
1, 2, 1, 3, 1, 4,
2, 3, 2, 6, 3, 1,
3, 5, 4, 2, 4, 1,
4, 5, 5, 2, 5, 6,
6, 3, 6, 4),
+ directed=TRUE)
> M = get.adjacency(g, sparse = FALSE)
> M = t(M / rowSums(M))
> n = nrow(M)
> U = matrix(data=rep(1/n, n^2), nrow=n, ncol=n)
> beta=0.85
> A = beta*M+(1-beta)*U
> e = eigen(A)
> v <- e$vec[,1]
> v <- as.numeric(v) / sum(as.numeric(v))
>v
[1] 0.1547936 0.1739920 0.2128167 0.1388701 0.1547936 0.1647339
> page.rank(g)$vector
[1] 0.1547936 0.1739920 0.2128167 0.1388701 0.1547936 0.1647339
> library(expm)
> n = nrow(M)
> U = matrix(data=rep(1/n, n^2), nrow=n, ncol=n)
> beta=0.85
> A = beta*M+(1-beta)*U
> r = matrix(data=rep(1/n, n), nrow=n, ncol=1)
> t(A%^%100 %*% r)
Output
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.1547936 0.173992 0.2128167 0.1388701 0.1547936 0.1647339
Result:
Thus the page rank computation was implemented in R-Studio.
P a g e | 56
P a g e | 57