0% found this document useful (0 votes)

47 views

Big Data Analytics Laboratory

useful

Uploaded by

21cse004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views

Big Data Analytics Laboratory

useful

Uploaded by

21cse004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 57

P.S.R.

ENGINEERING COLLEGE
(An Autonomous Institution, Affiliated to Anna University, Chennai)
Sevalpatti (P.O), Sivakasi - 626140.
Virudhunagar Dt.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINE RING

P.S.R.R COLLEGE OFENGINEERING
(Anna University, Chennai)
Sevalpatti (P.O), Sivakasi - 626140.
Tamilnadu State

LABORATORY MANUAL

COURSE CODE : Ccs334

COURSE NAME : BIGDATA ANALYTICS LABORATORY

SEMSTER : V

ACADEMIC YEAR : 2024 (Even SEMESTER)

Faculty Incharges

A.Rajalakshmi

Page |1
P.S.R.ENGINEERING COLLEGE
(An Autonomous Institution, Affiliated to Anna University, Chennai)
Sevalpatti (P.O), Sivakasi - 626140.
Virudhunagar Dt.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

LABORATORY MANUAL

COURSE CODE : CCS334

COURSE NAME : BIGDATA ANALYTICS LABORATORY

SEMSTER : V

ACADEMIC YEAR : 2024-2025 (ODD SEMESTER)

PREPARED BY APPROVED BY

A.RAJALAKSHMI
HOD/CSE
Page |2
L T P C
CCS334 BIG DATA ANALYTICS LABORATORY
0 0 2 1
Programme: B.TECH Sem: 5 Category: PC
Prerequisites: -
Aim: To provide the knowledge of HDFS and work with big data problems.
Course Outcomes: The Students will be able to
CO1: Set up and implement Hadoop clusters
CO2: Learn to use Hadoop Distributed File System(HDFS) to set up single and multi-node
CO3: Use the map reduce tasks for various applications
CO4: Analyze the various technologies & tools associated with Big Data
CO5: Discuss techniques common to NoSQL datastores
CO6: Propose solutions for Big Data Analytics problems
LIST OF EXPERIMENTS
1. Find procedure to set up Hortonworks Data Platform (HDP) - Cloudera CDH Stack
2. HDFS Commands
3. Find procedure to set up single and multi-node Hadoop cluster
Find procedure to load data into HDFS using
Apache Flume
4.
Apache Kafka
Apache Sqoop
5. Install & run the MongoDB Server
Demonstrate unstructured data into NoSQL data and do all operation with such as NoSQL
6.
query with API
7. Write a weather forecasting program using Map Reduce
8. Write an event detection program using Spark
9. Page Rank Computation.
Total Hours 60

Program Specific Outcomes

Course Program Outcomes (POs)
(PSOs)
Out
Comes PSO PSO PSO
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO4
1 2 3
CO1 3 2 2 1 2 2 1 1 1
CO2 2 2 2 1 1 2 1 2 2
CO3 1 1 2 1 3 2 2 2 2
CO4 1 2 3 2 3 2 2 2 2
CO5 1 2 1 1 2 2 2 2
CO6 2 2 2 3 2 3 2 2 2 2

Page |3
Ex: 1 Install Cloudera(CDH) & Hortonworks Data Platform (HDP)
Aim:
To set up Cloudera and Hortonworks data and platform(HDP) using virtual box.

Step 1: Download the virtual box-executable file from https://www.virtualbox.org/wiki/Downloads Download

Virtual Box 4.2.16 for window host.

Step 2: Install VirtualBox by double clicking on the downloaded file.

Page |4
Step 3: The Software installation is successful you will see a Virtual Manager window to manage VMs.

Step 4: Download the Cloudera quickstart vm for VirtualBox

Go to the link - https://ccp.cloudera.com/display/SUPPORT/Cloudera+QuickStart+VM

Page |5
Select quickVM for VirtualBox and click on download

Step 5: Unzip the downloaded file. When you unzip the file cloudera-quickstart-vm-4.3.0-virtualbox.tar you
will find these two files in the directory.

Step 6: Open VirualBox and click on “New” to create new virtual box

Give name for new virtual machine and select type as Linux and versions as Linux 2.6

Step 7 : Select Memory Size as 4GB and click Next.

Page |6
Step 8: In the next page, VirtualBox asks to select Hard Drive for new VirualBox as shown in the screenshot.
Create a virtual hard drive now is selected by default. But you have to select “Use an existing virtual hard
drive file” option.

Page |7
Select “Use an existing virtual hard drive file”.

Step 9: Click on the small yellow icon beside the dropdown to browse and select the cloudera-quickstart-vm-
4.3.0-virtualbox-disk1.vm file (which is download in step 4).

Page |8
Click on create to create Cloudera quickstart vm.

Step 10: Your virtual box should look like following screen shots. We can see the new virtual machine named
Cloudera Hadoop on the left side.

Step 11: Select Cloudera vm and click on “Start” Virtual Machine starts to boot

Page |9
Step 12: System is loaded and CDH is installed on virtual machine.

Step 13: System redirects you to the index page of Cloudera.

Step 14: Select Cloudera Manager and Agree to the information assurance policy.

P a g e | 10
Step 15: Login to Cloudera Manager as admin. Password is admin.

Step 16: Click on the Hosts tab and we can see that one host is running, version of CDH installed on it is 4,
health of the host is good and last heart beat was listened 5.22s ago.

P a g e | 11
HORTONWORKS DATA PLATFORMS
Step 1 – Launch 3 instances of t2.xlarge type
AWS gives us these various configurations. The one which we are going with is 16 GB RAM
which is t2.xlarge. You can see how AWS console will look like

As you can see in the above image, we have selected the centos 7. Now the next step is to select the
instance type. As stated earlier, we are going with t2.xlarge instance type.

P a g e | 12
In the next step, we will add storage of 100 GB. Please make sure that you select Magnetic volume
type because SSD will cost more and provide less storage. For a reference, HDFS will consume 3 GB
storage in order to give you 1 GB of storage, therefore, it makes more sense economically to go with
the magnetic value type.

In the next step, we give name to the server.

After you give a name to the server, the next step is to create a security group. Here, we are allowing
all the ports so that there is no restriction.

In the next step, you need to create a new key pair.

Amazon provides you a feature of the private & public key. It takes away the headache of
login every time on each machine. It will generate a private-public key and give you the private one
and it will save the public key on all the machines. This will allow you to connect to your instance
securely and easily. As you can see below, we have successfully initialized the three instances.

P a g e | 13
Step 2 – Change permission of the downloaded key so that nobody else can access it
chmod 400 hadoop-hdp-demo.pem (Here you are not allowing anyone to read or view)

Step 3 – Give the name to servers

Name one of the nodes to “hadoop-ambari-server” and others two to “hadoop-data-node”

Step 4 – Login to each of the node using the downloaded private key
In this step, you need to provide the private key and your public IP address. For example, ssh -i
~/hadoop-demo.pem [email protected]
You can locate the Public IP address on AWS server as shown in the image below

P a g e | 14
Step 5 – Run “sudo yum update” to update the packages
It will update the packages on all the machines because the instances that have been given by
Amazon are a bit old, thus we need to update the software on these machines. Yum is the package
manager on Red Head (Centos machine). You will run this command on all of the machines.

Step 6 – Make centos sudoers on all machines

Sudoer is somebody who can do administrative tasks, therefore, we make centos the sudoer.

sudo visudo

centos ALL=(ALL) ALL

With these commands, centos will be able to run command anywhere without any restriction.

Step 7 – Now on each machine, verify if the hostname is properly set

hostname -f

With the above command, it will give the full details of the hostname.

Step 8 – Edit the Network Configuration File

In this step, we are setting up the hostname properly.

sudo vi /etc/sysconfig/network

NETWORKING=yes

HOSTNAME=<fqdn>

Please make sure to replace <fqdn> with a proper hostname. You can get this detail from the
command used in step 7 i.e hostname -f

Result:
Thus the cloudera and hortonworks data platforms were installed successfully.

P a g e | 15
Ex: 2 HDFS COMMANDS
Aim:
To load big data into Hadoop Distributed File System using HDFS commands for creating
directories, moving files, adding files, deleting files, reading files and listing directories.
Commands
1 .ls
Description: This command is used to list all the files. Use lsr for recursive approach. It is useful when
we want a hierarchy of a folder.

SYNTAX: bin/hdfs dfs –ls

2. mkdir
Description: To create a directory. In Hadoop dfs there is no home directory by default. So let’s first
create it.

SYNTAX: bin/hdfs dfs –mkdir folder name

3. touchz
Description: It creates an empty file.

SYNTAX: hadoop fs –touchz /directory/filename

4. cp:
Description: This command is used to copy files within hdfsS.

SYNTAX: bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>

P a g e | 16
5. moveFromLocal:
Description: This command will move file from local to hdfs.

SYNTAX: bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>

6. Put
Description: This command is used to copy localfile1 of the local file system to the Hadoop filesystem.

SYNTAX: hadoop fs -moveFromLocal <localsrc> <dest>

7. Get
Description: The Hadoop fs shell command get copies the file or directory from the Hadoop file
system to the local file system.

SYNTAX: hadoop fs -get <src> <localdest>

8. df
Description: It shows the capacity, size, and free space available on the HDFS file system.

SYNTAX: hadoop fs -df [-h] <path>

P a g e | 17
9. fsck
Description: Hadoop command is used to check the health of the HDFS.

SYNTAX: hadoop fsck <path> [ -move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]

Result:
Thus the files operation such as create, move and delete operation were performed in cloudera
terminal.

P a g e | 18
Ex: 3 Single and multi –node Hadoop cluster

Aim:
To set up single and multi –node Hadoop cluster on a distributed environment.
Program:
Install Hadoop
Step 1: Download the Java 8 Package. Save the file in your home directory.
Step 2: Extract the Java Tar File.
Command: tar -xvf jdk-8u101-linux-i586.tar.gz

Fig1: Hadoop Installation – Extracting Java Files

Step 3: Download the Hadoop 2.7.3 Package.

Command: wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-
2.7.3.tar.gz

Fig2: Extracting Hadoop Files

Step 4: Add the Hadoop and Java paths in the bash file (.bashrc).Open. bashrc file. Now, add
Hadoop and Java Path as shown below.Learn more about the Hadoop Ecosystem and its tools
with the Hadoop Certification.
Command: vi .bashrc

P a g e | 19
Fig 3 Setting Environment Variable

Step 5: Then, save the bash file and close it.For applying all these changes to the current
Terminal, execute the source command.
Command: source. bashrc

Fig 4 Refreshing environment variables

Step 6: To make sure that Java and Hadoop have been properly installed on your system
and can be accessed through the Terminal, execute the java -version and hadoop version
commands.
Command: java –version

Fig 5 Checking Java Version

Command: hadoop version

Fig6: Checking Hadoop Version

P a g e | 20
Step 6: Edit the Hadoop Configuration files.

Command: cd hadoop-2.7.3/etc/hadoop/
Command: ls

All the Hadoop configuration files are located in hadoop-

2.7.3/etc/hadoop directory as you can see in the snapshot below:

Fig 7: Hadoop Configuration Files

Step 7: Open core-site.xml and edit the property mentioned below inside configuration tag:
core-site.xml informs Hadoop daemon where NameNode runs in the cluster. It contains
configuration settings of Hadoop core such as I/O settings that are
common to HDFS &amp.

Command: vi core-site.xml

Fig 8: Configuring core-site.xml

<?xml version=”1.0” encoding=”UTF-8”?>

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Step 8: Edit hdfs-site.xml and edit the property mentioned below inside configuration tag:hdfs-
site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode,
Secondary NameNode). It also includes the replication factor and block size of HDFS.

P a g e | 21
Command: vi hdfs-site.xml

Fig 9: Hadoop Installation – Configuring hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
</configuration>

Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside
configuration tag:mapred-site.xml contains configuration settings of MapReduce application
like number of JVM that can run in parallel, the size of the mapper and the reducer process,
CPU cores available for a process, etc. In some cases, mapred-site.xml file is not available. So,
we have to create the mapred-site.xml file using mapred-site.xml template.

Command: cp mapred-site.xml.template mapred-site.xml

Command: vi mapred-site.xml.

Step 10: Edit yarn-site.xml and edit the property mentioned below inside configuration
tag:yarn-site.xml contains configuration settings of ResourceManager and
NodeManager like application memory management size, the operation needed on
program & algorithm, etc.

Command: vi yarn-site.xml

Fig 11: Hadoop Installation – Configuring yarn-site.xml

<?xml version="1.0">
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

Result:
Thus the single node and multi node Hadoop cluster were installed on a distributed
environment

P a g e | 23
Ex. No: 4A Apache flume
Aim:

To load Twitter data into HDFS using Apache Flume.

Procedure
Step 1
To create a Twitter application, click on the following link https://apps.twitter.com/. Sign in to
your Twitter account. You will have a Twitter Application Management window where you
can create, delete, and manage Twitter Apps.

Step 2
Click on the Create New App button. You will be redirected to a window where you will get
an application form in which you have to fill in your details in order to create the App. While
filling the website address, give the complete URL pattern, for example, http://example.com.

P a g e | 24
Step 3
Fill in the details, accept the Developer Agreement when finished, click on the Create your
Twitter application button which is at the bottom of the page. If everything goes fine, an App
will be created with the given details as shown below.

Step 4
Under keys and Access Tokens tab at the bottom of the page, you can observe a button
named Create my access token. Click on it to generate the access token.

P a g e | 25
Step 5
Finally, click on the Test OAuth button which is on the right side top of the page. This will
lead to a page which displays your Consumer key, Consumer secret, Access token, and Access
token secret. Copy these details. These are useful to configure the agent in Flume.

Starting HDFS
Since we are storing the data in HDFS, we need to install / verify Hadoop. Start Hadoop and
create a folder in it to store Flume data. Follow the steps given below before configuring
Flume.
Step 6: Install / Verify Hadoop
Install Hadoop. If Hadoop is already installed in your system, verify the installation using
Hadoop version command, as shown below.
$ hadoop version
If your system contains Hadoop, and if you have set the path variable, then you will get the
following output −
Hadoop 2.6.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using /home/Hadoop/hadoop/share/hadoop/common/hadoop-common-
2.6.0.jar

P a g e | 26
Step 7: Starting Hadoop
Browse through the sbin directory of Hadoop and start yarn and Hadoop dfs (distributed file
system) as shown below.
cd /$Hadoop_Home/sbin/
$ start-dfs.sh
localhost: starting namenode, logging to
/home/Hadoop/hadoop/logs/hadoop-Hadoop-namenode-localhost.localdomain.out
localhost: starting datanode, logging to
/home/Hadoop/hadoop/logs/hadoop-Hadoop-datanode-localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
starting secondarynamenode, logging to
/home/Hadoop/hadoop/logs/hadoop-Hadoop-secondarynamenode-localhost.localdomain.out

$ start-yarn.sh
starting yarn
daemons
starting resourcemanager, logging to
/home/Hadoop/hadoop/logs/yarn-Hadoop-resourcemanager-localhost.localdomain.out
localhost: starting nodemanager, logging to
/home/Hadoop/hadoop/logs/yarn-Hadoop-nodemanager-localhost.localdomain.out
Step 8: Create a Directory in HDFS
In Hadoop DFS, you can create directories using the command mkdir. Browse through it and
create a directory with the name twitter_data in the required path as shown below.
$cd /$Hadoop_Home/bin/
$ hdfs dfs -mkdir hdfs://localhost:9000/user/Hadoop/twitter_data

Configuring Flume
We have to configure the source, the channel, and the sink using the configuration file in
the conf folder. The example given in this chapter uses an experimental source provided by
Apache Flume named Twitter 1% Firehose Memory channel and HDFS sink.
Twitter 1% Firehose Source.

Setting the classpath

Set the classpath variable to the lib folder of Flume in Flume-env.sh file as shown below.
export CLASSPATH=$CLASSPATH:/FLUME_HOME/lib/*

P a g e | 27
Example – Configuration File
# Naming the components on the current agent.
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

# Describing/Configuring the source

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = Your OAuth consumer key
TwitterAgent.sources.Twitter.consumerSecret = Your OAuth consumer secret
TwitterAgent.sources.Twitter.accessToken = Your OAuth consumer key access token
TwitterAgent.sources.Twitter.accessTokenSecret = Your OAuth consumer key access token
secret
TwitterAgent.sources.Twitter.keywords = tutorials point,java, bigdata, mapreduce, mahout,
hbase, nosql

# Describing/Configuring the sink

TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/Hadoop/twitter_data/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

# Describing/Configuring the channel

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

# Binding the source and sink to the channel

TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel

Execution
Browse through the Flume home directory and execute the application as shown below.
$ cd $FLUME_HOME
$ bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf
Dflume.root.logger=DEBUG,console -n TwitterAgent.
Given below is the snapshot of the command prompt window while fetching tweets.

P a g e | 28
Verifying HDFS
You can access the Hadoop Administration Web UI using the URL given below.
http://localhost:50070/
Click on the dropdown named Utilities on the right-hand side of the page. You can see two
options as shown in the snapshot given below.

Result:
The cluster ID were generated from the name node and the twitter data were
successfully loaded into HDFS by using Apache flume.

P a g e | 29
Ex. No: 4B Apache sqoop

Aim:
To import data from diverse data sources, such as a relational database, into MySQL
using Apache Sqoop.
Procedure: The following steps are involved in pulling the data present in the MySQL table
and inserting it into HDFS. We see how sqoop is used to achieve this.

Step 1: Log in to MySQL using

mysql -u root -p;

Enter the required credentials.

Step 2: Before transferring the selected data, check how many records are present in the table.
To do this, change to the required database using.

use &ltdatabase name>

Now, check the number of records present in the table that we select to move its data into
HDFS, using “count(*).”select count(*) from &lttablename>

P a g e | 30
Step 3: Before running the “sqoop import” command, ensure that the target directory is not
already present. Otherwise, the import command throws an error. To check this, let us try
deleting the directory that we wish to use as our target directory.

hadoop fs -rm -r &lt target directory &gt

Step 4: Let us now do the “sqoop import” job to pull the selected data from MySQL and insert
it into HDFS using the command:

sqoop import \
--connect jdbc:mysql://localhost/&ltdatabase name&gt \
--table &lttable name&gt \
--username &ltusername&gt --password &ltpassword&gt \
--target-dir &lttarget location in HDFS&gt \
-m &ltno. of Mapper jobs you wish to create&gt

In our example, the database is “test,” selected data is in table “retailinfo,” and our target
location in HDFS is in the directory “/user/root/online_basic_command.” We proceeded with
only one mapper job.

P a g e | 31
Upon successful data transfer, the output looks similar to:

P a g e | 32
We can see that all the records in our database table are retrieved into HDFS.

Result:
Thus the Sqoop imported data from RDBMS (Relational Database Management
System) such as MySQL to HDFS (Hadoop Distributed File System).

P a g e | 33
Ex. No: 4C Apache kafka

Aim:
To load data into HDFS using Apache kafka.

Procedure

Step 1: Install Java Apache Kafka requires Java. To ensure that Java is installed first update the
Operating System then try to install it:
sudo apt-get updatesudo apt-get upgradesudo add-apt-repository
-y ppa:webupd8team/javasudo apt-get install oracle-java8-
installer Step 2: Download Kafka

First, we need to download Kafka binaries package.

wget http://www-eu.apache.org/dist/kafka/1.1.1/kafka_2.11-1.1.1.tgz

Step 3: Now, we need to extract the tgz package. We choose to install Kafka in /opt/kafka
directory:
sudo tar xvf kafka_2.11–1.1.1.tgz -–directory /opt/kafka -–strip 1

Step 4: Kafka stores it logs on disk in /tmp directory, it is better to create a new directory to
store logs
sudo mkdir /var/lib/kafkasudo mkdir /var/lib/kafka/data

Step 5: Configure Kafka. By default, Kafka does not allow us to delete topics. To be able to
delete topics, find the line and change it (if it is not found just add it)
Now, we need to edit the Kafka server configuration file.
sudo gedit /opt/kafka/config/server.properties
delete.topic.enable = true
log.dirs=/var/lib/kafka/data

Step 6: In addition, we can adjust the time interval for logs deletion (Kafka deletes logs after a
particular time or according to disk size):
log.retention.hours=168 # according to timelog.retention.bytes=104857600 # according to disk size

Step 7: We need to give access to kafkauser on the logs directory and kafka installation
directory:
sudo chown –R kafkauser:nogroup /opt/kafkasudo chown –R kafkauser:nogroup /var/lib/kafka

Step 8: To start Apache Kafka service you can use the following command. You should see the
output, if the server has started successfully. To start Kafka as a background process, you can
use nohup command.
sudo /opt/kafka/bin/kafka-server-start.sh /opt/kafka/ config/server.properties [2018–07–
23 21:43:48,279] WARN No meta.properties file under dir
/var/lib/kafka/data/meta.properties (kafka.server.BrokerMetadataCheckpoint)[2018–07–23
21:43:48,516] INFO Kafka version : 0.10.0.1
(org.apache.kafka.common.utils.AppInfoParser)[2018–07–23 21:43:48,525] INFO Kafka commitId
: a7a17cdec9eaa6c5 (org.apache.kafka.common.utils.AppInfoParser)[2018–07–23 21:43:48,527]
INFO [Kafka Server 0], started (kafka.server.KafkaServer)[2018–07–23 21:43:48,555] INFO New
leader is 0 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)

P a g e | 34
sudo nohup /opt/kafka/bin/kafka-server-start.sh /opt/kafka/ config/server.properties
/var/lib/kafka/data/kafka.log 2>&1 &

Step 9: The Kafka server running and listening on port 9092. Step 9: Import a text file into a
Kafka topic. To open a text file into a Kafka topic, you need to use cat command with a
pipeline:
cat filename.txt | /opt/kafka/bin/kafka-console-producer.sh — broker-list localhost:9092 — topic
testKafka

Result:
Thus the kafka, an intermediate data store, helps to very easily replay ingestion,
consume datasets across multiple applications, and perform data analysis.

P a g e | 35
Ex: 5 INSTALL AND RUN MONGODB
Aim:
To install and run mongoDB database.

Procedure:
Step 1 — Download the MongoDB MSI Installer Package
Download the current version of MongoDB. Make sure you select MSI as the package you want
to download.

Step 2 — Install MongoDB with the Installation Wizard

A. Make sure you are logged in as a user with Admin privileges. Then navigate to your
downloads folder and double click on the .msi package you just downloaded. This will
launch the installation wizard

P a g e | 36
B. Click Next to start installation.

C. Accept the licence agreement then click Next.

P a g e | 37
D. Select the Complete setup.

E. Select “Run service as Network Service user” and make a note of the data directory.

P a g e | 38
F. We won’t need Mongo Compass, so deselect it and click Next.

G. Click Install to begin installation.

P a g e | 39
H. Hit Finish to complete installation.

Step 3— Create the Data Folders to Store our Databases

A. Navigate to the C Drive on your computer using Explorer and create a new folder
called data here.

P a g e | 40
B. Inside the data folder you just created, create another folder called db.

B. Click Next to start installation.

P a g e | 41
C. Accept the licence agreement then click Next.

D. Select the Complete setup.

P a g e | 42
E. Select “Run service as Network Service user” and make a note of the data directory.

F. We won’t need Mongo Compass, so deselect it and click Next.

P a g e | 43
G. Click Install to begin installation.

F. Hit Finish to complete installation.

P a g e | 44
Step 3— Create the Data Folders to Store our Databases
A. Navigate to the C Drive on your computer using Explorer and create a new folder
called data here.

B. Inside the data folder you just created, create another folder called db.

P a g e | 45
Result:
Thus, the MongoDB were installed and SQL query were imported.

P a g e | 46
Ex: 6 Unstructured data into NoSQL using MongoDb with API.
Aim:
To Demonstrate unstructured data into NoSQL data and do all operation with such as NoSQL
query with API
Program:
Step 1: To go to start the MongoDB in Terminal
Step 2: create a database and table to insert necessary data
Step 3: db.student.insert() method is used to add or insert new documents into a collection in your
database.
Step 4: db.student.find() method using view the row of table.
Step 5: db.student.update() method using update the row of table
Microsoft Windows [Version 10.0.17134.1488]
(c) 2018 Microsoft Corporation. All rights reserved.
C:\Users\Administrator>cd ..

C:\Users>cd ..

C:\>cd "Program Files (x86)" C:\

Program Files (x86)>cd "MongoDB"

C:\Program Files (x86)\MongoDB>cd Server C:\

Program Files (x86)\MongoDB\Server>cd 3.0 C:\

Program Files (x86)\MongoDB\Server\3.0>cd bin

C:\Program Files (x86)\MongoDB\Server\3.0\bin>mongo

MongoDB shell version: 3.0.13-rc0
connecting to: test
Welcome to the MongoDB shell.
For interactive help, type "help".
For more comprehensive documentation, see
http://docs.mongodb.org/
Questions? Try the support group
http://groups.google.com/group/mongodb-user
Server has startup warnings:
2021-09-13T15:03:04.319+0530 I CONTROL [initandlisten] ** NOTE: This is a 32-bit MongoDB
binary running on a 64-bit operating
2021-09-13T15:03:04.319+0530 I CONTROL [initandlisten] ** system. Switch to a 64-bit build
of MongoDB to
2021-09-13T15:03:04.319+0530 I CONTROL [initandlisten] ** support larger
databases. 2021-09-13T15:03:04.319+0530 I CONTROL [initandlisten]
2021-09-13T15:03:04.319+0530 I CONTROL [initandlisten]
2021-09-13T15:03:04.320+0530 I CONTROL [initandlisten] ** NOTE: This is a 32 bit MongoDB
binary.
2021-09-13T15:03:04.320+0530 I CONTROL [initandlisten] ** 32 bit builds are limited to less
than 2GB of data (or less with --journal).
2021-09-13T15:03:04.321+0530 I CONTROL [initandlisten] ** Note that journaling defaults
to off for 32 bit and is currently off.
P a g e | 47
2021-09-13T15:03:04.321+0530 I CONTROL [initandlisten] ** See
http://dochub.mongodb.org/core/32bit
2021-09-13T15:03:04.321+0530 I CONTROL [initandlisten]

Create Database
> use big-data
switched to db big-data
View Database
> db
big-data
Show Database
> show dbs
dept 0.078GB
local 0.078GB
Create Table
> db.student.insert({name:'sri'})
WriteResult({ "nInserted" : 1 })
View Table
> db.student.find()
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }
> db.student.insert({name:'sri',rollno:'19cs0023'})
WriteResult({ "nInserted" : 1 })
> db.student.find()
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }
{ "_id" : ObjectId("613f1d84d7a1744c6d681094"), "name" : "sri", "rollno" : "19cs0023" }
> db.student.insert({name:'sri',rollno:'19cs0023',course:'android'})
WriteResult({ "nInserted" : 1 })
> db.student.find()
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }
{ "_id" : ObjectId("613f1d84d7a1744c6d681094"), "name" : "sri", "rollno" : "19cs0023" }
{ "_id" : ObjectId("613f1db5d7a1744c6d681095"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
> db.student.insert({name:'sri',rollno:'19cs0023',course:'android'})
WriteResult({ "nInserted" : 1 })
> db.student.find()

P a g e | 48
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }
{ "_id" : ObjectId("613f1d84d7a1744c6d681094"), "name" : "sri", "rollno" : "19cs0023" }
{ "_id" : ObjectId("613f1db5d7a1744c6d681095"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
{ "_id" : ObjectId("613f1e22d7a1744c6d681096"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
> db.student.insert({name:'sri',rollno:'19cs0023',course:'android'})
WriteResult({ "nInserted" : 1 })
> db.student.find()
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }
{ "_id" : ObjectId("613f1d84d7a1744c6d681094"), "name" : "sri", "rollno" : "19cs0023" }
{ "_id" : ObjectId("613f1db5d7a1744c6d681095"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
{ "_id" : ObjectId("613f1e22d7a1744c6d681096"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
{ "_id" : ObjectId("613f1e42d7a1744c6d681097"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }

Update Row
> db.student.update({course:'android'},{$set:{course:'big-data'}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1
})
> db.student.find()
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }
{ "_id" : ObjectId("613f1d84d7a1744c6d681094"), "name" : "sri", "rollno" : "19cs0023" }
{ "_id" : ObjectId("613f1db5d7a1744c6d681095"), "name" : "sri", "rollno" : "19cs0023", "course"
: "big-data" }
{ "_id" : ObjectId("613f1e22d7a1744c6d681096"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
{ "_id" : ObjectId("613f1e42d7a1744c6d681097"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
> db.student.update({course:'android'},{$set:{course:'big-data'}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.student.find()

P a g e | 49
{ "_id" : ObjectId("613f1bfad7a1744c6d681093"), "name" : "sri" }

P a g e | 50
{ "_id" : ObjectId("613f1d84d7a1744c6d681094"), "name" : "sri", "rollno" : "19cs0023" }
{ "_id" : ObjectId("613f1db5d7a1744c6d681095"), "name" : "sri", "rollno" : "19cs0023", "course" :
"big-data" }
{ "_id" : ObjectId("613f1e22d7a1744c6d681096"), "name" : "sri", "rollno" : "19cs0023", "course"
: "big-data" }
{ "_id" : ObjectId("613f1e42d7a1744c6d681097"), "name" : "sri", "rollno" : "19cs0023", "course" :
"android" }
>
MongoDB Delete Documents
> db.student.remove({})
WriteResult({ "nRemoved" : 5
})
> db.student.find()
Drop Database:
> db.dropDatabase()
{“ok”: 1}

Result:
Thus the program for Unstructured data into NoSQL data & do all operations such as NoSQL
query with API was executed.
P a g e | 51
Ex: 8 Write an event detection program using Spark
Aim:
To predict whether the person has heart disease or not by using Apache Spark.
Procedure:
Step 1: Dataset

Step 2: Spark can read data in different formats like CSV, Parquet, Avro and JSON.
Here the data is in CSV format and read by using the below code.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("heart_disease_
classification").getOrCreate()
#Reading data from csv and creating a Spark Dataframe
df = spark.read.csv("datasets_heart.csv",
inferSchema=True,
header=True)
df.show(2)

P a g e | 52
df.groupBy('target').count().show()+------+-----+
|target|count|
+ + +
| 1| 165|
| 0| 138|
+ + +

Step 3: The column in the dataframe that has null values. We can drop the data if the percentage of
null values is very less. Since our dataset is clean with no null values.
from pyspark.sql.functions import *
def null_value_calc(df):
null_columns_counts = []
numRows = df.count()
for k in df.columns:
nullRows = df.where(col(k).isNull()).count()
if(nullRows > 0):
temp = k,nullRows,(nullRows/numRows)*100
null_columns_counts.append(temp)
return(null_columns_counts)
null_columns_calc_list = null_value_calc(df)
if null_columns_calc_list :
spark.createDataFrame(null_columns_calc_list, ['Column_Name',
'Null_Values_Count','Null_Value_Percent']).show()
else :
print("Data is clean with no null values")

Step 4: Skewness measures how much a distribution of values deviates from symmetry around the
mean. A value of zero means the distribution is symmetric, while a positive skewness indicates a
greater number of smaller values, and a negative value indicates a greater number of larger values.

P a g e | 53
from pyspark.sql.functions import *
d = {}
# Create a dictionary of quantiles
for col in numeric_inputs:
d[col] = indexed.approxQuantile(col,[0.01,0.99],0.25) #if you want to make it go
faster increase the last number
#Now fill in the values
for col in numeric_inputs:
skew = indexed.agg(skewness(indexed[col])).collect() #check for skewness
paramGrid = (ParamGridBuilder() \
.addGrid(classifier.maxDepth, [2, 5, 10])
# .addGrid(classifier.maxBins, [5, 10, 20])
# .addGrid(classifier.numTrees, [5, 20, 50])
.build())

# Add parameters of your choice here:

if Mtype in("GBTClassifier"):
paramGrid = (ParamGridBuilder() \
# .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
# .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
.addGrid(classifier.maxIter, [10, 15,50,100])
.build())

# Add parameters of your choice here:

if Mtype in("DecisionTreeClassifier"):
paramGrid = (ParamGridBuilder() \
# .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
.addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \

.build())

#Cross Validator requires all of the following parameters:

crossval = CrossValidator(estimator=classifier,

P a g e | 54
estimatorParamMaps=paramGrid,
evaluator=MulticlassClassificationEvaluator(),
numFolds=folds) # 3 + is best practice
# Fit Model: Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

Step 5 : Logistic Regression is having highest prediction score, so we can use best
model of Logistic Regression to create our final model.
#Select the top n features and view results
n = 72
# For Logistic regression or One vs Rest
selector = ChiSqSelector(numTopFeatures=n, featuresCol="features",
outputCol="selectedFeatures", labelCol="label")
bestFeaturesDf = selector.fit(data).transform(data)
bestFeaturesDf = bestFeaturesDf.select("label","selectedFeatures")
bestFeaturesDf =
bestFeaturesDf.withColumnRenamed("selectedFeatures","features")
# Collect features
features = bestFeaturesDf.select(['features']).collect()
# Split
train,test = bestFeaturesDf.randomSplit([0.7,0.3])

Result:
Thus the heart disease predicted using Apache spark(logistic regression) with high accuracy.

P a g e | 55
Ex: 9 Page Rank Computation
Aim:
To compute the page Rank in R-Studio.

Procedure:
Step 1: To go to Open the R Tool.
Step 2: Small universe of four web pages: A, B, C and D
Step 3: Multiple outbound links from one single page to another single page is ignored.

PageRank is initialized to the same value for all pages.

Program:

library(igraph) library(expm)
g <- graph(c(
1, 2, 1, 3, 1, 4,
2, 3, 2, 6, 3, 1,
3, 5, 4, 2, 4, 1,
4, 5, 5, 2, 5, 6,
6, 3, 6, 4),
+ directed=TRUE)
> M = get.adjacency(g, sparse = FALSE)
> M = t(M / rowSums(M))
> n = nrow(M)
> U = matrix(data=rep(1/n, n^2), nrow=n, ncol=n)
> beta=0.85
> A = beta*M+(1-beta)*U
> e = eigen(A)
> v <- e$vec[,1]
> v <- as.numeric(v) / sum(as.numeric(v))
>v
[1] 0.1547936 0.1739920 0.2128167 0.1388701 0.1547936 0.1647339
> page.rank(g)$vector
[1] 0.1547936 0.1739920 0.2128167 0.1388701 0.1547936 0.1647339
> library(expm)
> n = nrow(M)
> U = matrix(data=rep(1/n, n^2), nrow=n, ncol=n)
> beta=0.85
> A = beta*M+(1-beta)*U
> r = matrix(data=rep(1/n, n), nrow=n, ncol=1)
> t(A%^%100 %*% r)

Output
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.1547936 0.173992 0.2128167 0.1388701 0.1547936 0.1647339

Result:
Thus the page rank computation was implemented in R-Studio.

P a g e | 56
P a g e | 57

Introduction To HDFS
No ratings yet
Introduction To HDFS
21 pages
How To Set Up A Hadoop Cluster in Docker
No ratings yet
How To Set Up A Hadoop Cluster in Docker
13 pages
Cloudera Distributed Hadoop (CDH) Installation and Configuration On Virtual Box
No ratings yet
Cloudera Distributed Hadoop (CDH) Installation and Configuration On Virtual Box
44 pages
Big Data Analytics and Visualization Lab
No ratings yet
Big Data Analytics and Visualization Lab
193 pages
Lab 1 - Week2
No ratings yet
Lab 1 - Week2
29 pages
4.b-cdh Installation Via Cloudera Manager
No ratings yet
4.b-cdh Installation Via Cloudera Manager
17 pages
Big Data Lab Manual and Syllabus
No ratings yet
Big Data Lab Manual and Syllabus
71 pages
BigData_Lab_Manual
No ratings yet
BigData_Lab_Manual
44 pages
Big Data Record 2024-25
No ratings yet
Big Data Record 2024-25
46 pages
BDA lab manual UPDATED
No ratings yet
BDA lab manual UPDATED
45 pages
Data Storage Data Processing: Hadoop Distributed File System (HDFS) Mapreduce
No ratings yet
Data Storage Data Processing: Hadoop Distributed File System (HDFS) Mapreduce
35 pages
Hadoop1
No ratings yet
Hadoop1
15 pages
Devsh 201605 Student Exercisemanual
No ratings yet
Devsh 201605 Student Exercisemanual
86 pages
BDA Lab manual
No ratings yet
BDA Lab manual
49 pages
PRACTICAL NO.4 Basic CMD
No ratings yet
PRACTICAL NO.4 Basic CMD
3 pages
BK Ambari Installation
No ratings yet
BK Ambari Installation
72 pages
Introduction To HDFS
No ratings yet
Introduction To HDFS
20 pages
CC Assignment - 11 - LP-II
No ratings yet
CC Assignment - 11 - LP-II
4 pages
BDA unit-4
No ratings yet
BDA unit-4
38 pages
Week 1 in Terminal
No ratings yet
Week 1 in Terminal
10 pages
Cloudera Installation - 5.11.1 (Using Parcels)
No ratings yet
Cloudera Installation - 5.11.1 (Using Parcels)
18 pages
Introduction_to_HDFS
No ratings yet
Introduction_to_HDFS
18 pages
Department of Computer Engineering Istanbul S. Zaim University, Istanbul, Turkey
No ratings yet
Department of Computer Engineering Istanbul S. Zaim University, Istanbul, Turkey
42 pages
BDA_Experiment1
No ratings yet
BDA_Experiment1
8 pages
Hortonworks HDP Installing Manually Book
100% (2)
Hortonworks HDP Installing Manually Book
140 pages
3021170
No ratings yet
3021170
51 pages
Exp1 Hirday Merged
No ratings yet
Exp1 Hirday Merged
102 pages
Hands On-Exercies
No ratings yet
Hands On-Exercies
17 pages
HANDS Hadoop Cloud
No ratings yet
HANDS Hadoop Cloud
10 pages
TP 1 - HDFS
No ratings yet
TP 1 - HDFS
40 pages
BDA Unit-4
No ratings yet
BDA Unit-4
38 pages
Big Data Practical 1a
No ratings yet
Big Data Practical 1a
8 pages
Cloudera Data Platform
No ratings yet
Cloudera Data Platform
69 pages
Unit IV
No ratings yet
Unit IV
10 pages
CCS334-BDA LAB MANUAL final (1)
No ratings yet
CCS334-BDA LAB MANUAL final (1)
46 pages
bigdatamanual(2)
No ratings yet
bigdatamanual(2)
45 pages
Apache Hadoop: Getting Started With
No ratings yet
Apache Hadoop: Getting Started With
7 pages
Hadoop Cluster
No ratings yet
Hadoop Cluster
18 pages
Amrita CC 3.1
No ratings yet
Amrita CC 3.1
7 pages
CCA131_ CCA Hadoop Administration Certification Hands-onA131 _ Hadoop Admin Certifications - HadoopExam Leraning
No ratings yet
CCA131_ CCA Hadoop Administration Certification Hands-onA131 _ Hadoop Admin Certifications - HadoopExam Leraning
230 pages
hadoop_1_88c3acc6-f4eb-4017-a334-f88abc6e813f
No ratings yet
hadoop_1_88c3acc6-f4eb-4017-a334-f88abc6e813f
8 pages
AICTE SPONSORED Faculty Development Programme (FDP) On "DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS"
No ratings yet
AICTE SPONSORED Faculty Development Programme (FDP) On "DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS"
28 pages
Experiment No 1
No ratings yet
Experiment No 1
13 pages
Lab1 BigData
No ratings yet
Lab1 BigData
3 pages
Hallo Docker: Learning Docker Containers by Doing Projects
From Everand
Hallo Docker: Learning Docker Containers by Doing Projects
Agus Kurniawan
No ratings yet
Learning The Ropes of The CDF Sandbox
No ratings yet
Learning The Ropes of The CDF Sandbox
16 pages
Big data analytics lab-JD
No ratings yet
Big data analytics lab-JD
49 pages
ClouderaManager ExerciseInstructions
No ratings yet
ClouderaManager ExerciseInstructions
25 pages
BDA Lab Manual-1
No ratings yet
BDA Lab Manual-1
60 pages
Bigdatamanualfinal 231019063224 d211cb48
No ratings yet
Bigdatamanualfinal 231019063224 d211cb48
45 pages
Install Hadoop
No ratings yet
Install Hadoop
8 pages
Running A Pig Program On The CDH Single Node Cluster On An Aws Ec2 Instance
No ratings yet
Running A Pig Program On The CDH Single Node Cluster On An Aws Ec2 Instance
21 pages
Guia Contador de Palabras Cloudera
No ratings yet
Guia Contador de Palabras Cloudera
23 pages
(XXXX) Syllabus - Big Data Administration Training For Apache Hadoop - 280715
No ratings yet
(XXXX) Syllabus - Big Data Administration Training For Apache Hadoop - 280715
1 page
Big Data Analytics Lab Experiments
No ratings yet
Big Data Analytics Lab Experiments
16 pages
Recommended Platform:: 4.1. Install Java 7 (Recommended Oracle Java)
No ratings yet
Recommended Platform:: 4.1. Install Java 7 (Recommended Oracle Java)
5 pages
Recommended Platform:: 4.1. Install Java 7 (Recommended Oracle Java)
No ratings yet
Recommended Platform:: 4.1. Install Java 7 (Recommended Oracle Java)
5 pages
basic HDFS commands
No ratings yet
basic HDFS commands
7 pages
Autodesk 3ds Max 2023: A Comprehensive Guide, 23rd Edition
From Everand
Autodesk 3ds Max 2023: A Comprehensive Guide, 23rd Edition
Prof. Sham Tickoo
No ratings yet
Autodesk Fusion 360 PCB Black Book (V 2.0.18719)
From Everand
Autodesk Fusion 360 PCB Black Book (V 2.0.18719)
Gaurav Verma
No ratings yet
Lab02S DatabasesDesign
No ratings yet
Lab02S DatabasesDesign
3 pages
User-Defined Average Porosity and Net-to-Gross Determination Per Zone From Well Tops in Petrel
No ratings yet
User-Defined Average Porosity and Net-to-Gross Determination Per Zone From Well Tops in Petrel
15 pages
Amdocs Technical Questions Paper
No ratings yet
Amdocs Technical Questions Paper
85 pages
Tableau Designing Efficient Workbooks
100% (1)
Tableau Designing Efficient Workbooks
53 pages
Information Knowledge Wisdom: DIKW Pyramid
No ratings yet
Information Knowledge Wisdom: DIKW Pyramid
2 pages
Analitik Dan Visualisasi Data - Pengenalan Data Analitik Dan Visualisasi
No ratings yet
Analitik Dan Visualisasi Data - Pengenalan Data Analitik Dan Visualisasi
18 pages
Big Picture Week 3 4
No ratings yet
Big Picture Week 3 4
4 pages
ICT503 Assessment 3 S1 2024
No ratings yet
ICT503 Assessment 3 S1 2024
4 pages
Lesson-2 1
No ratings yet
Lesson-2 1
27 pages
Python For Data Analysis (1) - 171-192
No ratings yet
Python For Data Analysis (1) - 171-192
24 pages
COMP3161 2023 Final Project
No ratings yet
COMP3161 2023 Final Project
3 pages
ALVgrid Example With CL - Gui - Alv - Grid and Screen Painter
No ratings yet
ALVgrid Example With CL - Gui - Alv - Grid and Screen Painter
6 pages
SRS For Data Dictionary
No ratings yet
SRS For Data Dictionary
6 pages
Grade 11 Ict Revision Paper
No ratings yet
Grade 11 Ict Revision Paper
3 pages
Document Management System
No ratings yet
Document Management System
11 pages
Project Proposal
No ratings yet
Project Proposal
3 pages
Shreesha Resume
No ratings yet
Shreesha Resume
2 pages
Ms Maestro
No ratings yet
Ms Maestro
482 pages
Chapter 6. Search Semantic and Recommendation Technology
No ratings yet
Chapter 6. Search Semantic and Recommendation Technology
29 pages
Sample Idp Report Final
No ratings yet
Sample Idp Report Final
32 pages
TEACHMATE (2)-2
No ratings yet
TEACHMATE (2)-2
25 pages
10
No ratings yet
10
4 pages
Ugc Net Updated Syllabus For Library and Information Science
No ratings yet
Ugc Net Updated Syllabus For Library and Information Science
4 pages
Tutorial - 10 - A2 and Query Optimization
No ratings yet
Tutorial - 10 - A2 and Query Optimization
16 pages
Programming in Infobasic
No ratings yet
Programming in Infobasic
13 pages
BSNL Training Apr18
No ratings yet
BSNL Training Apr18
136 pages
Unit 2 Topic 5 Developing A Map Reduce Application
No ratings yet
Unit 2 Topic 5 Developing A Map Reduce Application
52 pages
Data Merging
No ratings yet
Data Merging
3 pages
Synopsis of Library Management System
No ratings yet
Synopsis of Library Management System
26 pages
NeoSOFT MachineAssignment Exp4 7 V1.1
No ratings yet
NeoSOFT MachineAssignment Exp4 7 V1.1
4 pages

Big Data Analytics Laboratory

Uploaded by

Big Data Analytics Laboratory

Uploaded by

P.S.R.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINE RING

COURSE CODE : Ccs334

COURSE NAME : BIGDATA ANALYTICS LABORATORY

ACADEMIC YEAR : 2024 (Even SEMESTER)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

COURSE CODE : CCS334

COURSE NAME : BIGDATA ANALYTICS LABORATORY

ACADEMIC YEAR : 2024-2025 (ODD SEMESTER)

Program Specific Outcomes

Step 1: Download the virtual box-executable file from https://www.virtualbox.org/wiki/Downloads Download

Step 2: Install VirtualBox by double clicking on the downloaded file.

Step 4: Download the Cloudera quickstart vm for VirtualBox

Step 7 : Select Memory Size as 4GB and click Next.

Step 13: System redirects you to the index page of Cloudera.

In the next step, we give name to the server.

In the next step, you need to create a new key pair.

Step 3 – Give the name to servers

Step 6 – Make centos sudoers on all machines

centos ALL=(ALL) ALL

Step 7 – Now on each machine, verify if the hostname is properly set

Step 8 – Edit the Network Configuration File

SYNTAX: bin/hdfs dfs –ls

SYNTAX: bin/hdfs dfs –mkdir folder name

SYNTAX: hadoop fs –touchz /directory/filename

SYNTAX: bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>

SYNTAX: bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>

SYNTAX: hadoop fs -moveFromLocal <localsrc> <dest>

SYNTAX: hadoop fs -get <src> <localdest>

SYNTAX: hadoop fs -df [-h] <path>

Fig1: Hadoop Installation – Extracting Java Files

Step 3: Download the Hadoop 2.7.3 Package.

Fig2: Extracting Hadoop Files

Fig 4 Refreshing environment variables

Fig 5 Checking Java Version

Command: hadoop version

Fig6: Checking Hadoop Version

All the Hadoop configuration files are located in hadoop-

Fig 7: Hadoop Configuration Files

Fig 8: Configuring core-site.xml

<?xml version=”1.0” encoding=”UTF-8”?>

Fig 9: Hadoop Installation – Configuring hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>

Command: cp mapred-site.xml.template mapred-site.xml

Fig10 : Hadoop Installation – Configuring mapred-site.xml

Fig 11: Hadoop Installation – Configuring yarn-site.xml

To load Twitter data into HDFS using Apache Flume.

Setting the classpath

# Describing/Configuring the source

# Describing/Configuring the sink

# Describing/Configuring the channel

# Binding the source and sink to the channel

Step 1: Log in to MySQL using

Enter the required credentials.

use &ltdatabase name>

hadoop fs -rm -r &lt target directory &gt

First, we need to download Kafka binaries package.

Step 2 — Install MongoDB with the Installation Wizard

C. Accept the licence agreement then click Next.

G. Click Install to begin installation.

Step 3— Create the Data Folders to Store our Databases

B. Click Next to start installation.

D. Select the Complete setup.

F. We won’t need Mongo Compass, so deselect it and click Next.

F. Hit Finish to complete installation.

C:\>cd "Program Files (x86)" C:\

Program Files (x86)>cd "MongoDB"

C:\Program Files (x86)\MongoDB>cd Server C:\

Program Files (x86)\MongoDB\Server>cd 3.0 C:\

Program Files (x86)\MongoDB\Server\3.0>cd bin

C:\Program Files (x86)\MongoDB\Server\3.0\bin>mongo

from pyspark.sql import SparkSession

# Add parameters of your choice here:

# Add parameters of your choice here:

#Cross Validator requires all of the following parameters:

PageRank is initialized to the same value for all pages.