Bda Record
Bda Record
CERTIFICATE
Certificate that this is bonafied record of practical work done in Big Data with
Hadoop Laboratory by Ponnam.Vasavi with Regd No 18761A1245 of
IV B.Tech Course(VII Semister) in IT branch during the academic year 2021-2022
Date: TEACHER-IN-CHARGE
Java Programming
Database Knowledge
Course Outcomes (COs):After the completion of this course, the student will be
able to:
CO1: Preparing for data summarization, query, and analysis.
CO2: Applying data modelling techniques to large data sets
CO3: Creating applications for Big Data analytics
Week-1:
Week-2:
Week-3:
Week-4
Implementation of Run a basic Word Count Map Reduce program to understand Map
Reduce Paradigm.
Week-5:
Week-6:
Week-7:
Week-8:
A) STANDALONE MODE:
Installation of jdk 7
Command: sudo apt-get install openjdk-7-jdk
export HADOOP_COMMON_HOME=/usr/lib/hadoop
export HADOOP_MAPRED_HOME=/usr/lib/hadoop
export PATH=$PATH:$HADOOP_COMMON_HOME/bin
export PATH=$PATH:$HADOOP_COMMON_HOME/Sbin
Checking of java and hadoop
Command: java -version
Command: hadoop version
B) PSEUDO MODE:
Hadoop single node cluster runs on single machine. The namenodes and datanodes
are performing on the one machine. The installation and configuration steps as given below:
Configure core-site.xml
Command: sudo gedit /usr/lib/hadoop/conf/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>
Configure hdfs-site.xml
Command: sudo gedit /usr/lib/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/lib/hadoop/dfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/lib/hadoop/dfs/datanode</value>
</property>
Configure mapred-site.xml
Command: sudo gedit /usr/lib/hadoop/conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
NOTE: Verify the passwordless ssh environment from namenode to all datanodes as “huser”
user.
Login to master node
Command: ssh pcetcse1
Command: ssh pcetcse2
Command: ssh pcetcse3
Command: ssh pcetcse4
Command: ssh pcetcse5
huser@pcetcse4:$ jps
datanode
tasktracker
huser@pcetcse5:$ jps
datanode
tasktracker
HDFS Jobtracker
http://locahost:5003
0/
HDFS Logs
http://locahost:50070/logs
HDFS Tasktracker
http://locahost:50060
/
Aim: Wordcount program using mapreduce.
Procedure in eclipse:
Open eclipse IDE
Go to file->new->java project->project name(wordcount). Check the version of Java SE-1.8.
Right click on project name(wordcount) ->new->package name as com.lbrce.wordcount
Right click on package name->new->class>create class with respective class names and add
code.
Right on package->bulid path->configure->goto libraries-click on add external jars. Then add
required jar files.
Right click on the project(wordcount)->export->Type as jar in text field and click on jar file
then browse jar file, save file name as wordcount->next->next->main class browse->next-
>click wordcount driver.click on finish.
Procedure in WinSCP:
Open WinSCP IDE
Enter host address: 172.16.0.70
Username: student
Password: LbrceStudent
Two windows will there in the environment. In right panel(server),open it2021->your respective
folder(eg.18761A1201).In the left panel select the respective folder select your jar file and drag to the
right panel(server).
Procedure in Termius:
Host address: ssh [email protected]
Password: LbrceStudent
Then click on connect.
>>cd it2021
>>cd 18761A1201(your directory)
>>scp jar filename(i.e.,wordcount.jar) [email protected]:/home/hduser/it2021/18761A1201
>>enter password as ipc
>>ssh [email protected]
>>enter password as ipc
>>cd it2021/18761A1201
create a file as wordcount1
>>cat >>wordcount1
Enter the text in the file to find count of words in the file.
Then click ctrl+D
>> hadoop fs -mkdir /it2021/18761a1201
>> hadoop fs -put wordcount1 /it2021/18761a1201
>> hadoop jar jarfilename(wordcount.jar) /it2021/18761a1201/wordcount1
/it2021/18761a1201/wordcountoutput
>> hadoop fs -cat /it2021/18761a1201/wordcountoutput/part*
Output will be displayed.
export DERBY_HOME=/usr/local/derby
export PATH=$PATH:$DERBY_HOME/bin
export
CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/
derbytools.jar:$DERBY_HOME/lib/derbyclient.jar
Command: sudo mkdir $DERBY_HOME/data
Command: sudo cd $HIVE_HOME/conf
Command: sudo cp hive-default.xml.template hive-site.xml
Command: Sudo gedit $HOVE_HOME/conf/hive-site.xml
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true </value>
<description>JDBC connect string for a JDBC metastore </description>
</property>
Create a file named jpox.properties and add the following lines into it:
javax.jdo.PersistenceManagerFactoryClass = org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create =
true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine
Example
We will insert the following data into the table. It is a text file named sample.txt in
/home/user directory.
1201 Gopal 45000 Technical manager
1202 Manisha 45000 Proof reader
1203 Masthanvali 40000 Technical writer
1204 Krian 40000 Hr Admin
1205 Kranthi 30000 Op Admin
+ + + + + +
| ID | Name | Salary | Designation | Dept |
+ + + + + +
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+ + + + + +
Functions:
+ + + + + +
| ID | Name | Salary | Designation | Dept |
+ + + + + +
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
The following query retrieves the employee details using the above scenario:
hive> CREATE VIEW emp_30000 AS
> SELECT * FROM employee
> WHERE salary>30000;
Indexes:
The following query creates an index:
hive> CREATE INDEX inedx_salary ON TABLE employee(salary)
> AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
EXPERIMENT – 7
Step 1) Go to the link here to download HBase. It will open a webpage as shown below.
Step 2) Select stable version as shown below 1.1.2 version
Step 3) Click on the hbase-1.1.2-bin.tar.gz. It will download tar file. Copy the tar file into an
installation location.
Step 4) Open ~/.bashrc file and mention HBASE_HOME path as shown in below
Step 5) Open hbase-site.xml and place the following properties inside the file
hduser@ubuntu$ gedit hbase-site.xml(code as below)
<property>
<name>hbase.rootdir</name>
<value>file:///home/hduser/HBASE/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hduser/HBASE/zookeeper</value>
</property>
Step 6) Open hosts file present in /etc. location and mention the IPs as shown in below.
Step 7) Now Run Start-hbase.sh in hbase-1.1.1/bin location as shown below.
And we can check by jps command to see HMaster is running or not.
Step8) HBase shell can start by using "hbase shell" and it will enter into interactive shell
mode as shown in below screenshot. Once it enters into shell mode, we can perform all type
of commands.
HBase Shell
HBase contains a shell using which you can communicate with HBase. HBase uses the Hadoop
File System to store its data. It will have a master server and region servers. The data storage will
be in the form of regions (tables). These regions will be split up and stored in region servers.
The master server manages these region servers and all these tasks take place on HDFS. Given
below are some of the commands supported by HBase Shell.
General Commands
status - Provides the status of HBase, for example, the number of servers.
version - Provides the version of HBase being used.
table_help - Provides help for table-reference commands.
whoami - Provides information about the user.
Requirements Analysis
page.
2. Select the latest installer R-3.4.0 for installation and download the same. After download, clicking.
3. Click on the ‘Next’ button starts the installation process. This redirects you to the license window.
4. After selecting the Next button from the previous step the installation folder path is required.
Select the desired folder for installation; it is advisable to select the C directory for smooth running
of the program.
5. Next select the components for installation based on the requirements of your operating system
to avoid unwanted use of disk space.
6. In the next dialog box, we need to select the start menu folder. Here, it is better to go with the
default option given by the installer.
7. After setting up the Start menu folder, check the additional options for completing the setup.
8. After clicking next from the previous step, the installation procedure end and the window is
displayed. Click ‘Finish’ to exist from the installation window.
Installing R-Studio
Installing and Configuring R-Studio in Windows: The Integrated Development Environment(IDE) for
R is R Studio and it provides a variety of features such as an editor with direct code execution and
syntax highlighting, a console, tools for plotting graphs, history lookup, debugging, and an
environment for workspace creation. R Studio can be installed in any of the Windows platforms such
as Windows 7/8/10/Vista and can be configured within a few minutes. The basic requirement is R
2.11.1+ version. The following are the steps involved to setup R Studio:
1) Download the latest version of R Studio just by clicking on the link provided here
https://www.rstudio.com/products/rstudio/download/ and it redirects you to download page.
There are two versions of R Studio available – desktop and server. Based on your usage and
comfort, select the appropriate version to initiate your download. The latest desktop version for R
Studio is 1.0.136.
2) Download the .exe file and double click on it to initiate the installation.
3) Click on the ‘Next’ button and it redirects you to select the installation folder. Select ‘C:\’ as your
installation directory since R and R Studio must be installed in the same directory to avoid path
issues for running R programs.
4) Click ‘Next’ to continue and a dialog box asking you to select the Start menu folder opens. It is
advisable to create your own folder to avoid any possible confusion and click on Install button to
install R Studio. After completion of installation, clicking ‘Next’ from the previous step, the
installation procedure ends, and the window is displayed. Click ‘Finish’ to exist from the installation
window.
Installation of R in Ubuntu
Installation of R in Ubuntu: Go to software center and search for R Base and install. Then open
terminal and enter R to get R command prompt in terminal. Installation of R-studio in Ubuntu: Open
terminal and type the following commands.
Experiment-8
R Programming language has numerous libraries to create charts and graphs. A pie-chart is a
representation of values as slices of a circle with different colors. The slices are labeled and the
numbers corresponding to each slice is also represented in the chart.
In R the pie chart is created using the pie() function which takes positive numbers as a vector input.
The additional parameters are used to control labels, color, title etc.
Syntax
The basic syntax for creating a pie-chart using the R is −
pie(x, labels, radius, main, col, clockwise)
Following is the description of the parameters used −
x is a vector containing the numeric values used in the pie chart.
labels is used to give description to the slices.
radius indicates the radius of the circle of the pie chart.(value between −1 and +1).
main indicates the title of the chart.
col indicates the color palette.
clockwise is a logical value indicating if the slices are drawn clockwise or anti clockwise.
Example
A very simple pie-chart is created using just the input vector and labels.
# Create data for the graph.
x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore", "Mumbai")
A bar chart represents data in rectangular bars with length of the bar proportional to the value of the
variable. R uses the function barplot() to create bar charts. R can draw both vertical and Horizontal
bars in the bar chart. In bar chart each of the bars can be given different colors.
Syntax
The basic syntax to create a bar-chart in R is −
barplot(H,xlab,ylab,main, names.arg,col)
Following is the description of the parameters used −
Example
Give the chart file a name.
png(file = "boxplot.png")
sssA histogram represents the frequencies of values of a variable bucketed into ranges. Histogram is
similar to bar chat but the difference is it groups the values into continuous ranges. Each bar in
histogram represents the height of the number of values present in that range.
R creates histogram using hist() function. This function takes a vector as an input and uses some
more parameters to plot histograms.
Syntax
The basic syntax for creating a histogram using R is −
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Following is the description of the parameters used −
v is a vector containing numeric values used in histogram.
main indicates title of the chart.
col is used to set color of the bars.
border is used to set border color of each bar.
xlab is used to give description of x-axis.
xlim is used to specify the range of values on the x-axis.
ylim is used to specify the range of values on the y-axis.
breaks is used to mention the width of each bar.
Example
A simple histogram is created using input vector, label, col and border parameters.
The script given below will create and save the histogram in the current R working directory.
# Create data for the graph.
v <- c(9,13,21,8,36,22,12,41,31,33,19)
print(is.data.frame(data))
print(ncol(data))
print(nrow(data))
When we execute the above code, it produces the following result −
[1] TRUE
[1] 5
[1] 8
Applying NA Option
If there are missing values, then the mean function returns NA.
To drop the missing values from the calculation use na.rm = TRUE. which means remove the NA
values.
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5,NA)
# Find mean.
result.mean <- mean(x)
print(result.mean)
Mean
It is calculated by taking the sum of the values and dividing with the number of values in a data
series.
The function mean() is used to calculate this in R.
Syntax
The basic syntax for calculating mean in R is −
mean(x, trim = 0, na.rm = FALSE, ...)
Following is the description of the parameters used −
x is the input vector.
trim is used to drop some observations from both end of the sorted vector.
na.rm is used to remove the missing values from the input vector.
Example
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x)
print(result.mean)
When we execute the above code, it produces the following result −
[1] 8.22
Median
The middle most value in a data series is called the median. The median() function is used in R to
calculate this value.
Syntax
The basic syntax for calculating median in R is −
median(x, na.rm = FALSE)
Following is the description of the parameters used −
x is the input vector.
na.rm is used to remove the missing values from the input vector.
Example
# Create the vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
Mode
The mode is the value that has highest number of occurrences in a set of data. Unike mean and
median, mode can have both numeric and character data.
R does not have a standard in-built function to calculate mode. So we create a user function to
calculate mode of a data set in R. This function takes the vector as input and gives the mode value as
output.
Example
# Create the function.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
library(tidyverse)
library(caret)
theme_set(theme_bw())
sample_n(marketing, 3)
set.seed(123)
summary(model)
summary(model)$coef
summary(model)$coef
summary(model)$coef
RMSE(predictions, test.data$sales)
R2(predictions, test.data$sales)
Results
Call:
Residuals:
Coefficients:
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1