0% found this document useful (0 votes)
7 views71 pages

BDA Final Compiled_pagenumber

The document outlines experiments involving Hadoop, Sqoop, and HBase, detailing commands and functionalities for managing data in a distributed environment. It includes specific HDFS commands for file management, Sqoop's role in transferring data between Hadoop and relational databases, and an overview of HBase as a non-relational database system. Each experiment aims to provide practical exercises and theoretical understanding of the respective technologies.

Uploaded by

Jugal Ghia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views71 pages

BDA Final Compiled_pagenumber

The document outlines experiments involving Hadoop, Sqoop, and HBase, detailing commands and functionalities for managing data in a distributed environment. It includes specific HDFS commands for file management, Sqoop's role in transferring data between Hadoop and relational databases, and an overview of HBase as a non-relational database system. Each experiment aims to provide practical exercises and theoretical understanding of the respective technologies.

Uploaded by

Jugal Ghia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

BDA

Experiment 1

Aim:
Installation of Hadoop and Experiment on HDFS commands.

Theory:
1. Print the Hadoop version
• To check the Hadoop version, you can use the hadoop version command.
• This command provides information about the installed Hadoop software,
including the version number and build details.

2. List the contents of the root directory in HDFS


• The hadoop fs -ls / command is used to list the contents of the root
directory in HDFS.
• This command displays information about files and directories in the root
of the Hadoop Distributed File System.

1
3. Report the amount of space used and available on currently mounted
filesystem
• The hadoop fs -df hdfs:/ command reports the disk usage and available
space on the HDFS file system.
• It provides information about the storage capacity and usage for the
specified HDFS location.

4. Count the number of directories, files, and bytes under the paths that
match the specified file pattern
• The hadoop fs -count hdfs:/ command is used to count the number of
directories, files, and bytes under the specified path.
• It helps in understanding the structure and size of data in HDFS.

5. Run a DFS filesystem checking utility


• The hadoop fsck - / command runs the Hadoop Distributed File System
checking utility.
• It checks the integrity of the file system and reports any issues or
inconsistencies in the HDFS.

2
6. Run a cluster balancing utility
• The hadoop balancer command is used to initiate the Hadoop cluster
balancing process.
• It redistributes data blocks across DataNodes to ensure data balance
within the Hadoop cluster.

7. Create a new directory named "hadoop" below the /user/training


directory in HDFS
• The hadoop fs -mkdir /user/training/hadoop command creates a new
directory named "hadoop" under the /user/training directory in HDFS.
• It's used to organize and manage data within the Hadoop file system.

8. Add a sample text file from the local directory named "data" to the new
directory you created in HDFS
• The hadoop fs -put data/sample.txt /user/training/hadoop command
uploads a text file from the local "data" directory to the "hadoop"
directory in HDFS.
• It is a common operation to transfer data from the local file system to
HDFS.

9. List the contents of this new directory in HDFS


• The hadoop fs -ls /user/training/hadoop command lists the contents of
the "hadoop" directory within HDFS.
• It displays the files and directories present in the specified HDFS
location.

10. Add the entire local directory called "retail" to the /user/training
directory in HDFS
• The hadoop fs -put data/retail /user/training/hadoop command copies
the entire local directory "retail" to the "hadoop" directory in HDFS.

3
• This is useful for transferring a complete directory to HDFS.

11. List your home directory in HDFS


• When you use the hadoop fs -ls command without an absolute path, it
lists the contents of your home directory in HDFS.
• This command provides an overview of the files and directories within
your HDFS home directory.

12. See how much space the "retail" directory occupies in HDFS
• The hadoop fs -du -s -h hadoop/retail command shows the disk usage of
the "retail" directory within HDFS.
• It displays the total size occupied by the specified directory in a human-
readable format.

4
13. Delete a file 'customers' from the "retail" directory
• The hadoop fs -rm hadoop/retail/customers command deletes the file
named 'customers' from the "retail" directory in HDFS.
• This operation removes the specified file from HDFS.

14. Ensure the file 'customers' is no longer in HDFS


• The hadoop fs -ls hadoop/retail/customers command is used to verify
whether the file 'customers' has been successfully deleted from HDFS.
• It checks for the existence of the specified file.

15. Delete all files from the "retail" directory using a wildcard
• The hadoop fs -rm hadoop/retail/* command deletes all files within the
"retail" directory in HDFS using a wildcard '*'.
• It is a convenient way to remove multiple files in one operation.

16. Empty the trash


• The hadoop fs -expunge command is used to empty the HDFS trash,
permanently deleting files that were previously moved to the trash.
• This helps free up storage space and ensures that deleted files are no
longer recoverable.

17. Remove the entire retail directory and its contents in HDFS
• The hadoop fs -rm -r hadoop/retail command deletes the "retail"
directory and all its contents recursively.
• This operation effectively removes an entire directory and its
subdirectories.

18. List the hadoop directory again


• The hadoop fs -ls hadoop command lists the contents of the "hadoop"
directory in HDFS.

5
• It helps confirm the status of the directory after performing various
operations.

19. Add the purchases.txt file from the local directory to the hadoop
directory you created in HDFS
• The hadoop fs -copyFromLocal /home/training/purchases.txt hadoop/
command copies the local file "purchases.txt" to the "hadoop" directory
in HDFS.
• This is a way to transfer a single file from the local file system to HDFS.

20. View the contents of the text file purchases.txt in your hadoop directory
• The hadoop fs -cat hadoop/purchases.txt command displays the
contents of the "purchases.txt" file within the "hadoop" directory in
HDFS.
• It allows you to view the content of the specified file.

21. Add the purchases.txt file from the "hadoop" directory in HDFS to the
"data" directory in your local directory
• The hadoop fs -copyToLocal hadoop/purchases.txt
/home/training/data command copies the "purchases.txt" file from the
"hadoop" directory in HDFS to the "data" directory in your local
filesystem.
• This is a way to transfer a file from HDFS to the local file system.

22. Copy files between directories present in HDFS


• The hadoop fs -cp /user/training/*.txt /user/training/hadoop command
is used to copy files from one directory to another within HDFS.
• This is a way to manage and organize files in HDFS by duplicating them
in a different location.

23. '-get' command can be used alternatively to '-copyToLocal' command


• The hadoop fs -get hadoop/sample.txt /home/training/ command is an
alternative to the -copyToLocal command, and it copies the "sample.txt"
file from the "hadoop" directory in HDFS to your local directory.

6
• It provides flexibility in copying files from HDFS to the local filesystem.

24. Display the last kilobyte of the file "purchases.txt" to stdout


• The hadoop fs -tail hadoop/purchases.txt command displays the last
kilobyte of the "purchases.txt" file to the standard output (stdout).
• This can be useful for quickly inspecting the end of large log or text files.

25. Change file permissions in HDFS


• By default, file permissions in HDFS are set to 666. The hadoop fs -
chmod command is used to change the permissions of a file.
• In this example, it changes the permissions of the "purchases.txt" file to
600, limiting access to the file.

26. Change file owner and group in HDFS


• Default names of owner and group in HDFS are typically
"training,training." The hadoop fs -chown command is used to change
the owner and group name simultaneously.
• In this example, it changes the owner and group of the "purchases.txt" file
to "root:root."

7
27. Change group name in HDFS
• The default name of the group is usually "training." The hadoop fs -
chgrp command is used to change the group name for a file or directory
in HDFS.
• In this example, it changes the group of the "purchases.txt" file to
"training."

28. Move a directory in HDFS


• The hadoop fs -mv hadoop apache_hadoop command is used to move a
directory from one location to another within HDFS.
• In this example, the "hadoop" directory is renamed to "apache_hadoop."

29. Change the replication factor of a file in HDFS


• The default replication factor for a file in HDFS is typically 3. The
hadoop fs -setrep command is used to change the replication factor of a
file.
• In this example, it sets the replication factor of the "sample.txt" file to 2.
30. Copy a directory between nodes in the cluster
• The hadoop fs -distcp command is used to copy a directory between
nodes in the Hadoop cluster. It provides options like -overwrite and -
update for different behaviors.
• It is used for copying data between Hadoop clusters or nodes and
ensuring data consistency.
31. Make the NameNode leave safe mode
• The hadoop fs -expunge command, when combined with the hdfs
dfsadmin -safemode leave command, is used to make the HDFS
NameNode leave safe mode.
• This is important to ensure normal HDFS operations after maintenance or
recovery.

8
32. List all Hadoop file system shell commands
• The hadoop fs command, when used without specifying a subcommand,
lists all available Hadoop file system shell commands.
• This helps you see a comprehensive list of commands available for
managing HDFS.

9
33. Always ask for help
• The hadoop fs -help command provides detailed help and usage
information for the Hadoop file system shell commands.
• It's a valuable resource to understand command syntax, options, and how
to use these commands effectively.

10
BDA

Experiment 2

Aim:
Use of Sqoop tool to transfer data between Hadoop and relational database
servers.

Theory:
Sqoop in the Hadoop Ecosystem: Bridging Relational Databases and Big
Data
The growth of data in today's digital landscape is nothing short of explosive.
Enterprises are inundated with vast volumes of data, and much of this data
resides in relational databases managed by traditional application systems.
Meanwhile, Hadoop, with its distributed storage (HDFS) and powerful data
processing tools, has emerged as a leading platform for managing and analyzing
big data. The challenge, however, is how to seamlessly integrate and leverage
the data stored in relational databases with Hadoop's capabilities.
The Crucial Role of Sqoop:
Sqoop comes to the rescue by acting as the bridge between these two disparate
data worlds. It plays an indispensable role in facilitating data transfer between
Hadoop and relational database systems, making it an integral component of the
Hadoop ecosystem. Let's explore Sqoop's role and significance in greater depth:
Purpose and Role of Sqoop:
At its core, Sqoop is designed to accomplish the seamless transfer of data
between Hadoop and relational database servers. It has a twofold purpose:
importing data from relational databases into Hadoop HDFS and exporting data
from Hadoop HDFS back to relational databases. This dual functionality is
encapsulated in its name: "SQL to Hadoop and Hadoop to SQL."
1. Importing Data into HDFS: The import functionality of Sqoop is tasked
with retrieving data from relational databases such as MySQL, Oracle, or
SQL Server and bringing it into Hadoop's distributed file system, HDFS.
Sqoop operates at the table level, allowing users to import individual
tables or select subsets of data.
2. Data Representation in HDFS: Once data arrives in HDFS, Sqoop
organizes it into records. In this context, each row from the source
RDBMS table is treated as a record in HDFS. These records can be stored
as text data within text files or in binary formats like Avro and Sequence
files. This versatility allows Sqoop to adapt to various data storage and

11
processing requirements, making it a flexible and practical solution for
managing diverse data sources.
Sqoop Export:
1. Exporting Data to RDBMS: The export tool within Sqoop complements
the import functionality by allowing users to move data from HDFS to a
target relational database system. This means that data generated,
processed, or stored in Hadoop can be effectively transported back to a
structured RDBMS environment.
2. Record Transformation and Delimiting: To make this transition
seamless, Sqoop reads and parses data files in HDFS. These data files
contain records, which align with the structure of the destination RDBMS
table. Sqoop offers users the option to specify a delimiter to separate
fields within the records. This transformation ensures that the data being
exported from Hadoop is formatted in a manner that is compatible with
the target RDBMS, thereby maintaining data integrity and consistency.
The Significance of Sqoop:
Sqoop's significance lies in its ability to empower organizations to fully harness
the power of Hadoop for big data processing, analytics, and storage, while still
maintaining a connection with the structured data in traditional relational
databases. This is particularly crucial in scenarios where historical or
operational data is essential for comprehensive analysis, reporting, or regulatory
compliance.

12
Output:

Step1:

Step 2:

Step3:

Step4:

13
Step5:

Step6:

14
Step7:

Step8:

Step9:

15
BDA

Experiment 3

Aim:
Programming Exercises in HBase

Theory:
HBase Overview
• HBase is a distributed, column-oriented, non-relational database
management system that is designed to run on top of the Hadoop
Distributed File System (HDFS). It is part of the Hadoop ecosystem and
is ideal for handling large volumes of data, making it suitable for big data
use cases.
• HBase excels in storing sparse data sets where most of the data is empty
or not present, which is common in big data applications.
• Unlike traditional relational database systems, HBase does not use a
structured query language (SQL). Instead, it is designed for handling
unstructured and semi-structured data.
• HBase applications are primarily written in Java, following the
MapReduce programming model. However, HBase also supports
application development in other languages, including Apache Avro,
REST, and Thrift.
• An HBase system is designed to scale linearly, meaning you can easily
add more machines to your cluster to handle increasing data volumes and
workloads. It's an excellent choice for real-time data processing and
random read/write access to large data sets.
• HBase relies on Apache ZooKeeper for high-performance coordination.
While ZooKeeper is integrated into HBase, in a production cluster, it's
recommended to have a dedicated ZooKeeper cluster that works in
tandem with your HBase cluster.

16
Step 1: HBase Shell
• HBase provides a shell interface that allows users to interact with the
database. The HBase shell is a command-line tool for executing various
HBase operations.

Step 2: Creating a Table


• In HBase, you can create a table using the create command in the HBase
shell. When creating a table, you must specify the table name and at least
one column family. Column families are used to organize and group
related data.

Step 3: Listing Tables


• The list command in the HBase shell is used to view a list of all tables
currently available in the HBase instance.

Step 4: Disabling a Table


• Before making changes to a table or deleting it, you need to disable the
table using the disable command in the HBase shell.

17
Step 5: Enabling a Table
• After disabling a table, you can re-enable it using the enable command.

Step 6: Describing a Table


• The describe command provides a detailed description of a specific table
in HBase, including information about column families, versions, and
more.

Step 7: Altering a Table


• The alter command is used to make changes to an existing table, such as
modifying the maximum number of cells in a column family, setting or
deleting table scope operators, or deleting a column family from a table.

18
Step 8: Checking Table Existence
• You can verify the existence of a table using the exists command in the
HBase shell.

Step 9: Dropping a Table


• To permanently delete a table, you must first disable it and then use the
drop command.

19
Step 10: Creating Data (Put Command)
• In HBase, you can insert data into a table using the put command in the
HBase shell. This command allows you to specify the table name, row
key, column family, column name, and the value you want to store in a
particular cell.

Step 11: Scanning a Table


• The scan command in HBase allows you to view the data stored in an
HTable. It's a useful command to retrieve and display data from a table.

20
Step 12: Updating Data (Put Command)
• To update an existing cell's value in an HBase table, you can use the put
command. The new value you provide will replace the existing value in
the specified cell.

Step 13: Reading Data (Get Command)


• The get command and the get() method of the HTable class are used to
read data from an HBase table. You can retrieve a single row of data at a
time using this command.

Step 14: Deleting Data (Delete Command)


• The delete command in HBase allows you to remove specific cells or
data from a table. You can specify the table name, row, column name, and
timestamp to delete a particular cell.

Step 15: Counting Rows (Count Command)


• The count command in HBase enables you to count the number of rows
in a specific table, helping you to understand the table's size and the
volume of data it contains.

21
Step 16: Truncating a Table
• The truncate command in HBase is used to disable, drop, and recreate a
table. This operation is helpful in certain scenarios when you need to
recreate a table without losing data.

22
BDA

Experiment No 4

Aim:
Experiment for Word Counting using Hadoop Map-Reduce

Theory:
WordCount Program Overview:

The WordCount program is a classic example in the Hadoop ecosystem, illustrating the
MapReduce paradigm for processing and analyzing large datasets. It is commonly used
as a beginner's exercise to understand the fundamental concepts of MapReduce.

Components of the WordCount Program:

1. Mapper (WordMapper.java):
• Responsible for reading input data and emitting key-value pairs.
• Takes a line of text and splits it into individual words.
• Associates each word with the value 1 to indicate its occurrence.

2. Reducer (WordReducer.java):
• Aggregates the intermediate results from the Mapper.
• Takes key-value pairs where the key is a word and the value is the count of
occurrences.
• Summarizes the counts to get the total occurrences of each word.

3. Driver Program (wordCount.java):


• Orchestrates the overall MapReduce job.
• Specifies input and output paths, sets Mapper and Reducer classes.
• Defines the data types for input and output key-value pairs.
• Executes the MapReduce job.

Steps to Run the WordCount Program:

1. Create a Java Project:


• In Eclipse, create a new Java project named "WordCountJob."

2. Add Classes:
• Create three classes: WordCount, WordMapper, and WordReducer.
• Copy the respective program contents into each class file.

3. Build Path:
• Add the Hadoop library (hadoop-core.jar) to the project's build path.
23
• This allows the project to use Hadoop classes and functionalities.

4. Export Jar File:


• Export the Java project as a JAR file in the project directory.

5. Terminal Commands:
• Open a terminal and navigate to the project directory.

6. Create a Sample File:


• Use the cat command to create a sample text file named sample.txt with some
content.

7. Put File in Hadoop:


• Use hadoop fs -put to upload the local sample.txt file to Hadoop in the
/user/training directory.

8. Run WordCount Program:


• Execute the WordCount program using hadoop jar with input and output
paths.

9. Check Output:
• Use hadoop fs -ls to check the output directory and see the results.

10. Display Output:


• Use hadoop fs -cat to display the contents of the output file.

Steps:

Step 1 :
Create New Java
Project in Eclipse Name
project :
WordCountJob click :
finish

Step 2:
Right Click on project ->
New -> Class create class
files as
Name :
WordCount
Name :
WordMapper
Name :
24
WordReducer

Add the respective program contents to the file which is given below :
Run The program as java application

Step 3 :
Right Click Project -> build path -> add expertnal archievs -> filesystem ->
usr -> lib -> hadoop-0.20 -> hadoop-core.jar click add

Step 4 : export the jar file in same folder (training/workspace/WordCountJob/src)

Step 5 : open terminal


[training@localhost ~]$
ls -l

Step 6 : create sample file


[training@localhost ~]$
cat sample.txt hi
how are
you bye
see you
soon hi
hi
hi
Put the file in Hadoop

Step 7 : Put the content of sample.txt to samplehadoop.txt


[training@localhost ~]$ hadoop fs -put sample.txt
/user/training/samplehadoop.txt [training@localhost
~]$ hadoop fs -ls

Step 8 : change the directory to


[training@localhost ~]$ cd /home/training/workspace/WordCountJob/src

Step 9: Listing the contents


of that directory
[training@localhost
WordCount22]$ ls
sample.txt wordCount.jar WordCount.java WordMapper.java WordReducer.java

Step 10 : Run the wordcount program and collect the output in


sampleoutdir [training@localhost src]$ hadoop jar
wordCount.jar wordCount samplehadoop.txt sampleoutdir

Step 11 : chk the output file


[training@localhost src]$ hadoop fs -ls
/user/training/sampleoutdir Found 3 items
25
-rw-r--r-- 1 training supergroup 0 2019-08-21 00:40
/user/training/sampleoutdir/_SUCCESS
drwxr-xr-x - training supergroup 0 2019-08-21 00:40
/user/training/sampleoutdir/_logs
-rw-r--r-- 1 training supergroup 27 2019-08-21 00:40
/user/training/sampleoutdir/part-r-00000

Step 12: Displaying the Output


[training@localhost src]$ hadoop fs -cat
/user/training/sampleoutdir/part-r-00000 are 1
bye 1
hi 4
how

Prog 1 : WordCount .java

import
org.apache.hadoop.fs.Path;
import
org.apache.hadoop.io.IntWr
itable; import
org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapreduce.lib.input.File
InputFormat; import
org.apache.hadoop.mapreduce.lib.output.Fil
eOutputFormat; import
org.apache.hadoop.mapreduce.Job;

public class wordCount


{
public static void main(String[] args) throws Exception
{
if (args.length != 2)
{
System.out.printf("Usage: WordCount <input
dir> <output dir>\n"); System.exit(-1);
}
Job job = new Job();
job.setJarByClass(wordCount.class);
job.setJobName("wordCount");
FileInputFormat.setInputPaths(job,
new Path(args[0]));
FileOutputFormat.setOutputPath(job,
new Path(args[1]));
job.setMapperClass(wordMapper.clas
26
s);
job.setReducerClass(wordReducer.cla
ss);
job.setMapOutputKeyClass(Text.clas
s);
job.setMapOutputValueClass(IntWrit
able.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.
class);
boolean success =
job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}

Prog 2 : WordMapper.java

import java.io.IOException;
import
org.apache.hadoop.io.IntWrit
able; import
org.apache.hadoop.io.LongW
ritable; import
org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class wordMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
@Override
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException
{
String line = value.toString();
for (String word : line.split("\\W+"))
{
if (word.length() > 0)
context.write(new Text(word), new IntWritable(1));
}
}
}

Program 3 : WordReducer.java

import java.io.IOException;
27
import
org.apache.hadoop.io.IntWrit
able; import
org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class wordReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException
{
int wordCount = 0;
for (IntWritable value : values)
wordCount += value.get();
context.write(key, new
IntWritable(wordCount));
}
}

Output:

28
29
30
31
32
BDA

Experiment 5

Aim:
Experiment on Pig.

Theory:
1. Overview of Pig:
• Apache Pig is a high-level platform for processing and analysing large
datasets on top of Hadoop. It provides a simple scripting language called
Pig Latin for expressing data transformations.
• Pig is designed for parallel processing and can work with structured and
semi-structured data. It's particularly valuable for ETL (Extract,
Transform, Load) operations and data processing tasks in the Hadoop
ecosystem.
• Pig abstracts the complexities of writing MapReduce programs, making it
more accessible for data engineers, analysts, and data scientists to work
with big data.
2. Pig Architecture:
• Pig operates in two modes: local mode and MapReduce mode. In local
mode, Pig runs on a single machine, which is useful for testing and
debugging. In MapReduce mode, it takes advantage of Hadoop's
distributed processing capabilities to process data across a cluster.
• Pig consists of two main components:
• Pig Latin Interpreter: The Pig Latin interpreter parses and
executes Pig Latin scripts, which are written by users to define data
transformations.
• Execution Engine: Pig employs Hadoop's MapReduce engine or
Apache Tez for executing data processing tasks across a cluster.
3. Basics of Pig Latin:
• Pig Latin is the scripting language used in Pig for expressing data
transformations. It has a simple and expressive syntax for working with
data.

33
• Pig Latin scripts consist of a series of operations, where each operation
represents a transformation on the data. These transformations can
include loading data, filtering, grouping, joining, and more.
• Pig Latin scripts are translated into a series of MapReduce jobs when
executed in MapReduce mode.
4. Basics of Pig Operators:
• Load: The LOAD operator is used to load data into Pig from various data
sources, including HDFS, local files, and HBase. It specifies the location
and format of the data.
• Dump: The DUMP operator is used for debugging and displaying the
contents of relations in Pig. It's often used during script development to
inspect intermediate data.
• Foreach: The FOREACH operator is used to apply a projection
operation on a relation. It selects specific fields or generates new fields
based on the existing data.
• Filter: The FILTER operator is used to filter rows from a relation based
on specific conditions. It is useful for data selection.
• Join: The JOIN operator is used to combine data from multiple relations
based on a common field or key. It can perform various types of joins,
including inner, outer, and cross joins.
• Distinct: The DISTINCT operator is used to remove duplicate tuples
from a relation, preserving only unique values.
• Cogroup: The COGROUP operator groups data from multiple relations
based on a common field or key, allowing you to perform operations on
grouped data.
• Nested Projection: Pig allows you to project fields from nested data
structures, such as tuples, bags, and maps, using the FOREACH
operator. This is valuable for working with complex, nested data.

Step 1: Running Pig


• This step shows how to run Pig by invoking the pig command. It
initializes the Pig environment and connects to Hadoop for data
processing.

34
Step 2: Listing Files
• The fs -ls command within the Pig Grunt shell is used to list files and
directories in HDFS. It provides a view of the available data sources.

Step 3: Copying Data into HDFS


• In this step, the copyFromLocal command is used to copy data from the
local file system into HDFS. This is often done to prepare data for
analysis within Pig.

35
Step 4: Dumping Data
• The dump command is used to display the contents of a relation created
from loaded data. It's a way to verify the data's correctness and structure.

36
Step 5: Projection
• Projection involves selecting specific fields from a relation using the
FOREACH operator. It helps reduce the data to only the columns of
interest.

37
Step 6: Joins
• The JOIN operator combines data from different relations based on
specified keys. It is essential for merging data from multiple sources.

38
Step 7: Relational Operators
a. CROSS Operator:
• The CROSS operator is used to calculate the cross product of two or
more relations. It generates all possible combinations of records from
the input relations.
• In the context of Pig, it takes two or more relations as input and
produces a new relation containing all possible pairs of records from
the input relations.

b. DISTINCT Operator:
• The DISTINCT operator is used to remove duplicate tuples from a
relation. It retains only unique values while discarding redundant
records.
• It's a valuable operator when you want to ensure that the data contains
no duplicate entries.

39
c. FILTER Operator:
• The FILTER operator is used to select data from a relation based on
specific conditions or predicates. It allows you to filter rows that meet
specific criteria.
• Conditions are defined using comparison and logical operators, and
only rows that satisfy the conditions are retained.

d. FOREACH Operator:
• The FOREACH operator is essential for generating data
transformations based on column data. It allows you to apply
expressions to fields within a relation, generating new fields or
modifying existing ones.
• With FOREACH, you can create calculated fields, perform
mathematical operations, and manipulate data as needed.

40
e. COGROUP Operator:
• The COGROUP operator groups data from multiple relations based
on a common field or key. It is particularly useful when you need to
perform operations on data that share a common attribute.
• After co-grouping, you can apply operations on groups of data,
making it possible to analyze or process data in a more structured way.

Step 8: Nested Projection


• Demonstrates how to perform projection operations on nested data
structures, such as tuples and bags, using the FOREACH operator.

41
42
BDA
EXPERIMENT NO. 6

Aim: Implementation of HIVE commands

Theory:

What is HIVE?

Hive is a data warehouse system developed by Facebook for analyzing structured


data on top of Hadoop. It allows reading, writing, and managing large datasets in
distributed storage. HQL (Hive Query Language) is used for running SQL-like
queries that internally get converted to MapReduce jobs. Hive supports Data
Definition Language (DDL), Data Manipulation Language (DML), and User Defined
Functions (UDF).

What is HQL?

Hive Query Language (HQL) is a SQL-like query language for managing large
datasets. It simplifies complex MapReduce programming for data analysis. Users
familiar with SQL can leverage HQL to write custom MapReduce frameworks for
advanced analysis.

Uses of Hive:

1. Apache Hive enables distributed storage.


2. Supports easy data extract/transform/load (ETL).
3. Provides structure for various data formats.
4. Allows access to files stored in Hadoop Distributed File System (HDFS) or
other data storage systems like Apache HBase.

Components of HIVE:

• Metastore: Stores schema information of Hive tables. By default, the


metastore runs in the same process as the Hive service, and DerBy Database is
the default Metastore.

• SerDe: Serializer, Deserializer provides instructions to Hive on how to process


a record.

43
HIVE Organization:

Data in Hive is organized into Tables, Partitions, and Buckets. Tables are mapped to
directories in HDFS, partitions are subdirectories, and buckets are divisions within
partitions. Hive also has a Metastore storing metadata.

Limitations of HIVE:

• Not designed for Online Transaction Processing (OLTP); used for Online
Analytical Processing (OLAP).
• Supports overwriting or appending data, not updates and deletes.
• Subqueries are not supported.

Why Hive is used instead of Pig?

Reasons for using Hive over Pig:


• Hive-QL is declarative, similar to SQL, while PigLatin is a data flow language.
• Hive serves as a distributed data warehouse.

Steps:

1. To enter Hive terminal:

• Command: hive
• Explanation:
• Initiates Hive command-line interface.
• Provides an interactive shell for Hive operations.

2. To check databases:

• Command: show databases;


• Explanation:
• Lists available Hive databases.
• Aids in identifying existing databases.

44
3. To check tables:

• Command: show tables;


• Explanation:
• Displays tables in the current database.
• Facilitates exploration of available datasets.

4. To use a particular database:

• Command: use newemployee;


• Explanation:
• Switches context to a specified database.
• Subsequent commands operate within that database.

5. To create a database:

• Command: create database retail;


• Explanation:
• Initiates the creation of a new database named "newemployee."
• Provides a dedicated space for organizing related datasets.

45
6. To create a table "emp" in the "retail" database:

• Command: create table emp(id INT, name STRING, sal DOUBLE) row
format delimited fields terminated by ',' stored as textfile;
• Explanation:
• Changes to "retail" database context.
• Creates a table "emp" with specified columns and storage format.

7. To view schema information of the table:


• Command: describe emp;
• Explanation:
• Offers detailed structure of the "emp" table.
• Displays column names and data types.

7. To create a file "hive-emp.txt" and load data into the "emp" table:

• Command:
• cat /home/training/hive-emp.txt
• load data local inpath '/home/training/hive-emp.txt' into table emp;
• Explanation:

46
• Displays content of "hive-emp.txt."
• Loads data into "emp" table for data ingestion.

8. To view contents of the table:

• Command: select * from emp;


• Explanation:
• Retrieves all records from "emp" table.
• Allows inspection of stored data.

10. To rename the table "emp" to "emp_data": -

• Command: alter table emp rename to emp_data;


• Explanation: -
Changes to "newemployee" database context. - Renames "emp" table to
"emp_data" for better management.

47
11. To select data from "emp_data" where id is 1:

• Command: select * from emp_data where id=1;


• Explanation:
• Retrieves and displays records where id is 1.
• Supports data exploration.

12. To count the number of records in the "emp_data" table:

• Command: select count(*) from emp_data;


• Explanation: - Computes and displays total record count. - Provides a quick
summary of dataset size.

13. Aggregate commands using HQL:

• Commands: select AVG(sal) as avg_salary from emp_data;


• Explanation:
• Computes average salary values.
• Leverages Hive-QL for analytical queries.

48
13. Max commands using HQL:

• Commands: select Max(sal) as max_salary from emp_data;


• Explanation:
• Computes maximum salary values.
• Leverages Hive-QL for analytical queries.

49
Big Data Analysis
Experiment – 7

Aim:
Implement Bloom Filter using Python/R Programming

Theory:
Introduction to Bloom Filters:
A Bloom Filter is a space-efficient probabilistic data structure used for
membership testing. It's designed to answer the question: "Is this element in the
set?" with some probability of false positives. Bloom Filters are particularly
useful when you want to quickly check if an element is part of a large dataset
without actually storing the entire dataset. They have applications in various
fields, including databases, networking, distributed systems, and more.

Basic Structure:
A Bloom Filter consists of the following components:
1. Bit Array (Bitmap): The core of the Bloom Filter is a fixed-size array of
bits. This array is typically initialized with all bits set to 0.
2. Hash Functions: A set of hash functions that map elements to positions
in the bit array. The number of hash functions used is a key parameter of
the Bloom Filter, often denoted as 'k.'

Key Operations:
1. Insertion: To add an element to the Bloom Filter, the element is hashed
by each of the 'k' hash functions, and the corresponding bit positions in
the bit array are set to 1.
2. Membership Test: To check if an element is in the set, you hash the
element using the same 'k' hash functions and check if all corresponding
bits are set to 1. If any of them is 0, you can be certain that the element is
not in the set. If all bits are 1, it's likely that the element is in the set, but
there's a chance of a false positive.

50
Probability of False Positives:
The probability of a false positive in a Bloom Filter depends on three main
factors:
1. Bit Array Size (m): The size of the bit array, which affects the filter's
capacity to store elements.
2. Number of Hash Functions (k): The number of hash functions used for
mapping elements to bits.
3. Number of Elements Inserted (n): The total number of elements added
to the Bloom Filter.
The formula to calculate the probability of a false positive (P_FP) is given by:
P_FP = (1 - e^(-kn/m))^k

Characteristics:
1. Space Efficiency: Bloom Filters are memory-efficient because they only
require a fixed-size bit array, regardless of the number of elements stored.
2. False Positives: Bloom Filters can return false positives but never false
negatives. This means if the filter indicates that an element is not in the
set, it's correct. However, if it indicates that an element is in the set,
there's a small chance it might not be.
3. Parallelism: Bloom Filters allow for parallel insertion and membership
tests, as each hash function can be computed independently.

Use Cases:
Bloom Filters find applications in various scenarios, including:
• Caching: To quickly determine if a requested resource is in a cache
before accessing a slower storage system.
• Spell Checking: To efficiently check if a word exists in a dictionary.
• Distributed Systems: To reduce network requests by checking if a record
exists in a remote database.
• De-duplication: To eliminate duplicate entries in large datasets.

51
• Malware Detection: For quickly checking if a file is known to be
malware.
• Web Search Engines: For checking if a web page has already been
indexed.

Trade-offs:
• Bloom Filters have a fixed false positive rate, which may or may not be
suitable for your use case.
• As elements are inserted, the false positive rate may increase over time.
• Removing elements from a Bloom Filter is not straightforward because
bits once set to 1 cannot be safely reverted.
• Bloom Filters do not provide information about the stored elements
themselves; they only indicate presence or likely presence.

52
pip install mmh3

Collecting mmh3
Downloading mmh3-4.0.1-cp310-cp310-win_amd64.whl (36 kB)
Installing collected packages: mmh3
Successfully installed mmh3-4.0.1
Note: you may need to restart the kernel to use updated packages.

[notice] A new release of pip is available: 23.0.1 -> 23.2.1


[notice] To update, run: python.exe -m pip install --upgrade pip

import math
import mmh3 # You may need to install this library using: pip install
mmh3

class BloomFilter:
def __init__(self, capacity, error_rate):
self.capacity = capacity
self.error_rate = error_rate
self.bit_array_size = self.calculate_bit_array_size()
self.hash_functions_count =
self.calculate_hash_functions_count()
self.bit_array = [False] * self.bit_array_size

def calculate_bit_array_size(self):
# Calculate the optimal size of the bit array
return int(-self.capacity * math.log(self.error_rate) /
(math.log(2) ** 2))

def calculate_hash_functions_count(self):
# Calculate the optimal number of hash functions
return int(self.bit_array_size * math.log(2) / self.capacity)

def add(self, item):


# Add an item to the Bloom Filter
for i in range(self.hash_functions_count):
index = mmh3.hash(item, seed=i) % self.bit_array_size
self.bit_array[index] = True

def contains(self, item):


# Check if an item is possibly in the Bloom Filter
for i in range(self.hash_functions_count):
index = mmh3.hash(item, seed=i) % self.bit_array_size
if not self.bit_array[index]:
return False
return True

if __name__ == "__main__":
capacity = 1000000 # Set the expected capacity of your data
error_rate = 0.001 # Set the desired error rate (false positive

53
probability)

bloom_filter = BloomFilter(capacity, error_rate)

# Add items to the Bloom Filter


items_to_add = ["etash",
“komal” "yash", "khushi", "ananya", "diza"]
for item in items_to_add:
bloom_filter.add(item)

# Check if items are in the Bloom Filter


“komal” "manav",
items_to_check = ["etash", “yash” "mahek","khushi"]
for item in items_to_check:
if bloom_filter.contains(item):
print(f"'{item}' may be in the set (false positive
possible)")
else:
print(f"'{item}' is definitely not in the set")

‘komal’ may be in the set (false positive possible)


'etash' may be in the set (false positive possible)
‘jay’
'manav'is is
definitely notnot
definitely in the set set
in the
‘vedant’
'mahek' is definitely not in the set
is definitely not in the set
‘yash’
'khushi'maymay
be be
in in
thethe
setset
(false positive
(false possible)
positive possible)

54
BDA

Experiment - 8

Aim:-

Implement FM algorithm using Python/R Programming.

Theory:-

Factorization Machines (FMs) are a class of machine learning models that


are particularly useful for solving problems in recommendation systems,
classification, and regression tasks. They were introduced by Steffen Rendle
in his paper "Factorization Machines" in 2010. FMs are designed to capture
interactions between features in a dataset, especially when dealing with
sparse and high-dimensional data. Below, I'll provide an extensive theoretical
overview of Factorization Machines:

1. Linear Models and Limitations:

- Linear models, such as linear regression and logistic regression, assume


that features are independent of each other. They cannot capture complex
interactions between features.

2. Motivation for Factorization Machines:

- Factorization Machines were introduced to address the limitations of


linear models by capturing pairwise feature interactions.
- These interactions can be essential for improving predictive accuracy,
especially in recommendation systems and collaborative filtering.

3. Basic Idea of Factorization Machines:

- FMs extend the linear model by introducing a latent factorization of


feature interactions.
- Instead of directly modeling the interaction weights, they model them as
latent factors, which are learned from data.
- The model estimates the contribution of each feature's interaction with
others through a latent factor vector.

55
4. Mathematical Formulation:

- Given a dataset with binary classification or regression tasks:


- Let `X` be the feature matrix with `n` samples and `m` features.
- The linear component of FM is similar to linear regression:
`w0 + Σ(w_i * x_i)`, where `w_i` are weights for each feature.
- The key difference is the factorization of interactions:
- `Σ(Σ(v_ij * x_i * x_j))`, where `v_ij` is the learned factor for the interaction
between features `i` and `j`.

5. Model Parameters:

- `w0`: Global bias term.


- `w_i`: Weight for feature `i`.
- `v_ij`: Interaction factor for features `i` and `j`.
- `n_factors`: The number of latent factors to capture interactions.

6. Predictions:

- The prediction for a single data point `x` is calculated as:


- `Prediction(x) = w0 + Σ(w_i * x_i) + 0.5 * Σ(Σ(v_ij * x_i * x_j))`
- The second term captures pairwise interactions through the latent
factors.

7. Learning:

- FMs are typically trained using optimization algorithms like Stochastic


Gradient Descent (SGD) or Alternating Least Squares (ALS).
- During training, both the linear weights (`w_i`) and the latent factors (`v_ij`)
are learned from the data.

8. Advantages:

- FMs can capture complex feature interactions in high-dimensional and


sparse datasets.
- They are especially powerful in recommendation systems, where users
and items have intricate relationships.

9. Variants of Factorization Machines:

- Field-Aware Factorization Machines (FFMs) extend FMs to consider


feature interactions within specific feature groups or fields.
- Deep Factorization Machines combine FMs with deep neural networks to
capture higher-order feature interactions.

56
10. Use Cases:

- Recommendation Systems: FMs are widely used in collaborative filtering


for personalized recommendations.
- Click-Through Rate (CTR) Prediction: FMs are used to predict the
likelihood of a user clicking on an advertisement.
- Natural Language Processing: FMs can be applied to text classification
tasks where feature interactions are important.

11. Challenges:

- Model Complexity: The computational complexity of FMs increases


quadratically with the number of features, which can be challenging for large
datasets.
- Hyperparameter Tuning: Tuning the number of latent factors and
regularization terms is crucial for model performance.

In summary, Factorization Machines are a versatile machine learning model


that excels at capturing feature interactions, making them particularly useful
in recommendation systems and other tasks where complex relationships
between features need to be modeled. However, their computational
complexity can be a challenge for large datasets, and careful
hyperparameter tuning is necessary for optimal performance.

57
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

class FactorizationMachine:
def __init__(self, n_factors, learning_rate, n_iterations):
self.n_factors = n_factors
self.learning_rate = learning_rate
self.n_iterations = n_iterations

def initialize_parameters(self, X):


n_features = X.shape[1]
self.w0 = 0.0
self.w = np.zeros(n_features)
self.V = np.random.normal(0, 0.01, size=(n_features,
self.n_factors))

def predict(self, X):


linear_term = self.w0 + np.dot(X, self.w)
interaction_term = 0.5 * np.sum((np.dot(X, self.V) ** 2) -
np.dot(X ** 2, self.V ** 2), axis=1)
return linear_term + interaction_term

def sigmoid(self, x):


return 1 / (1 + np.exp(-x))

def fit(self, X_train, y_train):


self.initialize_parameters(X_train)

for _ in range(self.n_iterations):
y_pred = self.sigmoid(self.predict(X_train))
error = y_train - y_pred

self.w0 += self.learning_rate * np.sum(error)


self.w += self.learning_rate * np.dot(X_train.T, error)
for i in range(X_train.shape[1]):
interaction_term = X_train[:, i] * (np.dot(X_train,
self.V)[:, i] - X_train[:, i] * self.V[i, i])
self.V[i, :] += self.learning_rate * np.dot(X_train[:,
i], error - interaction_term)

def evaluate(self, X_test, y_test):


y_pred = self.predict(X_test)
y_pred = np.round(self.sigmoid(y_pred))
accuracy = accuracy_score(y_test, y_pred)
return accuracy

# Example usage
if __name__ == "__main__":

58
# Generate some random data for demonstration
np.random.seed(0)
X = np.random.randint(2, size=(1000, 10)) # Feature matrix
(binary features)
y = np.random.randint(2, size=1000) # Binary labels

# Split data into train and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

# Create and train the FM model


fm = FactorizationMachine(n_factors=10, learning_rate=0.01,
n_iterations=100)
fm.fit(X_train, y_train)

# Evaluate the model


accuracy = fm.evaluate(X_test, y_test)
accuracy = accuracy*100
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 48.50

/var/folders/c5/05rtc57s0vj8p6whh_qf51pc0000gn/T/
ipykernel_501/3572402935.py:24: RuntimeWarning: overflow encountered
in exp
return 1 / (1 + np.exp(-x))

59
BDA
EXPERIMENT NO. 9

Aim: Data Visualization using R.


Theory:

Histogram:

Histograms are indispensable tools for comprehensively examining the distribution of a


continuous variable, such as MBA percentage. By employing the hist function, the data is
intelligently partitioned into bins, allowing for a nuanced portrayal of the frequency
distribution. This visualization not only elucidates central tendencies and spread but also
identifies potential skewness or outliers in MBA percentages. The histogram serves as a
robust diagnostic tool for assessing the academic performance landscape, aiding educators
and analysts in gauging the prevalence of specific grade ranges or concentration areas.

Scatter Plot:

Scatter plots emerge as invaluable aids in unraveling the intricate relationship between two
continuous variables. The code snippet adeptly uses the plot function to depict the interplay
between test scores and salary. Beyond individual data points, the scatter plot allows for the
discernment of trends, clusters, or outliers, furnishing analysts with a visual narrative of how
test scores might impact salary outcomes. This graphical exploration deepens the
understanding of the nuanced connections within the dataset, enabling data-driven decision-
making and strategic planning.

Boxplot:

Boxplots, as eloquently showcased in the code, offer a sophisticated visualization of the


distribution of a numerical variable like HSC percentage. The boxplot not only encapsulates
key statistical measures like median and quartiles but also flags potential outliers. By
presenting a compact summary of the HSC percentage distribution, this visualization tool

60
aids in pinpointing variations, asymmetries, or patterns that might influence academic
outcomes. The boxplot's elegance lies in its ability to distill complex distributional
information into a visually accessible format, empowering analysts to glean insights
effortlessly.

Heatmap:

Heatmaps, as harnessed in this code snippet, are versatile instruments for unraveling intricate
patterns within a matrix. The selected subset of columns, transformed into a matrix, is
ingeniously visualized as a heatmap with column scaling. This technique not only facilitates
the identification of correlations but also brings out nuanced variations in the dataset. By
utilizing color gradients, the heatmap offers a nuanced representation of the relationships
between different variables, laying bare underlying structures that might be overlooked in raw
numerical data.

Barplot:

The barplot depicted in the code extends its utility beyond mere enumeration, becoming a
narrative of categorical distributions. By graphically representing the count of males and
females across placement status categories, this visualization allows for an at-a-glance
understanding of demographic proportions within different outcomes. The judicious use of
color and legend placement enhances interpretability, making it a powerful tool for conveying
gender distribution nuances within the context of placement outcomes. This visual aid
expedites decision-making processes related to gender diversity initiatives or targeted
interventions.

- The analysis aims to explore various aspects of the dataset, including the distribution of
MBA percentages, relationships between test scores and salary, the distribution of HSC
percentages, patterns in selected variables via a heatmap, and the distribution of
placement status by gender.
- Each visualization technique provides a unique perspective on the data, helping analysts
and stakeholders gain insights into different facets of the dataset. The combination of

61
histograms, scatter plots, boxplots, heatmaps, and barplots allows for a comprehensive
exploratory data analysis (EDA). This EDA is crucial for understanding data patterns,
making informed decisions, and potentially identifying areas for further investigation or
modeling.

Code:
data <- read.csv("Placement_Data_Full_Class.csv")
hist(data$mba_p, main = "MBA Percentage", xlim = c(40, 90), freq = TRUE)
data <- read.csv("Placement_Data_Full_Class.csv")
plot(x = data$etest_p, y = data$salary,
xlab = "Test Scores", ylab = "Salary",
xlim = c(45, 100), ylim = c(200000, 800000)
)
data <- read.csv("Placement_Data_Full_Class.csv")
boxplot(data$hsc_p,
xlab = "Box Plot", ylab = "HSC Percentage", notch = TRUE
)
data <- read.csv("Placement_Data_Full_Class.csv")
data3 <- data[, c(3, 5, 8, 11)]
data2 <- as.matrix(data3)
heatmap(data2, scale = "column")
data <- read.csv("Placement_Data_Full_Class.csv")
grouped <- table(data$gender, data$status)
barplot(as.matrix(grouped), col = c("grey", "black"))
legend("topright", legend = rownames(grouped), fill = c("grey", "black"))

Data Set:

62
Output:

63
MBA Percentage
60
50
40
Frequency

30
20
10
0

40 50 60 70 80 90

data$mba_p 64
Salary

2e+05 3e+05 4e+05 5e+05 6e+05 7e+05 8e+05

50
60
70

Test Scores
80
90
100

65
HSC Percentage

40 50 60 70 80 90 100

Box Plot
66
102
131
214
24
158
192
55
152
76
5
34
9
119
68
172
8
186
157
197
91
44
84
166
27
47
36
61
213
19
49
210
56
23
39
168
89
140
115
181
33
149
113
187
101
162
6
52
4
48
7
88
43
137
207
test_p

hsc_p

ree_p

ssc_p

67
F
140

M
120
100
80
60
40
20
0

Not Placed Placed

68
Experiment 10

Analyzing property data using Hive theory typically refers to leveraging


Hive, a data warehousing and SQL-like query language that runs on top of
Hadoop, to process and analyze large datasets related to real estate
properties. Here's a step-by-step guide on how to approach property data
analysis using Hive:

1. Data Ingestion:
- Acquire property data from various sources, such as public records, real
estate websites, or APIs. Ensure the data is in a suitable format, like CSV,
and load it into the Hadoop Distributed File System (HDFS).

2. Hive Data Modeling:


- Define a Hive schema for your property data. This involves creating a
Hive table that maps to the structure of your data files. You can use Hive's
Data Definition Language (DDL) to create tables with appropriate columns
and data types.

create table house (price INT, area INT,bedrooms INT, bathrooms INT,
stories INT, mainroad STRING, guestroom STRING, basement STRING,
hotwaterheating STRING)

3. Data Loading:
- Load the data into the Hive table you've created. This can be done
using Hive's `LOAD DATA` or `INSERT INTO` statements.

69
Creating and loading table house

4. Data Exploration:
- Use Hive SQL to explore the data. You can write queries to get an initial
sense of the data, such as finding the average price, distribution of
bedrooms, or the properties built after a certain year.

70
5. Data Transformation:
- Clean and transform the data as needed. This might involve handling
missing values, outliers, or converting data types. Hive provides various
built-in functions for these operations.

6. Aggregation and Reporting:


- Aggregate data to derive insights. You can generate reports or
visualizations from your property data to identify trends or patterns, like the
average property price in different regions, year-wise trends, or property
age distributions.

Aggregation of data

71

You might also like