0% found this document useful (0 votes)
30 views

Procedure: 1

5.3.2

Uploaded by

R GAYATHRI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Procedure: 1

5.3.2

Uploaded by

R GAYATHRI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Procedure: 1

Java installation: Ensure your system is up to date and install Java using the following commands

sudo apt update

sudo apt install default-jdk -y

Download Hadoop: Downloading stable version of Hadoop using the following commands

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz

Extraction of Hadoop: Extracting Hadoop from the tarball using the following command

tar -xzf hadoop-3.3.6.tar.gz

Moving Hadoop: Moving Hadoop software into installation location

sudo mv hadoop-3.3.6 /usr/local/Hadoop

Set up environment variables: open the ~/.bashrc file using the command nano ~/.bashrc and then

add the following three lines

export HADOOP_HOME=/usr/local/hadoop

export PATH=$PATH:$HADOOP_HOME/bin

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

save the changes in bashrc file using the following commands

source ~/.bashrc

Configure Hadoop:

Open the core-site.xml file using the command nano $HADOOP_HOME/etc/hadoop/core-site.xml

and add the following lines between the configuration tag

<configuration>

<property>

<name>fs.defaultFS</name>

<value>file:///</value>

</property>
</configuration>

Open the hdfs-site.xml file using the command nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

and add the following lines between the configuration tag

<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

</configuration>

Verify the installation: verify the Hadoop installation by running the following default Hadoop
example program

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar pi


16 100
Procedure: 2

Updating ubuntu and installing java 8

sudo apt update

sudo apt install openjdk-8-jdk

checking java installation: cd /usr/lib/jvm

sudo nano .bashrc

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

export PATH=$PATH:/usr/lib/jvm/java-8-openjdk-amd64/bin

export HADOOP_HOME=~/hadoop-3.4.0/

export PATH=$PATH:$HADOOP_HOME/bin

export PATH=$PATH:$HADOOP_HOME/sbin

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

export HADOOP_STREAMING=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-
3.4.0.jar

export HADOOP_LOG_DIR=$HADOOP_HOME/logs

export PDSH_RCMD_TYPE=ssh

Installing SSH

sudo apt-get install ssh

Download hadoop from website

tar -zxvf ~/Downloads/hadoop-3.4.0.tar.gz

cd hadoop-3.4.0/etc/hadoop

ls
Configuring hadoop-env.sh

sudo nano hadoop-env.sh

JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Configuring core-site.xml

sudo nano core-site.xml

<property>

<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value> </property>

<property>

<name>hadoop.proxyuser.dataflair.groups</name> <value>*</value>

</property>

<property>

<name>hadoop.proxyuser.dataflair.hosts</name> <value>*</value>

</property>

<property>

<name>hadoop.proxyuser.server.hosts</name> <value>*</value>

</property>

<property>

<name>hadoop.proxyuser.server.groups</name> <value>*</value>

</property>

Configuring hdfs-site.xml

sudo nano hdfs-site.xml

<property>

<name>dfs.replication</name>

<value>1</value>

</property>
Configuring mapred-site.xml

sudo nano mapred-site.xml

<property>

<name>mapreduce.framework.name</name> <value>yarn</value>

</property>

<property>

<name>mapreduce.application.classpath</name>

<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share
/hadoop/mapreduce/lib/*</value>

</property>

Congfiguring yarn-site.xml

sudo nano yarn-site.xml

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<property>

<name>yarn.nodemanager.env-whitelist</name>

<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLAS
SPATH_PREP END_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>

</property>

Generating Key

ssh localhost

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

chmod 0600 ~/.ssh/authorized_keys


Formatting Namenode

hadoop-3.4.0/bin/hdfs namenode -format

export PDSH_RCMD_TYPE=ssh

Starting Hadoop

start-all.sh

Hadoop Web Browser Interface URL

localhost:9870
Procedure: 3
Directory Creation: Directory can be created using the following command

hdfs dfs -mkdir /user

The above command creates the directory user on the Hadoop home

hdfs dfs -mkdir /user/yourusername

The above command creates the sub directory yourusername directory in the directory user

File uploading: The files can be uploaded into Hadoop framework using the following commands

hdfs dfs -put input.txt /user/yourusername/

The above command uploads the file input.txt into the folder yourusername

hdfs dfs -copyFromLocal input.txt /user/yourusername/

The above command uploads the file input.txt into the folder yourusername

File Downloading: The files can be downloaded from the Hadoop framework using the following
commands

hadoop fs -get /user/yourusername/output/part-00000 /local/path/

This command will download the file part-00000 from the HDFS directory

/user/yourusername/output/ to the local directory /local/path/.

File Deletion: The files can be deleted from the Hadoop framework using the following commands

hadoop fs -rm /user/yourusername/input/input.txt

This will delete the file input.txt from the directory /user/yourusername/input/ in HDFS.

hadoop fs -rm -r /user/yourusername/output/

This will delete the entire output/ directory, including all its files and subdirectories.

hadoop fs -rm -skipTrash /user/yourusername/input/input.txt

This will delete files directly without moving them to the trash. By default, Hadoop moves deleted
files to the trash folder before final deletion.

Directory Deletion The empty directories can be deleted from the Hadoop framework using the
following commands

hadoop fs -rmdir /user/yourusername/empty_directory/


Program: 4
import fastavro

# Define the Avro schema

schema = {

"type": "record",

"name": "User",

"fields": [

{"name": "name", "type": "string"},

{"name": "age", "type": "int"},

{"name": "email", "type": ["null", "string"], "default": None}

# File path for the Avro file

file_path = 'users_fastavro.avro'

# Sample data to be written to the Avro file

records = [

{"name": "John Doe", "age": 29, "email": "[email protected]"},

{"name": "Alice Smith", "age": 34, "email": None}

# Function to write data into an Avro file

def write_avro_file(file_path, schema, records):

with open(file_path, 'wb') as avro_file:

# Use fastavro.writer to write records into the Avro file

fastavro.writer(avro_file, schema, records)

print(f"Data written to {file_path}")

# Function to read data from an Avro file

def read_avro_file(file_path):

with open(file_path, 'rb') as avro_file:

# Use fastavro.reader to read records from the Avro file

reader = fastavro.reader(avro_file)
print("Reading data from Avro file:")

for record in reader:

print(record)

# Write data to the Avro file

write_avro_file(file_path, schema, records)

# Read data from the Avro file

read_avro_file(file_path)
Program: 5
Install mrjob

pip install mrjob

Create a Python Script for the MapReduce Job

from mrjob.job import MRJob

from mrjob.step import MRStep

class MRWordCount(MRJob):

def steps(self):

return [

MRStep(mapper=self.mapper_get_words,

reducer=self.reducer_count_words) ]

def mapper_get_words(self, _, line):

# Emit each word in the line

for word in line.split():

yield (word.lower(), 1)

def reducer_count_words(self, word, counts):

# Sum up the occurrences of each word

yield (word, sum(counts))

if __name__ == '__main__':

MRWordCount.run()

Run the MapReduce Job

Create a text file called input.txt with some content:

Hello World

Hello Python

Python is fun

Hello MapReduce

MapReduce is powerful

Run the Python script

python word_count.py input.txt

Output:
"fun" 1

"hello" 3

"is" 2

"mapreduce" 2

"powerful" 1

"python" 2

"world" 1
Program: 6
Create a mapper.py:

import sys

for line in sys.stdin:

line = line.strip()

words = line.split()

for word in words:

print(f"{word}\t1")

Create a reducer.py:

import sys

current_word = None

current_count = 0

word = None

for line in sys.stdin:

line = line.strip()

word, count = line.split('\t', 1)

try:

count = int(count)

except ValueError:

continue

if current_word == word:

current_count += count

else:

if current_word:

# Write result to STDOUT

print(f"{current_word}\t{current_count}")

current_word = word

current_count = count

if current_word == word:

print(f"{current_word}\t{current_count}")
Make both scripts executable:

chmod +x mapper.py reducer.py

Prepare the Input Data

echo "Hadoop is a framework Hadoop is fun" > input.txt

Upload this file to HDFS:

hadoop fs -mkdir -p /user/hadoop/input

hadoop fs -put input.txt /user/hadoop/input/

Run the Word Count Program using Hadoop Streaming

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-*.jar \

-input /user/hadoop/input/input.txt \

-output /user/hadoop/output \

-mapper mapper.py \

-reducer reducer.py \

-file mapper.py \

-file reducer.py

Check the Output

hadoop fs -cat /user/hadoop/output/part-00000

output

Hadoop 2

a 1

framework 1

fun 1

is 2
Program: 7
Create a mapper.py:

import sys

for line in sys.stdin:

line = line.strip()

matrix, i, j, value = line.split(',')

if matrix == 'A':

# For each element in matrix A, output a key-value pair for each column of matrix B

for k in range(1, p + 1): # p = number of columns in matrix B

print(f"{i},{k}\t{j},{value}")

elif matrix == 'B':

# For each element in matrix B, output a key-value pair for each row of matrix A

for i in range(1, m + 1): # m = number of rows in matrix A

print(f"{i},{k}\t{j},{value}")

Create a reducer.py:

import sys

from collections import defaultdict

current_key = None

A_values = defaultdict(float)

B_values = defaultdict(float)

for line in sys.stdin:

line = line.strip()

key, value = line.split('\t')

i_k, matrix, j, v = value.split(',')

v = float(v)

if matrix == 'A':

A_values[j] = v

elif matrix == 'B':

B_values[j] = v
if key != current_key and current_key is not None:

# Multiply A's row and B's column

result = 0

for j in A_values:

result += A_values.get(j, 0) * B_values.get(j, 0)

print(f"{current_key}\t{result}")

# Reset for the next key

A_values.clear()

B_values.clear(

current_key = key

# Output the last key result

if current_key:

result = 0

for j in A_values:

result += A_values.get(j, 0) * B_values.get(j, 0)

print(f"{current_key}\t{result}")

Make both scripts executable:

chmod +x mapper.py reducer.py

Prepare the Input Data

Matrix A (a.txt):

A,1,1,1

A,1,2,2

A,1,3,3

A,2,1,4

A,2,2,5

A,2,3,6

A,3,1,7

A,3,2,8

A,3,3,9
Matrix B (b.txt):

B,1,1,1

B,1,2,2

B,1,3,3

B,2,1,4

B,2,2,5

B,2,3,6

B,3,1,7

B,3,2,8

B,3,3,9

Upload this file to HDFS:

hadoop fs -mkdir -p /user/hadoop/matrix_input

hadoop fs -put a.txt /user/hadoop/matrix_input/

hadoop fs -put b.txt /user/hadoop/matrix_input/

Run the Word Count Program using Hadoop Streaming

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-*.jar \

-input /user/hadoop/matrix_input/a.txt \

-input /user/hadoop/matrix_input/b.txt \

-output /user/hadoop/matrix_output \

-mapper mapper.py \

-reducer reducer.py \

-file mapper.py \

-file reducer.py

Check the Output

hadoop fs -cat /user/hadoop/matrix_output/part-00000

output

1,1 30

1,2 36

1,3 42
2,1 66

2,2 81

2,3 96

3,1 102

3,2 126

3,3 150
Procedure: 8
Step 1: Install Java (if not installed)
sudo apt-get install openjdk-11-jdk
step2 :Dowloading Apache Pig using the following command
wget https://dlcdn.apache.org/pig/pig-0.16.0/pig-0.16.0.tar.gz
step3: To untar pig-0.16.0.tar.gz file run the following command
tar xvzf pig-0.16.0.tar.gz
step4: Move the Pig directory to /usr/local/: sudo mv pig-0.16.0 /usr/local/pig
step5: Set Up Environment Variables open the bashrc using the following command
nano ~/.bashrc
Add the following lines to set Pig and Java environment variables
export PIG_HOME=/usr/local/pig export PATH=$PATH:$PIG_HOME/bin export
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Save and reload the file: source ~/.bashrc
step6: Verify Pig Installation
pig -version
Run a Sample Pig Script
create the input file:
echo "John,28 Alice,32 Bob,45 Carol,25" > input.txt
open the pig file using the following command
nano sample.pig
Add the following script in the file sample.pig –
Load a file into Pig data = LOAD 'input.txt' USING PigStorage(',') AS (name:chararray,
age:int);
-- Filter out people younger than 30 filtered_data = FILTER data BY age > 30;
-- Dump the result to the screen
DUMP filtered_data;

Run the Pig script in local mode: pig -x local sample.pig


Procedure: 9

1. Download the Hbase tarball using the following command

wget https://downloads.apache.org/hbase/2.4.18/hbase-2.4.18-bin.tar.gz

2. Extract the tarball using the following command

tar -xvzf hbase-2.4.18-bin.tar.gz

3. Move it to the path /usr/local/hbase

sudo mv hbase-2.4.18 /usr/local/hbase

4.open the bashrc file using the command:nano ~/.bashrc and add the following lines

export HBASE_HOME=/usr/local/hbase
export PATH=$PATH:$HBASE_HOME/bin

5. update the changes using the command:

source ~/.bashrc

6.change the path using the command


cd $HBASE_HOME/conf

7.open the file hbase-site.xml using the command:nano hbase-site.xml and the following
lines inside configuration tag

<property>
<name>hbase.rootdir</name>
<value>file:///usr/local/hbase/data</value>
<description>The directory shared by RegionServers.</description>
</property>

<property>
<name>hbase.cluster.distributed</name>
<value>false</value>
<description>The mode HBase is running in. `false` means standalone mode.
</description>
</property>

<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/usr/local/hbase/zookeeper</value>
<description>Data directory for ZooKeeper</description>
</property>

8.Create the following data directories:

mkdir -p /usr/local/hbase/data
mkdir -p /usr/local/hbase/zookeeper

9.Start the Hbase using the following command


start-hbase.sh

10. open the hbase shell using the following command

hbase shell

11.Sample Hbase command, the following command create the table with one column
family

create 'student', 'info'

12.stop the hbase using the following command

stop-hbase.sh

Hbase practice queries:

Create Table: create 'employee', 'details'

output:
0 row(s) in 1.2340 seconds
=> Hbase::Table – employee
Insert Data into Table

put 'employee', '1', 'details:name', 'Alice'


put 'employee', '1', 'details:age', '30' put
'employee', '1', 'details:position',
'Manager'
put 'employee', '2', 'details:name',
'Bob' put 'employee', '2',
'details:age', '25'
put 'employee', '2', 'details:position', 'Developer'

put 'employee', '3', 'details:name',


'Charlie' put 'employee', '3',
'details:age', '28'
put 'employee', '3', 'details:position', 'Analyst'

Retrieve Data for a Specific Employee

get 'employee', '1'

output

COLUMN CELL
details:age timestamp=1624980701918, value=30
details:name timestamp=1624980701818, value=Alice
details:position timestamp=1624980702003, value=Manager
3 row(s) in 0.0250 seconds

Retrieve a Specific Column (e.g., Name) for an Employee

get 'employee', '2', 'details:name'

output

COLUMN CELL
details:name timestamp=1624980712356, value=Bob
1 row(s) in 0.0100 seconds
Scan the Entire Table

scan 'employee'

output

ROW COLUMN CELL


1 details:age timestamp=1624980701918, value=30
1 details:name timestamp=1624980701818, value=Alice

1 details:position timestamp=1624980702003, value=Manager


2 details:age timestamp=1624980702205, value=25
2 details:name timestamp=1624980702183, value=Bob

2 details:position timestamp=1624980702291, value=Developer


3 details:age timestamp=1624980702557, value=28
3 details:name timestamp=1624980702531, value=Charlie
3 details:position timestamp=1624980702611, value=Analyst
9 row(s) in 0.0560 seconds

Count the Number of Rows in the Table

count 'employee'

output

3 row(s) in 0.0430 seconds


Update Data in a Row

put 'employee', '2', 'details:position', 'Senior Developer'

Delete a Column (e.g., Age) for a Specific Employee

delete 'employee', '3', 'details:age'

Delete an Entire Row

deleteall 'employee', '2'

Describe the Table Schema

describe 'employee'

Table employee is ENABLED


COLUMN FAMILIES DESCRIPTION
{NAME => 'details', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY =>
'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE',
TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0',
BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} 1 row(s) in
0.0250 seconds
Procedure 10

1. Install Java (Java 8 or later is required):


sudo apt update
sudo apt install openjdk-8-jdk -y
Verify Java installation:
java -version

2. Install Hadoop (Hive requires Hadoop to be installed and running):


- Follow the Hadoop Installation Guide, or use this basic guide:
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
tar -xvzf hadoop-3.3.6.tar.gz
sudo mv hadoop-3.3.6 /usr/local/hadoop

Add Hadoop environment variables to your ~/.bashrc:


export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Reload the profile:


source ~/.bashrc

Test Hadoop:
hadoop version

Step 1: Download and Extract Apache Hive

1. Download the latest Hive binary from the Apache Hive Downloads:
wget https://downloads.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz

2. Extract the tarball:


tar -xvzf apache-hive-3.1.3-bin.tar.gz
sudo mv apache-hive-3.1.3-bin /usr/local/hive

Step 2: Configure Environment Variables

1. Add Hive environment variables to ~/.bashrc:


export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

2. Apply the changes:


source ~/.bashrc
Step 3: Configure Hive

1. Create Directories for Hive Warehouse:


sudo mkdir -p /usr/local/hive/warehouse
sudo chmod -R 777 /usr/local/hive/warehouse

2. Configure hive-site.xml:
- Go to the conf directory:
cd /usr/local/hive/conf

- Rename the default configuration file:


cp hive-default.xml.template hive-site.xml

- Open hive-site.xml for editing:


nano hive-site.xml

- Add the following configuration:


<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
<name>hive.metastore.warehouse.dir</name>
<value>/usr/local/hive/warehouse</value>
<description>Location of default database for the warehouse</description>
</property>
</configuration>

3. Install MySQL (Optional for Production):


If you want a production-ready metastore, install MySQL and configure it with Hive:
sudo apt install mysql-server -y

Step 4: Start Hadoop

Start the Hadoop services:


start-dfs.sh
start-yarn.sh

Verify that Hadoop is running:


jps

You should see processes like NameNode, DataNode, ResourceManager, etc.


Step 5: Start Hive

1. Run Hive using the command:


hive

You should see the Hive command-line interface (CLI).

2. Test Hive by creating a table:


CREATE TABLE test (id INT, name STRING);
INSERT INTO test VALUES (1, 'Ubuntu'), (2, 'Hive');
SELECT * FROM test;
Procedure 11
Step 1: Connect to Your Cloud Instance
SSH into your first node:
ssh -i classkey ubuntu@<student_IP>
Step 2: Install Required Packages
Before installing Cassandra, install required libraries:
sudo apt-get install libaio1
Step 3: Install Apache Cassandra
Step 4: Configure the Cluster
Modify the cassandra.yaml configuration file to set up your cluster settings (e.g.,
seed nodes).
Step 5: Launch Cassandra
Start the Cassandra service:
sudo service cassandra start
Step 6: Verify Cluster Status
Check if the cluster is running:
nodetool status
Step 7: Create a Keyspace and Table
Open CQLSH and create a keyspace:
CREATE KEYSPACE lab WITH REPLICATION = { 'class': 'SimpleStrategy',
'replication_factor': 3 };
USE lab;
7. Create a table:
CREATE TABLE users (
email TEXT PRIMARY KEY,
password TEXT,
user_id UUID
);
Step 8: Insert Data into the Table
Insert sample data:
INSERT INTO users (email, password, user_id) VALUES ('[email protected]',
'password123', uuid());
Step 9: Query Data from the Table
Retrieve data:
SELECT * FROM users WHERE email = '[email protected]';
Output
email | password | user_id
--------------------+--------------+--------------------------------------
[email protected] | password123 | 123e4567-e89b-12d3-a456-426614174000

You might also like