Procedure: 1
Procedure: 1
Java installation: Ensure your system is up to date and install Java using the following commands
Download Hadoop: Downloading stable version of Hadoop using the following commands
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
Extraction of Hadoop: Extracting Hadoop from the tarball using the following command
Set up environment variables: open the ~/.bashrc file using the command nano ~/.bashrc and then
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
source ~/.bashrc
Configure Hadoop:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>file:///</value>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Verify the installation: verify the Hadoop installation by running the following default Hadoop
example program
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:/usr/lib/jvm/java-8-openjdk-amd64/bin
export HADOOP_HOME=~/hadoop-3.4.0/
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_STREAMING=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-
3.4.0.jar
export HADOOP_LOG_DIR=$HADOOP_HOME/logs
export PDSH_RCMD_TYPE=ssh
Installing SSH
cd hadoop-3.4.0/etc/hadoop
ls
Configuring hadoop-env.sh
JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Configuring core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value> </property>
<property>
<name>hadoop.proxyuser.dataflair.groups</name> <value>*</value>
</property>
<property>
<name>hadoop.proxyuser.dataflair.hosts</name> <value>*</value>
</property>
<property>
<name>hadoop.proxyuser.server.hosts</name> <value>*</value>
</property>
<property>
<name>hadoop.proxyuser.server.groups</name> <value>*</value>
</property>
Configuring hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
Configuring mapred-site.xml
<property>
<name>mapreduce.framework.name</name> <value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share
/hadoop/mapreduce/lib/*</value>
</property>
Congfiguring yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLAS
SPATH_PREP END_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
Generating Key
ssh localhost
export PDSH_RCMD_TYPE=ssh
Starting Hadoop
start-all.sh
localhost:9870
Procedure: 3
Directory Creation: Directory can be created using the following command
The above command creates the directory user on the Hadoop home
The above command creates the sub directory yourusername directory in the directory user
File uploading: The files can be uploaded into Hadoop framework using the following commands
The above command uploads the file input.txt into the folder yourusername
The above command uploads the file input.txt into the folder yourusername
File Downloading: The files can be downloaded from the Hadoop framework using the following
commands
This command will download the file part-00000 from the HDFS directory
File Deletion: The files can be deleted from the Hadoop framework using the following commands
This will delete the file input.txt from the directory /user/yourusername/input/ in HDFS.
This will delete the entire output/ directory, including all its files and subdirectories.
This will delete files directly without moving them to the trash. By default, Hadoop moves deleted
files to the trash folder before final deletion.
Directory Deletion The empty directories can be deleted from the Hadoop framework using the
following commands
schema = {
"type": "record",
"name": "User",
"fields": [
file_path = 'users_fastavro.avro'
records = [
def read_avro_file(file_path):
reader = fastavro.reader(avro_file)
print("Reading data from Avro file:")
print(record)
read_avro_file(file_path)
Program: 5
Install mrjob
class MRWordCount(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_words,
reducer=self.reducer_count_words) ]
yield (word.lower(), 1)
if __name__ == '__main__':
MRWordCount.run()
Hello World
Hello Python
Python is fun
Hello MapReduce
MapReduce is powerful
Output:
"fun" 1
"hello" 3
"is" 2
"mapreduce" 2
"powerful" 1
"python" 2
"world" 1
Program: 6
Create a mapper.py:
import sys
line = line.strip()
words = line.split()
print(f"{word}\t1")
Create a reducer.py:
import sys
current_word = None
current_count = 0
word = None
line = line.strip()
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count
if current_word == word:
print(f"{current_word}\t{current_count}")
Make both scripts executable:
-input /user/hadoop/input/input.txt \
-output /user/hadoop/output \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py
output
Hadoop 2
a 1
framework 1
fun 1
is 2
Program: 7
Create a mapper.py:
import sys
line = line.strip()
if matrix == 'A':
# For each element in matrix A, output a key-value pair for each column of matrix B
print(f"{i},{k}\t{j},{value}")
# For each element in matrix B, output a key-value pair for each row of matrix A
print(f"{i},{k}\t{j},{value}")
Create a reducer.py:
import sys
current_key = None
A_values = defaultdict(float)
B_values = defaultdict(float)
line = line.strip()
v = float(v)
if matrix == 'A':
A_values[j] = v
B_values[j] = v
if key != current_key and current_key is not None:
result = 0
for j in A_values:
print(f"{current_key}\t{result}")
A_values.clear()
B_values.clear(
current_key = key
if current_key:
result = 0
for j in A_values:
print(f"{current_key}\t{result}")
Matrix A (a.txt):
A,1,1,1
A,1,2,2
A,1,3,3
A,2,1,4
A,2,2,5
A,2,3,6
A,3,1,7
A,3,2,8
A,3,3,9
Matrix B (b.txt):
B,1,1,1
B,1,2,2
B,1,3,3
B,2,1,4
B,2,2,5
B,2,3,6
B,3,1,7
B,3,2,8
B,3,3,9
-input /user/hadoop/matrix_input/a.txt \
-input /user/hadoop/matrix_input/b.txt \
-output /user/hadoop/matrix_output \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py
output
1,1 30
1,2 36
1,3 42
2,1 66
2,2 81
2,3 96
3,1 102
3,2 126
3,3 150
Procedure: 8
Step 1: Install Java (if not installed)
sudo apt-get install openjdk-11-jdk
step2 :Dowloading Apache Pig using the following command
wget https://dlcdn.apache.org/pig/pig-0.16.0/pig-0.16.0.tar.gz
step3: To untar pig-0.16.0.tar.gz file run the following command
tar xvzf pig-0.16.0.tar.gz
step4: Move the Pig directory to /usr/local/: sudo mv pig-0.16.0 /usr/local/pig
step5: Set Up Environment Variables open the bashrc using the following command
nano ~/.bashrc
Add the following lines to set Pig and Java environment variables
export PIG_HOME=/usr/local/pig export PATH=$PATH:$PIG_HOME/bin export
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Save and reload the file: source ~/.bashrc
step6: Verify Pig Installation
pig -version
Run a Sample Pig Script
create the input file:
echo "John,28 Alice,32 Bob,45 Carol,25" > input.txt
open the pig file using the following command
nano sample.pig
Add the following script in the file sample.pig –
Load a file into Pig data = LOAD 'input.txt' USING PigStorage(',') AS (name:chararray,
age:int);
-- Filter out people younger than 30 filtered_data = FILTER data BY age > 30;
-- Dump the result to the screen
DUMP filtered_data;
wget https://downloads.apache.org/hbase/2.4.18/hbase-2.4.18-bin.tar.gz
4.open the bashrc file using the command:nano ~/.bashrc and add the following lines
export HBASE_HOME=/usr/local/hbase
export PATH=$PATH:$HBASE_HOME/bin
source ~/.bashrc
7.open the file hbase-site.xml using the command:nano hbase-site.xml and the following
lines inside configuration tag
<property>
<name>hbase.rootdir</name>
<value>file:///usr/local/hbase/data</value>
<description>The directory shared by RegionServers.</description>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>false</value>
<description>The mode HBase is running in. `false` means standalone mode.
</description>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/usr/local/hbase/zookeeper</value>
<description>Data directory for ZooKeeper</description>
</property>
mkdir -p /usr/local/hbase/data
mkdir -p /usr/local/hbase/zookeeper
hbase shell
11.Sample Hbase command, the following command create the table with one column
family
stop-hbase.sh
output:
0 row(s) in 1.2340 seconds
=> Hbase::Table – employee
Insert Data into Table
output
COLUMN CELL
details:age timestamp=1624980701918, value=30
details:name timestamp=1624980701818, value=Alice
details:position timestamp=1624980702003, value=Manager
3 row(s) in 0.0250 seconds
output
COLUMN CELL
details:name timestamp=1624980712356, value=Bob
1 row(s) in 0.0100 seconds
Scan the Entire Table
scan 'employee'
output
count 'employee'
output
describe 'employee'
Test Hadoop:
hadoop version
1. Download the latest Hive binary from the Apache Hive Downloads:
wget https://downloads.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
2. Configure hive-site.xml:
- Go to the conf directory:
cd /usr/local/hive/conf
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/usr/local/hive/warehouse</value>
<description>Location of default database for the warehouse</description>
</property>
</configuration>