big data
big data
Navigate to binary for the release you’d like to install. In this guide you’ll install Hadoop 3.3.1, but you can
substitute the version numbers in this guide with one of your choice.
On the next page, right-click and copy the link to the release binary.
On the server, you’ll use wget to fetch it:
In order to make sure that the file you downloaded hasn’t been altered, you’ll do a quick check using SHA-512,
or the Secure Hash Algorithm 512. Return to the releases page, then right-click and copy the link to the
checksum file for the release binary you downloaded:
Output
2fd0bf74852c797dc864f373ec82ffaa1e98706b309b30d1effa91ac399b477e1accc1ee74d4ccbb1db7da1c5c541b
72e4a834f131a99f2814b030fbd043df66 hadoop-3.3.1.tar.gz
Compare this value with the SHA-512 value in the .sha512 file:
~/hadoop-3.3.1.tar.gz.sha512
...
SHA512 (hadoop-3.3.1.tar.gz) =
2fd0bf74852c797dc864f373ec82ffaa1e98706b309b30d1effa91ac399b477e1accc1ee74d4ccbb1db7da1c5c541b
72e4a834f131a99f2814b030fbd043df66
...
The output of the command you ran against the file you downloaded from the mirror should match the value in
the file you downloaded from apache.org.
Now that you’ve verified that the file wasn’t corrupted or changed, you can extract it:
Use the tar command with the -x flag to extract, -z to uncompress, -v for verbose output, and -f to specify that
you’re extracting from a file.
Finally, you’ll move the extracted files into /usr/local, the appropriate place for locally installed software:
The path to Java, /usr/bin/java is a symlink to /etc/alternatives/java, which is in turn a symlink to default Java
binary. You will use readlink with the -f flag to follow every symlink in every part of the path, recursively.
Then, you’ll use sed to trim bin/java from the output to give us the correct value for JAVA_HOME.
You can copy this output to set Hadoop’s Java home to this specific version, which ensures that if the default
Java changes, this value will not. Alternatively, you can use the readlink command dynamically in the file so
that Hadoop will automatically use whatever Java version is set as the system default.
Note: With respect to Hadoop, the value of JAVA_HOME in hadoop-env.sh overrides any values that are set in
the environment by /etc/profile or in a user’s profile.
This output means you’ve successfully configured Hadoop to run in stand-alone mode.
Experiment 4 : Installation of Hadoop framework in pseudo distribution mode
Step 1: Download Binary Package :
Download the latest binary from the following site as follows.
http://hadoop.apache.org/releases.html
For reference, you can check the file save to the folder as follows.
C:\BigData
Open Git Bash, and change directory (cd) to the folder where you save the binary package and then unzip as
follows.
$ cd C:\BigData
MINGW64: C:\BigData
Next, go to this GitHub Repo and download the receptacle organizer as a speed as demonstrated as follows.
Concentrate the compress and duplicate all the documents present under the receptacle envelope to
C:\BigData\hadoop-3.1.2\bin. Supplant the current records too.
HADOOP_HOME=” C:\BigData\hadoop-3.1.2”
HADOOP_BIN=”C:\BigData\hadoop-3.1.2\bin”
Right-click -> Properties -> Advanced System settings -> Environment variables.
Click New to create a new environment variable.
In the event that you don’t have JAVA 1.8 introduced, at that point you’ll have to download and introduce it
first. In the event that the JAVA_HOME climate variable is now set, at that point check whether the way has
any spaces in it (ex: C:\Program Files\Java\… ). Spaces in the JAVA_HOME way will lead you to issues.
There is a stunt to get around it. Supplant ‘Program Files ‘to ‘Progra~1’in the variable worth. Guarantee
that the variant of Java is 1.8 and JAVA_HOME is highlighting JDK 1.8.
echo %HADOOP_HOME%
echo %HADOOP_BIN%
echo %PATH%
On the off chance that the factors are not instated yet, at that point it can likely be on the grounds that
you are trying them in an old meeting. Ensure you have opened another order brief to test them.
Once environment variables are set up, we need to configure Hadoop by editing the following configuration
files.
hadoop-env.cmd
core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml
hadoop-env.cmd
set HADOOP_PREFIX=%HADOOP_HOME%
set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%
set PATH=%PATH%;%HADOOP_PREFIX%\bin
After editing core-site.xml, you need to set the replication factor and the location of namenode and datanodes.
Open C:\BigData\hadoop-3.1.2\etc\hadoop\hdfs-site.xml and below content within <configuration>
</configuration> tags.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\BigData\hadoop-3.2.1\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\BigData\hadoop-3.1.2\data\datanode</value>
</property>
</configuration>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:19000</value>
</property>
</configuration>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value> </property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
At last, how about we arrange properties for the Map-Reduce system. Open C:\BigData\hadoop-
3.1.2\etc\hadoop\mapred-site.xml and beneath content inside <configuration> </configuration> labels. In the
event that you don’t see mapred-site.xml, at that point open mapred-site.xml.template record and rename it to
mapred-site.xml
<configuration>
<property>
<name>mapreduce.job.user.name</name> <value>%USERNAME%</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.apps.stagingDir</name> <value>/user/%USERNAME%/staging</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>local</value>
</property>
</configuration>
Check if C:\BigData\hadoop-3.1.2\etc\hadoop\slaves file is present, if it’s not then created one and add
localhost in it and save it.
To organize the Name Node, open another Windows Command Prompt and run the beneath order. It might give
you a few admonitions, disregard them.
Open another Windows Command brief, make a point to run it as an Administrator to maintain a strategic
distance from authorization mistakes. When opened, execute the beginning all.cmd order. Since we have
added %HADOOP_HOME%\sbin to the PATH variable, you can run this order from any envelope. In the event
that you haven’t done as such, at that point go to the %HADOOP_HOME%\sbin organizer and run the order.
You can check the given below screenshot for your reference 4 new windows will open and cmd terminals for 4
daemon processes like as follows.
namenode
datanode
node manager
resource manager
Don’t close these windows, minimize them. Closing the windows will terminate the daemons. You can run
them in the background if you don’t like to see these windows.
In conclusion, how about we screen to perceive how are Hadoop daemons are getting along. Also you can
utilize the Web UI for a wide range of authoritative and observing purposes. Open your program and begin.
1. Download Hadoop from the official website and put it in an appropriate directory (eg. /usr/local/) later
on we shall refer to this folder as HADOOP_PARENT_DIR. The full path of the hadoop home
directory shall instead be referred to as HADOOP_HOME.
2. Set the hadoop user as the owner of the hadoop home directory.
3. Configure Hadoop, it’s just about editing a couple of configuration files, you will find them in
the $HADOOP_PARENT_DIR/etc/hadoop/ directory. The files to be edited are core-site.xml, hdfs-
site.xml, mapred-site.xml, yarn-site.xml workers, hadoop-env.sh:
core-site.xml
We can see a property which specifies that the hdfs namenode process is hosted by the node which hostname
is master, and is running on port 9000.
hdfs-site.xml
In this file, we specify the replication factor for hdfs, here it is set to 3 , we also specify where to store the hdfs
data on each node. Here in the /home/hadoop/data/hdfs/namenode folder for the namenode, and in
the /home/hadoop/data/hdfs/datanode folder for the datanodes.
mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.--><!-- Put site-specific property
overrides in this file. --><configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>2048</value>
</property> <property>
<name>mapreduce.map.memory.mb</name>
<value>1024</value>
</property> <property>
<name>mapreduce.reduce.memory.mb</name>
<value>1024</value>
</property>
</configuration>
In this file, the amount of RAM to allocate to the mapreduce applications in the node is specified. We maintain
the same configurations for each node, the exact values you should use depend on the specs of the computer
and its workload besides Hadoop. The above configuration is one adopted on an existing cluster with 4GB ram
datanodes
yarn-site.xml
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration><!-- Site specific YARN configuration properties -->
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property> <property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property> <property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property> <property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property> <property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property> <property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property> <property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>
For more information on how to choose the memory configuration values when editing the mapred-
site.xml and yarn-site.xml files, I suggest to read this post.
workers
Probably the simplest configuration file, we just have to list the hostnames of the nodes hosting the
datanodes in our cluster (one hostname per line).
hadoop-env.sh
We only have to set the JAVA_HOME environment variable at the end of the file.
8. La last step is to update the hadoop user’s bash profile, that is to edit the /home/hadoop/.bashrc file. We
have to:
9. Finished!
To check if all went right you can launch yarn and hdfs by typing the following command lines on your master
node.
Step 1
Open the homepage of Apache Pig website. Under the section News, click on the link release page as shown in
the following snapshot.
Step 2
On clicking the specified link, you will be redirected to the Apache Pig Releases page. On this page, under
the Download section, you will have two links, namely, Pig 0.8 and later and Pig 0.7 and before. Click on the
link Pig 0.8 and later, then you will be redirected to the page having a set of mirrors.
Step 3
These mirrors will take you to the Pig Releases page. This page contains various versions of Apache Pig. Click
the latest version among them.
Step 5
Within these folders, you will have the source and binary files of Apache Pig in various distributions. Download
the tar files of the source and binary files of Apache Pig 0.15, pig0.15.0-src.tar.gz and pig-0.15.0.tar.gz.
Step 1
Create a directory with the name Pig in the same directory where the installation directories of Hadoop,
Java, and other software were installed. (In our tutorial, we have created the Pig directory in the user named
Hadoop).
$ mkdir Pig
Step 2
$ cd Downloads/
$ tar zxvf pig-0.15.0-src.tar.gz
Step 3
Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier as shown below.
$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/
.bashrc file
In the .bashrc file, set the following variables −
PIG_CLASSPATH environment variable to the etc (configuration) folder of your Hadoop installations
(the directory that contains the core-site.xml, hdfs-site.xml and mapred-site.xml files).
pig -h properties
If true, prints count of warnings of each type rather than logging each warning.
Note that this memory is shared across all large bags used by the application.
Specifies the fraction of heap available for the reducer to perform the join.
pig.exec.nocombiner = true|false; default is false.
Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs.
Determines if partial aggregation is done within map phase, before records are sent to combiner.
pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10.
If the in-map partial aggregation does not reduce the output num records by this factor, it gets disabled.
Miscellaneous: exectype = mapreduce|tez|local; default is mapreduce. This property is the same as -x switch
stop.on.failure = true|false; default is false. Set to true to terminate on the first error.
pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host.
$ pig –version
Java must be installed on your system before installing Hive. Let us verify java installation using the following
command:
$ java –version
If Java is already installed on your system, you get to see the following response:
If java is not installed in your system, then follow the steps given below for installing java.
Installing Java
Step I:
Step II:
Generally you will find the downloaded java file in the Downloads folder. Verify it and extract the jdk-7u71-
linux-x64.gz file using the following commands.
$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz
Step III:
To make java available to all the users, you have to move it to the location “/usr/local/”. Open root, and type the
following commands.
$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit
Step IV:
For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.
export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=$PATH:$JAVA_HOME/bin
Now apply all the changes into the current running system.
$ source ~/.bashrc
Step V:
Now verify the installation using the command java -version from the terminal as explained above.
$ cd Downloads
$ ls
apache-hive-0.14.0-bin.tar.gz
The following steps are required for installing Hive on your system. Let us assume the Hive archive is
downloaded onto the /Downloads directory.
The following command is used to verify the download and extract the hive archive:
$ tar zxvf apache-hive-0.14.0-bin.tar.gz
$ ls
apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz
We need to copy the files from the super user “su -”. The following commands are used to copy the files from
the extracted directory to the /usr/local/hive” directory.
$ su -
passwd:
# cd /home/user/Download
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit
You can set up the Hive environment by appending the following lines to ~/.bashrc file:
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
$ source ~/.bashrc
To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed in
the $HIVE_HOME/conf directory. The following commands redirect to Hive config folder and copy the
template file:
$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh
export HADOOP_HOME=/usr/local/hadoop
Hive installation is completed successfully. Now you require an external database server to configure
Metastore. We use Apache Derby database.
Follow the steps given below to download and install Apache Derby:
$ cd ~
$ wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz
$ ls
The following commands are used for extracting and verifying the Derby archive:
$ ls
db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz
Copying files to /usr/local/derby directory
We need to copy from the super user “su -”. The following commands are used to copy the files from the
extracted directory to the /usr/local/derby directory:
$ su -
passwd:
# cd /home/user
# mv db-derby-10.4.2.0-bin /usr/local/derby
# exit
You can set up the Derby environment by appending the following lines to ~/.bashrc file:
export DERBY_HOME=/usr/local/derby
export PATH=$PATH:$DERBY_HOME/bin
Apache Hive
18
export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jar
$ source ~/.bashrc
$ mkdir $DERBY_HOME/data
Derby installation and environmental setup is now complete.
Configuring Metastore means specifying to Hive where the database is stored. You can do this by editing the
hive-site.xml file, which is in the $HIVE_HOME/conf directory. First of all, copy the template file using the
following command:
$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-site.xml
Edit hive-site.xml and append the following lines between the <configuration> and </configuration> tags:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true </value>
<description>JDBC connect string for a JDBC metastore </description>
</property>
Create a file named jpox.properties and add the following lines into it:
javax.jdo.PersistenceManagerFactoryClass =
org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine
chmod g+w
Now set them in HDFS before verifying Hive. Use the following commands:
$ cd $HIVE_HOME
$ bin/hive
………………….
hive>
OK
Time taken: 2.798 seconds
hive>