Bda Lab
Bda Lab
Standalonde Mode:
Step 1
Download java (JDK <latest version> - X64.tar.gz) by visiting the following link www.oracle.com
Then jdk-7u71-linux-x64.tar.gz will be downloaded into your system.
Step 2
Generally you will find the downloaded java file in Downloads folder. Verify it and extract
the jdk-7u71-linux-x64.gz file using the following commands.
$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit
Step 4
For setting up PATH and JAVA_HOME variables, add the following commands
to ~/.bashrc file.
export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=$PATH:$JAVA_HOME/bin
Now apply all the changes into the current running system.
$ source ~/.bashrc
Downloading Hadoop
Download and extract Hadoop 2.4.1 from Apache software foundation using the following
commands.
$ su
password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit
Hadoop Operation Modes
Once you have downloaded Hadoop, you can operate your Hadoop cluster in one of the three
supported modes −
● Local/Standalone Mode − After downloading Hadoop in your system, by default, it is
configured in a standalone mode and can be run as a single java process.
● Pseudo Distributed Mode − It is a distributed simulation on single machine. Each Hadoop
daemon such as hdfs, yarn, MapReduce etc., will run as a separate java process. This mode
is useful for development.
● Fully Distributed Mode − This mode is fully distributed with minimum two or more
machines as a cluster. We will come across this mode in detail in the coming chapters.
Installing Hadoop in Standalone Mode
Here we will discuss the installation of Hadoop 2.4.1 in standalone mode.
There are no daemons running and everything runs in a single JVM. Standalone mode is suitable
for running MapReduce programs during development, since it is easy to test and debug them.
Setting Up Hadoop
You can set Hadoop environment variables by appending the following commands
to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
Before proceeding further, you need to make sure that Hadoop is working fine. Just issue the
following command −
$ hadoop version
If everything is fine with your setup, then you should see the following result −
Hadoop 2.4.1
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
It means your Hadoop's standalone mode setup is working fine. By default, Hadoop is configured
to run in a non-distributed mode on a single machine.
Example
Let's check a simple example of Hadoop. Hadoop installation delivers the following example
MapReduce jar file, which provides basic functionality of MapReduce and can be used for
calculating, like Pi value, word counts in a given list of files, etc.
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
Let's have an input directory where we will push a few files and our requirement is to count the
total number of words in those files. To calculate the total number of words, we do not need to
write our MapReduce, provided the .jar file contains the implementation for word count. You can
try other examples using the same .jar file; just issue the following commands to check supported
MapReduce functional programs by hadoop-mapreduce-examples-2.2.0.jar file.
$ mkdir input
$ cp $HADOOP_HOME/*.txt input
$ ls -l input
It will give the following files in your input directory −
total 24
-rw-r--r-- 1 root root 15164 Feb 21 10:14 LICENSE.txt
-rw-r--r-- 1 root root 101 Feb 21 10:14 NOTICE.txt
-rw-r--r-- 1 root root 1366 Feb 21 10:14 README.txt
These files have been copied from the Hadoop installation home directory. For your experiment,
you can have different and large sets of files.
Step 2:
Let's start the Hadoop process to count the total number of words in all the files available in the
input directory, as follows −
$cat output/*
It will list down all the words along with their total counts available in all the files available in the
input directory.
"AS 4
"Contribution" 1
"Contributor" 1
"Derivative 1
"Legal 1
"License" 1
"License"); 1
"Licensor" 1
"NOTICE” 1
"Not 1
"Object" 1
"Source” 1
"Work” 1
"You" 1
"Your") 1
"[]" 1
"control" 1
"printed 1
"submitted" 1
(50%) 1
(BIS), 1
(C) 1
(Don't) 1
(ECCN) 1
(INCLUDING 2
(INCLUDING, 2
.............
Hadoop installation: Pseudo-distributed Mode
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
Add the public key to the authorized_keys. Just use the ssh-copy-id command, which will take care
of this step automatically and assign appropriate permissions to these files.
$ ssh-copy-id -i ~/.ssh/id_rsa.pub localhost
hadoop@localhost's password:
Now try logging into the machine, with "ssh 'localhost'", and check in:
.ssh/authorized_keys
to make sure we haven't added extra keys that you weren't expecting.
Jps command
Program WordCount.java
package WordCountEx;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
{
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws IOException,
InterruptedException
{
int sum = 0;
for(IntWritable value : values)
{
sum += value.get();
}
con.write(word, new IntWritable(sum));
}
}
}
Manifest.txt
Main-Class: WordCountEx.WordCount
SalesMapper
package SalesCountry;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
package SalesCountry;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
}
output.collect(key, new IntWritable(frequencyForCountry));
}
}
SalesCountryDriver
package SalesCountry;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
my_client.setConf(job_conf);
try {
// Run the job
JobClient.runJob(job_conf);
} catch (Exception e) {
e.printStackTrace();
} }}
Implementing Hadoop Commands
How to read a file from a HDFS file system like cat command
URLCat.java
package HDFSIO;
import java.io.BufferedOutputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import java.net.URL;
import java.io.*;
import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;
public class URLCat {
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception {
InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
Main-Class: HDFSIO.URLCat
Create the jar file using the command
The above file with the name filename has to be present in HDFS.
Output of the above command will be the content of the file just like the cat command.
Writing Files to Hadoop
package HDFSWrite;
import java.io.BufferedOutputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import java.net.URL;
import java.net.URI;
import java.io.*;
public class FileWriteToHDFS {
//Input stream for the file in local file system to be written to HDFS
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
}
}
export CLASSPATH="$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-
core-2.4.1.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-common-
2.4.1.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar:$HADOOP_HOME/
share/hadoop/common/hadoop-common-2.4.1.jar:~/hadoopwrite/HDFSWrite/*:
$HADOOP_HOME/lib/*
In root user after compiling the java program generate the jar file and copy the jar file to hadoop
location
Start hadoop
Then enter the following commands after hadoop starts successfully
Pig runs as a client-side application. Even if you want to run Pig on a Hadoop cluster,there is
nothing extra to install on the cluster: Pig launches jobs and interacts with HDFS (or other Hadoop
filesystems) from your workstation.
export PIG_INSTALL=/usr/local/pig
export PATH=$PATH:$PIG_INSTALL/bin
After setting the path then navigate to the pig instalation location and type pig
Then the grunt shell(pig interactive shell) will be opened where we can run the pig commands
Create a file
Employee.txt
2. 001,Mehul,Hyderabad
3. 002,Ankur,Kolkata
4. 003,Shubham,Delhi
and copy to the hdfs
Sample_script.pig
1. Employee = LOAD 'hdfs://localhost:9000/pig_data/Employee.txt' USING PigStorage(',')
2. as (id:int,name:chararray,city:chararray);
3. Dump Employee;
exec hdfs://localhost:9000/Sample_script.pig Executes the pig script and displays the output of
Employee.txt
Pig Programs
Loading and displaying data
FOREACH .. GENERATE
The FOREACH .. GENERATE operator is usedto act on every row in a relation. It can be used to
remove fields, or to generate newones. In this example, we do both.
Joe,cherry,2
Ali,apple,3
Joe,banana,2
Eve,apple,7
(Joe,cherry,2)
(Ali,apple,3)
(Joe,banana,2)
(Eve,apple,7)
grunt> B = FOREACH A GENERATE $0, $2+1, 'Constant';
grunt> DUMP B;
(Joe,3,Constant)
(Ali,4,Constant)
(Joe,3,Constant)
(Eve,8,Constant)
Here we have created a new relation B with three fields. Its first field is a projection ofthe first field
($0) of A. B’s second field is the third field of A ($1) with one added to it
B’s third field is a constant field (every row in B has the same third field) with thechararray value
Constant.
STREAM
The STREAM operator allows you to transform data in a relation using an externalprogram or
script
STREAM can use built-in commands with arguments. Here is an example that uses theUnix cut
command to extract the second field of each tuple in A. Note that the com-mand and its arguments
are enclosed in backticks:
grunt> DUMP C;
(cherry)
(apple)
(banana)
(apple)
oining datasets in MapReduce takes some work on the part of the programmer (see“Joins” on page
233), whereas Pig has very good built-in support for join operations,making it much more
approachable. Since the large datasets that are suitable for anal-ysis by Pig (and MapReduce in
general), are not normalized, joins are used more in-frequently in Pig than they are in SQL.JOIN
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)
grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)
We can join the two relations on the numerical (identity) field in each:
grunt> DUMP C;
(2,Tie,Joe,2)
(2,Tie,Hank,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)
This is a classic inner join, where each match between the two relations correspondsto a row in the
result. (It’s actually an equijoin since the join predicate is equality.) The result’s fields are made up
of all the fields of all the input relations.
COGROUP
JOIN always gives a flat structure: a set of tuples. The COGROUP statement is similarto JOIN, but
creates a nested set of output tuples. This can be useful if you want toexploit the structure in
subsequent statements:
grunt> D = COGROUP A BY $0, B BY $1;
grunt> DUMP D;
(0,{},{(Ali,0)})
(1,{(1,Scarf)},{})
(2,{(2,Tie)},{(Joe,2),(Hank,2)})
(3,{(3,Hat)},{(Eve,3)})
(4,{(4,Coat)},{(Hank,4)})
COGROUP generates a tuple for each unique grouping key. The first field of each tupleis the key,
and the remaining fields are bags of tuples from the relations with a matchingkey. The first bag
contains the matching tuples from relation A with the same key.Similarly, the second bag contains
the matching tuples from relation B with the samekey.If for a particular key a relation has no
matching key, then the bag for that relation isempty. For example, since no one has bought a scarf
(with ID 1), the second bag in thetuple for that row is empty.
GROUP
Although COGROUP groups the data in two or more relations, the GROUP statement groups the
data in a single relation. GROUP supports grouping by more than equality of keys: you can use an
expression or user-defined function as the group key. For example, consider the following relation
A:
grunt> DUMP A;
(Joe,cherry)
(Ali,apple)
(Joe,banana)
(Eve,apple)
grunt> DUMP B;
(5L,{(Ali,apple),(Eve,apple)})
(6L,{(Joe,cherry),(Joe,banana)})
GROUP creates a relation whose first field is the grouping field, which is given the aliasgroup. The
second field is a bag containing the grouped fields with the same schema asthe original relation (in
this case, A).
Sorting Data
grunt> DUMP A;
(2,3)
(1,2)
(2,4)
The following example sorts A by the first field in ascending order, and by the secondfield in
descending order:
grunt> DUMP B;
(1,2)
(2,4)
(2,3)
Filtering in Pig
How to Filter Records - Pig Tutorial Examples
Pig allows you to remove unwanted records based on a condition. The Filter functionality is
similar to the WHERE clause in SQL. The FILTER operator in pig is used to remove
unwanted records from the data file. The syntax of FILTER operator is shown below
Here relation is the data set on which the filter is applied, condition is the filter condition and
new relation is the relation created after filtering the rows.
year,product,quantity
---------------------
grunt> DUMP B;
(2000,iphone,1000)
(2001,iphone,1500)
(2002,iphone,2000)
(2000,nokia,1200)
(2001,nokia,1500)
2. select products whose quantity is greater than 1000 and year is 2001
(2001,iphone,1500)
(2001,nokia,1500)
grunt> DUMP D;
(2001,iphone,1500)
(2002,iphone,2000)
(2001,nokia,1500)
(2002,nokia,900)
You can use all the logical operators (NOT, AND, OR) and relational operators (< , >, ==, !=,
>=, <= ) in the filter conditions
Pig Joins
Running Pig
Note: The following should be performed in ***root user*** (not hadoop user).
Hadoop user is only for starting hadoop
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
Inner join:
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Self Join:
c1 = LOAD 'hdfs://localhost:9000/customers.txt' USING PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
Dump c3;
(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)
3) Outer Join
Unlike inner join, outer join returns all the rows from at least one of the relations. An outer
join operation is carried out in three ways -
a) Left outer join
b) Right outer join
c) Full outer join
a) Left outer join
The left outer Join operation returns all rows from the left table, even if there are no matches
in the right relation.
c= LOAD 'hdfs://localhost:9000/c.txt' using PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
Dump outer_left;
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
Right Outer
Dump outer_right;
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
c) Full outer join
Dump outer_full;
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
PIG UDF
export CLASSPATH="$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-
client-core-2.9.2.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-
common-2.9.2.jar:$HADOOP_HOME/share/hadoop/common/hadoop-common-2.9.2.jar:~/
pigudf/*:$HADOOP_HOME/lib/*:/usr/local/pig/pig-0.16.0-core-h1.jar:/usr/local/pig/pig-
0.16.0-core-h2.jar"
package pig;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
return null;
return str.toUpperCase();
}
Compile the program and generate the jar file
1,John,2007-01-24,250
2,Ram,2007-05-27,220
3,Jack,2007-05-06,170
3,Jack,2007-04-06,100
4,Jill,2007-04-06,220
5,Zara,2007-06-06,300
5,Zara,2007-02-06,35
Register ‘hdfs://localhost:9000/MyUDF.jar’
Step 9
Let us now convert the names of the employees in to upper case using the UDF sample_eval.
Upper_case = FOREACH employee_data GENERATE pig.Sample_Eval(name);
Dump Upper_case;
Retrieving user login credentials from /etc/passwd using Pig Latin
First copy the passwd file from etc to the working directory.Assume the working directory
is /usr/local/pig
export PIG_HOME=/usr/local/pig
In grunt shell
Prerequisites are
2. MySQL
Procedure
3. Create soft link for connector in Hive lib directory or copy connector jar to lib folder –
use metastore
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
<description>user name for connecting to mysql server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hivepassword</value>
<description>password for connecting to mysql server</description>
</property>
</configuration>
7. We are all set now. Start the hive console.
HIVE EXPERIMENTS
create table emp(id int,ename string) row format delimited fields terminated by ',';
1,Sandeep
2,Ramesh
3,Pradeep
Load data from the above file to the table emp as shown below;
Managed Table:
When we load data into a Managed table, then Hive moves data into Hive warehouse
directory.
This moves the file b.txt into Hive’s warehouse directory for the managed_table table, which
is hdfs://user/hive/warehouse/managed_table.
Then this will delete the table metadata including its data. The data no longer exists
anywhere. This is what it means for HIVE to manage the data.
External Tables – External table behaves differently. In this, we can control the creation and
deletion of the data. The location of the external data is specified at the table creation time:
Now, with the EXTERNAL keyword, Apache Hive knows that it is not managing the data.
So it doesn’t move data to its warehouse directory. It does not even check whether the
external location exists at the time it is defined. This very useful feature because it means we
create the data lazily after creating the table.
The important thing to notice is that when we drop an external table, Hive will leave the data
untouched and only delete the metadata.
ii. Security
5. Managed Tables –Hive solely controls the Managed table security. Within Hive,
security needs to be managed; probably at the schema level (depends on
organization).
6. External Tables –These tables’ files are accessible to anyone who has access to
HDFS file structure. So, it needs to manage security at the HDFS file/folder level.
iii. When to use Managed and external table
Use Managed table when –
4. We want Hive to completely manage the lifecycle of the data and table.
5. Data is temporary
Use External table when –
● Data is used outside of Hive. For example, the data files are read and processed by an
existing program that does not lock the files.
● We are not creating a table based on the existing table.
● We need data to remain in the underlying location even after a DROP TABLE. This
may apply if we are pointing multiple schemas at a single data set.
● The hive shouldn’t own data and control settings, directories etc., we may have
another program or process that will do these things.
Creating Partitions and buckets
Hive Partitions and buckets
Partitioning –Apache Hive organizes tables into partitions for grouping same type of data
together based on a column or partition key. Each table in the hive can have one or more
partition keys to identify a particular partition. Using partition we can make it faster to do
queries on slices of the data.
Bucketing –In Hive Tables or partition are subdivided into buckets based on the hash
function of a column in the table to give extra structure to the data that may be used for more
efficient queries.
Table creation
hive> create table student(sid int,sname string,smarks int) row format delimited fields
terminated by ',';
1,abc,20
2,xyz,30
3,def,40
Partition Creation
sid=20
sid=30
sid=40
sid=__HIVE_DEFAULT_PARTITION__
Buckets:
Buckets in hive is used in segregating of hive table-data into multiple files or directories. it is
used for efficient querying.
1. The data i.e. present in that partitions can be divided further into Buckets
2. The division is performed based on Hash of particular columns that we selected in the
table.
3. Buckets use some form of Hashing algorithm at back end to read each record and
place it into buckets
4. In Hive, we have to enable buckets by using the set.hive.enforce.bucketing=true;
Enable bucketing
hive>set hive.enforce.bucketing=true;
Create bucketed table
create table student_bucket(sid int,sname string,smarks int) clustered by (sid) into 3 buckets
row format delimited fields terminated by ',';
Inserting data
insert overwrite table student_bucket select * from student;