Apache Hive Interview Questions: 1. Define The Difference Between Hive and Hbase?
Apache Hive Interview Questions: 1. Define The Difference Between Hive and Hbase?
Here is the comprehensive list of the most frequently asked Apache Hive Interview
Questions that have been framed after deep research and discussion with the industry
experts.
Hive vs HBase
HBase Hive
2. HBase operations run in a real-time on its 2. Hive queries are executed as MapReduce jobs
database rather internally
Hive supports all those client applications that are written in:
Java
PHP
Python
C++
Ruby
By default, the Hive table is stored in an HDFS directory – /user/hive/warehouse. One can
change it by specifying the desired directory
in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml.
Hive stores metadata information in the metastore using RDBMS instead of HDFS. The
reason for choosing RDBMS is to achieve low latency as HDFS read/write operations
are time consuming processes.
Local Metastore:
In local metastore configuration, the metastore service runs in the same JVM in which the
Hive service is running and connects to a database running in a separate JVM, either on
the same machine or on a remote machine.
Remote Metastore:
In the remote metastore configuration, the metastore service runs on its own separate
JVM and not in the Hive service JVM. Other processes communicate with the metastore
server using Thrift Network APIs. You can have one or more metastore servers in this case
to provide more availability.
By default, Hive provides an embedded Derby database instance backed by the local disk
for the metastore. This is called the embedded metastore configuration.
8. Scenario:
Suppose I have installed Apache Hive on top of my Hadoop cluster using default
metastore configuration. Then, what will happen if we have multiple clients trying
to access Hive at the same time?
The default metastore configuration allows only one Hive session to be opened at a time
for accessing the metastore. Therefore, if multiple clients try to access the metastore at
the same time, they will get an error. One has to use a standalone metastore, i.e. Local or
remote metastore configuration in Apache Hive for allowing access to multiple clients
concurrently.
Following are the steps to configure MySQL database as the local metastore in Apache
Hive:
Here is the key difference between an external table and managed table:
In case of managed table, If one drops a managed table, the metadata information
along with the table data is deleted from the Hive warehouse directory.
On the contrary, in case of an external table, Hive just deletes the metadata
information regarding the table and leaves the table data present in HDFS
untouched.
Note: I would suggest you to go through the blog on Hive Tutorial to learn more about
Managed Table and External Table in Hive.
Yes, it is possible to change the default location of a managed table. It can be achieved
by using the clause – LOCATION ‘<hdfs_path>’.
We should use SORT BY instead of ORDER BY when we have to sort huge datasets because
SORT BY clause sorts the data using multiple reducers whereas ORDER BY sorts all of the
data together using a single reducer. Therefore, using ORDER BY against a large number
of inputs will take a lot of time to execute.
Big Data Hadoop Certification Training
Instructor-led Sessions
Real-life Case Studies
Assignments
Lifetime Access
Explore Curriculum
Hive organizes tables into partitions for grouping similar type of data together based on
a column or partition key. Each Table can have one or more partition keys to identify a
particular partition. Physically, a partition is nothing but a sub-directory in the table
directory.
Partitioning provides granularity in a Hive table and therefore, reduces the query latency
by scanning only relevant partitioned data instead of the whole data set.
In dynamic partitioning values for partition columns are known in the runtime, i.e. It is
known during loading of the data into a Hive table.
15. Scenario:
Suppose, I create a table that contains details of all the transactions done by the
customers of year 2016: CREATE TABLE transaction_details (cust_id INT, amount
FLOAT, month STRING, country STRING) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ‘,’ ;
Now, after inserting 50,000 tuples in this table, I want to know the total revenue
generated for each month. But, Hive is taking too much time in processing this
query. How will you solve this problem and list the steps that I will be taking in order
to do so?
We can solve this problem of query latency by partitioning the table according to each
month. So, for each month we will be scanning only the partitioned data instead of whole
data sets.
3. Transfer the data from the non – partitioned table into the newly created partitioned
table:
16. How can you add a new partition for the month December in the above
partitioned table?
For adding a new partition in the above table partitioned_transaction, we will issue the
command give below:
Note: I suggest you to go through the dedicated blog on Hive Commands where all the
commands present in Apache Hive have been explained with an example.
17. What is the default maximum dynamic partition that can be created by
a mapper/reducer? How can you change it?
By default the number of maximum partition that can be created by a mapper or reducer
is set to 100. One can change it by issuing the following command:
Note: You can set the total number of dynamic partitions that can be created by one
statement by using: SET hive.exec.max.dynamic.partitions = <value>
18. Scenario:
Things to Remember:
Big Data Hadoop Certification Training
A map side join requires the data belonging to a unique join key to be present in
the same partition. But what about those cases where your partition key differs
from that of join key? Therefore, in these cases you can perform a map side join by
bucketing the table using the join key.
Bucketing makes the sampling process more efficient and therefore, allows us to
decrease the query time.
21. What will happen in case you have not issued the command: ‘SET
hive.enforce.bucketing=true;’ before bucketing a table in Hive in Apache
Hive 0.x or 1.x?
The command: ‘SET hive.enforce.bucketing=true;’ allows one to have the correct number
of reducer while using ‘CLUSTER BY’ clause for bucketing a column. In case it’s not done,
one may find the number of files that will be generated in the table directory to be not
equal to the number of buckets. As an alternative, one may also set the number of reducer
equal to the number of buckets by using set mapred.reduce.task = num_bucket.
One of the Hive query optimization methods is Hive index. Hive index is used to speed
up the access of a column or set of columns in a Hive database because with the use of
index the database system does not need to read all rows in the table to find the data
that one has selected.
23. Scenario:
Suppose, I have a CSV file – ‘sample.csv’ present in ‘/temp’ directory with the
following entries:
How will you consume this CSV file into the Hive warehouse using built SerDe?
Hive provides a specific SerDe for working with CSV files. We can use this SerDe for the
sample.csv by issuing following commands:
24. Scenario:
Suppose, I have a lot of small CSV files present in /input directory in HDFS and I
want to create a single Hive table corresponding to these files. The data in these files
are in the format: {id, name, e-mail, country}. Now, as we know, Hadoop
performance degrades when we use lots of small files.
So, how will you solve this problem where we want to create a single Hive table for
lots of small files without degrading the performance of the system?
One can use the SequenceFile format which will group these small files together to form
a single sequence file. The steps that will be followed in doing so are as follows:
CREATE TABLE temp_table (id INT, name STRING, e-mail STRING, country STRING)
CREATE TABLE sample_seqfile (id INT, name STRING, e-mail STRING, country STRING)
Transfer the data from the temporary table into the sample_seqfile table:
Hence, a single SequenceFile is generated which contains the data present in all of the
input files and therefore, the problem of having lots of small files is finally eliminated.