0% found this document useful (0 votes)
99 views8 pages

Profiling Apache HIVE Query From Run Time Logs: Givanna Putri Haryono Ying Zhou

This document summarizes a tool that profiles Apache HIVE queries by extracting information from HIVE and Hadoop logs. The tool builds an execution profile for each query containing details about MapReduce jobs, tasks, and attempts. Profiles are stored in MongoDB for generating performance reports. The tool assists with comparing queries, diagnosing issues, and understanding how parameters affect performance.

Uploaded by

ragou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views8 pages

Profiling Apache HIVE Query From Run Time Logs: Givanna Putri Haryono Ying Zhou

This document summarizes a tool that profiles Apache HIVE queries by extracting information from HIVE and Hadoop logs. The tool builds an execution profile for each query containing details about MapReduce jobs, tasks, and attempts. Profiles are stored in MongoDB for generating performance reports. The tool assists with comparing queries, diagnosing issues, and understanding how parameters affect performance.

Uploaded by

ragou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Profiling Apache HIVE Query from Run Time Logs

Givanna Putri Haryono Ying Zhou


School of Information Technologies School of Information Technologies
The University of Sydney The University of Sydney
NSW 2008 NSW 2008
Email: [email protected] Email: [email protected]

AbstractApache Hive is a widely used data warehousing Performance and debugging information may be obtained
and analysis tool. Developers write SQL like HIVE queries, by instrumenting query execution. This approach is intrusive
which are converted into MapReduce programs to runs on and often incurs overheads. In addition, these options are
a cluster. Despite its popularity, there is little research on only available for hand-coded MapReduce programs. HIVE
performance comparison and diagnose. Part of the reason is that generates and executes MapReduce code on the fly, making it
instrumentation techniques used to monitor execution can not be
applied to intermediate MapReduce code generated from Hive
impossible for any instrumentation tools to perform source or
query. Because the generated MapReduce code is hidden from binary instrumentation.
developers, run time logs are the only places a developer can get In this paper, we present a profiling tool which builds
a glimpse of the actual execution. Having an automatic tool to execution profile of individual HIVE queries with information
extract information and to generate report from logs is essential
to understand the query execution behavior.
extracted from log files. The profiled is stored in semi-
structured format and can be retrieved to produce various re-
We designed a tool to build the execution profile of individual ports for performance comparison, diagnose and other use. We
Hive queries by extracting information from HIVE and Hadoop discuss preliminaries about HIVE and MapReduce execution
logs. The profile consists of detailed information about MapRe- in section II; Section III focus on the overall profiler design
duce jobs, tasks and attempts belonging to a query. It is stored as with a focus on how data is extracted; Experiments and results
a JSON document in MongoDB and can be retrieved to generate
reports in charts or tables. We have run several experiments
are presented in section IV. This is followed by discussion of
on AWS with TPC-H data sets and queries to demonstrate related work and the conclusions.
that our profiling tool is able to assist developers in comparing
HIVE queries written in different formats, running on different II. P RELIMINARY
data sets and configured with different parameters. It is also
able to compare tasks/attempts within the same job to diagnose In this section, we briefly explain the execution of HIVE
performance issues. query and the logs generated that are of interest. Hive sup-
ports a high level programming language called Hive Query
I. I NTRODUCTION Language (HiveQL). It closely resembles Structured Query
Language (SQL). Queries written in HiveQL will be analyzed
Apache Hive [1] is a popular data analytic extension to and converted into one or a few Hadoop MapReduce jobs.
Hadoop MapReduce [2] framework. The SQL like language These jobs are then submitted to the underlying Hadoop cluster
makes it a convenient alternative to hand-coded MapReduce to run.
programs. Simple HIVE queries are able to achieve similar
performance as those of hand-coded MapReduce programs. A typical MapReduce job consists of a map and an optional
There could be performance gaps between HIVE query and reduce phase. When a reduce phase is present, a shuffle phase
hand-coded MapReduce program for complex analytic work- is required to synchronize these two phases.
loads. Over the years, HIVE has made many significant In the map phase, a number of map tasks run in various
technical advancements in storage format, indexing, SQL- nodes to process different partitions of the job input data.
to-MapReduce translator and execution engine to improve Intermediate results produced by all map tasks are sorted by
execution performance. Certain in-depth knowledge may be key and stored temporary in the local file system. They are
required to fully utilize such advancements. The performance partitioned and transferred to nodes running reduce tasks. This
of complex queries could be unpredictable and sometimes hard is the shuffling phase. Each reduce task is responsible for a
to understand for most users. There are cases where small subset of keys produced by map phase. After a reduce task
changes in the way a query is written may have big impact on received all intermediate data in its key space, it starts to run
the performance. There are also cases where performance on a developer specified reduce function on each key value list
large data set may vary a lot to performance on small data set pair. This is the reduce phase.
using exactly the same query. In addition, HIVE also allows for
fine tuning performance by parameters such as reducer number Task is the basic execution unit and each execution of a
and timing of shuffle. The effect of such parameters, most of task is called an attempt. If speculative feature is enabled, new
the time, can only be evaluated by experimenting. Developers attempts may be started for tasks that are detected as straggler.
often have to resort to meticulously walking through countless The maximum number of attempt per task is configurable.
log messages, or perform numerous key word searches to find Ideal execution time may be achieved if tasks belonging to the
the information needed. same phase (map or reduce) can all run in parallel. However,

978-1-4673-8796-5/16/$31.00 2016 IEEE 61 BigComp 2016


such ideal case rarely happens because resources in Hadoop version. In fact the content of Hadoop logs changed slightly
cluster is supposed to be shared among many jobs and users. between the two versions we used during this study. We decide
Most clusters are configured to allow jobs running at the same to store the profile data in a schema flexible system. MongoDB
time sharing the computing resources fairly. is chosen for its powerful random query feature and for its
flexible schema. The MongoDB schema is shown in Figure 1.
Hadoop cluster configured with MRv1 view resources as
fixed map or reduce slots on each node. Job scheduling
and monitoring is managed by a JobTracker demon running
on the master node. TaskTracker demons running on the
slave nodes manage the execution of individual tasks. This
is considered as ineffective because the JobTracker demon is
responsible for managing the resources for the whole cluster
as well as monitoring the execution of each individual jobs.
Recent versions of the Hadoop introduces a generic resource
management framework YARN[3]. Hadoop cluster configured
with YARN is also referred to as MRv2. In this setting, YARN
is responsible for allocating and managing resources for the
whole cluster. Each application running on the cluster has
its own ApplicationMaster to monitor execution status and to Fig. 1: Schema of MongoDB database
decide if new attempt should be started.
The run time logs under different managing systems con- Each box in 1 represents a type of document. They are
tain similar information with slightly different formats. We either stored as separate documents in their own collections or
focus on logs generated by YARN. YARN framework allows are embedded in other documents.
users to configure a log directory in HDFS. All logs of a Individual query profile data is stored in a query docu-
particular job will be stored in sub-directories identified by ment. This document contains basic information such as the ac-
user name and jobID. Each attempt is stored in a file identified tual query expression, total duration, start, end time and num-
by the actual node running the attempt. The logs representing ber of jobs generated. It also contains an array of links pointing
the overall job is stored in a file identified by the node running to job documents representing job profile. The overall I/O cost
the ApplicationMaster. of the query is stored in an array of embedded HDFS docu-
In summary, we are able to extract execution information ments, which summarize the read and write bytes of each job.
from three types of logs: hive log, jog log and attempt log. To facilitate performance comparison, each query document
The hive log is generated by HIVE, it contains the execution also stores identifying information as supplied by the user.
information of multiple Hive queries. The other two are gen- These include tags and data set size. Tags is used
erated by YARN framework containing detailed MapReduce to identify individual query. For instance, a TPC-H query
job execution information. 1 running at 2015-03-27 10:30:30 maybe tagged as
TPC-H, Query-1, 2015-03-27 10:30:30. The ac-
III. P ROFILER D ESIGN tual content of tags are purely decided by the user. The schema
free feature of MongoDB allows different users to add different
The key objective of our profiler is to help developers identifying fields.
diagnose the execution of HIVE query by obtaining detailed
running statistics. In general, HIVE queries share the cluster A document representing job profile contains information
resources with other applications, it is normal to have variable about its map, shuffle and reduce phases, stored as embedded
execution times. But more often than not, it is hard to differ- documents. Both map and reduce phases may contain multiple
entiate normal execution variation with variations caused by tasks, task profiles are stored in separate task document and
design or execution inefficiency. In the latter case, there are referenced by the job profile.
chances to make improvement if we are able to isolate the
exact cause. B. Query profile

Apart from cluster resources restriction, the execution time Most of data used to build a query profile can be obtained
of a HIVE query are affected by a number of factors: the from HIVE log file. The logs are generated systematically
number of MapReduce jobs generated, the actual execution using predefined structure. It contains logs of many queries.
of each job and the associated networking and I/O costs. The very first message logged when a query is submitted
Our profiler starts by extracting relevant information from normally describes Hives attempt to identify the metastore
various logs and store them in a structured way. Each query version used. We use this as template to segment the HIVE
is identified by user supplied tags. The tags are used later to log into sections, each belonging to an individual query.
retrieve a querys profile or to retrieve a few querys profile for Within a section, many run time information can be ex-
comparison. We provide reporting tools to visualize the data tracted using standard information extraction techniques. We
for easy comparison. manually analyzed and annotated several hive log sections to
derive patterns to extract information such as query expression,
A. Profile Data job ID, start and end time.
Logging behavior is a configurable feature in most systems. The query expression appears at the beginning of a partic-
It is normal to change the default logging behaviour in a new ular querys log section. It is sandwiched between log message

62
containing the text Starting command and log message
outlining number of jobs to be launched. Query start time is the
time when Hive starts preparing the environment required for
executing the query. It is obtained by extracting the timestamp
of the very first message in Hive log file. Query end time
is the time when a query has been executed successfully. It is
obtained by reading the very last log message of that particular
query log section. This log message generally contains an OK
message indicating the querys successful execution.
Not all required information can be extracted directly from
the log file. For instance, the per job HDFS read and write
bytes can be extracted from the summary appearing at the
end of a querys log section. The information presented there
does not contain the actual job ID but a HIVE generated stage
number. The mapping between a job ID and stage number
can be found from three early entries when a job is ready
to run. When a job is ready to run, it print out a log entry
like Starting Job = job_1427766547871_9705,
Tracking URL = .... This entry indicates the start of
a job. The next entry shows command to kill the job. Both
entries contain the job ID (job_1427766547871_9705).
A third entry shows the number of mappers and reducers of
that job, which also contains a stage number. We identify
these entries and extract the job ID and the stage number
to establish a mapping. Figure 2 shows a example log with
extracted information highlighted.

C. Job and Task Profile


Most data about job and task are recorded in job and
task logs. Both job and task undergo several state transitions,
all of which are captured in the log. Another closely related
concept is task attempt. A task is a conceptual term while a
task attempt represents a running task. The life cycle of a task
attempt also goes through many states. Figure 3 shows the state
transition of a successful job. It is clear that the life cycles are
interconnected. In addition to SUCCEEDED state, an attempt
may end up in FAILED or KILLED state. Job and task can
finish on FAILED or KILLED state as well.
Fig. 2: Hive Log Example
When state transition happens, a message is
recorded in the log file. A typical message looks like
attempt_7871_9762_m_000001_0 TaskAttempt
Transitioned from NEW to UNASSIGNED. We rely A map phase consists of many map tasks running on a
on such log lines to rebuild the life cycle of job, task and number of nodes. When a job is in RUNNING state, all of its
attempt. tasks, including both map and reduce tasks will be put in
SCHEDULED state. After that, corresponding task attempt will
The total execution time of a MapReduce job reported after be put in UNASSIGNED state, waiting for resource assignment.
the job completes is the time between a jobs start state (NEW) If there are available resource containers, some task attempts
and a jobs finish state (SUCCEEDED, KILLED or FAILED). will be assigned to resource containers and launched by the
As shown in figure 3, a job goes through a few preparation node manager. After an attempt is launched, its state changes to
states before it is put into RUNNING state where it is put in RUNNING and the corresponding task will enter the RUNNING
queue waiting for resources. A jobs actual running time can state as well. We consider a map phase starts when its first map
be further divided into map phases, shuffle phase and reduce task is changed to RUNNING state. It finishes when its last map
phase. The phases may overlap with each other. The YARN task is in SUCCEEDED state.
Resource Managers web UI displays the execution time of
each phase. There is no formal definition of a phase or standard Shuffle is the phase where intermediate map output is
ways to compute the duration of a phase. It is very often to see sorted locally, transferred to and merged on their destination
a phase with zero or negative duration from the web UI. We reduce task. Shuffle phase begins with ApplicationMaster
propose our own way to compute the duration of each phase generating and sending a unique service ID and security tokens
based on task states. to the node manager. This usually happens at the same time

63
Fig. 3: Job, Task and Task Attempt State Transition in a Successful Job

when map task attempts are assigned resources. The service D. Reporting tool
ID and tokens are recorded in the log file and can be used to
extract the shuffle phase start time. The actual transfer starts Our reporting tool supports a range of chart types and tables
when at least one reduce task attempt is running and starts to visualize the performance data. We use bar chart and line
to request map output. This event is also recorded in the job chart for overall comparison, such as execution times, number
log. Shuffle phase ends when all map tasks output has been of jobs, number of tasks and so on. We use customized Gantt
transferred to and merged on the reducer nodes. Because the chart to show time line of query execution, this includes the
final merge happens on the reducer side, this event is recorded start and end time of map, shuffle and reduce phases per job,
in the corresponding attempt log, instead of the job log. The as well as the duration of each phase. Individual tasks are
job log file contains information of the node(s) running reduce running concurrently on cluster machines, we show details of
attempt(s). From that we are able to locate the actual reduce individual task such as duration, memory, IO, location and so
attempt log file. on using tables and boxed formats with interactive features
to show additional details. We give examples of the various
Upon inspection of the reduce attempt log file, we noticed charts and tables in the experiment section.
that there are two contiguous log entries indicating the end of
merge and the start of actual reduce task. Below is an example IV. E XPERIMENT
of the entries.
We have designed and run several experiments to simulate a
number of real world use cases to demonstrate that our profiler
2015-05-26 03:05:11,904 INFO [main]
can be used in performance comparison, parameter selection,
org.apache.hadoop.mapred.Merger:
performance debugging and more.
Down to the last merge-pass,
with 1 segments left of total size: 95 bytes
2015-05-26 03:05:11,957 INFO [main] A. Experiment Environments
ExecReducer: maximum memory = 1528823808
We conducted our experiment on a small cluster with 14
nodes running Linux Ubuntu 12.04.5 LTS operation system. 12
The first entry indicates that it is doing the last merge. The nodes are used to run data nodes and node managers, six of
second entry indicates that the task attempt is setting up the are m3.medium node and another six are m3.large nodes. The
environment for executing the reduce function on the merged other two nodes host the HDFS name node and YARN resource
result. We use the timestamp of the second entry as the end manager respectively. The YARN configuration allocates 1 GB
of a merge stage. A job may contains several reduce tasks. memory to the map task, and 2 GB memory to the reduce task
The shuffle phase completes when the last reduce task finishes regardless of whether the task is running on medium or large
merging the intermediate results. node.
After the intermediate results are sorted and merged, they The cluster is built using Clouderas CDH 5.3 package.
are ready to be passed to a reduce function. This is called the In particular, we installed HADOOP version 2.5.0, and HIVE
reduce phase. It starts when the intermediate results is ready version 0.13.1-cdh5.3.2. HIVE adopts local Metastore deploy-
on the first reduce task and ends when the last reduce task ment mode with a MySQL server as the Metadata storage. A
has its state transitioned from RUNNING to SUCCEDDED. The single MySQL server is installed on one of the data nodes. All
data can be extracted from job log and from the individual other data nodes runs a HiveMetaStore service communicating
task attempt log(s). with the MySQL server to get table information.

64
B. Experiment Data Set and Queries
We used data and queries from TPC-H [4] bench mark
in our experiments.TPC-H consist of 8 tables and 15 queries,
some with multiple versions. It is used as a standard bench-
mark for Decision Support System [5]. It has been used to test
various optimizations in HIVE [6], [7]. In our experiment, we
used all tables and a subset of queries. We provide our own
variants for a few queries. We used the scale factor of 1, 5,
10, 15, 20, 25 and 30 for data generation, which produced an
overall data size of 1G to 30G respectively. Fig. 4: Overall Performance Comparison of Three Query
Variants
C. Query Variants and Performance Comparison
Hive supports many SQL like clauses such as Group By, stage in tabular format. Those numbers are good indicators job
SORT BY, JOIN, HAVING and so on. Each clause represents performance. For jobs doing similar kind of processing, less
a processing stage. It is likely that complex analytic workloads I/O is preferred. Table I is an example of the I/O cost (we
with many of those processing stages will have a few query shorten the job id to save some table space). It is clear that
variants. Query variants may end up with similar performance some of the performance variation are not caused by cluster
if the query planner is able to pick up an efficient plan resources variation, but by the jobs generated from the Hive
regardless of the actual expression. query. We can also tell that all three queries generate four
jobs. The last three jobs are doing similar processing in all
HIVE query planner is able to combine and remove unnec- three queries because they read and write the same number
essary map reduce jobs by examining the relationship between of bytes. The first job of the original query is different to the
data used in various stages[6]. A complex Hive query may first job of the two variants. It writes less bytes after the job
require only a couple of map reduce jobs to produce the results. finishes. The only difference between the two variants are that
However, because the data Hive used is always distributed one uses implicit join while the other uses explicit join. We
across the cluster and traditional statistics used by RDBMS can safely conclude that there is no performance difference
for query planning and optimization are not always available, with respect to implicit or explicit join. Both variants filter
most of the time it is a developers job to prepare and submit the order date before grouping while the original query filter
the right query for desirable performance. Writing a few it after grouping using a having clause. It indicates that early
query variants and running on a small data set to compare the filtering causes the first job to write out more data, resulting
performance is a typical practice. in longer execution time.
To simulate this use case, we provides two variants of TPC-
H query 3. Both are validated to produce the same results. This Tags Jobs HDFS read HDFS write
query retrieves the top 10 orders that belong to BUILDING job 9738 163.99MB 3.54MB
tpch,query3,
segment, placed before 1995-03-15, but are not yet shipped job 9739 724.68MB 396682B
version1, original,
as of 1995-03-15. The result is sorted based on the amount job 9740 398472B 396318B
cluster, 1gb
of revenue received by each order. The original query uses job 9742 396705B 435574B
implicit join. It performs filtering on the customer key, order job 9743 163.99MB 6.11MB
tpch,query3,
key, marketing segment, and ship date before grouping, but job 9744 724.68MB 396682B
version2, cluster,
perform filtering on the order date after grouping. One of job 9745 398472B 396318B
1gb
our variants move all filtering before grouping; the other uses job 9746 396705B 435574B
explicit join. job 9747 163.99MB 6.11MB
tpch,query3,
job 9748 724.68MB 396682B
We run these three queries on 1G data set and built profiles version3, cluster,
job 9749 398472B 396318B
for each. For general performance comparison, our profiler will 1gb
job 9750 396705B 435574B
by default plot the overall execution time as well as the average
map/reduce/shuffle phase time as a bar chart. Figure 4 shows TABLE I: IO Cost of Each Job in the Three Query Variants
an example of the overall execution comparison. Please note
that the phase duration of map/reduce/shuffle as show in the
bar chart are the average duration of all jobs generated by Some queries may perform well in small data set but rather
that query. All three versions generate 4 jobs, that is why the poorly in large data set. Measuring query scalability is also
average map/reduce/shuffle duration is much shorter than the important when comparing different query variants. A query
overall execution time. Because queries may generate different can be viewed as scalable if the execution, number of tasks,
number of jobs, we find plotting the average makes more sense. total memories used and amount of I/O generated all grow as
If preferred, a user can choose to show the total duration of either O(N ) or O(log N ), where N is the size of the dataset.
each phase instead of an average. Our profiler is able to produce a serie of bar charts and line
charts to plot individual metrics against data set size. Figure
A user can choose to display various details after viewing 5 shows the scalability metrics of TPC-H query 6 with data
the overall picture. It is possible to show the number of jobs set size grows from 1G to 30G. It is clear that most metrics
as well as each jobs execution time. We find it particularly has either a linear or O(log N ) growth pattern, indicating good
useful to check the HDFS READ and WRITE bytes of each scalability of this particular query.

65
D. Configuration Parameter Comparison
Both HIVE and MapReduce framework allow users to set
configuration parameters to fine tune the performance. The
parameters may be at cluster level or job level. More often
than not, users need to experiment with different settings to
determine the best one. That involves setting a parameter,
running some queries/job and comparing the performance. We
designed two use cases to illustrate how our profiler can help.
Shuffling is a synchronization stage where map tasks
distribute its results to reduce tasks. A reduce task has to
wait till it receives all data from map tasks before it calls the
reduce function. Shuffling could become a bottleneck because
it involves m map tasks exchanging data with r reducers.
By default, shuffling starts when the map tasks start to emit
data. That means reduce tasks need to be scheduled to request
(a) Execution time of TPC-H query 6 version 2 the data through RPC calls. The benefit of early shuffling is
that the network traffic is evenly distributed across some time
period. The down side is, a reduce task will need to hold
resource longer for the whole period of shuffling.
It is possible to configure the start time of shuffle by
setting a shuffle threshold. These thresholds is a percentage
representing the percentage of map tasks finishes before shuffle
can start. To study the effect of this parameter, we run query
6 on 1G data set several times, each with a different shuffle
(b) Number of Tasks used by TPC-H query 6 version 2 threshold number. The number we used are: 5%, 7%, 15%,
and 30%. We plot the result on a time line charts to highlight
the difference as shown in figure 6. The overall job execution
time can be viewed by mouse hovering over the plotted area
(figure 6a). We can see from the figure that the query duration
will be longer if shuffle starts at 30% threshold, but not at 15%
threshold, indicating an optimal setting of 15% to balance the
overall cluster resource and the individual query performance.
Apart from per job setting, global settings are even harder
to decide. For instance, the current YARN framework allocate
resources mainly by node memory. Determining the optimal
allocation is very important. Under-provision may see lots of
tasks get killed due to insufficient memory. Over-provision
(c) Memory consumption of TPC-H query 6 version 2
would result in under utilized cluster. Most of the online guide
suggests users to allocate memory based on hardware capacity.
More over, it is recommended that reduce tasks should be
allocated more memory than map tasks. This is based on two
general assumptions: there are usually more map tasks than
reduce tasks; reduce tasks are doing aggregating type of work
and may require more memories to hold the incoming value
list. This might not be true if map tasks are also doing lots of
filtering which is the case with most TPC-H queries.
We argue that many stable clusters are running similar
queries/jobs daily. After some period of time, users should
be able to get an idea of typical memory usage patterns
(d) I/O generated by TPC-H query 6 version 2 of the jobs. Our profiler can help by summarizing the
task memory consumption in a easy to understand way.
Figure 7 shows the memory consumption summary of
Fig. 5: Scalability measurement of TPC-H query 6 version 2 TPC-H query 6 running on 10G data set. TPC-H query 6
is converted to one map reduce job. In this graph, each
box represents a taks, while the number represents memory
usage in MB. We also highlight the tasks where memory
consumption is less than 50% of the maximum memory
allocated. Moving mouse over a particular box will show
more details about that particular task. For instance, details

66
(a) Timeline for TPC-H query 6 version 1 with 5% threshold

Fig. 7: Memory consumption of TPC-H query 6 running on


10G data

intermediate results from that map task before they can start
(b) Timeline for TPC-H query 6 version 1 with 7% threshold the actual reducing tasks. For similar reason, a slow reduce
task may delay the return of the job results and hence delay
the overall execution. A slow task may be caused by node
hardware problem or uneven data distribution among tasks.
For instance, if data is skewed in a way that some keys have
a lot more data than others have, the reduce tasks assigned to
handle popular keys may have to process a lot more data than
the reduce tasks assigned with less popular keys. Our profiler
displays all such information to help developer to diagnose or
(c) Timeline for TPC-H query 6 version 1 with 15% threshold
trace the problem.
Figure 8 shows the task duration of all tasks in TPC-H
query 3 running on 1G data. It shows the average duration
of tasks belonging to the same phase and highlight the ones
that are 50% slower than the average duration. The interactive
feature allows a developer to move mouse over a particular
box to get details of that task such as its ID, the node running
it, the number of rows processed by that particular task.
(d) Timeline for TPC-H query 6 version 1 with 30% threshold
The average map task duration can be used as a guide in
determining the optimal input split size. Adjusting the split
Fig. 6: Timeline for TPC-H query 6 with various shuffle start size will increase or decrease the average map task duration.
settings The average reduce task duration can be used as a guide in
determining the number of reduce tasks required. It is possible
to shorten the overall execution time by have more reduce
of the reduce task in 7 shows that the maximum memory tasks.
is 1458M, the task only uses 4.57% of the memory and it
runs on node ip-10-79-156-149.ec2.internal The task details can be used by developers to either check
and the task attempt id is: the nodes health state or to revise the current data partitioning
attempt_1427766547871_12271_r_000000_0 strategy.
It is clear that even though there are more map tasks than
reduce tasks. the memory consumption of reduce task is much V. R ELATED W ORK
smaller than that of map tasks. In fact, we found similar Profiling tools(profilers) serve an important role in mea-
patterns in most of the TPC-H queries we run. We assume it is suring and evaluating various aspects of software. It provides
because most queries filter a lot of data at the early (map) stage information ranges from memory usage to execution time
by applying given query conditions. If most of the queries have and frequency of function calls often used to aid software
similar pattern, it is more resource efficient to allocate similar developers in performance measurement [8] and bug detection.
memories to both map and reduce tasks. Profiling data can be obtained either by instrumenting the
program (instrumentation) or by log analysis. Some system
E. Debugging under performing queries utilizes both.
We reuse the box graph of figure 7 to highlight task Instrumentation modifies a programs source code or ex-
execution time and related details. This can help developers ecutable files to collect required run time information. It can
to identify slow tasks that may have dragged down the perfor- be performed by manually injecting extra lines of code into
mance of the overall query. Shuffling is the synchronization the programs source or by compilers or by automated tools.
stage between map and reduce phases, a slow map task would Instrumentation can be quite accurate in terms of the data
delay the shuffling phase and force all reduce tasks to wait for they obtain but may introduce overhead to the running system.

67
query, job or task and to compare profiles of multiple queries
against each other on various aspects. Our experiment shows
that it can be used in performance analysis, diagnose and pa-
rameter selection. Though not presented in the experiment, our
profiler can be a useful tool to test the impact of new software
features. It can effectively replace the hand drawn charts and
tables in reports of new feature and new improvements.

ACKNOWLEDGMENT
This work is supported by Amazon AWS Education Grant.

R EFERENCES
[1] Apache hive. [Online]. Available: http://hive.apache.org/
[2] Apache hadoop. [Online]. Available: http://hadoop.apache.org/
[3] Apache hadoop nextgen mapreduce (yarn). [Online]. Avail-
able: http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-
site/YARN.html
[4] Tpc-h benchmark. [Online]. Available: http://www.tpc.org/tpch/
[5] M. Poess and C. Floyd, New tpc benchmarks for decision support
and web commerce, SIGMOD Rec., vol. 29, no. 4, pp. 6471, Dec.
2000. [Online]. Available: http://doi.acm.org/10.1145/369275.369291
[6] R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang, Ysmart: Yet
Fig. 8: Task Duration TPC-H query 3 running on 1G data another sql-to-mapreduce translator, in Distributed Computing Systems
(ICDCS), 2011 31st International Conference on. IEEE, 2011, pp. 25
36.
[7] Y. Huai, A. Chauhan, A. Gates, G. Hagleitner, E. N. Hanson,
DMTracker[9] is a profiling tool aims at finding bugs in large- O. OMalley, J. Pandey, Y. Yuan, R. Lee, and X. Zhang, Major tech-
scale parallel program. It uses binary instrumentation to collect nical advancements in apache hive, in Proceedings of the 2014 ACM
data movement information. Herodoutou and Babu [10] uses a SIGMOD international conference on Management of data. ACM,
dynamic java tracing tool BTrace[11] to gather data of running 2014, pp. 12351246.
hadoop programs to build job profiles. lprof [12] is another [8] A. Srivastava and A. Eustace, ATOM: A system for building customized
profiling tool for distributed systems in general. lprof performs program analysis tools. ACM, 1994, vol. 29, no. 6.
a static code analysis on the programs binary code to identify [9] Q. Gao, F. Qin, and D. K. Panda, Dmtracker: finding bugs in large-
scale parallel programs by detecting anomaly in data movements, in
the structure of the log files. It then used the patterns learned to Proceedings of the 2007 ACM/IEEE conference on Supercomputing.
extract request level messages from the logs. Instrumentation ACM, 2007, p. 15.
based profiling tool has been quite popular in collecting request [10] H. Herodotou and S. Babu, Profiling, what-if analysis, and cost-
traces. Examples includes MagPie[13], X-Trace[14] and so on. based optimization of mapreduce programs, Proceedings of the VLDB
Endowment, vol. 4, no. 11, pp. 11111122, 2011.
Profilers can be built at various levels. Profilers designed [11] Btrace: A dynamic instrumentation tool for java. [Online]. Available:
for general distributed systems usually focus on lower level https://kenai.com/projects/btrace
metrics without trying to understand the application. These in- [12] X. Zhao, Y. Zhang, D. Lion, M. Faizan, Y. Luo, D. Yuan, and
clude data movement[9], event trace [13] or request trace([15], M. Stumm, lprof: A nonintrusive request flow profiler for distributed
[12]). Such profilers are used mainly for bug detection, but systems, in Proceedings of the 11th Symposium on Operating Systems
can be used for performance diagnose as well[15]. Specialized Design and Implementation, 2014.
profilers aim to understand application level data and build [13] P. Barham, R. Isaacs, R. Mortier, and D. Narayanan, Magpie: Online
modelling and performance-aware systems. in HotOS, 2003, pp. 8590.
profiles closely represent the applications execution units.
[14] R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica, X-trace:
Herodoutou and Babu [10] collects data such as number A pervasive network tracing framework, in In NSDI, 2007.
of merge rounds, HDFS read, HDFS write and so on to [15] R. R. Sambasivan, A. X. Zheng, M. De Rosa, E. Krevat, S. Whitman,
understand Hadoop MapReduce job behavior. Historical data M. Stroucken, W. Wang, L. Xu, and G. R. Ganger, Diagnosing
are then used to build profiles of a particular type of workload performance changes by comparing request flows. in NSDI, 2011.
to run what-if analysis and well as cost based optimizer.
Our profiler is a specialized one extracting data from logs
produced by running queries. Different to [10], we focus on
profiling individual query to understand performance variation
and to diagnose possible problems.

VI. C ONCLUSIONS
We presented a HIVE profiling tool based on log analysis.
This profiler is able to extract information from various log
files to build profiles of individual queries. It also provides vi-
sualization tools to help developers show details of a particular

68

You might also like