International Journal of Distributed and Parallel Systems (IJDPS)
International Journal of Distributed and Parallel Systems (IJDPS)
5, September 2016
ABSTRACT
This paper studies the performance and energy consumption of several multi-core, multi-CPUs and manycore hardware platforms and software stacks for parallel programming. It uses the Multimedia Multiscale
Parser (MMP), a computationally demanding image encoder application, which was ported to several
hardware and software parallel environments as a benchmark. Hardware-wise, the study assesses
NVIDIA's Jetson TK1 development board, the Raspberry Pi 2, and a dual Intel Xeon E5-2620/v2 server, as
well as NVIDIA's discrete GPUs GTX 680, Titan Black Edition and GTX 750 Ti. The assessed parallel
programming paradigms are OpenMP, Pthreads and CUDA, and a single-thread sequential version, all
running in a Linux environment. While the CUDA-based implementation delivered the fastest execution, the
Jetson TK1 proved to be the most energy efficient platform, regardless of the used parallel software stack.
Although it has the lowest power demand, the Raspberry Pi 2 energy efficiency is hindered by its lengthy
execution times, effectively consuming more energy than the Jetson TK1. Surprisingly, OpenMP delivered
twice the performance of the Pthreads-based implementation, proving the maturity of the tools and
libraries supporting OpenMP.
KEYWORDS
CUDA, OpenMP, Pthreads, multi-core, many-core, high performance computing, energy consumption
1. INTRODUCTION
Multi- and many-core systems have changed high performance computing in the last decade.
Indeed, multi-core CPU systems have brought parallel computing capabilities to every desktop,
requiring developers to adapt their applications to multi-core CPUs whenever high performance is
an issue. In fact, multi-core CPUs have become ubiquitous, existing not only on traditional
laptop, desktop and server computers, but also on smartphones, tablets and in embedded
devices.With the advent of GPUs and software stacks for parallel programming such as CUDA[1]
and OpenCL[2], a new trend has started, making thousands of cores available to developers[3].
To properly take advantage of many-core systems, applications need to exhibit a certain level of
parallelism, often requiring changes to their inner organization and algorithms[4]. Nonetheless, a
low to middle range mainstream GPU like the NVIDIA TI 750 delivers a top 1.4 TLFOPs singleprecision FP computing power for a price tag below 200 US dollars.More recently, so called
System-on-a-Chip (SoC) like the NVIDIA Jetson TK1 and the Raspberry Pi have emerged. Both
are quite dissimilar, with Raspberry Pi targeting pedagogical and low cost markets, and Jetson
DOI:10.5121/ijdps.2016.7501
International Journal of Distributed and Parallel Systems (IJDPS) Vol.7, No.5, September 2016
Tk1 delivering high performance computingto embed systems at affordable prices. More
importantly, both systems provide for energy efficient computing, an important topic since the
dominant cost of ownership for computing is energy, not only the energy directly consumed by
the devices, but also the one used for refrigeration purposes. The quest for performance and
computing efficiency is not the sole competence of hardware. In particular, the recent wide
adoption of many- and multi-core platforms by developers has been facilitated by the
consolidation of software platforms. These platforms have taken away some of the burden of
parallel programming, helping developers to be more productive and efficient. Examples include
Pthreads[5] for multi-core CPU and OpenMP[6]for multi-core CPU/many-core GPU (OpenMP
version 4 or higher is required for targeting GPU), CUDA[1] and OpenACC[7] for NVIDIA GPU
devices and OpenCL for CPU, GPU and other accelerators[2].
In this paper, we resort to a compute-intensive image coder/decoder software named Multimedia
Multiscale Parser (MMP) to evaluate the performance of several software platforms over distinct
hardware devices. Specifically, we assess the sequential, Pthreads and OpenMP versions of MMP
over the CPU-based hardware platforms and CUDA over the GPU-based hardware. The
assessment comprises computing performance and energy consumption over several
heterogeneous hardware platforms. MMP is a representative signal processing algorithm for
images. It uses a pattern-matching-based compression algorithm, performs regular and irregular
accesses to memory and dictionary searches, uses loops, conditional statements and allocates
large amount of buffers. For all these reasons, MMP addresses the major aspects that developers
face when programing applications for these architectures. These challenges are common to other
signal processing applications that can therefore benefit from the considerations of our study.The
CPU-based hardware includes a server with two Intel Xeon E5-2620/v2 CPUs, an NVIDIA
Jetson TK1 development board[8] and a Raspberry Pi 2[9]. Regarding GPUs, the study comprises
the following NVIDIA's devices: one GTX 680, one Titan Z Black Edition, one GTX 750 Ti and
again the Jetson TK1 since it has a 192-core CUDA GPU. The GTX 680, the Titan Z and the
Jetson TK1 are based on the Kepler GPU architecture, while the Ti 750 is based on the Maxwell
architecture.
Through the assessment of the throughput performance and energy consumption of several multiand many-core able hardware and software environments, this study contributes for a better
knowledge of the behaviour of some platforms for parallel computing. Indeed, a relevant
contribution of this work is the assessment of two embedded platforms: the Jetson Tk1
development board and the Raspberry Pi 2. This study confirms that the Jetson TK1 development
board with its quad-core CPU and CUDA-able GPU is an effective platform for delivering high
performance with low energy consumption. Conversely, the Raspberry Pi 2 is clearly not
appropriate for high performance-bounded operations. Another contribution of this work lies in
the comparison between the use of the paradigms OpenMP and Pthreads to solve the same
problem, with a clear performance advantage for OpenMP. This study also confirms the need for
different parallelization approaches, depending whether multi-core/multi-CPUs or many-core
systems are targeted. Finally, it also shows that speedups, albeit moderate, are possible to attain
even with applications that are challenging to parallelize.
The paper is organized as follows. Section 2 reviews related work. Section 3 presents the
hardware and parallel paradigms targeted in this work. Section 4 outlines the MMP algorithm,
while Section 5 presents the main results. Finally, Section 6 concludes the paper and presents
future work.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.7, No.5, September 2016
2. RELATED WORK
Since the Jetson development boardsare relatively recent, scientific studies regarding their
performance and energy consumption are still scarce. Paolucci et al. analyse performance vs.
energy consumption for a distributed simulation of spiking neural networks[10]. The comparison
involves two Jetson TK1 development boards connected through ethernet and a multiprocessor
system with Intel Xeon E5-2620/2.10 GHz, while the application is based on the Message Passing
Interface standard (MPI)[11]. The authors report that performance-wise, the server system is 3.3
times faster than the parallel embedded system, but its total energy consumption is 4.4 times
higher than the dual TK1 system. In[12], the authors evaluate the RX algorithm for anomaly
detection on images for several low-powered architectures. The study assesses systems based on
general processors from INTEL (Atom S1260 with two cores) and ARM (Cortex-A7, Cortex-A9,
Cortex-A15, all quad-core systems); and two low-power CUDA-compatible GPUs (the 96-core
Quadro 1000M and the 192-core GK20a of Jetson TK1). As a reference, they use an Intel Xeon
i7-3930 CPU with no accelerators. They report that for the IEEE 754 real double-precision
arithmetic RX detector, the Jetson TK1 system yields an execution time close to the reference
desktop system, using one tenth of the energy.Fatica and Phillips report on the port and
optimization of a synthetic aperture radar (SAR) imaging application on the Jetson TK1
development board[13]. The port involves the adaptation of the Octave-based applicationto
CUDA. Through several software optimizations, the execution time of the application is brought
down from 18 minutes to 1.5 seconds, although the main performance improvements come from
refactoring the code, and not from using the Jetson TK1 GPU through CUDA.
The Glasgow Raspberry Pi cloud project reports that the 56-Raspberry Pi data center solely
consumes 196 Wh (3.5 Wh per system), while a real testbed would require 10,080 Wh (180 Wh
per system), that is, roughly, 50 times more[14].Similarly, Baunthoroughly studies the
performance of several clusters comprised of SoC boards: RPi-B, RPi2-B and the Banana Pi[15].
The author concludes that the studied cluster of RPi2-B provides 284.04 MFLOPS per watt,
which would be sufficient for 6th place in the November 2015 Green 500 list if solely the
performance per watt is considered. Additionally, these low cost and low maintenance clusters are
interesting for several academic purposes and research projects.
Since the inception in the 2000s of multi-core and many-core systems, a significant volume of
scientific literature has been produced, often comparing the performance of both types of systems.
Lee et al.[16] report that a regular GPU is, on average, 14x faster than a state-of-the-art 6-core
CPU over a set of several CPU- and GPU-optimized kernels. Bordawekar et al. study the
performance of an application that computes the spatial correlation for large images dataset
derived from natural scenes[17]. They report that the optimized CPU version of the application
requires 1.02 seconds on an IBM power 7-based system, 1.82 seconds on an Intel Xeon, while the
CUDA-based version runs on 1.75 seconds over an NVIDIA GTX285. Stamatakis and Ott report
on a performance study on the bioinformatics field involving OpenMP, Pthreads and MPI[18].
They use the RAxML application that studies large-scale phylogenetic inference. The authors
mention some numerical issues with reduction operations under OpenMP due to the nondeterminism of the order of additions. We encountered a similar situation in our initial adaptation
of the code, where the determinism of the sequential version could not be reproduced on the
parallel version, yielding slightly different final results. Regarding performance, the authors
report better scalability of OpenMP relatively to Pthreads on a two-way 4-core Opteron system (8
cores) using the Intel C Compiler (ICC) suite.
3. COMPUTING ENVIRONMENTS
Next, we describe the hardware and software environments used in this study.
3
International Journal of Distributed and Parallel Systems (IJDPS) Vol.7, No.5, September 2016
3.1. HARDWARE
We present the hardware used in the experiments, namely, the Xeon-based server and the GPUs,
and the energy consumption measurement hardware.
3.1.1. SERVER SYSTEM
All the tests requiring a server system were performed on a machine with a two-way Intel Xeon
E5-2620/v2 CPUs, clocked at 2.10 GHz. Each physical core has a 32 KiB L1 cache for data, 32
KiB L1 cache for instructions, plus a unified 256 KiB level 2 cache. Additionally, all the physical
cores share a 15MiB on-chip Level 3 cache memory. Each CPU holds 6 physical cores that are
doubled through SMT hyper-threading. Therefore, in total, the desktop testing machine has 12
physical cores (6 core per CPU) that yield 24 virtual cores.
3.1.2. DISCRETE GPUS
The CUDA-based tests involving discrete GPUs were conducted with a GTX 680, a Titan Black
Edition and a GTX 750 Ti, all from NVIDIA. Both the GTX 680 and the Titan Black Edition are
Kepler-based GPU, while the GTX 750 Ti is based on the Maxwell architecture. All of them were
used through the PCI Express interface in the Xeon E5-2620/v2 server. The main characteristics
of the GPUs are summarized inTable 1.
Table 1. Main characteristics of the GPUs (TFLOPS are for 32bit FP)
CUDA cores
Mem. (DDR5)
Mem. width(bits)
Power (watts)
TFLOPS
Architecture
GTX 680
1536
2 GiB
256
195
3.090
Kepler
Titan Black
2880
6 GiB
384
250
5.121
Kepler
GTX 750 Ti
640
2 GiB
128
60
1.306
Maxwell
Jetson TK1
192
2 GiB
64
14
0.300
Kepler
International Journal of Distributed and Parallel Systems (IJDPS) Vol.7, No.5, September 2016
for maximum performance, only the needed hardware modules of the Jetson TK1 are enabled.
For instance, when running a CPU-bound application that does not use the GPU, the system does
not enable the GPU.
3.1.4. RASPBERRY PI 2
The Raspberry Pi is a low cost, low power single board credit-sized computer developed by
Raspberry Pi Foundation[9]. The Raspberry Pi has attracted a lot of attention, with both models of
the first version model A and model B reaching sales in the order of millions. Major
contributor for its popularity has been the low prices and the ability to run the Linux OS and its
software stack. The version 2 of the Raspberry Pi, which is the one used in this study,
wasreleased in 2015. Model B the high end model of the Raspberry Pi 2 has a quad-core 32bit ARM-Cortex A7 CPU operating at 900 MHz, a Broadcom VideoCore IV GPU and 1 GiB of
RAM memory shared between the CPU and the GPU. Besides the doubling of the RAM memory,
an important upgrade from the original Raspberry version lies in the CPU which has four cores
and thus can be used for effective multithreading. Each CPU core has a 32 KiB instruction cache
and a 32 KiB data cache, while a 512 KiB L2 cache is shared with all cores. The CPU implements
the version 7 of the ARM architecture, which means that Linux distributions available for the
ARM v7 can be run on the Raspberry Pi 2. The GPU is praised for its capability in decoding
video with resolution of up to 1080 pixels (full HD) supporting the H.264 standard[21]. However,
to the best of our knowledge, no standard parallel programming interfaces like OpenMP 4 and
OpenCL are available for the GPU of the Raspberry Pi. Although the Raspberry provides for six
different performance modes, we solely consider two of these modes. The low power mode
corresponds to the None mode of the Raspberry Pi 2, with the ARM CPU sets to 700 MHz, the
cores to 250 MHz and the SDRAM to 400MHz. The high performance mode increases the ARM
CPU to 1000 MHz, the cores to 500 MHz and the SDRAM to 600 MHz. It corresponds to the
Turbo mode of the Raspberry Pi 2. The main characteristics of both the Jetson TK1 and the
Raspberry Pi 2 are shown inTable 2. Table 3 displays the memory bandwidth measured on copies
between non-pageable RAM (host) and the GPUs (devices) and vice-versa. The values were
measured with the bandwidthTest (NVIDIA SDK).
Table 2. Main characteristics of the Embedded Systems.
Device
CPU cores
GPU cores
Jetson TK1
Raspberry Pi 2
4+1 ARM-v7
4 ARM-v7
192 (CUDA)
n.a.
TFLOPS
(32-bit FP)
0.300
0.244
Host to Device
(MB/s)
6004
6119
997
6380
Device to Host
(MB/s)
6530
6529
997
6387
3.2. SOFTWARE
We briefly present the software frameworks OpenMP, Pthreads and CUDA.
3.2.1. OPENMP
OpenMP (Open Multi Programming) is a parallel programming standard for shared memory
computers available for the C, C++ and Fortran programming languages. Although the standard
appeared in 1997, the emergence of multi-core CPUs have contributed to renewed interest in
5
International Journal of Distributed and Parallel Systems (IJDPS) Vol.7, No.5, September 2016
International Journal of Distributed and Parallel Systems (IJDPS) Vol.7, No.5, September 2016
Within the GPU code, CUDA provides a set of identifiers that allows for the localization of the
current thread within a block (threadID.x, .y and .z) and of the current block (blockID.x, .y and
.z). Through these identifiers, the programmer can assign a particular zone of the dataset to each
thread. For example, the addition of two matrices can be performed by creating a 2D execution
geometry with the dimension of the matrices, whereas each thread performs the addition of the
corresponding pair of parcels of the matrices. This way, the addition is performed in parallel. For
matrices larger than the maximum dimensions of the execution geometry, each thread can be
looped around, performing an addition and then moving to the next assigned pair of parcels.
Regarding memory, CUDA distinguishes between the host memory and the device memory. The
former is the system RAM, while the latter corresponds to the memory linked to the GPU. By
default, CUDA code running within a GPU can only access the GPU memory. Proper memory
management is important in CUDA and can have deep impact in performance[25][24].CUDAs
software stack includes compilers, profilers, libraries and a vast set of examples and samples.
From the programming language point of view, CUDA extends C++ and C through the addition
of a few modifiers, identifiers and functions. Nonetheless, the logic and semantic of the original
programming language is preserved. In this study, CUDA was used in a C environment.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.7, No.5, September 2016
For a given block, the best fit corresponds to the block or set of subblocks that yield the lowest .
The distortion is handled through single floating point IEEE 754 format. The recurrent
optimization of an input block works by segmenting it into two sub-blocks,
sub blocks, each with half the
pixels. The two halves are then recursively optimized using new searches in the dictionary. The
decision to segment a block is made by comparing the sum of the costs for each half with the cost
for approximating the original block.The need to compute the distortion among a vast set of
blocks and the input block is the main cause
cause for the high computational load of the algorithm. For
example, the single-threaded
eaded MMP requires around 2000 secondss to encode the 512
512512 pixels
8-bit
bit gray level Lenna image when run on an Intel Xeon E5-2620/v2
E5
machine.
Every time MMP segments a blo
block,
ck, a new pattern is created by the concatenation of two smaller
codewords. This new pattern is then inserted in the dictionary, allowing for future uses in the
coding procedure. Furthermore, scale transformations are used in order to adjust the dimensions
of the vector and create new patterns that can be used to approximate future blocks to be coded
with any possible dimensions[27]
[27][26].
Another relevant feature of MMP is the use of a hierarchical prediction scheme, similar to the one
used by H.264/AVC video encoding standard[28].
standard
For each original block, , a prediction block,
, is determined using
ing the previously encoded neighbouring
neighbo ring samples, located to the left and/or
above the block to be predicted. A residue block can then be computed by using a pixel
pixel-wise
difference: =
. This allows the use (encoding) of the residue block instead of , since
the decoder is able to determine
and compute = + , where and represent the
encoded (approximated) versions of and , respectively.By using different prediction models,
the residual patterns
tend to be more homogeneous than the original image patterns. These
homogeneous patterns are easier to learn, thus increasing the efficiency of the dictionary and of
the approximation of the encoded blocks, resulting in a more efficient method. Figure 2 presents
three examples of available
le prediction modes (vertical, horizontal and diagonal down/right) and,
at the bottom right, all possible prediction directions. These prediction modes are available for
both MMP and H.264/AVC[28]..
MMP uses a hierarchical prediction scheme, meaning that block of different dimensions can be
used in the prediction process (16 x 16 down to 4 x 4). For each possible prediction scale, MMP
8
International Journal of Distributed and Parallel Systems (IJDPS) Vol.7, No.5, September 2016
tests all available prediction modes and selects the one with the best result. This full search
scheme enables MMP to choose not only the most favourable prediction mode, but also the best
block size to be used in the prediction step. As a result, MMP becomes highly flexible and has a
relevant performance improvement, but at the cost of an exponential complexity increase, related
with the many new coding options which have to be tested.
# of calls
667805
667805
1024
9
International Journal of Distributed and Parallel Systems (IJDPS) Vol.7, No.5, September 2016
5. MAIN RESULTS
We now discuss main results. We present the configuration for the experimental tests, and then,
analyse the most relevant results regarding execution time and energy consumption.
5.1. CONFIGURATION
Each testwas run 20 times, except for the Raspberry Pi 2, where solely 10 executions were
performed per test, due to its slower speed. As the standard deviation values are close to zero, we
only report the average of the execution times. The tests consisted in performing the MMP
encode operation of the Lenna image in a 512x512 8-bit gray format. The quality parameter of
MMP was set to 10, a good balance between quality and output bitrate.
5.1.1 OPERATING SYSTEM AND TOOLS
For each platform, the following operating systems and compiler tools were used:
Xeon E5-2620/v2: Ubuntu 14.10, kernel 3.13.0-39 SMP, gcc 4.8.2, CUDA driver 340.58,
nvcc 6.5.12
Jetson TK1: 32-bit tegra-ubuntu, kernel 3.10.40, gcc 4.8.2, nvcc 6.5.12
International Journal of Distributed and Parallel Systems (IJDPS) Vol.7, No.5, September 2016
The Energy (ratio) corresponds to the power usage ratio between the current device and the
reference system, again the Xeon E5-2620/v2 server. Finally, the Efficiency (ratio) corresponds to
the ratio between the speed ratio and the power ratio. It measures the efficiency of the device
running MMP against the E5-2620/v2 reference.
The efficiency metric shows that the Jetson TK1 is the most appropriate device when considering
execution speed and energy consumption, with an efficiency ratio in the nearby of 4.8, with a
slight advantage for the low power over the high performance mode. Note that the efficiency ratio
of the Jetson TK1 is due to its low energy consumption (roughly 7.2 watt-hour) which is less than
1/10 of the energy needed by the Xeon E5-2620/v2, while its execution speed for SequentialMMP is approximately 0.46 of the speed achieved by Xeon E5-2620/v2 server. At the other
extreme of the scale, the low power mode of the Raspberry Pi 2 delivers half the efficiency of the
reference system. In fact, the raspberry consumes more energy than the Jetson TK1 (13.240 Wh
vs. 7.183Wh), since its lower instantaneous power usage is overshadowed by the fact that it takes
roughly 5 times longer to execute the MMP encoding operation than the Jetson TK1.
Interestingly, at least for the single-threaded CPU version of MMP, there seem to be no
meaningful differences on the Jetson TK1 between the low power mode and the maximum
performance mode, while only a marginal difference exists on the execution time between the low
power and the high performance mode of the Raspberry Pi 2 (24826.195 vs. 24442.232 seconds).
This is mostly due to the memory-bound nature of sequential-MMP, where a faster CPU does not
meaningfully improve execution time due to saturation at the CPU/RAM traffic level.
Table 5. Execution times and power usage for the sequential version of MMP.
Sequential
Xeon E5-2620
Jetson TK1 LP
Jetson TK1 HP
RPi 2 LP
RPi 2 HP
Exec. Time
(seconds)
1967.813
4297.038
4290.311
24826.195
24442.232
Speed
(ratio)
1
0.458
0.459
0.079
0.081
Avg. Power
(watts)
140.191
6.018
6.086
1.920
1.879
Energy
(Wh)
76.630
7.183
7.253
13.240
12.762
Energy
(ratio)
1
0.094
0.095
0.173
0.167
Efficiency
(ratio)
1
4.872
4.832
0.457
0.485
5.2.2 OPENMP
The OpenMP version of MMP (henceforth OpenMP-MMP) was run with a number of working
threads ranging between 1 to 16 on the Jetson TK1 board and Raspberry Pi 2 and 1 to 48 on the
Xeon E5-2620/v2. The rationale is that both the Jetson TK1 and Raspberry Pi 2 have a quad-core
CPU, while the E5-2620 server system has two E5-2620/v2 hexa-core CPUs, totalling 24 virtual
cores. However, since in all experiments no behaviour change was observed past 32 threads, we
only present results up to 32 threads for the Xeon E5-2620/v2 server.The execution times for
OpenMP-MMP across all studied platforms are plotted inFigure 3a. The X-axis represents the
number of OpenMP working threads, that is, the number of threads excluding the main thread
(the main thread does not perform any computation). The slot thread 0 corresponds to the
execution time of Sequential-MMP and aims to ease comparisons.
For the E5-2620/v2 server, OpenMP-MMP attains its minimum execution time with 719.287
seconds when the execution is performed with 10 working threads, thus being 2.74 faster than the
single-threaded sequential version. The 10-thread performance barrier matches the underlying
algorithm used to adapt MMP to OpenMP, where parallelization is performed along the 10
prediction modes as previously seen on Section 4.1. Increasing the number of threads beyond 10
degrades the execution times (895.727 seconds for 16 threads and 1072.816 seconds for 32
threads), mostly due to the overhead of having more threads, while the parallelism has been
exhausted. However, an unexpected disruption occurs with 24 working threads, with the
11
International Journal of Distributed and Parallel Systems (IJDPS) Vol.7, No.5, September 2016
execution times worsening to 2283.686 seconds, which is the double of both the 23-thread
execution (994.851 seconds) and 25-thread execution (1065.163 seconds). Note that the peak on
the execution times coincides with the saturation of the 24-virtual core E5-2620/v2 server, since
24 working threads corresponds to 25 threads plus the OS regular activity. After checking the
source code of the used OpenMP implementation, we confirmed that it enforces CPU affinity,
assigning one thread per core, whenever the number of threads is less or equal the number of
(virtual) cores of the underlying system. Thus, when the number of threads is equal to the number
of cores, one of the cores used by OpenMP is also necessarily used by the operating system, thus
disturbing the balance of the OpenMP execution. Moreover, due to the natural
International Journal of Distributed and Parallel Systems (IJDPS) Vol.7, No.5, September 2016
OpenMP organization of the execution in fork/join sections, a delayed thread impacts a whole
section, with each delayed section accumulating on the execution time. This hypothesis is further
confirmed by the fact that with 25 or more threads, the execution time of OpenMP-MMP regains
its slow degradation observed before the 24-thread peak, due to the fact that OpenMP no longer
enforces CPU affinity. This is further confirmed by the average power demands (Figure 5), which
drops from 173.1981 to 140.6071 watts, indicating that the number of active cores drops sharply
after the 24-thread peak.
For the Raspberry Pi 2, the fastest execution time is 11177.974 seconds, achieved with 10
working threads, meaning that the OpenMP version of MMP is roughly 2.19 times faster than the
sequential-MMP version ran on the Raspberry Pi 2. Beyond 10 threads, the execution times
slowly degrades reaching 11280.480 seconds with 16 working threads. As can be seen on the
plot, where the curves for low power and high performance modes match, the execution times
difference between the two modes are marginal.The behaviour for the Jetson TK1 is strongly
influenced by the running mode. Indeed, in performance mode, the Jetson TK1 achieves its fastest
execution time of 1888.244 seconds for OpenMP-MMP, when ran with 10 working threads.
Beyond this threshold, the execution time slowly degrades (1915.380 seconds for 16 working
threads). This matches the behaviour of the other platforms. However, the low power mode has a
substantially different behaviour: it achieves its fastest execution of 2539.583 seconds with 8
working threads, which corresponds to two threads per physical core. Moreover, contrary to the
other platforms and the high performance mode, the evolution of the execution times in scenarios
with less than 8 threads is not linear. For instance, while two threads yields 3234.894 seconds, the
execution with three threads requires 4979.784 seconds, and the execution with four threads -matching the number of cores -- only takes 2906.157 seconds. This pinpoints that in low power
mode, the performance achieved with OpenMP on the Jetson TK1 are strongly dependent on the
number of threads. In particular, the best results seem to be achieved with four and eight threads.
The energy consumption results are shown onFigure 3b. The Raspberry Pi 2 yields a similar
behaviour for both the low power and high performance mode, consuming around 6.5 watt-hours
for the OpenMP-MMP version. Relatively to the power demand, the Raspberry Pi 2 has a stable
behaviour for both execution modes, increasing slightly with the number of threads. Moreover,
there is practically no differences between the two execution modes. Indeed, for the execution of
OpenMP-MMP, the average power demand ranges from 1.877 watts (one worker thread) to 2.110
watts (16 threads) for the low power mode, and from 1.866 watts (one thread) to 2.125 watts for
the high performance mode. The energy consumption of the Raspberry Pi 2 is shown on Figure
3b along with the energy consumption of all the other devices. The average power demand is
plotted in Figure 4. The Xeon E5-2620/v2 server attains its best performance with 10 working
threads consuming 33.528 watt-hours, while the highest consumption occurs at the
aforementioned ill 24-thread execution with 109.870 watt-hours. Note that the one-thread
OpenMP-MMP consumes less energy than the sequential execution (128.114 vs. 140.191 watthours), requiring less average power (Figure 5). This is most probably due to the strict coreaffinity policy enforced by OpenMP that aims to keep each thread in the same core along the
whole run. This minimizes the number of used cores, allowing for non-used cores to remain idle
in low power mode.The high performance mode of the Jetson TK1 only consumes slightly more
energy than the low power mode: 5.236 vs. 4.936 watt-hours, while the execution times are
significantly different (1888.244 vs. 2539.583 seconds). The plot for the average power usage of
the Jetson TK1 (Figure 4a) shows that the power demand varies widely for the low power mode,
possibly due to individual CPU cores being activated/deactivated in response to the system load.
Table 6 summarizes the best results for OpenMP-MMP. While the Xeon E5-2620/v2 server
provides for the fastest execution, the Jetson TK1 yields the best efficiency ratio with 15.324 in
high performance mode and the lowest overall energy consumption with 4.936 watt-hours for the
13
International Journal of Distributed and Parallel Systems (IJDPS) Vol.7, No.5, September 2016
low power mode. While it only requires an average power of 2 watts, overall the Raspberry Pi 2
consumes more energy than the Jetson TK1, since it takes much longer to perform the MMP
encoding operation.
Table 6. Execution times and energy consumption for OpenMP-MMP.
Pthreads
Xeon E5-2620
(sequential)
Xeon E5-2620
(OpenMP)
Jet.TK1 (LP)
Jet. TK1 (HP)
RPi 2 LP
RPi 2 HP
(a)
#threads
Exec. Time
(seconds)
Speed
(ratio)
Avg. Power
(watts)
Energy
(Wh)
Energy
(ratio)
Efficiency
(ratio)
1967.813
140.191
76.630
10
719.287
2.736
167.806
33.528
0.438
6.247
8
10
10
10
2539.583
1888.244
11185.531
11177.974
0.775
1.042
0.176
0.176
6.997
9.983
2.093
2.103
4.936
5.236
6.503
.530
0.064
0.068
0.085
0.085
12.109
15.324
2.071
2.071
Execution times
5.2.3 PTHREADS
The execution times for Pthreads-MMP are shown in Figure 6. Once again, practically no
differences exist between the two studied modes of the Raspberry Pi 2. Both modes yield their
fastest execution times close to 11580 seconds with 4 working threads. Relatively to the
sequential execution, this corresponds to a 2.1 speedup. The fastest average execution time for the
Pthreads-MMP on the Xeon E5-2620/v2 server is 1458.617 seconds, achieved with 11 working
threads. This corresponds to a 1.35 speedup relatively to the sequential version.Surprisingly,
Pthreads-MMP executes practically within the same execution times of Sequential-MMP when
the Jetson TK1 is set for low power. The only exception occurs with three working threads, when
it achieves a marginal speedup of 1.13 relatively to the sequential single-thread version. However,
when set to performance mode, the Jetson TK1 achieves a speedup of 2.23 with 4 working
threads. In fact, with 1924.222 seconds, the Jetson TK1 high performance is only 25% slower
than the Xeon E5-2620/v2 for Pthreads-MMP (1924.222 vs. 1458.617 seconds) and slightly faster
than the execution of Sequential-MMP on the Xeon server (1924.222 vs 1967.813 seconds), all
this with a fraction of the energy consumption. The main performance and energy consumption
results of Pthreads-MMP are grouped in Table 7.
14
International Journal of Distributed and Parallel Systems (IJDPS) Vol.7, No.5, September 2016
Table 7. Execution times and energy consumption for Pthreads-MMP.
Pthreads
Xeon E5-2620
(sequential)
Xeon E5-2620
(Pthreads)
Jetson TK1 LP
Jetson TK1 HP
RPi 2 LP
RPi 2 HP
(a)
#threads
Exec. Time
(seconds)
Speed
(ratio)
Avg. Power
(watts)
Energy
(Wh)
Energy
(ratio)
Efficiency
(ratio)
1967.813
140.191
76.630
11
1458.617
1.349
143.967
58.331
0.761
1.772
3
10
10
10
3772.305
1924.222
11570.112
11584.317
0.522
1.023
0.170
0.170
5.628
9.251
2.162
2.077
5.897
4.944
6.948
6.684
0.077
0.065
0.091
0.087
6.779
15.850
1.868
1.954
Raspberry Pi 2
(b)
Jetson TK1
International Journal of Distributed and Parallel Systems (IJDPS) Vol.7, No.5, September 2016
of the (virtual) core never reaches more than 65%, while CPU usage under OpenMP-MMP is
close to 100%. This also explains the almost constant average power demand of Pthreads-MMP
which is close to 144 watts, regardless of the number of working threads, while OpenMP-MMP
requires an average of 167.801 watts when it achieves its fastest execution on the Xeon server. It
is important to note that Pthreads-MMP also relies on a pool of threads, and that some further
optimization efforts were done to this version, but the performance achieved by OpenMP-MMP
could not be matched.
5.3 CUDA
Under CUDA, execution times are dependent on the execution geometry, that is, ) the number of
blocks per grid and ) the number of threads per block[32]. The best geometry configuration, that
is, the configuration that yields the fastest execution times, depends not only on the application,
but also on the GPU. For MMP, the configurations that yielded the fastest execution times per
GPU are shown in Table 8. These best geometries were found through experimentation, although
since version 6.5, CUDA provides some API calls reporting on the occupancy of a GPU that
might help to steer the execution geometry towards optimal performance[33].
Table 8. Best CUDA geometries per GPU for CUDA-MMP.
GTX680
Blocks
Threads per block
1024
128
GTXTitan
Black
1024
64
GTX750
Ti
128
288
JetsonTK1
6
608
The CUDA-MMP execution results are shown inTable 9. The plots in Figure 9 show the
execution times (left plot) and the energy consumption (right plot). Both plots also display the
efficiency ratio, again considering Sequential-MMP on the Xeon server as reference. The discrete
GPUs present the best execution time results, being roughly 6.5 times faster than SequentialMMP. Interestingly, the marginal execution time differences among the three GPUs, which have
different capabilities, is an indication that the performance of CUDA-MMP is not limited by the
GPU, but by the CPU. Therefore, since the GTX 750 Ti has the lowest power consumption, it
yields the best efficiency ratio for the set of discrete GPUs. Regarding the Jetson TK1, and
contrary to what was previously observed with Sequential-MMP, the difference of efficiency
between the low power and the high performance is huge: 17.46 vs. 175.03. This is an indication
that the main difference between the low power and the high performance occurs mostly on the
GPU, benefiting the execution of CUDA code. Indeed, as stated before, in low power mode the
GPU operates at 72 MHz, while in high performance it operating frequency is boosted to 852
MHz. Overall, the Jetson TK1, especially in high performance mode is the most efficient.
Although its execution times are more than the double of the discrete GPUs, it has a much lower
power demand, consuming far less energy.
Table 9. Execution times and power usage CUDA-MMP.
CUDA
Xeon E5-2620/v2
(sequential)
GTX 680
GTX Titan Black
GTX 750 Ti
Jetson TK1 LP
Jetson TK1 HP
Exec. Time
(seconds)
Speed
(ratio)
Avg. Power
(watts)
Energy
(Wh)
Energy
(ratio)
Efficiency
(ratio)
1967.813
140.191
76.630
299.571
293.437
298.737
2939.568
646.497
6.569
6.706
6.581
0.669
3.044
186.844
216.223
150.448
3.599
7.422
15.548
17.624
12.484
2.939
1.333
0.203
0.230
0.163
0.038
0.017
32.360
29.157
40.411
17.605
179.059
16
International Journal of Distributed and Parallel Systems (IJDPS) Vol.7, No.5, September 2016
(b)
Energy consumption
International Journal of Distributed and Parallel Systems (IJDPS) Vol.7, No.5, September 2016
As future work, we plan to study the performance and energy consumption of MMP versions
using the OpenCL paradigm, exploring the advantage of the OpenCL availability for both multicore/multi-CPU environments, as well as for many-core hardware[2]. We also aim to explore
OpenMP with GPUs, taking advantage of existing implementation of version 4 of OpenMP for
accelerators[34].
ACKNOWLEDGEMENTS
Financial support provided in the scope of R&D Unit 50008, financed by the applicable financial
framework (FCT/MEC through national funds and when applicable co-funded by FEDER PT2020 partnership agreement). We would like to thank Dr.Pedro Marques and Gilberto Jorge for
their valuable contributions with the energy consumption measurements.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
18
International Journal of Distributed and Parallel Systems (IJDPS) Vol.7, No.5, September 2016
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
AUTHORS
Pedro M. M. Pereira received in 2013 his B.Sc. in Computer Engineering from
InstitutoPolitcnico de Leiria, Portugal. He is currently a M.Sc. student Mobile
Computing at the same institution. His interests include low level algorithms, energy
efficient coding in high-performance computing and artificial intelligence.
Patricio Dominguesis with the Department of Informatics Engineering at ESTGInstitutoPolitcnico de Leiria, Portugal. He holds a Ph.D. (2009) in Informatics
Engineering from the University of Coimbra, Portugal. His research interests include
multi-core and many-core systems, parallel computing and image, video processing and
digital forensics.
Nuno M. M. Rodrigueshas a Ph.D.from Universidade de Coimbra (2009), in
collaborationwith Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brasil. Since
1997, he has been with the Department of Electrical Engineering,
ESTG/InstitutoPolitcnico de Leiria, Portugal. His current research interests include
image compression and many-core programming.
19
International Journal of Distributed and Parallel Systems (IJDPS) Vol.7, No.5, September 2016
Gabriel Falcoholds a Ph.D. degree in electrical and computer engineering from the
University of Coimbra, Coimbra, Portugal. He is currently an Assistant Professor at the
University of Coimbra and a Researcher at Instituto de Telecomunicaes. His research
interests include high-performance and parallel computing, hybrid computation on
heterogeneous systems and digital signal processing algorithms. Dr.Falcao is a member of
the IEEE Signal Processing Society.
Srgio M. M. de Fariaholds a Ph.D. degree in Electronics and Telecommunications
from the University of Essex, England (1996). He is Professor at ESTG/IPLeiria,
Portugal, since 1990 and a Senior Researcher with Instituto de Telecomunicaes,
Portugal. He is an Area Editor of Signal Processing: Image Communication, and a
reviewer for several scientific journals and conferences (IEEE, IET and EURASIP). He is
a Senior Member of the IEEE.
20