0% found this document useful (0 votes)
20 views

Week 6 - Review On High Performance Energy Efficient Multicore Embedded Computing 1

1) The document discusses techniques for achieving high performance and energy efficiency in multicore embedded systems. 2) It outlines the requirements of embedded applications and provides hardware and software techniques at different levels to meet performance needs while improving energy efficiency. 3) These techniques include multicore processor designs, distributed memory architectures, and approaches to optimize performance at each level from the processor to the system level.

Uploaded by

Game Account
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Week 6 - Review On High Performance Energy Efficient Multicore Embedded Computing 1

1) The document discusses techniques for achieving high performance and energy efficiency in multicore embedded systems. 2) It outlines the requirements of embedded applications and provides hardware and software techniques at different levels to meet performance needs while improving energy efficiency. 3) These techniques include multicore processor designs, distributed memory architectures, and approaches to optimize performance at each level from the processor to the system level.

Uploaded by

Game Account
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

International Journal of Advanced Computer Research (ISSN (print): 2249-7277 ISSN (online): 2277-7970)

Volume-3 Number-3 Issue-11 September-2013


Review on High Performance Energy Efficient Multicore Embedded
Computing
Archana S. Shinde1, Swapnali B. Bhosale2, Ashok R. Suryawanshi3

Abstract The distinction between HPC for supercomputers and


HPEEC is important because performance is the most
With Moore’s law supplying billions of transistors significant metric for supercomputers with less
on chip, embedded systems are undergoing a emphasis given to energy efficiency whereas energy
transition from single core to multicore to exploit efficiency is a primary concern for HPEEC.
this high transistor density for high performance. In this paper, it focuses on high performance and
The main objective of developing a High energy efficient techniques that are applicable to
Performance Energy Efficient Multicore Embedded embedded systemsto meet particular application
Computing is to satisfy both Performance and the requirements. Many of the HPEEC techniques at
Power which is the first order constraint for different levels are complementary in nature and
Embedded Systems. To achieve this objective, this work in conjunction with one another to better meet
paper outlines typical requirements of embedded application requirements.
applications and provides state of the art
hardware/software high performance energy Background and relevance
efficient embedded computing (HPEEC) techniques The trend of increasing a processors speed to get a
that help meeting these requirements. Finally, this boost in performance is a way of the past. Multicore
paper presents design challenges and future processors are the new direction manufacturers are
research directions for HPEEC system focusing on. Using multiple cores on a single chip is
development.This paper is a literature review of a advantageous in raw processing power, but nothing
High Performance Energy Efficient Multicore comes for free.
Embedded Computing. The techniques reported by
the researchers till date is encouraging and With additional cores, power consumption and heat
motivate further research in this domain. dissipation become a concern and must be simulated
before layout to determine the best floorplan which
Keywords distributes heat across the chip, while being careful
not to form any hot spots.
High performance computing (HPC), multicore, energy
efficient computing, low power embedded systems. Organization of the paper work
Chapter 2 covers the multicore basics. Chapter 3
1. Introduction covers the proposed techniques under different
approaches for achieving High Performance Energy
Embedded system design is traditionally power Efficient Multicore Embedded Computing. The
centric but there has been a recent shift toward high proposed techniques are divided under 3 approaches.
performance embedded computing (HPEC) due to Under each approach, different techniques are
the proliferation of computeintensive embedded suggested for different embedded application to
applications.The design challenges are competing achieve High Performance with energy efficiency.
because high performance typically requires High Performance Energy Efficient Multicore
maximum processor speeds with enormous energy Processors and discussions are discussed in Chapter
consumption, whereas low power typically requires 4. Chapter 5 includes embedded applications.
nominal or low processor speeds that offer modest Chapter 6 includes conclusion and Chapter 7 consists
performance.HPEEC requires thorough consideration of future scope of this paper work.
of the thermal design power (TDP) and processor
frequency relationship while selecting an appropriate 2. Multicore Basics
processor for an embedded application.
Brief History of Microprocessors
Intel manufactured the first microprocessor, the 4bit
A. S. Shinde, PCCOE, Pune University 4004, in the early 1970s which was basically just a
S. B. Bhosale, PCCOE, Pune University numbercrunching machine. Fig. 1 shows the world’s
A. R. Suryawanshi, PCCOE, Pune University
first single chip processor. Shortly afterwards they
341
International Journal of Advanced Computer Research (ISSN (print): 2249-7277 ISSN (online): 2277-7970)
Volume-3 Number-3 Issue-11 September-2013
developed the 8008 and 8080, both 8 bit. The memory is necessary. This is usually accomplished
companies then fabricated 16 bit microprocessors and either using a single communication bus or an
Intel the 8086 and 8088. The former would be the interconnection network. The bus approach is used
basis for Intel’s 80386 32 bit and later their popular with a shared memory model whereas the inter
Pentium line up which were in the first consumer connection network approach is used with a
based PCs. Each generation of processors grew distributed memory model. After approximately 32
smaller, faster, dissipated more heat and consumed cores the bus is overloaded with the amount of
more power. processing, communication, and competition, which
leads to diminished performance therefore a
communication bus, has a limited scalability.

Figure 1: The world's first single chip processor

Moore’s Law
One of the guiding principles of computer
architecture is known as Moore’s Law. In 1965
Gordon Moore stated that the number of transistors
on a chip will roughly double each year (he later
redefined this, in 1975, to every two years). What is
often quoted as Moore’s Law is Dave House’s
revision that computer performances will double
every 18 months.

As shown in Figure 2, the number of transistors has Figure 3(a) Shared Memory Model, (b)
roughly doubled every 2 years. Multicore processors Distributed Memory Model
are often run at slower frequencies, but have much
better performance than a single core processor Table 1 below shows a comparison of a single and
because “Two heads are better than one.” [5]. multicore (8 cores in this case) processor used by the
Packaging Research Center at Georgia Tech [5].
With the same source voltage and multiple cores run
at a lower frequency we see an almost tenfold
increase in bandwidth while the total power
consumption is reduced by a factor of four.

Table 1:Single Core vs. Multicore [5]

Parameters Single Core Multicore


Processor processor
(45nm) (45nm)
Vdd 1.0V 1.0V
I/O pins total 1280 (ITRS) 3000 (ITRS)
Operating 7.8 GHz 4 GHz
Frequency
Figure 2: Depiction of Moore’s Law Chip Package 7.8 Gb/s 4 Gb/s
Data Rate
Multicore Bandwidth 125 Gbyte/s 1 TeraByte/s
A multicore architecture is an architecture where on a Power 429.78W 107.39W
Total no of pins 3840 9000
single computer chip multiple processors are
on chip (Estimated)
integrated. A multicore processor has two or more Total no of pins 2480 4500
independent cores.Each of the processor is referred to on the package (Estimated)
as a core. The core is part of the processor that is The Need for Multicore
responsible for correctly reading and executing the The single core processors are capable of executing
instructions. Fig.3 shows (a) Shared Memory Model, only one instruction at a time.As we move from one
(b) Distributed Memory Model. If we set two cores generation to the other i.e. from
side by side, one can see that a method of 8086,80186,80286,80386 towards Pentium IV we see
communication between the cores and to main
342
International Journal of Advanced Computer Research (ISSN (print): 2249-7277 ISSN (online): 2277-7970)
Volume-3 Number-3 Issue-11 September-2013
that the processor clock frequency increases.The 2. Memory Design
faster clock speeds typically require additional The cache miss rate, fetch latency, and data transfer
transistors and higher input voltages, resulting in bandwidth are some of the main factors impacting the
greater power consumption. The increasing clock performance and energy consumption of embedded
speed is creating a power dissipation problem for systems [8]. The memory subsystem encompasses the
semiconductor manufacturers. main memory and cache hierarchy and must take into
consideration issues such as consistency, sharing,
The latest semiconductor technologies support more contention, size and power dissipation. In this
and more transistors. The downside is that every section, HPEEC memory design techniques which
transistor leaks a small amount of current, the sum of include transactional memory, cache partitioning,
which is problematic.Instead of pushing chips to run cooperative cachingandsmart caching.
faster, CPU designers are adding resources, such as
more cores i.e. multicore and more cache to provide 3. Interconnection Network
comparable or better performance at lower As the number of onchip cores increases, a scalable
power.Adding a core will double the system and highbandwidth interconnection network to
performance, and dissipate less heat. Practically the connect on chip resources becomes crucial.
speed of single core processor is faster than every Interconnection networks can be static or dynamic.
core on the multicore system. Static interconnection networks consist of
pointtopoint communication links between
3. Proposed Techniques computing nodes and are also referred to as direct
networks (e.g., bus, ring, hypercube). Dynamic
To meet embedded application requirements different interconnection networks consist of switches and
approaches are defined in which different techniques links and are also referred to as indirect networks
are proposed. In HPEEC different approaches used (e.g., packet switched networks) [1]. This section
are Architectural Approaches, Middle Approaches consists of prominent interconnect topologies (e.g.,
and Software approaches. Fig. 5 gives an overview of bus, 2D mesh, hypercube) and interconnect
the HPEEC domain, which spans architectural technologies (e.g., packet switched, photonic,
approaches to middleware and software approaches. wireless).

A. Architectural Approaches 4. Reduction Techniques


HPEEC architectural approaches play an important
role in meeting varying application requirements. Due to an embedded system’s constrained resources,
These architectural approaches can be broadly embedded system architectural design must consider
categorized into four categories: core layout, memory power dissipation reduction techniques. Power
design, interconnection networksand reduction reduction techniques can be applied at various design
techniques. In this section, HPEEC architectural levels: the complementary metal oxide
approaches are discussed. semiconductor (CMOS) level targets leakage and
short circuit current reduction. The processor level
1. Core Layout targets instruction/data supply energy reduction as
There exist various core layout considerations during well as power efficient management of other
chip and processor design such as whether to use processor components (e.g., execution units, reorder
homogeneous (cores of the same type) or buffers etc.) and the interconnection networklevel
heterogeneous cores (cores of varying types), targets minimizing interconnection length using an
whether to position the cores in a 2D or 3D layout on appropriate network layout. In this section, several
the chip, whether to design independent processor power reduction techniques include leakage current
cores with switches that can turn on/off processor reduction, short circuit current reduction, peak power
cores or to have a reconfigurable integrated circuit reductionand interconnection length reduction.
that can be configured to form processor cores of
different granularity. B. Hardware Assisted Middleware
In this section, a few core layout techniques includes Approaches
heterogeneous CMP, conjoined core CMP, tiled Various HPEEC techniques are implemented as
multicore architectures, 3D multicore architectures, middleware and/or part of an embedded OS to meet
composable multicore architectures, multicomponent application requirements. The HPEEC middleware
architectures and stochastic processors. techniques are implemented in hardware to provide
required functionalities.

343
International Journal of Advanced Computer Research (ISSN (print): 2249-7277 ISSN (online): 2277-7970)
Volume-3 Number-3 Issue-11 September-2013
application requirements. HPEEC softwarebased
techniques include data forwarding, task scheduling,
task migration and load balancing.The workload
aware load unbalancing strategy reduces themean
waiting time of aperiodic tasks by 92 percent with
similar power efficiency as compared to a
workloadunaware load unbalancing strategy [3].

4. High Performance Energy


Efficient Multicore Processors and
Discussions
Silicon and chip vendors have developed various
high performance multicore processors that leverage
the various HPEEC techniques discussed in this
paper. In this, we discuss some prominent multicore
processors (summarized in Table 2) and focus on
their HPEEC features.

Table 2: HPEEC Multicore Processors

Processor Cores Speed Power Performance


ARM11 620
1-4 600 mW 2600 DMIPS
MPCore MHz
800
ARM Cortex MHz 250 mW 4000-10,000
1-4
A-9 MPCore -2 per CPU DMIPS
GHz
1.2
MPC8572E 17.3 W
GHz - 6897 MIPS
PowerQUICC 2 @ 1.5
1.5 @ 1.5 GHz
III GHz
Figure 4: High Performance Energy Efficient GHz
700
Embedded Computing Domain Tilera MHz
19 - 23
64 tiles W@ 443 GOPS
TILEPro64 - 866
700MHz
HPEEC hardware assisted middleware techniques MHz
includes dynamic voltage and frequency scaling 1
(DVFS), advanced configuration and power interface Tilera TILE- 16/36/64/100 GHz -
10-55 W 750 GOPS
Gx tiles 1.5
(ACPI), threading techniques (hyper threading, GHz
helper threadingand speculative threading), energy Intel Sandy 3.8 121.6
4 35-45 W
monitoringand management, dynamic thermal Bridge GHz GFLOPS
management, dependable HPEEC (DHPEEC) NVIDIA
336 CUDA 1.3 748.8
GeForce 160 W
techniques (N-modular redundancy (NMR), dynamic GTX 460
cores GHz GFLOPS
constitutionand proactive checkpoint deallocation) NVIDIA
256 CUDA 1.5 1152
and various low power gating techniques (power GeForce
cores GHz
197 W
GFLOPS
gating, per core power gating, split power planes, and 9800 GX2
NVIDIA
clock gating) [2]. 480 CUDA 1.242 748.8
GeForce 289 W
C. Software Approaches cores GHz GFLOPS
GTX 295
The performance and power efficiency of an NVIDIA
448 CUDA 1.15 1.03
embedded platform not only depends upon the built Tesla
cores GHz
238 W
TFLOPS
in hardware techniques but also depends upon the C2050/C2070
AMD
software’s ability to effectively leverage the FireStream
800 stream 750
160 W 1.2 TFLOPS
hardware support. Software based HPEEC techniques cores MHz
9270
assistDPM by signallingthe hardware of the resource ATI Radeon 1600 stream 750
423 W 2.4 TFLOPS
requirements of an application phase. Software HD 4870 X2 cores MHz
approaches enable high performance by scheduling
and migrating tasks statically or dynamically to meet A. Tilera TILEPro64 and TILE-Gx

344
International Journal of Advanced Computer Research (ISSN (print): 2249-7277 ISSN (online): 2277-7970)
Volume-3 Number-3 Issue-11 September-2013
Tilera revolutionizes high performance multicore to incorrect results or even the embedded system
embedded computing by leveraging a tiled multicore failure. Depending on the target market, embedded
architecture (e.g., the TILEPro64 and TILE-Gx applications typically operate above 45OC (e.g.,
processor family [9], [10]). The TILEPro64 and telecommunication embedded equipment temperature
TILE-Gx processors offer 5.6 and 32 MB of on chip exceeds 55OC) in contrast to traditional computer
cache respectively and implement Tilera’s dynamic systems, which normally operate below 38OC.
distributed cache (DDC) technology that provides a
2x improvement on average in cache coherence 3. Reliability Constrained
performance over traditional cache technologies Embedded systems with high reliability constraints
using a cache coherence protocol. are typically required to operate for many years
without errors and/or must recover from errors since
B. Intel Xeon Processor many reliabilityconstrained embedded systems are
Intel leverages Hafnium Hi-K andmetal gates in next deployed in harsh environments where post
generation Xeon processors to achievehigher clock deployment removal and maintenance is infeasible.
speeds and better performance per watt. TheXeon Hence, hardware and software for
processors also implement hyper threading and reliabilityconstrained embedded systems must be
widedynamic execution technologies for high developed and tested more carefully than traditional
performance. Thewider execution pipelines enable computer systems.
each core to simultaneously fetch, dispatch, execute,
and retire up to four instructions per cycle [11]. 4. Real Time
Intel’s deep power down technology enables both In addition to correct functional operation, realtime
cores and the L2 cache to be powered down when the embedded applications have additional stringent
processor is idle [12]. timing constraints, which impose realtime operational
deadlines on the embedded system’s response time.
C. Graphics Processing Units Although realtime operation does not strictly imply
GPUs feature highmemorybandwidth that is typically high performance, realtime embedded systems
10xfaster thancontemporary CPUs. NVIDIA and require high performance only to the point that the
AMD/ATI are thetwomain GPU vendors. NVIDIA’s deadline is met, at which time high performance is no
PowerMizer technology available on all NVIDIA longer needed. Hence, realtime embedded systems
GPUs is a DPM technique that adapts the GPUto suit require predictable highperformance. Realtime
an application’s requirements [13]. operating systems (RTOSs) provide guarantees for
meeting the stringent deadline requirements for
5. Embedded Applications embedded applications [7].

Different embedded applications have different 5. Parallel and Distributed


characteristics. We discuss below some of these Parallel and distributed embedded applications
application characteristics in context of their leverage distributed embedded devices to cooperate
associated embedded domains. and aggregate their functionalities or resources.
Wireless sensor network (WSN) applications use
1. Throughput Intensive sensor nodes to gather sensed information (statistics
Throughput intensive embedded applications are and data) and use distributed fault detection
applications that require high processing throughput. algorithms. Mobile agent (autonomous software
Networking and multimedia applications, which agent)based distributed embedded applications allow
constitute a large fraction of embedded applications, the process state to be saved and transported to
are typically throughput intensive due to ever another new embedded system where the process
increasing quality of service (QoS) demands. An resumes execution from the suspended point. Many
embedded system containing an embedded processor embedded applications exhibit varying degrees (low
requires a network stack and network protocols to to high levels) of parallelism, such as instruction
connect with other devices. A telemedicine level parallelism (ILP) and threadlevel parallelism
application requires processing of 5 million blocks (TLP). Innovative architectural and software HPEEC
per second [6]. techniques are required to exploit an embedded
applications available parallelism to achieve high
2. Thermal Constrained performance with lowpower consumption.
An embedded application is thermalconstrained if an
increase in temperature above a threshold could lead 6. Conclusion
345
International Journal of Advanced Computer Research (ISSN (print): 2249-7277 ISSN (online): 2277-7970)
Volume-3 Number-3 Issue-11 September-2013
[4] T. Berg, “Maintaining I/O Data Coherence in
Multicore chips are an important new trend in Embedded Multicore Systems,” IEEE MICRO,
computer architecture. Several new multicore chips vol. 29, no. 3, pp. 10-19, May/June 2009.
in design phases. In this paper, HPEEC techniques [5] Bryan Schauer “Multicore Processors – A
Necessity”, ProQuest Discovery Guides, Sept.
are used in multicore processors to achieve its 2008.
objective. HPEEC is an active and expanding [6] G. Kornaros, Multi-core Embedded Systems.
research domain with applications ranging from Taylor and Francis Group, CRC Press, 2010.
consumer electronics to supercomputers. The [7] Dr. K. V. K. K. Prasad, Embedded Real Time
introduction of HPEEC techniques satisfies both high Systems : Concepts, Design & Programming
performance and energy efficiency in multicore Black Book
embedded computing. [8] Frank Vahid , Tony Givargis : Embedded System
Design
Future Scope [9] TILERA, “Manycore without Boundaries:
TILEPro64 Processor”,
Although power is a firstorder constraint in HPEEC http://www.tilera.com/products/processrs/TILEP
platforms, several additional challenges facing the RO64, June 2011
HPEEC domain are as below: [10] TILERA, “Manycore without Boundaries: TILE-
Gx
1. Complex Design Space ProcessorFamily”,http://www.tilera.com/products
Large design space due to various core types and /processors/TILE-Gx_Family, June 2011.
each cores tunable parameters (e.g., [11] Intel, “High-Performance Energy-Efficient
instructionwindow size, issue width, fetch gating) Processors for Embedded Market Segments”,
http://www.intel.com/design/embedded/downloa
2. High On Chip Bandwidth
ds/315336.pdf, June 2011.
Increased communication due to increasing number [12] Intel, “Intel Core 2 Duo Processor Maximizing
of cores requires high bandwidth on chip Dual-Core Performance Efficiency”
interconnects. ftp://download.intel.com/products/processor/core
3. Synchronization 2duo/mobile_prod_brief.pdf, June 2011.
Synchronization primitives (e.g., lock, barriers) result [13] NVIDIA, “NVIDIA PowerMizer Technology,”
in programs serialization degrading performance. http://www.nvidia.com/object/feature_powermize
4. Shared Memory Bottleneck r.html, June 2011.
Threads running on different cores make large
number of accesses to various shared memory data Archana S. Shinde received her B.E.
partitions. degree from CWIT, University of Pune
in 2012. She is currently pursuing M.E
5. Cache Coherence in VLSI and Embedded Systems from
Heterogeneous cores with different cache line sizes P.C.C.O.E, Pune University.
require cache coherence protocols redesign and
synchronization primitives (e.g., semaphore, locks)
increase cache coherence traffic [4].
6. Cache Thrashing
Threads working concurrently evict each other’s data Author’s Photo
Swapnali B. Bhosale received her B.E.
out of the shared cache to bring their own data. degree from Dr. BAMU, Aurangabad in
2012. She is currently pursuing M.E in
References VLSI and Embedded Systems from
P.C.C.O.E, Pune University.
[1] Shacham, K. Bergman, and L. Carloni, “Photonic
Networks-on-Chip for Future Generations of
Chip Multiprocessors,” IEEE Trans. Computers,
vol. 57, no. 9, pp. 1246-1260, Sept. 2008. Author’s Photo Prof. Ashok Suryawanshi is an
[2] R. Ge, X. Feng, S. Song, H.-C. Chang, D. Li, and assistant professor of electronics and
K. Cameron, “PowerPack: Energy Profiling and telecommunication engineering at
Analysis of High-Performance Systems and Pimpri Chinchwad College of
Applications,” IEEE Trans. Parallel and Engineering, University of Pune, India.
Distributed Systems, vol. 21, no. 5, pp. 658-671, He received his B.E. degree from
May 2010. Shivaji University and M.E. degree
[3] H. Jeon, W. Lee, and S. Chung, “Load from M.S. University. His areas of
Unbalancing Strategy for Multi-Core Embedded interests include Electronic Devices & Networks, Power
Processors,” IEEE Trans. Computers, vol. 59, no. Author’s Photo
Electronics /Industrial Electronics.
10, pp. 1434-1440, Oct. 2010.
346
Reproduced with permission of the copyright owner. Further reproduction prohibited without
permission.

You might also like