0% found this document useful (0 votes)
168 views

Embedded Systems Design - January February 2011

EMBEDDED SYSTEMS DESIGN VOLUME 24, NUMBER 1 JANUARY / FEBRUARY 2011 Demystifying constructors 9 Benchmarking that works 28 Trends in power management 33 solving the USB puzzle. Pre-configured USB packages, with running sample projects, are available for most popular microcontroller architectures and development boards.

Uploaded by

Godapati Pavan
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
168 views

Embedded Systems Design - January February 2011

EMBEDDED SYSTEMS DESIGN VOLUME 24, NUMBER 1 JANUARY / FEBRUARY 2011 Demystifying constructors 9 Benchmarking that works 28 Trends in power management 33 solving the USB puzzle. Pre-configured USB packages, with running sample projects, are available for most popular microcontroller architectures and development boards.

Uploaded by

Godapati Pavan
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

VOLUME 24, NUMBER 1 JANUARY/ FEBRUARY 2011

EMBEDDED SYSTEMS DESIGN


The Official Publication of The Embedded Systems Conferences and Embedded.com

Modeling an SoC with SystemC


14
Demystifying constructors 9 Benchmarking that works 28 Trends in power management 33

Solving the USB puzzle


USB solutions for embedded applications involve numerous complex issues. You can apply HCCs extensive experience and knowledge of USB software and hardware to your product. Pre-configured USB packages, with running sample projects, are available for most popular microcontroller architectures and development boards.

HCC-Embedded
COMPLETE USB SOLUTIONS

www.hcc-embedded.com [email protected]

The INTEGRITY RTOS

Certified and Deployed Technology

The INTEGRITY RTOS is deployed and certified to: Railway: EN 50128 SWSIL 4, certified: 2010 Security: EAL6+ High Robustness, certified: 2008 Medical: FDA Class III, approved: 2007 Industrial: IEC 61508 SIL 3, certified: 2006 Avionics: DO-178B Level A, certified: 2002

www.ghs.com
Copyright 2011 Green Hills Software, Inc. Green Hills, the Green Hills logo and INTEGRITY are trademarks of Green Hills Software, Inc. in the U.S.and/or internationally. All other trademarks are the property of their respective owners.

mouser.com

Semiconductors and electronic components for design engineers.

Weve got what design engineers need. The future.


Youll nd the newest products for your newest designs. Get whats next at Mouser.

Scan Here

m mouser.com
Mouser and Mouser Electronics are registered trademarks of Mouser Electronics, Inc. The Newest Products for Your Newest Designs is a registered trademark of Mouser Electronics, Inc ectronics,

T H E O F F I C I A L P U B L I C AT I O N O F T H E E M B E D D E D S Y S T E M S C O N F E R E N C E S A N D E M B E D D E D. C O M

COLUMNS
programming pointers
BY DAN SAKS

Demystifying constructors
Even the experienced C++ programmer can be confused about what exactly constructors do and when they get called.

EMBEDDED SYSTEMS DESIGN


VOLUME 24, NUMBER 1 JANUARY/FEBRUARY 2011

break points
Power management, 2011
BY JACK G. GANSSLE

33

14

Cover Feature:

Using SystemC to build a system-on-chip platform


BY JAMES ALDIS

From Microchips eXtreme Low Power to TIs OMAP, new chips contain some interesting and complex power management techniques.

DEPARTMENTS
#include
C to silicon. Really?
BY RON WILSON

How Texas Instruments designers used the SystemC hardware design language to do performance modeling when creating both the companys OMAP-2 platform and the devices based on it.

What ever became of the idea that we could dene an embedded system in C, push the Compile button, and watch the tool spit out a complete hardware and software system design?

parity bit
ESC Silicon Valley
May 25, 2011 San Jose, CA http://esc-sv.techinsightsevents.com/

ESC Chicago
June 68, 2011 Chicago, IL http://esc-chicago.techinsightsevents.com/

ESC India

28

CoreMark: A realistic way to benchmark CPU performance


BY SHAY GAL-ON AND MARKUS LEVY

July 2022, 2011 Bangalore, India www.esc-india.com/

ESC Boston
July 2022, 2011 Boston, MA http://esc-boston.techinsightsevents.com/

EEMBCs CPU benchmark maximizes simplicity and efcacy.

EMBEDDED SYSTEMS DESIGN (ISSN 1558-2493) print; (ISSN 1558-2507 PDF-electronic) is published 10 times a year as follows: Jan/Feb, March, April, May, June, July/August, Sept., Oct., Nov., Dec. by the EE Times Group, 600 Harrison Street, 5th oor, San Francisco, CA 94107, (415) 947-6000. Please direct advertising and editorial inquiries to this address. SUBSCRIPTION RATE for the United States is $55 for 10 issues. Canadian/Mexican orders must be accompanied by payment in U.S. funds with additional postage of $6 per year. All other foreign subscriptions must be prepaid in U.S. funds with additional postage of $15 per year for surface mail and $40 per year for airmail. POSTMASTER: Send all changes to EMBEDDED SYSTEMS DESIGN, EE Times/ESD, PO Box #3609, Northbrook, IL 60065-3257, [email protected]. For customer service, telephone toll-free (847) 559-7597. Please allow four to six weeks for change of address to take effect. Periodicals postage paid at San Francisco, CA and additional mailing ofces. EMBEDDED SYSTEMS DESIGN is a registered trademark owned by the parent company, EE Times Group. All material published in EMBEDDED SYSTEMS DESIGN is copyright 2010 by EE Times Group. All rights reserved. Reproduction of material appearing in EMBEDDED SYSTEMS DESIGN is forbidden without permission.

ONLINE
www.embedded.com

INDUSTRIAL

AEROSPACE

SYSTEM ON A CHIP

MEDICAL

AVIATION

CONSUMER

REALLY COUNTS
When Your Companys Success, And Your Job, Are On The Line You Can Count On Express Logics ThreadX RTOS
Express Logic has completed 14 years of successful business operation, T H R E and our agship product, ThreadX, has been used in over 800 million electronic devices and systems, ranging from printers to smartphones, from single-chip SoCs to multiprocessors. Time and time again, when leading manufacturers put their company on the line, when their engineering team chooses an RTOS for their next critical product, they choose ThreadX. Our ThreadX RTOS is rock-solid, thoroughly eld-proven, and represents not only the safe choice, but the most cost-effective choice when your companys product simply must succeed. Its royalty-free licensing model helps keep your BOM low, A D and its proven dependability helps keep your support costs down as well. ThreadX repeatedly tops the time-to-market results reported by embedded developers like you. All the while, Express Logic is there to assist you with enhancements, training, and responsive telephone support. Join leading organizations like HP, Apple, Marvell, Philips, NASA, and many more who have chosen ThreadX for use in over 800 million of their products because their products are too important to rely on anything but the best. Rely on ThreadX, when it really counts!

THREADX: WHEN IT

Contact Express Logic to nd out more about our ThreadX RTOS, FileX le system, NetX Dual IPv4/IPv6 TCP/IP stack, USBX USB Host/Device/OTG stack, and our new PrismX graphics toolkit for embedded GUI development. Also ask about our TraceX real-time event trace and analysis tool, and StackX, our patent-pending stack size analysis tool that makes stack overows a thing of the past. And if youre developing safety-critical products for aviation, industrial or medical applications, ask about our new Certication Pack for ThreadX.
Second Editio n

Newnes

E REAL-TIM ED EMBEDD ADING RE MULTITH


append Now with architectures PowerPC MIPS and

adX for ARM, Coldre, With Thre ices

For a free evaluation copy, visit www.rtos.com 1-888-THREADX


L. Lamie Edward Copyright 2010, Express Logic, Inc. ThreadX, FileX, and TraceX are registered trademarks, and NetX, USBX, PrismX, StackX, and Certication Pack are trademarks of Express Logic, Inc. All other trademarks are the property of their respective owners.

M CD-RO INCLU DED

EMBEDDED SYSTEMS DESIGN

BY Ron Wilson
Director of Content/Media, EE Times Group Events and Embedded Ron Wilson (415) 947-6317 [email protected] Managing Editor Susan Rambo [email protected] Acquisitions/Newsletter Editor, Embedded.com Site Editor Bernard Cole [email protected] Contributing Editors Michael Barr, John Canosa, Jack W. Crenshaw, Jack G. Ganssle, Dan Saks, Larry Mittag Art Director Debee Rommel [email protected] Production Director Donna Ambrosino [email protected] Subscriptions/RSS Feeds/Newsletters www.eetimes.com/electronics-subscriptions Subscriptions Customer Service (Print) Embedded Systems Design PO Box # 3609 Northbrook, IL 60065- 3257 [email protected] (847) 559-7597 Article Reprints, E-prints, and Permissions Mike OBrien Wrights Reprints (877) 652-5295 (toll free) (281) 419-5725 ext.117 Fax: (281) 419-5712 www.wrightsreprints.com/reprints/index.cfm ?magid=2210 Publisher David Blaza (415) 947-6929 [email protected] Editorial Review Board Michael Barr, Jack W. Crenshaw, Jack G. Ganssle, Bill Gatliff, Nigel Jones, Niall Murphy, Dan Saks, Miro Samek

#include
existing embedded computing system, and we just want to separate out a few hot-spots in the code and compile them into a custom hardware accelerator. Now the picture is a little brighter. High-level synthesis tools are good at datapaths. And because the hardware in question is usually small, the target is generally an FPGA instead of an ASIC, so there is no need to call in a chip-design team. In fact there are EDA toolsboth specialized accelerator-generators and more general C-to-RTL synthesis toolsintended for just such applications. An evaluation last year by Berkeley Design Technology, Inc. (BDTI) demonstrated that such tools could in fact produce hardware designs about as good as a hand-crafted design. But the report had fine print, too. BDTI found that it was necessary to tune the high-level code for synthesis rather than for software compilation. And they determined that you still need an experienced FPGA designer to take the design from RTL to a finished FPGA. So, yes: you can get a flying car, though its not the Sci-Fi version. No butler either, but maybe youd take a robot vacuum cleaner. And no pushbutton embedded hardware system from C. But it can be very much worth your while to learn about Csynthesis tools for generating hardware accelerators.

C to silicon. Really?

CorporateEE Times Group


Paul Miller Felicia Hamerman Brent Pearson Jean-Marie Enjuto Amandeep Sandhu Barbara Couchois Chief Executive Ofcer Group Marketing Director Chief Information Ofcer Financial Director Manager Audience Engagement Vice President Sales Ops

his month we lead off with an article on using SystemC to develop an SoC at Texas Instruments. The author describes a very specific use of SystemC in the tightlyconstrained context of the OMAP platform, and in the hands of an experienced IC design team. Yet the subject itself raises the question of an old promise: what ever became of the idea that we could define an embedded system in C, push the Compile button, and watch the tool spit out a complete hardware and software system design? Maybe this question belongs in the same bin with Wheres my flying car? and What happened to my robot butler? Some forecasts are overly optimistic. But the synthesis question is not that easy to dismiss. Many systems designers say that it is possible, with some care, to define the behavior of a system in an extended C dialect. The challenge lies in moving from a description of the behavior to a description of the implementation. One problem is that most hardware designs in the real world contain large predefined hardware blocks, such as microcontrollers, DRAMs, or peripheral chips. We need to synthesize connections between, and ancillary blocks to, existing chips for which we may not have good models. Tricky. We can simplify the problem, though. Suppose we write code for an
Ron Wilson is the director of content/ media, EE Times Group Events and Embedded. You may reach him at [email protected].

CorporateUBM LLC
Marie Myers Pat Nohilly Senior Vice President, Manufacturing Senior Vice President, Strategic Development and Business Administration

Ron Wilson, [email protected]

www.embedded.com | embedded systems design | JANUARY/FEBRUARY 2011

Mission
Make the Dreamliner a reality

Critical
Breakthrough dependability

Boeing 787 Dreamliner The Wind River VxWorks platform proved instrumental in the development of the Boeing 787 Dreamlinerhistorys fastest selling wide-body aircraft.

Wind River embedded systems deliver the cutting-edge reliability and performance that fuel innovation.
Boeings 787 Dreamliner is taking ight with an innovative integrated modular avionics (IMA)-based Common Core System (CCS) developed by GE Aviation and enabled by Wind River. Our industry-leading VxWorks 653 partitioning operating system is the foundation for GEs CCS, which serves as the Dreamliners central nervous system by infallibly orchestrating the operation of over 70 applications supplied by over 15 Boeing suppliers. VxWorks enables the asynchronous integration of multiple suppliers and allows for applications of different RTCA DO-178B safety criticality levels to reliably run on a single shared computer platform. Its the kind of cutting-edge dependability and proven performance thats made Wind River a trusted leading provider of advanced embedded solutions for aerospace and defense. Take your innovation to new heights. Contact us today for our Mission Critical Toolkit, now available for a limited time at www.windriver.com/missioncritical/safety.
2010 Wind River Systems, Inc. The Wind River logo is a trademark, and Wind River is a registered trademark of Wind River Systems, Inc. Other marks are the property of their respective owners.

parity bit

Not there yet


an Saks Fibonacci example is not instructive because the class he is using is far too simple (Dan Saks, Measuring instead of speculating, December 2010, p9, www.eetimes.com/discussion/programming-pointers/4211118/ Measuring-instead-of-speculating). Drawing any conclusion from this is borderline meaningless. As soon as you start using the toys that come with C++ inheritance, function overriding, and friendsyou will likely see the picture change dramatically. Of course, people dont have to use those extra C++ features, but it is often difcult to not use them. Firstly, the compiler will use many constructs by default. Secondly, programmersbeing the experimental sortwill tend to use them for fun. My main objection to C++ is that it is far more complex than C. Visibility into softwareembedded in particularis extremely challenging. C++ hides what is going on. In C, a simple statement like a = b; can only represent a binary copy with maybe a type change thrown in. I have seen C++ code where that resulted in 28 constructor calls + over 20 destructor calls. eembedded_janitor The polystate objects require the passing of a pointer. This pointer is used to address the data. But the monostates are also objects, even though they alias the same data; and they require the passing of a pointer, even though this pointer is completely useless and is merely a consequence of the typically convoluted object-oriented implementation. The difference then comes down to the difference between register/offset and direct address modes

(or similar) in a handful of instructions. The monostate case is burdened with overheads that it cannot use. In my non-inline C monostate implementation on an MSP430, the called code turned out to be the same as the C++ case, but the call was more expensive for C++ (to pass the useless pointer). In a C polystate implementation, the calling and called code was identical to the C++ code. This is not surprising. The ARM processor, however, is problematic for this kind of test, because timing measurements are likely to be influenced by its pipelining and cache features, which will have subtle effects

straightforward standard way in C or C++ of specifying something as basic as a devices address. There is no comprehensive and precise system for declaring data representation. There is no real package mechanism. The type model is weak. The language is fragile. That is the main game. willc2010 Medical mindfulness I cant think of more interesting eld to be in today than biomedical electronics (Ron Wilson, A medical matter, December 2010, p5, www.eetimes.com/dis cussion/-include/4211167/A-medicalmatter). You cant call this eld a specialty because of the broad skill set a person would require: for example, mathematics, DSP, analog/digital electronics, RF microwave, software development, wavelets, chemistry, biology, and power electronics. Test_engineer The more things change, the more they stay the same. Fifty years ago computer programmers had to gure out the language and critical domain knowledge for banking, space ight, etc. As software and electronics move into new territories, the different players have had to lean to work together. Sometimes the cost of failure has been low (MP3 player rebooting) and sometimes high (automatic braking, aeronautics, and medical). Just like all industries before, medical will go through the pain but get there in the end. cdhmanning
We welcome your feedback. Letters to the editor may be edited. Send your comments to Ron Wilson at [email protected] or post a comment online, under the article you wish to discuss. We edit letters and posts for brevity.

! ! !

There is no straightforward standard way in C or C++ of specifying something as basic as a devices address. . . . The language is fragile.

that are highly dependent on the test code. One would have to look carefully at the details. The fact that the inline versions are the same with each compiler, and all better than the non-inline versions, simply shows that when the compiler is presented with all of the information at once, the optimizer is able to reduce everything to the same essentials. But to get that degree of benefit on any scale in a C(++) system, you would need to make extensive use of file includes and textual macros, which would be typically ugly and error-prone. But this is not really the main game. As the article says, there is no

www.embedded.com | embedded systems design | JANUARY/FEBRUARY 2011

HALF THE TWICE THE POWER PERFORMANCE


A WHOLE NEW WAY OF THINKING.

Introducing the 7 Series. Highest performance, lowest power family of FPGAs.


Lowest power and cost

Best price and performance Highest system performance and capacity

Powerful, flexible, and built on the only unified architecture to span low-cost to ultra high-end FPGA families. Leveraging next-generation ISE Design Suite, development times speed up, while protecting your IP investment. Innovate without compromise.
LEARN MORE AT WWW.XILINX.COM / 7

Copyright 2010. Xilinx, Inc. XILINX, the Xilinx logo, Artix, ISE, Kintex, Spartan, Virtex, and other designated brands included herein are trademarks of Xilinx in the United States and other countries. All other trademarks are the property of their respective owners.

By Dan Saks

programming pointers
Consider an abstract type that implements a ring buffer of characters. A ring buffer is a firstin-first-out data structure. Data can be inserted at the buffers back end and removed from the front end. The C++ definition for a ring buffer class might look, in part, like:
class ring_buffer { ~~~ private: char *base; size_t size; size_t head, tail; };

Demystifying constructors
ne of the easiest ways to misuse a structure object in C is to fail to initialize it properly. In C++, a class can have special member functions, called constructors, that provide guaranteed initialization for objects of that class type. The guarantee isnt absoluteyou can subvert it using a cast. Nonetheless, using constructors can reduce the incidence of uninitialized objects. While most C++ programmers use constructors frequently, I keep running into C++ programmers, even experienced ones, who seem to misunderstand how constructors really work. Theyre surprised, and somewhat dismayed, when a seemingly simple statement generates a flurry of constructor calls that they didnt expect. Initialization is rarely optional. When it doesnt get done, subsequent operations often fail. However, initialization can be a problem when it happens at unexpected times, especially when the affected code is time-critical. This month, Ill start to take some of the mystery out of when constructors execute and what it is that they actually do. As I often do, Ill explain the behavior of C++ by showing equivalent code in C. If youre a C programmer who doesnt use C++, I think youll still find these insights helpful. C code that mimics the discipline imposed by C++ is often better code.

! ! !

Even the experienced C++ programmer can be confused about what exactly constructors do and when they get called.

SHALLOW PARTS VS. DEEP PARTS Ill begin by introducing a little terminology that should simplify the remaining discussion.

Member base represents an array that holds the buffered characters. Member size represents the number of elements in that array. Members head and tail are the indices of the elements at the buffers front and back ends, respectively. As I explained in a prior column, a class without base classes and virtual functions, and with all data members having the same accessibility (all public or all private), has essentially the same storage layout as a structure containing the same data members in the same order.1 Thus, for example, the ring_buffer class above has the same storage layout as a C structure defined as:
typedef struct ring_buffer ring_buffer; struct ring_buffer { char *base; size_t size; size_t head, tail; };

Dan Saks is president of Saks & Associates, a C/C++ training and consulting company. For more information about Dan Saks, visit his website at www.dansaks.com. Dan also welcomes your feedback: e-mail him at [email protected].

www.embedded.com | embedded systems design | JANUARY/FEBRUARY 2011

programmers pointers
In truth, member base stores a pointer to the initial element of the array, not the array itself. The array is part of ring_buffers implementation, but its not one of the data members. The array occupies storage allocated separately from the ring_buffer object. The ring_buffer is an example of a class with deep structure. A class has deep structure if it has at least one data member that refers to separately-allocated resources managed by the class. Classes with members that are pointers to dynamically-allocated memory are the most common classes with deep structure. However, a class with a member of any type that designates a separately-allocated managed resource, such as an integer designating a file, also has deep structure. Obviously, not all classes have deep structure. For example, a class representing complex numbers typically has just two data members of some floating-point type, as in:
class complex { ~~~ private: double real, imaginary; };

Nothing in this class refers to resources beyond the data members. Such classes have shallow structure. The shallow part of an object is the storage that contains the objects data members, as well as its base class sub-objects, vptr and padding, if any. (I briefly described base class sub-objects and vptrs in an earlier column.)1 The sizeof operator applied to a class object (or the class itself) yields the number of bytes in the shallow part of the object (or class). The deep part of an object is any storage used to represent the objects state beyond the shallow part. Objects with shallow structure have no deep part. WHERE THE SHALLOW PARTS COME FROM When you define an object in either C or C++, as in:
ring_buffer rb;

allocation. The constructor then initializes the shallow part, and in so doing, may allocate and initialize a deep part as well. (In some early C++ implementations, the constructor did memory allocation for the shallow part, but only for new-expressions. C++ has evolved so that such implementations are now extinct and can be found only in museums.) Now, back to the allocation of the shallow parts. By the usual run-time mechanisms for storage allocation I mean whatever the compiler normally does depending on whether the object to be allocated has automatic, static, or dynamic storage duration.2 These mechanisms are essentially the same in C++ as they are in C. For an object with automatic storage duration (automatic objects), the shallow part will be allocated on the runtime stack. During optimization, the compiler may decide to place some automatic objects into CPU registers. It might even do this for a class object whose shallow part is small enough to fit into the available registers. However, its easier to discuss automatic allocation if we dont belabor this detail and instead speak as if automatic objects are always placed in the stack. If an automatic object is a function parameter, its storage will be allocated as the program evaluates function arguments prior to the call. If an automatic object is declared within a function body, its storage will be allocated upon entering the function. For an object with static storage duration, the compiler, linker, and loader collaborate to place the objects shallow part in memory before the program starts running. From the running programs perspective, an object with static storage duration is always there. (In reality, the constructor doesnt run until run time. The new draft standard for C++ provides a new keyword constexpr, which will allow some constructors to run at compile time.) In C++, objects with dynamic storage duration are those created by new-expressions. A new-expression allocates memory by calling a function named operator new or operator new []. Ive described the behavior of these functions in previous columns.3, 4 CONSTRUCTORS In C++, a constructor is a special class member function that initializes objects of its class type. A constructors function name is always the same as its class name. A class can have more than one constructor, each with a distinct parameter list, as in:
class ring_buffer { ring_buffer(); ring_buffer(size_t n); ~~~ };

the compiler generates code to allocate the objects shallow part. The initialization of the ring_buffer, including the allocation of its deep part, wont happen unless you write additional code. In C, you have to invoke the initialization code explicitly every time you define a ring_buffer. In C++, you can provide constructors for ring_buffer, which the compiler will use to initialize each ring_buffer automatically. Before I explain where the shallow parts come from, I want to dispel a common misconception: With modern C++ compilers, a constructor doesnt allocate the shallow part of the object it constructs. Rather, the program allocates the shallow part using one of the usual run-time mechanisms for storage
10

JANUARY/FEBRUARY 2011 | embedded systems design | www.embedded.com

RX Design Contest The Challenge is On! Finalists announced in March. Winners at ESC Silicon Valley in May. www.RenesasRulz.com/rx-contest

Get up to speed at Renesas Interactive


eLearning at its best when and where you want, at your pace

Find over 180 courses in

Renesas MCUs and related technology RX, SH, V850, CAN, USB, LCD, HMI, design techniques and much more!

www.RenesasInteractive.com

2011 Renesas Electronics America Inc.

programmers pointers
A constructor cant specify a return type. You dont write calls to constructors, so theres no opportunity to use the return value. Again, you write object definitions, and the compiler automatically generates constructor calls for you. A constructor that requires no arguments is called a default constructor. Since the ring_buffer class defined above has a default constructor, you can write definitions for ring_buffer objects as just:
ring_buffer rb; void rb_construct(ring_buffer *this, size_t n) { if ((this->base = (char *)malloc(n)) == NULL) /* return or announce failure more overtly */ this->size = n; this->head = this->tail = 0; }

In C, you should probably call this function as soon as possible after the definition or statement that allocates the shallow part, as in:
ring_buffer rb; rb_construct(&rb, 32);

When this definition appears at block or namespace scope, it automatically calls ring_buffers default constructor. When this definition appears elsewhere, such as at class scope, it might use a constructor other than the default. Ill explore this complication in a later column. When a class has no default constructor, every definition for a ring_buffer object must specify arguments to one of those constructors, as in:
ring_buffer rb (32);

This definition automatically calls the constructor with a parameter of type size_t. The compiler will reject any object definition whose argument list doesnt match any constructors parameter list, as in:
ring_buffer rb ("xyzzy");

WHERE CONSTRUCTORS GET CALLED Again, whenever your program defines an object with a class type, the compiler automatically plants a call to the objects constructor at the appropriate place in the generated code. If you learn to anticipate where those places are, youre less likely to be surprised by the code the compiler generates. Among the places youre likely to see constructors called are:

A constructor is like every other ordinary (nonstatic) member function in that it has an implicitly-declared parameter named this, which points to an object of the constructors class. Whenever the program calls a constructor, the constructors this parameter points to storage for an uninitialized objectthe shallow part allocated by one of the usual run-time mechanisms. The constructors job is to place appropriate initial values into the shallow part and, if there is a deep part, acquire and initialize it, too. For example, the ring_buffer(size_t) constructor might be defined as:
ring_buffer::ring_buffer(size_t n) { base = new char [n]; size = n; head = tail = 0; }

Definitions for objects of class type, or for arrays with elements of class type. New-expressions that create objects of class type, or arrays with elements of class type. Return statements that return class objects by value. Function calls with parameters of class type passed by value. Explicit type conversions (cast expressions). Any other expressions that create temporary objects of class type. Throwing an exception of class type. Catching an object of class type by value. Ill look at some of these in detail in my next column.

ENDNOTES:
1. Saks, Dan.Classes are structure, and then some, Embedded.com, July, 2009. www.eetimes.com/discussion/programming-pointers/ 4027034/Classes-are-structures-and-then-some Saks, Dan, Storage class specifiers and storage duration, Embedded Systems Design, January 2008, p. 9. www.eetimes.com/discussion/ programming-pointers/4026823/Storage-class-specifiers-and-storageduration Saks, Dan, Allocating objects vs. allocating storage, Embedded Systems Design, September 2008, p. 11. www.eetimes.com/discussion/ programming-pointers/4026897/Allocating-objects-vs-allocatingstorage Saks, Dan. Allocating arrays, Embedded Systems Design, January 2009, p. 9. www.eetimes.com/discussion/programming-pointers/ 4026953/Allocating-arrays

2.

3.

The new-expression acquires the ring_buffers deep part. By default, it throws an exception if it fails. The rest of the constructor initializes the shallow part. A C function that does essentially the same job looks like:
12

4.

JANUARY/FEBRUARY 2011 | embedded systems design | www.embedded.com

Analog Devices: enabling the designs that make a difference in peoples lives
At ADI, continuous innovation in signal processing technologies makes possible
ADI helps Signostics deliver portable ultrasound imaging quickly and affordably with their new personal ultrasound system.

the design of sophisticated medical diagnostic and monitoring systems, as well as health and wellness devices relied on by care providers and patients around the globe. Were pioneering technologies that hold the promise of accurate, affordable home monitoring and assistance for heart health, high blood pressure, diabetes, and patient activity. For 40 years, engineers have relied on our analog and mixed-signal ICs to set performance standards and on our people for their system expertise. This tradition continues. Explore ADI healthcare solutions at: www.analog.com/healthcare.

www.analog.com/makeadifference

cover feature
How Texas Instruments designers used the SystemC hardware design language to do performance modeling when creating both the companys OMAP-2 platform and the devices based on it.

Using SystemC to build a system-onchip platform

A
14

BY JAMES ALDIS s an embedded systems designer, you may find youre working more with hardware design languages and the system-on-chip (SoC). Perhaps youre building boards or systems using components, both of which often now deal with SoCs either in ASIC or FPGA form. How SoCs are modeled and simulated may feel like a prequel to your design, but its valuable story.
This article discusses the role of performance modeling in creating both the OMAP-2 platform and the devices based on it. The OMAP-2 is a platform from Texas Instruments for creating SoCs. The platform is underpinned by a basic set of rules and guidelines covering programming models, bus interfaces, and RTL (register transfer level) design. The platform is highly generic. Its capable of supporting a wide range of functional and performance requirements, some of which may be unknown when the platform is created. Surprisingly, the same can be true for the specific devices, which are frequently openly programmable and expected to have a life extending beyond that of the products that drive their development. It is, however, generally true that device requirements are more precise than platform requirements. Because the SoCs built from OMAP-2 are highly complex, its not possible to analyze performance satisfactorily using static calculations such as in spreadsheets. Therefore, simulation is used. The requirements on the simulation technology are first and foremost ease in creating test cases and models and credibility of results. The emphasis on test-case creation is a consequence of the complexity of the devices and of the way in which an SoC platform such

JANUARY/FEBRUARY 2011 | embedded systems design | www.embedded.com

as OMAP-2 is used: because the whole motivation is to be able to move from marketing requirements to RTL freeze and tape-out in a very short time; and because in many cases large parts of the software will be written by the end customer and not by the SoC provider (Texas Instruments, in this article), the performance-area-power tradeoff of a proposed new SoC must be achieved without the aid of the software. Secondary requirements are simulation speed, visibility of results and behavior, modularity and reusability, and the ability to integrate legacy and third-party models. We created a modeling technology based on:

configuration, run-time control, and results extraction. We used cycle-based interfaces throughout, because cycle accuracy is required in some areas and use of a single interface technology throughout the platform was essential. Cycleaccurate interfaces do not necessarily imply cycle-accurate functionality, and in general the OMAP-2 simulations can be described as timingapproximate. The aim is to move to public domain technology in all areas as soon as appropriate solutions become available. We never use this modeling technology for software development but independently create virtual SoC platforms for software development . The challenges for the future lie in making this technology usable outside the core OMAP-2 architecture team and in being able to import models from third-party suppliers. Achieve-

ment of these goals is currently hampered by the lack of public standards for specifying test cases and configuring and controlling modules. OMAP-2 OVERVIEW The OMAP-2 platform provides the basic building blocks to create a general-purpose computer system on-achip.1 Its designed for application and modem processors for mobile telephones. The principal shared characteristics of the modules to be found in the OMAP-2 platform (shown in Figure 1) are:

SystemC. Standard cycle-based modeling technology for bus interfaces taken from the Open Core Protocol International Partnership (OCP-IP). Privately-developed technology for test-case specification, module

Bus interfaces. All OMAP-2 modules use the same protocol, namely OCP (Open Core Protocol). 2 Power management functionality and interfaces. Interrupt/direct memory access (DMA)-request interfaces. Synthesis scripts assuring timing

OMAP2430 functional view.


Trace analyzer Emulator pod JTAG/ emulation I/F NOR flash NAND flash Mobile DDR GPIO GPMC SDRC Camera I/F serial parallel OMAP2430 Camera-serial Camera module Sub camera GPIO

Fast IrDA Antenna GP55300 Antenna TEV1000 Antenna coexistence WiLink TNETW1253 BRF6300 Data BlueLink 5.0 Solution Voice Flashing TCS modem chipset Control/data Voice

Trace UART/IrDA PC

ARM1136 SPI 2D/3D graphics accelerator SDIO

Imaging video and audio accelerator IVA 2

PC System interface power reset clock manager McPSP McPSP High-Speed (HS) USB2 OTG controller Voice Audio

TWL4030 Battery charger Power manager Audio/voice codec HS USB transceiver Keypad

On/Off Reset 32 kHz crystal Audio Speaker Speaker In/Out USB connector LED Keypad

Shared memory controller/DMA Timers, interrupt controller, mailbox

UART Boot/secure ROM McPSP M-Shield Technology: SHA-1/MDS, DES/3DES, RNG, AES, PKA, Secure WDT, Keys USB McPSP McPSP MS/MMC/ SD/SDIO TV out (DAC) Display controller parallel-serial SPI

Antenna

Battery

LEGEND TI products Figure 1

MS/MMC/ SD/SDIO card

TV PAL/NTSC

QVGA VGA color TFT color TFT display display

TSC2046 touch screen controller

CONCEPT TO SILICON & PROTOTYPE


COMPLETE SILICON & EMBEDDED SYSTEM SOLUTIONS

Turn Key System Design

ASIC Design

Embedded Software Development

FPGA & Board Design

Infotech Enterprises is an 8,000 employee Global Engineering Services company focused on providing concept to silicon and prototype solutions for ASIC/FPGA Engineering and Embedded Software Development. Our comprehensive and highly skilled design solution team has been serving the Hi-Tech Industry and the manufacturing OEMs for 20 years.

We provide: Innovative client centric solutions to meet current design requirements & roadmaps for future design trends Reliable and cost eective services that combine global delivery with local interface Reduced product development costs and faster time-to-market An impeccable track record of rst pass silicon success over 200+ ASIC tapeouts

Australia | Canada | France | Germany | India | Japan | Malaysia | Netherlands | Norway | New Zealand |Sweden| Singapore | UAE | UK | USA www.info t ech- enterprises.co m e n g i ne e r i ng @ i nf o t e ch - enterprises.com

cover feature
closure at common frequencies at common process nodes. Programming models derived from a common base and common principles. Security- and debug-related functionality. OCP and the modular multilevel NOC are cornerstones of the OMAP-2 platform. They permit the rapid creation of SoC products. The architects of the new product are able to select the processors and peripherals they desire and be confident that these are compatible with each other and that they can be connectfrom the library, if the development timescales are to be held. One module, however, in every SoC is created specifically for that SoCthe NOC. By playing with the topology, the level of concurrency, and the level of pipelining in the NOC, its possible to create SoCs from the same basic modules with quite different capabilities. This approach to SoC creation puts product performance analysis in the critical path. The product architects are able to fashion an SoC rapidly from existing material and to know immediately how big it will be, how fast it will run, and (to a first approximation) how much power it will consume; but they must also know whether it meets its performance requirements. For this, architecture-level simulation is used, based on transactionlevel modeling (TLM) concepts and the SystemC language.3 This simulation capability is also a part of the OMAP-2 platform. The basic requirement on it is

One vital element of the OMAP-2 platform that is not shown in Figure 1 is the interconnect or NOC (network on chip) technology. Its absent because its largely transparent to the user, whether software developer or hardware integrator. The NOC enables the processors and DMA controllers to access the memories and peripherals, using a common SoCwide memory map. The NOC provides:

Address-based routing of bus requests. Arbitration for concurrent access to shared memories. Adaptation of OCP features between incompatible initiators and targets, for example bus-width conversion, conversion of burst types to those supported at the target, or conversion from single-requestmultiple-data to multirequest-multiple-data bursts. Clock-frequency conversion between modules running at different rates. Programmable address-based connectivity control for enhanced security. Detection, routing, and logging of error events.

! ! !

The product architects are able to know immediately how big the SoC will be, how fast it will run, and how much power it will consume.

ed as required. This is in some ways an intrinsically bottom-up process, with apparently little scope for optimization except through selection of the modules

Embedded smxWiFi

untethers your designs.


802.11a, b, g, i, n USB WiFi Dongles PCI WiFi Cards Device <> Access Point Device <> Device Optimized for SMX Portable to other RTOSs Security: WEP, WPA1, WPA2 Ralink RT2501, RT2573, RT2860, RT2870, RT3070, RT3572 Drivers Small RAM/ROM Footprint Full source code Royalty free

Typically multiple levels of NOC are in an OMAP device, and different NOC technologies are available within OMAP-2, optimized for the different levels. Superficially these provide the same functionality but are very different in terms of performance and area. The connectivity between the principal processors and the principal memories is critical to system performance and is allowed to consume more silicon area and power than the paths to rarely-used peripherals.

www.smxrtos.com/wi

www.embedded.com | embedded systems design | JANUARY/FEBRUARY 2011

17

cover feature
to be able to provide feedback on questions of product performance in the timescales for definition of an OMAP-2based SoC, in the project phase before development resources are allocated and development starts. During this definition phase, the SoC architecture is not stable and the performance analysis technology must live with this fact. PERFORMANCE MODELING The OMAP-2 performance modeling technology is used for the following: burst should be used, what are the best arbitration options). MODULES Figure 2 shows a simplified representation of the top level of an OMAP-2 SoC performance model. All the boxes are SystemC modules connected by SystemC channels. The modules fall into a small number of different categories:

Support of OMAP-2 platform development and maintenance. Support of SoC product definition: validation of the SoCs performance before RTL development starts. Validation of details of SoC implementation, in particular the NOC configurations, during development. Provision of reference performance data to RTL and silicon validation teams. Response to queries from marketing and customers when new applications of an exiting SoC design are proposed. Support of customers wishing to optimize the implementation of their application on the SoC (which DMA is better to use, what size of

! ! !

The power of such models is that with a small number of parameters, representative bus activity can be created, even of the most complex software.
Subsystems, which are just hierarchical divisions and contain further modules of the same sorts, connected in the same way. The hierarchy in general matches the hardware hierarchy of the SoC. Typically a subsystem is composed of one or more processors and one or more DMA controllers or traffic generators.

Generic example of SoC architecture model top-level assembly.


MPU SS DMA Display controller Camera controller USB-HS Image processing SS

Network-on-Chiptypically 20 attached modules

SDRAM/ DDR SS

Flash controller

On-chip memory

Peripheral bustypically 80 attached modules

Timers LEGEND Fully cycle and functional accurate

UARTs

Config

Etc.

Cycle accurate but missing functionality (such as memory, registers, external connections, signal processing) Cycle accurate CPU representation, either stochastic or trace-driven Generic traffic generator, maybe with real-time requirements Figure 2

Processors: Three different styles of processor model are used: Stochastic, in which the processor generates random instructions, pretends to fetch them, then executes them. External memory accesses for fetch, load, and store are filtered by stochastic cache models, in which the decision whether an access hits or misses is made through comparison of a random number with a cache miss ratio parameter. The power of such models is that with a small number of parameters, representative bus activity can be created, even of the most complex software. Cache miss ratio and code profiling statistics are available for many classes of software, and so a large range of tests (such as protocol stack, signal processing, high-level operating system with user applications, games) can be run without significant software development or porting effort. Furthermore, because no actual software is run, the parameters can be slightly degraded to test the sensitivity of the SoC to potential software variations. The models provide estimates of processor MIPS and simulated NOC and memory traffic. Such models have been developed for RISC and DSP processors, Harvard and unified-memory, with L1 and L2 caches. Although they do not implement the function of processors, they may be said to be cycle-accurate. CPU and cache pipelines are modeled correctly, write buffers are implemented, and so on. Trace-driven. Where the performance of the processor for a specific software is the primary consideration, a more detailed model is required that takes into consideration not only the statistics of the software but also the order of instructions executed. To achieve this, cycle-accurate processor and cache models are available, which replay a trace of the software execu-

18

JANUARY/FEBRUARY 2011 | embedded systems design | www.embedded.com

One Size Doesnt Fit All

With HCC, you can choose a file system thats File Systems USB Stacks Bootloaders Windows Drivers Embedded Development Services
THE MOST COMPREHENSIVE PORTFOLIO OF FILE SYSTEMS FOR EMBEDDED APPLICATIONS

right for your application. HCC products run with the broadest range of CPUs and memory devices, with or without an operating system.

HCC-Embedded
FILE SYSTEMS WITH A DIFFERENCE

www.hcc-embedded.com [email protected]

cover feature
tion rather than executing it afresh. The advantage of this is that software from another platform, for example a previous generation OMAP, may be tested without first being ported. Software porting, especially where OSes are involved, is a major task and is not attempted during the product-definition phase. Furthermore, the use of traces, which include the effects of user I/O, provide repeatability hard to achieve if the software is actually executed, for example in a game environment. Instruction-set simulators. Although used less than the other processor types, its possible to instantiate a cycle-accurate instruction-set simulator (ISS) in the OMAP-2 performance simulation. This is at the moment restricted to DSPs. Such processor models are heavily used in DSP software optimization, and instantiation in the SoC model allows the effects of the overall system (for example, increased latency caused by congestion on external memory) to be taken into consideration. The SoC model is not in general safe to use for the processorthere is no guarantee that memory exists where it should or that memory will not be overwritten by some random interference traffic, so the processor I/O is usually taken from the host filesystem and the external memory is fully cached inside the processor model. There is no requirement for an ISS model of the main RISC processor of the OMAP SoC. The cost of implementing a SystemC SoC model capable of supporting the interesting applications is too high, even before the cost of software porting and maintenance is considered. Configuration of the SoC, a task done by the RISC CPU in reality, is more easily accomplished in performance simulation by direct configuration of the modules. Memory controllers and memories: The memory controllers are modeled in a fully-cycle-accurate way. However, they arent normally connected to memories. Although a read operation will return data at the correct time, it will not be the correct data. In general, the whole OMAP-2 performance-simulation platform can be described as dataless. DMAs and peripherals: Similarly to the memory controllers, DMAs and peripherals are modeled cycle-accurately, but certain aspects of their functionality are not implemented. In the case of a DMA, it is the runtime programmability that is not present. Whereas in reality a processor writes to DMA registers to provoke a transfer, in the simulation the transfer is simply requested

! !

In general, the whole OMAP-2 performancesimulation platform can be described as dataless.


from the DMA model through a SystemC interface. This may be done at elaboration time, optionally with a delayed start, or at any time during the simulation by some other process. Peripherals of interest to the performance simulation include serial-port controllers, cryptographic accelerators, and so on. A serialport controller model would be cycle-accurate on its bus and DMA/CPU interfaces, but the serial data would not exist. Likewise a cryptographic accelerator would not encrypt the data given it, but it would act as though it had, making dummy data available at the correct time. In both cases any configuration (such as baud rate) would be done using a high-level SystemC interface and not by writing to simulated registers.

Generic traffic generators: Many of the main bandwidth consumers in an OMAP SoC have relatively simple and repetitive traffic patterns. The best examples of this are the display and camera controllers. In the OMAP-2 performance model such things are represented by simple traffic generators. These generators have a range of addressing modes and traffic types (such as burst/non-burst and SRMD). They generate traffic at a constant rate (with optional jitter) and may have real-time requirements and internal pipelining limitations. By combining several such generators, relatively complex traffic flows may be created. They may also be configured to behave in a highly randomized way, to create a sort of background load in the system. NOCs: The networks-on-chip are the only fully cycle- and functionalaccurate parts of the SoC model. The NOC technology used in OMAP-2 is based on generation of an NOC from a configuration file, which contains details of all the initiators and targets and the desired topology of the NOC. It is possible to generate both RTL code and SystemC code from the same input.

INTERFACES AND CHANNELS The modules just described all support the same basic set of SystemC interfaces. A small set of SystemC channels is used to connect them together.

20

OCP TL1: The OCP-IP has proposed a method for SystemC modeling of OCP interfaces.3 Documentation and SystemC code (interfaces, channels, and data types) are available. The proposals cover a wide range of abstraction levels: TL1 being cycle-accurate; TL2 being protocol-specific with approximate timing, and so on. In the OMAP performance model the OCP TL1 technology is used exclusively. It can be argued that many of the simulations

JANUARY/FEBRUARY 2011 | embedded systems design | www.embedded.com

Get Your Hands On Whats Next.

Scan

mouser.com

The Newest Products For Your Newest Designs

Mouser and Mouser Electronics are registered trademarks of Mouser Electronics, Inc. Other products, logos, and company names mentioned herein, may be trademarks of their respective owners.

cover feature
Listing 1
# part of a map file for a stochastic CPU model mpu.cpu_type s:arm1136 mpu.cpu_ocp_clock_ratio i:3 mpu.data_addr_range x:00300000 mpu.data_base_addr x:05000000 mpu.inst_base_addr x:07000000 mpu.inst_miss_ratio f:0.02

do not require cycle-accuracy and certainly many of the traffic generators or peripheral models are not in any way cycle-accurate in their functionality. However the advantages of having a single interface and a single channel to deal with outweigh the simulation speed gains that might be available in a mixed TL1/TL2 simulation platform. The OCP-TL1 channel includes a monitor interface, and a simple monitor that dumps a trace to a text file is available. A TI-developed statistics-gathering monitor is also used in the OMAP simulations to enable bandwidths and latencies to be extracted as simulation outputs. Any OCP interface in the SoC may be monitored in this way. OCP is a synchronous protocol. A clock is associated with every point-to-point OCP connection. In the SystemC model, the synchronisation is accomplished using sc_clock() objects. All modules with OCP ports also have sc_clock input ports. Interrupts and DMA requests: TI has developed a simple TL1 interface for DMA requests and interrupts, and a SystemC channel for connecting interrupt generators to interrupt consumers. The main point of interest in this technology is that a single channel is instantiated in the whole SoC simulation. This allows the routing of interrupts and DMA requests to be done at run time, based on a configuration file, rather than being hard-wired as in reality.

Static configuration: All the modules and channels in an OMAP SoC performance model support an elaboration-time/end-of-elaboration configuration procedure. This is used for:

Providing hardware parameters to generic modules, for example cache sizes, filenames for trace or executable binaries, FIFO depths, bus widths, and clock frequencies. Providing modules with configuration information that would in the hardware be provided through register writes, for example baud rates, arbitration parameters, and FIFO trigger thresholds. Providing behavioral parameters to autonomous initiators, for example cache miss ratios, display refresh rates and screen size, and DMA transfer parameters.

Each module or channel has a set of parameters that may be set, and the parameter values are passed to it in the form of an stl map, templated with a pair of stl strings. The first string is the parameter name and the second includes a letter to indicate the type and then the parameter value. Such maps may be read from text files created by a user in a text editor. For example, see Listing 1. Run-time control: Most of the simulations that run on the OMAP performance model need only static configuration of the initiators in or-

der to produce the desired behavior. However, in such simulations, initiators do not interact. A process completing on the DSP can not trigger the start of a DMA transfer, for example. In order to address this limitation, the modules (mainly the initiators) also support a second interface, which allows dynamic control of their behavior during the simulation. Basically a secondary pure-functional simulation runs: it may start tasks on the OMAP initiators and is informed when these tasks complete. Preemption of tasks is possible, so the secondary simulation can model multiple tasks using the same CPU under control of an RTOS. This is illustrated in Figure 3. The top half of Figure 3 shows a simplified representation of a video conference application. Its purely functional, simply a chain of functions that have to be completed, each one triggering others. A complex application like this is hierarchical. In implementation, each function is mapped to some hardware, for example a DMA controller or a CPU. The black arrows show such a mapping. In our simulation technology, two separate simulations are within the same SystemC sc_main(). The pure-functional simulation provides the interactions between functions. For example, it waits until the MPEG compression is complete before starting a DMA transfer of the compressed data to mass storage. The other simulation is the OMAP performance model as described earlier. Each function in the pure-functional simulation includes a set of parameters for one of the OMAP modules, on which it will be executed. The two simulations are linked by a set of schedulers, which allow multiple functions to be active at the same time even if they share the same CPU. For example several functions can run in turn on a stochastic CPU, with its own cache

22

JANUARY/FEBRUARY 2011 | embedded systems design | www.embedded.com

cover feature
Pure-functional simulation driving OMAP performance simulation.
store reference macroblock get large reference macroblock get macroblock complete new with reference get reference macroblock get macroblock IQ IDCT

diff

DCT

IQ

VLC

store encoded frame

refresh screen from active screen buffer MPEG compression

encapsulate and send to network write to memory card

image capture from sensor

color space conversion

wait 66 ms

rotation and scaling

write to idle screen buffer swap idle and active screen buffers

new image from network

MPEG decompression

write to idle screen buffer

generic schedulers

MPU SS

DMA

Display controller

Camera controller

USB-HS

Image processing SS

Network-on-chiptypically 20 attached modules

SDRAM/ DDR SS

Flash controller

On-chip memory

Peripheral Bustypically 80 attached modules

Timers LEGEND Fully-cycle and functional accurate

UARTs

Config

Etc.

Cycle-accurate but missing functionality (such as memory, registers, external connections, signal processing) Cycle-accurate CPU representation, either stochastic or trace-driven Generic traffic generator, maybe with real-time requirements

Figure 3

miss ratios and a number of instructions to execute before it is done. The interface implemented by the OMAP-2 initiator models to en-

able this dynamic control is shown in Listing 2. Use cases (applications) defined in this way can also be executed

standalone and can be reused from a model of one SoC to a model of another, without requiring the same function-to-hardware mapping.
23

www.embedded.com | embedded systems design | JANUARY/FEBRUARY 2011

cover feature
Listing 2 The interface implemented by the OMAP-2 initiator models for dynamic control.
class dynamic_configuration_if { // start a task on a processor // virtual bool start_task(sc_event &done, int task_id, const config_map &cm) { return(false); } // ask a processor to stop executing a task // // so that a new one can be started // virtual bool preempt_task(sc_event &ready, int task_id) { return(false); } // find out the state of a running task // virtual bool get_task_status(int task_id, config_map &cm) { return(false); } }

OTHER OMAP-2 SIMULATION PLATFORMS The architecture-level SoC performance model described here is not the only simulation model of an OMAP-2 SoC that is created. All of the following models are available:

options to substitute fast ISSs or simple traffic generators for the processors. FPGA model.

SystemC architecture-level performance model. Virtual platform model for software development. RTL simulator, including the

The different models serve different purposes, require different levels of effort to use, and become available at different times during the project. The SystemC performance model is always available first and is always the simplest to create and use. The virtual platform is the next to become available. It is used

Aspects of a typical module to be modeled.


Transactor OCP-TL1 sc_port Callbacks PV sc_export Registers

Functionality Interrupt sc_port Callbacks

Timing Bus interface

Figure 4

for software development and has very little timing accuracy. TI uses Virtio technology to create this model rather than SystemC.5 The lack of accurate timing in this model means that low-level software has to be validated on another platformthe motivation behind the FPGA model. The FPGA model can also be used for performance investigations. It complements the SystemC model, being much less flexible and requiring software but having a degree of completeness and accuracy that is not attempted in SystemC. RTL simulations are in general too slow for either software development or performance investigations but are the final reference in cases of doubt and have the advantage of complete visiblity into the SoC behavior. It would appear the choice of two different technologies for the virtual platform and the performance model is inefficient, wasting potential code reuse. However, the two have completely different (almost fully orthogonal) requirements, and at module level almost no code reuse is possible. This is illustrated in Figures 4 and 5. Figure 4 shows a breakdown of a module into different aspects, whose importances vary depending on the level of abstraction. This example is an OCP slave, a peripheral of some kind, with a register file, some functionality triggered by writes to the registers, and some timing associated with execution of the function. Furthermore the peripheral has a bus interface that is compliant with some protocol, in this case OCP. A complete model of the peripheral would implement all this. And a generic model architecture following this breakdown has often been proposed in the ESL industry. Figure 5 shows that the model needed for the SoC architects performance analysis is completely orthogonal to that needed by the software developers virtual platform. On the left, we see that the architect needs the timing and the bus interface. The architect is concerned that the parameters of the bus interface are correctly chosen (such that the function can be implemented) and needs to be

24

JANUARY/FEBRUARY 2011 | embedded systems design | www.embedded.com

2011

UBM Electronics Celebrates the Industry and its Innovation through the EE Times ACE Awards and EDN Innovators Awards

EE TIMES ACE AWARDS 2011

CALL FOR ENTRIES


2011 EE TIMES ACE AWARDS
Now accepting nominations for the EE Times 7th Annual ACE Awards
The Annual Creativity in Electronics (ACE) Awards celebrates the creators of technology who demonstrate leadership and innovation in the global electronics industry and shape the world we live in. If you or your company has made signicant achievements in 2010, enter today to see if you can become part of a prestigious group of nalists and winners recognized by EE Times editors, a distinguished judging panel and the global electronics industry.

DESIGN TEAM OF THE YEAR INNOVATOR OF THE YEAR EXECUTIVE OF THE YEAR STARTUP OF THE YEAR

COMPANY OF THE YEAR MOST PROMISING NEW TECHNOLOGY STUDENT OF THE YEAR ENERGY TECHNOLOGY AWARD

NEW

DEADLINE EXTENDED TO FEBRUARY 11TH


NO COST TO ENTER

For a full list of categories including the new editorial awards, please visit www.eetimes.com/ace

SUBMIT YOUR NOMINATION TODAY


www.eetimes.com/ace

ASSOCIATION MEDIA SPONSOR

cover feature
Architects view and programmers view of the module.
Architects view: bus and timing Programmers view: registers and functionality

Binding control sc_export Registers replaced by direct external control interface

PV sc_export Registers Interrupts sc_export

Callbacks OCP-TL1 sc_port

Functionality

Callbacks Interrupt sc_port Timing Bus interface Bus interface not wanted Timing not wanted

Figure 5

able to look at the cycle-by-cycle behaviour on that interface. But the functionality is of no interest to the architect. Suppose this is a cryptographic accelerator: it doesnt matter to the architect whether the encryption is done properly or not. In fact, it may be a hindrance if apparently random data is visible on the bus, making it much more difficult to correlate input and output. Further-

more, the architect will want to be able to trigger the function without writing into the registers via the bus, because that would require writing or modifying software, a time-consuming activity that may not even be possible. On the right, by contrast, we see that in the virtual platform only the encryption function and the registers are important. The software engineer does not

care at all about the bus interface, and the virtual platform generally discards timing to improve simulation speed. For NOCs and memory controllers, the difference between the architects view and the programmers view is yet starker. These modules are the central core of the architects model, but do not even exist in the programmers view, except maybe as a few register stubs, their

Example results, from simulations of an ensemble or 144 video-conference scenarios.


sms long-term mean

LCD long-term mean

Mbytes/s

LCD 10-s maximum

Non-LCD 10-s maximum

20 Figure 6

40

60

80

100

120

140

26

JANUARY/FEBRUARY 2011 | embedded systems design | www.embedded.com

cover feature
main functionality being fully softwaretransparent. USAGE EXAMPLE OF MODEL The performance model exists because a generic and rapid-deployment performance model is essential for a platformbased SoC factory, as weve discussed in this article. But more than this, certain things can not be achieved without this kind of technology. Here is an example in which the performance limits of an OMAP-2 SoC have been probed using the model. The use case is a videoconference. This is easy to say, but when it comes to the details many choices need to be made. Development of software able to implement all these choices as run-time or even compile-time options is practically impossible. On FPGA or silicon each videoconference to be analysed requires development resources. On the SystemC performance model, on the other hand, a regression of 144 videoconferences has been created that can be applied to any OMAP application processor. The results of this are available to OMAP marketing for understanding the limits of each platform in advance of specific customer queries. Figure 6 shows the results. Some of the parameters that are varied in order to create the 144 scenarios include: The figure shows some of the results generated. These are bandwidths measured on the external memories. Similar bar charts exist for latencies, CPU occupancies, FIFO occupancies for hard-realtime functions, and so on. THE NEED FOR STANDARDS In this article, Ive described the OMAP-2 platform and its SystemC-based performance-modeling infrastructure. This infrastructure is one of the technologies essential if real benefits are to be drawn from a platform-based SoC methodology to a SystemC-for-everything worldview. Its not expected that the OMAP performance model will be used to generate RTL. Rather, its expected that the same tool will generate the top-level RTL and the performance model. Work on the modeling platform is continuing but in some areas there is a strong desire for public standards, to replace the ad-hoc technology developed within TI, making it cleaner, available to third-party suppliers, and supportable by EDA vendors. Also there needs to be widespread agreement on the types of model that are required. So far, it seems that the types of model used in OMAP-2 performance modeling have not been widely proposed.
James Aldis is co-chair for OCP-IP systemlevel design working group as well as a senior member of Group Technical Staff at Texas Instruments,where he works on the architecture of OMAP SoCs, specifically on-chip networking and SoC performance modeling. He joined TI in 2002. Previously he worked at Ascom AG in Switzerland on specification and implementation of wireless LAN, cellular and powerline communications modems. He has many academic publications and has made contributions to standardisation of GSM, UMTS, 802.11, and the language SystemC. His degree is in pure mathematics from the University of Liverpool and his PhD is from the University of York, on the subject of coded modulation and multidimensional geometry.

! !

SystemC is seen as a part of the overall ESL puzzle rather than as a central uniting theme.

Display size, refresh rate, and orientation. Configuration of windows on the display, including rescaling and rotation requirements for the video. Size of the base image used in the videoconference. Compression algorithm used (MPEG or other, with stabilization or without, and so on). Mapping of videoconference functional elements to OMAP hardware. Configuration of SoC, including external memory size and performance, arbitration options, burst usage, and clock frequencies.

My emphasis has been more on the platform-users requirements and workflow than on those of the platform-supplier. Within TIs OMAP organization, the distinction between platform-user and platform-supplier is relatively small and most of the issues raised apply to both. The use of SystemC for performance modeling must fit into a broader methodology for SoC definition and development. Electronic System Level [Design], or ESL, is generally used as an umbrella term for this kind of thing. ESL encompasses themes as diverse as synthesis from sequential C code to RTL and virtual platform use for early software development. Within TI, a number of tools and technologies have been or are being adopted and SystemC is seen as a part of the overall ESL puzzle rather than as a central uniting theme. NonSystemC ESL activity includes use of executable specifications, requirements and use-case capture, top-level SoC integration automization, and memory-map and register-map capture. The performance-analysis modeling environment must interwork with these tools, and therefore its important that the SystemC TLM technology not restrict itself

ENDNOTES:
1. OMAP Platform, http://focus.ti.com/general/ docs/wtbu/wtbugencontent.tsp?templateId=6123 &navigationId=11988&path=templatedata/cm/ general/data/wtbovrvw/omap Open Core Protocol (OCP), www.ocpip.org For OCP transaction-level modeling, see: Haverinen, Anssi, Maxime Leclercq, Norman Weyrich, and Drew Wingard. White Paper for SystemC based SoC Communication Modeling for the OCP Protocol, V1.0, October 14, 2002, www.ocpip.org/uploads/doc uments/ocpip_wp_SystemC_Communication_ Modeling_2002.pdf SystemC, www.systemc.org OMAP Code Development Tools, http://focus.ti.com/general/docs/wtbu/wtbugencontent.tsp? templateId=6123&navigtionId=12013&path= templatedata/cm/general/data/wtbmiddl/omap_ development
27

2. 3.

3. 5.

www.embedded.com | embedded systems design | JANUARY/FEBRUARY 2011

feature

CPU benchmark maximizes simplicity and efficacy.

CoreMark: A realistic way to benchmark CPU performance


BY SHAY GAL-ON AND MARKUS LEVY, EEMBC

M
28

any attempts have been made to provide a single number that can totally quantify the ability of a CPU. Be it MHz, MOPS, MFLOPSall are simple to derive but misleading when looking at actual performance potential. Dhrystone was the first attempt to tie a performance indicator, namely DMIPS, to execution of real codea good attempt that has long served the industry but is no longer meaningful. BogoMIPS attempts to measure how fast a CPU can do nothing, for what thats worth.
CoreMark ties a performance indicator to execution of simple code, but rather than being entirely arbitrary and synthetic, the code for the benchmark uses basic data structures and algorithms that are com-

The need still exists for a simple and standardized benchmark that provides meaningful information about the CPU core. Introducing CoreMark, available for free download from www.coremark.org.

JANUARY/FEBRUARY 2011 | embedded systems design | www.embedded.com

feature
mon in practically any application. Furthermore, in developing this benchmark, the Embedded Microprocessor Benchmark Consortium (EEMBC) carefully chose the CoreMark implementation such that all computations are driven by run-timeprovided values to prevent code elimination during compile-time optimization. CoreMark also sets specific rules about how to run the code and report results, thereby eliminating inconsistencies. COREMARK COMPOSITION To appreciate the value of CoreMark, its worthwhile to dissect its composition, which in general consists of lists, strings, and arrays (matrices to be exact). Lists commonly exercise pointers and are also characterized by non-serial memory-access patterns. In terms of testing the core of a CPU, list processing predominantly tests how fast data can be used to scan through the list. For lists larger than the CPUs available cache, list processing can also test the efficiency of cache and memory hierarchy. List processing consists of reversing, searching, or sorting the list according to different parameters, based on the contents of the lists data items. In particular, each list item can either contain a precomputed value or a directive to invoke a specific algorit hm with specific data to provide a value during sorting. To verify correct operation, CoreMark performs a 16-bit cyclic redundancy check (CRC) based on the data contained in elements of the list. Since CRC is also a commonly used function in embedded applications, this calculation is included in the timed portion of CoreMark. In many simple list implementations, programs allocate list items as needed with a call to malloc. However, on embedded systems with constrained memory, lists are commonly constrained to specific programmer-managed memory blocks. CoreMark uses the latter approach to avoid calls to library code (malloc/free). CoreMark partitions the available data space into two blocks, one containing the list itself and the other containing the data items. This partitioning also applies to embedded systems designs where data can accumulate in a buffer (items) and pointers to the data are kept in lists (or sometimes ring buffers). The containing the original value for the lower 8 bits. The data contained in the lower 8 bits is as shown in Listing 1. The benchmark code modifies the data16 item during each iteration of the benchmark. The idx item maintains the original order of the list items, so that CoreMark can recreate the original list without reinitializing the list (a requirement for systems with low memory capacity).
typedef struct list_head_s { struct list_head_s *next; struct list_data_s *info; } list_head;

! ! !

The algorithm sorts the list, performs the test, and recreates the original list by sorting back to the original order and rewriting the list data.

data16 items are initialized based on

data that is not available at compile time.


typedef struct list_data_s { ee_s16 data16; ee_s16 idx; } list_data;

Each data16 item really consists of two 8-bit parts, with the upper 8 bits

The list head is modified during each iteration of the benchmark and the next pointers are modified when the list is sorted or reversed. At each consecutive iteration of the benchmark, the algorithm sorts the list according to the information in the data16 member, performs the test, and then recreates the original list by sorting back to the original order and rewriting the list data. Figure 1 shows the basic structure. Since pointers on CPUs can range from 8 bits to 64 bits, the number of

Listing 1 Data contained in the lower 8 bits.


0..2: Type of function to perform to calculate a value. 3..6: Type of data for the operation. 7 : Indicator for pre-computed or cached value.

Basic structure of linked-list access mechanism. Each list item has a pointer to the data and a pointer to the next item in the list.

Figure 1 www.embedded.com | embedded systems design | JANUARY/FEBRUARY 2011

29

feature
items initialized for the list is calculated such that the list will contain the same number of items regardless of pointer size. In other words, a CPU with 8-bit pointers will use a quarter of the memory that a 32-bit CPU uses to hold the list headers). MATRIX PROCESSING Many algorithms use matrices and arrays, warranting significant research on optimizing this type of processing. These algorithms test the efficiency of tight loop operations as well as the ability of the CPU and associated toolchain to use instruction set architecture (ISA) accelerators such as multiply-accumulate (MAC) units and single instruction, multiple data (SIMD) instructions. These algorithms are composed of tight loops that iterate over the whole matrix. CoreMark performs simple operations on the input matrices, including multiplication with a constant, a vector, or another matrix. CoreMark also tests operating on part of the data in the matrix in the form of extracting bits from each matrix item for operations. To validate that all operations have been performed, CoreMark again computes a CRC on the results from the matrix test. Within the matrix algorithm for CoreMark, the available data space is split into three portions: an output macomputed at compile time. The input matrices are recreated with the last operation, and the same function can be invoked to repeat exactly the same processing. STATE-MACHINE PROCESSING An important function of a CPU core is the ability to handle control statements other than loops. A state machine based on switch or if statements is an ideal candidate for testing that capability. The two common methods for state machines use either switch statements or a state transition table. Because CoreMark already uses the latter method in the list-processing algorithm to test load and store behavior, CoreMark uses the former method, switch and if statements, to exercise the CPU control structure. The state machine tests an input string to detect if the input is a number; if the input is not a number, the state machine will reach the invalid state. Figure 2 shows a simple state machine with nine states. The input is a stream of bytes, initialized to ensure we pass all available states, based on an input that is not available at compile time. The entire input buffer is scanned with this state machine. To validate operation, CoreMark keeps count of how many times each state was visited. During each iteration of CoreMark, some of the data is corrupted based on input that is not available at compile time. At the end of processing, the data is restored based on inputs not available at compile time. COREMARK PROFILING Since CoreMark contains multiple algorithms, its interesting to demonstrate how the behavior changes over time. For example, looking at the percentage of control code executed (samples taken at each 1,000 cycles) and branch mispredictions in Figure 3, its obvious where the matrix algorithm is being called. This is portrayed by the low misprediction rate and high % of control operations, indicative of tight loops (for

! ! !

During each iteration of the benchmark, the input matrices are changed based on input values that cannot be computed at compile time.

trix (with a 32-bit value in each cell) and two input matrices (with 16-bit values in each cell). The input matrices are initialized based on input values that arent available at compile time. During each iteration of the benchmark, the input matrices are changed based on input values that cannot be

Overall functionality of CoreMarks state machine processing.


START digit dot + or S1 digit dot digit INT digit FLOAT dot E or e S2 + or separator separator digit digit SCIENTIFIC separator Other VALID Figure 2 Other INVALID Other Other Other EXPONENT Other Other

30

JANUARY/FEBRUARY 2011 | embedded systems design | www.embedded.com

feature
example, between points 330 and 390). By default, CoreMark only requires the allocation of 2 Kbytes to accommodate all data. This minimal memory size is necessary to support operation on the smallest microcontrollers, so that it can truly be a standard performance metric for any CPU core. Figure 4 examines the memory-access pattern during the benchmarks execution. The information is represented as a percentage of memory operations that access memory within a certain distance from the previous access. It is easy to deduce that the distance peaks are caused by switching between the differ-

Distribution of control instructions and mispredictions over CoreMark execution. Note that the graph contains both the exact data points and the moving average for control operation. This is to emphasize the extremes.
0.30 0.25

0.20

Percent

0.15 0.10 0.05 0 1 27 53 79 105 157 209 261 313 365 417 469 521 573 625 677 729 131 183 235 287 339 391 443 495 547 599 651 703 755

Thousands of instructions executed

! ! !

% control

20 per. mov. avg. (% control)

20 per. mov. avg. (misprediction rate)

Since CoreMark contains multiple algorithms, its interesting to demonstrate how the behavior changes over time.

Figure 3

Distribution of memory access distance over time during CoreMark execution. Globally, about 20% of the time, memory access is serial, with peaks of serial access likely due to state-machine operation. This access pattern will test the cache mechanism (if any) and memory-access efficiency on systems without cache.
1.2 1

ent algorithms (since each algorithm operates on a slice of a third of the total available data space).
Percent

0.8 0.6 0.4 0.2 0 1 53 105 157 209 261 313 365 417 469 521 573 625 677 729 781 833 885 937 989 1041 1145 1093 1197

RESULTS More than 120 CoreMark results are available online at www.coremark.org, but Table 1 shows a few results that display some interesting patterns. All results that depend on the compiler version and flags make it clear that these details must be included, otherwise it is impossible to make a useful comparison. The run and reporting rules for CoreMark require that exact tool versions be reported along with any performance results.

Thousands of instructions executed


20 per. mov. avg. (far (512 bytes or less)) 20 per. mov. avg. (close (8 bytes or less)) 20 per. mov. avg. (medium (64 bytes or less)) 20 per. mov. avg. (serial)

Figure 4

Blackfin results (1, 2) show a 10% increase in performance when moving from GCC 4.1.2 to GCC 4.3.3, a reasonable expectation for a newer compiler version. Results (8, 9) show an even more pronounced difference of 18% for

a mature compiler, while other results (10, 11) (12, 13) show minor effects only, as all of those compilers are based on the GCC4 series. The compiler can also balance code size vs. performance as we can see in results (3, 4). The compiler and platform are the same, but when directed to build a smaller executable (-Os -mpa), performance

drops 19% vs. optimizing the code with O3. The distinction is even sharper with results (5, 6) at 30% performance difference (using mpa switch). Note: compiler switch information is available from CoreMark website reports. Other compiler options affect how much the compiler tries to optimize the code. Results (7, 8) show a
31

www.embedded.com | embedded systems design | JANUARY/FEBRUARY 2011

feature
CoreMark/MHz performance data. Complete benchmark environment details are available at www.coremark.org. Processor Compiler CoreMark/MHz
1. Analog Devices BF536 0.3 393MHz 2. Analog Devices BF536 0.3 393MHz 3. Microchip PIC24FJ64GA004 32MHz 4. Microchip PIC24FJ64GA004 32MHz 5. Microchip PIC24HJ128GP202 40MHz 6. Microchip PIC24HJ128GP202 40MHz 7. Microchip PIC32MX360F512L (MIPS 32 M4K) 72MHz 8. Microchip PIC32MX360F512L (MIPS 32 M4K) 72MHz 9. Microchip PIC32MX360F512L (MIPS 32 M4K) 80MHz 10. NXP LPC1114 48MHz 11. NXP LPC1114 48MHz 12. NXP LPC1768 100MHz 13. NXP LPC1768 72MHz 14. Texas Instruments OMAP 3530 500MHz 15. Texas Instruments OMAP 3530 600MHz 16. TI Stellaris LM3S9B96 Cortex M3 50MHz 17. TI Stellaris LM3S9B96 Cortex M3 80MHz 18. Xilinx MicroBlaze v7.20.d in Spartan xC3S700A FPGA, 3-s 19. Xilinx MicroBlaze v7.20.d in Spartan xC3S700A FPGA, 5-s
Table 1

gcc4.1.2 gcc4.3.3 gcc4.0.3 (dsPIC30, Microchip v3_20) gcc4.0.3 (dsPIC30, Microchip v3_20) gcc4.0.3 (dsPIC30, Microchip v3.12) gcc4.0.3 (dsPIC30, Microchip v3.12) gcc3.4.4 MPLAB C32 v1.00-20071024 gcc3.4.4 MPLAB C32 v1.00-20071024 gcc4.3.2 (Sourcery G++ Lite 4.3-81) Keil ARMcc v4.0.0.524 gcc 4.3.3 (Code Red) arm cc 4.0 Keil ARMCC v4.0.0.524 gcc4.3.3 gcc4.3.3 Keil ARMCC v4.0.0.524 Keil ARMCC v4.0.0.524 gcc4.1.1 (Xilinx MicroBlaze) gcc4.1.1 (Xilinx MicroBlaze)

1.01 1.12 0.93 0.75 1.86 1.29 1.71 1.90 2.30 1.06 0.98 1.75 1.76 2.42 2.19 1.92 1.60 1.48 1.66

typical effect from safest (-O2) vs. normal (-O3) optimizations of about 10%. When operating frequency is scaled up, the system memory and/or onchip flash cannot always maintain a 1:1 ratio. It is common to see extra wait states on the flash when using higher processor frequencies. When the code resides in flash, the efficiency (as expressed in CoreMark/MHz) is impacted; (14, 15) shows the efficiency dropping almost 10%. For (16, 17) the waitstate effect is even more pronounced as the CPU:memory ratio can only be maintained 1:1 up to 50 MHz; when operating at the devices highest frequency (80 MHz), the ratio drops to 1:2 resulting in an efficiency drop of 15%. However, running at 80 MHz still yields an absolute performance improvement of 25% vs. running at 50 MHz. The results (18, 19) explore the situation where the cache is too small to contain all of the data. In 18, the cache is exactly 2K, which will fit all the data just barely, but have no

room left for function arguments that must be passed on the stack. This causes a small amount of bus traffic external to the cache, but when the cache is enlarged (result

! ! !

CoreMark is well suited to comparing embedded processors. It is small, highly portable, well understood, and highly controlled.
19), performance improves 10% even though the pipeline has been changed to a less efficient five-stage pipeline vs. a three-stage pipeline for the first implementation.

pleted correctly during execution, which helps debug any issues that may come up. The run rules are clearly defined and reporting rules are enforced on the CoreMark web site. In addition, EEMBC offers certification for CoreMark scores and even has a standardized method to measure energy consumption while running the benchmark.
Markus Levy is founder and president of EEMBC. He is also president of The Multicore Association and chairman of Multicore Technical Conference and Expo. Markus was previously a senior analyst at In-Stat/MDR and an editor at EDN magazine, focusing in both roles on processors for the embedded systems industry. Shay Gal-On is EEMBCs director of software engineering and leader of the EEMBC Technology Center. At EEMBC Shay created the EnergyBench and MultiBench Standards for benchmarking. Prior to EEMBC Shay was principal performance analyst in the Microprocessor Products Group at PMC Sierra where he influenced the design of new processors, including instruction set design, and optimized both hardware and software products.

Overall CoreMark is well suited to comparing embedded processors. It is small, highly portable, well understood, and highly controlled. CoreMark verifies that all computations were com-

32

JANUARY/FEBRUARY 2011 | embedded systems design | www.embedded.com

By Jack G. Ganssle

break points
Leakage and dynamic current requirements eat most of the power drawn by a processor. Leakage is just that: electrons that sneak through the silicon from power to ground. The effect is complex but is proportional to the applied voltage and the intrinsic leakage of the material. The latter grows substantially as the transistor sizes shrink; in fact, it grows by about five orders of magnitude as the geometry goes from 250 nm to 65 nm. (A white paper from Altera explains the leakage problem).1 To make matters worse, leakage increases with temperature, to the tune of about an order of magnitude increase per 100C.2 The effect is an insidious feedback loop: the chip gets hot, so it leaks more, making it even hotter, all the while sucking ever more power. And thats the static dissipation, before the clocks are turned on, before the chip is doing anything useful. With small geometries typically a third to a half of the power consumed is to leakage. Dynamic current comes from charging the capacitive loads on the chip, since:
i =C dv dt

Power management, 2011

he first decade of this century was surely a story of wireless connectivity. Billions of people now have cellular connections, and many have smart phones thatastonishingly put the Internet into our pockets. 2010 saw the success of always-connected tablets, such as the iPad, and 2011 promises to see the release of a veritable zoo of similar products. But the back story is more interesting and nuanced. These portable devices run for hours to weeks off batteries that make the proverbial cigarette pack look monstrous. Yet the CPU runs at hundreds of MHz, with tens of GBs of various forms of semiconductor memory. That half-pound phone has the compute horsepower of the recently-retired desktop rotting in your basement. The secret sauce behind portable electronics is the power management hardware and software, consisting of hundreds of thousands of lines of code and big chunks of silicon. Most microcontrollers today have at least a few power-saving sleep modes. But you will likely be surprisedshocked evenat the range of low-energy resources provided by the processors behind the portable connectivity revolution. But first, a little background. Where does the power go? CURRENT SINKS Most of the energy supplied to a CPU is consumed in three different ways (there are some other current sinks but

! ! !

From Microchips eXtreme Low Power to TIs OMAP, new chips contain some interesting and complex power management techniques.

their importance is small). The first is driving I/O pins. A differential driver can take a fair amount of oomph, but in a typical connected mobile device the processor generally drives high impedance signals, and so these I/O lines represent a small portion of the energy used.

(1)

Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. Contact him at [email protected].

Every bit switch charges a capacitor (for instance, a transistors gate). Typical advanced processors have hundreds of millions of transistors switching at a furious pace so even tiny amounts of per-transistor current adds up fast. Really fast. For instance, a high-end Pentium may consume 100 amps at times. The dynamic power consumed by a processor is approximately:
33

www.embedded.com | embedded systems design | JANUARY/FEBRUARY 2011

P=Cfv2

(2)

where C is capacitance, f is the clock frequency, v is the applied voltage, and is the percentage of the circuit that switches with each clock transition. That formula is a little simplistic as it assumes the CPU is running at a constant speed all of the time. Since even simple microprocessors employ sleep modes its more useful to think in terms of the amount of power consumed per work-item accomplished. If you have to do one thing and then shut the system down till theres something else needing attention, the energy used per task is: E = Pt (3)

Then there are the connected mobile device parts, which come with vastly more sophisticated power-management features. THE OMAP 3530 TIs OMAP 3530 is a part targeted to connected mobile devices and is an excellent example of a chip that offers a mind-bending array of power-management capabilities. Its far more than just a processor. The OMAP contains an ARM CortexA8 main CPU, a C64x DSP, a graphics

But that assumes no sleep modes. With clock gating, clocks going to portions of the chip that arent needed get turned off (for instance, why power the USB interface when theres no comm going on?). For clock gating, the energy consumed by a particular portion of the chip during time t is: E = Pdynamic t + Pstatic (6)

where t is the time spent accomplishing the task. Substituting: E = Cfv2t (4)

Given that power is proportional to the voltage squared, its critical to minimize V. Alas, lowering the voltage limits the maximum frequency attainable, which means the system must be out of a sleep mode longer to get a particular bit of work done. To paraphrase that last sentence: ft is a constant for a given task. However, if the processor must be awake for more than doing one particular activity, cutting the clock frequency may be a better way to save power. Most of us have programmed various sleep modes in our microcontrollers, and generally this is a pretty simple process. Texas Instruments MSP430 is touted as an ultra-low power controller, and indeed contains about 25 registers associated with setting and monitoring voltage levels.3 It provides about a half-dozen low power modes. Microchips eXtreme Low Power controllers such as the PIC16F1827 are also typical.4 The 1827 has a single very-low-power sleep mode controlled by a dozen registers, and its onboard peripherals may or may not be active during sleep.

! ! !

That is, the energized component still draws static power even when clocks are off. Power gating removes the voltage to a component entirely: E = (Pdynamic + Pstatic )t (7)

If the processor must be awake for more than doing one particular activity, cutting the clock frequency may be a better way to save power.

accelerator, another processor to handle camera data, a display processor, and a huge number of peripherals and peripheral controllers. Its a monster chip accompanied by a monstrous Technical Reference Manual, which is 3,444 pages long and yet woefully incomplete.5 The parts Power, Reset and Clock Management (PRCM) component provides the resources needed extend battery life to the user, and just this section consumes 400 pages of the TRM. By my count, 189 documented registers in the OMAP are dedicated to power management; there are actually more, but the additional ones are proprietary and not in the public domain. The OMAP 3530, like devices slated for similar applications, uses (among other things) clock and power gating to minimize power consumption. Remembering that the dynamic and static power together represent the bulk of the energy consumed by a device of this type: E = Pdynamic + Pstatic (5)

And so the OMAP is portioned into a number of voltage, power, and clock domains, with each one controllable via the software. Voltage domains are subsections of the chip powered by a particular voltage regulator. In fact, the OMAP doesnt like to play by itself; it really needs to be coupled to an external device like TIs TWL5030, which contains a dozen regulators, each of whose voltages are independently programmable via the firmware. Since power used is proportional to the voltage squared, it pays off to get V to just the level needed. For instance, when listening to music, little computation is going on, so its possible to throttle the clocks down and drop V to a lower value. Processing a camera image takes a lot of horsepower for a short period of time; V and F go to their max values. The voltage swing is tens of percent while clock frequencies can vary by over an order of magnitude. Each voltage domain has a number of power domains, which further partition voltage distribution. The power domains enable or disable power to a subsection, or can in some cases put the subsection of the chip into a low-leakage retention mode. The eighteen different power domains include the wakeup logic, the MPU, the video processor, graphics engine, camera, USB, and DPLLs, among others. The alert EEs eyebrows may now
35

www.embedded.com | embedded systems design | JANUARY/FEBRUARY 2011

break points
be arched a bit. Turning power off to one part of a chip that is connected to another is a recipe for SCR latchup. But the chip automatically isolates connections to avoid this chip-destroying Armageddon. Clock domains enable, disable, and control clock rates to components within a power domain. The camera, for instance, has a number of clocks that can be turned on or off as required. Theres no need to enable the serial-communications clock to the image sensor when doing white balance adjustment, for example. These hardware capabilities are combined with software to define Operating Performance Points (OPPs), which are typical combinations of clock frequencies and voltage levels. One OPP for high-performance needs might clock the ARM at 650 MHz, the DSP at 430 MHz, and set the voltage to those two components to 1.35. Simpler needs could define an OPP with 125/90 MHz clocks and 0.95 volt Vdd. The software sets registers to program the clocks and sends USB commands to the external power controller chip to program the voltage regulators. This approach is referred to as dynamic voltage and frequency scaling (DVFS) and is quite common in the mobile world. Clearly, these mobile devices offer a enormously fine-grained control of power consumption. In practice, the OMAP may consume just a few milliamps or nearly an amp. Your mobile phone is constantly going through an exquisite ballet of modulated current consumption. But its important to note that the concept of a global sleep or hibernation mode so often found on common processors doesnt really exist on the OMAP. Then it starts getting complicated. The weary firmware engineer must set up clock and power domain dependencies. That is, the OMAP lets one link domains together so, for instance, on a sleep or wakeup transition of one domain, other linked domains will follow along as well. Clocks can be put in autoidle modes that shut them down or
36

enables them depending on other onchip activities. TI takes DVFS even further with their SmartReflex. This is a gestalt of the hardware resources noted above, extra undocumented hardware in the silicon, plus software that dynamically and automatically manages power to squeeze the

! ! !

Turning power off to one part of a chip that is connected to another is a recipe for SCR latchup. But the chip isolates connections to avoid Armageddon.

of the OMAPs PRCM since the thing is bewilderingly complex. And theres a lot going on that TI wont divulge. I doubt that its possible to actually use one of these parts without a lot of interaction with the vendor, but in the mobile world the volumes are so high that Im sure vendors and customers form close engineering partnerships. The world of power management is far bigger than the simple sleep modes we use on our smaller controllers. Its great for consumers who get devices that will run for days on a charge. I wonder if these techniques will find their way into other consumer appliances, like TVs, in the days ahead, which promise perhaps significantly higher energy costs. ENDNOTES:

most out of every electron drawn from the battery. Its very proprietary and secret but a paper gives some details.6 TI claims SmartReflex can reduce leakage current by three orders of magnitude using esoteric techniques that apply odd biases to modulate the body voltage of transistor cells or blocks. I have no idea what that means or how it works, but its clear an awful lot is going on at the transistor level in the silicon. Then theres battery management, which, in the case of the OMAP takes place either in the external TWL5030 power controller or yet another IC. Two general approaches are usedor, sometimes, both together. An A/D can monitor the batterys voltage, but thats a poor indicator of reserve capacity. Most batteries have a discharge curve with a rather sharp knee; pass the knee and the battery will rather suddenly run out of juice. Better is to measure amp-hours charged into the cell and withdrawn, and then apply corrections for battery temperature and aging. In fact, the TWL5030 companion chip also measures the batterys temperature, looks for over-voltage conditions, manages charging via a USB interface, and looks for over-current conditions. Theres a lot going on! This is but a simplified discussion

1.

2.

3.

4.

5.

6.

Altera. 40-nm FPGA Power Management and Advantages, white paper, December 2008, v.1.2, available at www.altera.com/lit erature/wp/wp-01059-stratix-iv-40nm-power-management.pdf. Fallah, Farzan and Massoud Pedram. Standby and Active Leakage Current Control and Minimization in CMOS VLSI Circuits, IEICE Trans. on Electronics, Special Section on Low-Power LSI and Low-Power IP, Vol. E88-C, No. 4 Apr. 2005, pp. 509519, available at SPORT Lab: http://atrak.usc.edu/~massoud/Papers/IEICE-leakage-review-journal.pdf. Texas Instruments. MSP430x5xx/ MSP430x6xx Family Users Guide, June 2008/revised December 2010, available at http://focus.ti.com/lit/ug/slau208h/slau208h. pdf. Microchip. PIC16F/LF1826/27 Data Sheet 18/20/28-Pin Flash Microcontrollers with nanoWatt XLP Technology, 2010, available at http://ww1.microchip.com/downloads/en/ DeviceDoc/41391C.pdf. Texas Instruments. OMAP35x Applications Processor Technical Reference Manual, available at http://focus.ti.com/lit/ug/spruf98m/ spruf98m.pdf. Carlson, Brian and Bill Giolma. SmartReflex Power and Performance Management Technologies: reduced power consumption, optimized performance, white paper, Texas Instruments, February 2008, available at http://focus.ti.com/lit/wp/swpy015a/ swpy015a.pdf.

JANUARY/FEBRUARY 2011 | embedded systems design | www.embedded.com

for R&D PRototyPes

the new standard for pcb assembly

$50 in 5-Days
Advanced Assembly specializes in the machine assembly of low volume and prototype PCBs in 5 days or less. It is our only focus and we do it better and faster than CEMs, board fabricators, or your local assembly shops. Our assembly process and professional service have established a higher industry standard for quality in PCB assembly. Let us earn your business today. R&D Assembly Pricing Includes free tooling and programming Number of SMT parts per board 1 through 25 26 through 50 51 through 100 101 through 150 151 through 200 201 through 250 251 through 400 Machine-placed SMTs Parts in bulk, cut tape or reels Full turn-key or consignment Free digital image of your board before assembly 1st Board $50 $95 $125 $180 $225 $275 Call 2nd Board $35 $65 $85 $120 $150 $190 Call Boards 3-5 each $30 $45 $60 $85 $120 $145 Call Stencil per side $75 $75 $75 $75 $100 $100 Call

AAPCB.com/aa1 1.800.838.5650

You might also like