Apnt 329
Apnt 329
Abstract
This application note contains the project that is used in the webinar “How to get started with Arm Cortex-M55
software development”. It explains the project is explained in-depth and repeats the steps that were shown.
The project itself contains four different implementations of a multiply-accumulate (MLA) function. It is
explained how the implementations differ and what performance gains you can expect when using optimized
code on Arm Cortex-M55. The project can run on a Fixed Virtual Platform (FVP) model that is shipped with MDK
v5.30, requiring an MDK-Professional license.
Contents
Abstract ......................................................................................................................................................................1
Introduction ................................................................................................................................................................2
Prerequisites ...........................................................................................................................................................2
Project Structure ........................................................................................................................................................2
Software Components ............................................................................................................................................2
main.c......................................................................................................................................................................2
mla_functions.S ......................................................................................................................................................4
Running the project ....................................................................................................................................................4
Targets ....................................................................................................................................................................4
FVP – Scalar only implementation ..........................................................................................................................5
Code Coverage ....................................................................................................................................................5
M-Profile Vector Extension window ...................................................................................................................7
FVP – Data type optimized vector implementation ...............................................................................................7
MPS3 - Data type optimized vector implementation .............................................................................................8
Code Coverage ....................................................................................................................................................9
Performance Analyzer ...................................................................................................................................... 10
MPS3 – Reducing Execution Time Using Component Viewer ............................................................................. 11
Variables.scvd................................................................................................................................................... 12
Results .............................................................................................................................................................. 13
Summary.................................................................................................................................................................. 13
Appendix .................................................................................................................................................................. 13
AN329 – Get started with Arm Cortex-M55 Copyright © 2020 Arm Ltd. All rights reserved
www.keil.com/appnotes/docs/apnt_329.asp
1
Introduction
This application note explains how to implement four different versions of a multiply-accumulate function on
Arm Cortex-M55. You will learn how to use the performance monitoring unit (PMU) to examine the differences
in computing performance. Also, some debugging concepts of MDK are discussed and used.
Finally, the application note quickly touches on the usage of MDK with the Arm MPS3 prototyping board that
can host the netlist of Arm Cortex-M55 and that enables real code profiling using target hardware.
Prerequisites
To run the project, you need to install the following software:
• Install MDK v5.30 from www.keil.com/demo/eval/arm.htm
• Add an MDK-Professional license. If you do not have access to this MDK edition, you can request a 30-
day trial license from within the tool: www.keil.com/support/man/docs/license/license_eval.htm
Project Structure
The project is configured for the Arm Cortex-M55 and has basically two source files:
• main.c contains the main() function and calls the different implementations of the MLA function.
• mla_functions.S is an assembly file that contains the different MLA implementations
The MLA implementations are as follows:
• Scalar only MLA function implementation.
• Scalar MLA function implementation with low-overhead loops (LOL) (refer to Appendix).
• Vectorized MLA function implementation with scalar code for loop tail prediction (refer to Appendix).
• Vectorized MLA function implementation with low-overhead loops (refer to Appendix).
The project also contains a custom scatter file (ARMC55_ac6.sct) that is used to place one of the software
components in an uninitialized part of the target’s memory.
Software Components
Apart from the two source files, the project contains the following software components:
• ::CMSIS:Core for access to the Cortex-M55 header file
• ::Device:Startup for startup and systems files
• ::Compiler:Event Recorder and ::Compiler:I/O:STDOUT:EVR for retargeting the printf() output to the
Debug (printf) Viewer window
main.c
In main.c, we first include a couple of required header files. The EventRecorder.h file is only included if the
component is present (which is noted in RTE_Components.h). This will be used in the last step when we remove
the component and thus don’t want to include its header file:
#include "RTE_Components.h"
#include CMSIS_device_header
#ifdef RTE_Compiler_EventRecorder
#include "EventRecorder.h" // Keil.ARM Compiler::Compiler:Event Recorder
#endif
#include <stdio.h>
After that, we create variables that will be used to compute the results based on the #defines as shown below
AN329 – Get started with Arm Cortex-M55 Copyright © 2020 Arm Ltd. All rights reserved
www.keil.com/appnotes/docs/apnt_329.asp
2
#define LENGTH 127
#ifndef DATATYPE
extern int mla_sca(int a[], int b[], int n);
extern int mla_sca_lol(int a[], int b[], int n);
extern int mla_vec(int a[], int b[], int n);
extern int mla_vec_lol(int a[], int b[], int n);
int a[LENGTH];
int b[LENGTH];
#else
short a[LENGTH];
short b[LENGTH];
#endif
Next, there is an initialization function that is used to fill the variables with data that will be used for the MLA
function:
__attribute__((noinline)) void init_arrays(void) {
int i;
In main(), (based on availability) we initialize Event Recorder, enable the PMU and its cycle counter. Then we
initialize the array and start doing the calculations with four different implementations. These calculations are
enclosed in calls to the cycle counter to retrieve the required information to measure the application
performance. The value is then stored in a variable that is printed on the debug console:
AN329 – Get started with Arm Cortex-M55 Copyright © 2020 Arm Ltd. All rights reserved
www.keil.com/appnotes/docs/apnt_329.asp
3
int main(void) {
#ifdef RTE_Compiler_EventRecorder
EventRecorderInitialize(EventRecordAll, 1);
#endif
ARM_PMU_Enable();
ARM_PMU_CNTR_Enable(PMU_CNTENSET_CCNTR_ENABLE_Msk);
init_arrays();
cycle_count_before = ARM_PMU_Get_CCNTR();
sca_result = mla_sca(a, b, LENGTH);
cycle_count_after = ARM_PMU_Get_CCNTR();
scalar_cycle_count = cycle_count_after - cycle_count_before;
#ifdef RTE_Compiler_EventRecorder
printf("Scalar only : Result = %d, Cycles = %d\n", sca_result, scalar_cycle_
count);
#endif
mla_functions.S
The capital “S” in the file extension indicates to the Arm assembler that the file requires preprocessing as it
contains a couple of #ifndefs that are used to steer the selected implementation. To see the best result of
each implementation, you need to set the following #defines in the Options for Target – C/C++ (AC6) and Asm
tabs:
Result of C/C++ (AC6) tab Asm tab
Scalar only PLAIN
Scalar LOL SCALOL
Vector LOL VECLOL
Optimized types DATATYPE DATATYPE
The last one uses different data types than the others (also in main.c). Instead of 32-bit integers, 16-bit fixed-
point values are used that enable higher throughput in a single iteration.
Targets
The project supports two targets:
• FVP: The project runs in simulation using the Cortex-M55 FVP that is delivered with MDK. An FVP is
useful for prototyping as it gives an indication about code performance. However, it does not give
accurate measurements.
• MPS3: The project connects to a Cortex-M55 FPGA image running on the MPS3 prototyping platform.
Using a hardware implementation gives you accurate code performance measurements.
Note: General information on how to use Fixed Virtual Platforms in MDK can be found here:
www.keil.com/support/man/docs/fstmdls/
AN329 – Get started with Arm Cortex-M55 Copyright © 2020 Arm Ltd. All rights reserved
www.keil.com/appnotes/docs/apnt_329.asp
4
FVP – Scalar only implementation
First, we need to tell the assembler to use the right implementation. Go to Options for Target – Asm (Atl+F7)
and enter PLAIN in the Define: section:
Build (F7) the project and Start a Debug Session (Ctrl + F5). You will see two windows opening in the
background – these are issued by the Fast Model and must not be closed during the debug session.
Run (F5) the application. It will hit a breakpoint at the end on the while(1) loop. Observe the output of the
printf() calls in the Debug (printf) Viewer window:
Note that the cycles do not have to match to your output, but the general relation should be correct.
Code Coverage
Stop the Debug Session (Ctrl + F5) and restart it immediately afterwards. You will notice that the display of
your code has changed. It now contains code coverage markings:
AN329 – Get started with Arm Cortex-M55 Copyright © 2020 Arm Ltd. All rights reserved
www.keil.com/appnotes/docs/apnt_329.asp
5
This is due to a new feature in MDK v5.30 that allows to extract coverage information from an FVP.
Unfortunately, this cannot be shown live in a debug session, but needs to be loaded when entering debug.
In the Models ARMv8-M Target Driver Setup dialog, you can specify to save the coverage information and to
load a recorded coverage info on debug entry ( Options for Target – Debug – Settings (Alt+F7)):
In a debug session with the data loaded from the previous run, you can use the COVERAGE command to store
the coverage information in Gcov format. This is useful for CI/CD environments where your server can run
automated testing and create coverage information based on GCOV. Enter the following in the Command
window:
The Gcov files (one for each module) will be saved in the directory where the objects are stored. You can use a
tool like gcovr to create a HTML table showing the project’s overall code coverage.
AN329 – Get started with Arm Cortex-M55 Copyright © 2020 Arm Ltd. All rights reserved
www.keil.com/appnotes/docs/apnt_329.asp
6
M-Profile Vector Extension window
MDK v5.30 introduces a new System Viewer window – the M-Profile Vector Extension window. This window
allows you to check the MVE vector registers.
Go to View – System Analyzer – Core Peripherals – M-Profile Vector Extension (MVE) to open the dialog:
The Vectors area displays the values of vectors Q0 - Q7. The Cortex-M55 works in parallel on 2 x 64-bit vectors.
You can configure the display of this window to show the native number format that you are using in your
algorithms, from 64-bit down to 8-bit. You can specify to see the content of the vector register in int, float, or
even q number format. This makes it easy to verify the correct operation of your application.
AN329 – Get started with Arm Cortex-M55 Copyright © 2020 Arm Ltd. All rights reserved
www.keil.com/appnotes/docs/apnt_329.asp
7
Rebuild the project, Start a Debug Session (Ctrl + F5), and Run (F5) the application. You should see
results like the following ones:
Comparing the different results, we see that an optimized implementation of an algorithm with the right
selection of the variable data types can increase performance significantly. In our case, the Vector + LOL version
is more than 6.5 times faster than the simple scalar implementation. Let’s see how the numbers are on a real
hardware implementation in an FPGA.
Make sure that the define DATATYPE is still set on the C/C++ (AC6) and Asm tabs.
Build (F7) the project, Start a Debug Session (Ctrl + F5), and Run (F5) the application. It will hit a
breakpoint at the end on the while(1) loop. Observe the output of the printf() calls in the Debug (printf)
Viewer window:
AN329 – Get started with Arm Cortex-M55 Copyright © 2020 Arm Ltd. All rights reserved
www.keil.com/appnotes/docs/apnt_329.asp
8
Notice that the Vector + LOL implementation needs more cycles than estimated with the model, but the relation
to the Scalar implementation is roughly right: in real life it is even more than eight times faster than the simple
implementation.
Code Coverage
Notice that the code is already annotated with coverage information once you run through it. This is an
advantage when using real hardware. You can also use the Code Coverage window to check the coverage for
each module/function:
As before, you can write coverage information to a Gcov file for further processing.
AN329 – Get started with Arm Cortex-M55 Copyright © 2020 Arm Ltd. All rights reserved
www.keil.com/appnotes/docs/apnt_329.asp
9
Performance Analyzer
Using ETM trace you also get access to Performance Analyzer. ULINKpro allows applications to be run for long
periods of time while collecting trace information. This is used by Performance Analyzer to record and display
execution times for functions and program blocks. It shows the processor cycle usage and enables you to
identify algorithms that require optimization.
Go to View – Analysis Windows – Performance Analyzer to open the window:
AN329 – Get started with Arm Cortex-M55 Copyright © 2020 Arm Ltd. All rights reserved
www.keil.com/appnotes/docs/apnt_329.asp
10
MPS3 – Reducing Execution Time Using Component Viewer
Stop the Debug Session (Ctrl + F5) and open the Manage Run-Time Environment window. Disable the
following components:
• ::Compiler:Event Recorder
• ::Compiler:I/O:STDOUT
Instead of using printf() for displaying the variables, you could add them to the Watch window, but this
requires the variables to be in scope and does not render nicely (the variable name and value is shown, that’s it).
Another option is to use Component Viewer to create your own window to view to the variables by just adding
a simple XML file to the project.
AN329 – Get started with Arm Cortex-M55 Copyright © 2020 Arm Ltd. All rights reserved
www.keil.com/appnotes/docs/apnt_329.asp
11
Go to Options for Target – Debug (Atl+F7) and click Manage Component Viewer Description Files … at the
bottom of the dialog. In the next window, click Add Component Viewer Description File and browse to the file
Variables.scvd in the project directory, and click Add:
Click OK twice.
Variables.scvd
Here is the content of the SCVD file:
<?xml version="1.0" encoding="utf-8"?>
</component_viewer>
Basically, you create the objects, that you want to display in the window by reading program symbols. Then you
tell the window how to display these objects (items) in a consistent way (using printf-style formatting).
AN329 – Get started with Arm Cortex-M55 Copyright © 2020 Arm Ltd. All rights reserved
www.keil.com/appnotes/docs/apnt_329.asp
12
Results
Rebuild the project, Start a Debug Session (Ctrl + F5) and go to View – Watch Windows – Cycle Counts to
open the Component Viewer window. Run (F5) the application. The Cycle Counts window shows the
following results:
Using a less invasive method of reading the variables, we could reduce the overall run-time of the application
drastically.
Summary
This application note showed how you can use the Fixed Virtual Platforms (FVPs) that are shipped with Arm Keil
MDK to start early prototyping of your software or architecture exploration without real hardware. It explained
that this method is suitable for benchmarking your application code, while not having access to a target device.
It was also shown that using a prototyping board, advanced features of Arm Keil MDK help you to profile your
application further and how these features can help to reduce the overall run-time of the program.
Appendix
Here are some useful resources regarding the Armv8.1-M architecture:
1. White paper: Introducing the new Armv8.1-M architecture
2. Armv8-M Architecture Reference Manual Documentation
3. Cortex-M55 on developer.arm.com
AN329 – Get started with Arm Cortex-M55 Copyright © 2020 Arm Ltd. All rights reserved
www.keil.com/appnotes/docs/apnt_329.asp
13