dm00596687 stm32mp157 Gpu Application Programming Manual Stmicroelectronics
dm00596687 stm32mp157 Gpu Application Programming Manual Stmicroelectronics
Programming manual
Introduction
The STMicroelectronics STM32MP157 line microprocessors embed a Vivante GPU (graphics processor unit) that uses the
Khronos® Group OpenGL® ES (embedded system) and OpenVG™ standards. This document addresses a number of items
that need to be considered when using the Vivante GPU to accelerate a graphics-based application. When used efficiently, the
Vivante GPU accelerates the eye-catching visuals of an application while minimizing system resource (CPU, memory,
bandwidth, power) loading. The end result is an optimized solution that maximizes the user experience. In the remaining
sections it is assumed the reader understands the fundamentals of OpenGL® ES programming or another graphics API
(application programming interface).
There are a few hints and tricks to optimize the application to take full advantage of the GPU hardware. The following sections
show some general recommendations and best practices for OpenGL ES and graphics programming in general.
This document describes generic GPU features; see the related STM32MP157 documentation [1], [2] for more technical details.
This document contains proprietary copyright material disclosed with permission of Vivante Corporation.
Reference documents
1 General information
The following table presents a non-exhaustive list of the acronyms used in this document.
Acronym Definition
AI Artificial intelligence
API Application programming interface
ES Embedded systems
FPS Frames per second
GUI Graphical user interface
GPU Graphics processing unit
MSAA Multisample anti-aliasing
OpenGL Open graphics library
SoC System on chip
VBO Vertex buffer object
VTK Vivante tool kit/STM32MP157 line GPU tool kit
This document applies to the STM32MP157 line devices, which are Arm®-based devices.
Note: Arm is a registered trademark of Arm Limited (or its subsidiaries) in the US and/or elsewhere.
2 Introduction to OpenGL/OpenGL ES
OpenGL is a royalty-free, cross platform C-based API that is portable across a wide range of systems. The API is
maintained by the Khronos Group and additional information can be found at www.khronos.org. OpenGL is a
method of rendering graphics (3D / 2D) data onscreen by transforming graphics primitives (points, lines, triangles)
into visual information that a user can see. For example, a developer can program and configure the Vivante 3D
GPU pipeline, send data to the pipeline, and the GPU executes the graphics commands.
OpenGL ES (OpenGL for embedded systems) is a subset of OpenGL that removes redundant functionality and
packages the API in a library that is targeted for mobile and embedded platforms. Even though this document
focus on OpenGL ES 2.0 in the following sections, the same rules apply to OpenGL ES and other 3D rendering
APIs.
The recommendations listed below take a holistic approach centered on overall system level optimizations that
balance graphics and system resources.
3.2 Optimize off-chip data transfer such as accessing off-chip DDR memory/mobile
DDR memory
Any data transfer off-chip takes bandwidth and resources from other functional blocks in the SoC, increases
power, and causes additional cycles of latency and delay as the GPU pipeline needs to wait for data to return
from memory. Using on-chip cache and writing the application to better take advantage of cache locality and
coherency increases performance. In addition, accessing the GPU frame buffer from the CPU (not recommended)
causes the driver to flush all queued render commands in the command buffer, slowing down performance as the
GPU has to wait since the command queue is partially empty (inefficient use of resources) and CPU-GPU
synchronization is not parallelized.
3.12 Use VBO for vertex data instead of static or stack data
A vertex buffer object (VBO) is a buffer object that provides the benefits of vertex array and display list and allows
a substantial performance gain for uploading data (vertex position, color, normals, and texture coordinates) to the
GPU. VBOs create buffer objects in memory and allow the GPU to directly access memory without CPU
intervention (DMA). The memory manager can optimize buffer placement using feedback from the application.
VBOs can also handle static and dynamic data sets and are managed by the Vivante driver. The benefits of each
are:
• A vertex array reduces the number of function calls and allows redundant data to be shared between related
vertices, instead of re-sending all the data each time. Access to data can be referenced by the array index.
• The display list allows commands to be stored for later execution and can be used repeatedly over multiple
frames without re-transmitting data, thus minimizing CPU cycles to transfer data. The display list can also be
shared by multiple OpenGL / OpenGL ES clients so they can access the same buffer with the corresponding
identifier. If computationally expensive operations (such as lighting or material calculations) are put inside
the display lists, then these computations are processed once when the list is created and the final result
can be re-used multiple times without needing to re-calculate again.
If the application combines the benefits of both vertex array and display list by using VBO, performance is
increased over static or stack data sets.
3.25 Use indexed triangle strips; most optimal are pairs of six triangles
Index triangle strips can maximize the vertex cache utilization as each set of vertex data can be used in two
triangles. Pairs of six triangles can fit exactly into the vertex cache.
3.26 Vertex attribute stride should not be larger than 256 bytes
Most Vivante GPUs provide native support for a 256 byte vertex attribute stride. If the vertex attribute stride is
larger than 256 bytes, then the driver has to copy the vertex data around.
3.32 Avoid using VBO and non-VBO for vertex attribute inputs within a draw call
Within one draw call, when one vertex attribute input is in a conventional vertex array (in application system
memory, instead of in a vertex buffer object (VBO)) or a vertex attribute input is not from an array, but from a
current vertex attribute value set by the API function glVertexAttrib* (not in a VBO), this mixed usage requires an
added scan of the index buffer. If the index buffer is in GPU video memory, this causes a performance lag.
Avoiding a scan of the index buffer is better. However, if a scan must be performed, it is better to have the index
buffer in a conventional vertex array (in system memory) to reduce the performance hit.
The driver scans the index buffer to get the maximum and minimal index value. It uses these values to decide
how much video memory to allocate to accommodate the data which is copied from application system memory
or current vertex attribute value. Scanning the index buffer object if it resides in GPU video memory is slower. The
best performance for a draw call is to put all vertex attributes in a VBO and not reference the current vertex
attribute value.
3.34 Schedule a dependable amount of work per frame to obtain a steady frame rate
Trying to render 10x the pixels in one frame relative to other frames results in a slower frame rate.
3.35 Select the number of pixels being rendered to correspond to the desired
performance
The number of pixels being rendered is directly proportional to the performance. This also includes offline
rendering.
4 Recommendations summary
Revision history
Contents
1 General information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Introduction to OpenGL/OpenGL ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Application programming recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1 Understand the system configuration and target application . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Optimize off-chip data transfer such as accessing off-chip DDR memory/mobile DDR
memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 Avoid random cache or memory accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.4 Optimize the use of system memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.5 Take advantage of fast memory transfers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.6 Target a fixed frame rate that is visibly smooth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.7 Minimize GL state changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.8 Batch primitives to minimize the number of draw calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.9 Perform calculations per vertex instead of per fragment/pixel . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.10 Enable early-Z, hierarchical-Z and back face culling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.11 Use branching carefully . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.12 Use VBO for vertex data instead of static or stack data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.13 Use dynamic VBO if data is changing frame by frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.14 Tesselate data for a better hierarchical Z (HZ) performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.15 Use dynamic textures as a texture cache (texture atlas) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.16 Stitch together small triangle strips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.17 Specify EGL configuration attributes precisely . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.18 Use power-of-two aligned texture/render buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.19 Disable MSAA rendering unless high quality is needed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.20 Avoid partial clears . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.21 Avoid mask operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.22 Use MIPMAP textures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.23 Use compressed textures if possible . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.24 Draw objects from near to far if possible . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.25 Use indexed triangle strips; most optimal are pairs of six triangles. . . . . . . . . . . . . . . . . . . . . . 7
3.26 Vertex attribute stride should not be larger than 256 bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.27 Avoid binding buffers to mixed index/vertex array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.28 Avoid using CPU to update texture/buffer contexts during render . . . . . . . . . . . . . . . . . . . . . . 8
3.29 Avoid frequent context switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.30 Optimize resources within a shader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.31 Avoid using glScissor Clear for small regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.32 Avoid using VBO and non-VBO for vertex attribute inputs within a draw call . . . . . . . . . . . . . 8
3.33 Do not call glFinish unnecessarily . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.34 Schedule a dependable amount of work per frame to obtain a steady frame rate . . . . . . . . . 8
3.35 Select the number of pixels being rendered to correspond to the desired performance . . . . 8
3.36 Consider using front-to-back drawing order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.37 Use eglWaitSync when compositing from multiple threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Recommendations summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1 Best practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Bad practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9