Migrating Software To Multicore SMP Systems: Satyaki Mukherjee
Migrating Software To Multicore SMP Systems: Satyaki Mukherjee
1
Agenda
2
ARM Cortex-A MPCore
ARMv7 architecture implementations
Configurable number of cores
Distributed Interrupt Controller CoreSight™ Multicore Debug and Trace
Multimedia (VFP/NEON™)
Security (TrustZone®)
ACP (for MP implementations)
Virtualization (in Cortex-A15)
Large Addresses (in Cortex-A15)
3
Announced MP SoC implementations
Silicon Vendor Platform name Marketed performance App Processor Number of cores
Nvidia Tegra 250 up-to 1 Ghz A9 MPCore 2
Samsung Orion 1 Ghz A9 MPCore 2
Texas Instruments OMAP 4430 up-to 1 Ghz A9 MPCore 2
Texas Instruments OMAP 4440 over 1 Ghz A9 MPCore 2
ST-Ericsson U8500 up-to 1.2 Ghz A9 MPCore 2
ST-Ericsson U5500 Optimized for power A9 MPCore 2
Renesas (NEC) EMMA Mobile/EV2 533 MHz A9 MPCore 2
Renesas (NEC) EMMA Car EC-4270 (NaviEngine) 400 MHz ARM11 MPCore 4
Renesas (NEC) EMMA Car EC-4250 (NaviEngine-mini) 400 MHz ARM11 MPCore 2
Renesas (NEC) EMMA Car EC-4260 (NaviEngine-MID) 400 MHz ARM11 MPCore 3
Renesas (NEC) EMMA Car EC-43xx Series TBC A9 MPCore 4
STMicroelectronics SPEAr1310 (Comms MPU) 600 MHz A9 MPCore 2
NuFront NuSmart 2816 up-to 2 GHz A9 MPCore 2
MindSpeed Transcede 4000 (4G wireless baseband stations) 600 MHz A9 MPCore 4+2
MindSpeed Transcede 4020 750 MHz A9 MPCore 4+2
MindSpeed M8500 (Comcerto® 5000 Media Processing SoC) 600 MHz A9 MPCore 4+2
Marvell 88F6323/2/1 (Sheeva) 0.8 to 1 GHz ARMv5TE 2
Marvell ARMADA 1000 (88DE3010) (Sheeva) 1.2 GHz ARMv5TE 2
Marvell ARMADA Quadcore (Announced at CES 2010) - - 4
Marvell ARMADA 628 2x 1.5GHz + 1x 624 MHz ARMv7 MP 3
Qualcomm MSM8260 Snapdragon up-to 1.2 GHz ARMv7 2
Qualcomm MSM8660 Snapdragon up-to 1.2 GHz ARMv7 2
Qualcomm QSD8672 Snapdragon 1.5 GHz ARMv7 2
Publically announced SoCs based on multicore ARM processors. Does not include future roadmap. Correct at time of publishing.
4
Toshiba AC/100 MSI WindPad 110 Notion Ink Adam NVIDIA/Foxconn
Tablet
8
The SMP OS Takes Care of it All!
Automatic load balancing to enable
Optimal multi-tasking dynamic power management
Cortex-A?
User Interface
MPCore
Multimedia CPU 1
SMP
Communication CPU 2
Operating
Web System CPU 3
Other tasks CPU 4
Other tasks
11
SMP Power Management Options
Full Operation DVFS
100% 100%
66% 66%
33% 33%
66% 66%
33% 33%
OFF OFF
CPU0 CPU1 CPU0 CPU1
13
Example: Android SMP
Every application runs in its own Linux
process
Java threading abstracts the multicore Applications
SMP architecture
Low-level PThreads API still accessible Application
for advanced developers Framework
Each instance of the Dalvik virtual
machine runs in its own process Runtime
Libraries
By default all components of an
application run in the same process, but it
is possible to change this and spawn SMP Linux Kernel
threads/processes
Middleware can be independently
SMP Processor
optimized for SMP
Linux is a mature and optimized SMP
Simplified Android stack
kernel
14
Migrating Software to SMP
VALIDATION
Ensure that software works on a SMP system
OPTIMIZATION
• Remove serialization bottlenecks
• Modify software to expose concurrency
• Parallelise if the algorithm is suitable and
the effort is worthwhile
16
Migrating software to multi-core
DEVICE DRIVERS
17
Shared Resources
19
Critical Sections and Mutual Exclusion
Thread A Thread B
... ...
mutex_lock(&lock); mutex_lock(&lock);
mutex_unlock(&lock); mutex_unlock(&lock);
... ...
20
Implications of Enabling Multitasking
To enable multitasking various scheduling models are available:
Run To Completion
Cooperative: the running task yields to relinquish CPU
Pre-emptive: the OS is in charge of scheduling
On a UP system
Only one task is executing at any time
Cooperation may assume this
The kernel is protected by:
Making kernel processes non-pre-emptable
Disabling interrupts inside critical sections
On an SMP system
More than one task can be executing at the same time
UP kernel protection mechanism not adequate
Can no longer assume that lower priority tasks are in the wait queue
21
Spinlocks (kernel)
Busy-wait mechanism to enforce mutual exclusion on multicore systems
It is a binary mechanism (lock/unlock). It is very quick.
Not suitable when holding lock for long time – spinlocks do not sleep
Energy efficient implementation for ARM architecture
If the protected resource can be accessed from an ISR, use the *_irq*
variants
Not recursive: acquiring the same lock twice causes dead-lock
<linux/spinlock.h>
spinlock_t lock = SPIN_LOCK_UNLOCKED; /* initialize */
...
spin_lock(&lock); /* acquire the lock */
/* critical section */
spin_unlock(&lock); /* release the lock */
...
23
Semaphores (kernel)
Use semaphores for synchronization between processes
There’s also APIs for killable/interruptable/timeout/try
Semaphores can sleep: Cannot be used in interrupt handlers
They are relatively slow
Not recursive: acquiring the same lock twice causes dead-lock
In reality should use down_interruptible(), since down() will not return
if a signal is received
<linux/semaphore.h>
struct semaphore sem;
sema_init(&sem, 1); /* initialize the semaphore */
...
down(&sem); /* acquire the semaphore */
/* critical section */
up(&sem); /* release the semaphore */
...
24
Spinlocks in Interrupt Handlers
Potential for deadlock if a locked spinlock is acquired inside an interrupt
handler
...
spin_lock(&lock);
/* critical section */
...
...
...
INTERRUPT
/* Interrupt handler */
... spin_lock(&lock);
... /* DEADLOCK */
spin_unlock(&lock); ...
...
25
Spinlocks in Interrupt Handlers
Use the appropriate API variant
They disable interrupts on the calling CPU whilst the lock is held
Use to synchronize interrupt handler and non-interrupt code
26
Multiple-reader Locking
Read/write semaphores
Same semantics as conventional semaphores, allow multiple readers
Read/write spinlocks
Same semantics as conventional spinlocks, allow multiple readers
Read-Copy-Update (RCU)
Specialist synchronization mechanism for when readers >> writers
Example: Shared access of networking routing tables: every outgoing packet
(process) needs to read the routing table. Once the read is completed the
entry can be removed. The removal can be deferred. The reading can be
done without locking
Seqlocks
Fast, lockless, suitable when the critical region is very small
Example: A global variable is regularly updated with system time or event
counters, and many threads regularly sample this information
27
Thread Safety and Re-entrancy
Functions that can be used concurrently by several threads need to be
thread-safe and re-entrant
Re-entrant function
All data provided by the caller:
Function does not hold static (global) data over successive calls
Function does not return a pointer to static data
Does not call non-re-entrant functions
28
Explicit Task Ordering Using Priorities
Code written for uni-processor system, using real-time scheduling policies
may assume ordered execution of tasks based on their priority
Higher priority task runs to completion, or until it explicitly yields
Tasks’ priorities are used to guarantee execution order
The issue: On a multicore SMP system a task of lower priority may in fact
run concurrently, on another CPU
The expected execution order is no longer guaranteed
29
Solution 1: Use Task Affinity
Setting task affinity to a specific CPU and specifying SCHED_FIFO for that
process will ensure that implicit execution order of legacy code on SMP systems
is maintained.
#include <sched.h>
int sched_setaffinity(pid_t pid,
unsigned int cpusetsize,
cpu_set_t* mask);
Disadvantages:
Reliance on a scheduler’s mechanism
Breaks the SMP paradigm and comes in the way of OS load balancing
30
Solution 2: Use Explicit Synchronization
Enforce serial execution using synchronization mechanisms
Semaphores, spinlocks, signals, completions, etc
The task that has to wait blocks on a lock
The task that has to signal releases the lock
Advantages
The programmer explicitly controls the execution flow
Performance is likely to be similar or better on multi-core vs single-core
... actually, other tasks can run on other cores, therefore multi-core wins!
Disadvantages
Need to manually serialize the code, and fine tuning may be required
Latency associated with specific locking mechanisms
31
Linux Kernel Memory Barriers
Barriers are needed to enforce ordering of memory operations
Compiler aggressive optimization
Processor micro-architecture optimizations
Multicore CPUs
Relaxed memory ordering systems
32
Barriers in Linux
Compiler barrier() Generic explicit compiler barrier
barrier Compilers not to reschedule memory accesses to/from either side
I/O barrier mmiowb() To be used with memory mapped I/O writes
I/O memory is un-cached, but ordering behavior cannot be guaranteed
Ensure ordering between CPU and I/O device
CPU mb() Used for ordering on Normal Cacheable memory
memory Can be used to control MMIO effects on accesses through relaxed
barrier wmb() memory I/O windows
rmb() They also imply a compiler barrier
Examples of typical memory barriers uses:
read_barrier_ - after writing DMA buffers and before starting DMA transfers
depends() - when a device driver writes some state to memory and expects it to be
visible in an interrupt routine which may arrive on a different CPU
Relevant to both UP and SMP systems
SMP smp_mb() Used to control the ordering of references to shared memory between
barriers CPUs within a cache-coherent SMP cluster
smp_wmb() A compiler barrier is also implied
smp_rmb() On UP systems (no CONFIGURE_SMP build option), all smp_*()
barriers are reduced to compiler barriers
smp_read_barr smp_* barriers are weaker than ?mb()
ier_depends()
33
Software Configurable Interrupts
On ARM multicore systems, interrupts are assigned to CPU0 by default
They can be re-assigned dynamically using Linux kernel APIs:
int irq_set_affinity(unsigned int irq, const struct cpumask *m);
34
Interrupt Load Balancing
Linux used to support in-kernel interrupt load balancing for some architectures
This has been removed from 2.6.29 (http://bit.ly/bYj4Op)
35
Migrating software to multi-core
36
Application software migration to multi-core
All modern Operating Systems support multicore, various models:
Symmetric multi-processing
Asymmetric multi-processing
Bound computational domains
Message passing architectures and distributed systems
37
Fundamentals of Explicit Parallelism
Code modified to split workloads ANALYZE
across available CPUs Analyze the UP application/problem to
find parallelisable areas
40
Data Decomposition
Each data item is independent
TASK CPU
CPU
CPU
CPU
TASK CPU
Split large quantity of DATA
TASK CPU into smaller chunks that can
be operated in parallel
TASK CPU
TASK CPU
41
Data Decomposition
CPU
Example: CPU
SIGNAL PROCESSING
Digital Camera CPU
Continuous Shoot Mode
CPU
SIGNAL PROCESSING
Data set CPU
42
Data Decomposition Granularity
Subdividing a data processing operation into several threads executing in
parallel on smaller chunks of data
Pixel by pixel – 1 pixel per thread
Line by line – 1 line per thread
Section by section – 1/n of picture per thread (n = number of PUs)
Individual images or group of images in a stream
1 2 3 4 1 2 3 4
1
1 2 3 4 1 2 3 4
1 2 3 4 1 2 3 4
2
1 2 3 4 1 2 3 4
3 1
2
4 3
4
43
Task Decomposition
Each task item is functionally independent
TASK TASK TASK TASK TASK TASK TASK TASK TASK CPU
CPU
CPU
CPU
44
Functional Block Partitioning
Functional blocks are serially dependent
but temporally independent
45
Task Allocation Models
FORK-EXEC MODEL Master thread
Create a thread
Parallel region
on demand
Master thread
WORKER-POOL MODEL
Thread pool
Task queue C A
Completed tasks
Hand off work
B
to pool of worker
threads
C
46
Typical Pitfalls of SMP Software
Synchronization overheads
Serialized regions caused my critical sections or rendezvous points
Dead-locks
2+ threads are each other waiting for another to release a resource
Such threads are blocked on a lock that will never be released
Live-lock
Multiple threads continue to run (without blocking indefinitely like in
the case of a dead-lock) but the system as a whole is unable to make
progress due to repeating patterns of non-productive contention
Priority Inversion
When a lower priority task acquires a lock on a resource and inhibits
a higher priority task from progressing
Addressed by priority inheritance and priority ceiling techniques
Cache trashing and false sharing
49
Summary
51
Thank You
52