0% found this document useful (0 votes)
151 views

Embedded Systems Unit 8 Notes

IL2206 Embedded Systems Memory Performance Memory Bandwidth rate at which information can be transferred from the Memory System memory bandwidth. A 32-bit memory with latency 20 ns has a bandwidth BW = 32 bit / 20ns = 1. GBit / s = 20 MByte /s Latency Time between the following two time instances time instance where the processor issues a request to the memory.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views

Embedded Systems Unit 8 Notes

IL2206 Embedded Systems Memory Performance Memory Bandwidth rate at which information can be transferred from the Memory System memory bandwidth. A 32-bit memory with latency 20 ns has a bandwidth BW = 32 bit / 20ns = 1. GBit / s = 20 MByte /s Latency Time between the following two time instances time instance where the processor issues a request to the memory.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Embedded Computer: Memory System, Input/Output

Outline
Memory System
Types of memory Caches

Input/Output
Ingo Sander [email protected]

Memory Mapped I/O Polling Interrupt

September 4, 2007

IL2206 Embedded Systems

The memory bottleneck

Memory System

Most instructions in a RISC processor can execute in a single clock cycle BUT Access to the main memory (typically in SDRAM) is slow If memory access time can be shortened the system would perform considerably better

September 4, 2007

IL2206 Embedded Systems

Memory Performance
Memory Bandwidth
rate at which information can be transferred from the memory system

Memory Bandwidth
If R is the number of request that the memory can serve simultaneously then BW = R/L Example:
A 32-bit memory with latency 20 ns has a bandwidth BW = 32 Bit / 20 ns = 1.6 GBit/s = 20 MByte/s

Latency
Time between the following two time instances
time instance where the processor issues a request to the memory time instance where the requested data arrives and is available for use by processor
September 4, 2007 IL2206 Embedded Systems 5

September 4, 2007

IL2206 Embedded Systems

Types of memory
ROM (Read Only Memory)
Mask-programmable Flash programmable (can be reprogrammed, but has long access times)

SRAM vs. DRAM


SRAM (Static RAM)
Faster Easier to integrate with logic Higher power consumption

RAM (Random Access Memory)


DRAM SRAM

DRAM (Dynamic RAM)


Denser Must be refreshed

September 4, 2007

IL2206 Embedded Systems

September 4, 2007

IL2206 Embedded Systems

Synchronous DRAM
Clock signal is used internally to pipeline accesses
Memory must be fast enough to respond to request Request takes multiple clock cycles

Flash issues
Flash is programmed at system voltages Erasure time is long Must be erased in blocks Limited number of erasures
A Flash Memory is very useful in combination with SRAM or SDRAM devices, since it can load these devices at power-on
9 September 4, 2007 IL2206 Embedded Systems 10

Provides burst mode access:


1, 2, 4, 8 locations

September 4, 2007

IL2206 Embedded Systems

Memory Access Times and Costs


Memory Technology SRAM DRAM Magnetic disk Typical Access Time 0.5 ns -5 ns 50 ns 70 ns 5,000,000 ns 20,000,000 ns $ per GB in 2004 $4000 - $10000 $100 - $200 $0.5 - $2

Embedded system memories


Large fast memories are very expensive Embedded systems have to be produced at a low cost
single SRAM main memory is in general too expensive combination of fast and slow memories is often still feasible

Source: Patterson and Hennessy, 2004


September 4, 2007 IL2206 Embedded Systems 11 September 4, 2007 IL2206 Embedded Systems 12

Caches
Large fast memories are too expensive, but small fast memories are feasible A cache memory is a small, but fast memory that is located near the CPU to reduce memory access times Ideally the processor does only need to access the cache and not the main memory

Memory is a bottleneck
While the CPU is fast, each memory access takes long time and slows down the system Caches can increase the performance, if most memory requests do not need to access the main memory
CPU
(fast)

CPU
(fast)

Memory
(very slow)

Memory Cache
(fast) (very slow)

Bus
(slow)

Bus
(slow)

September 4, 2007

IL2206 Embedded Systems

13

September 4, 2007

IL2206 Embedded Systems

14

Caches and CPUs


address cache controller data cache address data main memory

Cache operation
Many main memory locations are mapped onto one cache entry May have caches for:
instructions; data; data + instructions (unified).

CPU data
2000 Wolf (Morgan Kaufman)

Memory access time is no longer 2000 Wolf (Morgan deterministic! Kaufman)


IL2206 Embedded Systems 15 September 4, 2007 IL2206 Embedded Systems 16

September 4, 2007

Terms
Cache hit: required location is in cache. Cache miss: required location is not in cache. Working set: set of locations used by program in a time interval.

Types of misses
Compulsory (cold): location has never been accessed. Capacity: working set is too large. Conflict: multiple locations in working set map to same cache entry.

2000 Wolf (Morgan Kaufman)

2000 Wolf (Morgan Kaufman)

September 4, 2007

IL2206 Embedded Systems

17

September 4, 2007

IL2206 Embedded Systems

18

Memory system performance


h = cache hit rate. tcache = cache access time, tmain = main memory access time. Average memory access time:
tav = htcache + (1-h)tmain

Write operations
Write-through: immediately copy write to main memory
Causes unnecessary memory communication Memory has always a valid copy of the cache block

Write-back: write to main memory only when location is removed from cache
Tries to minimize communication with memory Memory may have an invalid copy of the cache block. Must be updated, when a cache block is replaced

2000 Wolf (Morgan Kaufman)

September 4, 2007

IL2206 Embedded Systems

19

September 4, 2007

IL2206 Embedded Systems

20

Replacement
Replacement policy: strategy for choosing which cache entry to throw out to make room for a new memory location. Two popular strategies:
Random. Least-recently used (LRU).

Cache performance benefits


Keep frequently-accessed locations in fast cache. Cache retrieves more than one word at a time.
Sequential accesses are faster after first access.

In case of a modified cache entry in a write-back cache replacement means also to write the contents of the dirty cache entry back to the memory. Thus a cache miss can be expensive!
September 4, 2007 IL2206 Embedded Systems 21

2000 Wolf (Morgan Kaufman)

September 4, 2007

IL2206 Embedded Systems

22

Data Transfer to Cache


Words are transferred between cache and processor Blocks (of multiple words, given by the block size) are transferred between cache and memory
Word Transfer Block Transfer

Cache organizations
Fully-associative: any memory location can be stored anywhere in the cache (almost never implemented). Direct-mapped: each memory location maps onto exactly one cache entry. N-way set-associative: each memory location can go into one of N entries.

CPU

Cache

Main Memory

September 4, 2007

IL2206 Embedded Systems

23

September 4, 2007

IL2206 Embedded Systems

24

Direct-mapped cache
A direct-mapped cache consists of several cache lines, where each cache line has a status bit, a tag and data (cache block) There is a given mapping for each memory location!
Cache Line 0 1 Cache Block Tag Wd 0 Wd 0 Wd 1 Wd 1 Wd 2 Wd 2 Wd 3 Wd 3 Memory Address 0 10 Block 1 20 Block 2 30 Block 3 7 Status Bit Wd 0 Wd 1 Wd 2 Wd 3 40 Block 4 50 Block 5 60 Block 6 70 Block 7 80 Block 8 FF0
September 4, 2007 IL2206 Embedded Systems

Example Direct Mapped Cache


Cache has 2 KBytes (512 words), organized as 64 cache lines with a block size of 8 words Memory has 64 Kbytes (16 KWords), which can be seen as 2048 blocks of 8 Words Address size is 16 bits The direct map technique uses the modulo (remainder) operation to map on a cache block
Block 0, 64, 128, ... is mapped on Block 0 in the cache Block 1, 65, 129, is mapped on Block 1 in the cache

Block 0

Block 1024

25

September 4, 2007

IL2206 Embedded Systems

26

Example Direct Mapped Cache


Main Memory Memory Address
5 Tag 6 Block 3 2 Word Byte Offset

Direct-mapped cache
Block 0 Block 1 0x0000 0x0020

Cache Line 0 Line 1


Block 63 Block 64 Block 65
0 4 1 5 2 6 3 7

1 valid

0xabcd tag

byte byte byte data cache line

Line 63
Block 127
1 5 32 Data (8 words)

A block has 8 words

tag

index offset = hit value byte


( or halfword/word)

Valid Tag

Block 2047

0xFFE0
27 September 4, 2007 IL2206 Embedded Systems 28

September 4, 2007

IL2206 Embedded Systems

Direct-mapped cache locations


Many locations map onto the same cache block. Conflict misses are easy to generate:
Array a[] uses locations 0, 1, 2, Array b[] uses locations 1024, 1025, 1026, Operation a[i] + b[i] generates conflict misses.
2000 Wolf (Morgan Kaufman)

Example 2-way set-associative cache


Memory Address
6 Tag 5 Set
Set 0 Set 1

Main Memory
Block 0

5 Offset

Cache Way 1 Way 1

Block 1

Way 0 Way 0

Block 31 Block 32 Block 33

0 4

1 5

2 6

3 7

Set 31

Way 0

Way 1

A block has 8 words

Block 127
1 6 32 Data (8 words) Valid Tag

Block 2043
IL2206 Embedded Systems 29 September 4, 2007 IL2206 Embedded Systems 30

September 4, 2007

Set-Associative Caches
One-way set associative (direct-mapped)
Block (Set) 0 1 2 3 4 5 6 7 Tag Tag Tag Tag Tag Tag Tag Tag Data Data Data Data Data Data Data Data

Fully associative cache


Data Data Data Data Tag Tag Tag Tag Data Data Data Data

Two-way set associative


Set 0 Tag Tag Tag Tag

1 element per set

1 2 3

2 elements per set

There is a complete freedom, where to place a block in the cache But all blocks have to be searched for the correct tag pattern In order to have an acceptable performance, the tags must be searched in parallel
Data

Eight-way set associative (fully associative)


Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag

8 elements per set


September 4, 2007 IL2206 Embedded Systems 31 September 4, 2007 IL2206 Embedded Systems 32

Example caches
StrongARM:
16 Kbyte, 32-way, 32-byte block instruction cache. 16 Kbyte, 32-way, 32-byte block data cache (write-back).

Summary Memory Systems


Memory is a bottleneck in the system Different memories exist
Cost increases with memory performance

A cache memory can significantly decrease execution time at low cost


Execution time is very hard to predict Problem for design of real-time systems Locality is important to utilize caches efficiently There can be several level of different caches Embedded systems have usually only one cache level
33 September 4, 2007 IL2206 Embedded Systems 34

Nios II
512 Bytes to 64KBytes direct-mapped I- and Dcache with a cache block size of 4 (D), 16(D) or 32(I&D) Bytes
September 4, 2007 IL2206 Embedded Systems

Input and Output Devices

Input/Output

Input/Output Devices are used to communicate with the environment An example is a UART (Universal Asynchronous Receiver/Transmitter) These devices (like other peripheral devices) are controlled by reading and writing to registers
Register Select Data Bus
Control Signals

Status Register Mode Register

Output Input

Data Register

I/O Device
September 4, 2007 IL2206 Embedded Systems 36

Serial communication
Characters are transmitted separately

Universal Asynchronous Receiver/Transmitter (UART)


Component for serial to parallel conversion Has a serial receiver/transmitter Many parameters can be configured
Baud rate Number of bits per character Parity bits Length of Stop Bit

no char start bit 0 bit 1 ... bit n-1 stop time


2000 Morgan Kaufman (Wayne Wolf)

September 4, 2007

IL2206 Embedded Systems

37

September 4, 2007

IL2206 Embedded Systems

38

Memory-Mapped I/O
Peripheral Components can be connected to the processor by memory-mapped I/O The components can be reached via a separate address space Memory-mapped I/O requires extra hardware for address decoding

Memory-Mapped I/O
The output chip-enable has to be active, when the input of the decoder is a correct address Other address bits are used for register select The decoder can be implemented with a small block of programmable logic or custom hardware (VHDL)
Register Select

Addressbus CPU

Decoder

Chip Enable Read/Write Peripheral

Interface to Environment

Databus
September 4, 2007 IL2206 Embedded Systems 39 September 4, 2007 IL2206 Embedded Systems 40

Example Memory-Mapped I/O


A device with 8 8-bit-registers shall be connected to the address 0x1000
0x00001002
Addressbus (ADR31-ADR0) ADR3 -ADR31 Decoder ADR2 ADR1 ADR0

Accessing Memory Locations in C


R0 R1 ... R7 Databus (D31-D0) (D7-D0)

0 1 0

RS2 RS1 RS0

Symbolic names can be defined for memory locations #define MEM_LOCATION 0x18 Functions can be defined to access memory
peek can be used to read a memory location (byte) char peek(char *location) {return *location;} poke can be used to write to a memory location (byte) void poke(char *location, char newval) {*location = newval;}
41 September 4, 2007 IL2206 Embedded Systems 42

1
CE

Registers

Dont do this!

Active when ADR12=1 and all others are 0!

The registers can now be accessed in the address space 0x1000 (R0) until 0x1007 (R7) movia r1, 0x1002 movi r3, 0x08 stb r3, (r1) set bit 3 and clears all other bits device register R2
September 4, 2007 IL2206 Embedded Systems

Memory Locations shouldnt be accessed directly!


Software shall be flexible
Hardware could change

Busy Wait I/O


Busy Wait I/O is the most basic way to communicate with an I/O-device The processor wait until the I/O-device has completed its current task Disadvantage: Processor cannot be used for other tasks during the waiting period! This method is also often called polling!
Example: Sending string via serial link Busy Wait I/O Pseudo Code:
Characters = String; While not all characters sent Send next character; While Sender = Busy Wait; Done!

Programmers may make mistakes that the compiler would not do (e.g. memory alignment) HAL (Hardware Abstraction Layer) offers optimized device drivers to access peripheral devices and memory

September 4, 2007

IL2206 Embedded Systems

43

September 4, 2007

IL2206 Embedded Systems

44

C-Programming Testing of Bits


In order to test specific bits, it is needed to mask the other bits Example: Busy Flag: Busy = 1; Non-Busy = 0
7 0x1000 0x1001
September 4, 2007 IL2206 Embedded Systems

C-Programming Testing of Bits


define Status 0x1000 define SendBuf 0x1001 char *myString = Hello World; char *current_char;
7 Status Sender Sender Buffer
45 September 4, 2007

5 BF

0 0x1000 0x1001

5 BF

0 Status Sender Sender Buffer

IL2206 Embedded Systems

46

C-Programming Testing of Bits


Here you should use HAL functions!

Simultaneous busy/wait input and output


Example: Copying Characters from Input to Output Busy Wait I/O Pseudo Code:
Loop While inBuffer busy Wait; Read Character Copy Character to Output Buffer Send Character While outBuffer busy Wait;

while (current_char != \0) { poke(SendBuf, *current_char++); while ((peek(Status) & 0x20) != 0) ; } /* Mask needed, since other bits */ /* in status register may not be zero */
7 0x1000 0x1001
September 4, 2007 IL2206 Embedded Systems

5 BF

0 Status Sender Sender Buffer


47

September 4, 2007

IL2206 Embedded Systems

48

Interrupt I/O
Busy/wait is very inefficient.
CPU cant do other work while testing device. Hard to do simultaneous I/O.

Interrupt Scheme
Interrupt Request

CPU

Interrupt Acknowledge Data/Address

Device

Interrupts allow a device to change the flow of control in the CPU.


Causes subroutine call to handle device.
2000 Wolf (Morgan Kaufman)

September 4, 2007

IL2206 Embedded Systems

49

September 4, 2007

IL2206 Embedded Systems

50

Interrupt physical interface


CPU and device are connected by CPU bus CPU and device handshake:
device asserts interrupt request; CPU asserts interrupt acknowledge when it can handle the interrupt.

Interrupt behavior
Based on subroutine call mechanism Interrupt forces next instruction to be a subroutine call to a predetermined location
Return address is saved to resume executing foreground program

2000 Wolf (Morgan Kaufman)

2000 Wolf (Morgan Kaufman)

September 4, 2007

IL2206 Embedded Systems

51

September 4, 2007

IL2206 Embedded Systems

52

Programming Interrupt
Foreground Program
Do something Interrupt Event

Receive-Send with Polling


Assume a program that as part of its duties receives characters and sends them further to another device Solution with polling:
loop
Wait for new character; Do something; Send character;

Interrupt Handler
Save Registers Handle Interrupt Restore Registers Restore PC Clear interrupt disable flag

Interrupt Vector
Branch to Interrupt Handler

end loop; System cannot do anything while it waits for a new character until the sender is ready System resources are utilized very inefficiently!

September 4, 2007

IL2206 Embedded Systems

53

September 4, 2007

IL2206 Embedded Systems

54

Better Receive-Send Implementation with Interrupt


Parallization of duties
Wait for new character (interrupt)
If character is received it is stored in a buffer

Better Receive-Send Implementation with Interrupt


System can do other thing while waiting for receiver or sender Buffer is needed to store elements Size of buffer must be chosen carefully
too small => buffer overflow too large => too expensive design

Do Something (foreground program)


Work with the stored buffer elements

Send character if transmitter ready (interrupt)


Check if transmitter is ready and send the first character of the buffer
September 4, 2007 IL2206 Embedded Systems 55

September 4, 2007

IL2206 Embedded Systems

56

Typical Embedded Design Problems


Embedded Systems are inherently parallel (concurrent), since they interact with heterogeneous environment
Parallization allows for a faster processing, since work can be done in parallel Waiting times can be avoided

Send-Receive with Circular Buffer (Wolf)


Independent receive, send realized by two interrupt routines Receive-interrupt routine Puts a character into queue Send-interrupt routine Sends a character, when sender ready

The need for buffers is a logical consequence of parallization

System designer needs to find the right amount of parallization and the right buffer size!
September 4, 2007 IL2206 Embedded Systems 57

head headtail
September 4, 2007

tail
IL2206 Embedded Systems 58

Send-Receive with Circular Buffer (Wolf)


A circular buffer can be realised in a memory with a pointer for head and tail If a pointer is at the end of the buffer, the next position is the start of the buffer
i f g h

Send-Receive sequence diagram (Wolf)


:foreground :input :output :queue empty a empty b bc

tail
September 4, 2007

head
IL2206 Embedded Systems 59 September 4, 2007 IL2206 Embedded Systems

c
2000 Wolf (Morgan Kaufman)
60

Debugging interrupt code


What if you forget to change registers?
Foreground program can exhibit mysterious bugs Bugs will be hard to repeat---depend on interrupt timing It is difficult to debug an interrupt routine!

Prioritized Interrupts
Some CPUs (as Nios II) support several interrupt levels by their hardware Otherwise extra hardware (priority decoder) can be used to create several levels of interrupt

2000 Wolf (Morgan Kaufman)

September 4, 2007

IL2206 Embedded Systems

61

September 4, 2007

IL2206 Embedded Systems

62

Interrupt prioritization
Masking: interrupt with priority lower than current priority is not recognized until pending interrupt is complete. Non-maskable interrupt (NMI): highestpriority, never masked.
Often used for power-down.
2000 Wolf (Morgan Kaufman)

Example: Prioritized I/O


:interrupts B C A A,B
2000 Wolf (Morgan Kaufman)

:foreground

:A

:B

:C

September 4, 2007

IL2206 Embedded Systems

63

September 4, 2007

IL2206 Embedded Systems

64

Sources of interrupt overhead


Handler execution time Interrupt mechanism overhead Register save/restore Pipeline-related penalties Cache-related penalties

Summary
Peripherals can be made accessible for software by memory mapped I/O Two basic approaches for communication with I/O device
polling processor checks, if data has arrived interrupt processor is notified, if data has arrived

Interrupt is not always better than polling!


September 4, 2007 IL2206 Embedded Systems 65 September 4, 2007 IL2206 Embedded Systems 66

You might also like