Direct memory access - Wikipedia
Direct memory access - Wikipedia
Direct memory access (DMA) is a feature of computer systems that allows certain hardware
subsystems to access main system memory independently of the central processing unit (CPU).[1]
Without DMA, when the CPU is using programmed input/output, it is typically fully occupied for the
entire duration of the read or write operation, and is thus unavailable to perform other work. With
DMA, the CPU first initiates the transfer, then it does other operations while the transfer is in
progress, and it finally receives an interrupt from the DMA controller (DMAC) when the operation is
done. This feature is useful at any time that the CPU cannot keep up with the rate of data transfer, or
when the CPU needs to perform work while waiting for a relatively slow I/O data transfer.
Many hardware systems use DMA, including disk drive controllers, graphics cards, network cards
and sound cards. DMA is also used for intra-chip data transfer in some multi-core processors.
Computers that have DMA channels can transfer data to and from devices with much less CPU
overhead than computers without DMA channels. Similarly, a processing circuitry inside a multi-core
processor can transfer data to and from its local memory without occupying its processor time,
allowing computation and data transfer to proceed in parallel.
DMA can also be used for "memory to memory" copying or moving of data within memory. DMA can
offload expensive memory operations, such as large copies or scatter-gather operations, from the
CPU to a dedicated DMA engine. An implementation example is the I/O Acceleration Technology.
DMA is of interest in network-on-chip and in-memory computing architectures.
Principles
Third-party
Motherboard of a NeXTcube
computer (1990). The two large
integrated circuits below the middle
of the image are the DMA controller
(l.) and - unusual - an extra dedicated
DMA controller (r.) for the magneto-
optical disc used instead of a hard
disk drive in the first series of this
computer model.
Standard DMA, also called third-party DMA, uses a DMA controller. A DMA controller can generate
memory addresses and initiate memory read or write cycles. It contains several hardware registers
that can be written and read by the CPU. These include a memory address register, a byte count
register, and one or more control registers. Depending on what features the DMA controller
provides, these control registers might specify some combination of the source, the destination, the
direction of the transfer (reading from the I/O device or writing to the I/O device), the size of the
transfer unit, and/or the number of bytes to transfer in one burst.[2]
To carry out an input, output or memory-to-memory operation, the host processor initializes the
DMA controller with a count of the number of words to transfer, and the memory address to use.
The CPU then commands the peripheral device to initiate a data transfer. The DMA controller then
provides addresses and read/write control lines to the system memory. Each time a byte of data is
ready to be transferred between the peripheral device and memory, the DMA controller increments
its internal address register until the full block of data is transferred.
Some examples of buses using third-party DMA are PATA, USB (before USB4), and SATA; however,
their host controllers use bus mastering.
Bus mastering
In a bus mastering system, also known as a first-party DMA system, the CPU and peripherals can
each be granted control of the memory bus. Where a peripheral can become a bus master, it can
directly write to system memory without the involvement of the CPU, providing memory address and
control signals as required. Some measures must be provided to put the processor into a hold
condition so that bus contention does not occur.
Modes of operation
Burst mode
In burst mode, an entire block of data is transferred in one contiguous sequence. Once the DMA
controller is granted access to the system bus by the CPU, it transfers all bytes of data in the data
block before releasing control of the system buses back to the CPU, but renders the CPU inactive for
relatively long periods of time. The mode is also called "Block Transfer Mode".
The cycle stealing mode is used in systems in which the CPU should not be disabled for the length
of time needed for burst transfer modes. In the cycle stealing mode, the DMA controller obtains
access to the system bus the same way as in burst mode, using BR (Bus Request) and BG (Bus
Grant) signals, which are the two signals controlling the interface between the CPU and the DMA
controller. However, in cycle stealing mode, after one unit of data transfer, the control of the system
bus is deasserted to the CPU via BG. It is then continually requested again via BR, transferring one
unit of data per request, until the entire block of data has been transferred.[3] By continually
obtaining and releasing the control of the system bus, the DMA controller essentially interleaves
instruction and data transfers. The CPU processes an instruction, then the DMA controller transfers
one data value, and so on. Data is not transferred as quickly, but CPU is not idled for as long as in
burst mode. Cycle stealing mode is useful for controllers that monitor data in real time.
Transparent mode
Transparent mode takes the most time to transfer a block of data, yet it is also the most efficient
mode in terms of overall system performance. In transparent mode, the DMA controller transfers
data only when the CPU is performing operations that do not use the system buses. The primary
advantage of transparent mode is that the CPU never stops executing its programs and the DMA
transfer is free in terms of time, while the disadvantage is that the hardware needs to determine
when the CPU is not using the system buses, which can be complex. This is also called "Hidden
DMA data transfer mode".
Cache coherency
DMA can lead to cache coherency problems. Imagine a CPU equipped with a cache and an external
memory that can be accessed directly by devices using DMA. When the CPU accesses location X in
the memory, the current value will be stored in the cache. Subsequent operations on X will update
the cached copy of X, but not the external memory version of X, assuming a write-back cache. If the
cache is not flushed to the memory before the next time a device tries to access X, the device will
receive a stale value of X.
Similarly, if the cached copy of X is not invalidated when a device writes a new value to the memory,
then the CPU will operate on a stale value of X.
This issue can be addressed in one of two ways in system design: Cache-coherent systems
implement a method in hardware, called bus snooping, whereby external writes are signaled to the
cache controller which then performs a cache invalidation for DMA writes or cache flush for DMA
reads. Non-coherent systems leave this to software, where the OS must then ensure that the cache
lines are flushed before an outgoing DMA transfer is started and invalidated before a memory range
affected by an incoming DMA transfer is accessed. The OS must make sure that the memory range
is not accessed by any running threads in the meantime. The latter approach introduces some
overhead to the DMA operation, as most hardware requires a loop to invalidate each cache line
individually.
Hybrids also exist, where the secondary L2 cache is coherent while the L1 cache (typically on-CPU)
is managed by software.
Examples
ISA
In the original IBM PC (and the follow-up PC/XT), there was only one Intel 8237 DMA controller
capable of providing four DMA channels (numbered 0–3). These DMA channels performed 8-bit
transfers (as the 8237 was an 8-bit device, ideally matched to the PC's i8088 CPU/bus architecture),
could only address the first (i8086/8088-standard) megabyte of RAM, and were limited to
addressing single 64 kB segments within that space (although the source and destination channels
could address different segments). Additionally, the controller could only be used for transfers to,
from or between expansion bus I/O devices, as the 8237 could only perform memory-to-memory
transfers using channels 0 & 1, of which channel 0 in the PC (& XT) was dedicated to dynamic
memory refresh. This prevented it from being used as a general-purpose "Blitter", and consequently
block memory moves in the PC, limited by the general PIO speed of the CPU, were very slow.
With the IBM PC/AT, the enhanced AT bus (more familiarly retronymed as the Industry Standard
Architecture (ISA)) added a second 8237 DMA controller to provide three additional, and as
highlighted by resource clashes with the XT's additional expandability over the original PC, much-
needed channels (5–7; channel 4 is used as a cascade to the first 8237). The page register was also
rewired to address the full 16 MB memory address space of the 80286 CPU. This second controller
was also integrated in a way capable of performing 16-bit transfers when an I/O device is used as
the data source and/or destination (as it actually only processes data itself for memory-to-memory
transfers, otherwise simply controlling the data flow between other parts of the 16-bit system,
making its own data bus width relatively immaterial), doubling data throughput when the upper
three channels are used. For compatibility, the lower four DMA channels were still limited to 8-bit
transfers only, and whilst memory-to-memory transfers were now technically possible due to the
freeing up of channel 0 from having to handle DRAM refresh, from a practical standpoint they were
of limited value because of the controller's consequent low throughput compared to what the CPU
could now achieve (i.e., a 16-bit, more optimised 80286 running at a minimum of 6 MHz, vs an 8-bit
controller locked at 4.77 MHz). In both cases, the 64 kB segment boundary issue remained, with
individual transfers unable to cross segments (instead "wrapping around" to the start of the same
segment) even in 16-bit mode, although this was in practice more a problem of programming
complexity than performance as the continued need for DRAM refresh (however handled) to
monopolise the bus approximately every 15 μs prevented use of large (and fast, but uninterruptible)
block transfers.
Due to their lagging performance (1.6 MB/s maximum 8-bit transfer capability at 5 MHz,[4] but no
more than 0.9 MB/s in the PC/XT and 1.6 MB/s for 16-bit transfers in the AT due to ISA bus
overheads and other interference such as memory refresh interruptions[1]) and unavailability of any
speed grades that would allow installation of direct replacements operating at speeds higher than
the original PC's standard 4.77 MHz clock, these devices have been effectively obsolete since the
late 1980s. Particularly, the advent of the 80386 processor in 1985 and its capacity for 32-bit
transfers (although great improvements in the efficiency of address calculation and block memory
moves in Intel CPUs after the 80186 meant that PIO transfers even by the 16-bit-bus 286 and 386SX
could still easily outstrip the 8237), as well as the development of further evolutions to (EISA) or
replacements for (MCA, VLB and PCI) the "ISA" bus with their own much higher-performance DMA
subsystems (up to a maximum of 33 MB/s for EISA, 40 MB/s MCA, typically 133 MB/s VLB/PCI)
made the original DMA controllers seem more of a performance millstone than a booster. They
were supported to the extent they are required to support built-in legacy PC hardware on later
machines. The pieces of legacy hardware that continued to use ISA DMA after 32-bit expansion
buses became common were Sound Blaster cards that needed to maintain full hardware
compatibility with the Sound Blaster standard; and Super I/O devices on motherboards that often
integrated a built-in floppy disk controller, an IrDA infrared controller when FIR (fast infrared) mode
is selected, and an IEEE 1284 parallel port controller when ECP mode is selected. In cases where an
original 8237s or direct compatibles were still used, transfer to or from these devices may still be
limited to the first 16 MB of main RAM regardless of the system's actual address space or amount
of installed memory.
Each DMA channel has a 16-bit address register and a 16-bit count register associated with it. To
initiate a data transfer the device driver sets up the DMA channel's address and count registers
together with the direction of the data transfer, read or write. It then instructs the DMA hardware to
begin the transfer. When the transfer is complete, the device interrupts the CPU.
Scatter-gather or vectored I/O DMA allows the transfer of data to and from multiple memory areas
in a single DMA transaction. It is equivalent to the chaining together of multiple simple DMA
requests. The motivation is to off-load multiple input/output interrupt and data copy tasks from the
CPU.
DRQ stands for Data request; DACK for Data acknowledge. These symbols, seen on hardware
schematics of computer systems with DMA functionality, represent electronic signaling lines
between the CPU and DMA controller. Each DMA channel has one Request and one Acknowledge
line. A device that uses DMA must be configured to use both lines of the assigned DMA channel.
3. WDMA for hard disk controller (replaced by UDMA modes), parallel port (ECP capable port), or
certain SoundBlaster Clones like the OPTi 928
5. Hard disk controller (PS/2 only), or user hardware usually ISA sound card
6. User hardware
7. User hardware
PCI
A PCI architecture has no central DMA controller, unlike ISA. Instead, A PCI device can request
control of the bus ("become the bus master") and request to read from and write to system memory.
More precisely, a PCI component requests bus ownership from the PCI bus controller (usually PCI
host bridge, and PCI to PCI bridge[6]), which will arbitrate if several devices request bus ownership
simultaneously, since there can only be one bus master at one time. When the component is
granted ownership, it will issue normal read and write commands on the PCI bus, which will be
claimed by the PCI bus controller.
As an example, on an Intel Core-based PC, the southbridge will forward the transactions to the
memory controller (which is integrated on the CPU die) using DMI, which will in turn convert them to
DDR operations and send them out on the memory bus. As a result, there are quite a number of
steps involved in a PCI DMA transfer; however, that poses little problem, since the PCI device or PCI
bus itself are an order of magnitude slower than the rest of the components (see list of device
bandwidths).
A modern x86 CPU may use more than 4 GB of memory, either utilizing the native 64-bit mode of
x86-64 CPU, or the Physical Address Extension (PAE), a 36-bit addressing mode. In such a case, a
device using DMA with a 32-bit address bus is unable to address memory above the 4 GB line. The
new Double Address Cycle (DAC) mechanism, if implemented on both the PCI bus and the device
itself,[7] enables 64-bit DMA addressing. Otherwise, the operating system would need to work
around the problem by either using costly double buffers (DOS/Windows nomenclature) also known
as bounce buffers (FreeBSD/Linux), or it could use an IOMMU to provide address translation
services if one is present.
I/OAT
As an example of DMA engine incorporated in a general-purpose CPU, some Intel Xeon chipsets
include a DMA engine called I/O Acceleration Technology (I/OAT), which can offload memory
copying from the main CPU, freeing it to do other work.[8] In 2006, Intel's Linux kernel developer
Andrew Grover performed benchmarks using I/OAT to offload network traffic copies and found no
more than 10% improvement in CPU utilization with receiving workloads.[9]
DDIO
Further performance-oriented enhancements to the DMA mechanism have been introduced in Intel
Xeon E5 processors with their Data Direct I/O (DDIO) feature, allowing the DMA "windows" to reside
within CPU caches instead of system RAM. As a result, CPU caches are used as the primary source
and destination for I/O, allowing network interface controllers (NICs) to DMA directly to the Last
level cache (L3 cache) of local CPUs and avoid costly fetching of the I/O data from system RAM. As
a result, DDIO reduces the overall I/O processing latency, allows processing of the I/O to be
performed entirely in-cache, prevents the available RAM bandwidth/latency from becoming a
performance bottleneck, and may lower the power consumption by allowing RAM to remain longer
in low-powered state.[10][11][12][13]
AHB
In systems-on-a-chip and embedded systems, typical system bus infrastructure is a complex on-
chip bus such as AMBA High-performance Bus. AMBA defines two kinds of AHB components:
master and slave. A slave interface is similar to programmed I/O through which the software
(running on embedded CPU, e.g. ARM) can write/read I/O registers or (less commonly) local
memory blocks inside the device. A master interface can be used by the device to perform DMA
transactions to/from system memory without heavily loading the CPU.
Therefore, high bandwidth devices such as network controllers that need to transfer huge amounts
of data to/from system memory will have two interface adapters to the AHB: a master and a slave
interface. This is because on-chip buses like AHB do not support tri-stating the bus or alternating
the direction of any line on the bus. Like PCI, no central DMA controller is required since the DMA is
bus-mastering, but an arbiter is required in case of multiple masters present on the system.
Internally, a multichannel DMA engine is usually present in the device to perform multiple concurrent
scatter-gather operations as programmed by the software.
Cell
DMA in Cell is fully cache coherent (note however local stores of SPEs operated upon by DMA do
not act as globally coherent cache in the standard sense). In both read ("get") and write ("put"), a
DMA command can transfer either a single block area of size up to 16 KB, or a list of 2 to 2048 such
blocks. The DMA command is issued by specifying a pair of a local address and a remote address:
for example when a SPE program issues a put DMA command, it specifies an address of its own
local memory as the source and a virtual memory address (pointing to either the main memory or
the local memory of another SPE) as the target, together with a block size. According to an
experiment, an effective peak performance of DMA in Cell (3 GHz, under uniform traffic) reaches
200 GB per second.[14]
DMA controllers
Intel 8257
Am9517[15]
Intel 8237
Z80 DMA[16]
Processors with scratchpad memory and DMA (such as digital signal processors and the Cell
processor) may benefit from software overlapping DMA memory operations with processing, via
double buffering or multibuffering. For example, the on-chip memory is split into two buffers; the
processor may be operating on data in one, while the DMA engine is loading and storing data in the
other. This allows the system to avoid memory latency and exploit burst transfers, at the expense of
needing a predictable memory access pattern.
See also
References
1. "DMA Fundamentals on various PC platforms, National Instruments, pages 6 & 7" (https://ww
w.ing.unlp.edu.ar/catedras/E0225/descargar.php?secc=0&id=E0225&id_inc=1196) .
Universidad Nacional de la Plata, Argentina. Retrieved 20 April 2019.
2. Osborne, Adam (1980). An Introduction to Microcomputers: Volume 1: Basic Concepts (https://a
rchive.org/details/introductiontomi00adam/page/5) (2nd ed.). Osborne McGraw Hill. pp. 5–
64 through 5–93 (https://archive.org/details/introductiontomi00adam/page/5) .
ISBN 0931988349.
3. Hayes, John.P (1978). Computer Architecture and Organization. McGraw-Hill International Book
Company. p. 426-427. ISBN 0-07-027363-4.
6. "Bus Specifics - Writing Device Drivers for Oracle® Solaris 11.3" (https://docs.oracle.com/cd/E
53394_01/html/E54850/hwovr-25520.html) . docs.oracle.com. Retrieved 2020-12-18.
10. "Intel Data Direct I/O (Intel DDIO): Frequently Asked Questions" (http://www.intel.com/content/
dam/www/public/us/en/documents/faqs/data-direct-i-o-faq.pdf) (PDF). Intel. March 2012.
Retrieved 2015-10-11.
11. Rashid Khan (2015-09-29). "Pushing the Limits of Kernel Networking" (http://rhelblog.redhat.co
m/2015/09/29/pushing-the-limits-of-kernel-networking/) . redhat.com. Retrieved 2015-10-11.
12. "Achieving Lowest Latencies at Highest Message Rates with Intel Xeon Processor E5-2600 and
Solarflare SFN6122F 10 GbE Server Adapter" (http://www.solarflare.com/content/userfiles/doc
uments/intel_solarflare_webinar_paper.pdf) (PDF). solarflare.com. 2012-06-07. Retrieved
2015-10-11.
13. Alexander Duyck (2015-08-19). "Pushing the Limits of Kernel Networking" (https://events.stati
c.linuxfound.org/sites/events/files/slides/pushing-kernel-networking.pdf) (PDF).
linuxfoundation.org. p. 5. Retrieved 2015-10-11.
14. Kistler, Michael (May 2006). "Cell Multiprocessor Communication Network: Built for Speed" (htt
p://portal.acm.org/citation.cfm?id=1158825.1159067) . IEEE Micro. 26 (3): 10–23.
doi:10.1109/MM.2006.49 (https://doi.org/10.1109%2FMM.2006.49) . S2CID 7735690 (http
s://api.semanticscholar.org/CorpusID:7735690) .
Sources