Uniform Memory Access
Uniform Memory Access
[hide]
1 Basic concept
2 Cache coherent NUMA (ccNUMA)
3 NUMA vs. cluster computing
4 See also
5 References
6 External links
[edit]Basic concept
One possible architecture of a NUMA system. Notice that the processors are connected to the bus or crossbar by
connections of varying thickness/number. This shows that different CPUs have different priorities to memory
access based on their location.
Modern CPUs operate considerably faster than the main memory to which they are
attached. In the early days of computing and data processing the CPU generally ran slower
than its memory. The performance lines crossed in the 1960s with the advent of the
firstsupercomputers and high-speed computing. Since then, CPUs, increasingly starved for
data, have had to stall while they wait for memory accesses to complete. Many
supercomputer designs of the 1980s and 90s focused on providing high-speed memory
access as opposed to faster processors, allowing them to work on large data sets at speeds
other systems could not approach.
Limiting the number of memory accesses provided the key to extracting high performance
from a modern computer. For commodity processors, this means installing an ever-
increasing amount of high-speed cache memory and using increasingly sophisticated
algorithms to avoid "cache misses". But the dramatic increase in size of the operating
systems and of the applications run on them has generally overwhelmed these cache-
processing improvements. Multi-processor systems make the problem considerably worse.
Now a system can starve several processors at the same time, notably because only one
processor can access memory at a time.
NUMA attempts to address this problem by providing separate memory for each processor,
avoiding the performance hit when several processors attempt to address the same
memory. For problems involving spread data (common for servers and similar applications),
NUMA can improve the performance over a single shared memory by a factor of roughly the
number of processors (or separate memory banks).
Of course, not all data ends up confined to a single task, which means that more than one
processor may require the same data. To handle these cases, NUMA systems include
additional hardware or software to move data between banks. This operation has the effect
of slowing down the processors attached to those banks, so the overall speed increase due
to NUMA will depend heavily on the exact nature of the tasks run on the system at any given
time.
Current[when?] ccNUMA systems are multiprocessor systems based on the AMD Opteron,
which can be implemented without external logic, and Intel Itanium, which requires the
chipset to support NUMA. Examples of ccNUMA enabled chipsets are the SGI Shub (Super
hub), the Intel E8870, the HP sx2000 (used in the Integrity and Superdome servers), and
those found in recent NEC Itanium-based systems. Earlier ccNUMA systems such as those
from Silicon Graphics were based on MIPS processors and the DEC Alpha 21364 (EV7)
processor.
Intel announced NUMA introduction to its x86 and Itanium servers in late 2007
with Nehalem and Tukwila CPUs[citation needed]. Both CPU families will share a common chipset;
the interconnection is called Intel Quick Path Interconnect (QPI).