Optimisation of AJIT Processor
Submitted in partial fulfilment of the requirements
for MTP Phase 1
by
Aswin Jith S.
(Roll No. 153070047)
under the guidance of
M. P. Desai
Department of Electrical Engineering
INDIAN INSTITUTE OF TECHNOLOGY BOMBAY
2016
Abstract
Contents
0.1
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction
Evolution of Memory
Memory Hierarchy
Performance of Memory Subsystems
4.1
Performance Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
Performance Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Emerging Memory Technologies
10
5.1
Embedded DRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
5.2
Resistive RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
5.3
Phase Change Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
5.4
Spin Transfer Torque-RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
5.5
STT-RAM as possible replacement for SRAM . . . . . . . . . . . . . . . . . .
12
Conclusion
13
Bibliography
16
ii
Chapter 1
Introduction
The performance of a system is always gauged from the capability of system in performing
arithmetic computations. The number of calculations a computer performs per second alone
doesnt define the capability of a system. It also depends on the ability to present data for
these computations. All data including inputs, parameters, temporary data over the course of
computations are to be fed to the CPU.
CPU cannot think on its own and need to be given instructions on how to solve a problem.
Any computation, be it simple or complex are divided to arithmetic or logical operations and
instructions are constructed in a table[17]. As per the instruction from the table, operation is
performed on data or operands, present at a particular address. All instructions and data corresponding to a problem, cannot be processed simultaneously by the machine, and hence machine needs to remember. Data needs to be stored into and loaded from the memory whenever
required.
In earlier computers, memory access and computation rates were nearly equal. With time
processors evolved, computation rates increased along with the complexity of problems, but the
improvement in memory access rates were relatively slower. Designers predicted the instructions and data needed by the computer, and located them at appropriate time for the computation elements to process. In modern day memory hierarchy, main memory is supplemented by
on-chip registers and multiple levels of caches, which are located closer to the processor and
contains a fraction of data available in main memory. Though memory was costly in early days,
technology evolutions increased transistor densities and cost per bit reduced over time, with
count of transistors increasing multi-fold according to Moores law [21]. Currently the trend is
towards dividing the memory across several computation nodes, to enable parallel tasking so
that computation demands of large volumes of data can be managed.
Chapter 2
Evolution of Memory
Punched cards are among the earliest technology used to implement a memory. It was developed in 1890s even before the advent of electrical computer. Though not extensively used,
it stayed on until around 1970s. Von Neumann designed EDVAC[24] in 1945 comprising of
an ALU, control unit and memory. He argued that a single type of memory would suffice, regardless of the kind of operations. EDVAC employed a sequential access mercury delay-line
memory with an average delay of 192 s and each of 128 delay lines gave a throughput of one
word every 48 s [7].
Constrained by the cost of high speed delay line memories, commercial systems used drum
memory. Drum is a rotating cylinder coated with ferromagnetic material and has read/write
heads attached to it. Bit was represented by EM pulse, and values were changed by changing
magnetic orientation of ferromagnetic particle. High rotational latency meant that drum memories were primarily limited to secondary memories and not as main memory up until 1960s.
Magnetic core memory emerged in 1960s and unlike drum memories it could randomly access to any word in memory as it had no moving parts. Magnetic core memory had superior
access time compared to drum memory,and was reliable . Magnetic core memory was used as
reliable high speed memory, before the advent of currently used dynamic memory.
By 1970s, semiconductor memories replaced core memory, and the invention of dynamic random access memory (DRAM), the single-transistor DRAM cell in particular marked the start of
a new era[4]. DRAM had better density than core memory, but was non-volatile and consumes
power to retain state. DRAMs were prone to soft errors and needed to be refreshed periodically to retain the charge. Despite these disadvantages, DRAM technology is still used as main
memory with density increasing and cost per bit decreasing over the years. As the processor
clock frequencies increased rapidly, memory latency increased and led to the introduction of an
additional level of memory known as caches, which were faster than main memory. Caches,
2
made of Static RAM(SRAM) were placed closer to the processor and they exploited the locality
principles to improve latency. SRAM cache memory has less capacity compared to a DRAM
main memory. As fabrication technology evolved, more transistors could be packed on to a
die, and many architectures were proposed with separate instruction cache and data cache to
improve performance.
Magnetic disks surfaced in 1970s replacing drum memory as non-volatile backup for DRAM
main memory[11]. By using movable read/write heads in comparison to the fixed heads of
drum, disks achieved lesser memory access time though there were constraints in cost and design complexity. Cost per bit came down as technology progressed. The concept of virtual
memory [6] was introduced in 1970s and its popularity enhanced the importance of non volatile
memories. Memory virtualization allowed programs to utilise memory larger than main memory. Unlike in cache memory where hardware decides the memory blocks to be placed in cache,
memory virtualization provides the user freedom to divide virtual memory between main memory and disk storage with the aid of operating system.
By around 1984 came the optical disks which could be used in any computer. Flash memory[16]
was invented in 1984 by Toshiba. Data on flash memory could be erased and reprogrammed
multiple times and it was faster compared to magnetic disks. Owing to these features, flash
memory gained popularity, and is still being used in mobile phones, solid state drives, etc.
In recent past, non-volatile memories based on magnetic and resistive properties like MRAM,
RRAM, Memristor, etc. have emerged as potential replacements for DRAM and SRAM.
Chapter 3
Memory Hierarchy
The most commonly used Von Neumann architecture fetches instruction from memory and
executes that in the processor. Over the course of execution, certain instructions require fetching
values from memory using load operation and storing the result back via store operation. As
the instructions and data could be needed for a longer term, they are stored in the disk.
Processors usage of memory is rather non-random and highly predictable since we program
it to solve problems linearly. Thus evolved the principle of locality which states that data currently accessed is likely to be used again quickly(temporal locality) and that the neighbouring
data will be accessed soon(spatial locality). Implementing an ideal memory is unrealistic due
to capacity, speed and cost constraints. A feasible solution is to organize memory into several
levels, each level faster and smaller than the one below it. Each stage is optimised for a certain
task, and clever design can ensure performance equivalent to that of the fastest component, and
cost per bit comparable to that of the cheapest. By exploiting temporal and spacial localities,
memory hierarchy is able to reduce the access to the main memory.
Registers: Registers are small, fast storage units made of flip-flops, located within the
processor. The compiler manages the registers, and decides the values to be kept in each
available register after each instruction is executed.
Cache: Caches are located close to the CPU and contains most recently accessed instruction or data. SRAM constitutes the cache and hence are smaller and faster compared
to main memory made of DRAM. In Systems where access performance of main memory doesnt meet the requirements, access time is reduced by inserting cache between
processor and main memory.
Main Memory: A level down in the hierarchy, main memory provides random access
which is cheap and large compared to cache, and faster compared to storage disk. Main
memory, made of DRAM has large capacity as compared to SRAM caches. The presence
4
of capacitor in DRAM structure increases it access time in comparison to a SRAM which
is made of transistors alone. Being a volatile memory, DRAM needs to be refreshed periodically to prevent charge leakage, thus additional clock cycles are spent and reduces its
performance when compared to SRAM. Still the simple 1-transistor 1-capacitor structure,
and low cost per bit makes DRAM the favourable candidate for main memory.
Storage Disk: Storage disks are large permanent storage spaces at low cost per bit. All
computers have huge amounts of data that are not frequently modified by the processor
and thus necessitate the need of a permanent storage space. Storage disks house most of
the data and being non-volatile, it retains all the data even when power is off. Most commonly used storage devices are magnetic disks, coated with magnetic material like ferric
oxide and cobalt-based alloy. An actuator arm with magnetic read/write heads move over
the magnetic surface to enable read/write operation. Other permanent storage devices
include flash memory, EEPROM and optical disks. Flash memories are one level above
magnetic disks in memory hierarchy owing to better access rates and lower capacity,
while optical disks form the bottom level of hierarchy.
To save execution time and energy consumption, cache reduce the number of accesses made
to the lower level memory, namely the main memory. If the data demanded is available in cache,
it is a cache hit and if it is not available, it is a cache miss. SRAM is used to build caches, though
latest trends suggest that DRAM can be used low level cache in systems with multiple levels
of cache. As memory sizes increased, the latency increased and necessitated the introduction
of atleast 3 levels of cache each increasing in size and latency as it goes down the hierarchy.
As more hardware goes into hiding details of caching in appropriate layers, multiple cache
memories are very expensive. In multiprocessor systems, it need to be ensured that memory
accesses doesnt deviate from programmers expectation. Inconsistency between a cached copy
and the shared memory or between cached copies themselves due to the existence of multiple
cached copies of data results in cache coherence problem. To maintain cache coherence, cache
snooping protocols are used.
Chapter 4
Performance of Memory Subsystems
4.1
Performance Measurement
Over the past 30 years, the rate of improvements, in microprocessor speeds have exceeded
that of memory access rates in an exponential [Link] growing gap between memory and
processor is often referred to as the "memory wall" [25]. Memory wall problem is experienced
not only in the main memory, but also in the on-chip caches. Memory latency in main memory
will be around 10 times that of caches. In high end computing systems, memory access rate
is the critical factor which determines the overall performance of the system due to the large
performance gap between processor and memory. Some commonly used figures of merit for
performance are given below :
Cycles per Instruction (CPI) =
Total execution cycles
Total user level instructions committed
MemoryCycles per Instruction (MCPI) =
Cache Miss Rate =
Total cycles spent in memory system
Total user level instructions committed
Total cache misses
Total Cache Accesses
Cache hit rate = 1 Cache Miss Rate
The commonly used performance metric which is often referred to as latency is
Average memory access time = (Hit time + Miss rate Miss penalty)
(4.1)
(4.2)
(4.3)
(4.4)
4.2
Performance Improvement
As the gap between processor frequency ad DRAM speed keeps increasing, computer systems design is focused on improving the performance of memory subsystem rather than the processor. In most cases, the performance of the memory subsystem dominates the performance of
the whole system, especially in cases where memory is shared between multiple processors.
Memory Bandwidth : Memory bandwidth refers to the amount of data which is transferred in each access. It determines the rate at which memory can accept requests from
processor.
Memory Latency: Memory latency refers to the time lapsed between the initiation of a
memory request and its completion.
The increasing divergence between memory and processor speeds is due to the growing latency. Ideal memory system optimized for maximum performance is characterized by infinite
bandwidth and zero latency. The two major techniques to reduce the impact of high memory
latencies are [2]:
Latency Reduction : Reduces the time required for the memory to provide the required
operand, once the memory request is issued.
Latency Tolerance : This technique hides the memory latency partially or completely by
performing other computations while another memory request is being service.
The above techniques try to improve the latency, at the cost of reduced bandwidth as the memory is invoked more frequently to supply the necessary operands. Efforts to abridge the gap
between processor and memory has concentrated on implementing efficient memory subsystem
architectures which optimises metrics like hit time, miss rate and miss penalty.
Few approaches employed to improve the memory performance are discussed in the following sections.
1. Improving Memory BUS: Traditional approaches like speeding up the clock, increase
the bus width can improve memory bandwidth. Clock rate scaling results in stringent
timing requirements, and hence the component and PCB need to be modeled precisely.
Increasing width of memory bus results in excessive I/O power, and PCB layout issues.
Conventional DRAM had limitations and hence surfaced SDRAM, DDR SDRAM and
later RDRAM which have better access rates.
2. Logic/DRAM integration: Integrating the processor on the same chip as the memory
reduces the latency, improves the bandwidth and is energy efficient. IRAM(Intelligent
RAM)[19] merged a microprocessor and DRAM on the same chip. The popularity of
IRAM was limited by the amount of memory on chip, which should increase by 60% per
year and hence it was limited to systems which required lesser memory.
3. Near data processing: Systems which process huge amounts of data follow irregular
patterns of data access and achieving spatial/temporal locality on both hardware and algorithm fronts is very difficult. For such systems, the algorithm to implement a deep
memory hierarchy becomes complex and consumes power and is often deemed ineffective. It was proposed that moving the computation to the location of data would be more
energy efficient and saves time. IBM Netezza Database Appliance [10] is a commercial
system which exploits this concept. For operations like join sort, aggregation etc commonly used in database systems, partial results of in-memory processing are propagated
to the main processor and all such results are combined. The technique need not be confined to DRAM main memory and can be extended to solid-state drives (SSDs) as moving
computation away from processors, and closer to SSDs can save bandwidth, power, and
energy. Latency is reduced and the results are more accurate.
4. Processing in Memory (PIM): In early PIM architectures like Terasys[8], large array of
computational elements were built into DRAM arrays. It takes advantage of the bandwidth of the DRAM, by integrating computation elements and DRAM on same chip.
Though promising, it had serious drawbacks:DRAM-technology process focuses on providing low cost high density DRAM chips, while computation requires the design of fast
transistors for processor chips;In shared memory multiprocessing paradigm, each processor has limited memory available. To counter these drawbacks, computation blocks can
be moved from memory to layers between processor and memory. These blocks can be
located on a memory controller, a separate chip optimised for computation using fast transistor circuits. Memory controller aid DRAM memory chips, refreshing of the DRAM
cell, and performs functions like error detection and correction.
5. Multiprocessor on a chip: Integrating multiple processors on the same chip offer incentives like improvement in memory access rates, reduction in inter processor communication latency, and increase in memory bandwidth. Since each memory node requires a
large number of resources, the total number of parallel nodes are limited.
6. 3-D stacking: A matured 3-D stacking technology can address the deficiencies associated
with PIM. Stacking in 3-D can provide a large amount of RAM without consuming lot
of die area, and allows connectivity of inter-technology dies using "through silicon vias"
(TSVs). 3-D technology promises improved bandwidth between the DRAM and computational elements located on the same stack. Microns hybrid memory cube(HMC)[20]
uses several layers of DRAM dies stacked one over the other to increase the capacity and
connects it to a logic layer made of CMOS, using TSVs.
7. Non Von-Neumann Architectures: The mainstream systems have stayed loyal to Von
Neumann architecture even though several non Von Neumann systems were developed
over the years for applications which demanded better performance, lesser power consumption, and lower area[18]. As data and computation performed over the data has
changed over the years, new computational paradigms need to be explored. Contentaddressable memories (CAMs) and ternary content-addressable memories (TCAMs) are
examples of such architectures. In a CAM, each memory cell is combined with a comparison logic which enables memory blocks to be addressed by its contents and not by its
location. Addition of logic alongside memory, reduces the density of memory compared
to DRAM, and consumes more power. Designers are proposing computer architectures
that function similar to a brain employing the artificial neural network model.[9]
Chapter 5
Emerging Memory Technologies
Designers are under increased scrutiny to provide memories with large capacity and low
power consumption. SRAM, the conventional architect for on-chip caches has high leakage
power and low memory density. As a result, SRAM caches consume a major fraction of die
area and power budget. DRAM which is being used for main memory for decades is slower
and consumes power due to mandatory refreshing policy. Upcoming memories like E-DRAM
and other non-volatile memories have several features which promise to solve several of these
problems. They can either improve the capacity for same die area or reduce the area and power
consumed for same capacity. E-DRAM, RRAM, STT-RAM, PCM are the prominent ones and
the feasibility of replacing SRAM by these as on-chip cache will be discussed[14].
5.1
Embedded DRAM
E-DRAM is similar in structure to conventional 1T-1C DRAM cell or gain cell configuration which consists of 2 or 3 transistors implemented using CMOS technology. eDRAM use
capacitor to store charge, and hence need to periodically refreshed to prevent charge leakage.
eDRAM cache has low read/write latency in comparison to standard DRAM as it uses high
speed transistors. The major drawback of eDRAM is low data retention period, which results
in faster charge leakage and demands higher refresh rate.
5.2
Resistive RAM
Resistive RAM is a memristance based device where resistance across dielectric materials
like metal oxides are being varied[15]. Resistance states store logical values in RRAM, with
low resistance being logic 1 and high resistance considered as logic 0. Resistive switching can
be unipolar where switching doesnt depend on polarity of voltage or bipolar where a SET or
RESET operations happens at different voltage polarities. When sufficiently high voltages are
10
applied, conducting path is formed across the dielectric and later by applying suitable voltages,
the filament may be SET to low resistance state(logic 0) or RESET to high resistance state(logic
0). In comparison to a SRAM, RRAM cache offers higher density and similar read latency at
much lesser leakage power. The major drawbacks with RRAM is low write endurance, high
write latency and high write energy[13].
5.3
Phase Change Memory
PCM is based on the phase change of an alloy of germanium, antimony and tellurium popularly known as GST when electric current is passed. If the alloy is heated to a temperature
between melting point and crystallization point and cooled down slowly, a crystal with a low
resistance state is formed, and when alloy is heated to a high temperature and cooled at a rapid
rate, an high resistance amorphous substance is formed. The different physical states of PCM
care used to store binary values. Limited write endurance and high write latency hampers the
usage of PCM in on-chip caches. A PCM with a write endurance of 108 writes, when used for
specific applications can fail within an hour when used as cache[12].
5.4
Spin Transfer Torque-RAM
Giant Magneto-Resistive Effect in thin films was discovered in 1988 and that led to the development of solid state magnetic memories. STT-RAM [23] utilizes the resistance states of
magnetic tunnel junction (MTJ) to store binary values. STT-RAM cell consists of a MTJ connected in series with a MOS transistor, where MTJ provides the variable resistance properties.
MTJ consists of two ferromagnetic layers, namely the free layer and the reference layer separated by an oxide barrier layer. Magnetization direction of reference layer is fixed while that
of free layer is altered by passing a current. The relative magnetization direction determines
the resistance state of MTJ. MTJ has high resistance when the two layers differ in magnetization direction and low resistance when they are similar. STT-RAM offer high write endurance
and hence used for designing caches. The main drawback with STT-MRAM is the high write
latency and write energy.
All non-volatile memories(NVM) utilize the change in physical state to store data. Write
operation to an non volatile memory involves a change of physical state and hence consumes
more time and energy in comparison to a read operation, resulting in read-write asymmetry.
11
5.5
STT-RAM as possible replacement for SRAM
Among the non-volatile memories STT-MRAM is the most promising candidate to replace
SRAM in cache memories. STT-MRAM based caches for processor ARM architecture was
explored[22]. Results of simulation showed that for a 512 kB L2 cache, STT-RAM has higher
write latency than SRAM. Hit latency of STT-MRAM is lower than SRAM since STT-MRAM
is denser than SRAM. Higher density of STT-MRAM meant that STT-MRAM requires lesser
cache area compared to SRAM for same capacity. STT-MRAM show noticable difference in
hit latency only in case of large caches. For 32kB L1 cache, the simulation results showed
that hit latency is nearly same for both SRAM and STT-RAM. STT-MRAM is slower in write
operations. Since Instruction Cache is read only, STT-MRAM can replace SRAM without
affecting performance.
In terms of energy consumption in L2 cache, STT-MRAM results in higher write energy,
while the hit energy is similar to SRAM. For L1 cache, STT-MRAM consumes around 4 times
more energy than SRAM for both hit and write. operations. In terms of static power, STTMRAM has considerable advantage over SRAM as STT-MRAM based L2 cache consumes
only one-tenth power. For L1 cache also, STT-MRAM can gain in static power due to zero
leakage of MTJ. Concerning L2, simulation results showed using STT-MRAM is 92% more
efficient while there is no gain in energy in case of L1 cache. The best possible architecture
as of now is to use SRAM as L1 cache, and STT-RAM for the lower levels. MRAMs can also
be evaluated at register level to explore new possible applications using non-volatile register
inside the processor. Also relaxing non-volatility can help STT-MRAM to achieve lower write
latencies.
12
Chapter 6
Conclusion
Memory has been of great assistance to computation, by providing the input needed for calculation, saving intermediate results, or storing results of computation. Over the years, the memory
densities have increased at an affordable manner largely due to the advancements in technology.
CMOS scaling is almost reaching its end as it is tougher to build nanometer-scale device structures. Emerging technologies like STT-RAM, RRAM will replace the DRAM in not so distant
future. Computing has grown from a single processor working on a memory to a connected
world of computers. Memory has moved on from being place of simple data storage towards
in-memory [Link] quest for a universal memory, which exhibit low power consumptions, low operating voltages, high operation speed, long retention time, high endurance, and a
simple structure is still going on. MRAM, RRAM,and other upcoming memory technologies
holds the key towards the realization of universal memory. Computation tasks have evolved and
to meet the requirements, memory organizations will be developed for approximate computing
which focus on achieving low latency at lower cost by compromising on accuracy and precision. Industry is moving towards developing architectures which can tackle both new problems
and large scale versions of older ones. Non Von Neumann architectures are being explored to
provide cost and energy efficient solutions.
13
Bibliography
[1] Doug BURGER. Memory bandwidth limitations of future. In Proc. of the 23rd Annual
International Symposium on Computer Architecture, May 1997, 1997.
[2] Carlos Carvalho. The gap between processor and memory speeds. In Proc. of IEEE
International Conference on Control and Automation, 2002.
[3] Mu-Tien Chang, Paul Rosenfeld, Shih-Lien Lu, and Biji Jacob. Technology comparison for large last-level caches (l 3 cs): Low-leakage sram, low write-energy stt-ram,
and refresh-optimized edram. In High Performance Computer Architecture (HPCA2013),
2013 IEEE 19th International Symposium on, pages 143154. IEEE, 2013.
[4] Robert H Dennard. Field-effect transistor memory (us patent no. 3,387,286). IEEE SolidState Circuits Newsletter, 1(13):1725, 2008.
[5] Robert H Dennard, VL Rideout, E Bassous, and AR Leblanc. Design of ion-implanted
mosfets with very small physical dimensions. Solid-State Circuits, IEEE Journal of,
9(5):256268, 1974.
[6] Peter J Denning. Virtual memory. ACM Computing Surveys (CSUR), 2(3):153189, 1970.
[7] Chapter III [Online]. Available:[Link] mike/comphist/ 61ordnance/[Link]
Electronic computers within the Ordnance Corps.
[8] Maya Gokhale, Bill Holmes, and Ken Iobst. Processing in memory: The terasys massively
parallel pim array. Computer, 28(4):2331, 1995.
[9] W Daniel Hillis. The connection machine. MIT press, 1989.
[10] IBM Netezza data warehouse appliances. [Online]. Available:
http:// www-
[Link]/software/data/netezza/ . IBM Netezza.
[11] Bruce Jacob, Spencer Ng, and David Wang. Memory systems: cache, DRAM, disk. Morgan Kaufmann, 2010.
14
[12] Yongsoo Joo, Dimin Niu, Xiangyu Dong, Guangyu Sun, Naehyuck Chang, and Yuan Xie.
Energy-and endurance-aware design of phase change memory caches. In Proceedings
of the Conference on Design, Automation and Test in Europe, pages 136141. European
Design and Automation Association, 2010.
[13] Young-Bae Kim, Seung Ryul Lee, Dongsoo Lee, Chang Bum Lee, Man Chang, Ji Hyun
Hur, Myoung-Jae Lee, Gyeong-Su Park, Chang Jung Kim, U-I Chung, et al. Bi-layered
rram with unlimited endurance and extremely uniform switching. In VLSI Technology
(VLSIT), 2011 Symposium on, pages 5253. IEEE, 2011.
[14] Mark H Kryder and Chang Soo Kim. After hard driveswhat comes next? Magnetics,
IEEE Transactions on, 45(10):34063413, 2009.
[15] Hai Li and Yiran Chen. An overview of non-volatile memory technology and the implication for tools and architectures. In Design, Automation & Test in Europe Conference &
Exhibition, 2009. DATE09., pages 731736. IEEE, 2009.
[16] Fujio Masuoka, Masamichi Asano, Hiroshi Iwahashi, Teisuke Komuro, and Shinichi
Tanaka. A new flash e 2 prom cell using triple polysilicon technology. In Electron Devices
Meeting, 1984 International, volume 30, pages 464467. IEEE, 1984.
[17] Associate Member and T Kilburn. A storage system for use with binary-digital computing
machines. (763), 1948.
[18] Ravi Nair. Evolution of Memory Architecture. Proceedings of the IEEE, 103(8):1331
1345, 2015.
[19] David Patternson, Thomas Anderson, Neal Cardwell, R Fromm, K Keeton, C Kozyrakis,
R Thomas, and K Yelick. A case for intelligent ram: Iram. IEEE Micro, April, 1997.
[20] J Thomas Pawlowski. Hybrid memory cube: breakthrough dram performance with a
fundamentally re-architected dram subsystem. In Proceedings of the 23rd Hot Chips Symposium, 2011.
[21] Robert R Schaller. Moores law: past, present and future. Spectrum, IEEE, 34(6):5259,
1997.
15
[22] Sophiane Senni, Lionel Torres, Gilles Sassatelli, Abdoulaye Gamatie, and Bruno Mussard.
Emerging non-volatile memory technologies exploration flow for processor architecture.
In VLSI (ISVLSI), 2015 IEEE Computer Society Annual Symposium on, pages 460460.
IEEE, 2015.
[23] Clinton W Smullen, Vidyabhushan Mohan, Anurag Nigam, Sudhanva Gurumurthi, and
Mircea R Stan. Relaxing non-volatility for fast and energy-efficient stt-ram caches. In
High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pages 5061. IEEE, 2011.
[24] John Von Neumann. First draft of a report on the edvac. IEEE Annals of the History of
Computing, (4):2775, 1993.
[25] Wm A Wulf and Sally A McKee. Hitting the memory wall: implications of the obvious.
ACM SIGARCH computer architecture news, 23(1):2024, 1995.
16