0% found this document useful (0 votes)
20 views6 pages

Intelligent Architectures For Intelligent Computingsystems Invited Paper DATE21

1) Modern computing systems struggle to efficiently process large amounts of data due to being designed based on a processor-centric model where data must be moved to processors rather than processed close to storage. This causes significant data movement bottlenecks. 2) The author argues future intelligent architectures should be designed based on three principles: data-centric to minimize data movement, data-driven to continuously self-optimize based on learning from data, and data-aware to exploit properties of different data types. 3) Promising approaches include processing using memory, which performs operations directly in memory chips, and processing near memory, which integrates processing capabilities near memory controllers or within memory chips to reduce data movement bottlenecks.

Uploaded by

prarthanaitsme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views6 pages

Intelligent Architectures For Intelligent Computingsystems Invited Paper DATE21

1) Modern computing systems struggle to efficiently process large amounts of data due to being designed based on a processor-centric model where data must be moved to processors rather than processed close to storage. This causes significant data movement bottlenecks. 2) The author argues future intelligent architectures should be designed based on three principles: data-centric to minimize data movement, data-driven to continuously self-optimize based on learning from data, and data-aware to exploit properties of different data types. 3) Promising approaches include processing using memory, which performs operations directly in memory chips, and processing near memory, which integrates processing capabilities near memory controllers or within memory chips to reduce data movement bottlenecks.

Uploaded by

prarthanaitsme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Intelligent Architectures for Intelligent Computing Systems

Onur Mutlu
ETH Zurich
[email protected]

ABSTRACT the potential for new genome sequencing technologies, such as


nanopore sequencing [2, 113], is greatly limited by how fast
Computing is bottlenecked by data. Large amounts of
and how efficiently we can process the huge amounts of
application data overwhelm storage capability, communication genomic data the underlying technology can provide us with
capability, and computation capability of the modern machines [3, 83, 113, 119, 143]. A similar observation can also be made
we design today. As a result, many key applications'
for video analytics [163, 7] and machine learning [198-199, 7].
performance, efficiency and scalability are bottlenecked by The processor-centric design paradigm (and the resulting
data movement. In this invited special session talk, we processor-centric execution model) of modern computing
describe three major shortcomings of modern architectures in systems is one prime cause of why data overwhelms modern
terms of 1) dealing with data, 2) taking advantage of the vast machines [4, 5, 120]. With this paradigm, there is a clear
amounts of data, and 3) exploiting different semantic
dichotomy between processing and memory/storage: data has
properties of application data. We argue that an intelligent
to be brought from storage and memory units to computation
architecture should be designed to handle data well. We show units (e.g., general-purpose processors or special-purpose
that handling data well requires designing architectures based
accelerators), which are far away from the memory/storage
on three key principles: 1) data-centric, 2) data-driven, 3) data-
units, before any processing can be done on the data. The
aware. We give several examples for how to exploit each of dichotomy exists at the macro-scale (e.g., across the internet)
these principles to design a much more efficient and high
as well as the micro-scale (e.g., within a single compute node,
performance computing system. We especially discuss recent or even within a single CPU processing core). This processor-
research that aims to fundamentally reduce memory latency memory dichotomy leads to large amounts of data movement
and energy, and practically enable computation close to data, across the entire computing system, degrading performance
with at least two promising novel directions: 1) processing and expending large amounts of energy. For example, a recent
using memory, which exploits analog operational properties of work [7] shows that more than 60% of the entire mobile
memory chips to perform massively-parallel operations in
system energy is spent on data movement across the memory
memory, with low-cost changes, 2) processing near memory, hierarchy when executing four major commonly-used
which integrates sophisticated additional processing capability consumer workloads, including machine learning inference,
in memory controllers, the logic layer of 3D-stacked memory
video processing and playback, and web browsing. Similarly,
technologies, or memory chips to enable high memory
due to the current processor-centric design paradigm, a large
bandwidth and low memory latency to near-memory logic. We
fraction of the system resources is dedicated to units that store
discuss how to enable adoption of such fundamentally more
and move data (i.e., to serve the computation units), and actual
intelligent architectures, which we believe are key to computation units constitute only ~5% of an entire processing
efficiency, performance, and sustainability. We conclude with
node [8] – yet, even then, data access is still a major bottleneck
some guiding principles for future computing architecture and
due to the large latency and energy costs of accessing large
system designs. This accompanying short paper provides a amounts of data.
summary of the invited talk and points the reader to further
work that may be beneficial to examine. II. FUNDAMENTAL PRINCIPLES
I. INTRODUCTION Our starting axiom for an intelligent architecture is that it
should handle (i.e., store, access, and process) data well. But,
Existing computing systems process increasingly large what does it mean for an architecture to handle data well? We
amounts of data. Data is key for many modern (and likely even posit (and later demonstrate with examples) that the answer
more future) workloads and systems. Important workloads lies in satisfying three major desirable properties (or
(e.g., machine learning, artificial intelligence, genome principles): 1) data-centric, 2) data-driven, and 3) data-aware.
analysis, graph analytics, databases, video analytics, online First, the system should ensure that data does not
collaboration), whether they execute on cloud servers or overwhelm its components. Doing so requires effort in
mobile systems are all data intensive; they require efficient intelligent algorithms, intelligent architectures and intelligent
processing of large amounts of data. Today, we can generate whole system designs that are co-optimized cross-layer (i.e.,
more data than we can process, as exemplified by the rapid optimizations spanning across algorithms-architectures-
increase in the data obtained in astronomy observations and devices), in a manner that puts data and its processing at the
genome sequencing [1]. center of the design, minimizing data movement and
Unfortunately, the way they are designed, modern maximizing the efficiency with which data is handled, i.e.,
computers are not efficient at dealing with large amounts of stored, accessed, and processed (e.g., as exemplified in [4-38,
data: large amounts of application data greatly overwhelm the 120]). We call this first principle data-centric architectures.
storage capability, the communication capability, and the Second, an intelligent architecture takes advantage of the
computation capability of the modern machines we design vast amounts of data and metadata that flow through the
today. As such, data becomes a large performance and energy system, to continuously improve its decision making, by
bottleneck, and it greatly impacts system robustness and bettering both its policies and mechanisms based on online
security as well. As a prime example, we provide evidence that
learning and self-optimization. In other words, the architecture of data and makes a vast number of decisions even in the
should make data-driven, self-optimizing decisions in its timeframe of a single millisecond (let alone years), yet it is
components (e.g., as exemplified in [39-51, 121]). We call this incapable of learning from that data and changing its policy to
second principle data-driven architectures. another dynamically-determined better policy because the
Third, an intelligent architecture understands and exploits policy it follows is rigid and hardcoded by a human. This is
various properties of each piece of data so that it can improve clearly not intelligent: for example, as humans, we have the
and adapt its algorithms, mechanisms, and policies based on capability to learn from the past and adapt our actions
the characteristics of data. In other words, the architecture accordingly to not repeat the same mistakes as in the past or to
should make data-characteristics-aware decisions in its choose the best policy/actions that we believe will provide the
components and across the entire system (e.g., as exemplified highest benefits in the future. Enabling similar intelligence and
in [52-58, 107, 116, 11, 149]). We call this third principle far-sightedness in controller and system policies in an
data-aware architectures. architecture is necessary for obtaining good performance and
efficiency (as well as better reliability, security and perhaps
III. EXISTING COMPUTING ARCHITECTURES other metrics) under a variety of system conditions and
workloads.
Based on our qualitative and quantitative analyses, we find
that existing computing architectures greatly fall short of Third, modern architectures are poor at knowing and
handling data well. In particular they violate all of the three exploiting different properties of application and system data.
major desirable principles. We analyze each briefly next. They are designed to treat all data as the same (except for a
First, modern architectures are poor at dealing with data: small set of specialized hints that provide some opportunity to
optimize based on data characteristics in a limited manner that
they are designed to mainly store and move data, as opposed to
is very specific to the particular optimization). As such, the
actually compute on the data. Most system resources serve the
processor (and accelerators) without being capable of decisions existing architectures make are component-aware
decisions as opposed to data-aware decisions: a component’s
processing data. As such, existing architectures are processor-
(e.g., a cache’s or a memory controller’s) structural and
centric as opposed to data-centric: they place the most value
performance characteristics dominate the policies designed to
in the processor (not data) and everything else in the system is
control that component and the accessed/manipulated data’s
viewed as secondary serving the processor. We believe this is
the wrong mindset and approach in designing a balanced characteristics are rarely conveyed to the policies or even
system that handles data well: such a system should be data known. If the characteristics of the data to be accessed or
centric: i.e., data should be the prime thing that is valued and manipulated were known, the decisions taken could be very
everything else in the system should be designed to 1) different: for example, if we knew the relative compressibility
of different types of data, e.g., different data types or different
minimize data movement by enabling computation capability
objects [55, 74-81, 135-138], different components in the
at and close to where data resides and 2) maximize the value
and efficiency of processing data by enabling low-latency and entire system could be designed in a manner that adaptively
low-energy access to as well as low-energy and low-cost scales their capability to match the compressibility of different
data elements, in order to maximize both performance and
storage of vast amounts of data. Doing so would eliminate the
efficiency. Modifying the architecture and its interface to
huge data access bottleneck of processor-centric systems,
thereby improving performance, reducing energy become richer and more expressive, and to include rich and
accurate information on various properties of data that is to be
consumption, alleviating off-chip bandwidth requirements
(and hence area and cost), likely reducing system and processed, is therefore critical to customizing the architecture
hardware design complexity, as well as opening up new to the characteristics of the data and, thus, enabling intelligent
opportunities for improving system security and reliability by adaptation of system policies to data characteristics.
handling data more locally in or near where it resides.
IV. INTELLIGENT COMPUTING ARCHITECTURES
Second, modern architectures are poor at taking advantage
of vast amounts of data (and metadata) available to them A major chunk of our invited talk describes in detail the
during online operation and over time. They are designed to characteristics of an intelligent computing architecture, by
make simple decisions based on fixed policies, ignoring concrete examples and their empirical evaluation. This short
massive amounts of easily-available data. This is because paper does not go into detail, but provides a brief overview
existing architectural policies make human-driven decisions as with references to other works that exemplify such
opposed to data-driven decisions, and humans, by nature, do architectures. Multiple detailed versions of this talk can be
not seem capable of designing policies and heuristics that found online [82, 139-142]. We also refer the reader to recent
consider hundreds, if not thousands, of different state attributes detailed survey and overview papers we have written on the
that may be useful to examine in a control policy that makes topic [120, 4].
dynamic decisions. It is instructive to notice that a modern
memory controller, for example, keeps executing exactly the Data-Centric
same fixed policy for scheduling or power management (e.g., A data-centric architecture has at least four major
FR-FCFS [59, 60], PAR-BS [61] or some other heuristic-based characteristics. First, it enables processing capability in or near
policy [62-73, 117-118, 122-133]), during the entire lifetime of where data resides (i.e., in or near memory structures), as
a system (for many many years!), regardless of the positive or described in detail in [4-6, 8, 38, 120] and exemplified by [7-
negative impact of the decisions resulting from the policy at 12, 14, 19, 20, 24, 27, 30, 34, 84, 108-113, 144-147]. Second,
any given point of time on the system. The same is true for a it provides low-latency and low-energy access to data, as
modern prefetch controller, a cache controller, a network exemplified by [11-13, 15-18, 21, 23, 31-33, 84-86]. Third, it
controller, and for many other hardware controllers in a system enables low-cost data storage and processing (i.e., high
(e.g., [150-162, 200-214]). Each controller sees a vast amount capacity memory at low cost, via techniques like new memory
technologies, hybrid memory systems and/or compressed overview work comprehensively analyzes modern PIM
memory systems), as exemplified by [22, 87-96, 74, 76, 78, systems and issues [120, 4].
107, 116]. Fourth, it provides mechanisms for intelligent data
management (with intelligent controllers handling robustness, Data-Driven
security, cost, etc.), as described in detail in [97-103, 116, 120] A data-driven architecture enables the machine itself to
and exemplified by, e.g., [104-106, 116, 120, 179-190]. learn the best policies for managing itself and executing
Our talk provides significant detail on providing processing programs. Controllers in such an architecture, when needed,
capability in or near where data resides, focusing on are data-driven autonomous agents that automatically learn
processing in memory (PIM). There is a pressing need for far-sighted policies. A prime example of such a controller is
enabling PIM in modern systems due to 1) a bottom-up push, the reinforcement learning based self-optimizing memory
i.e., circuit- and device-level memory technology scaling controllers [39]. Such controllers can not only improve
issues requiring intelligent main memory controllers to solve performance and efficiency under a wide variety of conditions
low-level scaling and reliability challenges, such as and workloads but also reduce the hardware and system
RowHammer [104-106, 99, 102], data retention [21, 167-170, designer’s burden in designing sophisticated controllers [39].
97, 191-193], energy consumption [171-172, 127, 132], We believe an intelligent architecture will consist of a
enabling scalable emerging technologies [22, 87-93, 172-174] collection of such intelligent controllers that perform
and 2) a top-down pull, i.e., systems and applications requiring automatic data-driven online policy learning, including
near-data processing capability with minimal data movement learning of how to best coordinate with each other to make
to reduce the data access bottleneck and its large negative decisions that benefit the overall system. Such machines learn
effect on performance [154-155, 164-165], energy [7, 166], the best policies over time and thus become better as they
and sustainability. learn, adapting, evolving, and executing far-sighted policies.
There are at least two new approaches to enabling To enable such a machine, we need to revisit the design of all
processing-in-memory in modern systems. The first approach, controllers (e.g., caching, prefetching, storage, memory,
processing using memory (PUM), exploits the existing interconnect) and turn them into data-driven agents.
memory architecture and the operational principles of the
memory circuitry to enable operations inside memory Data-Aware
structures with minimal changes. PUM makes use of intrinsic A data-aware architecture understands what it can do with
properties and operational principles of the memory cells and and to each piece of data (and associated computations on
cell arrays, by inducing interactions between cells such that the data), and uses this information about data characteristics to
cells and/or cell arrays can perform useful computation. PUM maximize system efficiency and performance. In other words,
architectures enable a wide range of different functions, such it customizes itself (i.e., its policies and mechanisms) to the
as data copy/initialization, bitwise operations, and simple characteristics of the data and computations it is dealing with.
arithmetic operations. We focus on how to minimally and Such an architecture requires knowledge of various
practically change DRAM chips to perform fast and energy- characteristics of different data elements and structures as well
efficient bulk data copy and initialization [84, 12, 147, 175] as as computations. Many semantic or other characteristics of
well as bulk bitwise operations [6, 10, 109, 175]. Similar data (e.g., compressibility, approximability, sparsity,
approaches are also applicable to SRAM, MRAM, RRAM and criticality, access and security semantics, locality, latency vs.
other NVM technologies [176-178]. bandwidth sensitivity, privacy requirements, data types, error
The second approach, processing near memory (PNM), vulnerability) are invisible or unknown to modern hardware
involves adding or integrating computation units (e.g., and thus need to be communicated or discovered. We believe
accelerators, simple processing cores, reconfigurable logic) efficient and expressive software/hardware interfaces and
close to or inside the memory. Computation units can be resulting cross-layer mechanisms, as exemplified by X-Mem
placed in the logic layer of 3D-stacked memories, in the (Expressive Memory) [52, 53] and the Virtual Block Interface
memory controller, or even inside memory chips. Recent [56] as well as other works [54, 55, 57, 58, 107, 116, 11], are
advances in silicon interposers (in-package wires that connect promising and critically-needed approaches to creating
directly to the through-silicon vias in a 3D-stacked chip) also general-purpose data-aware architectures.
allow for separate logic chips to be placed in the same die
package as a 3D-stacked memory while still taking advantage ACKNOWLEDGMENTS
of the TSV bandwidth. An earlier version of this talk was delivered as a plenary
Both PUM and PNM approaches can greatly accelerate real keynote talk at the VLSI-DAT/TSA conferences [142], with an
applications, including database systems, graph analytics, accompanying paper [148], which this paper is an extension
machine learning, genome analysis, GPU workloads, pointer- of. The very first version of this talk was delivered as a
chasing-intensive workloads, data analytics, climate modeling, keynote talk at the SRC-Mubadala-Khalifa Forum on The
etc. Recent results show up to approximately two orders of Future of Artificial Intelligence Hardware Systems in April
magnitude improvement in energy and performance over 2019. We thank all of the members of the SAFARI Research
conventional processor-centric systems. More functionality Group, and our collaborators at Carnegie Mellon, ETH Zurich,
can be potentially integrated into a memory chip using PNM and other universities, who have contributed to the various
than using PUM, but both approaches can be combined to get works we describe in this paper. Thanks also goes to our
even higher benefit from PIM. For both approaches, we research group’s industrial sponsors over the past ten years,
describe and tackle relevant cross-layer research, design, and especially ASML, Google, Huawei, Intel, Microsoft, NVIDIA,
practical adoption challenges in devices, architecture, systems, Samsung, Seagate, SRC, and VMware, who have supported
and programming models in our talk. Our recent PIM various pieces of research that are described in this paper and
the associated talk.
REFERENCES [34] J. S. Kim et al., “D-RaNGe: Using Commodity DRAM Devices to
[1] Z. D. Stevens et al., “Big data: astronomical or genomical?”, PLoS Generate True Random Numbers with Low Latency and High
Biology, 2015. Throughput”, HPCA 2019.
[2] D. Senol Cali et al., “Nanopore Sequencing Technology and Tools for [35] H. Hassan et al., “CROW: A Low-Cost Substrate for Improving DRAM
Genome Assembly: Computational Analysis of the Current State, Performance, Energy Efficiency, and Reliability”, ISCA 2019.
Bottlenecks and Future Directions” BIB 2019. [36] S. Song et al., “Improving Phase Change Memory Performance with
[3] O. Mutlu, “Accelerating Genome Analysis: A Primer on an Ongoing Data Content Aware Access”, ISMM 2020.
Journey”, Keynote Talk at HiCOMB-17, 2018. [37] G. Singh et al., “NAPEL: Near-Memory Computing Application
[4] S. Ghose et al., “Processing-in-Memory: A Workload-Driven Performance Prediction via Ensemble Learning”, DAC 2019.
Perspective”, IBM JRD 2019. [38] O. Mutlu et al., “Enabling Practical Processing in and near Memory for
[5] O. Mutlu et al., “Processing Data Where It Makes Sense: Enabling In- Data-Intensive Computing”, DAC 2019.
Memory Computation”, MICPRO, 2019. [39] E. Ipek et al. “Self Optimizing Memory Controllers: A Reinforcement
[6] V. Seshadri and O. Mutlu, “In-DRAM Bulk Bitwise Execution Engine”, Learning Approach”, ISCA 2008.
Advances in Computers, 2020. [40] D. A. Jimenez and C. Lin, “Dynamic Branch Prediction with
[7] A. Boroumand et al., “Google Workloads for Consumer Devices: Perceptrons,” HPCA 2001.
Mitigating Data Movement Bottlenecks”, ASPLOS 2018. [41] D. A. Jimenez, “Fast Path-Based Neural Branch Prediction”, MICRO
[8] O. Mutlu, “Enabling Computation with Minimal Data Movement: 2003.
Changing the Computing Paradigm for High Efficiency", Design [42] D. A. Jimenez, “Piecewise Linear Branch Prediction”, ISCA 2005.
Automation Summer School Lecture, DAC 2019. [43] D. A. Jimenez, “An optimized scaled neural branch predictor”, ICCD
https://people.inf.ethz.ch/omutlu/pub/onur-DAC-DASS- 2011.
EnablingInMemoryComputation-June-2-2019.pptx [44] E. Teran et al., “Perceptron learning for reuse prediction”, MICRO
[9] J. Ahn et al., “A Scalable Processing-in-Memory Accelerator for Parallel 2016.
Graph Processing”, ISCA 2015. [45] E. Garza et al., “Bit-level perceptron prediction for indirect branches”,
[10] V. Seshadri et al., “Ambit: In-Memory Accelerator for Bulk Bitwise ISCA 2019.
Operations Using Commodity DRAM Technology”, MICRO 2017. [46] E. Bhatia et al., “Perceptron-based prefetch filtering”, ISCA 2019.
[11] H. Luo et al., “CLR-DRAM: A Low-Cost DRAM Architecture Enabling [47] L. Peled et al., “A Neural Network Prefetcher for Arbitrary Memory
Dynamic Capacity-Latency Trade-Off”, ISCA 2020. Access Patterns”, TACO 2020.
[12] K. Chang et al., “Low-Cost Inter-Linked Subarrays (LISA): Enabling [48] L. Peled et al., “Semantic locality and context-based prefetching using
Fast Inter-Subarray Data Movement in DRAM”, HPCA 2016. reinforcement learning,” ISCA 2015.
[13] D. Lee et al., “Adaptive-Latency DRAM: Optimizing DRAM Timing [49] R. Bitirgen et al., “Coordinated management of multiple interacting
for the Common-Case”, HPCA 2015. resources in chip multiprocessors: A machine learning approach”,
[14] K. Hsieh et al., “Accelerating Pointer Chasing in 3D-Stacked Memory: MICRO 2008.
Challenges, Mechanisms, Evaluation”, ICCD 2016. [50] J. Mukundan and J. F. Martinez, “MORSE: Multi-objective
[15] K. Chang et al., “Understanding Latency Variation in Modern DRAM reconfigurable self-optimizing memory scheduler”, HPCA 2012.
Chips: Experimental Characterization, Analysis, and Optimization”, [51] J. F. Martinez and E. Ipek, “Dynamic Multicore Resource Management:
SIGMETRICS 2016. A Machine Learning Approach”, IEEE Micro 2009.
[16] K. Chang et al., “Understanding Reduced-Voltage Operation in Modern [52] N. Vijaykumar et al., “A Case for Richer Cross-layer Abstractions:
DRAM Devices: Experimental Characterization, Analysis, and Bridging the Semantic Gap with Expressive Memory", ISCA 2018.
Mechanisms", SIGMETRICS 2017. [53] N. Vijaykumar et al., “The Locality Descriptor: A Holistic Cross-Layer
[17] D. Lee et al., “Design-Induced Latency Variation in Modern DRAM Abstraction to Express Data Locality in GPUs”, ISCA 2018.
Chips: Characterization, Analysis, and Latency Reduction [54] S. Koppula et al., “EDEN: Enabling Energy-Efficient, High-Performance
Mechanisms”, SIGMETRICS 2017. Deep Neural Network Inference Using Approximate DRAM”, MICRO
[18] S. Ghose et al., “What Your DRAM Power Models Are Not Telling 2019.
You: Lessons from a Detailed Experimental Study”, SIGMETRICS [55] K. Kanellopoulos et al., “SMASH: Co-designing Software Compression
2018. and Hardware-Accelerated Indexing for Efficient Sparse Matrix
[19] K. Hsieh et al., “Transparent Offloading and Mapping (TOM): Enabling Operations", MICRO 2019.
Programmer-Transparent Near-Data Processing in GPU Systems”, [56] N. Hajinazar et al., “The Virtual Block Interface: A Flexible Alternative
ISCA 2016. to the Conventional Virtual Memory Framework”, ISCA 2020.
[20] J. Ahn et al., “PIM-Enabled Instructions: A Low-Overhead, Locality- [57] Z. Yu et al., “Labeled RISC-V: A New Perspective on Software-Defined
Aware Processing-in-Memory Architecture”, ISCA 2015. Architecture”, CARRV 2017.
[21] J. Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh”, [58] J. Ma et al., “Supporting Differentiated Services in Computers via
ISCA 2012. Programmable Architecture for Resourcing-on-Demand (PARD)”,
[22] B. C. Lee et al., “Architecting Phase Change Memory as a Scalable ASPLOS 2015.
DRAM Alternative”, ISCA 2009. [59] S. Rixner et al., “Memory access scheduling”, ISCA 2000.
[23] D. Lee et al., “Decoupled Direct Memory Access: Isolating CPU and IO [60] W. K. Zuravleff and T. Robinson, “Controller for a synchronous DRAM
Traffic by Leveraging a Dual-Data-Port DRAM”, PACT 2015. that maximizes throughput by allowing memory requests and
[24] V. Seshadri et al., “Gather-Scatter DRAM: In-DRAM Address commands to be issued out of order”, U.S. Patent Number 5,630,096,
Translation to Improve the Spatial Locality of Non-unit Strided May 1997.
Accesses”, MICRO 2015. [61] O. Mutlu and T. Moscibroda, “Parallelism-Aware Batch Scheduling:
[25] D. Lee et al., “Simultaneous Multi-Layer Access: Improving 3D- Enhancing both Performance and Fairness of Shared DRAM Systems",
Stacked Memory Bandwidth at Low Cost”, TACO 2016. ISCA 2008.
[26] H. Hassan et al., “ChargeCache: Reducing DRAM Latency by Exploiting [62] H. Usui et al., “DASH: Deadline-Aware High-Performance Memory
Row Access Locality”, HPCA 2016. Scheduler for Heterogeneous Systems with Hardware Accelerators”,
[27] M. Hashemi et al., “Accelerating Dependent Cache Misses with an TACO 2016.
Enhanced Memory Controller”, ISCA 2016. [63] O. Mutlu and T. Moscibroda, “Stall-Time Fair Memory Access
[28] M. Patel et al., “The Reach Profiler (REAPER): Enabling the Mitigation Scheduling for Chip Multiprocessors”, MICRO 2007.
of DRAM Retention Failures via Profiling at Aggressive Conditions”, [64] Y. Kim et al., “ATLAS: A Scalable and High-Performance Scheduling
ISCA 2017. Algorithm for Multiple Memory Controllers”, HPCA 2010.
[29] S. Khan et al., “Detecting and Mitigating Data-Dependent DRAM [65] Y. Kim et al., “Thread Cluster Memory Scheduling: Exploiting
Failures by Exploiting Current Memory Content”, MICRO 2017. Differences in Memory Access Behavior”, MICRO 2010.
[30] J. S. Kim et al. “GRIM-Filter: Fast Seed Location Filtering in DNA [66] I. Hur and C. Lin, “Adaptive History-Based Memory Schedulers”,
Read Mapping Using Processing-in-Memory Technologies”, BMC MICRO 2004.
Genomics 2018. [67] I. Hur and C. Lin, “Memory scheduling for modern microprocessors”,
[31] A. Das et al., “VRL-DRAM: Improving DRAM Performance via ACM TOCS 2007.
Variable Refresh Latency”, DAC 2018. [68] C. Natarajan et al., “A study of performance impact of memory
[32] J. S. Kim et al., “Solar-DRAM: Reducing DRAM Access Latency by controller features in multi-processor server environment”, WMPI
Exploiting the Variation in Local Bitlines”, ICCD 2018. 2004.
[33] Y. Wang et al., “Reducing DRAM Latency via Charge-Level-Aware [69] S. Rixner, “Memory controller optimizations for web servers”, MICRO
Look-Ahead Partial Restoration”, MICRO 2018. 2004.
[70] L. Subramanian et al., “BLISS: Balancing Performance, Fairness and [107] Y. Luo et al., “Characterizing Application Memory Error Vulnerability
Complexity in Memory Access Scheduling”, IEEE TPDS 2016. to Optimize Data Center Cost via Heterogeneous-Reliability Memory”,
[71] L. Subramanian et al., “The Blacklisting Memory Scheduler: Achieving DSN 2014.
High Performance and Fairness at Low Cost”, ICCD 2014. [108] A. Boroumand et al., “CoNDA: Efficient Cache Coherence Support for
[72] R. Ausavarungnirun et al., “Staged Memory Scheduling: Achieving Near-Data Accelerators", ISCA 2019.
High Performance and Scalability in Heterogeneous Systems”, ISCA [109] V. Seshadri et al., “Fast Bulk Bitwise AND and OR in DRAM", IEEE
2012. CAL 2015.
[73] K. J. Nesbit et al., “Fair Queuing Memory Systems”, MICRO 2006. [110] A. Boroumand et al., “LazyPIM: An Efficient Cache Coherence
[74] G. Pekhimenko et al., “Base-Delta-Immediate Compression: Practical Mechanism for Processing-in-Memory”, IEEE CAL 2016.
Data Compression for On-Chip Caches”, PACT 2012. [111] M. Hashemi et al., “Continuous Runahead: Transparent Hardware
[75] N. Vijaykumar et al., “A Case for Core-Assisted Bottleneck Acceleration Acceleration for Memory Intensive Workloads”, MICRO 2016.
in GPUs: Enabling Flexible Data Compression with Assist Warps”, [112] A Pattnaik et al., “ Scheduling Techniques for GPU Architectures with
ISCA 2015. Processing-In-Memory Capabilities”, PACT 2016.
[76] G. Pekhimenko et al., “Linearly Compressed Pages: A Low-Complexity, [113] D. Senol Cali et al., “GenASM: A High-Performance, Low-Power
Low-Latency Main Memory Compression Framework”, MICRO 2013. Approximate String Matching Acceleration Framework for Genome
[77] G. Pekhimenko et al., “A Case for Toggle-Aware Compression for GPU Sequence Analysis", MICRO 2020.
Systems”, HPCA 2016. [114] H. Hassan et al., “SoftMC: A Flexible and Practical Open-Source
[78] M. Ekman and P. Stenstrom, “A Robust Main-Memory Compression Infrastructure for Enabling Experimental DRAM Studies”, HPCA
Scheme”, ISCA 2005. 2017.
[79] A. Arelakis et al., “HyComp: a hybrid cache compression method for [115] Y. Kim et al., “Ramulator: A Fast and Extensible DRAM Simulator”,
selection of data-type-specific compression methods”, MICRO 2015. IEEE CAL 2015.
[80] A. Arelakis and P. Stenstrom, “SC2: A statistical compression cache [116] J. Meza et al., “A Case for Efficient Hardware-Software Cooperative
scheme”, ISCA 2014. Management of Storage and Memory”, WEED 2013.
[81] G. Pekhimenko et al., “Exploiting Compressed Block Size as an [117] L. Subramanian et al., “MISE: Providing Performance Predictability and
Indicator of Future Reuse”, HPCA 2015. Improving Fairness in Shared Main Memory Systems”, HPCA 2013.
[82] O. Mutlu, “Intelligent Architectures for Intelligent Machines", Keynote [118] C. J. Lee et al., “Prefetch-Aware DRAM Controllers”, MICRO 2008.
Talk at 17th ChinaSys Workshop, December 2019. [119] M. Alser et al., “Accelerating Genome Analysis: A Primer on an
https://www.youtube.com/watch?v=n8Aj_A0WSg8 Ongoing Journey”, IEEE Micro, September/October 2020.
[83] M. Alser et al., “Shouji: A Fast and Efficient Pre-Alignment Filter for [120] O. Mutlu et al., “A Modern Primer on Processing in Memory”, Invited
Sequence Alignment”, Bioinformatics 2019. Book Chapter in Emerging Computing: From Devices to Systems -
[84] V. Seshadri et al., “RowClone: Fast and Energy-Efficient In-DRAM Looking Beyond Moore and Von Neumann, Springer, 2021.
Bulk Data Copy and Initialization”, MICRO 2013. https://arxiv.org/abs/2012.03112
[85] D. Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost [121] L. N. Vintan and M. Iridon, “Towards a High Performance Neural
DRAM Architecture”, HPCA 2013. Branch Predictor”, IJCNN 1999.
[86] Y. Kim et al., “A Case for Exploiting Subarray-Level Parallelism [122] L. Subramanian et al., “"The Application Slowdown Model: Quantifying
(SALP) in DRAM”, ISCA 2012. and Controlling the Impact of Inter-Application Interference at Shared
[87] B. C. Lee et al., “Phase Change Technology and the Future of Main Caches and Main Memory”, MICRO 2015.
Memory”, IEEE Micro 2010. [123] S. P. Muralidhara et al., “Reducing Memory Interference in Multicore
[88] Y. Li et al., “Utility-Based Hybrid Memory Management”, CLUSTER Systems via Application-Aware Memory Channel Partitioning",
2017. MICRO 2011.
[89] H. Yoon et al., “Row Buffer Locality Aware Caching Policies for [124] E. Ebrahimi et al., “Parallel Application Memory Scheduling", MICRO
Hybrid Memories”, ICCD 2012. 2011.
[90] C. Wang et al., “Panthera: Holistic Memory Management for Big Data [125] C. J. Lee et al., “Prefetch-Aware Memory Controllers", IEEE TC 2011.
Processing over Hybrid Memories”, PLDI 2019. [126] E. Ebrahimi et al., “Prefetch-Aware Shared Resource Management for
[91] J. Meza et al., “Enabling Efficient and Scalable Hybrid Memories Using Multi-Core Systems", ISCA 2011.
Fine-Granularity DRAM Cache Management”, IEEE CAL 2012. [127] H. David et al., “Memory Power Management via Dynamic
[92] M. K. Qureshi et al., “Scalable high performance main memory system Voltage/Frequency Scaling", ICAC 2011.
using phase-change memory technology”, ISCA 2009. [128] I. Hur and C. Lin, “A Comprehensive Approach to DRAM Power
[93] M. K. Qureshi et al., “Morphable memory system: a robust architecture Management”, HPCA 2008.
for exploiting multi-level phase change memories”, ISCA 2010. [129] C. J. Lee et al., “DRAM-Aware Last-Level Cache Writeback: Reducing
[94] C-C. Chou et al., “CAMEO: A Two-Level Memory Organization with Write-Caused Interference in Memory Systems", HPS Technical
Capacity of Main Memory and Flexibility of Hardware-Managed Report 2010.
Cache”, MICRO 2014. [130] E. Ebrahimi et al., “"Fairness via Source Throttling: A Configurable and
[95] V. Young et al., “Enabling Transparent Memory-Compression for High-Performance Fairness Substrate for Multi-Core Memory
Commodity Memory Systems”, HPCA 2014. Systems”, ASPLOS 2010.
[96] X. Yu et al., “Banshee: Bandwidth-Efficient DRAM Caching via [131] C. J. Lee et al., “"Improving Memory Bank-Level Parallelism in the
Software/Hardware Cooperation", MICRO 2017. Presence of Prefetching”, MICRO 2009.
[97] O. Mutlu, “Memory Scaling: A Systems Architecture Perspective”, [132] Q. Deng et al., “MemScale: Active Low-Power Modes for Main
IMW 2013. Memory”, ASPLOS 2011.
[98] O. Mutlu and L. Subramanian, “Research Problems and Opportunities in [133] B. Diniz et al., “Limiting the Power Consumption of Main Memory”,
Memory Systems”, SUPERFRI 2015. ISCA 2007.
[99] O. Mutlu and J. Kim, “RowHammer: A Retrospective", IEEE TCAD [134] V. Pandey et al., “DMA-aware Memory Energy Management”, HPCA
2019. 2006.
[100] Y. Cai et al., “Error Characterization, Mitigation, and Recovery in Flash [135] F. Zhang et al., “TADOC: Text Analytics Directly on Compression”,
Memory Based Solid State Drives”, Proc. IEEE 2017. VLDB Journal 2020.
[101] Y. Cai et al., “Errors in Flash-Memory-Based Solid-State Drives: [136] F. Zhang et al., “Enabling Efficient Random Access to Hierarchically-
Analysis, Mitigation, and Recovery”, Inside Solid-State Drives, 2018. Compressed Data”, ICDE 2020.
[102] O. Mutlu, “The RowHammer Problem and Other Issues We May Face [137] F. Zhang et al., “Efficient Document Analytics on Compressed Data:
as Memory Becomes Denser”, DATE 2017. Method, Challenges, Algorithms, Insights”, VLDB 2018.
[103] O. Mutlu et al., “Recent Advances in DRAM and Flash Memory [138] F. Zhang et al., “Zwift: A Programming Framework for High
Architectures”, IPSI TIR, July 2018. Performance Text Analytics on Compressed Data”, ICS 2018.
[104] Y. Kim et al., “Flipping Bits in Memory Without Accessing Them: An [139] O. Mutlu, “Intelligent Architectures for Intelligent Machines", Invited
Experimental Study of DRAM Disturbance Errors”, ISCA 2014. Talk at Texas State University Computer Science Seminar, Nov. 2019.
[105] J. S. Kim et al., “Revisiting RowHammer: An Experimental Analysis of https://www.youtube.com/watch?v=sJwO_BB4LaY
Modern Devices and Mitigation Techniques”, ISCA 2020. [140] O. Mutlu, “Memory-Centric Computing Sytems", Invited Tutorial at the
[106] P. Frigo et al., “TRRespass: Exploiting the Many Sides of Target Row 66th International Electron Devices Meeting (IEDM), Dec. 2019.
Refresh”, S&P 2020. [141] O. Mutlu, “Intelligent Architectures for Intelligent Machines", Keynote
Talk at ACM SYSTOR Conference, October 2020.
https://www.youtube.com/watch?v=V6Sq7OiQD90
[142] O. Mutlu, “Intelligent Architectures for Intelligent Machines", Plenary [179] Y. Luo et al., “Improving 3D NAND Flash Memory Lifetime by
Keynote Talk at VLSI-DAT/TSA, August 2020. Tolerating Early Retention Loss and Process Variation”, SIGMETRICS
https://www.youtube.com/watch?v=c6_LgzuNdkw 2018.
[143] M. Alser et al., “SneakySnake: A Fast and Accurate Universal Genome [180] Y. Luo et al., “HeatWatch: Improving 3D NAND Flash Memory Device
Pre-Alignment Filter for CPUs, GPUs, and FPGAs”, Bioinformatics Reliability by Exploiting Self-Recovery and Temperature-Awareness”,
2020. HPCA 2018.
[144] G. Singh et al., “NERO: A Near High-Bandwidth Memory Stencil [181] Y. Cai et al., “Data Retention in MLC NAND Flash Memory:
Accelerator for Weather Prediction Modeling”, FPL 2020. Characterization, Optimization and Recovery”, HPCA 2015.
[145] I. Fernandez et al., “NATSA: A Near-Data Processing Accelerator for [182] Y. Cai et al., “Flash Correct-and-Refresh: Retention-Aware Error
Time Series Analysis”, ICCD 2020. Management for Increased Flash Memory Lifetime”, ICCD 2012.
[146] Y. Wang et al., “FIGARO: Improving System Performance via Fine- [183] Y. Cai et al., “Error Patterns in MLC NAND Flash Memory:
Grained In-DRAM Data Relocation and Caching”, MICRO 2020. Measurement, Characterization, and Analysis”, DATE 2012.
[147] S. H. S. Rezaei et al., “NoM: Network-on-Memory for Inter-Bank Data [184] Y. Luo et al., “WARM: Improving NAND Flash Memory Lifetime with
Transfer in Highly-Banked Memories”, IEEE CAL 2020. Write-hotness Aware Retention Management”, MSST 2015.
[148] O. Mutlu, “Intelligent Architectures for Intelligent Machines”, VLSI- [185] Y. Cai et al., “Neighbor-Cell Assisted Error Correction for MLC
DAT 2020. NAND Flash Memories”, SIGMETRICS 2014.
[149] S. Dustdar et al., “Rethinking Divide and Conquer - Towards Holistic [186] Y. Cai et al., “Program Interference in MLC NAND Flash Memory:
Interfaces of the Computing Stack”, IEEE Internet Computing 2020. Characterization, Modeling, and Mitigation”, ICCD 2013.
[150] S. Srinath et al., “Feedback Directed Prefetching: Improving the [187] Y. Cai et al., “Threshold Voltage Distribution in MLC NAND Flash
Performance and Bandwidth-Efficiency of Hardware Prefetchers”, Memory: Characterization, Analysis and Modeling”, DATE 2013.
HPCA 2007. [188] Y. Cai et al., “Vulnerabilities in MLC NAND Flash Memory
[151] R. Bera et al., “DSPatch: Dual Spatial Pattern Prefetcher", MICRO Programming: Experimental Analysis, Exploits, and Mitigation
2019. Techniques”, HPCA 2017.
[152] E. Ebrahimi et al., “Techniques for Bandwidth-Efficient Prefetching of [189] Y. Cai et al., “Read Disturb Errors in MLC NAND Flash Memory:
Linked Data Structures in Hybrid Prefetching Systems”, HPCA 2009. Characterization and Mitigation”, DSN 2015.
[153] E. Ebrahimi et al., “Coordinated Control of Multiple Prefetchers in [190] Y. Luo et al., “Enabling Accurate and Practical Online Flash Channel
Multi-Core Systems”, MICRO 2009. Modeling for Modern MLC NAND Flash Memory”, IEEE JSAC 2016.
[154] O. Mutlu et al., “Runahead Execution: An Alternative to Very Large [191] M. Qureshi et al., “AVATAR: A Variable-Retention-Time (VRT)
Instruction Windows for Out-of-order Processors”, HPCA 2003. Aware Refresh for DRAM Systems”, DSN 2015.
[155] O. Mutlu et al., “Techniques for Efficient Processing in Runahead [192] S. Khan et al., “PARBOR: An Efficient System-Level Technique to
Execution Engines”, ISCA 2005. Detect Data-Dependent Failures in DRAM”, DSN 2016.
[156] K. J. Nesbit and J. E. Smith, “Data Cache Prefetching Using a Global [193] S. Khan et l., “Detecting and Mitigating Data-Dependent DRAM
History Buffer”, HPCA 2004. Failures by Exploiting Current Memory Content”, MICRO 2017.
[157] M. Shevgoor et al., “Efficiently Prefetching Complex Address [194] H. Xin et al., “Shifted Hamming Distance: A Fast and Accurate SIMD-
Patterns”, MICRO 2015. friendly Filter to Accelerate Alignment Verification in Read Mapping”,
[158] J. Kim et al., “Path Confidence based Lookahead Prefetching”, MICRO Bioinformatics 2015.
2016. [195] M. Alser et al., “GateKeeper: A New Hardware Architecture for
[159] M. K. Qureshi et al., “A Case for MLP-Aware Cache Replacement”, Accelerating Pre-Alignment in DNA Short Read Mapping",
ISCA 2006. Bioinformatics 2017.
[160] V. Seshadri et al., “The Evicted-Address Filter: A Unified Mechanism [196] H. Xin et al., “"Accelerating Read Mapping with FastHASH", BMC
to Address Both Cache Pollution and Thrashing”, PACT 2012. Genomics 2013.
[161] M. K. Qureshi et al., “Utility-Based Cache Partitioning: A Low- [197] C. Alkan et al., “Personalized copy number and segmental duplication
Overhead, High-Performance, Runtime Mechanism to Partition Shared maps using next-generation sequencing”, Nature Genetics 2009.
Caches”, MICRO 2006. [198] K. Hsieh et al., “Gaia: Geo-Distributed Machine Learning Approaching
[162] M. K. Qureshi et al., “Adaptive Insertion Policies for High Performance LAN Speeds”, NSDI 2017.
Caching” ISCA 2007. [199] K. Hsieh et al., “The Non-IID Data Quagmire of Decentralized Machine
[163] K. Hsieh et al., “Focus: Querying Large Video Datasets with Low Learning", ICML 2020.
Latency and Low Cost”, OSDI 2018. [200] T. Moscibroda and O. Mutlu, “A Case for Bufferless Routing in On-
[164] R. L. Sites, “It’s the Memory, Stupid!”, MPR, 1996. Chip Networks”, ISCA 2009.
[165] S. Kanev et al., “Profiling a Warehouse-Scale Computer”, ISCA 2015. [201] R. Das et al., “Application-Aware Prioritization Mechanisms for On-
[166] V. J. Reddi et al., “Web Search using Mobile Cores: Quantifying and Chip Networks”, MICRO 2009.
Mitigating the Price of Efficiency”, ISCA 2010. [202] R. Das et al., “Aergia: Exploiting Packet Latency Slack in On-Chip
[167] J. Liu et al., “An Experimental Study of Data Retention Behavior in Networks”, ISCA 2010.
Modern DRAM Devices: Implications for Retention Time Profiling [203] R. Das et al., “Application-to-Core Mapping Policies to Reduce
Mechanisms”, ISCA 2013. Memory System Interference in Multi-Core Systems”, HPCA 2013.
[168] M. Patel et al., “The Reach Profiler (REAPER): Enabling the Mitigation [204] B. Grot et al., “Preemptive Virtual Clock: A Flexible, Efficient, and
of DRAM Retention Failures via Profiling at Aggressive Conditions”, Cost-effective QOS Scheme for Networks-on-Chip”, MICRO 2009.
ISCA 2017. [205] C. Fallin et al., “CHIPPER: A Low-Complexity Bufferless Deflection
[169] S. Khan et al., “The Efficacy of Error Mitigation Techniques for Router”, HPCA 2011.
DRAM Retention Failures: A Comparative Experimental Study”, [206] O. Kayiran et al., “Managing GPU Concurrency in Heterogeneous
SIGMETRICS 2014. Architectures”, MICRO 2014.
[170] U. Kang et al., “Co-architecting Controllers and DRAM to Enhance [207] C. Fallin et al., “MinBD: Minimally-Buffered Deflection Routing for
DRAM Process Scaling”, Memory Forum 2014. Energy-Efficient Interconnect”, NOCS 2012.
[171] S. Ghose et al., “What Your DRAM Power Models Are Not Telling [208] J. Zhao et al., “FIRM: Fair and High-Performance Memory Control for
You: Lessons from a Detailed Experimental Study”, SIGMETRICS Persistent Memory Systems”, MICRO 2014.
2018. [209] G. Nychis et al., “On-Chip Networks from a Networking Perspective:
[172] E. Kultursay et al., “Evaluating STT-RAM as an Energy-Efficient Main Congestion and Scalability in Many-core Interconnects”, SIGCOMM
Memory Alternative”, ISPASS 2013. 2012.
[173] B. C. Lee et al., “Phase Change Memory Architecture and the Quest for [210] N. Vijaykumar et al., “A Case for Core-Assisted Bottleneck
Scalability”, CACM 2010. Acceleration in GPUs: Enabling Flexible Data Compression with Assist
[174] H. Yoon et al., “Efficient Data Mapping and Buffering Techniques for Warps”, ISCA 2015.
Multi-Level Cell Phase-Change Memories”, ACM TACO 2015. [211] J. A. Joao et al., “Utility-Based Acceleration of Multithreaded
[175] F. Gao et al., “ComputeDRAM: In-Memory Compute Using Off-the- Applications on Asymmetric CMPs”, ISCA 2013.
Shelf DRAMs”, MICRO 2019. [212] K. K. Rangan et al., “Thread Motion: Fine-grained Power Management
[176] S. Li et al., “Pinatubo: A Processing-in-Memory Architecture for Bulk for Multi-core Systems”, ISCA 2009.
Bitwise Operations in Emerging Non-volatile Memories”, DAC 2016. [213] V. J. Reddi et al., “Voltage Emergency Prediction: Using Signatures to
[177] S. Kvatinsky et al., “MAGIC - Memristor-Aided Logic”, IEEE TCAS II Reduce Operating Margins”, HPCA 2009.
2014. [214] J. Haj-Yahya et al., “Techniques for Reducing the Connected-Standby
[178] S. Aga, “Compute Caches”, HPCA 2017. Energy Consumption of Mobile Devices”, HPCA 2020.

You might also like