A Dynamic Hash Table for the GPU
A Dynamic Hash Table for the GPU
Abstract—We design and implement a fully concurrent hash, that supports bulk and incremental builds. One might
arXiv:1710.11246v2 [cs.DC] 2 Mar 2018
dynamic hash table for GPUs with comparable performance expect that supporting incremental insertions and deletions
to the state of the art static hash tables. We propose a would result in significantly reduced query performance
warp-cooperative work sharing strategy that reduces branch
divergence and provides an efficient alternative to the tradi- compared to static data structures. However, our hash table
tional way of per-thread (or per-warp) work assignment and not only supports updates with high performance but also
processing. By using this strategy, we build a dynamic non- sustains build and query performance on par with static GPU
blocking concurrent linked list, the slab list, that supports hash tables. Our hash table is based on a novel linked list
asynchronous, concurrent updates (insertions and deletions) data structure, the slab list. Previous GPU implementations
as well as search queries. We use the slab list to implement
a dynamic hash table with chaining (the slab hash). On an of linked lists [4], which operate on a thread granularity
NVIDIA Tesla K40c GPU, the slab hash performs updates with and contain a data element and pointer per linked list node,
up to 512 M updates/s and processes search queries with up to exhibit poor performance because they suffer from control
937 M queries/s. We also design a warp-synchronous dynamic and memory divergence and incur significant space over-
memory allocator, SlabAlloc, that suits the high performance head. The slab list instead operates on a warp granularity,
needs of the slab hash. SlabAlloc dynamically allocates memory
at a rate of 600 M allocations/s, which is up to 37x faster than with a width equal to the SIMD width of the underlying
alternative methods in similar scenarios. machine and contains many data elements per linked list
node. Its design minimizes control and memory divergence
and uses space efficiently. We then construct the slab hash
I. I NTRODUCTION
from this slab list as its building block, with one slab list per
A key deficiency of the GPU ecosystem is its lack hash bucket. Our contributions in this work are as follows:
of dynamic data structures, which allow incremental up-
dates (such as insertions and deletions). Instead, GPU data
structures (e.g., cuckoo hash tables [1]) typically address
incremental changes to a data structure by rebuilding the • The slab list is based on a node structure that closely
entire data structure from scratch. A few GPU data structures matches the GPU’s hardware characteristics.
(e.g., the dynamic graph data structure in cuSTINGER [2]) • The slab list implementation leverages a novel warp-
implement phased updates, where updates occur in a differ- cooperative work sharing strategy that minimizes
ent execution phase than lookups. In this work we describe branch divergence, using warp-synchronous program-
the design and implementation of a hash table for GPUs ming and warp-wide communications.
that supports truly concurrent insertions and deletions that • The slab hash, based on the slab list, supports concur-
can execute together with lookups. rent operations with high performance.
Supporting high-performance concurrent updates of data • To allow concurrent updates, we design and imple-
structures on GPUs represents a significant design challenge. ment a novel memory allocator that dynamically and
Modern GPUs support tens of thousands of simultaneous efficiently allocates and deallocates memory in a way
resident threads, so traditional lock-based methods that en- that is well-matched to our underlying warp-cooperative
force concurrency will suffer from substantial contention and implementation.
will thus likely be inefficient. Non-blocking approaches offer • Our memory allocator is scalable, allowing us to sup-
more potential for such massively parallel frameworks, but port data structures up to 1 TB (far larger than the
most of the multi-core system literature (e.g., classic non- memory size of current GPUs) and without any CPU
blocking linked lists [3]) neglects the sensitivity of GPUs intervention.
to memory access patterns and branch divergence, which • The slab hash’s bulk-build and search rates are com-
makes it inefficient to directly translate those ideas to the parable to those of static methods (e.g., GPU cuckoo
GPU. hashing [1]), while additionally achieving efficient in-
In this paper, we present a new GPU hash table, the slab cremental updates.
II. BACKGROUND & R ELATED W ORK Chaudhuri’s work, Moscovici et al. [6] recently proposed a
lock-based GPU-friendly skip list (GFSL) with an emphasis
Graphics Processing Unit (GPU): GPUs are massively on the GPU’s preferred coalesced memory accesses. We will
parallel processors with thousands of parallel active threads. also discuss in Section VI-C why we believe GFSL (either
Threads are grouped into SIMD units of width 32—a by itself or as a building block of a larger data structure)
warp—and each warp executes instructions in lockstep. As cannot outperform our lock-free slab hash in updates and
a result, any branch statements that cause threads to run searches. I/O sensitive linked lists were studied in the CPU
different instructions are serialized (branch divergence). A context by Bender et al. [7].
group of threads (multiple warps) are called a thread block Dynamic memory allocation: Although a mature tech-
and are scheduled to be run on different streaming processors nology for single and multi-core systems, dynamic memory
(SMs) on the GPU. The memory hierarchy of GPUs is allocation is still considered a challenging research problem
organized into a large global memory accessible by all on massively parallel frameworks such as GPUs. Massive
threads within the device (e.g., 12 GB on the Tesla K40c), parallelism makes it difficult to directly exploit traditional
smaller but faster shared memory for each thread block allocation strategies such as lock-based or private-heap
(48 KB per SM on the Tesla K40c), and local registers schemes without a significant performance degradation.
for each thread in the thread block (64 KB per SM on CUDA [8] provides a built-in malloc that dynamically
the Tesla K40c). Maximizing achieved memory bandwidth allocates memory on the device (GPU). However, it is not
requires accessing consecutive memory indices within a efficient for small allocations (less than 1 kB). To address
warp (coalesced access). NVIDIA GPUs support a set of malloc’s inefficiencies for small allocations, almost every
warp-wide instructions (e.g., shuffles and ballots) so that all competitive proposed method so far is based on the idea of
threads within a warp can communicate with each other. allocating numerous large enough memory pools (with dif-
Hash tables: There are several efficient static hash ferent terminology), assigning each memory pool to a thread,
tables implemented for GPUs. Alcantara et al. [1] proposed a warp, or a thread block (to decrease parallel contention),
an open-addressing cuckoo hashing scheme for GPUs. This dynamically allocating or deallocating small portions of
method supports bulk build and search, both of which it based on received requests, and finally implementing a
require minimal memory accesses in the best case: a single mechanism to use another memory pool once fully allocated.
atomic operation for inserting a new element, and a regular Some methods use hashing to operate on different memory
memory read for a search. As the load factor increases, it is pools (e.g., Halloc [9]). Other methods use various forms
increasingly likely that a bulk build using cuckoo hashing of linked lists to move into different memory pools (e.g.,
fails. Garcia et al. [5] proposed a method based on Robin CMalloc [10]). All these methods maintain various flags
Hood hashing that focuses on higher load factors and uses (or bitmaps) and operate on them atomically to be able to
more spatial locality for graphics applications, at the expense allocate or deallocate memory.
of performance degradation compared to cuckoo hashing. Vinkler et al. has provided an extensive study of all
Khorasani et al.’s stadium hashing is also based on a cuckoo these methods and some benchmarks to compare their
hashing scheme but stores two tables instead of one. Its performance [10]. The most efficient ones, CMalloc and
focus is mainly on out-of-core hash tables that cannot be fit Halloc, perform best when there are multiple allocation re-
on a single GPU’s memory. In the best case (i.e., an empty quests within each warp that can be formed into a single but
table) and with a randomly generated key, an insertion in this larger allocation per warp (a coalesced allocation). However,
method requires one atomic operation and a regular memory for the warp-cooperative work sharing strategy we use in this
write. A search operation in stadium hashing requires at work (Section IV-A), we need an allocator that can handle
least two memory reads. Although hash tables may be numerous independent but sequentially available allocation
specifically designed for special applications, Alcantara’s requests per warp, which cannot be formed into a single
cuckoo hashing appears to be the best general-purpose in- larger coalesced allocation to avoid divergence overheads.
core hash table option with the best performance measures. As we will see in Section V, existing allocators perform
We use this method for our comparisons in Section VI. poorly in such scenarios. Instead, we propose a novel warp-
Misra and Chaudhuri [4] implemented a lock-free linked synchronous allocator, SlabAlloc, that uses the entire warp
list, which led to a lock-free hash table with chaining that to efficiently allocate fixed-size slabs with modest register
supported concurrent insertion, deletion and search. Their usage and minimal branch divergence (more details in Sec-
implementation is not fully dynamic, because it pre-allocates tion V).
all future insertions into an array (which must be known
at compile time), and it does not address the challenge III. D ESIGN DESCRIPTION
of dynamically allocating new elements and deallocating
deleted elements at runtime. However, we briefly compare A linked list is a linear data structure whose elements
it to the slab hash in Section VI-C. Inspired by Misra and are stored in non-contiguous parts of the memory. These
arbitrary memory accesses are handled by storing the mem- consecutive memory indices with a certain fixed alignment
ory address (i.e., a pointer) of the next element of the (e.g., on NVIDIA GPUs, each thread fetches a 32-bit word
list alongside the data stored at each node. The simplicity per memory access, i.e., 128 bytes per warp).
of linked lists makes concurrent updates relatively easy
to support, using compare-and-swap (CAS) operations [3]. There are some well-known tactics to avoid coalesced-
New nodes can be inserted by (1) allocating a new node, (2) memory issues in GPUs, such as using structure-of-arrays
initializing the data it contains and storing the successor’s instead of array-of-structures data layouts, or first fetching a
address into its next pointer, then (3) atomically compare- big block of items into a faster but locally shared memory
and-swapping the new node’s address with its predecessor’s (with coalesced memory accesses) and then accessing the
next pointer. Similarly, nodes can be deleted by (1) atomi- local memory with an arbitrary alignment. However, none
cally marking a node as deleted (to make sure no new node of these methods is effective with a linked list data structure
is inserted beyond it) and then (2) compare-and-swapping that requires singleton structures distributed randomly in the
its predecessor’s pointer with its successor’s address. memory domain. As a result, we propose to use an alternate
On GPUs, it is possible to implement the same set of linked list design that is more suitable for our hardware
operations for a linked list, and then use it as a building platform.
block of other data structures (e.g., in hash tables) [4].
Braginsky and Petrank proposed a lock-free linked list on
However, this implementation requires an arbitrary random
the CPU [11] that achieves better locality of reference by
memory access per unit of stored data, which is not ideal for
ensuring that certain number of regular linked list nodes (an
any high-performance GPU program. Furthermore, making
entry; a data and a pointer) will all be arbitrarily placed
any change to a linked list data structure requires dynamic
into a larger contiguous structure (a chunk) that would fit
memory allocation, which itself is challenging to perform
in a single cache line. As a result, each entry would only
efficiently, especially on massively parallel devices such as
point to another entry within that chunk. Each chunk itself
GPUs (Section II). In this work, we propose a new linked
would point to another chunk to form the whole list. We
list data structure, the slab list, and then use it to implement
also achieve better locality, but in a different way. We use
a dynamic hash table (slab hash). In our design, we have
a larger linked list node (called a slab, or interchangeably
two major goals in mind: (1) maximizing performance in
a memory unit) that consists of multiple data elements and
maintaining several slab lists concurrently (suited for hash
a single pointer to its successor slab (shown in Fig. 1). The
tables), and (2) having better memory utilization by reducing
main difference is that slabs are fixed in size, and all data
the memory overhead in classic linked lists. We present the
elements within a slab share a single next pointer.
slab list and slab hash in this section, and then provide
implementation details in Section IV. An immediate advantage of slab lists is that their memory
overhead is reduced by approximately a factor of M (if
A. Slab list
there are M data elements per slab). However, our main
Classic singly linked lists consist of nodes with two main motivation for using large slabs is to be able to maintain
distinctive parts: a single unit of data element (a key, a key- them in parallel, meaning that the whole slab is accessed
value pair, or any other arbitrary metadata associated with with a minimum number of memory accesses and in parallel
a key), and a next pointer to its successor node. Storing (distributed among multiple threads), and then operations
the successor’s memory address makes linked lists flexible are also performed in parallel. The optimal size of these
and powerful in dealing with mutability issues. However, slabs will depend on the hardware characteristics of the
it introduces additional memory overhead per stored unit of target platform, including both the way memory accesses
data. Moreover, the efficiency of linked list operations is one are handled as well as communication possibilities among
of our primary concerns. different threads.
In classic linked list usage, an operation (inser-
tion/deletion/search) is often requested from a single inde- On GPUs, we operate on each slab with a single SIMD
pendent thread. These high-level operations translate into unit (a warp) and use available warp-wide intrinsics such as
lower-level operations on the linked list itself. In turn, shuffles and ballots for communications. So, the size of a
these lower-level operations result in a series of random, slab will be a modest multiple of the warp width (e.g., 32
sequential memory operations per thread. Because we expect consecutive 32-bit words). We do not maintain order within
that a parallel program that accesses such a data structure our slabs. GPU hardware enables us to search within an
will feature numerous simultaneous operations on the data unordered set of 32 words with a single ballot instruction.
structure, we can easily parallelize operations across threads. So, as long as we keep the slab list relatively short (e.g.,
But in modern parallel hardware with SIMD cores, including ∼10 slabs), we can have faster updates with negligible extra
GPUs, the peak memory bandwidth is only achieved when search cost. If not, extra measures should be taken, including
threads within each SIMD unit (i.e., a warp in GPUs) access maintaining an inter-slab order.
Regular Linked List Slab List some other thread has successfully added a new memory unit
kv next kv1 kv2 ... kvM next to the list. Hence, we release the newly allocated memory
unit and restart our insertion process from the tail.
R EPLACE is similar to INSERT except that we have to
key meta data search the entire list to see if there exists a previously
inserted key k. If so, then we use atomic CAS to replace it
Figure 1: Regular linked list and the slab list. with the new pair. If not, we simply perform INSERT starting
from the tail of the list.
B. Supported Operations in Slab Lists
3) Deletion (DELETE and DELETE A LL): To delete a key,
Suppose our slab list maintains a set of keys (or key-value we start from the head slab and look for the matching key. If
pairs), here represented by S. Depending on whether or not found, we mark the element as deleted.1 If not, we continue
we allow duplicate keys in our data structure, we support to the next slab. We continue this process until we reach
the following operations: the end of the list. For DELETE, we return after deleting
• INSERT (k, v): S ← S ∪{hk, vi}. Insert a new key-value the first matching element, but for DELETE A LL we process
pair into the slab list. the whole list. We later describe our FLUSH operation that
• REPLACE (k, v): S ← (S − {hk, ∗i}) ∪ {hk, vi}. Insert locks the list and removes all stale elements (marked as
a new key-value pair with an extra restriction on deleted) and rebalances the list to have the minimum number
maintaining uniqueness among keys (i.e., replace a of necessary memory units, releasing extra memory units for
previously inserted key if it exists). later allocations.
• DELETE (k): S ← S − {hk, vi ∈ S}. Remove the least In case we allow duplicates, we can simply mark a to-be-
recently inserted key-value pair hk, vi. deleted element as empty. In this case, later insertions that
• DELETE A LL (k): S ← S −{hk, ∗i}. Delete all instances use INSERT can potentially find these empty spots down the
of a key in the slab list. list and insert new items in them. However, if we do not al-
• SEARCH (k): Return the least recent hk, vi ∈ S, or ⊥ low duplicates, in order to correctly maintain the uniqueness
if not found. condition, we must mark deleted elements differently than
• SEARCH A LL (k) : Return all found instances of k in being empty to avoid inserting a key that already exists in
the data structure ({hk, ∗i ∈ S}), or ⊥ if not found. the list (somewhere in its successive memory units).
1) Search (SEARCH and SEARCH A LL): Searching for a
C. Slab Hash: A Dynamic Hash Table
specific key in slab list is similar to classic linked lists.
We start from the head of the list and look for the key Our slab hash is a dynamic hash table (meaning that
within that memory unit. If none of the data units possess we support not only operations like searches that do not
such a key, we load the next memory unit based on the change the contents of the hash table but also operations like
stored successor pointer. In SEARCH we return the first insertions and deletions that do) built from a set of B slab
found matching element, but in SEARCH A LL we continue lists (buckets).2 This hash table uses chaining as its collision
searching the whole list. In both cases, if no matching key resolution. More specifically, we use a direct-address table
is found we return ⊥. of B buckets (base slabs), where each bucket corresponds to
2) Insertion (INSERT and REPLACE): For the INSERT a unique hashed value from 0 to B − 1 [12]. Each base slab
operation, we make no extra effort to ensure uniqueness is the head of an independent slab list consisting of slabs, as
among the keys, which makes the operation a bit easier. introduced in Section III-A, each with M data points to be
We simply start from the head of the list and use an atomic filled. In general, base slabs and regular slabs can differ in
CAS to insert the new key-value pair into the first empty their structures in order to allow additional implementation
data unit we find. If the CAS operation is successful, then features (e.g., pointers to the tail, number of slabs, number
insertion is done. Otherwise, it means that some other thread of stored elements, etc.). For simplicity and without loss of
has successfully inserted a pair into that empty data unit, and generality, here we assume there is no difference between
we have to try again and look for a new empty spot. If the them.
current memory unit is completely full (no empty spot), we We use a simple universal hash function such as
load the next memory unit and repeat the procedure. If we h(k; a, b) = ((ak + b) mod p) mod B, where a, b are ran-
reach the end of the list, it means that the linked list requires dom arbitrary integers and p is a random prime number. As
more memory units to contain the new element. As a result, 1 In our design we reserve two 32-bit values in the key domain to denote
we dynamically allocate a new memory unit and use another 1) an empty spot, and 2) a deleted key.
atomic CAS to replace the null pointer currently existing in 2 Similar to other hash tables (on CPU or GPU), the slab hash is capacity
the tail’s successor address with the address of the newly based, meaning that our performance depends on the initial number of
buckets. As shown in Section VI, for any choice of B, we can cause
allocated memory unit. If it is successful, we restart our performance degradation by continually increasing the number of elements
insertion procedure from the tail again. If it failed, it means (but it never breaks).
a result, on average, keys are distributed uniformly among advantage of the WCWS strategy is that it significantly
all buckets with an average slab count of β = n/(M B) reduces branch divergence when compared to traditional per-
slabs per bucket, where n is the total number of elements thread processing. A disadvantage is that we should always
in the hash table. For searching a key that does not exist in keep all threads within a warp active in order to correctly
the table (i.e., an unsuccessful search), we should perform perform even a single task (avoiding branches on threads).
Θ(1 + β) memory accesses. A successful search is slightly But, this limitation already exists on many CUDA warp-
better, but has similar asymptotic behavior. wide instructions and can be easily avoided by using the
In order to be able to compare our memory usage with same tricks [8, Chapter B.15].
open-addressing hash tables that do not use any pointers
(e.g., cuckoo hashing [1]), we define the memory utilization B. Choice of parameters
to be the amount of memory actually used to store the data As we emphasized in Section III, the main motivation
over the total amount of used memory (including pointers behind introducing slabs in our design is to have better
and unused empty slots). If each element and pointer take x coalesced memory accesses. Hence, we chose our slab
and y bytes of memory respectively, then each slab requires sizes to be a multiple of each warp’s physical memory
M x + y bytes. As a result, our slab hash would achieve a access width, i.e., at least 32×4 B for current architectures.
x n
memory utilization equal to M x+y PB−1 ≤ MMx+y
x
, where Throughout the rest of the paper, we assume each slab is
i=0 ki
ki denotes the number of slabs for bucket i. For open- exactly 128 B, so that once a warp accesses a slab each
addressing hash tables, memory utilization is equal to the thread has exactly 1/32 of the slab’s content. So, when we
load factor, i.e., the number of stored elements divided by use the term “lane” for a slab, we mean that portion of
the table size. the slab that is read by the corresponding warp’s thread. We
currently support two item data types: 1) 32-bit entries (key-
IV. I MPLEMENTATION DETAILS only), 2) 64-bit entries (key-value pairs), but our design can
In this section we focus on our technical design choices be extended to support other data types. In both cases, slab
and implementation details, primarily influenced by the lanes 0–29 contain the data elements (in the key-value case,
hardware characteristics of NVIDIA GPUs. even and odd lanes contain keys and values respectively).
We refer to lane 31 as the address lane, while lane 30 is
A. Our warp-cooperative work sharing strategy used as an auxiliary element (flags and pointer information
A traditional, but not necessarily efficient, way to perform if required). As a result, slab lists (and the derived slab hash)
a set of independent tasks on a GPU is to assign and can achieve a maximum memory utilization of 94%.
process an independent task on each thread (e.g., classic
linked list operations on GPU [4]). An alternative approach C. Operation details
is to do a per-warp work assignment followed by a per-warp Here we provide more details about some of slab hash
processing (e.g., warp-wide histogram computation [13]). In operations discussed in Section III-B. We thoroughly discuss
this work we propose a new approach where threads are still SEARCH , REPLACE (insertion when uniqueness is main-
assigned to do independent tasks (per-thread assignment), tained), and DELETE, and then briefly explain our method-
but works are done in parallel (per-warp processing). We call ology for the FLUSH operation. In our explanations, we
this a warp-cooperative work sharing (WCWS) strategy. This use some simplified code snippets. For example, ReadSlab()
strategy would be particularly useful under the following takes a 32-bit address layout of a slab as input; each thread
circumstances: 1) threads are assigned to independent-but- reads its corresponding data portion. SlabAddress() extends
different tasks (irregular workload); 2) each task requires an a 32-bit address layout to a 64-bit memory address (more
arbitrarily placed but vectorized memory access (accessing details about memory address layouts are in Section V).
consecutive memory units); 3) it is possible to process each 1) SEARCH: Figure 2 shows a simplified pseudocode for
task in parallel within a warp using warp-wide communica- the SEARCH procedure in our slab hash. As an input, any
tion (warp friendly). In our data structure context, this means thread that has a search query to perform sets is_active
that we form a work queue of arbitrary requested operations to true. Keys are stored in myKey and the result will
from different threads within a warp, and all threads within be stored in myValue. By following the WCWS strategy
that warp cooperate to process these operations one at a time introduced before, all threads within a warp participate in
(based on a pre-defined intra-warp order) and until the whole performing every search operation within that warp, one
work queue is empty. operation at a time. First, we form a local warp-wide work
If data is properly distributed among the threads, as it is queue (line 3) by using a ballot instruction and asking
naturally in our slab based design, then regular data struc- whether any thread has something to search for. Then, all
ture operations such as looking for a specific element can threads go into a while loop (line 4) and repeat until all
simply be implemented in parallel using simple warp-wide search queries are processed. At each round, all threads can
instructions (e.g., using ballots and shuffles). An immediate process the work queue and find the next lane within the
1: device void warp operation(bool &is active, uint32 t &myKey, uint32 t &myValue) {
warp that has the priority to perform its search query (the 2: next ← BASE SLAB;
3: work queue ← ballot(is active);
source lane, line 6). This is done by using a pre-defined 4: while (work queue != 0) do
5: next ← (if work queue is changed) ? (BASE SLAB) : next;
procedure next_prior(), which can be implemented as 6: src lane ← next prior(work queue); src key ← shfl(myKey, src lane);
simply as finding the first set bit in the work queue (using 7: src bucket ← hash(src key); read data ← ReadSlab(next, laneId);
8: warp search macro() OR warp replace macro() OR warp delete macro()
CUDA’s __ffs). Then all threads ask for the source lane’s 9: work queue ← ballot(is active);
10: end while
query key using a shuffle instruction (line 6), and hash it to 11: }
compute its corresponding bucket id (line 7). 12:
13:
// ============================================
warp search macro()
The whole warp then performs a coalesced memory read 14: found lane ← ffs( ballot(read data == src key) & VALID KEY MASK);
15: if (found lane is valid) then
from global memory (ReadSlab()), which takes the 32-bit 16: found value ← shfl(read data, found lane + 1);
17: if (laneId == src lane) then
address layout of the slab as well as the lane id of each 18: myValue ← found value; is active ← false;
thread. If we are currently at the linked list’s base slab (the 19: end if
20: else
bucket head), we will find the corresponding slab’s contents 21: next ptr ← shfl(read data, ADDRESS LANE);
22: if (next ptr is an empty address pointer) then
in a fixed array. Otherwise, we use our SlabAlloc allocator 23: if (laneId == src lane) then
and compute the unique 64-bit address of that allocated 24: myValue ← SEARCH NOT FOUND; is active ← false;
25: end if
slab by using the 32-bit next variable. Now, every thread 26: else
27: next ← next ptr;
has read its portion of the target slab. By using a ballot 28: end if
29: end if
instruction we can ask whether any valid thread possesses 30: // ============================================
the source lane’s query (src_key), and then compute its 31: warp replace macro()
32: dest lane ← ffs( ballot(read data == EMPTY k read data == myKey) & VALID KEY MASK);
position found_lane (line 14). If found, we ask for its 33: if dest lane is valid then
34: if (src lane == laneId) then
corresponding value by using a shuffle instruction and asking 35: old pair ← atomicCAS(SlabAddress(next, dest lane), EMPTY PAIR, h myKey, myValue i);
for its subsequent thread’s read_data, which stores the 36:
37:
if (old pair == EMPTY PAIR) then
is active ← false;
result from the requested source lane. The source lane then 38: end if
39: end if
stores back the result and marks its query as resolved (line 40: else
41: next ptr ← shfl(read data, ADDRESS LANE);
18). If not found, we must go to the next slab. To find it, 42: if next ptr is empty then
we ask the address lane for its address (line 21) and update 43: new slab ptr ← SlabAlloc::warp allocate();
44: if (laneId == ADDRESS LANE) then
the next_ptr. If the next_ptr was empty, it means that 45: temp ← atomicCAS(SlabAddress(next, ADDRESS LANE), EMPTY POINTER, new slab ptr);
we have reached the slab list’s tail and the query does not 46: if (temp != EMPTY POINTER) then
exist (line 24). Otherwise, we update the next variable and 47:
48:
SlabAlloc::deallocate(new slab ptr);
end if
continue within the next loop. At each loop, we initially 49: end if
50: else
check whether the work queue has changed (someone has 51: next ← next ptr;
52: end if
successfully processed its query) or we are still processing 53: end if
the same query (but are now searching in allocated slabs 54: // ============================================
55: warp delete macro()
rather than the base slab). 56: dest lane ← ffs( ballot(read data == src key) & VALID KEY MASK);
57: if dest lane is valid then
2) REPLACE: The main skeleton of the REPLACE proce- 58: if (src lane == laneId) then
dure (Fig. 2) is similar to search, but now instead of looking 59: *(SlabAddress(next, src lane)) ← DELETED KEY;
60: is active ← false;
for a particular key, we look for either that same key (to re- 61: end if
62: else
place it), or an empty spot (to insert it for the first time). Any 63: next ptr ← shfl(read data, ADDRESS LANE);
64: if next ptr is empty then
thread with an insertion operation will mark is_active 65: is active ← false;
as true. As with search, threads loop until the work queue 66: else
67: next ← next ptr;
of all insertion operations are completely processed. Within 68: end if
69: end if
a loop, all threads read their corresponding portions of the
target slab (lines 2–7), searching for the source key or an Figure 2: Pseudocode for search (SEARCH), insert
empty spot (an empty key-value pair) within the read slab (REPLACE), and delete (DELETE) operations in the slab hash.
(called the destination lane). If found, the source lane inserts
its key-value pair into the destination lane’s portion of the
slab with a 64-bit atomicCAS operation. If that insert is new slab should be allocated. All threads use the SlabAlloc
successful, the source lane marks its operation as resolved, routine, allocating a new slab, then the source lane uses a 32-
which will be reflected in the next work queue computation. bit atomicCAS to update the empty address previously stored
If the insert fails, it means some other warp has inserted into in the address lane. If the atomicCAS is successful, the
that empty spot and the whole process should be restarted. whole insertion process is repeated with the newly allocated
If no empty spot or source key is found at all, all threads slab. If not, it means some other warp has successfully
fetch the next slab’s address from the address lane. If that allocated and inserted the new slab and hence, this warp’s
address is not empty, the new slab is read and the insertion allocated slab should be deallocated. The process is then
process repeats. If the address is empty, it means that a restarted again with the new valid slab.
3) DELETE: Deletion (shown in Fig. 2) is similar to
Bitmaps Memory Blocks
both the SEARCH and REPLACE operations. Each thread Bitmaps Memory Blocks
with a deletion operation to perform (a true is_active)
…
Super Blocks
updates the work queue accordingly. Then, for each deletion … … … …
operation in the work queue, the source lane and its to-be- …
NM ⇥ 32 ⇥ 32-bit
Bitmaps …
NM ⇥ NS
… …
Memory Blocks…
⇥ 128 bytes
deleted key are queried by the whole warp (lines 2–7). Now,
the current slab is searched for the source key. We name the … … … … …
lane that possesses it as the destination lane (line 56). If the
destination lane is valid (a match is found), the source lane
itself proceeds with overwriting the corresponding element Figure 3: Memory layout for SlabAlloc.
with DELETED_KEY (line 59). If not found, then the next
pointer is updated (line 63). If we reach the end of the for this allocation workload. Our SlabAlloc takes 1.8 ms
list (an empty next pointer), the source key does not exist (600 M slabs/s), which is about 37x faster than Halloc.
in the list and the operation terminates successfully (line Terminology: We use a hierarchical memory structure
65). Otherwise, the next slab is loaded and we repeat the as follows: several (NS ) memory pools called super blocks
procedure. are each divided into several (NM ) smaller memory blocks.
4) F LUSH: Since we do not physically remove deleted Each memory block consists of fixed NU = 1024 memory
elements in the slab hash but instead mark them as deleted, units (i.e., slabs). Figure 3 shows this hierarchy.
after a while it is possible to have slab lists that can be SlabAlloc: There are a total of NS NM memory blocks.
reorganized to occupy fewer slabs. A FLUSH operation takes We distribute memory blocks uniformly among all warps,
a bucket as an argument and then a warp processes all slabs such that different warps may be assigned to each memory
within that bucket’s slab list and compacts them into fewer block. We call each warp’s assigned memory block resident.
slabs. In the end, we deallocate those emptied buckets in A resident block is used for all allocations requested from
the SlabAlloc so that they can be reused by others. In order its warp owners for up to 1024 memory units (slabs). Once
to guarantee correctness, we implement this operation as a a resident block gets full, its warp randomly chooses (with a
separate kernel call so that no other thread can perform an hash) another memory block and uses that as its new resident
operation in those buckets while we are flushing its contents. block. After a threshold number of resident changes, we add
new super blocks and reflect them in the hash functions.
V. DYNAMIC MEMORY ALLOCATION In its most general case, both the super block and its
Motivation: Today’s GPU memory allocators (Sec- memory block are chosen randomly using two different hash
tion II) are generally designed for variable-sized allocations functions (taking the global warp ID and the total number of
and aim to avoid too much memory fragmentation (so as not resident change attempts as input arguments). This creates a
to run out of memory with large allocations). These alloca- probing effect in the way we assign resident blocks. Since
tors are designed for flexibility and generality at the cost there are 1024 memory units within each memory block, by
of high performance; for instance, they do not emphasize using just one 32-bit bitmap variable per thread (a total of
branch and memory-access divergence. For high-throughput 32×32-bit across the warp), a warp can fully store a memory
mutability scenarios such as hash table insertions (e.g., block’s full/empty availability.
the slab hash) that require many allocations, the memory Upon each allocation request, all threads in the warp look
allocator would be a significant bottleneck. into their local resident bitmap (stored in a register) and
The WCWS strategy (Section IV-A) that we chose for announce whether there are any unused memory units in
the slab hash results in the following allocation problem: their portion of the memory block. For example, thread 0 is
insertion operations that are assigned to a single warp are in charge of the first 32 memory units of its resident block,
sequentially processed (one at a time) and hence we will thread 1 has memory units 32–63, etc. Following a pre-
require numerous independent fixed-size slab allocations per defined priority order (e.g., the least indexed unused memory
warp at different times during a warp’s lifetime. These unit), all threads then know which thread should allocate
allocations cannot be simply formed into a single larger the next memory unit. That thread uses an atomicCAS
coalesced allocation that suits other allocators. operation to update its resident bitmap in global memory. If
Consequently, current allocators perform poorly on this successful, then the newly allocated memory unit’s address
pattern of allocations. For example, on a Tesla K40c (ECC is shared with all threads within that warp (using shuffle
disabled), with one million slab allocations, 128 bytes per instructions). If not, it means some other warp has previously
slab, one allocation per thread and with similar total used allocated new memory units from this memory block and the
memory for each allocator, CUDA’s malloc spends 1.2s local register-level resident bitmap should be updated. As a
(0.8 M slabs/s). Halloc takes 66 ms (16.1 M slabs/s). result, in the best case scenario and with low contention,
We designed our own memory allocator that is better suited each allocation can be addressed with just a single atomic
operation. If necessary, each resident change requires a compiler (V8.0.61). In this section, all our insertion oper-
single coalesced memory access to read all the bitmaps for ations maintain uniqueness (REPLACE). We have also used
the new resident block. Deallocation is done by first locating SlabAlloc with 32 super blocks (on a contiguous allocation),
the slab’s memory block’s bitmap in global memory and then 256 memory blocks, and 1024 memory units, 128 bytes
atomically unsetting the corresponding bit. each. We believe this is a fair comparison, because all other
Memory structure: In general, we need a 64-bit pointer methods that we compare against (CUDPP and Misra’s) pre-
variable to uniquely address any part of the GPU’s memory allocate a single contiguous array for use. If necessary, our
space. Almost all general-purpose memory allocators use SlabAlloc can be scaled up to 1 TB allocations. We divide
the same format. Since our main target is to improve our our performance evaluations into two categories. First, we
data structure’s dynamic performance, we trade off the compare against other static hash tables (such as CUDPP’s
generality of our allocators to gain performance: we use cuckoo hashing implementation [1]) in performing opera-
32-bit address layouts, which are less expensive to store tions such as building the data structure from scratch and
and share (especially because shuffle instructions work only processing search queries afterwards. Alcantara did an ex-
with 32-bit registers). In order to uniquely address a memory tensive study over various GPU hash tables including linear
unit, we use a 32-bit variable: 1) the first 10 bits represent and quadratic probing methods and reached the conclusion
the memory unit’s index within its memory block, 2) the that their cuckoo hashing implementation was substantially
next 14 bits are used for the memory block’s index within its superior [14]. Second, we design a concurrent benchmark
super block, and 3) the next 8 bits represent the super block. to evaluate the dynamic behavior of our proposed methods
Each super block is assumed to be allocated continuously with a random mixture of certain operations performed
on a single array (< 4 GB). Each memory unit is at least asynchronously (insertion, deletion, and search queries). We
27 bytes, and there are total of 1024 units in each memory compare the slab hash to Misra and Chaudhuri’s lock-free
block. Considering 27 bytes per each block’s bitmap, we hash table [4].
can at most put 214 memory blocks within each super block
A. Bulk benchmarks (static methods)
(i.e., (27 + 217 )NM ≤ 232 ⇒ NM < 215 ). As a result,
with this version we can dynamically allocate memory up to There are two major operations defined for static hash
27 NS NM NU < 1 TB (much larger than any current GPU’s tables such as CUDPP’s hash table: (1) building the data
DRAM size). structure given a fixed load factor (i.e., memory utilization)
In SlabAlloc, we assume that each super block is allocated and an input array of key-value pairs, and (2) searching
as a contiguous array. In order to look up allocated slabs for an array of queries (keys) and returning an array of
using our 32-bit layout, we store the actual beginning corresponding values (if found). By giving the same set of
address (64-bit pointers) of each super block in shared inputs into our slab hash, where each thread reads a key-
memory. Before each memory access, we must first decode value pair and dynamically inserts it into the data structure,
the 32-bit layout variables into an actual 64-bit memory we can build a hash table. Similarly, after the hash table is
address. This requires a single shared memory access per built, each thread can read a query from an input array and
memory lookup (using the above 32-bit variable), which search for it dynamically in the slab hash and store back
is costly, especially when performing search queries. To the search results into an output array. By doing so, we can
address this cost, we can also implement a lightweight compare slab hash with other static methods.3
memory allocator, SlabAlloc-light, where all super blocks For many data structures, the performance cost of sup-
are allocated in a single contiguous array. In this case, a porting incremental mutability is significant: static data
single beginning address for the first super block, which is structures often sustain considerably better bulk-build and
stored as a global variable, is enough to find the address query rates when compared to similar data structures that
of all memory units, resulting in a less expensive memory additionally support incremental mutable operations. We will
lookup, but with less scalability (at most about 4 GB). In see, however, that the performance cost of supporting these
scenarios where memory lookups are heavily required (e.g., additional operations in the slab hash is modest.
the bulk search scenarios in Section VI-A), SlabAlloc-light Figure 4a shows the build rate (M elements/s) for various
gives us up to 25% performance improvement compared to memory utilizations. n = 222 elements are stored in the
the regular SlabAlloc. table. For CUDPP’s hash table, memory utilization (load
factor) can be directly fixed as an input argument, but the
VI. P ERFORMANCE E VALUATION situation is slightly more involved for the slab hash: Given
We evaluate our slab list and slab hash on an NVIDIA a fixed number of buckets (B), the average slab count is
Tesla K40c GPU (with ECC disabled), which has a Kepler β = n/(M B), where M is the number of elements per
microarchitecture with compute capability of 3.5, 12 GB 3 In the slab hash, there is no difference between a bulk build operation
of GDDR5 memory, and a peak memory bandwidth of and incremental insertions of a batch of key-value pairs. However, for a
288 GB/s. We compile our codes with the CUDA 8.0 bulk search we assign more queries to each thread.
No. of Buckets (B) for the slab hash No. of Buckets (B) for the slab hash
2796K1398K 699K 466K 186K 56K 2796K1398K 699K 466K 186K 56K 2000 CUDPP-none
1500 SlabHash-none
SlabHash-all
400 1000
400
1000
200 CUDPP-none
200 500
CUDPP
500 CUDPP-all CUDPP
(a) Building rate (b) Search rate (a) Building rate (b) Search rate
1.00 Mx/(Mx+y) = 0.94 be 60%. (a) The whole table is built from scratch, dynamically,
0.75 and in parallel. (b) There are as many search queries as there are
0.50 elements in the table where either all (or none) of them exist.
0.25
0.00
0 1 2 3 4 5
Average slab count β 937 M queries/s. At about 65% memory utilization there is
(c) Memory utilization versus average slab count (β)
a sudden drop in performance for both insertions and search
queries. This drop happens when the average slab count is
Figure 4: Performance (M operations/s) versus memory efficiency. around 0.9–1.1, which means that almost all buckets will
222 elements are stored in the hash table in each trial. On top, the
have more than one slab and most of the operations will
number of buckets for the slab hash is shown. (a) The whole table
is built from scratch, dynamically, and in parallel. (b) There are have to visit the second slab. The slab hash appears to be
222 search queries where all (or none) of them exist. (c) Achieved competitive to cuckoo hashing, especially around 45–65%
memory utilization versus the average slab count (and number of utilization. For example, our slab hash is marginally better
buckets) is shown. in insertions (at 65%) and 1.1x faster in search queries when
no queries exist. But, using a geometrical mean over all
utilizations and n = 222 , cuckoo hashing is 1.33x, 2.08x,
slab as defined in Section III-C. The performance of the and 2.04x faster than the slab hash for build, search-all, and
slab hash and its achieved memory utilization is directly search-none respectively.
affected by the number of buckets (and β). Figure 4c shows Figure 5 shows the build rate (M elements/s) and search
the achieved memory utilization vs. the average slab count rate (M queries/s) vs. the total number of elements (n)
and number of buckets. So, on average, in order to achieve stored in the hash table, where memory utilization is fixed
a particular memory utilization we can refer to Fig. 4c and to be 60% (an average slab count of 0.7). Here we witness
choose the optimal β and then compute the required number that CUDPP’s building performance is particularly high
of initial buckets. The maximum memory utilization (given when the table size is small, which is because most of the
our choice of parameters for the slab hash) is about 94%, atomic operations can be done in cache level. The slab hash
which is achieved as B → 1. saturates the GPU’s resources for 220 ≤ n ≤ 224 , where
In our simulations, then, we build a CUDPP hash with both methods perform roughly the same. For very large table
the same utilization and with the same input elements. The sizes, both methods degrade, but the slab hash’s performance
process is averaged over 50 independent randomly generated decline starts at smaller table sizes. For search queries, the
trials. For search queries on a hash table with n = 222 slab hash shows a relatively consistent performance with a
elements, we generate two sets of n = 222 random queries: harmonic mean of 861 and 793 M queries/s for search-all
1) all queries exist in the data structure; 2) none of the and search-none. In this experiment, with a geometric mean
queries exist. These two scenarios are important as they over all table sizes and 65% memory utilization, the speedup
represent, on average, the best and worst case scenarios of CUDPP’s cuckoo hashing over the slab hash is 1.19x,
respectively. Figure 4b shows the search rate (M queries/s) 1.19x, and 0.94x for bulk build, search-all, and search-none
for both scenarios with various memory utilizations. respectively.
The slab hash gets its best performance from 19–60% Ideally, the “fast path” scenario for CUDPP’s cuckoo hash
memory utilization; these utilizations have a 0.2–0.7 average table requires a single atomicCAS for insertion and a single
slab count. Intuitively, this is when the average list size fits random memory access for a search. Unless there is some
in a single slab. Peak performance is 512 M insertion/s and locality to be extracted from input elements (which does not
600
600
100
200 200
20% updates, 80% searches
1 40% updates, 60% searches
batch = 128k 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 25000 50000 75000 100000
Memory Utilization Number of buckets
0 500000 1000000 1500000 2000000
Number of elements (a) Concurrent benchmark (b) Performance vs. Misra’s
Figure 6: Incremental batch update for the slab hash, as well Figure 7: (a) Concurrent benchmark for the slab hash: perfor-
as building from scratch for the CUDPP’s cuckoo hashing. Final mance (M ops/s) versus initial memory utilization. (b) Performance
memory utilization for both methods are fixed to be 65%. Time is (M ops/s) versus Misra and Chaudhuri’s lock-free hash table [4].
reported in logarithmic scale. Three different operation distributions are shown in different colors
as shown in (a).
exist in most scenarios), any hash table is doomed to have existing element, d) searching for a non-existing element.
at least one global memory access (atomic or regular) per We define an operation distribution Γ = (a, b, c, d), such
operation. This explains why CUDPP’s peak performance is that every item is non-negative and a + b + c + d = 1.
hard to beat, and other proposed methods such as stadium Given any Γ, we can construct a random workload where, for
hashing [15] and Robin Hood hashing [5] are unable to instance, a denotes the fraction of new insertions compared
compete with its peak performance. In the slab hash, for to all other operations. To ensure correctness, we generate
insertion, ideally we will have one memory access (reading operations in batches and process batches one at a time, but
the slab) and a single atomicCAS to insert into an empty each in parallel. For each batch, operations are randomly
lane. For search, it will be a single memory access plus some assigned to each thread (one operation per thread) such
overhead from extra warp-wide instructions (Section IV). that all four operations may occur within a single warp.
In the end, we average the results over multiple batches.
B. Incremental insertion We consider three scenarios: 1) Γ0 = (0.5, 0.5, 0, 0) where
Suppose we periodically add a new batch of elements all operations are updates, 2) Γ1 = (0.2, 0.2, 0.3, 0.3)
to a hash table. For CUDPP, this means building from where there are 40% updates and 60% search queries, and
scratch every time. For the slab hash, this means dynamically 3) Γ2 = (0.1, 0.1, 0.4, 0.4) where there are 20% updates and
inserting new elements into the same data structure. Figure 6 80% search queries.
shows both methods in inserting new batches of different Figure 7a shows the slab hash performance (M ops/s)
sizes (32k, 64k, and 128k) until there are 2 million elements for three different operation distributions and various ini-
stored in the hash table. For CUDPP, we use a fixed 65% tial memory utilizations. Since updates are computation-
load factor. For the slab hash, we choose initial number of ally more expensive than searches, given a fixed mem-
buckets so that its final memory utilization (after inserting ory utilization, performance gets better with fewer updates
all batches) is 65%. As expected, the slab hash significantly (Γ0 < Γ1 < Γ2 ). Similar to the behavior in Fig. 4, the slab
outperforms cuckoo hashing by reaching final speedup of hash sharply degrades in performance with more than 65%
6.4x, 10.4x, and 17.3x for batches of size 128k, 64k, and memory utilization, falling to about 100 M ops/s with about
32k. As the number of inserted batches increases (as with 90% utilization. Comparing against our bulk benchmark in
smaller batches), the performance gap increases. Fig. 4, it is clear that the slab hash performs slightly worse
in our concurrent benchmark (e.g., Γ0 in Fig 7a and Fig. 4a).
C. Concurrent benchmarks (dynamic methods) There are two main reasons: (1) Since it is assumed that in
One notable feature of the slab hash is its ability static situations all operations are available, we can assign
to perform truly concurrent query and mutation (inser- multiple operations per thread and hide potential memory-
tion/deletion) operations without having to divide different related latencies, and (2) in concurrent benchmarks we run
operations into different computation phases. To evaluate three different procedures (one for each operation type)
our concurrent features, we design the following benchmark. compared to the bulk benchmark that runs just one.
Suppose we build our hash table with an initial number of Misra’s hash table: Misra and Chaudhuri have imple-
elements. We then continue to perform operations in one of mented a lock-free hash table using classic linked lists [4].
the following four categories: a) inserting a new element, b) This is a key-only hash table (i.e., an unordered set), without
deleting a previously inserted element, c) searching for an any pointer dereferencing or dynamic memory allocation;
based on the required number of insertions, an array of [2] O. Green and D. A. Bader, “cuSTINGER: Supporting dy-
linked list nodes are allocated at compile time, and then namic graph algorithms for GPUs,” in 2016 IEEE High Per-
indices of that array are used in the linked lists. Since it uses formance Extreme Computing Conference, ser. HPEC 2016,
Sep. 2016.
a simplified version of a classic linked list (32-bit keys and
32-bit next indices), it theoretically can reach at most 50% [3] M. M. Michael, “High performance dynamic lock-free hash
memory utilization. In order to compare its performance tables and list-based sets,” in Proceedings of the 14th Annual
with our slab hash, we use our concurrent benchmarks and ACM Symposium on Parallel Algorithms and Architectures,
the three operation distributions discussed above. Figure 7b ser. SPAA ’02. ACM, Aug. 2002, pp. 73–82.
shows performance (M ops/s) versus number of buckets, [4] P. Misra and M. Chaudhuri, “Performance evaluation of
where each case has exactly one million operations to concurrent lock-free data structures on GPUs,” in IEEE 18th
perform. The slab hash significantly outperforms Misra’s International Conference on Parallel and Distributed Systems,
hash table, with geometric mean speedup of 5.1x, 4.3x, and ser. ICPADS 2012. IEEE, Dec. 2012, pp. 53–60.
3.1x for distributions with 100%, 40% and 20% updates
[5] I. Garcı́a, S. Lefebvre, S. Hornus, and A. Lasram, “Coherent
respectively. parallel hashing,” ACM Transactions on Graphics, vol. 30,
As discussed in Section II, Moscovici et al. has recently no. 6, pp. 161:1–161:8, Dec. 2011.
proposed a lock-based skip list (GFSL). On a GeForce GTX
970, with 224 GB/s memory bandwidth, they report that its [6] N. Moscovici, N. Cohen, and E. Petrank, “Poster: A GPU-
peak performance is about 100 M queries/s for searches and friendly skiplist algorithm,” in Proceedings of the 22nd ACM
SIGPLAN Symposium on Principles and Practice of Parallel
50 M updates/s for updates (compared to our peak results of Programming, ser. PPoPP ’17, Feb. 2017, pp. 449–450.
937 and 512 M op/s respectively). In the best case, GFSL
requires at least two atomic operations (lock/unlock) and [7] M. A. Bender, R. Cole, E. D. Demaine, and M. Farach-Colton,
two other regular memory accesses for a single insertion. “Scanning and traversing: Maintaining data for traversals in a
This cost makes it unlikely that GFSL can outperform static memory hierarchy,” in Algorithms — ESA 2002, R. Möhring
and R. Raman, Eds. Berlin, Heidelberg: Springer Berlin
cuckoo hashing (1 atomic/insert) or our dynamic slab hash Heidelberg, 2002, pp. 139–150.
(1 read and 1 atomic per insert) in their peak performance.
[8] NVIDIA Corporation, “NVIDIA CUDA C programming
VII. C ONCLUSION guide,” 2016, version 8.0.
The careful consideration of GPU hardware characteristics
as well as our warp-cooperative work sharing strategy lead [9] A. V. Adinetz and D. Pleiter, “Halloc: a high-throughput
dynamic memory allocator for GPGPU architectures,” Mar.
us to design and implementation of an efficient dynamic 2014, https://github.com/canonizer/halloc.
hash table for GPUs. Beyond getting significant speedup
compared to previous semi-dynamic hash tables, slab hash [10] M. Vinkler and V. Havran, “Register efficient dynamic mem-
proves to be competitive to the fastest static hash tables ory allocator for GPUs,” Computer Graphics Forum, vol. 34,
too. We believe our slab list design and its utilization in no. 8, pp. 143–154, Dec. 2015.
slab hash can be a promising first step to provide a larger [11] A. Braginsky and E. Petrank, Locality-Conscious Lock-
family of dynamic data structures with specialized analytics Free Linked Lists. Berlin, Heidelberg: Springer Berlin
for them, which can also be used to target other interesting Heidelberg, 2011, pp. 107–118. [Online]. Available: https:
problems such as sparse data representation and dynamic //doi.org/10.1007/978-3-642-17679-1 10
graph analytics.
[12] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,
ACKNOWLEDGMENTS Introduction to Algorithms, Second Edition. The MIT Press,
Sep. 2001.
Thanks to NVIDIA who provided the GPUs that made
this research possible. We appreciate the funding support [13] S. Ashkiani, A. A. Davidson, U. Meyer, and J. D. Owens,
from a 2016–17 NVIDIA Graduate Fellowship, from NSF “GPU multisplit,” in Proceedings of the 21st ACM SIGPLAN
awards CCF-1637442, CCF-1629657, OAC-1740333, CCF- Symposium on Principles and Practice of Parallel Program-
ming, ser. PPoPP 2016, Mar. 2016, pp. 12:1–12:13.
1724745, CCF-1715777, CCF-1637458, and IIS-1541613,
from the Defense Advanced Research Projects Agency [14] D. A. F. Alcantara, “Efficient hash tables on the GPU,” Ph.D.
(DARPA), from an Adobe Data Science Research Award, dissertation, University of California, Davis, 2011.
and from gifts from EMC and NetApp.
[15] F. Khorasani, M. E. Belviranli, R. Gupta, and L. N. Bhuyan,
R EFERENCES “Stadium hashing: Scalable and flexible hashing on GPUs,”
in International Conference on Parallel Architecture and
[1] D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, Compilation, ser. PACT 2015, Oct. 2015, pp. 63–74.
M. Mitzenmacher, J. D. Owens, and N. Amenta, “Real-
time parallel hashing on the GPU,” ACM Transactions on
Graphics, vol. 28, no. 5, pp. 154:1–154:9, Dec. 2009.