Implementing Global Memory Management in A Workstation Cluster
Implementing Global Memory Management in A Workstation Cluster
Michael J, Feeley, Wdliam E. Morgan, t Frederic H. Pighin, Anna R. Karlin, Henry M. Levy
Department of Computer Science and Engineering
University of Washington
and
Chandramohan A. Thekkath
DEC Systems Research Center
Abstract agent, exporting services to other nodes, but not acting in a co-
ordinated way. Such autonomy has advantages, but results in an
Advances in network and processor technology have greatly underutilization of resources that could be used to improve per-
changed the communication and computational power of local-area formance. For example, global memory management allows the
workstation clusters. However, operating systems still treat work- operating system to use cluster-wide memory to avoid many disk
station clusters as a collection of loosely-connected processors, accesses; this becomes more important with the widely growing
where each workstation acts as an autonomous and independent disparity between processor speed and disk speed. We believe that
agent. This operating system structure makes it difficult to exploit as processor performance increases and communication latency
the characteristics of current clusters, such as low-latency commu- decreases, workstation or personal computer clusters should be
nication, huge primary memories, and high-speed processors, in managed more as a multicomputer than as a collection of indepen-
order to improve the performance of cluster applications. dent machines.
This paper describes the design and implementation of global We have defined a global memory management algorithm and
memory management in a workstation cluster. Our objective is to implemented it in the OSF/1 operating system, running on a collec-
use a single, unified, but distributed memory management algo- tion of DEC Alpha workstations connected by a DEC AN2 ATM
rithm at the lowest level of the operating system. By managing network [1]. By inserting a global memory management algorithm
memory globally at this level, all system- and higher-level soft- at the lowest OS level, our system integrates, in a natural way, all
ware, including VM, file systems, transaction systems, and user cluster memory for use by all higher-level functions, including VM
applications, can benefit from available cluster memory. We have paging, mapped files, and file system buffering. Our system can
implemented our algorithm in the OSF/1 operating system running automatically reconfigure to allow machines to join and in most
on an ATM-connected cluster of DEC Alpha workstations. Our cases, to depart the cluster at any time. In particular, with our
measurements show that on a suite of memory-intensive programs, algorithm and implementation, no globally-managed data is lost
our system improves performance by a factor of 1.5 to 3.5. We also when a cluster node crashes.
show that our algorithm has a performance advantage over others
Using our system, which we call GMS (for Global Memory
that have been proposed in the past.
Service), we have conducted experiments on clusters of up to 20
machines using a suite of real-world application programs. Our
results show that the basic costs for global memory management
1 Introduction
operations are modest and that application performance improve-
This paper examines global memory management in a workstation ment can be significant, For example, we show a 1.5- to 3.5-fold
cluster. By a cluster, we mean a high-speed local-area network speedup for a collection of memory-intensive applications running
with 100 or so high-performance machines operating within a sin- with GMS; these speedups are close to optimal for these applica-
gle administrative domain. Our premise is that a single, unified, tions, given the relative speeds of remote memory and disk.
memory management algorithm can be used at a low-level of the The paper is organized as follows. Section 2 compares our
operating system to manage memory cluster-wide. In contrast, work to earlier systems. In Section 3 we describe our algorithm
each operating system in today’s clusters acts as an autonomous for global memory management. Section 4 details our OSF/1
implementation. We present performance measurements of the
tAuthor’s current address: DECwest Engineering, Bellevue, WA. implementation in Section 5. Section 6 discusses limitations of
This work was supported in part by the N~tlontd Science Foundation (Grants no
CDA-9 123308, CCR-9200832, tmd GER-9450075), ARPA Cumegie Mellon Uni-
our algorithm and implementation, and possible solutions to those
versity Subcontract #38 1375-50196, the Wmhington Technology Center, md Digital limitations. Finally, we summarize and conclude in Section 7.
Equipment Corpomtion. M. Feeley was supported in part by a fellowship from Intel
Co~oration, W. Morgan was supported in p~rt by Digital Equipment Corpomtion
Permission to make digitahwd copy of part or all of this work for personaJ 2 Comparison With Previous Work
or classroom use is granted without fee provided that copies are not made
or distributed for profit or commercial advantage, the mpyright notice, the
title of the publication and its date appear, and notice is given that Several previous studies have examined various ways of using
~PYin9 i$ by permission of ACM, Inc. To oopy otherwise, to republish, to remote memory. Strictly theoretical results related to this problem
post on servers, or to redistribute to Ma, reqtares prior specific permission include [2, 3, 7, 24]. Leach et al. describe remote paging in the
andfor a fee.
context of the Apollo DOMAIN System [15]. Each machine in
SIGOPS ’95 12/95 CO, USA
CI 1995 ACM 0-89791-71 5-4/95/0012...$3.50
201
the network has a paging server that accepts paging requests from Similar management issues exist at the software level in single
remote nodes. This system allowed local users to statically restrict address space systems such as Opal [6], and at the hardware level
the amount of physical memory available to the paging server. in NUMA and COMA architectures [9, 21], Eager et al. [11]
Comer and Griffioen described a remote memory model in which describe strategies for choosing target nodes on which to offload
the cluster contains workstations, disk servers. and remote memory tasks in a distributed load sharing environment.
servers [8]. The remote memory servers were dedicated machines
whose large primary memories could be allocated by workstations
3 Algorithm
with heavy paging activity. No client-to-client resource sharing
occurred, except through the servers. Felten and Zahorj an gener-
This section describes the basic algorithm used by GMS. The de-
alized this idea to use memory on idle client machines as paging
scription is divided into two parts. First, we present a high-level de-
backing store [12]. When a machine becomes idle, its kernel acti-
scription of the global replacement algorithm. Second, we describe
vates an otherwise dormant memory server, which registers itself
the probabilistic process by which page information is maintained
for remote use. Whenever a kernel replaces a VM page, it queries
and exchanged in the cluster.
a central registry to locate active memory servers, picking one at
random to receive the replacement victim. Felten and Zahorj an
used a simple queueing model to predict performance. 3.1 The Basic Algorithm
In a different environment, Schilit and Duchamp have used re- As previously stated, our goal is to globally coordinate memory
mote paging to enhance the performance of mobile computers [18]. management, We assume that nodes trust each other but may crash
Their goal is to permit small memory-starved portable computers to at any time. All nodes run the same algorithm and attempt to make
page to the memories of larger servers nearby: pages could migrate choices that are good in a global cluster sense, as well as for the
from server to server as the portables migrate. local node. We classify pages on a node P as being either local
.
Franklin et al. examine the use of remote memory in a client- pages, which have been recently accessed on P, or global pages,
server DBMS system [13]. Their system assumes a centralized which are stored in P’s memory on behalf of other nodes. Pages
database server that contains the disks for stable store plus a large may also be private or shared, shared pages occur because two or
memory cache. Clients interact with each other via a central server. more nodes might access a common file exported by a file server.
On a page read request, if the page is not cached in the server’s Thus, a shared page may be found in the active local memories
memory, the server checks whether another client has that page of multiple nodes; however, a page in global memory is always
cached; if so, the server asks that client to forward its copy to the private.
workstation requesting the read. Franklin et al. evaluate several In general, the algorithm changes the local/global memory bal-
variants of this algorithm using a synthetic database workload. ance as the result of faults caused by an access to a nonresident
Dahlin et al. evaluate the use of several algorithms for utilizing page. Node P, on a fault, performs the following global replace-
remote memory, the best of which is called N-chance forward- ment algorithm, which we describe in terms of 4 possible cases:
ing [10]. Using N-chance forwarding, when a node is about to
replace a page, it checks whether that page is the last copy in Case 1: The faulted page is in the global memory of another node,
the cluster (a “singlet”); if so, the node forwards that page to a Q, We swap the desired page in Q’s global memory with any
randomly-picked node, otherwise it discards the page. Each page global page in P’s global memory. Once brought into P’s
sent to remote memory has a circulation count, N, and the page is memory, the faulted page becomes a local page, increasing
discarded after it has been forwarded to N nodes. When a node the size of P’s local memory by 1. Q’s local/global memory
receives a remote page, that page is made the youngest on its LRU balance is unchanged. This is depicted in Figure 1.
list, possibly displacing another page on that node; if possible, Case 2: The faulted page is in the global memory of node Q,
a duplicate page or recirculating page is chosen for replacement. but P’s memory contains only local pages. Exchange the
Dahlin et al. compare a~gorithms using a simulator running one LRU local page on P with the faulted page on Q. The size
two-day trace of a Sprite workload; their analysis examines file sys-
of the global memory on Q and the local memory on P are
tem data pages only (i.e., no VM paging activity and no program unchanged.
executable).
Our work is related to these previous studies, but also differs in Case 3: The page is on disk. Read the faulted page into node P’s
significant ways. First, our algorithm is integrated with the lowest memory, where it becomes a local page. Choose the oldest
level of the system and encompasses all memory activity: VM page in the cluster (say, on node Q) for replacement and write
paging, mapped files, and explicit file access. Second, in previous it to disk if necessary. Send a global page on node P to node
systems, even where client-to-client sharing occurs, each node acts Q where it continues as a global page. If P has no global
as an autonomous agent. In contrast, we manage memory globally, pages, choose P’s LRU local page instead. This is shown in
attempting to make good choices both for the faulting node and the Figure 2.
cluster as a whole (we provide a more detailed comparison of the Case 4: The faulted page is a shared page in the local memory
global vs. autonomous scheme following the presentation of our
of another node Q, Copy that page into a frame on node P,
algorithm in the next section). Third, our system can gracefully
leaving the original in local memory on Q. Choose the oldest
handle addition and deletion of nodes in the cluster without user page in the cluster (say, on node R) for replacement and write
intervention. Finally, we have an implementation that is well
it to disk if necessary. Send a global page on node P to node
integrated into a production operating system: OSF/1.
R where it becomes a global page (if P has no global pages,
Several other efforts, while not dealing directly with remote choose P’s LRU local page).
paging. relate to our work, Most fundamental is the work of
Li and Hudak, who describe a number of alternative strategies for The behavior of this algorithm is fairly straightforward. Over
managing pages in a distributed shared virtual memory system [ 16]. time, nodes that are actively computing and using memory will
202
obviously impossible to maintain complete global age information
at every instant; therefore, we use a variant in which each node has
only approximate information about global pages. The objective
of our algorithm is to provide a reasonable tradeoff between the ac-
curacy of information that is available to nodes and the efficiency
of distributing that information. The key issue is guaranteeing
the validity of the age information and deciding when it must be
updated.
Our algorithm divides time into epochs, Each epoch has a maxi-
Figure 1: Global replacement with hit in the global cache, mum duration, 7’, and a maximum number of cluster replacements,
M, that will be allowed in that epoch. The values of T and M vary
H(M1 P Hml Q from epoch to epoch, depending on the state of global memory and
dP
L(XI.I Cache L4ud C’,’hc the workload. A new epoch is triggered when either(1) the dura-
tion of the epoch, T, has elapsed, (2) M global pages have been
n Gk)h,lly
replaced, or (3) the age information is detected to be inaccurate.
Ohlcst
Gk)b,l C,l’hc Gh)h<tl C,,LhC P<,g, Currently, each epoch is on the order of 5–10 seconds.
Any
Our system maintains age information on every node for both
P,gc
local and global pages. At the start of each epoch, every node sends
a summary of the ages of its local and global pages to a designated
initiator node. Using this information, the initiator computes a
weight, w;, for each node Z, such that out of the M oldest pages in
w the network, w, reside in node i’s memory at the beginning of the
Figure 2: Global replacement showing miss in the global cache. epoch. The initiator also determines the minimum age, MinAge,
The faulted page is read from disk, and the oldest page in the that will be replaced from the cluster (i e., sent to disk or discarded)
in the new epoch. The initiator sends the weights w, and the value
network is either discarded (if clean) or written back to disk.
A4irulge to all nodes in the cluster. In addition, the initiator selects
the node with the most idle pages (the largest w,) to be the initiator
fill their memories with local pages and will begin using remote for the following epoch.
memory in the cluster; nodes that have been idle for some time and During an epoch, when a node P must evict a page from its
whose pages are old will begin to fill their memories with global memory to fault in a page from disk (Cases 3 and 4), it first checks
pages. The balance between local and global storage on a node is if the age of the evicted page is older than &finAge. If so, it simply
thus dynamic and depends on its workload and the workload in the discards the page (since this page is expected to be discarded
cluster. The basic issue is when to change the amount of global sometime during this epoch). If not, P sends the page to node i,
store and local storage, both on a node and in the cluster overall. In where the probability of choosing node i is proportional to w,. In
general, on a fault requiring a disk read, the (active) faulting node this case, the page discarded from P becomes a global page on
grows its local memory, while the cluster node with the oldest page node i, and the oldest page on i is discarded.
(an “idle” node) loses a page to disk. Global memory grows when Our algorithm is probabilistic: on average, during an epoch, tbe
the faulting node has no global pages and the oldest page in the ith node receives w, /M of the evictions in that epoch, replacing its
network is a local page (i.e., the oldest local page on the faulting oldest page for each one. This yields two useful properties. First,
node becomes a global page, replacing the oldest cluster page.) our algorithm approximates LRU in the sense that if M pages are
Ultimately, our goal is to minimize the total cost of all memory discarded by global replacement during the epoch, they are the
references within the cluster. The cost of a memory reference globally oldest M pages in the cluster. Second, it yields a simple
depends on the state of the referenced page: in local memory, in way to determine statistically when M pages have been replaced;
global memory on another node, or on disk. A local hit is over i.e., when the node with the largest w~ receives w, pages, it declares
three orders of magnitude faster than a global memory or disk an end to the epoch.
access, while a global memory hit is only two to ten times faster To reduce the divergence from strict LRU, it is thus important to
than a disk access. Therefore, in making replacement decisions, keep the duration of the epoch T and the value of M appropriate
we might choose to replace a global page before a local page of the for the current behavior of the system. The decision procedure for
same age, because the cost of mistakenly replacing a local page is choosing these values considers (1) the distribution of global page
substantially higher. Which decision is better depends on future ages, (2) the expected rate at which pages will be discarded from the
behavior. To predict future behavior, a cost function is associated cluster, and (3) the rate at which the distributed age information is
with each page. This cost function is related to LRU, but is based on expected to become inaccurate. t The latter two rates are estimated
both the age of the page and its state. Our current implementation from their values in preceding epochs. Roughly speaking, the more
boosts the ages of global pages to favor their replacement over old pages there are in the network, the longer T should be (and the
local pages of approximately the same age. larger M and A4inAge are); similarly, if the expected discard rate
is low, T can be larger as well. When the number of old pages
3.2 Managing Global Age Information in the network is too small, indicating that all nodes are actively
using their memory, MinAge is set to O, so that pages are always
When a faulted page is read from disk (Cases 3 and 4), our al-
discarded or written to disk rather than forwarded.
gorithm discards the oldest page in the cluster. As described so
far, we assume full global information about the state of nodes fThe age distribution on a node ch~nges when Its global pages me consumed due
and their pages in order to locate this oldest page. However, it N to m mcreme in Its local ciiche size.
203
3.3 Node Failures and Coherency
Fuults
Node failures in the cluster do not cause data loss in global memory,
VM
because all pages sent to global memory are clean; i.e., a dirty page . ___— —— _ .
Red
moves from local to global memory only when it is being written Wntc Fre
We have modified the OSF/1 operating system on the DEC Alpha is in one of four states: (1) cached locally on a single node, (2)
platform to incorporate the algorithm described above. This section cached locally on multiple nodes, (3) cached on a single node on
presents the details of our implementation. behalf of another node, or (4) not cached at all.
Figure 3 shows a simplified representation of the modified We maintain three principal data structures, keyed by UID.
OSF/1 memory management subsystem. The boxes represent
1. The page-frame-directory (PFD) is a per-node structure that
functional components and the arrows show some of the control
contains a record for each page (local or global) that is present
relationships. The two key components of the basic OSF/ 1 mem-
on the node. A successful UID lookup in the PFD yields
ory system are (1) the VM system, which supports cmonymou.s
information about the physical page frame containing the
pages devoted to process stacks and heaps, and (2) the Unified
data, LRU statistics about the frame, and whether the page
Buffer Cache (UBC), which caches file pages. The UBC contains
is local or global. An unsuccessful lookup implies that the
pages from both mapped tiles and files accessed through normal
particular page is not present on this node,
read/write calls and is dynamically-sized; this is similar in some
ways to the Sprite tile system [17]. At the same level as VM 2. The global-cuche-directo~ (GCD) is a cluster-wide data
and UBC, we have added the GMS module, which holds global structure that is used to locate the 1P address of a node that
pages housed on the node. Page-replacement decisions are made has a particular page cached. For performance reasons, the
by the pageout daemon and GMS. A custom TLB handler provides GCD is organized as a hash table, with each node storing
information about the ages of VM and UBC pages for use by GMS, only a portion of the table.
We modified the kernel to insert calls to the GMS at each point
where pages were either added to or removed from the UBC. 3. The page-ownership-directory (POD) maps the UID for a
Similarly, we inserted calls into the VM swapping code to keep shared page to the node storing the GCD section containing
track of additions and deletions to the list of anonymous pages. that page. For non-shared pages, the GCD entry is always
204
No(Ic B
entries. Subsequently when the TLB handler performs a virtual-
GCD
to-physical translation on a TLB miss, it sets a bit for that physical
UID
c) n.-
4-
>.... .
UID
frame.
4.3
A kernel thread samples the per-frame
order to maintain
Inter-node
LRU statistics for all physical
Communication
bit every period in
page frames,
205
No(Ic B Latency in +$
GCD Operation Non-Shared Page Shared Page
Miss Hit Miss Hit
.1...! Request Generation 7 61 65 65
<UID,C> D Reply Receipt 156 5 150
GCD Processing 8 8 59 61
Network ktW&SW - 1135 211 1241
@)--m
Target Processing - 80 - 81
Total 15 1440 340 1558
Nmic
A N(vJc C Table 1: Performance of the Getpage Operation (#s)
206
Latency in ps
Operation Non-Shared Page Shared Page
Request Generation 58 102
GCD Processing
Network HW&SW
Target Processing
II 7
989
178
I
12
989
181
Sender Latency 65 102 Table 4: Average Access Times for Shared Pages (ins)
overhead is about 300 ~s. access will thus result in an NFS request to the server, which will
Table 3 compares the average data-read time for non-shared require a disk access. In the final case, the NFS server has enough
pages with and without GMS. For this experiment, we ran a syn- memory so that it can satisfy client requests without accessing the
thetic program on a machine with 64 Mbytes of memory. The disk. Here, the cost of an access is simply the overhead of the NFS
program repeated] y accesses a large number of anonymous (i.e., call and reply between the client and the server. Notice that an
non-shared) pages, in excess of the total physical memory. In NFS server-cache hit is 0.2 ms faster than a GMS hit for a single.
steady state for this experiment, every access requires a putpage to This reflects the additional cost of the putpage operation performed
free a page and a getpage to fetch the faulted page. The average by GMS when a page is discarded from the client cache. In NFS,
read time thus reflects the overhead of both operations, discarded pages are dropped as they are in GMS for duplicates, in
The first row of Table 3 shows the average performance of se- which case GMS is 0.2 ms faster than NFS.
quential reads to non-shared pages. The numbers shown with no
GMS reflect the average disk access time; the difference between 5.2 Bookkeeping Overheads
the sequential and random access times indicates the substantial
benefit OSF gains from prefetching and clustering disk blocks for This section describes the cost of performing the essential GMS
sequential reads. Nevertheless, using GM S reduces the average bookkeeping operations, which include the periodic flushing of
sequential read time by 41 YO for non-shared pages. For con- TLB entries as well as the overhead of collecting and propagating
sequential accesses, GMS shows a nearly a 7-fold performance global page age information.
improvement. Here the native OSF/ 1 system is unable to exploit On the 225-MHz processor, our modified TLB handler intro-
clustering to amortize the cost of disk seeks and rotational delays. duces a latency of about 60 cycles (an additional 18 cycles over
Table 4 shows the data-access times for NFS files that can be the standard handler) on the TLB till path. In addition, since TLB
potentially shared. In this experiment, a client machine with 64 entries are flushed every minute, with a 44 entry TLB, we intro-
Mbytes of memory tries to access a large NFS tile that will not fit duce a negligible overhead of 2640 (60x44) cycles per minute.
into its main memory, although there is sufficient memory in the In practice, we have seen no slowdown in the execution time of
cluster to hold all of its pages, There are four cases to consider, programs with the modified TLB handler.
In the first case, shown in the first column of Table 4, we assume Collecting and propagating the age information consists of mul-
that all NFS pages accessed by the client will be put into global tiple steps: (1) the initiator triggers a new epoch by sending out
memory. This happens in practice when a single NFS client ac- a request to each node asking for summary age information, (2)
cesses the file from a server. For the most part, the behavior of the each node gathers tbe summary information and returns it to the
system is similar to the experiment described above: there will be initiator, and (3) the initiator receives the information, calculates
a putpage and a getpage for each access. In this case, tbe pages weights and epoch parameters, and distributes the data back to each
will be fetched from global memory on idle machines, node.
The second case is a variation of the first, where two clients The three rows of Table 5 represent the CPU cost and the network
are accessing the same NFS tile. One client bas ample memory traffic induced by each of these operations. For steps one and three,
to store the entire file while the other client does not. Because of tbe table shows the CPU overhead on the initiator node and the
the memory pressure, the second client will do a series of putpage network traffic it generates as a function of the number of nodes,
and getpage operations. The putpage operations in this case are n, The CPU cost in step two is a function of the number of pages
for shared pages, for which copies already exist in the file buffer each node must scan: 0.29 ps per local page and 0.54 AS for each
cache of the other client (i.e., they are duplicates). Such a putpage globa~ page scanned. The overhead shown in the table assumes that
operation causes the page to be dropped; there is no network trans- each node has 64 Mbytes (8192 pages) of local memory and that
mission. The average access cost in this case is therefore the cost 2000 global pages are scanned. We display network traffic as a rate
of a getpage. in bytes per second by assuming a worst-case triggering interval
The next two cases examine the cost of a read access when there of 2 seconds (a 2-second epoch would be extremely short), Given
is no GMS. In the first case, we constrain the NFS file server so that this short epoch length and a 100-node network, CPU overbead is
it does not have enough buffer cache for the entire fi le. A client read less than 0.8?6 on the initiator node and less than 0.2?Z0on other
207
nodes, while the impact on network bandwidth is minimal.
I
/-—
——
5.3 Execution Time Improvement
[
d Bca”g CAD
A VLSI Router
This section examines the performance gains seen by several ap-
I
&:;
+– Compiled L,”k
plications with the global memory system. These applications are —x
—x— 007
memory and file 1/0 intensive, so under normal circumstances, per-
I
— R.”d,,
formance suffers due to disk accesses if the machine has insufficient ,.,X
+ we]> QU,ry Suw!
,,
memory for application needs. In these situations we would expect
/ //22
global memory management to improve performance, assuming
..-.+--”
that enough idle memory exists in the network. The applications
we measured were the following: () 50 lrm 150 2eo 250
Render is a graphics rendering program that displays a idle memory is insufficient to meet the application’s demands,
computer-generated scene from a pre-computed 178-Mbyte our system provides relatively good performance. Beyond about
database [5]. In our experiment, we measured the elapsed 200 Mbytes of free memory in the cluster, the performance of
time for a sequence of operations that move the viewpoint these applications does not show any appreciable change, but at
progressively closer to the scene without changing the view-
that point, we see speedups of from 1.5 to 3.5, depending on the
point angle.
application. These speedups are significant and demonstrate the
Web Query Server is a server that handles queries against the full potential of using remote memory to reduce the disk bottleneck.
text of Digital’s internal World-Wide-Web pages (and some Figure 6 shows the speedup of each application when running
popular external Web pages). We measured its performance alone with sufficient global memory. To demonstrate that those
for processing a script containing 150 typical user queries. benefits remain when multiple applications run simultaneously,
competing for memory in a larger network, we ran another exper-
To provide a best-case estimate of the performance impact of iment. Here we varied the number of nodes from five to twenty;
global memory management, we measured the speedup of our in each group of five workstations, two were idle and each of the
applications relative to a native OSF system. Nine nodes were remaining three ran a different workload (007, Compile and Link,
used for these measurements: eight 225-MHz DEC 3000 Model or Render). The idle machines had sufficient memory to meet the
700 machines rated at 163 SPECint92 and one 233-MHz DEC needs of the workloads. Thus, when running with twenty nodes,
AlphaStation 400 4/233 rated at 157 SPECint92. The AlphaStation eight were idle and each of the three workloads was running on
had 64 Mbytes of memory and ran each application in turn. The four different nodes. The results of this experiment, shown in Fig-
other eight machines housed an amount of idle memory that was ure 7, demonstrate that the speedup remains nearly constant as the
equally divided among them. We varied the total amount of idle number of nodes is increased.
cluster memory to see the impact of free memory size. Figure 6
shows the speedup of each of the applications as a function of the
5.4 Responsiveness to Load Changes
amount of idle network memory.
As Figure 6 shows, global memory management has a beneficial Our algorithm is based on the distribution and use of memory load
impact on all the applications. With zero idle memory, application information. An obvious question, then, is the extent to which our
performance with and without GMS is comparable. This is in system can cope with rapid changes in the distribution of idle pages
agreement with our microbenchmarks that indicate GMS overheads in the network. To measure this, we ran a controlled experiment,
are only 0.40.1 qo when there is no idle memory. Even when again using nine nodes. In this case, the 233-MHz AlphaStation
208
2.4 { 4,
H N. CIUJWX(Ml,= Nwkd~
; 34 I nms(kk.iwcdul, I
1
*
LI
I
209
■ N.CIU”LI! [Idle= Needed]
■ N. CI,WCL (Me= Nw(U)
D N-Clmmc (Idle= 1.5x Needed)
❑ N-c1w,cL’ (idle= 1,5x NwdwJl
❑ N-Chunce (Idle= 2 x Ncukl)
!S+N. CIXU>CC
(k!], = 2 x NCIXW
fC GMS (MI.. NdUI)
203
0
25% 37.5% 50% 25% 37.5% 50%
Idleness Skew (X% of nodes ha~e 1OO-X % of idle memory) IdlenessSkew (X% of nodes have 1OO-X% of idle memory)
Figure 10: Effect of varying distribution of idleness on the per- Figure 11: Effect of varying distribution of idleness on network
formance of a program actively accessing local memory, half of activity.
which consists of shared data that is duplicated on other nodes.
2,6 ~
network traffic. 1 2 3 4 5 6 7
As one test of this effect, we ran an experiment as before with Number of Nodes Running Benchmark
210
T 3(K)()
domain. All of the kernels must trust each other in various ways.
/i In particular, one node must trust another to not reveal or corrupt
its data that is stored in the second node’s global memory. Without
mutual trust, the solution is to encrypt the data on its way to or
from global memory. This could be done most easily at the network
hardware level [19].
Our current algorithm is essentially a modified global LRU re-
placement scheme. It is well known that in some cases, such as
sequential file access, LRU may not be the best choice [22]. The
sequential case could be dealt with by limiting its buffer space, as is
done currently in the OSF/1 file buffer cache. Other problems could
1 2 3 4 5 6 7 exist as weH. The most obvious is that a single badly-behaving pro-
Number of Nodes Running Benchmark gram on one node could cause enormous paging activity, effectively
flushing global memory. Of course, even without global memory,
Figure 13: Impact of multiple clients on CPU performance of an a misbehaving program could flood the network or disk, disrupting
idle node. service. Again, one approach is to provide a threshold limiting the
total amount of global memory storage that a single node or single
application could consume.
If only a few nodes have idle memory, the CPU load on these
we used 007, increasing the number of 007 client nodes from
nodes could be high if a large number of nodes all attempt to use that
one to seven, ensuring in all cases that the idle node had sufficient
idle memory simultaneously. If there are programs running on the
memory to handle all of their needs.
idle machines, this could adversely affect their performance. This
Figure 12 shows that when seven copies of 007 were simul-
effect was measured in the previous section. A possible solution is
taneously using the remote memory of the idle node, the average
to incorporate CPU-load information with page age information,
speedup achieved by GMS was only moderately lowered. That is,
and to use it to limit CPU overhead on heavily-loaded machines,
the applications using the idle node’s memory did not seriously
In the end, all global memory schemes depend on the existence
degrade in performance as a result of their sharing a single global
of “a reasonable amount of idle memory” in the network. If the
memory provider,
idle memory drops below a certain point, the use of global memory
On the other hand, Figure 13 shows the result of that workload
management should be abandoned until it returns to a reasonable
on the idle node itself. The bar graph shows the CPU overhead ex-
level. Our measurements show the ability of our algorithm and
perienced by the idle node as a percentage of total CPU cycles. As
implementation to find and effectively utilize global memory even
well, Figure 13 plots the rate of page-transfer (getpage and putpage)
when idle memory is limited.
operations at the idle node during that execution. From this data,
we see that when seven nodes were running 007 simultaneously,
the idle node received an average of 2880 page-transfer operations 7 Conclusions
per second, which required 56% of the processor’s CPU cycles.
This translates to an average per-operation overhead of 194 ps, Current-generation networks, such as ATM, provide an order-of-
consistent with our micro-benchmark measurements. magnitude performance increase over existing 10 Mb/s Ethernets;
another order-of-magnitude—to gigabit networks—is visible on
the horizon. Such networks permit a much tighter coupling of
6 Limitations
interconnected computers, particularly in local area clusters. To
benefit from this new technology, however, operating systems must
The most fundamental concern with respect to network-wide re-
integrate low-latency high-bandwidth networks into their design,
source management is the impact of failures, In most distributed
in order to increase the performance of both distributed and parallel
systems, failures can cause disruption, but they should not cause
applications.
permanent data loss. Temporary service loss is common on any
We have shown that global memory management is one prac-
distributed system, as anyone using a distributed file system is well
tical and efficient way to share cluster-wide memory resources.
aware. Whh our current algorithm, all pages in global memory are
We have designed a memory management system that attempts
clean, and can therefore be retrieved from disk should a node hold-
to make cluster-wide decisions on memory usage, dynamically ad-
ing global pages fail. The failure of the initiator or master nodes is
justing the local/global memory balance on each node as the node’s
more difficult to handle; while we have not yet implemented such
behavior and the cluster’s behavior change, Our system does not
schemes, simple algorithms exist for the remaining nodes to elect
cause data loss should nodes fail, because only clean pages are
a replacement.
cached in global memory; cached data can always be fetched from
A reasonable extension to our system would permit dirty pages
disk if necessary.
to be sent to global memory without first writing them to disk. Such
The goal of any global memory algorithm is to reduce the av-
a scheme would have performance advantages, particularly given
erage memory access time. Key to our algorithm is its use of
distributed file systems and faster networks, at the risk of data loss
periodically-distributed ch.rster-wide age information in order to:
in the case of failure. A commonly used solution is to replicate
(1) house global pages in those nodes most likely to have idle mem-
pages in the global memory of multiple nodes; this is future work
ory, (2) avoid burdening nodes that are activcJy using their memory,
that we intend to explore.
(3) ultimately maintain in cluster-wide primary memory the pages
Another issue is one of trust. As a chrster becomes more closely
most likely to be globally reused, and (4) maintain those pages in
coupled, the machines act more as a single timesharing system. Our
the right places. Algorithms that do not have these properties are
mechanism expects a single, trusted, cluster-wide administrative
211
unlikely to be successful in a dynamic cluster environment. [9] A, L, Cox and R. J. Fowler. The implementation of a coher-
We have implemented our algorithm on the OSF/1 operating ent memory abstraction on a NUMA multiprocessor: Ex-
system running on a cluster of DEC Alpha workstations connected periences with PLATINUM. In Proceedings of the 12th
by a DEC AN2 ATM network. Our measurements show the under- ACM Symposium on Operating Systems Principles, Decem-
lying costs for global memory management operations in light of a ber 1989.
real implementation, and the potential benefits of global memory
[10] M. D. Dahlin, R. Y. Wang, T. E. Anderson, and D. A. Pat-
management for applications executing within a local-area cluster.
terson. Cooperative caching: Using remote client memory
to improve file system performance. In Proceedings of the
212