An Operating System Framework For Large
An Operating System Framework For Large
Abstract
Little work has been done on operating systems for massively parallel computing. This paper pro-
poses a framework for such an operating system. It is assumed that there are multiple jobs executing
on a large MIMD computer. Each job is assumed to be data parallel, using as many virtual processors
as necessary to exploit its inherent parallelism. We view the notion of virtual processors as playing
a unifying role in the conceptual design of the operating system. Our main thesis is that the vari-
ous functions performed by the operating system may be viewed as operations on the set of virtual
processors.
In the context of the above framework, several open theoretical problems are identi ed, and in
particular, the twin problems of spatial and temporal scheduling are addressed. Preliminary analysis
indicates the viability of horizontal spatial schedules and periodic temporal schedules.
This research was supported in part by the Air Force Oce of Scienti c Research under grant numbers F49620-92-J-0126
and AFOSR-90-0144, the NASA under grant number NAG-5-1897, and the NSF under grant numbers MIP-9106949 and
MIP-9205737
y
On leave from the Institute of Informatics, Warsaw University
z
Supported by CNPq (Conselho Nacional de Desenvolvimento Cient co e Tecnologico), Brazilian Government, under
grant number 200358-92.8
1
1 Introduction
When programming a uniprocessor machine, the user has to know very little about the machine to write
ecient programs. With current massively parallel processors (MPPs), however, the user needs to know
machine dependent details such as the number of processors available and the amount of memory at
each node. In addition, users often analyze characteristics of the machine's communication delays and
the application's communication patterns in order to speed up programs [1]. Current parallel programs
are, therefore, highly machine speci c, and dicult to port across machines. One of the reasons for this
situation is the lack of operating systems that can eciently bridge the gap between parallel machines
and high level parallel programming models.
Moreover, the high cost of current MPPs demands that they be utilized eciently and to the fullest. (In
the past, these precise economic considerations used to apply to mainframes.) Typically, this requires the
accommodation of multiple users, and an operating system to manage the sharing of machine resources
among these users.
There are, in fact, several MPP operating systems available in the market, for example, CM-5's CMost [25]
and T3D's UNICOS [17]. Unfortunately, the convenience a orded by these commercial operating systems
is nowhere near what we have grown accustomed to with uniprocessor operating systems. Duties that
properly belong to the operating system, such as processor virtualization and virtual memory management,
are often ignored and dubbed as \programmer responsibility." The inability of these operating systems
to o er programming convenience without compromising performance may stem from the fact that they
are largely extensions of uniprocessor operating systems.
This paper attempts to rethink, ab initio, the role of operating systems in massively parallel computing.
We envisage several jobs executing on a large MIMD machine; each job is assumed to be data parallel,
using as many virtual processors as necessary to exploit its inherent parallelism. We believe that the notion
of virtual processors uni es the conceptual design of the operating system, in much the same way as the
lesystem does in UNIXTM . Our main thesis is that most activities of the operating system may be viewed
as operations or manipulations on the set of virtual processors. Viewing the activities of the operating
system in this context facilitates the examination of the merits and demerits of various operating system
policies and reveals the underlying basic theoretical problems.
Section 2 presents a framework to view operating system issues in terms of virtual processors. In the
context of this framework, the sections that follow take a closer look at two complementary problems:
spatial and temporal scheduling. In Section 4, we consider spatial schedules, and make a qualitative
comparison between two simple spatial scheduling policies. In Section 5, a simpli ed, discrete model for
temporal scheduling is proposed, and metrics for evaluating temporal schedules are described. Section 6
nds the optimal temporal scheduling policy in terms of these metrics.
2
VM
VM
VM VM - Virtual Machine
Each VM is a user job
VM Physical
VM
Machine
Operating System
VM VM
Programming Model
Figure 1. The function of the operating system is to emulate several virtual machines (corresponding to multiple
programs) on a single physical machine.
The combination of the data parallel programming model and a MIMD machine model is called a SPMD
execution model [18, page 606]. SPMD stands for Single Program Multiple Data, indicating that all
processors execute the same program, but may be at di erent instructions at a given time, owing to
asynchronous execution. Throughout this paper, our discussion of operating systems is predicated on the
SPMD execution model.
3
2.2 Virtual Processors as a Basis for Operating Systems
As pointed out before, a program is nothing more than a description of a virtual machine. In the case
of a data parallel program, the virtual machine consists of a (typically large) number of identical virtual
processors (VPs), communicating through an interconnection network. For instance, the standard data
parallel program to multiply two N N matrices [10] might be viewed as a virtual machine consisting of
N 2 VPs communicating in a mesh.
For many years now, the concept of virtual processors has been relegated to the status of a mere logical
aid to programmers [24, 8]. In our view, the notion of virtual processors should form the fundamental
basis of an MPP operating system. We claim two advantages to this approach.
It is our thesis that most of the functions of an MPP operating system can be viewed as operations
on the set of VPs. Thus, the notion of virtual processors provides a uni ed framework within which
several operating system issues may be considered and evaluated. We brie y illustrate our thesis
below, by phrasing various well-known operating system issues in terms of VP manipulations:
{ Spatial scheduling: When a job enters the system, the spatial scheduling (or space sharing)
policy de nes which processor each VP is allocated to.
{ Temporal scheduling: The temporal scheduling (or time sharing) policy dictates how each
processor switches between the execution of the VPs allocated to it.
{ Load balancing: Once a VP is allocated to a processor, it usually does not get reallocated, since
moving VPs between processors is quite expensive. However, there are situations when VPs do
indeed move between processors. For example, if a job spawns and kills VPs dynamically in an
unpredictable fashion, it is periodically necessary to load balance the VPs among processors.
{ Memory and I/O problems: Memory limitations and I/O bottlenecks may also be phrased in
terms of VPs. Memory limitations occur when a processor does not have enough local memory
to hold all the VPs allocated to it. I/O bottlenecks are most pronounced while loading (roll-in)
the VPs of a job from the disk into the local memories of various processors1.
The above problems are not independent of each other | any policy on one issue has subtle reper-
cussions on the other issues. The VP model enables us to understand and tackle these complex
interactions. For example, suppose each processor schedules the VPs assigned to it in FIFO order.
A VP that is at the end of a processor's queue may work its way towards the head, only to be
bumped o to the end of another processor's queue because of an ill-timed load balancing. If this
continues, the VP in question might never get a chance to be executed.
An operating system based on virtual processors alleviates several constraints imposed by current
commercial MPP operating systems, such as Thinking Machines' CM-5 and Cray's T3D.
Since these operating systems do not support the virtual processor abstraction, it is the programmer's
(or compiler's) responsibility to manage the virtual processors. This involves grouping together
logical VPs into coarse grain processes, forcing the number of such processes to t a legal partition
size. If all partitions of that size happen to be in use, then the programmer must either wait,
or re-group the VPs (perhaps compile the code with di erent options) to t another partition size.
1
For example, it takes the Cray T3D (128 processors with 64M per processor, and 2 I/0 gateways at 50M/sec) about 1.3
minutes to load the whole machine.
4
Furthermore, if some jobs terminate, leaving the machine lightly loaded, then current MPP operating
systems are unable to load balance the VPs of still-running jobs over a larger number of processors. 2
On the ip side, there are, conceivably, performance overheads in designing an operating system based
on virtual processors. Whether the convenience won is worth the performance lost can not be settled by
rhetoric | extensive experimentation and analysis is required before anything can be said on the matter.
As an application of the virtual processor framework, in the following sections, we take a preliminary look
at two of the problems mentioned above, namely spatial and temporal scheduling. As the names suggest,
these two issues are complementary to each other: the rst determines where VPs must be executed, while
the second determines when VPs must be executed.
4 Spatial Scheduling
Whenever a job enters the system, the spatial scheduling policy must specify the processor that each VP
of the job is allocated to. For simplicity, we restrict ourselves to the static case, wherein a set of jobs
present themselves to be allocated initially, and no jobs arrive or leave the system thereafter. Moreover,
we assume that jobs do not spawn and kill VPs dynamically, thereby necessitating load balancing.
In the static case, a spatial schedule is simply a mapping from the set of VPs to the set of processors.
Figure 2 shows an example of such a spatial schedule.
Spatial scheduling can be done in many ways. Two policies suggest themselves immediately. A vertical
spatial schedule is one in which the VPs of every job are granted exclusive access to a subset of the
2
This information was obtained from discussions with engineers from MasPar, Cray Research and Thinking Machines [1,
9, 19].
5
4 4
3 4 5 4 2
2 2 2 4 1
1 2 3 4 5
Figure 2. A spatial schedule, or allocation. In this toy example, 5 jobs need to be allocated on a machine with 5
processors. The jobs have 1,4,1,5, and 1 VPs each.
Vertical Horizontal
Processor Uti- Wasted, if the number of idle PEs is not enough to Fully utilized. When a job terminates and leaves
lization satisfy the minimum requirements of any queued the system, the resources are automatically shared
job. If jobs leave the system, freeing many PEs, among the remaining jobs in the system: no load
then VPs of running jobs can not exploit the free balancing required.
PEs, unless load balancing is performed (with large
OS overhead).
Memory As in processor utilization case, memory resources Since each job uses all PEs, a copy of each program
utilization may not be fully utilized. On the other hand, since has to be on each PE, increasing code memory.
many VPs of the same job are on each PE, the Also, since VPs of di erent jobs exist on the same
number of copies of the program can be minimized PE, memory fragmentation may occur.
by just having one per PE. Memory fragmentation
does not occur.
Interprocessor Reduced, as VPs on the same PE can communicate Many communications performed concurrently.
communica- by local memory access, however, these communi- Although the network delay may be signi cant,
tion cations will be sequential. horizontal allocation makes the communication
patterns random, which makes networks behave
well [11].
Roll-in/roll- When only one job is allocated on a processor, time Can be fully masked: while the PEs are execut-
out time lost in roll-in/out is unavoidable, and usually sig- ing jobs allocated to them, the new one is loaded
ni cant. into the system by using a DMA controller. Once
the job has been loaded, the local schedule is aug-
mented with the VPs of the new job. A similar
procedure is applied to mask the roll-out time.
Table 1. A qualitative comparison of vertical and horizontal spatial schedules
processors. This scheme is also called partitioning. At the other extreme, a horizontal spatial schedule is
one in which the VPs of every job are spread evenly over all the processors in the system (or as many
processors as possible, if the number of VPs is smaller than the number of processors).
Table 1 gives a qualitative comparison of vertical and horizontal spatial schedules in terms of system
performance metrics such as processor utilization, memory utilization, interprocessor communication, and
roll-in/roll-out time. These preliminary considerations indicate several advantages of horizontal allocations
over the conventionally adopted vertical/partitioning schemes.
5 Temporal Scheduling
Once spatial scheduling is done, the problem is more local in nature. Several VPs, possibly belonging to
di erent jobs, may have been allocated to the same processor. The temporal scheduling policy speci es
6
Time ... ... ... ... ... ...
Slices
Period 2
5
Job B
4
Period 1
2
1
Processors
PE 1 PE 2 PE 3 PE 4 PE 5 PE 6
how each processor multiplexes the execution of the VPs that are allocated to it.
7
model this fact in the trace diagram, it is postulated that a trace diagram is legal if all VPs belonging to
the same job receive the same number of time slices \on average". More precisely: Over any period of
time, the number of time slices devoted to the various VPs of a job di er by no more than a constant.
8
extension of a round robin schedule, wherein a processor may return to a VP many times within the same
round.
An optimal temporal schedule is de ned as one with the least idling ratio that satis es the happiness
requirements of all the users. The following theorem proves that for any temporal schedule, there exists
a periodic schedule with lower idling ratio, in which each job is at least as happy. This implies that the
search for an optimal temporal schedule may be restricted to the class of periodic schedules.
Theorem 1. For every temporal schedule S , there exists a periodic schedule Sp, such that the idling
ratio of Sp is at most that of S , and every job's happiness in Sp is at least as much as in S .
Proof: De ne the progress of a job at a particular time as the number of time slices granted to each of
its VPs upto that time. Thus, if a job has V VPs, its progress at time slice t may be represented by a
progress vector of V components, where each component is an integer less than or equal to t.
By the rules of legal execution, no VP may lag behind another VP of the same job by more than a constant
C number of time slices. Therefore, no two elements in the progress vector can di er by more than C .
De ne the di erential progress of a job at a particular time as the number of time slices by which each
VP leads the slowest VP of the job. Thus, di erential progress vector at time t is also a vector of V
components, where each component is an integer less than or equal to C . The di erential progress vector
is obtained by subtracting out the minimum component of the progress vector from each component of
the progress vector.
The system's di erential progress vector (SDPV) at time t is the concatenation of all job's di erential
progress vectors at time t. The key is to note that the SDPV can only assume a nite number of values.
Therefore, there exists an in nite sequence of times ti1 ; ti2 ; . . . such that the SPDVs at these times are
identical.
Consider any time interval [tik ; tik ]. One may construct a periodic schedule by cutting out the portion of
0
the trace diagram between tik and tik , and replicating it in nitely in the vertical direction.
0
First of all, we claim that such a periodic schedule is legal. From the equality of the SPDVs at tik and
tik , it follows that all VPs belonging to the same job receive the same number of time slices during each
0
period. In other words, at the end of each period, all the VPs belonging to the same job have made equal
progress. Therefore, no two VP lags behind another VP of the same job by more than a constant number
of time slices.
Secondly, observe it is possible to choose a time interval [tik ; tik ], such that the happiness of each job
0
in the during this interval is at least as much as in the complete trace diagram. This implies that the
happiness of each job in the constructed periodic schedule is greater than or equal to the happiness of
each job in the original temporal schedule.
Finally, the idling ratio of the constructed periodic schedule must be less than or equal to the idling ration
of the original temporal schedule. Since the fraction of area in the trace diagram covered by each job
increases, the fraction covered by the holes must necessarily decrease. This concludes the proof. 2
We assume that the spatial scheduling is done, and VPs have been allocated to processors somehow. The
spatial schedule can be summarized in the form of an m n matrix A, called the allocation matrix, where
A gives the number of VPs of job J on processor . For example, Figure 4 gives the allocation matrix
j;p j p
any VP of job J in one period of the periodic schedule. (Recall that within each period of a periodic
j
schedule, all VPs belonging to the same job must receive exactly the same number of time slices.) Of
course, T and the R s are not yet known.
j
To summarize, the input data available to us are the s and the A s; and the output data to be computed
j j;p
are T and the R s. Without loss of generality, let us normalize the period T to 1, and rede ne the R s
j j
to be the old R s divided by T . (Thus, while the old R s were integer variables, the new R s rational
j j j
numbers.)
The minimization of the idling ratio may be written algebraically as:
0 XX 1
B A R CC
min B
j;p j
B@1 ? n CA p j
(1)
The objective function (1), along with the constraints (2) and (3), form a linear program, which may be
solved using standard techniques. An apparent complication is that we seek not just any R s, but R s j j
that are rational. This is really not a problem. As long as the s are rational, the R s will automatically
j j
7 Conclusions
This paper presents the notion of virtual processors as the unifying concept in the design of operating sys-
tems for massively parallel computing. We propose that several well-recognized activities of the operating
10
system can be viewed as operations or manipulations on the set of virtual processors. To illustrate the
applicablility of the virtual processor framework, we present preliminary analyses of spatial and temporal
scheduling.
We believe that the many conceptual bene ts of founding an MPP operating system on virtual processors
will outweigh the overheads in terms of performance. However, a de nitive answer will require extensive
experimentation and analysis.
References
[1] Tom Blank. Personal communications, 1993. MasPar Computer Corporation, Sunnyvale, CA.
[2] Guy E. Blelloch. Vector Models for Data-parallel Computing. MIT Press, Cambridge, MA, 1990.
[3] Walter S. Brainerd, Charles H. Goldberg, and Jeanne C. Adams. Programmer's guide to Fortran 90.
McGraw-Hill Book Co., 1990.
[4] M. Crovella et al. Multiprogramming on multiprocessors. In Proceedings of the Third IEEE Sympo-
sium on Parallel and Distributed Processing, pages 590{597, Dec, 1991.
[5] R. Cytron, J. Lipkis, and E. Schonberg. A computer-assisted approach to SPMD execution. In
Proceedings of Supercomputing '90, pages 398{406, Nov, 1990.
[6] Heshan El-Rewini, Theodore G. Lewis, and Heshan H. Ali. Task Scheduling in Parallel and Distributed
Systems. Prentice Hall, Englewood Cli s, New Jersey 07632, 1994.
[7] Philip J. Hatcher and Michael J. Quinn. Data-parallel programming on MIMD computers. MIT Press,
Cambridge, MA, 1991.
[8] W. D. Hillis. The Connection Machine. MIT Press, Cambridge, Mass., 1985.
[9] Kent K. Koeninger. Personal communications, 1994. Cray Research Corp., Minneapolis, MN.
[10] T. Leighton. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Mor-
gan Kaufmann Publishers, Inc., San Mateo, CA 94403, 1992.
[11] Tom Leighton. Methods for message routing in parallel machines. In 24th Annual ACM Symposium
on Theory of Computing, pages 77 { 95, 1992.
[12] David B. Loveman. High Performance Fortran. IEEE Parallel and Distributed Technology, 1(1):25 {
42, 1993.
[13] S.T. Luetenegger and M.K. Vernon. The performance of multiprogrammed multiprocessor scheduling
policy. In Performance Evaluation Review, pages 226{236, May, 1990.
[14] C. McCann, R. Vaswani, and J. Zahorjan. A dynamic processor allocation policy for multiprogrammed
shared-memory multiprocessors. ACM Transactions on Computer Systems, 11(2):146{178, 1993.
[15] C. McCann and J. Zoharjan. Processor allocation policies for message-passing parallel computers. In
Performance Evaluation Review, pages 19{32, May, 1994.
[16] Michael Metcalf and John Reid. Fortran 90 Explained. Oxford University Press, New York, 1990.
11
[17] Wilfried Oed. The Cray Research Massively Parallel Processor System CRAY T3D. available by
anonymous ftp from ftp.cray.com, November 1993.
[18] David A. Patterson and John L. Hennessy. Computer Organization and Design: The Hard-
ware/Software Interface. Morgan Kaufmann Publishers, San Mateo, CA, 1994.
[19] David M. Ray. Personal communications, 1994. Thinking Machines Corporation, Cambridge, MA.
[20] Gary Sabot. The Paralation Model: Architecture-Independent Parallel Programming. MIT Press,
Cambridge, MA, 1988.
[21] Vivek Sarkar. Partitioning and Scheduling Parallel Programs for Multiprocessors. Pitman Publishing,
128, Long Acre, London WC2E 9AN, 1989.
[22] S. K. Setia, M.S. Squillante, and S.K. Tripathi. Analysis of processor allocation in multiprogrammed,
distributed-memory parallel processing system. IEEE Transaction on Parallel and Distributed Sys-
tems, 5(4):401{430, April, 1994.
[23] K. C. Sevcik. Characterization of parallelism in applications and their use in scheduling. In Perfor-
mance Evaluation Review, pages 171{180, May, 1989.
[24] Thinking Machines Corporation, Cambridge, MA. *Lisp release notes, 1987.
[25] Thinking Machines Corporation, Cambridge, MA. The Connection Machine CM-5 Technical Sum-
mary, October 1991.
12