PHD Thesis PDF
PHD Thesis PDF
by
Doctor of Philosophy
November 2009
Adaptive Co-Allocation of Distributed
Resources for Parallel Applications
Abstract
Parallel applications can speed up their execution by accessing resources hosted by
multiple autonomous providers. Applications following the message passing model de-
mand processors to be available at the same time; a problem known as resource co-
allocation. Other application models, such as workflows and bag-of-tasks (BoT), can also
benefit from the coordinated allocation of resources from autonomous providers. Ap-
plications waiting for resources require constant rescheduling. However, different from
single-provider settings, rescheduling across multiple providers is challenging due to the
autonomy and information availability participants are willing to disclose.
This thesis contributes to the area of distributed systems by proposing adaptive re-
source co-allocation policies for message passing and BoT applications, which aim at
reducing user response time and increasing system utilisation. For message passing appli-
cations, the co-allocation policies rely on start time shifting and process remapping opera-
tions, whereas for BoT applications, the policies consider limited information access from
providers and coordinated rescheduling. This thesis also shows practical deployment of
the co-allocation policies in a real distributed computing environment. The four major
findings of this thesis are:
1. Adaptive co-allocation for message passing applications is necessary since single-
cluster applications may not fill all scheduling queue fragments generated by inac-
curate run time estimates. It also allows applications to be rescheduled to a single
cluster, thus eliminating inter-cluster network overhead;
2. Metaschedulers using system-generated run time estimates can reschedule applica-
tions to faster or slower resources without forcing users to overestimate execution
times. Overestimations have a negative effect when scheduling parallel applications
in multiple providers;
3. It is possible to keep information from providers private, such as local load and
total computing power, when co-allocating resources for deadline-constrained BoT
applications. Resource providers can use execution offers to advertise their interest
in executing an entire BoT or only part of it without revealing private information,
which is important for companies to protect their business strategies;
4. Tasks of the same BoT can be spread over time due to inaccurate run time esti-
mates and environment heterogeneity. Coordinated rescheduling of these tasks can
reduce response time for users accessing single and multiple providers. Moreover,
accurate run time estimates assist metaschedulers to better distribute tasks of BoT
applications on multiple providers.
This is to certify that
(ii) due acknowledgement has been made in the text to all other material used,
(iii) the thesis is less than 100,000 words in length, exclusive of table, maps, bibliogra-
phies, appendices and footnotes.
Signature
Date
ACKNOWLEDGMENTS
Marco A. S. Netto
Melbourne, Australia
November 2009
v
CONTENTS
1 Introduction 1
1.1 Resource Co-Allocation and Rescheduling . . . . . . . . . . . . . . . . . 4
1.2 Research Question and Objectives . . . . . . . . . . . . . . . . . . . . . 5
1.3 Contributions and Main Findings . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Workload Files from Parallel Machines . . . . . . . . . . . . . . 8
1.4.2 Scheduling System . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Thesis Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
vii
4.4.2 Rescheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.3 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . 52
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5.1 Experimental Configuration . . . . . . . . . . . . . . . . . . . . 53
4.5.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . 54
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
References 121
LIST OF FIGURES
xi
5.7 Throughput for the Small-World topology. . . . . . . . . . . . . . . . . . 73
5.8 Throughput for the Scale-Free topology. . . . . . . . . . . . . . . . . . . 73
5.9 Throughput for the Random topology. . . . . . . . . . . . . . . . . . . . 73
5.10 Comparison of predicted and actual execution times . . . . . . . . . . . . 74
5.11 Overestimations to avoid application being aborted due to rescheduling. . 74
5.12 Epsilon indicator for three resource sets on both communication models. . 75
5.13 Epsilon indicator showing the importance of mixing topologies. . . . . . 75
7.1 Time to generate execution time estimates and their accuracy . . . . . . . 100
7.2 Set up for experiments with coordinated rescheduling. . . . . . . . . . . . 102
7.3 Node configurations for the experiments in Grid’5000. . . . . . . . . . . 103
7.4 Comparison of results from Grid’5000 and simulations. . . . . . . . . . . 110
xiii
Chapter 1
Introduction
The Ghan1 is a passenger train that operates between Darwin to Adelaide in Australia.
A person who departs from Melbourne needs to book the following: a flight ticket from
Melbourne to Darwin, and another ticket from Adelaide to Melbourne, accommodation
in Darwin and Adelaide; and naturally, the train ticket (Figure 1.1). When booking these
services, one can either contact an agency or book each service directly. The process of
allocating these services in a coordinated manner is called co-allocation or co-scheduling.
Apart from co-allocating these services, one must take into account possible change of
plans, i.e. rescheduling the bookings. The challenge is that changing a single booking may
affect all the other bookings due to their interdependency. Co-allocation is also necessary
in many other activities of our daily lives. For example, when scheduling a meeting with
multiple participants, it is necessary to make sure all participants can attend the meeting
at a specified time, and that a room and additional resources, such as a projector, are
available. In the digital world, several software systems also require co-allocation of
multiple components to execute properly, and most importantly, co-allocation decisions
may change over time to meet user requirements.
Distributed computing [117] is a field of Computer Science that investigates systems
consisting of multiple components over a computer network. These components, which
can be software or hardware, interact with one another via a communication protocol,
i.e. all components communicate at the same time or in a sequence. The complexity of
component interactions depends on the scale of the system. Large-scale systems, with
hundreds or thousands of resources, tend to have more complex communication proto-
cols in order to handle heterogeneous components and failures in hardware and software
layers.
Large-scale distributed computing systems gained attention in the 1990s due to the
1
The Ghan Website: http://www.gsr.com.au
1
2 Chapter 1. INTRODUCTION
DARWIN
3
1
The Ghan
route
PERTH 4
SYDNEY
ADELAIDE
MELBOURNE
5
USER
STORAGE CENTER
SUPERCOMPUTER CENTER
DATA GENERATOR
CLUSTER
VISUALISATION TOOL
DATA STORAGE
NETWORK
DATA ANALYSIS
APPLICATION
GEOPHYSICS CENTER
SUPERCOMPUTER CENTER
Figure 1.2: Example of two applications requiring resources from multiple sites.
available at the same time and the unavailability of a single processor can compromise the
entire application. Therefore, in order to execute large-scale MPI applications, researchers
developed policies to allocate resources from multiple sites in a coordinated manner. To
co-allocate resources, users rely on metaschedulers, which are responsible for generating
co-allocation requests that are submitted to management systems of each site. These
requests, also known as jobs, represent the amount of time and number of processors
required by the user application. In order to guarantee that all jobs start at the same time,
the metascheduler allocates resources using advance reservations [51].
Recently, several researchers investigated resource co-allocation for workflow appli-
cations [134]. Workflows are applications composed of tasks that have time dependencies
and control flow specifications; a process cannot start if its input is not available or a set of
conditions are not satisfied. Therefore, resources that execute a workflow have to be avail-
able in a coordinated manner, usually in a sequence. However, in the literature, resource
co-allocation for workflows is usually referred as workflow scheduling [41, 134].
Bag-of-Tasks (BoT) applications [27, 33] are another application model that requires
co-allocation, but with different constraints. These applications have been used in sev-
eral fields including computational biology [92], image processing [110], and massive
searches [5]. In comparison to the message passing model, BoT applications can be eas-
ily executed on multiple resource providers to meet a user deadline or reduce the user
response time. Although BoT applications comprise independent tasks, the results pro-
duced by all tasks constitute the solution of a single problem. In most cases, users need
the whole set of tasks executed in order to post-process or analyse the results. The optimi-
sation of the aggregate set of results is important, and not the optimisation of a particular
4 Chapter 1. INTRODUCTION
task or group of tasks [8]. Therefore, message passing, workflow, and bag-of-tasks appli-
cations have different resource co-allocation requirements.
Various projects have developed software systems with resource co-allocation support
for large-scale computing environments, such as TeraGrid, Distributed ASCI Supercom-
puter (DAS), and Grid’5000. TeraGrid has deployed Generic Universal Remote (GUR)
[133] and Highly-Available Resource Co-allocator (HARC) [78], the DAS project has
developed KOALA [83, 84], and Grid’5000 [18] has relied on the OAR(Grid) sched-
uler [25] to allow the execution of applications requiring co-allocation. For these sys-
tems, co-allocation is mostly performed for applications that require simultaneous access
to resources from several sites. For workflows, researchers rely on workflow engines,
which are middleware systems to schedule workflow applications. In Cloud Comput-
ing space, initiatives such RESERVOIR [100] are emerging to co-allocate resources from
multiple commercial data centers. These centers keep important information required by
metaschedulers to co-allocate resources, such as scheduling policies, local load, and total
computing capabilities.
resources
resources
T
T T
T T
Bag-of-tasks
T
Message Passing T
T
T T
T T T T T T
time time
completion time
T T T T
resources
T T T communication
resources
T T T T
T T T T
T
T T
T T
T T
time time
(a) Message Passing Application. (b) Bag-of-Tasks Application.
“What are the benefits for users and resource providers when rescheduling message
passing and bag-of-tasks applications on multiple autonomous providers?”
The following objectives are requirements to answer the thesis research question:
To meet the above objectives, this thesis proposes resource co-allocation policies that
consider the following aspects:
• Completion time guarantees: users can better plan other activities, such as result
analysis, when they have precise estimation of their application completion time, in
particular when an application is part of a workflow. Thus, resource providers and
metaschedulers have an important role in generating completion time estimations;
• Limited information access: resource providers may want to keep their load and
total computing power private, especially in utility computing facilities. Therefore,
co-allocation in these environments is more difficult due to the limited information
access.
The main findings of this model are that local jobs may not fill all the fragments
in the scheduling queues and hence rescheduling co-allocation requests reduces
1.3. CONTRIBUTIONS AND MAIN FINDINGS 7
response time of both local and multi-site jobs. We have also observed in some
scenarios that process remapping increases the chance of placing the tasks of multi-
site jobs into a single cluster, thus eliminating any inter-cluster network overhead.
Moreover, a simple and practical approach can be used to generate run time pre-
dictions depending on the application. Predictions are important since applications
may be aborted when rescheduled to slower resources; unless users provide high
run time overestimations. When applications are rescheduled to faster resources,
backfilling may not be explored if estimated run times are not reduced.
The main findings are that offer-based scheduling delays less jobs that cannot meet
deadlines in comparison to scheduling based on load availability (i.e. free time
slots); thus it is possible to keep providers’ load information private when schedul-
ing multi-site BoTs; and if providers publish their total computing power configu-
ration, more local jobs can meet deadlines.
• Show the importance of accurate run time estimates when co-allocating re-
sources for bag-of-tasks applications on multiple providers;
The main findings are that tasks of the same BoT can be spread over time due to in-
accurate run time estimates and environment heterogeneity. Coordinated reschedul-
ing of these tasks can reduce user response time. Moreover, accurate run time
estimates assist metaschedulers to better distribute the tasks of BoT applications
on multiple sites. Although system generated predictions may consume time, the
schedules produced by more accurate run time estimates pay off the profiling time
since users have better response times than simply overestimating resource usages.
1.4 Methodology
Most of the results presented in this thesis are based on simulations using workloads
from real production systems. In order to analyse technical challenges to deploy the co-
allocation policies, we performed experiments using Grid’5000, which consists of a set of
clusters in France dedicated to large-scale experiments2 . Here we present an overview of
the workloads used in our experiments and the scheduling system in which we developed
the co-allocation policies.
We used workload logs from real production systems available at the Parallel Workloads
Archive3 . These logs are ASCII text files that follow the Standard Workload Format4 . The
top of a log file contains comments on the machine where the log was obtained, such as
name, number processors and their configuration, and scheduling queue attributes. The
body of the log file contains a sequence of lines, each representing a job. There are
eighteen fields containing job attributes, being four of them relevant for this thesis: arrival
time, estimated run time, actual run time, number of required processors.
We adapted the workload files to meet the requirements of our experiments. In order
to evaluate the co-allocation policies under different loads, we modified job arrival times,
by either reducing or increasing them, such that to achieve the required load. This strat-
egy was also used by other researchers [104]. We also incorporated parameters, such as
network overhead, deadlines, and time to obtain system-generated run time predictions.
We discuss their inclusion when describing the experiment each of these parameters were
required.
2
Grid’5000 website: https://www.grid5000.fr
3
Parallel Workloads Archive: http://www.cs.huji.ac.il/labs/parallel/workload
4
Standard Workload Format: http://www.cs.huji.ac.il/labs/parallel/workload/swf.html
1.4. METHODOLOGY 9
Core Components
Communication Layer
Figure 1.4: Parallel Job Fit (PaJFit) architecture and its main components.
1.4.2 Scheduling System
We developed a system called Parallel Job Fit (PaJFit) to perform experiments in both
simulated and actual execution modes. The discrete-event simulator provides the means to
perform experiments in various conditions and long runs that could not be possible using
a real environment. The communication layer based on sockets allowed us to perform
experiments in a real testbed. We used Java to implement the system, which currently
consists of 70 classes and approximately 20 thousand lines of source code.
PaJFit architecture is composed of a metascheduler, a resource provider, event handler,
and a job submission handler. For both metascheduler and resource provider components,
we implemented plug-ins to schedule message passing and bag-of-tasks applications. Fig-
ure 1.4 illustrates the main components, which we detail throughout the thesis. The
event handler and communication layer components are responsible for differentiating
the execution between simulation and actual modes. Event handler in the simulated mode
contains a simulated clock and a list of events that are executed at each simulated time
unit. In the actual execution mode, each component, such as metascheduler and resource
providers, contains its own clock and a list of events to process. The communication layer
is a Java Interface that has two implementations, one based on method calls, in which the
simulator contacts any component by calling Java methods, and another based on sockets,
in which components running in different machines communicate through sockets.
The implementation of the co-allocation algorithms is distributed between the sched-
uler and rescheduler of both resource provider and metascheduler components, as il-
lustrated in Figure 1.5. Note that as we used the metascheduler as a mediator between
providers during rescheduling phase, the metascheduler also requires methods for resche-
duling. An alternative implementation could be to distribute the rescheduling responsibil-
ities among the resource providers. This second approach would increase the middleware
complexity, being only required for settings with large number of providers, which is not
the case in this thesis.
PaJFit also contains a simple but effective graphical user interface to visualise job
10 Chapter 1. INTRODUCTION
- Rescheduling with shift start time - Rescheduling with shift start time
- Rescheduling with shift start time - Rescheduling with shift start time
andprocess remapping and process remapping
- Coordinated rescheduling - Coordinated rescheduling
- Uncoordinated rescheduling
- Scheduling based on free time slots
- Scheduling with FIFO + conservative backfilling
- Scheduling based on execution offers - Scheduling with EDF + conservative backfilling
- Scheduling based on execution offers with load balancing
- Scheduling based on execution offers with double load balancing
Figure 1.5: Algorithms implemented in the metascheduler and resource provider classes.
schedules (Figure 1.6), which proved to be an important debugging tool during the devel-
opment of the co-allocation policies. Each resource provider window has a display with
its job schedule, which can be zoomed in or out, and a list of jobs scheduled and their
attributes such as arrival time, run time estimate, and number of required processors. For
the simulated mode, it is possible execute events one by one (“STEP” button), by amount
of simulated time (“STEP SIZE”), until the last event (“RUN ALL”) or until a specified
simulated time (“RUN UNTIL”).
Implementation of
Automatic Process Mapping Adaptive Co-allocation for
Resource Co-allocation
(Chapter 5) BoT Applications
(Chapter 7)
Background on
(Chapter 2)
Adaptive Co-allocation for
Message Passing Applications
(Chapter 4)
Offer-based Co-allocation for
BoT Applications
(Chapter 6)
Flexible Advance Reservations
(Chapter 3)
Core middleware for co-allocation
Distributed Computing Resources
ing the course of the PhD candidature. The thesis chapters and their respective papers are
the following:
• Chapter 4 presents the resource co-allocation model with rescheduling support for
message passing applications. The model consists of two operations that reduce
user response time and increase system utilisation:
12 Chapter 1. INTRODUCTION
• Chapter 7 presents the coordinated rescheduling for BoT applications and the im-
pact of run time estimates when executing these applications across multiple providers.
Results are based on both simulations and real executions using Grid’5000. This
chapter also describes an example of application-profiling using a ray-tracing tool:
Chapter 8 concludes the thesis with a discussion of our main findings and future re-
search directions in the area of resource co-allocation and co-related areas.
Chapter 2
One of the promises of distributed systems is the execution of applications across mul-
tiple resources. Several applications require coordinated allocation of resources hosted on
autonomous domains—problem known as resource co-allocation. This chapter describes
and categorises existing solutions for the main challenges in resource co-allocation: dis-
tributed transactions, fault tolerance, network overhead, and schedule optimisation. The
chapter also presents projects that developed systems with resource co-allocation support,
and the thesis positioning in relation to existing research.
2.1 Introduction
When users require resources from multiple places, they submit requests called metajobs—
also known as multi-site jobs, multi-cluster jobs, or co-allo-cation requests—to a metasched-
uler, which in turn contacts local schedulers to acquire resources (Figure 2.1). These
metajobs are decomposed into a set of jobs or requests to be submitted to resource providers.
In this thesis, a provider contains a cluster with a set of processors and therefore we use
provider and cluster interchangeably. Site is the physical location of the provider. We
consider the scheduling to be on-line, where users submit jobs to resource providers over
time and their schedulers make decisions based on only currently accepted jobs.
The metascheduler is a software that runs either in the user desktop machine or in
a remote server. The local scheduler runs in the front-end node of each provider and is
responsible for managing a scheduling queue to control resource access. The scheduling
queue contains jobs, also known as requests, coming from local or remote sites. These
jobs are specifications given by users containing the required number of processors and
estimated usage time. The empty spaces in the scheduling queue are called fragments.
When the metascheduler asks for resources, the local schedulers look for free time slots,
13
14 Chapter 2. BACKGROUND, CHALLENGES, AND EXISTING SOLUTIONS
resources
queue
with the
containing 2 4 5 6 .....
local
1 3
jobs/requests scheduler
time
User
Local Remote
desktop server Resource
Providers
Metajob
T T T
T T T
T T T
T T T
T T T T
Metascheduler
T T T
which are fragments in the scheduling queue and empty spaces after the last expected
completion time.
The metascheduler has to be available to access the information of multi-site jobs such
as total number of required processors and location of resource providers holding the jobs
from the same metajob. An alternative is to associate to each job a list of the resource
providers holding the other jobs from the same application. The first approach brings the
simplicity to the middleware of the local schedulers since they need to negotiate and keep
track of only a single entity, i.e. the metascheduler. However, such a centralised entity
becomes a bottleneck when managing large number of providers. The second approach
has opposite advantages and drawbacks.
When co-allocating resources for message passing applications, the metascheduler
uses the free time slots to make advance reservations, whereas for BoT applications, the
metascheduler uses the free time slots as an indicator on the number of tasks to be placed
in each provider.
From the moment users request resources to the moment applications start execution,
four challenges have to be addressed regarding resource co-allocation: distributed trans-
actions, fault tolerance, inter-site network overhead, and schedule optimisation [90].
Distributed transactions is the first challenge discussed in this chapter. Resource
co-allocation involves the interaction of multiple entities, namely clients and resource
providers. More than one client may ask for resources at the same time from the same
providers. This situation may generate deadlocks if the resource providers use a locking
procedure; or livelock if there is a time out associated with the locks. Distributed trans-
2.2. CHALLENGES AND SOLUTIONS 15
actions is the research area focused on avoiding deadlocks and livelocks, and minimising
the number of messages during these transactions.
Another common problem in the resource co-allocation field is that a failure in a sin-
gle resource compromises the entire execution of an application that requires multiple
resources at the same time. One approach to minimise this problem is defining a fault tol-
erance strategy that notifies applications of a problem with a resource. A software layer
could then provide the application with a new resource, or discard the failed resource if it
is not essential.
From the applications’ perspective, one of the main problems when executing them
over multiple clusters is the inter-cluster network overhead. Several parallel applica-
tions require inter-process communication, which may become a bottleneck due to the
high latency of wide-area networks. Therefore, it is important to evaluate the benefits of
multi-site execution and develop techniques for mapping application processes consider-
ing communication costs.
Scheduling multi-cluster applications is more complex than scheduling single-cluster
applications due to the tasks time dependency. In addition, as some applications have
more flexibility on how to map tasks to resources, the scheduler has to analyse more
mapping options. For parallel applications with inter-process communication, the sched-
uler also has to take into account the network overhead. Moreover, the scheduling of a
co-allocation request depends on the goals and policies of each resource provider.
When implementing and deploying a software system that supports resource co-allo-
cation, developers initially face the first three mentioned problems. Once a system is in
production, the schedule optimisation becomes one of the most important issues. Most
of the work on co-allocation has focused on schedule optimisation, mainly evaluated by
means of simulations.
In the next section, we describe in detail the solutions proposed for these four major
problems in resource co-allocation, which mainly focus on message passing parallel ap-
plications. We also present relevant work on scheduling of BoT applications, which assist
in positioning the thesis regarding this application model. We also give an overview of
each project before detailing their solutions. Some projects, especially those with middle-
ware implementation, have faced more than one challenge. For these projects, we have
included a section with a comparison of their features and limitations.
Table 2.1: Summary of research challenges and solutions for resource co-allocation.
Research Topic Description Solutions
Some of the projects have focused on more than one aspect of resource co-allocation.
However, the description of such projects is located in the section of the research topic
with their most significant contribution.
local schedulers. The prepare message holds the resources, whereas the commit message
allocates the resources. There are variations of this protocol to enhance its functionalities
as described below.
Kuo et al. [70] proposed a co-allocation protocol based on the two-phase commit
protocol to support cancellations that may occur at any time. Their protocol supports
nested configuration, i.e. a resource can be a co-allocator for other resource sets. However,
it has no support for atomic transactions. Therefore, a transaction may reach a state where
a reservation executes on some resources, while other reservations are cancelled. They
deal with race conditions on the request phase and propose a non-blocking protocol with
a time out mechanism.
Takefusa et al. [115] extended the two-phase commit protocol by including polling
from the client to the server. The authors argued that although there is a communication
overhead between the client and server due to the polling, this non-blocking approach
allows asymmetric communication, and hence, the client does not need a global address.
Moreover, it eliminates firewall problems, avoids hang-ups because of server or client side
troubles, and enables the recovery of each process from a failure.
Deadlocks and livelocks are problems that may occur during a distributed transaction
depending on the allocation protocol and computing environment. Park [94] introduced
a decentralised protocol for co-allocating large-scale distributed resources, which is free
from deadlocks and livelocks. The protocol is based on the Order-based Deadlock Pre-
vention Protocol ODP 2 , but with parallel requests in order to increase its efficiency. The
protocol uses the IP address as the unique local identifier to order the resources. Another
approach to avoid deadlock and livelock is the exponential back-off mechanism, which
18 Chapter 2. BACKGROUND, CHALLENGES, AND EXISTING SOLUTIONS
does not require the ordering of resources. Jardine et al. [62] investigated such a mecha-
nism for co-allocating resources.
Service Negotiation and Acquisition Protocol (SNAP) is a well-known protocol aimed
at managing access to and use of distributed computing resources in a coordinated fashion
by means of Service Level Agreements (SLAs) [36]. SNAP coordinates the resource man-
agement through three types of SLAs, which separate task requirements, resource capa-
bilities, and biding of tasks to resources. From the moment users identify target resources
to the moment when they submit tasks, other users may access the chosen resources. This
happens because information obtained from the providers may be out-of-date during the
selection and actual submission of tasks. In order to solve this problem, Haji et al. [55] de-
veloped a Three-Phase commit protocol for SNAP-based brokers. The key feature of their
protocol is the use of probes, which are signals sent from the providers to the candidates
interested in the same resources to be aware of resource status’ changes.
Maclaren [78] also proposed a Three-Phase Commit Protocol, which is based on Paxos
consensus algorithm [53]. In this algorithm, the coordinator responsible for receiving con-
firmation answers from resource providers is replaced with a set of replicated processes
called Acceptors. A leader process coordinates the acceptor processes to agree on a value
or condition. Any acceptor can act as the leader and replace the leader if it fails. This
algorithm allows messages to be lost, delayed or even duplicated. Therefore, the Paxos
Commit protocol is a valuable algorithm when considering the fault tolerance for dis-
tributed transactions in order to co-allocate resources in Grids.
Jobs can also allocate resources in a pull manner, which is the approach described by
Azougagh et al. [9], who introduced the Availability Check Technique (ACT) to reduce
the conflicts during the process of resource co-allocation. The conflicts are generated
when multiple jobs are trying to allocate two or more resources in a crossing way si-
multaneously, resulting in deadlocks, starvations, and livelocks. Rather than allocating
resources, jobs wait for updates from resource providers until they fulfil their require-
ments.
User’s fault recovery strategy Users specify their own recovery strategy
Czajkowski et al. [35] proposed a layered architecture to address failures for co-
allocation requests. The architecture has two co-allocation methods: Atomic Transaction
and Interactive Transaction. In the atomic transaction, all the required resources are speci-
fied at the request time. The request succeeds if all resources are allocated. Otherwise, the
request fails and none of the resources is acquired. The user can modify the co-allocation
content until the request initialises. In the interactive transaction method, the content of
a co-allocation request can be modified via add, delete, and substitute operations. Re-
sources can be classified in three categories: required (failure or time out of this type of
resource causes the entire computation to be terminated—similar to atomic operation); in-
teractive (failure or time out of a resource results in a call-back to the application, which
can delete or substitute to another resource—i.e. the resource is not essential or it is easy
to find replacements); optional (failure or time out is ignored). Similar approach was ex-
plored by Sinaga et al. for the DUROC system [108]. The authors extended DUROC to
keep trying to schedule jobs until they could get all the required resources, or until the
number of tries achieved a certain threshold.
The Globus Architecture for Reservation and Allocation (GARA) was one of the first
co-allocation systems that considered Quality of Service (QoS) guarantees [51]. The main
goal of GARA was to provide resource access guarantees by using advance reservations.
GARA had the concept of backtracking, in which when a resource fails, it is possible to
try other resources until the request succeeds or fails.
Röblitz and Reinefeld [97] presented a framework to manage reservations for applica-
tions running concurrently on multiple sites and applications with components that may be
linked by temporal or spatial relationships, such as job flows. They defined and described
co-reservations along with their life cycle, and presented an architecture for processing
co-reservation requests with support for fault tolerance. When handling confirmed co-
reservations, as part of the requested resources may not be available, alternative ones
20 Chapter 2. BACKGROUND, CHALLENGES, AND EXISTING SOLUTIONS
Table 2.4: Summary of methods and goals for network overhead evaluations.
Method Goals
The Message Passing Interface (MPI) has been broadly used for developing parallel
applications in single site environments. However, executing these applications on multi-
site environments imposes different challenges due to network heterogeneity. Intra-site
communication has much lower latency than inter-site communication. There are several
MPI implementations, such as MPICH-VMI [93], MPICH Madeleine [7], and MPICH-
G2 [65], that take into account the network heterogeneity and simplify the application
development process.
2.2. CHALLENGES AND SOLUTIONS 21
Ernemann et al. [45] studied the benefits of sharing jobs among independent sites
and executing parallel jobs in multiple sites. When co-allocating resources, the scheduler
looks for a site that has enough resources to start the job. If it cannot find it, the sched-
uler sorts the sites in a descending order of free resources and allocates those resources
in this order to minimise the number of combined sites. If it is not possible to map the
job, the scheduler queues the job using Easy Backfilling [85]. The authors varied the net-
work overhead from 0 to 40% and concluded that multi-site applications reduce average
weighed response time when the communication overhead is limited to about 25%. This
threshold has been used for most of the following work that considers network overhead
for co-allocation.
Bucur et al. [20] investigated the feasibility of executing parallel applications across
wide-area systems. Their evaluation has as input parameters the structure and size of
jobs, scheduling policy, and communication speed ratio between intra- and inter-clusters.
They investigated various scheduling policies and concluded that when the ratio between
inter- and intra-cluster is 50, it is worth co-allocating resources instead of waiting for all
resources to be available in a single cluster.
Jones et al. [64] proposed scheduling strategies that use available information of the
network link utilisation and job communication topology to define job partition sizes and
job placement. Rather than assuming a fixed amount of time for all inter-cluster com-
munication or assigning execution time penalties for the network overhead, the authors
considered that inter-cluster bandwidth changes over time due to the number and dura-
tion of multi-site executions in the environment. Therefore, they explored the scheduling
of multiple co-allocation jobs sharing the same computing infrastructure. As for the co-
allocation strategies, the authors investigated:
• First-Fit, which performs resource co-allocation by assigning tasks starting with the
cluster having the largest number of free nodes and does not use any information of
neither the job communication characterisation nor network link saturation;
• Link Saturation Level Threshold Only, which is similar to First-Fit but discards
clusters with saturated links;
• Link Saturation Level Threshold with Constraint Satisfaction, which tries to put
jobs into a large portion of a single cluster (e.g. 85% of resources); and Integer
Constraint Satisfaction, which uses jobs’ communication characterisation and cur-
rent link utilisation to prevent link saturations.
Jones et al. [64] concluded that it is possible to reduce multi-site jobs’ response time
by using information of network usage and jobs’ network requirements. In addition, they
22 Chapter 2. BACKGROUND, CHALLENGES, AND EXISTING SOLUTIONS
concluded that this performance gain depends heavily on the characteristics of the arriving
workload stream.
Mohamed and Epema [83] addressed the problem of co-allocating processors and
data. They presented two features of their metascheduler, namely different priority levels
of jobs and incrementally claiming processors. The metascheduler may not be able to find
enough resources when jobs are claiming for resources. In this case, if a job j claiming for
resources has high priority, the metascheduler verifies whether the number of processors
used by low priority jobs is enough to serve the job j. If it is enough, the metascheduler
preempts the low priority jobs in a descending order until enough resources are released.
The metascheduler moves the preempted jobs into the low priority placement queue. The
metascheduler uses the Close-to-Files (CF) job-placement algorithm to select target sites
for job components [82]. The CF algorithm attempts to place the jobs in the sites where
the estimated delay of transferring the input file to the execution sites is minimal.
Advance reservation Ensure all resources are available at the same time
tasks to resources assuming that all the applications hold all the required resources for
their entire execution. The second phase is the run-time adaptation where the scheduler
maps tasks according to the actual computation and communication costs, which may
differ from the estimated costs used in the first phase. In addition, applications may release
a portion of the resources before they finish. The authors considered the scheduling of a
set of applications rather than a single one (batch mode). Their optimisation criterion was
to minimise the completion time of the last application, i.e. the makespan. They modeled
the applications as Directed Acyclic Graphs (DAGs) and used graph theory to optimise
the mapping of tasks.
Ernemann et al. [46] studied the effects of applying constraints for job decomposition
when scheduling multi-site jobs. These constraints limit the number of processes for each
site (lower bound) and number of sites per job. When selecting the number of processors
used in each site, they sort the sites list by the decreasing number of free nodes in order to
minimise the number of fragments for the jobs. The decision of using multi- or single-site
to execute the application is automatic and depends on the load of the clusters. In their
study, a lower bound of half of the total number of available resources appeared to be
24 Chapter 2. BACKGROUND, CHALLENGES, AND EXISTING SOLUTIONS
beneficial in most cases. Their evaluation considers the network overhead for multi-site
jobs. They summarised the overhead caused by communication and data migration as an
increase of the job’s run time.
Azzedin et al. [10] proposed a co-allocation mechanism that requires no advance
reservations. Their main argument for this approach is the strict timing constraints on the
client side due to the advance reservations, i.e. once a user requests an allocation, the
initial and final times are fixed. Consequently, advance reservations generate fragments
that schedulers cannot utilise. Furthermore, the authors argued that a resource provider
can reject a co-allocation request at any time in favour of internal requests, and hence the
co-allocation would fail. Their scheme, called synchronous queuing (SQ), synchronises
jobs at scheduling cycles, or more often, by speeding them up or slowing them down.
Li and Yahyapour [75] introduced a negotiation model that supports co-allocation.
They extended a bilateral model, which consists of a negotiation protocol, utility functions
or preference relationships for the negotiating parties, and a negotiation strategy. For
the negotiation protocol, the authors adopted and modified the Rubinstein’s sequential
alternating offer protocol. In this latter protocol, players bargain at certain times. For each
period, one of the players proposes an agreement and the other player either accepts or
rejects. If the second player rejects, it presents an agreement, and the first player agrees or
rejects. This negotiation continues until an agreement between the parties is established or
the negotiation period expires. They evaluated the model with different input parameters
for prices, negotiation behaviors, and optimisation weights.
Sonmez et al. [113] presented two job placement policies that take into account the
wide-area communication overhead when co-allocating applications across multiple clus-
ters. The first policy is Cluster Minimisation in which users specify how to decompose
jobs and the scheduler maps the maximum job components in each cluster according to
their processor availability (more processors available first). The second policy is Flexible
Cluster Minimisation in which users specify only the number of required processors and
the scheduler fills the maximum number of processors in each cluster. The main goal of
these two policies is to minimise the number of clusters involved in a co-allocation request
in order to reduce the wide-area communication overhead. The authors implemented these
policies in their system called KOALA and evaluated several metrics, including average
response time, wait time and execution time of user applications. Their policies do not
use advance reservations, so at time intervals, the scheduler looks for idle nodes in the
waiting queues of co-allocation requests.
Bucur et al. [21, 22] investigated scheduling policies on various queuing structures for
resource co-allocation in multi-cluster systems. They evaluated the differences between
having single global schedulers, only local schedulers and both schedulers together, as
2.3. SYSTEMS WITH RESOURCE CO-ALLOCATION SUPPORT 25
well as different priorities for local and meta jobs. They used First Come First Serve in
the scheduling queues. They have concluded that multi-site applications should not spend
more than 25% of their time with wide-area communication and that there should be
restrictions on how to decompose the multi-site jobs in order to produce better schedules.
Elmroth and Tordsson [44] modelled the co-allocation problem as a bipartite graph-
matching problem. Tasks can be executed on specific resources and have different re-
quirements. Their model relies on advance reservations with flexible time intervals. They
explored a relaxed notion of simultaneous start time, where jobs can start with a short
period of difference. When a resource provider cannot grant an advance reservation, it
suggests a new feasible reservation, identical to the rejected one, but with a later start
time. They presented an algorithm to schedule all the jobs within the start window inter-
val, which tries to minimise the jobs’ start time.
Decker and Schneider [40] investigated resource co-allocation as part of workflow
tasks that must be executed at the same time. They extended the HEFT (Heterogeneous
Earliest-Finish-Time) algorithm to find a mapping of tasks to resources in order to min-
imise the schedule length (makespan), to support advance reservations and co-allocation,
and to consider data channel requirements between two activities. They observed that
most of the workflows were rejected because no co-allocation could be found that covered
all activities of a synchronous dependency or because there was not enough bandwidth
available for the data channels. Therefore, they incorporated a backtracking method,
which uses not only the earliest feasible allocation slot for each activity that is part of
a co-allocation requirement, but all possible allocation ranges as well.
Siddiqui et al. [105] have introduced a mechanism for capacity planning to optimise
user QoS requirements in a Grid environment. Their mechanism supports negotiation and
is based on advance reservations. A co-allocation request contains sub-requests that are
submitted to the resource providers, which in turn send counter-offers when users and
resource providers cannot establish an agreement.
GARA (The Globus Architecture for Reservation and Allocation) enables applications to
co-allocate resources, which include networks, computers, and storage. GARA uses ad-
vance reservations to support co-allocation with Quality-of-Service and uses backtracking
to handle resource failure [51]. GARA was one of the first projects to consider QoS for
co-allocation requests.
OAR is the batch scheduler that has been used in Grid’5000 [25, 26]. OAR uses a simple
policy based on all-or-none approach to co-allocate resources using advance reservations.
One of the main design goals of OAR is the use of high level tools to maintain low soft-
ware complexity.
KOALA is a grid scheduler that has been deployed on the DAS-2 and the DAS-3 multi-
cluster systems in the Netherlands [83, 84]). KOALA users can co-allocate both proces-
sors and files located in autonomous clusters. KOALA supports malleable jobs, which
can receive messages to expand and reduce the number of processors at application run
time, and has fault tolerance mechanisms based on flexible resource selection.
JSS (Job Submission Service) is a tool for resource brokering designed for software com-
ponent interoperability [44]. JSS has been used in NorduGrid and SweGrid, and relies
on advance reservations for resource co-allocation. These advance reservations are flex-
ible, i.e. users can provide a start time interval for the allocation. JSS also considers
time prediction for file staging when ranking resources to schedule user applications. JSS
does not access the resource provider scheduling queues to decide where to place the ad-
vance reservations. Thus, the co-allocation is based on a set of interactions between the
metascheduler and resource providers until the co-allocation can be accomplished.
2.3. SYSTEMS WITH RESOURCE CO-ALLOCATION SUPPORT 27
Table 2.6 summarises the main features of each system according to each co-allocation
challenge. The systems have used relied on different methods to deal with distributed
transactions and fault tolerance problems. Most of them have no support to schedule
applications considering network overhead. Therefore, it is the user who has to deal with
this problem from the application level. Regarding schedule optimisation, most of the
systems rely on advance reservations.
investigated scheduling of tasks with deadline constraints, but focusing on single cluster
environments and with no feedback when deadlines are not possible to be satisfied.
Limited information access to the metascheduler. Resource providers can use execu-
tion offers when they are not willing to disclose private information such as local load,
resource capabilities, and scheduling strategies. Bag-of-tasks are one of the main appli-
cation models that can execute across multiple providers. Co-allocation for BoT appli-
cations is important since the results produced by all tasks constitute the solution of a
single problem. The closest work from our execution offer-based scheduling comes from:
Elmroth and Tordsson [44], who proposed a co-allocation algorithm that relies on the in-
teraction between resource providers and metascheduler; and from Singh et al. [109] who
30 Chapter 2. BACKGROUND, CHALLENGES, AND EXISTING SOLUTIONS
2.6 Conclusions
This chapter described the main research efforts in the area of resource co-allocation.
These efforts involve four research directions: distributed transactions, fault tolerance,
evaluation of inter-site network overhead, and schedule optimisation. We have presented
existing work for each of these research directions. We have also described and compared
six systems that support resource co-allocation.
When implementing real systems to deploy in production environments, the support
for managing distributed transactions properly becomes an important issue. In terms of
fault tolerance, co-allocation systems have been supporting the notion of optional and al-
ternative resources that allows the scheduler to remap the application to other resources in
case of failures. As for wide-area network overhead, most of existing work that considers
it has used 25% of the execution time as the threshold to perform experiments. In addition,
researchers have been considering location of data and computing resources to schedule
multi-site applications. This is particularly necessary when scheduling data-intensive ap-
plications.
Most of the work on resource co-allocation focuses on schedule optimisation, which
has been mainly based on the use of advance reservations. When scheduling multi-site
applications, there are several factors to take into account. Apart from the network over-
head and fault tolerance aspects, scheduling relies on the amount of information available
for finding a placement for jobs. The use of a global queue or autonomous queues has a
considerable influence on the scheduling strategies. Fortunately, several researchers have
been considering the use of autonomous queues to perform their experiments, which is a
fundamental characteristic of a large-scale computing environments.
Although there are several researchers working on scheduling policies for co-allocation
requests, we have observed that most groups that developed middleware systems with
2.6. CONCLUSIONS 31
co-allocation support use simple scheduling techniques. That is because there are still
several technical difficulties before more advanced scheduling policies can be deployed.
Some of these technical problems are interoperability between metaschedulers and re-
source providers middleware, inter-site network overhead, and the autonomous policies
of each resource provider.
This chapter also presented the thesis position in relation to existing work. The main
research direction of this thesis is the rescheduling of co-allocation requests for message
passing and bag-of-tasks applications. The next chapters describe in detail our contri-
butions for this research direction, starting by understanding the impact of rescheduling
advance reservations in a single provider.
Chapter 3
Advance reservations are important building blocks for coordinated allocation of resources
hosted by multiple providers. This chapter presents a detailed study on reservations that
have flexible time intervals, having system utilisation as the main performance metric.
This study involves the evaluation of four scheduling algorithms along with measure-
ments considering the time reservations are flexible and alternative offers from resource
providers when reservations cannot be granted. The results show the importance of
rescheduling advance reservations in single-site environments, which motivates a further
study for multi-site applications in the following chapters.
3.1 Introduction
Advance reservations are an allocation mechanism to ensure resource availability in fu-
ture. Resources can be display devices for a meeting, network channels required for data
transmission, and computers from multiple clusters to execute a parallel application.
When a provider accepts an advance reservation, the user expects to be able to access
the agreed resources at the specified time. However, changes may arise in the scheduling
queue between the time the user submits the reservation to the time the user receives the
resources. There are a number of reasons for such changes including: users cancelling
and modifying requests, resource failures, and errors in estimating usage time. When
reservations are not rescheduled, scheduling queues increase their fragmentation. This
fragmentation reduces the potential scheduling opportunities and results in lower utilisa-
tion. Indeed, fragmentation also limits the positions in which other jobs can be scheduled.
In order to minimise fragmentation due to advance reservations, researchers in this area
have introduced and investigated the impact of flexible time intervals for advance reser-
vations [30, 47, 67, 86, 98].
33
34 Chapter 3. FLEXIBLE ADVANCE RESERVATIONS
time time
interval deadline interval
1. Strict interval: Users require resources at the same amount of time as the interval
length and hence there is no flexibility permitted to the scheduler. This scenario
maps well to the availability of a physical resource that may need to be booked for
a specific period.
2. Strict deadline: Users require that the execution completes prior to a deadline. This
scenario typically applies when there are subsequent dependencies on the results of
a given computation.
3. Flexible interval: There is a strict start and finish time, but the time between these
two points exceeds the length of the computation. This scenario fits well with for-
ward and backward timing dependencies, such those encountered in a workflow
computation.
3.2. RESERVATION SPECIFICATION 35
6:30pm 11pm
Figure 3.2: Reschedule of workflow tasks due to inaccurate run time estimations.
• fjmol : Rj → Tje : moldability function that specifies the relation between the number
of resources and execution time Tje ;
3.3.1 Sorting
We separate the jobs currently allocated into two queues: running queue Qr = {o1 , ..., ou } |
N N
u ∈ and waiting queue Qw = {j1 , ..., jn } | n ∈ [88]. The first queue contains jobs
already in execution that cannot be rescheduled. The second queue contains jobs that
can be rescheduled. The approach we adopt here is to try to reschedule the jobs in the
waiting queue by sorting them first and then attempting to create a new schedule. We use
five sorting techniques: Shuffle, First In First Out (FIFO), Biggest Job First (BJF), Least
Flexible First (LFF), and Earliest Deadline First (EDF). The only sorting criterion that
needs explanation is LFF, which sorts the jobs according to their flexibility in terms of
time intervals. This approach is based on the work from Wu et al. [127], but considers
only the time intervals. We define the time flexibility of a job j as follows:
3.3. SCHEDULING AND RESCHEDULING OF RESERVATIONS 37
D − max(T r , CT ) − T e : for advance reservation jobs
j j j
∆j =
D − CT − T e : for jobs with deadline
j j
= < <
time interval time interval time interval
time interval
current time
3.3.2 Scheduling
Algorithm 1 represents the pseudo-code for scheduling a new job jk at the current time
CT , returning true if it is possible to scheduled it, or false and a list of alternative possible
schedules otherwise. The list of jobs in Qw is sorted by a given criterion (e.g. EDF or
LFF). Before scheduling a new job, the state of the system is consistent, which means that
the current schedule of all jobs meets the users’ QoS requirements. Therefore, during the
scheduling, if a job ji is rejected there are two options: (i) ji = jk , the new job cannot be
scheduled; or (ii) ji 6= jk , the new job is scheduled but generates a scheduling problem for
another job ji ∈ Qw . In the second case, we change the positions of jk with ji and all jobs
between jk and ji go back to the original scheduling—function that we call f ixqueue.
Each job is scheduled by using the first fit approach with conservative backfilling—the
first available time slot is assigned to the job [85]. For jobs with deadline, the scheduler
looks for a time slot between the interval [CT, Dj − Tje ] and for advance reservations, the
scheduler looks for a time slot within the interval [Tjr , Dj − Tje ].
When job jk is rejected, all jobs in Qw after jk , including jk itself, must be rescheduled
(Algorithm 2). However, in this rescheduling phase, other options are used to reschedule
jk . The list of options Ψ is generated based on the intersection of the new job jk , the jobs
in the running queue, and the jobs in the waiting queue that are before jk (Figure 3.4). The
list Ψ contains scheduling options defining start time, duration, and number of processors.
For each job ji that intersects jk , job jk is tested before Tis and after Di . Once the list of
options Ψ is generated, it sorts it according to the percentage difference φ between the
original Tjs and Dj values and the alternative scheduler suggested options OP T Tjr and
OP T Dj :
OP T Dej −Dj : option generated by placing jk after ji
T
φopt = j
OP T Tjer −Tjr : option generated by placing j before j
T k i
j
38 Chapter 3. FLEXIBLE ADVANCE RESERVATIONS
intersected jobs
time
Figure 3.4: Example of alternative options generated when an advance reservation cannot
be granted.
Once defined the possible positions of the new job jk , all jobs in Qw after jk (including
jk ) are rescheduled. If a job ji is rejected, there are again two options: (i) ji = jk , the
new job cannot be scheduled; or (ii) ji 6= jk , the new job is scheduled but delays another
job ji ∈ Qw . In contrast to Algorithm 1, in Algorithm 2, when ji = jk , it means that
the scheduler has already tried all the possibilities to fit jk in the queue, and hence, jk
is be rescheduled again. However, if jk 6= ji , then the queue Qw is fixed, the index
of jk is updated, Tkr and Dk are set to the original values, and the rest of Qw is again
rescheduled. This process finishes when there are no more scheduling options to test. The
first successful option is enough for a user who does not require an advance reservation.
3.4. EVALUATION 39
Algorithm 2: Pseudo-code for rescheduling the rejected part of Qw using the list of
options Ψ for the rejected new job jk .
1 OTkr ← Tkr , ODk ← Dk {keep original values}
2 for ∀OP T ∈ Ψ do
3 jobscheduled ← true
4 for ∀ji ∈ Qw | i ≥ k and jobscheduled = true do
5 if ji = jk then
6 Set Tkr and Dk with option OP T
7 jobscheduled ← schedule(ji )
8 if jobscheduled = f alse then
9 if i 6= k then
10 f ixqueue(Qw , i, k)
11 Tkr ← OTkr , Dk ← ODk {restore original values}
12 return reschedule ∀ji ∈ Qw | i ≥ k
13 else
14 return f alse {already tested new options for jk }
15 else
16 {valid option OP T in Ψ—inform user about this possibility}
3.4 Evaluation
The goal of the evaluation is to observe the impact of rescheduling advance reservations
on the system utilisation. We evaluated the use of flexible QoS parameters for advance
reservations on the PaJFit simulator.
bution with λ=5 (mean of delay factor), and Tjsub is the request submission time defined in
the workload traces. As we are working with advance reservations, we defined the release
time of jobs as Tjr = Dj − Tje . To model higher loads and the subsequent performance of
the scheduler, we increased the frequency of request submissions from the trace by 25%
and 50%.
We also analysed four flexible interval sizes, which we again define as a Poisson dis-
tribution: fixed interval, short interval (λ ← φ = 25%), medium interval (λ ← φ = 50%),
long intervals (λ ← φ = 100%). For all experiments using flexible intervals, we modified
only half of each workload, the other half continues to have fixed intervals. We believe a
portion of users would still continue to specify strict deadlines.
System Utilisation Gain (%) System Utilisation Gain (%) System Utilisation Gain (%) LFF
8
6
BJF
4 EDF
2 FIFO
0
−2
−4
−6
−8
8 LFF
6 BJF
4
EDF
2
0 FIFO
−2
−4
−6
−8
8 LFF
6
BJF
4
2 EDF
0 FIFO
−2
−4
−6
−8
fixed short medium long
40
System Utilisation Gain (%)
Short Interval
35
Medium Interval
30 Long Interval
25
20
15
10
5
0
Original arrival time Arrival time with 25% reduction Arrival time with 50% reduction
Users may want to know with some assurance when their jobs will execute. They
can ask the resource provider to fix their jobs when the time to receive the resources
42 Chapter 3. FLEXIBLE ADVANCE RESERVATIONS
40
System Utilisation Gain (%)
Short Interval
35 Medium Interval
30 Long Interval
25
20
15
10
5
0
Original arrival time Arrival time with 25% reduction Arrival time with 50% reduction
Figure 3.7: Impact of time interval size on resource utilisation with inaccurate run time
estimations.
gets closer, i.e. remove the time interval flexibility by renegotiating the reservation. We
evaluated the system utilisation by fixing the Tjr and Dj of each job j when 25%, 50%,
and 75% of the waiting time has passed. We compared these results with an approach that
fixes the schedule immediately after the job is accepted.
As in the first set of experiments (Figure 3.5), we performed runs for different work-
loads. However in this case the results for all workloads are similar, therefore we only
present the graph for the medium workload in Figure 3.8. We observe that the longer
users wait to fix their job requirements, the better the system utilisation since the sched-
uler has more opportunities to reschedule the workload.
System Utilisation Gain (%)
0
Short Interval Medium Interval Long Interval
Figure 3.8: Effects of the duration the advance reservations are flexible.
Instead of using flexible intervals to meet time QoS requirements of users, we wanted
to see what would happen when the resource provider offered an alternative slot to the
user. When the resource provider cannot schedule a job j with the required starting time,
3.4. EVALUATION 43
φ = 25%
System Utilisation Gain (%)
25 φ = 50%
φ = 100%
20
15
10
0
Original arrival time Arrival time with 25% reduction Arrival time with 50% reduction
Figure 3.9: System utilisation using suggested option from resource provider.
Max φ = 25%
Average φ for accepted jobs
80
70 Max φ = 50%
Max φ = 100%
60
50
40
30
20
10
0
Original arrival time Arrival time with 25% reduction Arrival time with 50% reduction
Figure 3.10: Average actual φ of jobs accepted through resource provider’s suggestion.
it provides the user with other options (if possible) before and after the interval [Tjr , Dj ].
We selected the lowest difference φ of the options for each job j, given a threshold of
25%, 50% and 100%. Figure 3.9 shows that while this approach does increase the system
utilisation, it does not perform as well as the flexible interval technique. Nevertheless, the
approach of returning to the user with an alternative option is a useful technique for users
who cannot accept flexible intervals.
We also measured the difference between the actual and the thresholds φ for the jobs
accepted through the option suggested by the resource provider. From Figure 3.10 we
observe that in average, the value of actual φ is significantly less than the maximum φ
defined by the resource provider. This means that even when users let providers choose
options that are not close to the original request, the scheduler has flexibility to find alter-
native options better than the user threshold.
44 Chapter 3. FLEXIBLE ADVANCE RESERVATIONS
3.5 Conclusions
In this chapter we outlined user scenarios for advance reservations with flexible and adap-
tive time QoS parameters and presented the benefits for resource providers in terms of
system utilisation. We evaluated these flexible advance reservations by using different
scheduling algorithms, and different flexibility and adaptability QoS parameters. We in-
vestigated cases where users do not or cannot specify the execution time of their jobs
accurately. We also examined resource providers that do not utilise flexible time QoS pa-
rameters, but rather return alternative scheduling options to the user when it is not possible
to meet the original QoS requirements.
In our experiments we observed that system utilisation increases with the flexibility of
request time intervals and with the time the users allow this flexibility while they wait in
the scheduling queue. This benefit is mainly due to the ability of the scheduler to rearrange
the jobs in the scheduling queue, which reduces the fragmentation generated by advance
reservations. This is particularly true when users overestimate application run time.
The results presented in this chapter are a solid foundation for scheduling applications
that require co-allocation using advance reservations. Different from single site advance
reservations, when rescheduling applications on multiple sites, resource providers have to
keep all the advance reservations of a co-allocation request synchronised. In spite of its
complexity, we will see in the following chapter that there are benefits of rescheduling
applications that require multiple advance reservations.
Chapter 4
This chapter proposes adaptive co-allocation policies based on flexible advance reser-
vations and process remapping. Metaschedulers can modify the start time of each job
component and remap the number of processors they use in each site. The experimen-
tal results show that local jobs may not fill all the fragments in the scheduling queues
and hence rescheduling co-allocation requests reduces response time of both local and
multi-site jobs. Moreover, process remapping increases the chances of placing the tasks
of multi-site jobs into a single cluster, thus eliminating the inter-cluster network overhead.
4.1 Introduction
Most of the current resource co-allocation solutions rely on advance reservations [40, 44,
51, 78, 97]. Although advance reservations are important to guarantee that resources are
available at the expected time, they reduce system utilisation due to the inflexibility intro-
duced in scheduling other jobs around the reserved slots [112]. To overcome this problem,
several researchers have been working with flexible (or elastic) advance reservations, i.e.
requests that have relaxed time intervals [47, 67, 86, 98, 99]. Nevertheless, the use of
these flexible advance reservations for resource co-allocation has been barely explored
[66].
By introducing flexibility to the advance reservations of co-allocation requests [66],
schedulers can hence reschedule them to increase system utilisation and reduce response
time of both local and multi-site jobs. This is particularly necessary due to the wrong
estimations provided by users [73, 85, 118].
Little research has been devoted to resource co-allocation with rescheduling support
45
46 Chapter 4. ADAPTIVE CO-ALLOCATION FOR MP APPLICATIONS
We have evaluated our model and scheduling strategies with extensive simulations
and analysed several metrics to have a better understanding of the improvements achieved
here. We also discuss issues on deploying the co-allocation policies on real environments.
GEOPHYSICS CENTER
resources
2 4 5 6 .....
1 3
time
Co-allocation Request
1 2 5 .....
4 6
time
Co-allocation Request
resources
STORAGE CENTER
1 .....
USER 4 5
2 3
time
Co-allocation Request
Resource Management Systems. These systems, also named local schedulers, schedule
both local and external requests in a machine mi . We do not assume that a metascheduler
has complete information about the local schedulers. In our scenario, rather than publish-
ing the complete scheduling queue to the metascheduler, the local schedulers may want
to only publish certain time slots to optimise local system usage. Moreover, in our com-
puting environment schema the resource providers have no knowledge about one another.
The scheduling management policy we use here is FIFO with conservative backfilling,
which provides completion time guarantees once users receive their scheduling time slots
[85].
request. A sub-request has a limited number of processors, which is dictated by the capac-
ity of the machine mi . However, the number of processes of each application component
is determined by the user, which can exceed the number of processors.
Metrics. Our main aim is to optimise job response time, i.e. the difference between the
submission time of the user request and its completion time. We also evaluate system
utilisation, number of machines used by each job, number of jobs that received resources
before expected, among other metrics.
• Rjmk : number of processors required in each machine mi , where k is the total num-
ber of jobs of the metajob j;
A FlexCo request has two operations (Figure 4.2): (i) Start time shifting: changes the
start time according to the relaxed time interval—the change must be the same for all jobs
coming from the same metajob; and (ii) Processor remapping: changes the number of
required resources of two or more jobs. Combining both operations is also important for
the scheduler. In Figure 4.2, we observe that after using the process remapping operation,
it is possible reduce the metajob response time by shifting the jobs associated with it. The
schedulers perform these operations while jobs are waiting for resources, and not during
run time. The idea here is to redefine the request specifications, not to migrate jobs [81].
4.3. FLEXIBLE RESOURCE CO-ALLOCATION 49
resources
Jn Jn
Site 1 time after reschedule Site 1 time
resources
resources
Jn Jn
Site 2 time Site 2 time
Original Schedule
Process remapping Modified Schedule
resources
resources
Jn after reschedule Jn
Site 1 time Site 1 time
resources
resources
Jn Jn
Site 2 time Site 2 time
Start Time Shifting (Shift): finding a common time slot may be difficult for users, hence
once they commit the co-allocation based on advance reservations, they will not be willing
to change it. The modification of the start time may be useful for one resource provider
in order to fill a fragment in the scheduling queue. If the other resource providers are also
willing to shift the advance reservations to start earlier, the users will also have benefits.
Note that this operation is application independent in the sense that it is only a shift on the
start time of the user application.
Process Remapping (Remap): A user requiring a certain number of resources tends to
decompose the metajob statically according to the available providers at a certain time.
Therefore, users may not be able to reduce the start time of their applications when re-
sources become available. To overcome this problem, Remap allows automatic remap-
ping of the processes once the jobs are queued. This operation is application dependent
since the throughput offered by each resource provider may influence the overall appli-
cation performance. Thus, users may also want to incorporate restrictions on how the
metascheduler should map and remap their jobs. Branch-and-bound-based solvers for op-
timisation problems are an example of application that is flexible to deploy and hence can
have benefits from this operation. For network demanding applications, this operation
allows the reduction of the number of clusters required by a co-allocation request, which
has a direct impact on the network utilisation.
50 Chapter 4. ADAPTIVE CO-ALLOCATION FOR MP APPLICATIONS
1. Ask the resource providers for the list of available time slots, T S = {ts1 , ts2 , ..., tsn },
where n is the number of time slots;
2. Find the earliest common start time Tjs that meets the request constraints, such as
number of resources, start time, and completion time;
In order to find the common start time Tjs , the metascheduler verifies the values of Tjs
according to the list of available time slots T S and gets the maximum number of resources
available in each machine mi starting at time Tjs that fits the job. Note that if the number
of resources available in a particular mi is greater than or equal to Rj , there is no need
to consider the network overhead Tjxo since the job will be submitted to a single machine
mi .
When generating the list of sub-requests, the metascheduler could follow different
approaches. For example, it could try to decompose the multi-site jobs evenly in order
to maintain the same load in each resource provider. In our approach, the metascheduler
allocates as many processors as possible from a single resource provider per request.
Every time a new external job arrives, the metascheduler uses the next-fit approach to give
priority to the next resource provider. The idea behind the second approach is to increase
the chances of fitting multi-site jobs into a single site over time due to the rescheduling.
4.4.2 Rescheduling
As described in the previous subsection, the initial scheduling of a multi-site job involves
manipulation and transfer of time slots over the network. In order to reschedule multi-
site jobs, one must consider the cost-benefit of transferring and manipulating time slots
4.4. SCHEDULING OF MULTI-SITE REQUESTS 51
... ...
job completed
current time Time
before local job rescheduled
to optimise the schedule. Therefore, our approach is to reschedule a multi-site job only
when the resource provider is not able to find a local job that fills the fragments generated
due to the early completion of a job (Figure 4.3). The local schedulers use Algorithm 3
to reschedule jobs whenever a job completes before its estimated time. The rescheduling
is based on the compressing method described by Mu’alem and Feitelson [85], which
consists in bringing the jobs to the current time according to their estimated start times,
not their arrival times (Lines 3-5, 11-14). This avoids the violation of the completion time
of jobs given by the original schedule. When implementing the algorithm, one could keep
52 Chapter 4. ADAPTIVE CO-ALLOCATION FOR MP APPLICATIONS
MetaschedulerForMPJobs
+ResourceProvidersList
+submitJobs(jobs)
Scheduler
+reconfigureJobs(jobs) Rescheduler
+scheduleMetaJob(metaJob) +cancelJob(job)
+rescheduleJob(metaJob)
+findCommonTimeSlot(timeSlotList,metaJob)
+shiftStartTime(metaJob)
+getFreeTimeSlots(metaJob,resourceProviders)
ScheduledJobs +findCommonTimeSlotForShifting(timeSlotList)
+getMetaJob(jobId)
Figure 4.4: Class diagram for the metascheduler of message passing applications.
a list of jobs sorted according to start time instead of sorting them when rescheduling
(Line 2).
Once the metascheduler receives a notification for rescheduling a multi-site job ji from
the resource provider (Line 8), it performs the rescheduling in a similar way as described
in the initial scheduling procedures (Section 4.4.1). The main differences are that (i)
for the Shift operation, the metascheduler asks for time slots only from those resource
providers that hold the jobs of the multi-site job ji ; and (ii) for the Remap operation
the metascheduler contacts other resource providers rather than only the original ones.
In addition, for this latter operation, the metascheduler may remove sub-requests from
resource providers.
Local schedulers usually handle jobs by using their IDs. In order to implement the co-
allocation policies, jobs require an additional ID, which is composed of the metascheduler
ID that necessary to contact the respective metascheduler managing the job, and the ID
used internally by the metascheduler, which assists the metascheduler to identify the jobs.
The co-allocation request also needs an estimation of the network overhead required to
execute the application in multiple providers.
Figure 4.4 represents a simplified version of the class diagram for the metascheduler.
There are four main components: the metascheduler main class, scheduler, rescheduler,
and a list of scheduled jobs. The main responsibilities of the metascheduler class are
to submit jobs to resource providers, and reconfigure and cancel jobs in case of Remap
operation is used. The complexity of implementing the scheduler and rescheduler classes
lies on finding the common time slot (Figure 4.5), which has to consider the requirements
of the metajob. The metascheduler also contains a list of scheduled jobs to keep track
where each job of a meta job is scheduled and to find metajobs based on job IDs given by
providers at the rescheduling phase.
4.5. EVALUATION 53
Request resources
Ask free time slots
Generate
time slots
Free time slots
Wait time slots from
all providers
Find common
start time Send requests
Schedule
request
Confirm schedule
Expected completion time
4.5 Evaluation
We used the simulator PaJFit and real traces from supercomputers available at the Par-
allel Workloads Archive to model the user applications. We compared the use of Shift
and Shift with Remap operations against the co-allocation model based on rigid advance
reservations, which provides response time guarantees but suffers from high fragmenta-
tion inside the resource provider’s scheduling queues. This section presents a description
of the environment set up and metrics followed by the results and our analysis.
marises the workload characteristics. We observe that the estimated load is much higher
than the actual load due to the wrong user estimations. More details on the workloads can
be found at the Parallel Workloads Archive.
For the network overhead of multi-site jobs, as there is no trace available with such
information, we have assigned to each job a random value defined by a Poisson distribu-
tion with λ=20. A study by Ernemann et al. [45] shows that co-allocation is advantageous
when the penalty for network overhead is up to 25%. Therefore, we limited the network
overhead under this value.
We evaluated the system utilisation and response time, i.e. the difference between
the job completion time and submission time. In addition, we analysed the behaviour of
multi-site jobs due to rescheduling. We investigated these metrics according to the run
time estimation precision of all jobs in the system. We used the original values, original
plus 25% and 50% of increase in accuracy. The metrics are also a function of the external
load.
40
30
20
10
0
0 25 50 0 25 50
Added Run Time Precision
External load 10% External load 30%
Figure 4.6: Global response time reduction as a function of the run time estimation preci-
sion and external load.
the fact that some multi-site jobs can be remapped to a single cluster, which reduces up to
25% of the execution time for these jobs.
Response Time Reduction (%)
Shift+Remap Shift
50
40
30
20
10
0
0 25 50 0 25 50
Added Run Time Precision
External load 10% External load 30%
Figure 4.7: Response time reduction of local jobs as a function of the run time estimation
precision and external load.
Filling the gaps using FlexCo requests has a direct impact on the system utilisation,
as can be observed in Figure 4.9. For system utilisation, we see that Shift+Remap con-
sistently provides better results than only shifting the requests, reaching its peek in an
improvement of over 10% in relation to co-allocation based on rigid advance reserva-
tions. However, for external load of 10% and increase of precision for 25% the difference
between Shift+Remap and Shift is minimum, and for precision increase of 50%, utili-
sation for Shift+Remap is lower than for Rigid Advance Reservations. This difference
does not reflect the improvement of Shift+Remap observed in the response time, and hap-
pens because Shift+Remap reduces system utilisation when multi-sites jobs remapped to
56 Chapter 4. ADAPTIVE CO-ALLOCATION FOR MP APPLICATIONS
80 Shift+Remap Shift
60
40
20
0
0 25 50 0 25 50
Added Run Time Precision
External load 10% External load 30%
Figure 4.8: Response time reduction of external jobs as a function of the run time estima-
tion precision and external load.
80
Shift+Remap Shift Rigid ARs
System Utilisation (%)
78
76
74
72
70
68
66
0 25 50 0 25 50
Added Run Time Precision
External load 10% External load 30%
Figure 4.9: Global utilisation as a function of the run time estimation precision and exter-
nal load.
a single-site, which eliminates network overhead. For both response time and system util-
isation, we observe that the higher the imprecision on run time estimations, the better the
benefit of rescheduling multi-site jobs.
To better understand what happens with the multi-site jobs, we measured the number
of metajobs remapped to a single cluster due to the rescheduling. From Figure 4.10,
we observe that approximately 15% of multi-site jobs, that otherwise would use inter-
cluster communication, were migrated to a single cluster. Different from the utilisation
and response time, this metric does not present a smooth behavior; moving jobs to a single
site is highly dependent on the characteristics and packing of the jobs.
Figure 4.13 illustrates the percentage of multi-site jobs that are initially submitted to
more than one site and able to access the resources before expected due to rescheduling.
We observe that this improvement occurs for almost all multi-site jobs for all scenarios.
Both operations helped to improve the schedule of multi-site jobs, however, as we have
4.5. EVALUATION 57
18
Multi-cluster jobs moved
16
Shift+Remap
to a single cluster (%)
14
12
10
8
6
4
2
0
0 25 50 0 25 50
Additional Run Time Precision
External load 10% External load 30%
Figure 4.10: Percentage of multi-site jobs moved to a single cluster as a function of the
run time estimation precision and external load.
120
Shift+Remap
100 Shift
Number of jobs
80
Rigid+ARs
60
40
20
0
0 25 50 0 25 50 0 25 50
Additional Run Time Precision
1 cluster 2 clusters 3 clusters
Figure 4.11: Number of clusters used by job with system external load=10%.
160
Shift+Remap
140
Shift
Number of jobs
120
Rigid+ARs
100
80
60
40
20
0
0 25 50 0 25 50 0 25 50
Additional Run Time Precision
1 cluster 2 clusters 3 clusters
Figure 4.12: Number of clusters used by job with system external load=30%.
Shift+Remap Shift
Multi-site jobs reduced
100
response time (%)
90
80
70
60
50
0 25 50 0 25 50
Additional Run Time Precision
External load 10% External load 30%
Figure 4.13: Percentage of multi-site jobs that reduce their response time due to reschedul-
ing as a function of the run time estimation precision and external load.
When the external load is only 10% more jobs can be moved to a single cluster. However,
when the system has a higher load, it is more difficult to find fragments in a single cluster.
Moreover, in this second case, multi-site jobs may end up accessing fragments of more
sites to reduce their response time.
The experimental results presented in this section demonstrate that local jobs are not
able to fill all the fragments in the scheduling queues and therefore co-allocation jobs
need to be rescheduled. The more the users are imprecise with their estimations the more
important is the rescheduling. That is because the need for the rescheduling increases
with the number and size of the fragments generated by the wrong estimations in the head
of the scheduling queues.
4.6 Conclusions
This chapter has shown the impact of rescheduling co-allocation requests in environments
where resource providers deal with inaccurate run time estimations. As local jobs are not
able to fill all the fragments in the scheduling queues, the co-allocation requests should not
be based on rigid advance reservations. Our flexible co-allocation (FlexCo) model relies
on shifting of advance reservations and process remapping. These operations allow the
rescheduling of co-allocation requests, therefore overcoming the limitations of existing
solutions in terms of response time guarantees and fragmentation reduction.
Regarding the rescheduling operations, Shift provides good results against the rigid-
advance-reservation-based co-allocation and is not application dependent since it only
changes the start time of the applications. Shift with Remap provides even better results
but is application dependent since it also modifies the amount of work submitted to each
4.6. CONCLUSIONS 59
site. Parallel applications that have flexible deployment requirements, such as branch-
and-bound-based solvers for optimisation problems, can have benefits from the Remap
operation. In our experiments we showed that depending on the system load, Remap can
reduce the number of clusters used by multi-site requests. In the best case, a job initially
mapped to multiple sites can be remapped to a single site, thus eliminating unnecessary
network overhead, which is important for network demanding parallel applications.
This chapter meets two objectives proposed in Chapter 1 for message passing appli-
cations: understand the impact of inaccurate run time estimates, and design and evaluate
co-allocation policies with rescheduling support. Process remapping has practical deploy-
ment difficulties, especially for environment containing clusters with different resource
capabilities. Therefore, next chapter discusses how to overcome these difficulties and the
kind of applications that can benefit from the process remapping operation.
Chapter 5
5.1 Introduction
Co-allocating resources from multiple clusters is difficult for users, especially when re-
sources are heterogeneous. Users have to specify the number of processors and usage
times for each cluster. Apart from being demanding to estimate application run times,
these static requests limit the initial scheduling due to the lack of resource options given
by users to metaschedulers. In addition, static requests prevent rescheduling of appli-
cations to other resource sets; applications may be aborted when rescheduled to slower
resources; unless users provide high run time overestimations. When applications are
rescheduled to faster resources, backfilling [85] may not be explored if estimated run
times are not reduced. Therefore, performance predictions play an important role for
automatic scheduling and rescheduling.
61
62 Chapter 5. IMPLEMENTATION OF AUTOMATIC PROCESS MAPPING
The use of performance predictions for scheduling applications have been extensively
studied [56, 63, 101, 103]. However, predictions have been used mostly for single-cluster
applications and require access to the user application source code [15, 16, 102, 130].
Parallel applications can also use multiple clusters and performance predictions can as-
sist their deployment. One application model for multi-cluster environments is based on
iterative algorithms, which has been used to solve a variety of problems in science and
engineering [11], and has also been used for large-scale computations through the asyn-
chronous communication model [42, 76].
This chapter proposes a resource co-allocation model with rescheduling support based
on performance predictions for multi-cluster iterative parallel applications. Iterative ap-
plications with regular execution steps can have run time predictions by observing their
behavior with a short partial execution. This chapter also proposes two scheduling al-
gorithms for multi-cluster iterative parallel applications based on synchronous and asyn-
chronous models. The algorithms can be utilised to co-allocate resources for iterative
applications with two heterogeneity levels: the computing power of cluster nodes and
process sizes.
We performed experiments using an iterative parallel application, which consists of
benchmark multiobjective problems, with both synchronous and asynchronous commu-
nication models on Grid’5000. The results using our case study application and seven
resource sets show that it is possible to generate performance predictions with no access
to the user application source code. By using our co-allocation model, metaschedulers be-
come responsible for run time predictions, process mapping, and application reschedul-
ing; releasing the user from these difficult tasks. The use of performance predictions
presented here can also be applied when rescheduling single-cluster applications among
multiple clusters.
Resources
.... ....
.... ....
.... ....
Time Time
Synchronous Model Asynchronous Model
Resource co-allocation is important for both models since: for the synchronous model,
it prevents processes from being idle and thus completes the execution faster; whereas
for the asynchronous model, it increases interaction among results of the application pro-
cesses.
and
Cluster 1
Resources
MACHINE
PERFORMANCE SCHEDULER LIST
MODEL
3
...
generated by the
metascheduler 4 Time
...
Centralised or distributed
among system schedulers MACHINE
LIST
System Scheduler
3
Resources
Cluster N
...
Time
The metascheduler, responsible for co-allocating resources from multiple clusters, re-
lies on four components to enable automatic process selection and rescheduling support.
Here we present the sequence of steps to co-allocate resources using performance predic-
tions and an overview of the metascheduler components (component details are presented
in the next sections), as illustrated in Figure 5.2:
5.3. CO-ALLOCATION BASED ON PERFORMANCE PREDICTIONS 65
• User preferences (Step 1): users specify the total number of processes and ap-
plication parameters. Users have to also specify script to determine application
throughput on a resource. Section 5.3.1 details how to provide the script.
• Performance predictions (Step 2): the metascheduler executes the script to collect
application throughput on multiple resource types. Section 5.4 details an example
of how to generate predictions.
• Machine list (Step 3): the metascheduler contacts system schedulers to obtain a
list of resources that can be used by the user application. Section 5.3.3 describes
the interaction between metascheduler and system schedulers.
• Application scheduler (Step 4): uses the machine list, performance predictions,
and user preferences to generate a schedule of the application processes. This
component generates a set of scripts used to deploy the application. Section 5.3.2
presents two application schedulers for iterative parallel applications.
by the user. The application scheduler determines the process locations and the overall
execution time, which is described in the following section. Network overhead can also
be included in the overall execution time, which can be estimated by the amount of data
that has to be transferred among application processes and the network latency among
clusters.
1. The application scheduler asks the system scheduler for the earliest n machines
available;
13 /* optional /*
14 for each resource r do
15 Make last process on this resource complete at M axCompletionT ime by
increasing the number of iterations
3. The metascheduler, or a system scheduler, verifies with the other system sched-
uler(s) whether it is possible to commit requests;
4. Step 1 is repeated if it is not possible to commit requests. A maximum number of
trials can be specified.
Note that by using this algorithm, resource providers can keep their schedules private
[44]. Alternatively, in Step 1, the application scheduler could ask system schedulers for
all free time slots (which are available time intervals for resources) and then minimise
interactions between the metascheduler and the system scheduler.
Rescheduling frequently occurs when applications finish before the estimated time
[89]. Rescheduling can also be necessary when resources fail or users modify/cancel re-
quests. Whenever one of these events arise, the system scheduler triggers the rescheduling
process.
68 Chapter 5. IMPLEMENTATION OF AUTOMATIC PROCESS MAPPING
5.4 Evaluation
This section describes the experiments to show how much time the metascheduler requires
to generate application throughputs, the accuracy of prediction run times, and the impact
of these predictions on the rescheduling. We used an iterative parallel application based on
evolutionary algorithms, which consists of benchmark multiobjective problems, with both
synchronous and asynchronous communication models. We conducted the experiment
in Grid’5000, which consists of clusters in France dedicated to large-scale experiments
[18]. The clusters are spaced-shared machines shared by multiple applications. From
these clusters, we selected seven resource sets to execute the application. These sets
are examples of resources that are dynamically chosen by the metascheduler to execute
our case study application. Experiments on how the metascheduler selects resource sets,
and when it reschedules multi-cluster applications are described in the previous chapter.
Before describing the experiment, we provide an overview of the benchmark application.
the approximation sets generated at each stage of the execution for synchronous model.
For the asynchronous model, EMOMerge uses the last results received from the other
processes.
Lille
Paradent Rennes
128 cpus
Paris Nancy
Paramount
66 cpus
Paraquad Azur
132 cpus 144 cpus
Lyon
Bordeplage Bordeaux Sol
Grenoble
102 cpus 100 cpus
Toulouse
Bordemer
96 cpus
Sophia
Table 5.2: Resource sets selected by the metascheduler on seven clusters in Grid’5000.
Clusters/Resource Sets 1 2 3 4 5 6 7
paradent 32
paramount 04 20 04
paraquad 04 20
sol 12 12 20 20 12
bordemer 02 08 20 08 06 20
azur 06 06 10
bordeplage 06 08 20
Resource sets. We configured the metascheduler to access seven resource sets in Grid’5000.
Table 5.2 presents the list of clusters and number of cores used in each resource set. These
resource sets are examples of resources chosen dynamically by the metascheduler for the
EMO application. The clusters are space-shared machines, and hence the resource sets are
dedicated to the application, which is a common set up for existing HPC infrastructures.
tance on rescheduling, we measured the time to generate predictions and analyze the
difference between the actual and the predicted execution times. The prediction for each
resource set assists schedulers to know whether they can reschedule the new sub requests
into the scheduling queues of other schedulers. Therefore, we measured the impact of
predictions for the application rescheduling. To understand the application output on dif-
ferent resource sets, we also measured execution times and the Epsilon indicator for the
synchronous and asynchronous models.
Table 5.3: Throughput (iterations/sec.) for each machine type and topology.
Cluster/Topology Regular 2D Scale-Free Small-World Random
paradent 10.87 11.36 3.57 3.29
paramount 10.00 10.42 3.33 3.05
sol 9.26 9.62 3.09 2.81
bordemer 7.81 7.81 2.60 2.38
azur 7.14 7.35 2.34 2.14
bordeplage 5.81 6.10 1.89 1.68
Table 5.4: Time in seconds to obtain the throughputs for each machine type and topology.
Cluster/Topology Regular 2D Scale-Free Small-World Random
paradent 23 14 42 46
paramount 25 15 45 49
sol 27 16 49 54
bordemer 32 19 58 63
azur 35 21 64 70
bordeplage 43 26 80 90
iteration, which is common for several iterative applications. Figures 5.6-5.9 represent
the throughput (iterations/second) of an EMO execution using each topology on a single
core of seven machine types as a function of number of iterations.
Accuracy of predictions. Figure 5.10 presents the predicted and actual execution times
for synchronous and asynchronous models. Actual execution times are averages of five
executions for each resource set. We observe that the execution time for the asynchronous
model is shorter than the synchronous model for all resource sets. For the asynchronous
model, all EMO processes execute the minimum required number of iterations, whereas
for the synchronous model, EMO processes may execute more iterations in order to wait
for processes that take longer. In addition, the difference between actual and predicted
execution is on average 8.5% for synchronous and 7.3% for the asynchronous model.
These results highlight that it is possible to reschedule processes on multiple clusters
since schedulers can predict the execution time for different resource sets. Note that the
predictions for the asynchronous model is slightly better than for the synchronous model.
This reason is that the asynchronous model requires less accurate inter-process communi-
cation predictions than the synchronous model since the network overhead impact in the
first model is minimum.
For the quality of the predictions (Figure 5.10), resource sets 2 and 3 present better
results compared to the other sets for the synchronous model. This happens because
5.4. EVALUATION 73
16
Throughput (iter./sec.) paradent sol azur
14
paramount bordemer bordeplage
12
10
8
6
4
0 50 100 150 200 250
Number of Iterations
Figure 5.6: Throughput for the Regular 2D topology.
16
Throughput (iter./sec.)
the merging phase is split by sites (locations in France). For these sets, three sites are
used, and therefore the load for merging results is well balanced. Resource set 6 also
comprises three sites, but only four resources in one of the sites. For the asynchronous
model, the worst prediction is for resource set 7 since 20 resources from the worst cluster
(bordeplage) are used, which makes the merging process slower.
Actual-Sync Actual-Async
1200
Execution Time (s)
Predicted-Sync Predicted-Async
1000
800
600
400
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Resource Sets for Synchronous and Asynchronous Models
Figure 5.10: Comparison of predicted and actual execution times for synchronous and
asynchronous models.
Run Time Overestimation (%)
Overestimation-Sync Overestimation-Async
60
50
40
30
20
10
0
2 3 4 5 6 7 2 3 4 5 6 7
Resource Sets for Synchronous and Asynchronous Models
Figure 5.11: Run time overestimations required to avoid application being aborted due to
rescheduling from resource set 1 to other sets without co-allocation based on performance
predictions.
Epsilon Indicator
Epsilon Indicator
0.6 0.6 0.6
0.5 0.5 0.5
0.4 Epsilon for Sync: 0.161 0.4 Epsilon for Sync: 0.231 0.4 Epsilon for Sync: 0.234
Epsilon for Async: 0.154 Epsilon for Async: 0.168 Epsilon for Async: 0.171
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0 0.1 0 0.1 0
100 200 300 400 500 600 700 800 100 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 9001000
Execution Time (s) Execution Time (s) Execution Time (s)
Figure 5.12: Epsilon indicator for three resource sets on both communication models.
Epsilon Indicator
Epsilon Indicator
0.6 0.6 0.6
0.5 0.5 0.5
0.4 Epsilon for Sync: 0.235 0.4 Epsilon for Sync: 0.173 0.4 Epsilon for Sync: 0.196
Epsilon for Async: 0.204 Epsilon for Async: 0.184 Epsilon for Async: 0.151
0.3 0.3 0.3
0.2 0.2 0.2
0.1 50 300 0.1 0.1
100 150 200 250 100 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800
Execution Time (s) Execution Time (s) Execution Time (s)
(a) Only Scale-Free topology. (b) Only Random topology. (c) Four topologies together.
Figure 5.13: Epsilon indicator for resource set 1 showing the importance of mixing topolo-
gies for both communication models.
sults from different topologies. The results show that although Random, which is the
most CPU consuming topology, has the greatest impact on the Epsilon indicator, the less
CPU consuming Scale-Free topology contributes to the function optimisation. Moreover,
even for one-topology executions, asynchronous produces better optimisation results and
it converges faster than its synchronous counterpart. Similar results were obtained for
the other resource sets. The comparison results between synchronous and asynchronous
model showed here corroborate the results presented by Desell et al. [42] with their ap-
plication in the astronomy field; i.e. asynchronous model has better convergence rates,
especially when heterogeneous resources are in place.
5.5 Conclusions
Resource co-allocation ensures that applications access processors from multiple clusters
in a coordinated manner. Current co-allocation models mostly depend on users to specify
the number of processors and usage time for each cluster, which is particularly difficult
due to heterogeneity of the computing environment.
This chapter presented a resource co-allocation model with rescheduling support based
on performance predictions for multi-cluster iterative parallel applications. Due to the reg-
ular nature of these applications, a simple and effective performance prediction strategy
can be used to determine the execution time of application processes. The metascheduler
can generate the application performance model without requiring access to the appli-
cation source code, but by observing the throughput of a process in each resource type
using a short partial execution. Predictions also enable automatic rescheduling of parallel
applications; in particular they prevent applications from being aborted due to run time
underestimations and increase backfilling chances when rescheduled to faster resources.
From the experiments using an iterative benchmark parallel application on Grid’5000,
we observed run time predictions with an average error of 7% and prevention of up to
35% and 57% of run time overestimations for synchronous and asynchronous models,
respectively. The results are encouraging since automatic co-allocation with rescheduling
support is fundamental for multi-cluster iterative parallel applications; in particular be-
cause these applications, based on asynchronous communication model, are used to solve
problems in large-scale systems.
This chapter meets the third objective presented in Chapter 1, which is investigation of
technical difficulties to deploy the co-allocation policies in real environments, in particular
for the process remapping operation. Another application model that has been used for
large-scale distributed systems is the bag-of-tasks; and therefore we present co-allocation
policies for this model in the next two chapters.
Chapter 6
6.1 Introduction
The execution of a BoT application on multiple utility computing facilities is an attractive
solution to meet user deadlines. This is because more tasks of a single BoT application
can execute in parallel and these facilities have to deliver a certain QoS level, otherwise
the providers are penalised. A service provider containing a metascheduler is responsible
for distributing the tasks among resource providers according to their load and system
configuration. However, allocating resources from multiple providers is challenging be-
cause these resource providers cannot disclose much information about their local load to
the metascheduler. Workload is private information that companies do not disclose easily
since it may affect the business strategy of competitors.
Much work has been done on scheduling BoT applications [19, 74, 80]. However,
77
78 Chapter 6. OFFER-BASED CO-ALLOCATION FOR BOT APPLICATIONS
little effort has been devoted to schedule these applications with deadline requirements
[14, 68, 125], in particular considering limited load information available from resource
providers.
This chapter introduces three policies for composing offers to schedule BoT appli-
cations. Offers are a mechanism in which resource providers expose their interest in
executing an entire BoT or only part of it without revealing their local load and system
capabilities. Whenever providers cannot meet a deadline, they generate offers with an-
other feasible deadline. For the offer generation within resource providers, we leverage
the work developed by Islam et al. [60, 61], whereas the concept of combining offers for
executing an application on multiple resource providers is inspired by the provisioning
model of Singh et al. [109] and the capacity planning of Siddiqui et al. [105]. In addition,
this chapter shows the impact of the amount of information resource providers need to
expose to the metascheduler and its impact on the scheduling.
Scheduler’s Goal. Our goal is to meet users’ deadlines, and when not possible, try to
schedule jobs as close as possible to these deadlines. The challenge from the metasched-
uler’s point of view is to know how much work to submit to each resource provider such
that it meets the BoT user’s deadline, whereas from the resource providers’ point of view
is to know how much work they can admit without violating the deadlines of already
accepted requests.
6.3. OFFER GENERATION 79
T T e a nd Provider 1 T T T T T T T T
oT siz
T T T T
s h B Offer1 Offer2
bli ers
MEET MY DEADLINE OR 1: pu nd off
3: se asks
MEET THE EARLIEST
s u b mit t
FEASIBLE DEADLINE Service 5:
Provider
4: Generate 2: Generate
Metascheduler 1: publis
Composite h BoT size
and dead
line
offers
Offers 3: send of
fers
calculate deadline for 5: submit
Resource T T T
T T T T T
tasks
Provider N
T T T T T T T T T T
Figure 6.1: Components interaction for scheduling a bag-of-tasks using offers from mul-
tiple providers.
1. Defines a set of possible number of tasks ∆ it is willing to accept. It creates this list
by calculating a percentage of the total number of tasks in the BoT application, e.g.
∆ ← {BoT s , 0.75 ∗ BoT s , 0.50 ∗ BoT s , 0.25 ∗ BoT s , 0.10 ∗ BoT s }, where s is the
BoT size, i.e. total number of tasks.
2. A procedure genOffer generates an offer for each BoT size. To generate an offer,
the resource provider creates a temporary job jk that has the BoT specifications,
80 Chapter 6. OFFER-BASED CO-ALLOCATION FOR BOT APPLICATIONS
which include deadline and estimated execution time, but with the different number
of tasks, defined in ∆. The algorithm used is the Earliest Deadline First.
3. The genOffer procedure returns the completion time of that offer. This time can
be the deadline provided by the user (in the case of a successful schedule), or a
completion time that is longer than the deadline (when the scheduler cannot meet
the user deadline).
4. The algorithm compares the current offer with the previous offer in Φ. If the com-
pletion time is the same, the previous offer simply has its number of tasks increased.
If the completion time is longer, the resource provider includes the current offer in
the list Φ.
FEEDBACK. In order to provide users with feedback when it is not possible to meet their
deadlines, we use the approach proposed by Islam et al. [61]. The idea is to use a binary
search that has as its first point the deadline defined by the user and its last point as the
longest feasible deadline, i.e. the one when the job is placed in the last position of the
scheduling queue.
balance the load distributed among the resource providers according to the information
available to the metascheduler.
2. Sort the list such that all offers that meet the user’s deadline come before those that
do not meet. For those that meet, the offers are sorted in the decreasing order of
number of tasks. For those that do not meet, the offers are sorted in the ascending
order of completion time;
3. Remove all offers after the first offer that is able to execute the entire BoT;
4. Create a list L that is dynamically updated with possible composite offers. The
creation of L is based on the order of the offers. For each offer analysed, the
metascheduler updates the number of remaining requested tasks and the last com-
pletion time of each list that uses that offer. Note that what makes this algorithm
simple is the list pre-processing, i.e. sorting and filtering;
5. Return the first composite offer in L that provides the earliest possible deadline.
Table 6.1 illustrates an example on how the metascheduler composes offers. The
example considers a list of offers Φ = {(s1 , d1 )rid1 , ..., (sn , dn )ridn } where rid is the
resource provider id, s is the BoT size, and d is the offer’s deadline, and a BoT with
deadline = 40 time units and number of tasks = 512. As we can observe, the choice of
the offers is not greedy. From the example, the metascheduler does not use the offer
(256, 40)1 , which is the best offer, because the remaining resource providers would end
up completing the BoT by 200 time units rather than 100 time units, which is the next best
offer from resource provider 3 that can accept enough tasks ((512, 200)3 ).
Table 6.1: Example of offer composition with the OfferNoLB policy for a BoT with num-
ber of tasks = 512 and deadline = 40 time units.
Operation Offers (size, deadline)provider
Original Offers (256, 40)1 , (512, 100)1 , (128, 40)2 , (512, 200)2 , (64, 40)3 , (512, 200)3
Sorted Offers (256, 40)1 , (128, 40)2 , (64, 40)3 , (512, 100)1 , (512, 200)2 , (512, 200)3
Filtered Offers (256, 40)1 , (128, 40)2 , (64, 40)3 , (512, 100)1
Composite Offers (List L) {(128, 40)2 , (64, 40)3 , (320, 100)1 } OR
{(256, 40)1 , (128, 40)2 , (64, 40)3 (not enough tasks)}
Selected Composite Offer (128, 40)2 , (64, 40)3 , (320, 100)1
These policies work only with offers that meet user’s deadlines; all the other offers are
discarded. Therefore, each resource provider has only one offer that meets the deadline.
The OffersWithPLB policy uses the offer size in order to balance the number of tasks
submitted to each resource provider. The proportional parameter P is calculated per offer
as follows: P ← OfferSize / totalOfferedTasks, where OfferSize is the number of tasks in
an offer, and totalOfferedTasks is the total number of tasks from all offers. For each offer,
the number of tasks is multiplied by P . As the metascheduler does not know the load and
the total computing power of resource providers, the offer size serves as an indicator of
how much work a resource provider should receive in relation to the others.
For the OffersWithDPLB policy (Algorithm 6), the parameter P is calculated by the
group of offers with the same number of tasks (Line 3). The additional parameter Double
Proportional (DP) is used to distributed the load to a given group of resource providers
that contains offers with the same number of tasks (offer size) according to their capabili-
ties (e.g. total number of resources a provider hosts) (Line 12). It is not always possible to
distributed the tasks of a given offer size exactly proportionally to the resource capabili-
ties. For this reason, we sort the offers of the same size in a decreasing order of providers’
total computing power (Line 1) and we adjust the number of tasks to the resource provider
when necessary (Lines 14-15).
6.5. EVALUATION 83
16 offer.setNTasks(nTasks)
17 remainingNTasksSameSize.decrement(offer.nTasks)
18 compositeoffer.add(offer)
6.5 Evaluation
We have evaluated the scheduling policies by means of simulations to observe their effects
in a long-term usage. We have used our event-driven simulator PaJFit and real traces from
supercomputers available at the Parallel Workloads Archive and extended them according
to our needs.
We have evaluated the following scheduling policies:
• FreeTimeSlots: scheduling based on free time slots, i.e. the metascheduler has a
detailed access to the load available (Section 6.3.2);
• OffersWithDPLB: the metascheduler considers offer sizes and the total computing
power of resource providers (Section 6.4.2);
LOAD. Regarding the load used in the resource providers, approximately 50% comes
from the multi-site BoTs (external load), which could be executed in any cluster or mul-
tiple clusters, and approximately 50% comes from users submitting parallel or sequential
jobs directly to a particular cluster (local load). The global load is therefore the external
load plus the local load submitted to the clusters. We have chosen the same load for local
and external loads in order to be able to compare the impact of the scheduling policies
on local and external jobs in a fair manner. We varied the loads using a strategy similar
to that described by Shmueli and Feitelson to evaluate their backfilling strategy [104], in
which they modify the jobs’ arrival time. However, we fixed the simulation time inter-
val and modified the number of jobs in the traces. Table 6.2 summarises the workload
6.5. EVALUATION 85
characteristics. More details on the workloads can be found at the Parallel Workloads
Archive.
DEADLINES. To the best of our knowledge, there are no traces available with deadlines.
Therefore, we have incorporated deadlines in the existing traces using the following func-
tion: Tjs +Tjr +k, where Tjs is the job submission time, Tjr is the job estimated run time,
and k is a parameter that assumes three values according to two Deadline Schemata. For
Deadline Schema 1, k assumes the values 18 hours, 36 hours, and 10 days, and for Dead-
line Schema 2, k assumes the values 12 hours, 1 day, and 1 week. Therefore, Deadline
Schema 2 has more jobs with tighter deadlines than Deadline Schema 1. We have used
a uniform distribution for the values of k for all jobs in each workload. Note that k is
not a function of job size. Modeling k independently of job size allowed the environment
to have both small and big jobs with relaxed and tight deadlines. We have generated 30
workloads for each original trace varying the seed for the deadlines. By having 60 days
of simulated time, 30 workloads with different deadlines, and 2 deadline schemas, we
believe that we have been able to evaluate the policies under various conditions.
1. Jobs Delayed: number of jobs that are not able to meet their deadlines;
2. Work Delayed: amount of work (processors x execution time) of the jobs that are
not able to meet their deadlines;
3. Total Weighted Delay: weighted difference between jobs’ deadlines and their new
deadline given by the system;
! !
X DjN − Tjs
T W Delay = Rj ∗ −1 ∗ 100 (6.1)
Dj − Tjs
DjN >Dj
Where Dj is the job deadline, DjN is the new job deadline provided by the system,
Rj is the number of tasks of the job, and Tjs is the job submitted time. We used
the value Rj as the weight for the percentage difference between the time of the
original and new deadline.
For utility computing environments, the first two metrics represent the loss in revenue
due to possible rejections, whereas the third metric could represent penalties for not meet-
86 Chapter 6. OFFER-BASED CO-ALLOCATION FOR BOT APPLICATIONS
ing user’s demand. The last two metrics allow us to verify how jobs are spread across the
clusters and the impact of the policies on the system utilisation.
GOAL. Assess the impact of the information available to the metascheduler for schedul-
ing deadline-constrained jobs.
Deadline tightness. We observe that the difference in the results among the offer-based
policies increases when jobs have more relaxed deadlines (Schema 1 has more relaxed
deadline jobs than Schema 2). That is because as jobs have more relaxed deadlines in
Schema 1, resource providers can reschedule more jobs in order to generate better offers.
Therefore, the metascheduler has more options to schedule BoT applications. When it is
not possible to meet a user deadline, which is the more frequent in Schema 2, the load
balancing policies cannot be used. The metascheduler has to use the OffersWithNoLB
policy (Section 6.4.1) for most of the jobs.
Information access. In relation to the differences between the FreeTimeSlots policy and
offer-based policies, we observe that the former policy handles local jobs better or similar
to offer-based policies. That is because the metascheduler has more detailed load infor-
mation using the FreeTimeSlots policy, and hence it can better distribute the load among
resource providers. As local jobs do not have the option to choose the resource providers,
they enjoy more benefits using this policy. The OffersWithDPLBV2 policy generates sim-
ilar results as FreeTimeSlots for local jobs. This happens because in OffersWithDPLBV2,
the metascheduler has rough access to the local loads, which is enough to balance the
load and gives local jobs equal opportunity. Finally, we observe that for these two met-
rics, having access to the providers’ total computing power is enough to provide as good
results as the FreeTimeSlots policy, in which the metascheduler has a detailed access to
the resource providers’ loads, i.e. the free time slots. If rough load information is also
available (OffersWithDPLBV2), it is possible to get even better results.
Job delays. Figure 6.4 illustrates total weighted delay for not meeting the deadlines of
6.5. EVALUATION 87
OffersWithDPLBV2 OffersWithPLB
Jobs Delayed (%)
25
OffersWithDPLB FreeTimeSlots
20
15
10
45
40 OffersWithDPLBV2 OffersWithPLB
Jobs Delayed (%)
35 OffersWithDPLB FreeTimeSlots
30
25
20
15
10
OffersWithDPLBV2 OffersWithPLB
Jobs Delayed (%)
25
OffersWithDPLB FreeTimeSlots
20
15
10
Figure 6.2: Number of jobs delayed for local, external, and global load.
local, external, and global load. This metric is interesting because it shows the difference
between what users asked and what the system provides. The behavior of this metric is
similar to the previous metrics, except that the delayed external jobs suffer much more
in the FreeTimeSlots policy. In this policy, the providers disclose their free time slots
to the metascheduler, which has no knowledge of the deadlines of the already accepted
jobs. Therefore, the metascheduler makes blind decisions in terms of deadlines, which
have a considerable impact on external jobs. Thus, even though resource providers dis-
close detailed information of their local load to the metascheduler using the FreeTimeSlots
88 Chapter 6. OFFER-BASED CO-ALLOCATION FOR BOT APPLICATIONS
35
OffersWithDPLBV2 OffersWithPLB
Work Delayed (%)
30 OffersWithDPLB FreeTimeSlots
25
20
15
10
Deadline Schema 1 Deadline Schema 2
50
OffersWithDPLBV2 OffersWithPLB
Work Delayed (%)
40 OffersWithDPLB FreeTimeSlots
30
20
10
Deadline Schema 1 Deadline Schema 2
40
OffersWithDPLBV2 OffersWithPLB
Work Delayed (%)
35
OffersWithDPLB FreeTimeSlots
30
25
20
15
10
Deadline Schema 1 Deadline Schema 2
Figure 6.3: Amount of work delayed for local, external, and global load.
policy, such a policy produces much worse results than the simplest offer-based policy,
i.e. OffersWithPLB. In this offer-based policy, the metascheduler uses only the offers,
without knowing the resource providers’ total computing power and load. This reveals
that indeed, the offer sizes are a good indicator to balance load among resource providers
without accessing their private information.
Load and system utilisation. Another important factor is the Number of Clusters used
by the BoT applications. Figure 6.5 shows the number of clusters used by BoT appli-
cations for each policy. We observe that the tighter the deadlines, the fewer options the
6.5. EVALUATION 89
25
OffersWithDPLBV2 OffersWithPLB
20
OffersWithDPLB FreeTimeSlots
15
10
0
Deadline Schema 1 Deadline Schema 2
25 OffersWithDPLBV2 OffersWithPLB
OffersWithDPLB FreeTimeSlots
20
15
10
5
Figure 6.4: Total Weighted Delay for local, external, and global load.
metascheduler has to distribute the tasks of BoT applications among resource providers.
That is because there is more unbalance in the load when jobs have tighter deadlines,
and hence the metascheduler tends to use fewer options in offer-based policies, and fewer
time slots in the FreeTimeSlots policy. In addition, the offer-based policies with double
proportional load balancing tend to distribute the load better than the other two policies.
Therefore, they allow more scheduling options for the next jobs arriving into the system.
Regarding the System Utilisation (Figure 6.6), we observe that the difference between the
policies is minimal, i.e. less than 1 percent. OffersWithDPLB has a minimal decrease in
90 Chapter 6. OFFER-BASED CO-ALLOCATION FOR BOT APPLICATIONS
OffersWithDPLB FreeTimeSlots
4.2
4.0
3.8
3.6
Deadline Schema 1 Deadline Schema 2
98
System Utilisation (%)
OffersWithDPLBV2 OffersWithPLB
96 OffersWithDPLB FreeTimeSlots
94
92
90
88
86
Deadline Schema 1 Deadline Schema 2
relation to the other policies because it considers the total computing power of providers
without their actual loads. Therefore, a provider that has less computing power than the
others may receive fewer tasks even if the other big providers have higher load. This
situation happens only when all offers that meet the deadline have the same size.
6.6 Conclusions
This chapter described three policies for composing resource offers from multiple providers
to schedule deadline-constrained BoT applications. These offers express the interest of re-
source providers in executing an entire BoT or only part of it without revealing their local
load and total system capabilities. When the metascheduler receives enough offers to meet
user deadlines, it decides how to balance the tasks among the resource providers accord-
ing to the information it has access, such as resource providers’ total computing power
and their local loads. Whenever providers cannot meet a deadline, they generate offers
with another feasible deadline. The metascheduler is then responsible for composing the
offers and providing users with a feedback containing the new deadline.
From our experiments, we observed that by using the free time slots of resource
6.6. CONCLUSIONS 91
providers, BoT applications cannot access resources in short term even when local jobs
could be rescheduled without violating their deadlines. The only benefit of publishing
the free time slots to the metascheduler is that it can balance the load among resource
providers, which makes more local jobs meet deadlines. However, when using offer-
based policies, more BoTs can meet deadlines and the delays between the user deadline
and the new deadline assigned by the system is much lower (in some cases 50% lower) in
comparison to the policy that uses free time slots (FreeTimeSlots).
We also observed that the simplest offer-based policy (OffersWithPLB) produces sched-
ules that delay fewer jobs in comparison to the FreeTimeSlots policy. However, Offer-
sWithPLB rejects more local jobs than FreeTimeSlots. This happens because OffersWith-
PLB cannot balance the load among resource providers. If resource providers also publish
the total computing power (the OffersWithDPLB policy), the metascheduler can balance
the load and have similar acceptance rates as the FreeTimeSlots policy for local jobs. If the
resource providers can make their load available (OffersWithDPLBV2), the metascheduler
can reduce even more the number of jobs delayed; however the benefit is not significant.
Therefore, our main conclusions are: (i) offer-based scheduling produces less delay for
jobs that cannot meet deadlines in comparison to scheduling based on load availability
(i.e. free time slots); thus it is possible to keep providers’ load private when scheduling
multi-site BoTs; and (ii) if providers publish their total computing power they can have
more local jobs meeting deadlines.
This chapter meets part of the second objective of the thesis, which is design, imple-
mentation, and evaluation of co-allocation policies. It considered co-allocation as offer-
composition for BoT applications with deadline constraints. This chapter is a building
block for the next chapter, which deals with inaccurate run time estimates, reschedul-
ing, and implementation issues, thus meeting the three objectives of this thesis for BoT
applications.
Chapter 7
The expected completion time of the user applications is calculated based on the run
time estimates of all applications running and waiting for resources. However, due to
inaccurate run time estimates, initial schedules are not those that provide users with the
shortest completion time. This chapter proposes a coordinated rescheduling algorithm and
evaluates the impact of this algorithm and system-generated predictions for bag-of-tasks
in multi-cluster environments. The coordinated rescheduling defines which tasks can have
start time updated based on the expected completion time of the entire BoT application,
whereas system-generated predictions assist metaschedulers to make scheduling decisions
with more accurate information. We performed experiments using simulations and an
actual distributed platform, Grid’5000, considering three main variables: time to generate
run times, accuracy of run time predictions, and time users are willing to wait to schedule
their applications.
7.1 Introduction
Metaschedulers can distribute parts of a BoT application among various resource providers
in order to speed up its execution. The expected completion time of the user application
is then calculated based on the run time estimates of all applications running and waiting
for resources. A common practice is to overestimate execution times in order to avoid
user applications to be aborted [72, 73]. Therefore, initial completion time promises are
usually not accurate. In addition, when a BoT application is executed across multiple
clusters, inaccurate estimates increase the time difference between the completion of its
first and last task, which increases average user response time in the entire system. This
93
94 Chapter 7. ADAPTIVE CO-ALLOCATION FOR BOT APPLICATIONS
time difference, which we call stretch factor, increases mainly because rescheduling is
performed independently by each provider.
System generated predictions can reduce inaccurate run time estimates and prevent
users from having to specify these values. Several techniques have been proposed to pre-
dict application run times and queue wait times. One common approach is to analyse
scheduling traces; i.e. historical data [91, 111, 119]. Techniques based on trace anal-
yses have the benefit of being application independent, but may have limitations when
workloads are highly heterogeneous. Application profiling has also been vastly studied
to predict execution times [63, 101, 103, 130]. Application profiling can generate run
time predictions for multiple environments, but usually requires application source code
access.
This chapter proposes a coordinated rescheduling strategy for BoT applications run-
ning across multiple resource providers. Rather than providers performing independent
rescheduling of tasks of a BoT application, the metascheduler keeps track of the expected
completion time of the entire BoT. This strategy minimises the stretch factor and re-
duces user response time. We also show that on-line system generated predictions, even
though require time to be obtained, can reduce user response time when compared to user
estimations. Moreover, with more accurate predictions, providers can offer tighter ex-
pected completion times, thus increasing system utilisation by attracting more users. We
performed experiments using simulations and an actual distributed platform, Grid’5000,
on homogeneous and heterogeneous resources. We also provide an example of system-
generated predictions using POV-Ray, which is a ray-tracer tool to generate three dimen-
sional images to produce animations.
Bag-of-tasks
Offer = Number of tasks and expected completion time (ADMISSION CONTROL)
USER
2: Generate Offers
T T T
T T T T T (and run time estimates)
T T T T T Composite Offer ler
profi Resource
T T T T tion
T T T
T T h a p plica T T T T T
T T T T T
nd off
T T T T
3: se
OPTIMIZE COMPLETION tasks s
statu
bmit Offer1 Offer2
TIME OF MY BOT 5: su
ng
(use my run time
s ch eduli
estimation or 6: re
Service
my application
profiler)
Provider
1: publis
Metascheduler h applicat 2: Generate Offers
ion profile
3: send of r
4: Generate fers (and run time estimates)
Composite 5: submit
tasks
Offers 6: resche Resource T T T
scheduled tasks and to consider the tasks’ run time estimation errors. Once the resource
providers generate the offers, they send them to the metascheduler (step 3), which com-
poses them according to the user requirements (step 4), and submits the tasks to resource
providers (step 5). After the tasks of a BoT are scheduled, resource providers contact the
metascheduler for rescheduling purposes (step 6).
Due to system heterogeneity and the different loads in each resource provider, offers
arrive at different times to the metascheduler. Once the metascheduler receives all offers,
some of them may not be valid any more since other users submitted applications to the
providers. To overcome this problem, we use an approach similar to the one developed by
Haji et al. [55], who introduced a Three-Phase commit protocol for SNAP-based brokers.
We used probes, which are signals sent from the providers to the metaschedulers interested
in the same resources to be aware of resource status’ changes.
Figure 7.2 represents a simplified version of the class diagram for the metascheduler.
There are four main components: the metascheduler main class, scheduler, rescheduler,
and a list of scheduled jobs. The main responsibilities of the metascheduler class are to
submit jobs to resource providers, and keep a table with the expected completion time
of the BoTs. The complexity of implementing the scheduler lies on composing the of-
fers. The rescheduler is responsible for updating BoT completion times in the scheduling
queues of resource providers. Figure 7.3 illustrates the sequence diagram for the initial
co-allocation of a BoT application.
The schedulers’ goal is to provide users with expected completion time and reduce
such a time as much as possible during rescheduling phases.
96 Chapter 7. ADAPTIVE CO-ALLOCATION FOR BOT APPLICATIONS
MetaschedulerForMPJobs
+ResourceProvidersList
Scheduler +ExpectedCompletionTimes
+scheduleMetaJob(metaJob) +submitJobs(jobs) Rescheduler
+composeOffers(offers,metaJob)
+updateExpectedCompletionTime(metaJob)
+getOffers(metaJob,resourceProviders,strategy)
ScheduledJobs
+getMetaJob(jobId)
Request resources
Ask execution offers
Generate
execution
Execution offers offers
Wait offers from
all providers
Compose offers
Send requests
Schedule
request
Confirm schedule
Expected completion time
uled one by one. For each job ji being part of a BoT, the scheduler verifies whether ji
holds the expected completion time of the entire BoT. Both BoT jobs and other type of
jobs are then rescheduled using FIFO with conservative backfilling. If a BoT job holds
the expected completion time of the entire job and receives a new completion time due to
rescheduling, the algorithm keeps this job in a structure called newCompletionT imes,
which contains the job id and the new completion time. The algorithm is executed again
but this time sorting the jobs by their start time. This is done to avoid any fragments in
the queue that are not possible to be filled using the sorting by completion time. After all
jobs are rescheduled, the local scheduler sends the newCompletionT imes structure to
the metaschedulers holding the respective BoTs.
From the metascheduler side, each time it receives the newCompletionT imes struc-
ture, it verifies whether the new completion times are local or global. If they are global,
the metascheduler sends to the local schedulers holding BoT tasks this new information.
resources
resources
resources
resources
a few tasks it is possible to estimate the overall application execution time. In addition,
depending on the application, it is possible to reduce the problem size in order to speed up
the prediction phase. For example, image processing applications can have the problem
size reduced by modifying image resolutions. The following sections describe an example
of run time generator and discuss when and how to execute the generator.
to Clouds in order to increase the workload). To generate the animation for Sky Vase, we
rotated the vase 360 degrees. Box consists of a chess floor with a box containing a few
objects with mirrors inside. We included a camera that gets closer to the box and cross
it on the other side. Fish consists of a fish over water that rotates 360 degrees. Different
from Sky Vase, Fish has a more heterogeneous animation due to the fish’s shape (a vase
shape is symmetric vertically).
The animations have different execution time’s behaviour. Sky Vase has a steady
execution time since the vase is the only object the rotates and its texture has similar work
to be processed on each frame. For the Box animation, at the beginning of the animation
the box is still far, and hence small, consuming little processing time. However, as the
camera approaches the box, more work has to be processed, getting to its maximum when
the camera is inside the box. After the camera crosses the box, only the floor has to be
rendered. The Fish animation has a very heterogeneous execution time due to the fish’s
shape, which impacts on the amount of work that needs to be processed when rendering
the reflex of the fish on the water. Figure 7.5 illustrates an example of image for each of
the three animations.
450
1200 Actual Time with 2048x1536 Actual Time with 2048x1536 Actual Time with 2048x1536
Predicted using 640x480 200 Predicted using 640x480 400 Predicted using 640x480
1000 Predicted using 320x240 Predicted using 320x240 350 Predicted using 320x240
Execution Time (s)
400 150
50 100
200
50
00 50 100 150 200 00 50 100 150 200 00 50 100 150 200
Frame ID Frame ID Frame ID
Figure 7.6: Predicted execution time using most CPU consuming frame as base for esti-
mations.
100 Chapter 7. ADAPTIVE CO-ALLOCATION FOR BOT APPLICATIONS
Table 7.1: Time to generate execution time estimates and their accuracy using three
rescaled resolutions. The times include the processing of the base frame with the original
resolution. Note that the total execution time for animations of Sky Vase, Box, and Fish
are 36h, 2.5h, and 8h respectively.
Animation Resolution Exec. Time (min) Perc. of total time Accuracy
When predicting the execution time, one must consider the trade-off between the time
spent to predict it and the prediction accuracy. The prediction should be fast enough to
allow prompt scheduling decisions to be made and accurate enough to be meaningful for
the schedulers. Apart from that, it should be easy to be deployed in practice. One possi-
bility is to render the animation in a much lower resolution and render a base frame of the
actual animation. Using the execution time of the base frame of the actual and the reduced
animation, it is possible to generate a factor to be multiplied on the lower resolution ani-
mation to predict the execution actual time of each frame. Figure 7.6 presents predictions
using the maximum and execution time frame as base frame. For this experiment, we
used 640x480, 320x240, and 160x120 as lower resolutions for generating predictions. If
the base frame is the one with maximum execution time, the predictions tend to be over-
estimated, whereas by using the minimum execution time they tend to be underestimated.
We also observe that both 640x480 and 320x240 resolutions using the base frame with
maximum execution time provided much better predictions than 160x120. Table 7.1 sum-
marises the execution times to generate the predictions and their accuracies using the base
frame with maximum execution time. The results show that good predictions are time
consuming since we are using the entire animation.
It is possible to reduce the profiling time by sampling a set of frames with a lower
resolution rather than using the entire animation. Figures 7.7 and 7.8 show the execution
time and the prediction accuracy as a function of the number of frames sampled using
resolutions 640x480 and 320x240 respectively. For this experiment, we used the maxi-
mum execution time as base frame. The results show that it is possible to considerably
7.4. ON-LINE SYSTEM GENERATED PREDICTIONS 101
reduce the profiling time keeping a good prediction accuracy level. This happens because
in an animation, neighbouring frames have similar content and depending on the case, the
variation is minimum during the entire animation, such as for Sky Vase.
0.3 20 8
220
16
0.2 200 15 6 45
Overestimation (%)
Overestimation (%)
160 12 35
0.0 140 5 2
10 30
−0.1 120 0 0 25
100 8
−0.2 −5 −2 20
80 6
15
−0.3 60 −10 −4
4
40 10
−0.4 Exec. Time −15 Exec. Time 2 −6 Exec. Time
20 5
Prediction Prediction Prediction
−0.50 50 100 150 200 0 −200 50 100 150 200 0 −80 50 100 150 200 0
Number of Frames Number of Frames Number of Frames
4 25 5 20
55
13
Execution Time + 10.96 (min)
50
Execution Time + 2.08 (min)
Overestimation (%)
Overestimation (%)
0 40 15 10
9
35 3 10
−2 10 8
30 7
−4 25 5 6
2 5
20 5
−6 15 0 4
1 0 3
10
−8 Exec. Time −5 Exec. Time Exec. Time 2
5
Prediction Prediction Prediction 1
−50
−100 50 100 150 200 0 −100 50 100 150 200 0 50 100 150 200 0
Number of Frames Number of Frames Number of Frames
7.5 Evaluation
This section evaluates the coordinated rescheduling algorithm and the impact of inaccu-
rate run time estimates when scheduling BoT applications on multiple resource providers.
We performed experiments using both a simulator and a real testbed. Simulations allowed
us to perform repeatable and controllable experiments using various parameters. The ex-
periments in a real testbed allowed us to verify how the scheduler architecture can be used
in practice. We used PaJFit, and workloads produced by the Lublin-Feitelson model. For
the real experiments, we used an extended version of PaJFit, which works with sockets for
communication between modules, on Grid’5000. Following we describe the experiment
configuration and the result analysis.
Table 7.2: Set up for experiments varying hardware configuration and estimation types.
All clusters have 300 each processors.
Hardware Estimation Type and rescheduling
C1 =C2 =C3 =C4 UE with uncoordinated rescheduling
C1 =C2 =C3 =C4 SE X and Y more accurate than UEs with
uncoordinated rescheduling
5-30min generation time
C1 =C2 =C3 =C4 UE with coordinated rescheduling
C1 =C2 =C3 =C4 C1 and C2 with UEs and C3 and C4 with SEs
C1 =C2 20 % and 50% faster than C3 =C4 UEs
C1 =C2 20 % and 50% faster than C3 =C4 SEs
C1 =C2 20 % and 50% faster than C3 =C4 UEs with coordinated rescheduling
We used the workload model proposed by Lublin and Feitelson [77] to generate traces
for both the simulations and the experiments in Grid’5000. We simulated 15 days of the
workload and used 15 workloads for each experiment. We also considered 20 run time
estimation values. Therefore, for each scenario described in Table 7.2, we have a total
7.5. EVALUATION 103
of 300 simulations. For all experiments we set up the system load as 70% by changing
the arrival times of the external jobs. To achieve this load we used a strategy similar to
that described by Shmueli and Feitelson to evaluate their backfilling strategy [104], but
we fixed the time interval and included more jobs from the trace.
We performed our experiments in Grid’5000 by placing a local scheduler in four clus-
ters with access to 300 processors. Table 7.3 presents an overview of the node config-
urations in which we deployed the local schedulers and the metascheduler. Figure 7.9
illustrates the resource locations in Grid’5000 used in this experiment. We present the
results obtained through simulations followed by results from Grid’5000.
Table 7.3: Overview of the node configurations for the experiments in Grid’5000.
Scheduler Cluster Location CPUs’ Configuration
metascheduler sol Sophia AMD Opteron 246 2.0 GHz
provider 1 paradent Rennes Intel Xeon L5420 2.5 Ghz
provider 2 bordemer Bordeaux AMD Opteron 248 2.2 GHz
provider 3 grelon Lille AMD Opteron 285 2.6 GHz
provider 4 chicon Nancy Intel Xeon 5110 1.6 GHz
Lille
Paradent Rennes
Provider 1 Grelon
Provider 3
Paris Nancy
Chicon
Provider 4
Lyon
Bordeaux
Grenoble Sol
Metascheduler
Toulouse
Bordemer
Provider 2
Sophia
jobs can be backfilled. However, there is a limit in which backfilling can be explored.
Figures 7.10 and 7.11 show the requested run times, fragment lengths, and number of
jobs that would fit into the fragments for run time estimates with accuracy of 85% and
50%, respectively. We observe that the higher the accuracy the smaller the number of
jobs that have chances of being backfilled. In this example, we are not considering the
submission time of the jobs. Figure 7.12 presents the total number of jobs that would have
chances of backfilling as a function of run time accuracy. From this figure we notice that
there is a limit on the backfilling chances, in particular, after an overestimation of 200%
the chances of backfilling become steady due to fragment lengths and requested run times.
Number of jobs
Number of jobs
500 500 500
400 400 400
300 300 300
200 200 200
100 100 100
00 600 800 00 50 00 50
200 400 100 150 200 100 150 200 250
Requested Run Times (min) Fragment Lenghts (min) Requested Run Times (min)
(a) Requested run times. (b) Fragment lengths. (c) Chances to be backfilled.
Figure 7.10: Requested run times and fragment lengths for accuracy of 85%.
Number of jobs
Number of jobs
(a) Requested run times. (b) Fragment lengths. (c) Chances to be backfilled.
Figure 7.11: Requested run times and fragment lengths for accuracy of 50%.
250
Number of jobs
200
150
100
50
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400
User run time overestimation (%)
Figure 7.12: Backfilling limit as a function of run time overestimations.
The main motivation for developing the coordinated rescheduling for bag-of-tasks is
7.5. EVALUATION 105
the observation that stretch factor increases with the run time overestimations. Figure 7.13
presents the stretch factor for applications scheduled in multiple clusters as a function of
run time overestimation for homogeneous and heterogeneous environments. Until 30% of
overestimation, there is no difference between the rescheduling strategies. This happens
because by this value, just a few jobs have chances of backfilling. However, after 30%,
tasks of BoT applications spread over the scheduling queues due to the rescheduling,
thus increasing the stretch factor. The coordinated rescheduling minimises this effect,
especially when run time accuracy is low.
2.5
2.0
1.5
1.0
0.5
0 20 40 60 80 100 120 140 160 180 200
User run time overestimation (%)
(a) Homogeneous environment.
3.5
Uncoordinated rescheduling Coordinated rescheduling
3.0
Stretch Factor
2.5
2.0
1.5
1.0
0.5
0 20 40 60 80 100 120 140 160 180 200
User run time overestimation (%)
(b) Heterogeneous environment.
Figure 7.13: Stretch factor variation as a function of the run time estimation accuracy and
rescheduling policy.
For the heterogeneous environment, although stretch factor is reduced using coordi-
nated rescheduling over the uncoordinated one, this improvement is slightly lower (Fig-
ure 7.13 (b)). The reason is that applications tend to execute in fewer clusters (the fastest
ones), and therefore the importance for coordinated rescheduling among providers is re-
duced. As showed in Figure 7.14, the number of clusters per job is reduced in the het-
erogeneous environment. Most of the applications are scheduled to one or two clusters,
whereas for the homogeneous environment similar number of applications access two,
three, and four clusters.
Reducing the stretch factor has a direct impact on the user response time. Figures 7.15
106 Chapter 7. ADAPTIVE CO-ALLOCATION FOR BOT APPLICATIONS
150
100
50
2 clusters 4 clusters
150
100
50
15
10
0
0 20 40 60 80 100 120 140 160 180 200
User run time overestimation (%)
15
10
0
0 20 40 60 80 100 120 140 160 180 200
User run time overestimation (%)
25
Slowdown Reduction (%)
15
10
5
0
−5
0 20 40 60 80 100 120 140 160 180 200
User run time overestimation (%)
ones. Smaller jobs have more chances of backfilling than the big ones. Similar happens
for the heterogeneous environment.
We also analysed the user response time separately for multi- and single-cluster jobs.
Figures 7.17 presents the results for single-cluster jobs. The increase of user overestima-
tions actually reduces user response time for these jobs, which corroborates with previous
studies on effects of run time estimates for job scheduling [120]. User response time for
coordinated rescheduling produces an improvement of up to 5% in relation to uncoor-
dinated rescheduling for these jobs. The main benefits of higher run time accuracy and
coordinated rescheduling come from multi-cluster jobs, as illustrated in Figure 7.18.
Response Time Reduction (%)
We have calculated the system utilisation for user and system-generated estimations
and uncoordinated/coordinated rescheduling algorithms. The results are similar with a
difference of less than 1%. This difference may increase if we consider a competition
scenario with providers offering different levels of completion time guarantees. In such
a scenario, users tend to execute their applications on providers with more optimised
completion time guarantees. Figure 7.19 illustrates the average system utilisation level of
providers with different run time estimation approaches; the higher the accuracy of run
time predictions the higher the chances of attracting more users.
We have also performed experiments in Grid’5000. Due to the complexity in gath-
ering resources from several sites, usage policy restrictions of large-scale environments,
7.5. EVALUATION 109
15
10
0
0 20 40 60 80 100 120 140 160 180 200
User run time overestimation (%)
15
10
0
0 20 40 60 80 100 120 140 160 180 200
User run time overestimation (%)
85
80
75
70
65
60
55
0 20 40 60 80 100 120 140 160 180 200
User run time overestimation (%)
(a) SE = 0.5 UE
90
Providers with user estimations Providers with system estimations
System Utilisation (%)
85
80
75
70
65
60
55
0 20 40 60 80 100 120 140 160 180 200
User run time overestimation (%)
(b) SE = 0.2 UE
Figure 7.19: Impact of estimations on the system utilisation by attracting more users
through more optimised completion time guarantees.
110 Chapter 7. ADAPTIVE CO-ALLOCATION FOR BOT APPLICATIONS
and execution time of real experiments, we have selected only three workloads used by
the simulations having 50%, 100%, and 150% of run time overestimation parameters.
The goal of these experiments is to compare the results of simulation and the execution in
the real system. PaJFit has support for both simulations and real execution using sockets
as a mechanism for the communication between the metascheduler, providers and users.
Therefore, the implementation of the rescheduling algorithm is the same for both exe-
cution modes. The main difference lies on the network delay and the order in which
messages are exchanged between the system components; in a real system the network
delay is higher and the messages require more complex treatment in comparison to simu-
lations. In spite of these differences, we observe in Table 7.4 that for these experiments,
both simulations and executions in the real environment provided similar results, showing
the practical benefits of coordinated rescheduling in a real environment. As we described
in Section 7.2, the required modification in an existing scheduling architecture is minimal.
7.6 Conclusion
This chapter presented a coordinated rescheduling algorithm for BoT applications execut-
ing across multiple providers and the impact of run time estimates for these applications.
Due to inaccurate run time estimates, initial schedules have to be updated, and therefore,
when each provider reschedules tasks of a BoT application independently, such tasks may
have their completion time reduced locally, but not globally. Tasks of the same BoT can
be spread over time due to rescheduling. The main idea of coordinated rescheduling for
7.6. CONCLUSION 111
BoT applications is to consider the completion time of the entire BoT. Therefore, small
jobs can have more chances of backfilling without delaying the BoT applications.
Moreover, accurate run time estimates assist metaschedulers to better distribute the
tasks of BoT applications on multiple sites. Although system generated predictions may
consume time, the schedules produced by more accurate estimates pay off the profiling
time since users have better response times than simply overestimating resource usages.
This chapter meets the three objectives proposed in Chapter 1 for BoT applications,
which concludes the core chapters of the thesis. Following we discuss the main findings
of the thesis and future research directions for resource co-allocation.
Chapter 8
This thesis contains an analysis of the impact of inaccurate run time estimates in three
scenarios: a single provider scheduling advance reservations, multiple providers schedul-
ing message passing applications, and multiple providers scheduling bag-of-tasks. It also
proposes co-allocation policies with rescheduling support for message passing and bag-
of-tasks application models, along with a description of technical difficulties to deploy
these policies. The proposed policies consider four aspects: inaccurate run time esti-
mations of user applications, completion time guarantees, coordinated rescheduling, and
limited information access from resource providers.
As an important building block of resource co-allocation for message passing applica-
tions, we started the thesis by investigating flexible advance reservations. These advance
113
114 Chapter 8. CONCLUSIONS AND FUTURE DIRECTIONS
reservations have flexible start and completion time intervals, which can be explored by
schedulers to increase system utilisation when unexpected events happen. One common
event is the completion of executions before the expected time. We investigated the im-
portance of rescheduling advance reservations for system utilisation using four scheduling
heuristics under several workloads, reservation time intervals and inaccurate run time es-
timates. In addition, we studied cases when users accept an alternative offer from the
resource provider on failure to schedule the initial request. Our main finding on this study
is that system utilisation increases with the flexibility of request time intervals and the
time users allow this flexibility while waiting for resource access. This benefit is mainly
due to the ability of the scheduler to rearrange jobs in the scheduling queue, which re-
duces the fragmentation generated by advance reservations. This is particularly true when
users overestimate application run time.
Based on flexible advance reservations for single resource provider settings, we ex-
tended the concept of flexible time intervals for applications requiring co-allocation of
resources from multiple providers. We proposed a co-allocation model that relies on two
operations to reschedule requests: start time shifting and process remapping. By using this
model, metaschedulers can modify the start time of each job component and remap the
number of processors they use in each provider. From our experiments, using workloads
from real clusters, we showed that local jobs may not fill all the fragments in the schedul-
ing queues and hence rescheduling co-allocation requests reduces response time of both
local and multi-site jobs. Moreover, process remapping increases the chances of plac-
ing the tasks of multi-cluster jobs into a single cluster, thus eliminating the inter-cluster
network overhead.
From the deployment point of view of the adaptive resource co-allocation for mes-
sage passing applications, we observed that the use of start time shifting can be widely
adopted by several parallel applications. This operation is not application dependent since
it is only a shift on the start time of the user application. The process remap operation,
on the other hand, is application dependent, which may limit its adoption for some ap-
plications. To better understand how to use this operation in practice, we developed an
application-level scheduler for iterative parallel applications. We concluded that users can
remap the processes with the cost of overestimating the execution time to avoid applica-
tions being aborted by the schedulers. To overcome this problem, metaschedulers can
use performance predictions, and in particular for iterative applications, the cost to obtain
predictions is negligible and requires no access to the user application source code.
Regarding resource co-allocation for BoT applications, we investigated how to dis-
tribute tasks of the same application on providers that are not willing to disclose private
information to metaschedulers. We mainly focused on two types of information: total
8.1. FUTURE RESEARCH DIRECTIONS 115
computing power of a resource provider and its local load. To keep this information pri-
vate, we introduced the concept of execution offers, in which resource providers advertise
their interest in executing an entire BoT application or only part of it without revealing
their load and total computing power. The main findings from this study are that offer-
based scheduling produces less delay for jobs that cannot meet deadlines in comparison
to scheduling based on load availability (i.e. free time slots); thus it is possible to keep
providers’ load private when scheduling multi-site BoT applications; and if providers
publish their total computing power they can have more local jobs meeting deadlines.
As one of the key aspects of this thesis is the inaccurate run time estimates of user ap-
plications, we investigated the importance of accurate predictions when scheduling BoT
applications on multiple providers and how tasks from the same application should be
rescheduled. We observed that tasks of the same BoT can be spread over time due to
inaccurate run time estimates and environment heterogeneity. To minimise the effect of
having completion time variation of tasks from the same application, this thesis proposes
a coordinated rescheduling algorithm, which reduces response time for both users access-
ing a single and multiple providers. We also observed that accurate run time estimates
assist metaschedulers to better distribute the tasks of BoT applications on multiple sites.
In order to obtain accurate run time estimates, users or the metascheduler can profile a
sample of tasks from the same application. We concluded that the cost to obtain run time
predictions pays off since users have better response times than simply overestimating
resource usages.
utilisation, power consumption, and user response time. Rescheduling has also impact
on management of contracts, also called Service Level Agreements, in utility comput-
ing environments. In these environments, users pay to access resources or services, and
providers have to guarantee the delivery of these services with a pre-established Quality-
of-Service level. For co-allocation users, several entities may participate in these contracts
and hence managing issues such as violation becomes a complex task. Thus, for the com-
ing years, especially due to the increasing number of utility computing centers around
the world, researchers will be facing the challenge of developing and improving existing
policies for managing contracts involving multiple entities
Virtualisation is another concept that will be highly explored to provide transparency
to users when co-allocating multiple resources that are hosted in either a single or multi-
ple administrative domains. Virtual clusters can be dynamically formed to deploy appli-
cations with various application requirements [29]. Moreover, with the consolidation of
Cloud Computing, resource/service provisioning centers can avoid contract violations and
increase system utilisation by co-allocation resources from multiple parties on demand.
In the following sections we describe in more detail some of the future directions
identified during the development of this thesis.
SLA
Resource Resource
QoS Provider 1 Provider y
requirements
SLA SLA
Service SLA Resource
Provider
Provider x
USER Metascheduler
SLA
Resource
Provider N
• When co-allocating resources from multiple cloud computing providers, the cost
to access resources becomes an important issue [58, 132]. Metaschedulers have to
consider prices from multiple resource providers, which impact on the utility of the
user requesting resources.
118 Chapter 8. CONCLUSIONS AND FUTURE DIRECTIONS
As cloud computing technology evolves, cloud centers will provide different services
that users will want to compose to meet their demand. Therefore, co-allocation policies,
especially with re-planning features, will be fundamental to make cloud computing the
new utility service for individual users and organisations.
Chapter 2). However, this increase in execution time comes with energy-consumption of
clusters and communication devices (Figure 8.3). A future direction is to evaluate energy-
consumption [71] when executing applications over multiple providers. This research
would require a detailed power consumption monitoring system of resources involved in
the computation and communication. A possible outcome would be a new threshold for
inter-cluster communication overhead that pays off the benefits of co-allocation.
resources
resources
network
overhead
Site 1 time Site 1 time
resources
• How long should the metascheduler delay the data transfer to increase the chances
of rescheduling options?
• When should resource providers notify the metascheduler about the interest in resche-
duling tasks of a data-intensive application?
• Workloads with deadlines. Current work on scheduling, including this thesis, uses
deadline generators based on a distribution function and/or on job sizes. It would
be interesting to investigate more methods for deadline generation;
120 Chapter 8. CONCLUSIONS AND FUTURE DIRECTIONS
• Run time rescheduling. This thesis investigated rescheduling for jobs waiting
for resources in scheduling queues. A possible extension is to considering the
rescheduling policies for jobs already accessing resources.
• Admission control with other criteria. This thesis used expected completion time
as the criterion for admission control. Other criteria, such as chances of meeting
completion time proposals in case of failures or unexpected peek demand could be
also investigated.
[1] Abramson, D., Buyya, R., and Giddy, J. (2002). A computational economy for grid
computing and its implementation in the Nimrod-G resource broker. Future Generation
Computer Systems., 18(8):1061–1074.
[2] Abramson, D., Giddy, J., and Kotler, L. (2000). High performance parametric model-
ing with Nimrod/G: Killer application for the global grid? In Proceedings of the 14th
International Parallel and Distributed Processing Symposium (IPDPS’00), Cancun,
Mexico. IEEE Computer Society.
[4] Alhusaini, A. H., Raghavendra, C. S., and Prasanna, V. K. (2001). Run-time adapta-
tion for grid environments. In Werner, B., editor, International Parallel and Distributed
Processing Symposium (IPDPS’01), pages 864–874, Los Alamitos, California. IEEE
Computer Society.
[5] Altschul, S., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic
local alignment search tool. Journal of Molecular Biology, 215(3):403–410.
[6] Armbrust, M., Fox, A., Griffith, R., Joseph, A., Katz, R., Konwinski, A., Lee, G.,
Patterson, D., Rabkin, A., Stoica, I., et al. (2009). Above the clouds: A Berkeley view
of cloud computing. EECS Department, University of California, Berkeley, Tech. Rep.
UCB/EECS-2009-28.
[8] Auyoung, A., Grit, L., Wiener, J., and Wilkes, J. (2006). Service contracts and ag-
gregate utility functions. In Proceedings of the 15th International Symposium on High
Performance Distributed Computing (HPDC’06), Paris, France. IEEE.
[9] Azougagh, D., Yu, J.-L., Kim, J.-S., and Maeng, S. R. (2005). Resource co-allocation:
A complementary technique that enhances performance in grid computing environ-
ment. In Barolli, L., editor, International Conference on Parallel and Distributed Sys-
tems (ICPADS’05), volume 1, pages 36–42, Los Alamitos, California. IEEE Computer
Society.
[10] Azzedin, F., Maheswaran, M., and Arnason, N. (2004). A synchronous co-allocation
mechanism for grid computing systems. Cluster Computing, 7(1):39–49.
121
122 REFERENCES
[11] Bahi, J. M., Contassot-Vivier, S., and Couturier, R. (2006). Performance compar-
ison of parallel programming environments for implementing AIAC algorithms. The
Journal of Supercomputing, 35(3):227–244.
[12] Bal, H. E., Plaat, A., Bakker, M. G., Dozy, P., and Hofman, R. F. H. (1998). Opti-
mizing parallel applications for wide-area clusters. In Proceedings of the 12th Inter-
national Parallel Processing Symposium / 9th Symposium on Parallel and Distributed
Processing (IPPS/SPDP’98).
[13] Beaumont, O., Carter, L., Ferrante, J., Legrand, A., Marchal, L., and Robert, Y.
(2008). Centralized versus distributed schedulers for bag-of-tasks applications. IEEE
Transactions on Parallel and Distributed Systems, 19(5):698–709.
[14] Benoit, A., Marchal, L., Pineau, J.-F., Robert, Y., and Vivien, F. (2008). Offline
and online master-worker scheduling of concurrent bags-of-tasks on heterogeneous
platforms. In Proceedings of the 22nd IEEE International Symposium on Parallel and
Distributed Processing (IPDPS’00), Miami, USA. IEEE Computer Society.
[15] Berman, F., Casanova, H., Chien, A. A., Cooper, K. D., Dail, H., Dasgupta, A.,
Deng, W., Dongarra, J., Johnsson, L., Kennedy, K., Koelbel, C., Liu, B., Liu, X.,
Mandal, A., Marin, G., Mazina, M., Mellor-Crummey, J. M., Mendes, C. L., Olugbile,
A., Patel, M., Reed, D. A., Shi, Z., Sievert, O., Xia, H., and YarKhan, A. (2005). New
grid scheduling and rescheduling methods in the GrADS project. International Journal
of Parallel Programming, 33(2-3):209–229.
[16] Berman, F., Wolski, R., Casanova, H., Cirne, W., Dail, H., Faerman, M., Figueira,
S. M., Hayes, J., Obertelli, G., Schopf, J. M., Shao, G., Smallen, S., Spring, N. T., Su,
A., and Zagorodnov, D. (2003). Adaptive computing on the grid using AppLeS. IEEE
Transactions on Parallel and Distributed Systems, 14(4):369–382.
[18] Bolze, R., Cappello, F., Caron, E., Daydé, M., Desprez, F., Jeannot, E., Jégou, Y.,
Lanteri, S., Leduc, J., Melab, N., et al. (2006). Grid’5000: a large scale and highly
reconfigurable experimental grid testbed. International Journal of High Performance
Computing Applications, 20(4):481.
[19] Braun, T. D., Siegel, H. J., Beck, N., Bölöni, L., Maheswaran, M., Reuther, A. I.,
Robertson, J. P., Theys, M. D., Yao, B., Hensgen, D. A., and Freund, R. F. (2001). A
comparison of eleven static heuristics for mapping a class of independent tasks onto
heterogeneous distributed computing systems. Journal of Parallel and Distributed
Computing, 61(6):810–837.
[21] Bucur, A. I. D. and Epema, D. H. J. (2003b). Priorities among multiple queues for
processor co-allocation in multicluster system. In Bilof, R., editor, Annual Simula-
tion Symposium (ANSS’03), pages 15–27, Los Alamitos, California. IEEE Computer
Society.
[24] Buyya, R., Yeo, C., Venugopal, S., Broberg, J., and Brandic, I. (2009). Cloud com-
puting and emerging IT platforms: Vision, hype, and reality for delivering computing
as the 5th utility. Future Generation Computer Systems, 25(6):599–616.
[25] Capit, N., Costa, G. D., Georgiou, Y., Huard, G., Martin, C., Mounié, G., Neyron,
P., and Richard, O. (2005). A batch scheduler with high level components. In Interna-
tional Symposium on Cluster Computing and the Grid (CCGrid’05), pages 776–783,
Los Alamitos, California. IEEE Computer Society.
[26] Cappello, F., Caron, E., Daydé, M. J., Desprez, F., Jégou, Y., Primet, P. V.-B., Jean-
not, E., Lanteri, S., Leduc, J., Melab, N., Mornet, G., Namyst, R., Quétier, B., and
Richard, O. (2005). Grid’5000: a large scale and highly reconfigurable grid exper-
imental testbed. In International Conference on Grid Computing (GRID’05), pages
99–106, Los Alamitos, California. IEEE.
[27] Carriero, N. and Gelernter, D. (1989). How to write parallel programs: A guide to
the perplexed. ACM Computing Surveys, 21(3):323–357.
[28] Casanova, H., Legrand, A., Zagorodnov, D., and Berman, F. (2000). Heuristics
for scheduling parameter sweep applications in grid environments. In Proceedings of
the Heterogeneous Computing Workshop (HCW’00), pages 349–363, Cancun, Mexico.
IEEE Computer Society.
[29] Chase, J. S., Irwin, D. E., Grit, L. E., Moore, J. D., and Sprenkle, S. (2003). Dy-
namic virtual clusters in a grid site manager. In International Symposium on High-
Performance Distributed Computing (HPDC’03), pages 90–103, Los Alamitos, Cali-
fornia. IEEE Computer Society.
[30] Chen, Y. T. and Lee, K. H. (2001). A flexible service model for advance reservation.
Computer Networks, 37(3/4):251–262.
[31] Chiang, S.-H., Arpaci-Dusseau, A. C., and Vernon, M. K. (2002). The impact of
more accurate requested runtimes on production job scheduling performance. In Pro-
ceedings of the 8th International Workshop on Job Scheduling Strategies for Parallel
Processing (JSSPP’02), volume 2537 of Lecture Notes in Computer Science, pages
103–127, Edinburgh, Scotland, UK. Springer.
124 REFERENCES
[32] Cirne, W. and Berman, F. (2002). Using moldability to improve the performance
of supercomputer jobs. Journal of Parallel and Distributed Computing, 62(10):1571–
1601.
[33] Cirne, W., Paranhos, D., Costa, L., Santos-Neto, E., Brasileiro, F., Sauvé, J., Silva,
F., Barros, C., and Silveira, C. (2003). Running bag-of-tasks applications on computa-
tional grids: The mygrid approach. In Proceedings of the International Conference on
Parallel Processing (ICPP’03), pages 407–416.
[34] Czajkowski, K., Foster, I., Karonis, N., Kesselman, C., Martin, S., Smith, W., and
Tuecke, S. (1998). A resource management architecture for metacomputing systems.
In Feitelson, D. G. and Rudolph, L., editors, International Workshop on Job Schedul-
ing Strategies for Parallel Processing (JSSPP’98), volume 1459 of Lecture Notes in
Computer Science, pages 62–82, Berlin. Springer.
[35] Czajkowski, K., Foster, I., and Kesselman, C. (1999). Resource co-allocation in
computational grids. In International Symposium on High Performance Distributed
Computing (HPDC’99), pages 219–228, Los Alamitos, California. IEEE Computer
Society.
[36] Czajkowski, K., Foster, I. T., Kesselman, C., Sander, V., and Tuecke, S. (2002).
SNAP: A protocol for negotiating service level agreements and coordinating re-
source management in distributed systems. In Feitelson, D. G., Rudolph, L., and
Schwiegelshohn, U., editors, Proceedings of the 8th International Workshop Job
Scheduling Strategies for Parallel Processing (JSSPP’02), Lecture Notes in Computer
Science, pages 153–183, Berlin. Springer.
[37] de Assunção, M. D., di Costanzo, A., and Buyya, R. (2009). Evaluating the cost-
benefit of using cloud computing to extend the capacity of clusters. In Kranzlmüller,
D., Bode, A., Hegering, H.-G., Casanova, H., and Gerndt, M., editors, Proceedings of
the 18th ACM International Symposium on High Performance Distributed Computing
(HPDC’09), pages 141–150, Garching, Germany.
[39] Deb, K., Thiele, L., Laumanns, M., and Zitzler, E. (2005). Scalable test problems
for evolutionary multiobjective optimization. Evolutonary Multiobjective Optmization.
[40] Decker, J. and Schneider, J. (2007). Heuristic scheduling of grid workflows sup-
porting co-allocation and advance reservation. In Schulze, B., Buyya, R., Navaux, P.,
Cirne, W., and Rebello, V., editors, International Symposium on Cluster Computing
and the Grid (CCGrid’07), pages 335–342, Los Alamitos, California. IEEE Computer
Society.
[41] Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Blackburn, K.,
Lazzarini, A., Arbree, A., Cavanaugh, R., et al. (2003). Mapping abstract complex
workflows onto grid environments. Journal of Grid Computing, 1(1):25–39.
REFERENCES 125
[42] Desell, T. J., Szymanski, B. K., and Varela, C. A. (2008). Asynchronous genetic
search for scientific modeling on large-scale heterogeneous environments. In Pro-
ceedings of the 17th Heterogeneity in Computing Workshop (HCW’08), in conjunc-
tion with 22nd IEEE International Symposium on Parallel and Distributed Processing
(IPDPS’08).
[43] Dong, S., Karniadakis, G. E., and Karonis, N. T. (2005). Cross-site computations on
the TeraGrid. Computing in Science and Engineering, 7(5):14–23.
[45] Ernemann, C., Hamscher, V., Schwiegelshohn, U., Yahyapour, R., and Streit, A.
(2002a). On advantages of grid computing for parallel job scheduling. In Interna-
tional Symposium on Cluster Computing and the Grid (CCGrid’02), pages 39–, Los
Alamitos, California. IEEE Computer Society.
[46] Ernemann, C., Hamscher, V., Streit, A., and Yahyapour, R. (2002b). Enhanced
algorithms for multi-site scheduling. In Parashar, M., editor, Proceedings of the 3rd
International Workshop on Grid Computing (GRID’02), volume 2536 of Lecture Notes
in Computer Science, pages 219–231, Berlin. Springer.
[47] Farooq, U., Majumdar, S., and Parsons, E. W. (2006). A framework to achieve guar-
anteed QoS for applications and high system performance in multi-institutional grid
computing. In Proceedings of the 35th International Conference on Parallel Process-
ing (ICPP’06), pages 373–380, Columbus, USA. IEEE Computer Society.
[48] Feitelson, D. G., Rudolph, L., Schwiegelshohn, U., Sevcik, K. C., and Wong, P.
(1997). Theory and practice in parallel job scheduling. In Feitelson, D. G. and Rudolph,
L., editors, Proceedings of the 3rd Workshop on Scheduling Strategies for Parallel
Processing (JSSPP’97), volume 1291 of Lecture Notes in Computer Science, pages
1–34, Geneva, Switzerland. Springer.
[49] Ferrari, D., Gupta, A., and Ventre, G. (1997). Distributed advance reservation of
real-time connections. Multimedia Systems, 5(3):187–198.
[50] Foster, I. and Kesselman, C. (1999). The Grid: Blueprint for a New Computing
Infrastructure. Morgan-Kaufman, San Francisco, CA, EUA.
[51] Foster, I., Kesselman, C., Lee, C., Lindell, B., Nahrstedt, K., and Roy, A. (1999). A
distributed resource management architecture that supports advance reservations and
co-allocation. In Proceedings of the International Workshop on Quality of Service
(IWQoS’99), pages 27–36, Piscataway, New Jersey. IEEE Computer Society.
[52] Foster, I., Kesselman, C., and Tuecke, S. (2001). The anatomy of the grid: Enabling
scalable virtual organizations. International Journal of High Performance Computing
Applications, 15(3):200.
126 REFERENCES
[53] Gray, J. and Lamport, L. (2006). Consensus on transaction commit. ACM Transac-
tions on Database Systems, 31(1):133–160.
[54] Gropp, W., Lusk, E., and Skjellum, A. (1999). Using MPI: portable parallel pro-
gramming with the message passing interface. MIT press.
[55] Haji, M. H., Gourlay, I., Djemame, K., and Dew, P. M. (2005). A SNAP-based com-
munity resource broker using a three-phase commit protocol: A performance study.
The Computer Journal, 48(3):333–346.
[56] He, L., Jarvis, S. A., Spooner, D. P., Chen, X., and Nudd, G. R. (2004). Dynamic
scheduling of parallel jobs with QoS demands in multiclusters and grids. In Proceed-
ings of the International Conference on Grid Computing (GRID’04).
[57] Iosup, A., Sonmez, O. O., Anoep, S., and Epema, D. H. J. (2008). The performance
of bags-of-tasks in large-scale distributed systems. In Proceedings of the 17th Inter-
national Symposium on High-Performance Distributed Computing (HPDC’08), pages
97–108, Boston, USA. ACM.
[58] Irwin, D. E., Grit, L. E., and Chase, J. S. (2004). Balancing risk and reward in
a market-based task service. In Proceedings of the 13th International Symposium
on High-Performance Distributed Computing (HPDC’04), pages 160–169, Honolulu,
USA. IEEE Computer Society.
[59] Islam, M., Balaji, P., Sabin, G., and Sadayappan, P. (2007). Analyzing and minimiz-
ing the impact of opportunity cost in QoS-aware job scheduling. In Proceedings of the
International Conference on Parallel Processing (ICPP’07), page 42, Xi-An, China.
IEEE Computer Society.
[60] Islam, M., Balaji, P., Sadayappan, P., and Panda, D. K. (2003). QoPS: A QoS
Based Scheme for Parallel Job Scheduling. In Proceedings of the 9th International
Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP’03), volume
2862 of Lecture Notes in Computer Science, pages 252–268, Seattle, USA. Springer.
[61] Islam, M., Balaji, P., Sadayappan, P., and Panda, D. K. (2004). Towards provision
of quality of service guarantees in job scheduling. In Proceedings of the IEEE Interna-
tional Conference on Cluster Computing (CLUSTER’04), pages 245–254, San Diego,
USA. IEEE Computer Society.
[62] Jardine, J., Snell, Q., and Clement, M. J. (2001). Livelock avoidance for meta-
schedulers. In Williams, A. D., editor, International Symposium on High Performance
Distributed Computing (HPDC’01), pages 141–146, Los Alamitos, California. IEEE
Computer Society.
[63] Jarvis, S. A., Spooner, D. P., Keung, H. N. L. C., Cao, J., Saini, S., and Nudd,
G. R. (2006). Performance prediction and its use in parallel and distributed computing
systems. Future Generation Computer Systems, 22(7):745–754.
REFERENCES 127
[64] Jones, W. M., III, W. B. L., and Shrivastava, N. (2006). The impact of informa-
tion availability and workload characteristics on the performance of job co-allocation
in multi-clusters. In International Conference on Parallel and Distributed Systems
(ICPADS’06), pages 123–134, Los Alamitos, California. IEEE Computer Society.
[65] Karonis, N. T., Toonen, B. R., and Foster, I. T. (2003). MPICH-G2: A Grid-enabled
implementation of the Message Passing Interface. Journal of Parallel and Distributed
Computing, 63(5):551–563.
[66] Kaushik, N., Figueira, S., and Chiappari, S. A. (2007). Resource co-allocation using
advance reservations with flexible time-windows. SIGMETRICS Performance Evalua-
tion Review, 35(3):46–48.
[67] Kaushik, N. R., Figueira, S. M., and Chiappari, S. A. (2006). Flexible time-windows
for advance reservation scheduling. In Proceedings of the 14th International Sym-
posium on Modeling, Analysis, and Simulation of Computer and Telecommunication
Systems (MASCOTS’06), pages 218–225, Monterey, USA. IEEE Computer Society.
[68] Kim, J.-K., Shivle, S., Siegel, H. J., Maciejewski, A. A., Braun, T. D., Schneider,
M., Tideman, S., Chitta, R., Dilmaghani, R. B., and Joshi, R. (2007). Dynamically
mapping tasks with priorities and multiple deadlines in a heterogeneous environment.
Journal of Parallel and Distributed Computing, 67(2):154–169.
[70] Kuo, D. and Mckeown, M. (2005). Advance reservation and co-allocation protocol
for grid computing. In Stockinger, H., Buyya, R., and Perrott, R., editors, International
Conference on e-Science and Grid Technologies (e-Science’05), pages 164–171, Los
Alamitos, California. IEEE Computer Society.
[72] Lee, C. B., Schwartzman, Y., Hardy, J., and Snavely, A. (2004). Are user runtime
estimates inherently inaccurate? In Proceedings of the 10th International Workshop on
Job Scheduling Strategies for Parallel Processing (JSSPP’04), volume 3277 of Lecture
Notes in Computer Science, pages 253–263, New York, USA. Springer.
[75] Li, J. and Yahyapour, R. (2006). Negotiation model supporting co-allocation for
grid scheduling. In Gannon, D., Badia, R. M., and Buyya, R., editors, International
Conference on Grid Computing (GRID’06), pages 254–261, Los Alamitos, California.
IEEE Computer Society.
[76] Li, Z. and Parashar, M. (2006). A decentralized computational infrastructure for
grid-based parallel asynchronous iterative applications. Journal of Grid Computing,
4(4):355–372.
[77] Lublin, U. and Feitelson, D. G. (2003). The workload on parallel supercomputers:
modeling the characteristics of rigid jobs. Journal of Parallel and Distributed Comput-
ing, 63(11):1105–1122.
[78] Maclaren, J., Keown, M. M., and Pickles, S. (2006). Co-allocation, fault tolerance
and grid computing. In Cox, S. J., editor, UK e-Science All Hands Meeting (AHM’06),
pages 155–162. NeSC Press.
[79] Maghraoui, K. E., Desell, T. J., Szymanski, B. K., and Varela, C. A. (2009). Mal-
leable iterative MPI applications. Concurrency and Computation: Practice and Expe-
rience, 21(3):393–413.
[80] Maheswaran, M., Ali, S., Siegel, H. J., Hensgen, D. A., and Freund, R. F. (1999).
Dynamic mapping of a class of independent tasks onto heterogeneous computing sys-
tems. Journal of Parallel and Distributed Computing, 59(2):107–131.
[81] Milojičić, D. S., Douglis, F., Paindaveine, Y., Wheeler, R., and Zhou, S. (2000).
Process migration. ACM Computing Surveys, 32(3):241–299.
[82] Mohamed, H. H. and Epema, D. H. J. (2004). An evaluation of the close-to-files pro-
cessor and data co-allocation policy in multiclusters. In International Conference on
Cluster Computing (CLUSTER’04), pages 287–298, Los Alamitos, California. IEEE
Computer Society.
[83] Mohamed, H. H. and Epema, D. H. J. (2005). Experiences with the koala co-
allocating scheduler in multiclusters. In Proceedings of the International Symposium
on Cluster Computing and the Grid (CCGrid’05), pages 784–791, Los Alamitos, Cal-
ifornia. IEEE Computer Society.
[84] Mohamed, H. H. and Epema, D. H. J. (2008). KOALA: a co-allocating grid sched-
uler. Concurrency and Computation: Practice and Experience, 20(16):1851–1876.
[85] Mu’alem, A. W. and Feitelson, D. G. (2001). Utilization, predictability, workloads,
and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans-
actions on Parallel and Distributed Systems, 12(6):529–543.
[86] Naiksatam, S. and Figueira, S. (2007). Elastic reservations for efficient bandwidth
utilization in LambdaGrids. Future Generation Computer Systems, 23(1):1–22.
[87] Netto, M. A. S., Bubendorfer, K., and Buyya, R. (2007). SLA-based advance reser-
vations with flexible and adaptive time QoS parameters. In Proceedings of the 5th Inter-
national Conference on Service-Oriented Computing, pages 119–131, Vienna, Austria.
REFERENCES 129
[88] Netto, M. A. S. and Buyya, R. (2007). Impact of adaptive resource allocation re-
quests in utility cluster computing environments. In Proceedings of the 7th IEEE Inter-
national Symposium on Cluster Computing and the Grid (CCGrid’07), Los Alamitos,
California. IEEE Computer Society.
[90] Netto, M. A. S. and Buyya, R. (2010). Handbook of Research on P2P and Grid
Systems for Service-Oriented Computing: Models, Methodologies and Applications.
Edited by Nick Antonopoulos and Georgios Exarchakos and Maozhen Li and Antonio
Liotta, chapter Resource Co-allocation in Grid Computing Environments. IGI Global
publisher.
[91] Nurmi, D., Brevik, J., and Wolski, R. (2007). Qbets: Queue bounds estimation
from time series. In Proceeding of the 13th International Workshop on Job Schedul-
ing Strategies for Parallel Processing (JSSPP’07), volume 4942 of Lecture Notes in
Computer Science, pages 76–101. Springer.
[92] Pande, V. S., Baker, I., Chapman, J., Elmer, S., Larson, S. M., Rhee, Y. M., Shirts,
M. R., Snow, C. D., Sorin, E. J., and Zagrovic, B. (2003). Atomistic protein folding
simulations on the submillisecond time scale using worldwide distributed computing.
Peter Kollman Memorial Issue, Biopolymers, 68(1):91–109.
[93] Pant, A. and Jafri, H. (2004). Communicating efficiently on cluster based grids with
mpich-vmi. In Proceedings of the International Conference on Cluster Computing
(CLUSTER’04), pages 23–33, Los Alamitos, California. IEEE Computer Society.
[94] Park, J. (2004). A deadlock and livelock free protocol for decentralized internet
resource coallocation. IEEE Transactions on Systems, Man, and Cybernetics, Part A,
34(1):123–131.
[95] Parkhill, D. (1966). The challenge of the computer utility. Addison-Wesley Educa-
tional Publishers Inc., US.
[97] Röblitz, T. and Reinefeld, A. (2005). Co-reservation with the concept of virtual re-
sources. In Proceedings of the International Symposium on Cluster Computing and the
Grid (CCGrid’05), pages 398–406, Los Alamitos, California. IEEE Computer Society.
[98] Röblitz, T., Schintke, F., and Reinefeld, A. (2006). Resource reservations with fuzzy
requests. Concurrency and Computation: Practice and Experience, 18(13):1681–
1703.
130 REFERENCES
[99] Röblitz, T., Schintke, F., and Wendler, J. (2004). Elastic grid reservations with
user-defined optimization policies. In Proceedings of the Workshop on Adaptive Grid
Middleware (AGridM’04), Los Alamitos, California. IEEE Computer Society.
[100] Rochwerger, B., Breitgand, D., Levy, E., Galis, A., Nagin, K., Llorente, I. M.,
Montero, R., Wolfsthal, Y., Elmroth, E., Caceres, J., Ben-Yehuda, M., Emmerich, W.,
and Galan, F. (2009). The reservoir model and architecture for open federated cloud
computing. IBM Journal of Research and Development, 53(4).
[101] Romanazzi, G. and Jimack, P. K. (2008). Parallel performance prediction for nu-
merical codes in a multi-cluster environment. In Proceedings of the 2008 International
Multiconference on Comp. Science and Information Technology (IMCSIT’08), Wisla,
Poland.
[102] Sadjadi, S. M., Shimizu, S., Figueroa, J., Rangaswami, R., Delgado, J., Duran, H.,
and Collazo-Mojica, X. J. (2008). A modeling approach for estimating execution time
of long-running scientific applications. In Proceedings of the 22nd IEEE International
Symposium on Parallel and Distributed Processing (IPDPS’08).
[104] Shmueli, E. and Feitelson, D. G. (2005). Backfilling with lookahead to optimize the
packing of parallel jobs. Journal of Parallel Distributed Computing, 65(9):1090–1107.
[105] Siddiqui, M., Villazón, A., and Fahringer, T. (2006). Grid capacity planning
with negotiation-based advance reservation for optimized QoS. In Proceedings of the
ACM/IEEE Conference on High Performance Networking and Computing (SC’06),
Tampa, USA. ACM Press.
[106] Sievert, O. and Casanova, H. (2004). A simple MPI process swapping architec-
ture for iterative applications. International Journal of High Performance Computing
Applications, 18(3):341–352.
[109] Singh, G., Kesselman, C., and Deelman, E. (2007). A provisioning model and its
comparison with best-effort for performance-cost optimization in grids. In Proceed-
ings of the 16th International Symposium on High-Performance Distributed Computing
(HPDC’07), pages 117–126, Monterey, USA. ACM.
REFERENCES 131
[110] Smallen, S., Casanova, H., and Berman, F. (2001). Applying scheduling and tun-
ing to on-line parallel tomography. In Proceedings of the ACM/IEEE Conference on
Supercomputing (SC’01), Denver, USA. ACM.
[111] Smith, W., Taylor, V. E., and Foster, I. T. (1999). Using run-time predictions to esti-
mate queue wait times and improve scheduler performance. In Proceeding of the Inter-
national Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP’99),
volume 1659 of Lecture Notes in Computer Science, pages 202–219. Springer.
[112] Snell, Q., Clement, M. J., Jackson, D. B., and Gregory, C. (2000). The performance
impact of advance reservation meta-scheduling. In Feitelson, D. G. and Rudolph, L.,
editors, Proceedings of the International Workshop on Job Scheduling Strategies for
Parallel Processing (JSSPP’00), volume 1911 of Lecture Notes in Computer Science,
pages 137–153, Berlin. Springer.
[113] Sonmez, O., Mohamed, H., and Epema, D. (2006). Communication-aware job
placement policies for the KOALA grid scheduler. In Proceedings of the International
Conference on e-Science and Grid Computing (e-Science’06), page 79, Los Alamitos,
California. IEEE Computer Science.
[114] Srinivasan, S., Subramani, V., Kettimuthu, R., Holenarsipur, P., and Sadayappan, P.
(2002). Effective selection of partition sizes for moldable scheduling of parallel jobs.
In Proceedings of the 9th International Conference on High Performance Comput-
ing (HiPC’02), volume 2552 of Lecture Notes in Computer Science, pages 174–183.
Springer.
[115] Takefusa, A., Nakada, H., Kudoh, T., Tanaka, Y., and Sekiguchi, S. (2007). Gri-
dARS: an advance reservation-based grid co-allocation framework for distributed com-
puting and network resources. In Frachtenberg, E. and Schwiegelshohn, U., editors,
Proceedings of the International Workshop on Job Scheduling Strategies for Parallel
Processing (JSSPP’07), Lecture Notes in Computer Science, Berlin. Springer.
[116] Takemiya, H., Tanaka, Y., Sekiguchi, S., Ogata, S., Kalia, R. K., Nakano, A., and
Vashishta, P. (2006). Sustainable adaptive grid supercomputing: multiscale simulation
of semiconductor processing across the pacific. In Proceedings of the Conference on
High Performance Networking and Computing (SC’06), page 106, New York, USA.
ACM Press.
[118] Tsafrir, D., Etsion, Y., and Feitelson, D. G. (2005). Modeling user runtime esti-
mates. In Proceedings of the 11th International Workshop Job Scheduling Strategies
for Parallel Processing (JSSPP’05), volume 3834 of Lecture Notes in Computer Sci-
ence, pages 1–35. Springer.
[119] Tsafrir, D., Etsion, Y., and Feitelson, D. G. (2007). Backfilling using system-
generated predictions rather than user runtime estimates. IEEE Transactions on Paral-
lel and Distributed Systems, 18(6):789–803.
132 REFERENCES
[120] Tsafrir, D. and Feitelson, D. G. (2006). The dynamics of backfilling: Solving the
mystery of why increased inaccuracy may help. In Proceedings of the IEEE Interna-
tional Symposium on Workload Characterization (IISWC’06), San Jose,USA.
[121] Vazhkudai, S. (2003). Enabling the co-allocation of grid data transfers. In Werner,
B., editor, Proceedings of the International Workshop on Grid Computing (GRID’03),
pages 44–51, Los Alamitos, California. IEEE Computer Society.
[122] Vecchiola, C., Kirley, M., and Buyya, R. (2009). Multi-objective problem solving
with offspring on enterprise clouds. In Proceedings of the 10th International Confer-
ence on High-Performance Computing in Asia-Pacific Region (HPC Asia’09).
[123] Venugopal, S., Buyya, R., and Ramamohanarao, K. (2006). A taxonomy of data
grids for distributed data sharing, management, and processing. ACM Computing Sur-
veys, 38(1).
[124] Vieira, G. E., Herrmann, J. W., and Lin, E. (2003). Rescheduling manufacturing
systems: A framework of strategies, policies, and methods. Journal of Scheduling,
6(1):39–62.
[125] Viswanathan, S., Veeravalli, B., and Robertazzi, T. G. (2007). Resource-aware dis-
tributed scheduling strategies for large-scale computational cluster/grid systems. IEEE
Transactions on Parallel and Distributed Systems, 18(10):1450–1461.
[126] Wilkinson, B. and Allen, M. (1998). Parallel programming: techniques and ap-
plications using networked workstations and parallel computers. Prentice-Hall, Inc.
Upper Saddle River, NJ, USA.
[127] Wu, Y.-L., Huang, W., Lau, S.-C., Wong, C. K., and Young, G. H. (2002). An effec-
tive quasi-human based heuristic for solving the rectangle packing problem. European
Journal of Operational Research, 141(2):341–358.
[128] Xu, D., Nahrstedt, K., and Wichadakul, D. (2001). QoS and contention-aware
multi-resource reservation. Cluster Computing, 4(2):95–107.
[129] Yang, C.-T., Yang, I.-H., Wang, S.-Y., Hsu, C.-H., and Li, K.-C. (2007). A
recursively-adjusting co-allocation scheme with cyber-transformer in data grids. Fu-
ture Generation Computer Systems.
[130] Yang, L. T., Ma, X., and Mueller, F. (2005). Cross-platform performance predic-
tion of parallel applications using partial execution. In Proceedings of the ACM/IEEE
Conference on High Performance Networking and Computing (SC’05).
[131] Yeo, C. S. and Buyya, R. (2007a). Integrated risk analysis for a commercial com-
puting service. In Proceedings of the 21th International Parallel and Distributed Pro-
cessing Symposium (IPDPS’07), pages 1–10, Long Beach, USA. IEEE.
[132] Yeo, C. S. and Buyya, R. (2007b). Pricing for utility-driven resource management
and allocation in clusters. International Journal of High Performance Computing Ap-
plications, 21(4):405.
REFERENCES 133
[133] Yoshimoto, K., Kovatch, P. A., and Andrews, P. (2005). Co-scheduling with
user-settable reservations. In Feitelson, D. G., Frachtenberg, E., Rudolph, L., and
Schwiegelshohn, U., editors, Proceedings of the International Workshop on Job
Scheduling Strategies for Parallel Processing (JSSPP’05), volume 3834 of Lecture
Notes in Computer Science, pages 146–156, Berlin. Springer.
[134] Yu, J. and Buyya, R. (2005). A taxonomy of workflow management systems for
grid computing. Journal of Grid Computing, 3(3-4):171–200.
[135] Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C. M., and da Fonseca, V. G. (2003).
Performance assessment of multiobjective optimizers: An analysis and review. IEEE
Transactions on Evolutionary Computation, 7(2):117–132.