AppNote Work Load Management in RedHawk-SC
AppNote Work Load Management in RedHawk-SC
RedHawk-SC
Version: 2020.04.29
Table of Contents
1. Introduction............................................................................................................................................... 1
2. Creating and Launching Workers ............................................................................................................. 1
2.1 Creating Launcher Objects .................................................................................................................. 1
2.2 Auto-Launching Workers .................................................................................................................... 2
2.3 Explicit Launching of Workers ............................................................................................................ 4
2.4 Registering multiple default launchers .............................................................................................. 5
2.5 Defining custom Launchers................................................................................................................. 6
2.6 Reserving certain launchers for certain jobs...................................................................................... 7
2.7 Save (and reload) the default configuration for creating launchers ................................................. 7
1. Introduction
RedHawk-SC is unique in the ability to automatically distribute the complex tasks involved in the import,
extraction, logic propagation and simulation for complex integrated circuits. The only interaction required
by the user to enable access to this modern, fully distributed analysis system is to provide the syntax for
requesting a CPU resource with enough memory. The provided syntax is used to launch processes in the
compute farm known as workers. In RedHawk-SC, each worker executes its job on a single core in the
execution host.
The workers are used by RedHawk-SC for distributed processing. Both the master RedHawk-SC process
(what the user directly invokes) and the worker processes uses the same RedHawk-SC executable. A
launcher script is used to start the worker processes on the same and/or remote execution hosts. Out of
the box, RedHawk-SC has the support to create launchers for LSF, UGE and RTDA NC Compute farms. It
can also create launchers based on SSH protocol. In addition, RedHawk-SC provides the infrastructure for
the user to define the launcher process for any custom grid used.
Listing 1 provides an example launcher object created for a LSF farm, allocating processors on the same
host with each job consuming 16GB and using RHEL7 platform. We use the API create_lsf_launcher
to create the LSF Launcher object. We need to provide an identification string for the launcher object as
defined by user_label in this example
The API create_rtda_launcher can be used to create launcher objects for the RTDA NC Farm.
For each of the examples provided, we can also list the number of jobs to be launched per worker by
providing the argument num_workers_per_launch.
RedHawk-SC also provides ways to create SSH based launchers. Instead of relying on the compute cluster
to manage and launch jobs, in a SSH based launcher, the workers are directly launched on the execution
hosts with SSH protocol.
We can also launch workers on the same host as the master with the API create_local_launcher.
With access to a sufficiently big machine, local launchers can be used to create the workers required for
the jobs.
When creating a launcher, if we need a specific function to execute at the start, we can configure it using
the argument initial_exec_function. For example,
def worker_startup_func():
#script-commands to be executed at the start of workers
gp_print('New worker launched'
To this, we have an API, register_default_launcher, that informs the system of the launcher object
to be used when a new worker is needed. By registering a created launcher as a default launcher,
RedHawk-SC can start workers on-demand. The view system in RedHawk-SC allows each view to request
as many workers as they think is “optimal” for a given design. The user can have some constraints – so as
to limit the total number of workers launched or to start with a minimum number of workers or to delay
the view creation until a certain percentage of optimal workers are online.
Here, we have created a UGE Launcher object ll, and have asked the system to use it as the “default
launcher”. Whenever a view requires some workers, it will use the specified default launcher to submit a
request to the grid to bring up a worker. We have added the constraints for minimum and maximum
workers here. Even though RedHawk-SC can begin the analysis as at least one worker is online, here, we
are explicitly requesting the system to have a minimum of 16 workers before the run can begin. The
system is also capped at a maximum of 600 workers. The limitation implies that RedHawk-SC, at any given
time, can execute a maximum of 600 jobs. Whenever a greater number of jobs are required to be run,
RedHawk-SC will wait until one of the jobs finishes to begin executing its next scheduled jobs.
By default, RedHawk-SC will wait as long as it takes for at least one worker to come online. The user is
free to change the default behavior to abort the run if no workers have come online even after waiting
for a specified time.
qsub_command = "qsub -V -cwd -j y -b y -o run.out -q all_hosts -l mfree=16G -l
platform='linux64e7'"
ll = create_uge_launcher('user_label', qsub_command)
register_default_launcher(ll, min_num_workers=16, max_num_workers=100, time_out=3600)
Here, the user needs a minimum of 16 workers for the run to start. But the user has also specified that
the run needs to abort if no workers are available even after the specified time out period of 3600 seconds
(1 hour).
qsub_command = "qsub -V -cwd -j y -b y -o run.out -q all_hosts -l mfree=16G -l
platform='linux64e7'"
ll = create_uge_launcher('user_label', qsub_command)
register_default_launcher(ll, min_num_workers=16, max_num_workers=100,
wait_for_workers_time_out=300)
Sometimes, it is prudent to wait for a certain percentage of requested workers for a view to be online
before starting the view creation. This will improve the performance and peak worker memory control.
In this example, we want RedHawk-SC to wait until at least 70% of the suggested number of workers for
AnalysisView is available before proceeding with the view creation. We can also use the soft time out
mentioned earlier to have finer control on the run.
qsub_command = "qsub -V -cwd -j y -b y -o run.out -q all_hosts -l mfree=16G -l
platform='linux64e7'"
ll = create_uge_launcher('user_label', qsub_command)
register_default_launcher(ll, min_num_workers=16, max_num_workers=100,
wait_for_workers={AnalysisView:0.7, ScenarioView:0.5},
wait_for_workers_time_out=360)
We have specified that we need to wait until 50% of optimal workers for ScenarioView and 70% of
optimal workers for AnalysisView needs to be available before either view creation can be started. For
both the views, we will wait only for 360 seconds before proceeding with the available number of workers.
def my_launcher_func(last_launcher_name):
if last_launcher_name == 'local1':
return launcher2
elif last_launcher_name =='l2':
return launcher3
else:
return launcher1
register_default_launcher(launcher_func=my_launcher_func)
We have three launcher objects created in RedHawk-SC. While launcher1 is a local launcher, both
launcher2 and launcher3 are LSF Launchers. The specified launcher_func works by checking the
name of the launcher object used for launching the last worker. It then makes sure that the workers are
launched in the order of launcher1, launcher2 and launcher3. We are not registering a default
launcher object here, but rather defining the function to be used whenever a new launcher object is
required.
launcher1 = gp.create_sge_launcher('sge1', 'qsub -q queue1')
launcher2 = gp.create_sge_launcher('sge2', 'qsub -q queue2')
launcher3 = gp.create_sge_launcher('sge3', 'qsub -q queue3')
launchers_with_limits = [(launcher1, 20), (launcher2, 10), (launcher3, 60)]
def my_launcher_func(launcher_name):
available_launchers = [ll for ll, limit in launchers_with_limits if
ll.get_num_launched() < limit]
ll = None
for ll in available_launchers:
if ll.get_name() != launcher_name:
break
if ll is None: # fail condition, in case max_num_workers is not set right
gp_assert(0, 'fall back launcher used')
ll = launcher3
return ll
gp.register_default_launcher(launcher_func=my_launcher_func)
In the above example, we have defined three SGE Launchers and have specified the maximum number of
workers that can be launched with each launcher. In the function, we collect all the available launchers
that be used based on the number of workers already launched for each launcher. We then return the
first launcher that is different from the previously used launcher, falling back to launcher3 in case of any
errors.
The user needs to be sure that multiple CPUs will be consumed in the jobs launched with this launcher.
For grid based launchers like LSF or UGE, the grid specific commands can be used for this purpose. For
example, -n 3 option in the LSF launcher command will reserve 3 CPUs for jobs launched with this
worker.
big_launcher.set_jobs(['cpm.write_spice_deck*', 'cpm.run_asim_power_model*'])
big_launcher.launch(1)
register_default_launcher(launcher)
Here, we create a regular launcher and a launcher for bigger jobs (requesting bigger resources). Using
the set_jobs API of the created launcher object, we can direct RedHawk-SC to pair certain launcher
objects (and their corresponding workers) with certain jobs. The set_jobs API takes in a list of jobs or
regular expression patterns of jobs to be assigned. In this example, we have workers from both launchers
online. We will use only the workers created with big_launcher for CPM specific jobs
(cpm.write_spice_deck and cpm.run_asim_power_model*); but will use all available jobs for
other scheduled jobs.
If we need to pair the workers and jobs exclusively, we need to use the API set_exclusive_jobs.
Here, we use regular expression to define the pattern of the jobs to be launched on workers created with
ll_big launcher. These workers will be used exclusively for the specified patterns as we are using the
API set_exclusive_jobs
2.7 Save (and reload) the default configuration for creating launchers
Instead of specifying the launcher type for every analysis, the user can create a config file in the home
directory from where RedHawk-SC can pick up the default launcher configuration to be used. For
example, by creating a file called ~/.seascape_rc/launcher.config with the following
contents,
[{'launcher_name': 'uge_launcher',
'launcher_command': 'qsub -l mfree=16G',
'launcher_type': 'uge',
'num_workers_per_launch': 1}]
RedHawk-SC will use a UGE Launcher with the specified command every time it needs a new worker and
no default launcher is registered with register_default_launcher. In addition to the keys
specified in the dictionary shown in the example, RedHawk-SC expects the following optional and
mandatory keys for ~/.seascape_rc/launcher.config
required_keys: (['launcher_name', 'launcher_type'])