Platform MPI™ 5.6.5 User's Guide: September 29, 2008 11:58 Am
Platform MPI™ 5.6.5 User's Guide: September 29, 2008 11:58 Am
5 User’s Guide
Version 5.6.5
September 29, 2008 11:58 am
Comments to: [email protected]
Support: [email protected]
Copyright © 1994-2008, Platform Computing Inc.
Although the information in this document has been carefully reviewed, Platform Computing Inc.
(“Platform”) does not warrant it to be free of errors or omissions. Platform reserves the right to make
corrections, updates, revisions or changes to the information in this document.
UNLESS OTHERWISE EXPRESSLY STATED BY PLATFORM, THE PROGRAM DESCRIBED
IN THIS DOCUMENT IS PROVIDED “AS IS” AND WITHOUT WARRANTY OF ANY KIND,
EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. IN
NO EVENT WILL PLATFORM COMPUTING BE LIABLE TO ANYONE FOR SPECIAL,
COLLATERAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING WITHOUT
LIMITATION ANY LOST PROFITS, DATA, OR SAVINGS, ARISING OUT OF THE USE OF OR
INABILITY TO USE THIS PROGRAM.
We’d like to hear from you You can help us make this document better by telling us what you think of the content, organization,
and usefulness of the information. If you find an error, or just want to make a suggestion for
improving this document, please address your comments to [email protected].
Your comments should pertain only to Platform documentation. For product support, contact
[email protected].
Document redistribution This document is protected by copyright and you may not redistribute or translate it into another
and translation language, in part or in whole.
Internal redistribution You may only redistribute this document internally within your organization (for example, on an
intranet) provided that you continue to check the Platform Web site for updates and update your
version of the documentation. You may not make it available to your organization over the Internet.
Trademarks LSF is a registered trademark of Platform Computing Inc. in the United States and in other
jurisdictions.
ACCELERATING INTELLIGENCE, PLATFORM COMPUTING, PLATFORM SYMPHONY,
PLATFORM JOBSCHEDULER, PLATFORM ENTERPRISE GRID ORCHESTRATOR,
PLATFORM EGO, and the PLATFORM and PLATFORM LSF logos are trademarks of Platform
Computing Inc. in the United States and in other jurisdictions.
UNIX is a registered trademark of The Open Group in the United States and in other jurisdictions.
Microsoft is either a registered trademark or a trademark of Microsoft Corporation in the United
States and/or other countries.
Windows is a registered trademark of Microsoft Corporation in the United States and other countries.
Other products or services mentioned in this document are identified by the trademarks or service
marks of their respective owners.
Third-party license http://www.platform.com/Company/third.part.license.htm
agreements
Third-party copyright http://www.platform.com/Company/Third.Party.Copyright.htm
notices
Table of Contents
Chapter 1 Introduction ..............................................................................................1
1.1 Platform MPI product context ................................................................................. 1
1.2 Support .............................................................................................................. 2
1.2.1 Platform MPI FAQ ............................................................................................ 2
1.2.2 Platform MPI release documents ........................................................................ 2
1.2.3 Problem reports .............................................................................................. 2
1.2.4 Platforms supported ......................................................................................... 3
1.2.5 Licensing ........................................................................................................ 3
1.2.6 Feedback ........................................................................................................ 3
1.3 How to read this guide .......................................................................................... 3
1.3.1 Acronyms and abbreviations ............................................................................ 4
1.3.2 Terms ............................................................................................................ 5
1.3.3 Typographic conventions .................................................................................. 6
This manual is written for users who have a basic programming knowledge of C or Fortran, as
well as an understanding of MPI.
Figure 1.1 shows a simplified view of the underlying architecture of clusters using Platform MPI:
A number of compute nodes are connected together in a Ethernet network through which a
front-end interfaces the cluster with the corporate network. A high performance interconnect can
be attached to service communication requirements of key applications.
The front-end imports services like file systems from the corporate network to allow users to run
applications and access their data.
Platform MPI implements the MPI standard for a number of popular high performance
interconnects, such as Gigiabit Ethernet, Infiniband and Myrinet.
While the high performance interconnect is optional, the networking infrastructure is mandatory.
Without it the nodes in the cluster will have no way of sharing resources. TCP/IP functionality
implemented by the Ethernet network enables the front-end to issue commands to the nodes,
provide them with data and application images, and collect results from the processing the
nodes perform.
CPU-intensive parallel applications use a programming library called MPI (Message Passing
Interface), the state-of-the-art library for high performance computing. Note that the MPI library
is NOT described within this manual; MPI is defined by a standards committee, and the API, and
its user guides are available free of charge on the Internet. A link to the MPI Standard and other
MPI resources can be found in chapter 7, "Related documentation".
Platform MPI consists of Platform's implementation of the MPI programming library and the
necessary support programs to launch and run MPI applications. This manual often uses the
term Platform MPI to refer to the specifics of the MPI itself, and not the support applications.
1.2 Support
1.2.1 Platform MPI FAQ
An updated list of Frequently Asked Questions is posted on
http://www.platform.com In addition, for users who have installed Platform MPI, the
version of the FAQ that was current when Platform MPI was installed is available as a text file
in Unix
${MPI_HOME}/doc/ScaMPI/FAQ
in Windows
%MPI_HOME%\doc\FAQ.TXT
in Windows:
%MPI_HOME%\doc
in UNIX:
${MPI_HOME}/doc/ScaMPI
1.2.5 Licensing
Platform MPI is licensed using Platform license manager system. In order to run Platform MPI
a valid demo or a permanent license must be obtained. Customers with valid software
maintenance contracts with Platform may request this directly from [email protected].
All other requests, including DEMO licenses, should be directed to [email protected].
1.2.6 Feedback
Platform appreciates any suggestions users may have for improving both this Platform MPI
User’s Guide and the software described herein. Please send your comments by e-mail to
[email protected].
Users of parallel tools software using Platform MPI on a Platform System are also encouraged
to provide feedback to the National HPCC Software Exchange (NHSE) - Parallel Tools Library [7].
The Parallel Tools Library provides information about parallel system software and tools, and also
provides for communication between software authors and users.
Acronym/
Abbreviation Meaning
The Intel implementation of the 64 bit extension to the x86 ISA. Also
EM64T See AMD64.
1.3.2 Terms
Unless explicitly specified otherwise, gcc (gnu c-compiler) and bash (gnu Bourne-Again-SHell)
are used in all examples.
Term Description
A generic term that cover the PowerPC and POWER processor families.
These processors are both 32 and 64 bit capable. The common case is
to have a 64 bit OS that support both 32 and 64 bit executables. See
Power also PPC64
Typography Description
Platform MPI is a C library built using the GNU compiler. Applications can however be compiled
with most compilers, as long as they are linked with the GNU runtime library.
The details of the process of linking with the Platform MPI libraries vary depending on which
compiler is used. Check the “Platform MPI Release Notes” for information on supported
compilers and how linking is done.
-c Compile only
-v Verbose output
-I${MPI_HOME}/include
The following string outlines the setup for the necessary linker flags (bash syntax):
-L${MPI_HOME}/lib -lmpi
-L${MPI_HOME}/lib64 -lmpi
Start the hello-world program on the three nodes called nodeA, nodeB and nodeC.
These processors are capable of running 32 bit programs at full speed while running a 64 bit OS.
For this reason Platform supports running both 32 bit and 64 bit MPI programs while running 64
bit OS.
Having both 32 bit and 64 bit libraries installed at the same time require some tweaks to the
compiler and linker flags.
Please note:
• All compilers for x86-64 generate 64 bit code by default, but have flags for 32 bit code
generation. For gcc/g77 these are -m32 and -m64 for making 32 and 64 bit code
respectively. For Portland Group Compilers these are -tp k8-32 and -tp k8-64. For
other compilers please check the compiler documentation.
• It is not possible to link 32 and 64 bit object code into one executable, (no cross
dynamic linking either) so there must be double set of libraries.
• It is common convention on x86-64 systems that all 32 bit libraries are placed in lib
directories (for compatibility with x86 OS’s) and all 64 bit libraries in lib64. This
means that when linking a 64 bit application with Platform MPI, you must use the
-L${MPI_HOME}/lib64 argument instead of the normal -L${MPI_HOME}/lib
2.2.3.2 Compiling and linking on Power series
The Power series processors (PowerPC, POWER4 and POWER5) are both 32 and 64 bit capable.
There are only 64 bit versions of Linux provided by SUSE and RedHat, and only a 64 bit OS is
supported by Platform. However the Power families are capable of running 32 bit programs at
full speed while running a 64 bit OS. For this reason Platform supports running both 32 bit and
64 bit MPI programs.
Please note:
• The gcc default compiles 32 bit on Power, use the gcc/g77 flags -m32 and -m64
explicitly to select code generation.
• It is not possible to link 32 and 64 bit object code into one executable, (no cross
dynamic linking, either) so there must be double set of libraries.
• It is common convention on ppc64 systems that all 32 bit libraries are placed in lib
directories and all 64 bit libraries in lib64. This means that when linking a 64 bit
application with Platform MPI, you must use the -L${MPI_HOME}/lib64 argument
instead of the normal -L${MPI_HOME}/lib.
For C programs mpio.h must be included in your program and you must link with the libmpio
shared library in addition to the Platform MPI 1.2 C shared library (libmpi):
For Fortran programs you will need to include mpiof.h in your program and link with the
libmpio shared library in addition to the Platform MPI 1.2 C and Fortran shared libraries:
Attention
Note: Platform MPI requires a homogenous file system image, i.e. a file system
providing the same path and program names on all nodes of the cluster on which
Platform MPI is installed.
# export SCAMPI_DISABLE_ARGV0_MOD=1
# setenv SCAMPI_DISABLE_ARGV0_MOD 1
Example
Starting the program “${MPI_HOME}/examples/bin/hello” on a node called “hugin”:
mpimon ${MPI_HOME}/examples/bin/hello -- hugin
Example
Starting the same program with two processes on the same node:
mpimon ${MPI_HOME}/examples/bin/hello -- hugin 2
Example
Starting the same program on two different nodes, “hugin” and “munin”:
mpimon ${MPI_HOME}/examples/bin/hello -- hugin munin
Example
Using bracket expansion and grouping (if configured):
mpimon ${MPI_HOME}/examples/bin/hello -- node[1-16] 2 node[17-32] 1
For more information regarding bracket expansion and grouping, refer to “Bracket expansion
and grouping” on page 86.
This control over placement of processes can be very valuable when application performance
depends on all the nodes having the same amount of work to do.
mpimon [<mpimon option>]... <program & node-spec> [-- <program & node-spec>]...
<mpimon options> are options to mpimon (“See mpimon options” and the Platform
MPI Release Notes for a complete list of options),
-- is the separator that signals end of user program options.
<program_&_node-spec> is an application and node specification consisting of <program
spec> -- <node spec> [<node spec>]...,
Table 2.4 mpimon option syntax
Example
Starting the program “${MPI_HOME}/examples/bin/hello” on a node called “hugin” and the
program “${MPI_HOME}/examples/bandwidth” with 2 processes on “munin”:
mpimon ${MPI_HOME}/examples/bin/hello -- hugin --
${MPI_HOME}/examples/bin/bandwidth -- munin 2
Example
Changing one of the mpimon-parameters:
mpimon -channel_entry_count 32 ${MPI_HOME}/examples/bin/hello -- hugin 2
Attention
Note: The default for -stdin is none.
It is possible to control the placement of the output file by setting the environment variable
SCAMPI_OUTPUT_FILE=prefix. The prefix will replace the ScaMPIoutput portion of the filenames
used for output. For example, by setting SCAMPI_OUTPUT_FILE=/tmp/MyOutput together with
the -separate_output option, the output files will be placed in the /tmp directory and be named
MyOutput_<host>_<pid>_<rank>.
• Command line options: Options for mpimon must be placed after mpimon, but before the
program name
• Environment-variable options: Setting an mpimon-option with environment variables
requires that variables are defined as SCAMPI_<uppercase-option> where SCAMPI_ is a
fixed prefix followed by the option converted to upper case. For example
SCAMPI_CHANNEL_SIZE=64K means setting -channel_size to 64K
• Configuration-files options: mpimon reads up to three different configuration files when
starting. First the system-wide configuration (${MPI_HOME}/etc/ScaMPI.conf) is read.
If the user has a file on his/her home-directory, that file(${HOME}/ScaMPI.conf) is then
read. Finally if there is a configuration file in the current directory, that file(./ScaMPI.conf)
is then read. The files should contain one option per line, given as for command line options.
The options described either on the command line, as environment variables or in configuration
files are prioritized the following way (ranked from lowest to highest):
1 System-wide configuration-file(${MPI_HOME}/etc/ScaMPI.conf)
2 Configuration-file on home-directory(${HOME}/ScaMPI.conf)
3 Configuration-file on current directory(./ScaMPI.conf)
4 Environment-variables
5 Command line-options
2.3.2.7 Network options
Platform MPI is designed to handle several networks in one run. There are two types of
networks, built-in standard-devices and DAT-devices. The devices are selected by giving the
option “-networks <net-list>” to mpimon. <net-list> is a comma-separated list of device
specifications. Platform MPI uses the list when setting up connections to other MPI-processes.
It starts off with the first device specification in the list and sets up all possible connections with
that device. If this fails the next on list is tried and so on until all connections are live or all
adapters in <net-list> have been tried.We have introduced a new syntax for
device-specifications but we also support the old syntax for backwards compatability(marked as
deprecated below).
-net <spec>[,<spec>]
<spec> = <nspec> | <dspec>
New syntax:
<nspec> = <type>[<delimit><typespec>][<delimit><parname>=<parvalue>]
<type> = ibverbs | vapi | ibta |
dapl |
gm
tcp |
smp |
-networks file:hwr
file "scampi-networks.conf":
hwr ibverbs:mthca0:2
/* File format
<user-def-name> <spec>
<spec> = <type>[<delimit><typespec>]
NB. no bracket-expandable expression! one name == one net-spec */
A list of preferred devices can be defined in several ways;(ranked from lowest to highest):
1 System-wide configuration-file(${MPI_HOME}/etc/scampi_networks.conf)
2 Configuration-file on home-directory(${HOME}/scampi_networks.conf)
3 Configuration-file on current directory(./scampi_networks.conf)
4 Environment-variables
5 Command line-options
The values should be provided in a comma-separated list of device names.For each MPI process
Platform MPI will try to establish contact with each other MPI process, in the order listed. This
enables mixed interconnect systems, and provides a means for working around failed hardware.
In a system interconnect where the primary interconnect is Myrinet, if one node has a faulty
card, using the device list in the example, all communication to and from the faulty node will
happen over TCP/IP while the remaining nodes will use Myrinet. This offers the unique ability to
continue running applications over the full set of nodes even when there are interconnect faults.
Option Description
-cpu <time> Limit runtime to <time> minutes.
-np <count> Total number of MPI-processes to be started, default 2.
-npn <count> Maximum number of MPI-processes pr. node, default np <count>/
nodes.
-pbs Submit job to PBS queue system
-pbsparams <“params”> Specify PBS scasub parameters
-p4pg <pgfile> Use mpich compatible pgfile for program, MPI-process and node
specification. pgfile entry: <nodename> <#procs> <progname>
The program name given at command line is additionally started
with one MPI-process at first node.
-v Verbose
-gdb Debug all MPI-processes using the GNU debugger gdb.
-maxtime <time> Limit runtime to <time> minutes.
-machinefile <filename> Take the list of possible nodes from <filename>
-mstdin <proc> Distribute stdin to MPI-process(es).
<proc> The choices are all (default), none, or MPI-process number(s).
-part <part> Use nodes from partition <part>
-q Keep quiet, no mpimon printout.
-t test mode, no MPI program is started
<params> Parameters not recognized are passed on to mpimon.
Table 2.6 mpirun options
Assuming that the process identifier for this mpimon is <PID>, the user interface for this is:
or
Similarly the suspended job can be resumed by sending it a SIGUSR2 or SIGCONT signal, i.e.,
Currently the Platform MPI Infiniband (ib0), Myrinet (gm0) and all DAT-based interconnects
are supported.
Some failures will not result in a explicit error value propagating to Platform MPI. Platform
MPI handles this by treating a lack of progress within a specified time as a failure. You may alter
this time by setting the environment variable SCAMPI_FAILOVER_TIMEOUT to the desired
number of seconds.
To set debug-mode for one or more MPI-processes, specify the MPI-process(es) to debug using
the mpimon option -debug <select>. In addition, note that the mpimon option -display
<display> should be used to set the display for the xterm terminal emulator. An xterm
terminal emulator, and one debugger, is started for each of the MPI-processes being debugged.
For example, to debug an application using the default gdb debugger do:
Initially, for both MPI-process 0 and MPI-process 1, an xterm window is opened. Next, in the
upper left hand corner of each xterm window, a message containing the application program’s
run parameter(s) is displayed. Typically, the first line reads Run parameters: run
<programoptions>. The information following the colon, i.e. run <programoptions>, is
SCAMPI_INSTALL_SIGNAL_HANDLER=11
will install a handler for signal number 11 (which is the same as SIGSEGV)
SCAMPI_INSTALL_SIGNAL_HANDLER=11,4
The default action has the handler dump all registers and starts processes looping. Attaching
with a debugger will then make it possible to examine the situation which resulted in the
segment protect violation.
As an alternative the handler dumps all registers but all processes will exit afterwards.
To attach to process <pid> on a machine with the GNU debugger (gdb) use the following
command:
In general, this will allow gdb to inspect the stack trace and identify the functions active when
the sigsegv occurred, and disassemble the functions. If the application is compiled with debug
info (-g) and the source code is available, then source level debugging can be carried out.
The built-in devices SMP and TCP/IP use a simplified protocol based on serial transfers. This can
be visualized as data being written into one end of a pipe and read from the other end. Messages
arriving out-of-order are buffered by the reader. The names of these standard devices are SMP
for intra-node communication and TCP for node-to-node-communication.
The size of the buffer inside the pipe can be adjusted by setting the following environment
variables:
The ring buffers are divided into equally sized entries. The size varies differs for different
architectures and networks; see “Platform MPI Release Notes” for details. An entry in the ring
buffer, which is used to hold the information forming the message envelope, is reserved each
time a message is being sent, and is used by the inline protocol, the eagerbuffering protocol, and
the transporter protocol. In addition, one or more entries are used by the inline protocol for
application data being transmitted.
mpimon has the following interface for the eagerbuffer and channel thresholds:
Variable Description
channel_inline_threshold <size> to set threshold for inline buffering
eager_threshold <size> to set threshold for eager buffering
Platform MPI operates on a buffer pool. The pool is divided into equally sized parts called
chunks. Platform MPI uses one chunk per connection to other processes. The mpimon option
“pool_ size” limits the total size of the pool and the “chunk_size” limits the block of memory that
can be allocated for a single connection.
Variable Description
pool_size <size> to set the buffer pool size
chunk_size <size> to set the chunk size.
while (...)
MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm, sts);
if (sts->MPI_TAG == SOME_VALUE)
MPI_Recv(buf, cnt, dtype, MPI_ANY_SOURCE, MPI_ANY_TAG, comm,sts);
doStuff();
doOtherStuff();
}
For MPI implementations that have one, and only one, receive-queue for all senders, the
program’s code sequence works as desired. However, the code will not work as expected with
Platform MPI. Platform MPI uses one receive-queue per sender (inside each MPI-process).
Thus, a message from one sender can bypass the message from another sender. In the time-gap
between the completion of MPI_Probe() and before MPI_Recv() matches a message, another
new message from a different MPI-process could arrive, i.e. it is not certain that the message
found by MPI_Probe() is identical to one that MPI_Recv() matches.
while (...) {
MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm, sts);
if (sts->MPI_TAG == SOME_VALUE) {
MPI_Recv(buf, cnt, dtype, sts->MPI_SOURCE, sts->MPI_TAG, comm, sts);
doStuff();
}
doOtherStuff();
}
2.3.13.3 MPI_Bsend()
Using buffered send, e.g., MPI_Bsend(), usually degrade performance significantly in
comparison with their unbuffered relatives.
A typical example that will not work with Platform MPI (for long messages):
while (...) {
MPI_Send(buf, cnt, dtype, partner, tag, comm);
MPI_Recv(buf, cnt, dtype, MPI_ANY_SOURCE, MPI_ANY_TAG, comm, sts);
doStuff();
}
This code tries to see the same buffer for both sending and receiving. Such logic can be found,
e.g., where processes from a ring where they communicate with their neighbors. Unfortunately
writing the code this way leads to deadlock, and to make it work the MPI_Send() must be
replaced with MPI_Isend() and MPI_Wait(), or the whole construction should be replaced with
MPI_Sendrecv() or MPI_Sendrecv_replace().
Given that Platform MPI has not fixed its OS routines to specific libraries, it is good
programming practice to avoid using OS functions or standard C-lib functions as application
function names.
STOP!
Warning: Naming routines or global variables as send, recv, open, close, yield,
internal_error, failure, service or other OS reserved names may result in an unpredictable
and undesirable behavior.
Using the -verbose option enables mpimon to print more detailed warnings.
Option Description
-affinity_mode <mode> Set affinity-mode for process(es).
-'automatic'<selection> Set automatic-mode for process(es).
-backoff_enable <selection> Set backoff-mode for process(es).
-ccp_mode <mode> Set checkpoint/restart mode.
-channel_entry_count <count> Set number of entries per channel.
-channel_entry_size <size> Set entry_size (in bytes) per channel.
-channel_inline_threshold <size> Set threshold for in-lining (in bytes) per channel.
-channel_size <size> Set buffer size (in bytes) per channel.
-chunk_size <size> Set chunk-size for communication.
-debug <selection> Set debug-mode for process(es).
-debugger <debugger> Set debugger to start in debug-mode.
-disable-timeout Disable process timeout.
disable_toss_page_cache Please see “Platform MPI components” on page 40
<true|false> for more details.
-display <display> Set display to use in debug-/manual-mode.
-dryrun <mode> Set dryrun-mode. The default value is 'none'.
-eager_count <count> Set number of buffers for eager protocol.
-eager_factor <factor> Set factor for subdivision of eagerbuffers.
-eager_size <size> Set buffer size (in bytes) for eager protocol.
-eager_threshold <size> Set threshold (in bytes) for eager protocol.
Table 2.7 mpimon options
[<prefix>]<numeric value>[<postfix>]
<prefix> selects numeric base when interpreting the value“0x” indicates hex-number (base =
16), “0” indicates octal-number (base = 8). If <prefix> is omitted, decimal-number (base = 10)
is assumed and <postfix> selects a multiplication factor.
Example
Input: Value as interpreted by mpimon (in decimal):
123 123
0x10 16
0200 128
1K 1 024
2M 2 097 152
The details of the process of linking with the Platform MPI libraries vary depending on which
compiler is used. Check the “Platform MPI Release Notes” for information on supported
compilers and how linking is done.
All examples below have been compiled with the Microsoft Visual C compiler and Intel Visual
Fortran. See the documentation for your compiler for the proper syntax and further information.
To compile and link with the Platform MPI-IO features you need to do the following depending on
whether it is a C or a Fortran program:
For C programs mpio.h must be included in your program and you must link with the
smcmpio.lib library in addition to the Platform MPI 1.2 C library (scampi.lib):
For Fortran programs you will need to include mpiof.h in your program and link with the
scampio.lib library in addition to the Platform MPI 1.2 C and Fortran shared libraries:
Start the hello-world program on the three nodes called nodeA, nodeB and nodeC.
Attention
Note: Platform MPI requires a homogenous file system image, i.e. a file system
providing the same path and program names on all nodes of the cluster on which
Platform MPI is installed.
Example
Starting the program “opt\Platform\examples\bin\hello” on a node called “hugin”:
mpimon C:\Program Files\Platform\Platform MPI\examples\bin\hello -- hugin
For more information regarding bracket expansion and grouping, refer to “Bracket expansion
and grouping” on page 86.
The identification of nodes and the number of processes to run on each particular node
translates directly into the rank of the MPI processes. For example, specifying n1 2 n2 2 will
place process 0 and 1 on node n1 and process 2 and 3 on node n2. On the other hand, specifying
n1 1 n2 1 n1 1 n2 1 will place process 0 and 2 on node n1 while process 1 and 3 are placed on
node n2.
This control over placement of processes can be very valuable when application performance
depends on all the nodes having the same amount of work to do.
The program mpimon has a multitude of options which can be used for optimising Platform
MPI performance. Normally it should not be necessary to use any of these options. However,
unsafe MPI programs might need buffer adjustments to solve deadlocks. Running multiple
applications in one run may also be facilitated with some of the “advanced options”.
Example
Starting the program “\opt\Platform\examples\bin\hello” on a node called “hugin” and the
program “\opt\Platform\examples\bandwidth” with 2 processes on “munin”:
mpimon C:\Program Files\Platform\Platform MPI\examples\bin\hello -- hugin --
C:\Program Files\Platform\Platform MPI\examples\bin\bandwidth -- munin 2
Example
Changing one of the mpimon-parameters:
The -stdin option specifies which MPI-process rank should receive the input. You can in fact
send stdin to all the MPI-processes with the all argument, but this requires that all
MPI-processes read the exact same amount of input. The most common way of doing it is to
send all data on stdin to rank 0:
By default the processes’ output to stdout all appear in the stdout of mpimon, where they are
merged in some random order. It is however possible to keep the outputs apart by directing
them to files that have unique names for each process. This is accomplished by giving mpimon
the option -separate_output <seletion>, e.g., -separate_output all to have each process deposit
its stdout in a file. The files are named according to the following template:
ScaMPIoutput_<host>_<pid>_<rank>, where <host> and <pid> identify the particular
invocation of mpimon on the host, and <rank> identifies the process.
It is possible to control the placement of the output file by setting the environment variable
SCAMPI_OUTPUT_FILE=prefix. The prefix will replace the ScaMPIoutput portion of the filenames
used for output. For example, by setting SCAMPI_OUTPUT_FILE=\tmp\MyOutput together with
the -separate_output option, the output files will be placed in the \tmp directory and be named
MyOutput_<host>_<pid>_<rank>.
There are three different ways to provide options to mpimon. The most common way is to
specify options on the command line invoking mpimon. Another way is to define environment
variables, and the third way is to define options in configuration file(s).
• Command line options: Options for mpimon must be placed after mpimon, but before the
program name
• Environment-variable options: Setting an mpimon-option with environment variables
requires that variables are defined as SCAMPI_<uppercase-option> where SCAMPI_ is a
fixed prefix followed by the option converted to upper case. For example
SCAMPI_CHANNEL_SIZE=64K means setting -channel_size to 64K
• Configuration-files options: mpimon reads up to three different configuration files when
starting. First the system-wide configuration
(C:\Program Files\Platform\Platform MPI\etc\ScaMPI.conf) is read. If the user has
a file on his\her home-directory, that file(${HOME}\ScaMPI.conf) is then read. Finally if
there is a configuration file in the current directory, that file(.\ScaMPI.conf) is then read.
The files should contain one option per line, given as for command line options.
1 System-wide configuration-file
(C:\Program Files\Platform\Platform MPI\etc\ScaMPI.conf)
2 Configuration-file on home-directory(${HOME}\ScaMPI.conf)
3 Configuration-file on current directory(.\ScaMPI.conf)
4 Environment-variables
5 Command line-options
3.3.1.7 Network options
With Platform MPI you can utilize multiple networks in the same invocation.
There are two types of networks, built-in standard-devices and DAT-devices. The devices are
selected by giving the option “-networks <net-list>” to mpimon. <net-list> is a
comma-separated list of device names. Platform MPI uses the list when setting up connections
to other MPI processes. Each point-to-point connection between MPI processes starts off with
the first device in the list and sets up all possible connections with that device. If this fails the
next on list is tried and so on until all connections are live or all adapters in <net-list> have been
tried.
Example
For systems installed with the Platform Manage installer, a list of preferred devices is provided in
ScaMPI.conf. An explicit list of devices may be set either in a private ScaMPI.conf, through
the SCAMPI_NETWORKS environment variable, or by the -networks parameter to mpimon. The
values should be provided in a comma-separated list of device names.
mpimon -networks smp,gm0,tcp...
For each MPI process Platform MPI will try to establish contact with each other MPI process, in
the order listed. This enables mixed interconnect systems, and provides a means for working
around failed hardware.
In a system interconnect where the primary interconnect is Myrinet, if one node has a faulty
card, using the device list in the example, all communication to and from the faulty node will
happen over TCP\IP while the remaining nodes will use Myrinet. This offers the unique ability to
continue running applications over the full set of nodes even when there are interconnect faults.
Built-in tools for debugging in Platform MPI covers discovery of the MPI calls used through
tracing and timing, and an attachment point to processes that fault with segmentation violation.
The tracing (see “Tracing” on page 53) and timing (see “Timing” on page 57) are covered
in Chapter 4.
When running applications that terminate with a SIGSEGV-signal it is often useful to be able to
freeze the situation instead of exiting, the default behavior. In previous versions of Platform
MPI you could use SCAMPI_INSTALL_SIGSEGV_HANDLER.
SCAMPI_INSTALL_SIGNAL_HANDLER has replaced this environment variable. This allows for
Platform MPI to be backwardly compatible with earlier versions of MPI. The values for the
signals are put into a comma separated list.
SCAMPI_INSTALL_SIGNAL_HANDLER=11
will install a handler for signal number 11 (which is the same as SIGSEGV).
SCAMPI_INSTALL_SIGNAL_HANDLER=11,4
The default action has the handler dump all registers and starts processes looping. Attaching
with a debugger will then make it possible to examine the situation which resulted in the
segment protect violation.
As an alternative the handler dumps all registers but all processes will exit afterwards.
You may attach a debugger to process <pid> on a machine with your preferred debugger. In
general, this will allow the debugger to inspect the stack trace and identify the functions active
when the sigsegv occurred, and disassemble the functions. If the application is compiled with
debug info and the source code is available, then source level debugging can be carried out.
The built-in devices SMP and TCP\IP use a simplified protocol based on serial transfers. This can
be visualized as data being written into one end of a pipe and read from the other end. Messages
arriving out-of-order are buffered by the reader. The names of these standard devices are SMP
for intra-node communication and TCP for node-to-node-communication.
The size of the buffer inside the pipe can be adjusted by setting the following environment
variables:
The ring buffers are divided into equally sized entries. The size varies differs for different
architectures and networks; see “Platform MPI Release Notes” for details. An entry in the ring
buffer, which is used to hold the information forming the message envelope, is reserved each
time a message is being sent, and is used by the inline protocol, the eagerbuffering protocol, and
the transporter protocol. In addition, one or more entries are used by the in-line protocol for
application data being transmitted.
mpimon has the following interface for the eagerbuffer and channel thresholds:
Variable Description
channel_inline_threshold <size> to set threshold for inline buffering
eager_threshold <size> to set threshold for eager buffering
Platform MPI operates on a buffer pool. The pool is divided into equally sized parts called
chunks. Platform MPI uses one chunk per connection to other processes. The mpimon option
pool_ size” limits the total size of the pool and the “chunk_size” limits the block of memory that
can be allocated for a single connection.
Variable Description
pool_size <size> to set the buffer pool size
chunk_size <size> to set the chunk size.
while (...)
MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm, sts);
if (sts->MPI_TAG == SOME_VALUE)
MPI_Recv(buf, cnt, dtype, MPI_ANY_SOURCE, MPI_ANY_TAG, comm,sts);
doStuff();
doOtherStuff();
}
For MPI implementations that have one, and only one, receive-queue for all senders, the
program’s code sequence works as desired. However, the code will not work as expected with
Platform MPI. Platform MPI uses one receive-queue per sender (inside each MPI-process).
Thus, a message from one sender can bypass the message from another sender. In the time-gap
between the completion of MPI_Probe() and before MPI_Recv() matches a message, another
new message from a different MPI-process could arrive, i.e. it is not certain that the message
found by MPI_Probe() is identical to one that MPI_Recv() matches.
while (...) {
MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm, sts);
if (sts->MPI_TAG == SOME_VALUE) {
MPI_Recv(buf, cnt, dtype, sts->MPI_SOURCE, sts->MPI_TAG, comm, sts);
doStuff();
}
doOtherStuff();
}
A typical example that will not work with Platform MPI (for long messages):
while (...) {
MPI_Send(buf, cnt, dtype, partner, tag, comm);
MPI_Recv(buf, cnt, dtype, MPI_ANY_SOURCE, MPI_ANY_TAG, comm, sts);
doStuff();
}
This code tries to see the same buffer for both sending and receiving. Such logic can be found,
e.g., where processes from a ring where they communicate with their neighbors. Unfortunately
writing the code this way leads to deadlock, and to make it work the MPI_Send() must be
replaced with MPI_Isend() and MPI_Wait(), or the whole construction should be replaced with
MPI_Sendrecv() or MPI_Sendrecv_replace().
STOP!
Warning: Naming routines or global variables as send, recv, open, close, yield, internal_error,
failure, service or other OS reserved names may result in an unpredictable and undesirable
behavior.
Using the -verbose option enables mpimon to print more detailed warnings.
Option Description
-affinity_mode <mode> Set affinity-mode for process(es). The default value is
'automatic'. The legal choices are 'automatic' and 'none'.
-'automatic'<selection> Set automatic-mode for process(es).
-backoff_enable <selection> Set backoff-mode for process(es). The default value is 'none'.
-channel_entry_count Set number of entries per channel. The default value is
<count> '512,smp:32'.
-channel_entry_size <size> Set entry_size (in bytes) per channel.
-channel_inline_threshold Set threshold for in-lining (in bytes) per channel.
<size>
-channel_size <size> Set buffer size (in bytes) per channel.
-chunk_size <size> Set chunk-size for communication.
Table 3.3 mpimon options (windows)
[<prefix>]<numeric value>[<postfix>]
<prefix> selects numeric base when interpreting the value“0x” indicates hex-number (base =
16), “0” indicates octal-number (base = 8). If <prefix> is omitted, decimal-number (base = 10)
is assumed and <postfix> selects a multiplication factor.
Example
Input: Value as interpreted by mpimon (in decimal):
123 123
0x10 16
0200 128
1K 1 024
2M 2 097 152
• mpimon is a monitor program which is the user’s interface for running the application program.
• mpisubmon is a submonitor program which controls the execution of application programs. One
submonitor program is started on each node per run.
• mpiboot is a bootstrap program used when running in manual-/debug-mode.
• Platformservices is a service responsible for starting daemons.
• mpid is a daemon program running on all nodes that are able to run Platform MPI. mpid is used for
starting the mpisubmon programs (to avoid using Unix facilities like the remote shell rsh). mpid is
started automatically when a node boots, and must run at all times. There are two options that are for
mpid use only:
• user_authentication <true|false>
• disable_toss_page_cache <true|false>
4.1.1 user_authentication
When set to “true” all users are required to generate public/private key pairs in order to launch
mpi applications. Key pairs can be generated by running ìmpikeygenî. Key pairs should be
located in the users home directory in the Unix subdirectory ${HOME}/.scampi
(in the Windows subdirectory: %APPDATA%/.scampi).
4.1.2 disable_toss_page_cache
When set to “true”, mpid will no longer toss the page cache prior to starting the application.
Tossing of the page cache is useful on NUMA machines to ensure memory locality and thus most
optimal performance. If you are experiencing kernel issues or weird behavior you might consider
disabling this feature
Figure 4.1 illustrates how applications started with mpimon have their communication system
established by a system of daemons on the nodes. This process uses TCP/IP communication over
the networking Ethernet, whereas optional high performance interconnects are used for
communication between processes.
Parameter control is performed by mpimon to check as many of the specified options and
parameters as possible. The user program names are checked for validity, and the nodes are
contacted (using sockets) to ensure they are responding and that mpid is running.
Via mpid mpimon establishes contact with the nodes and transfers basic information to
enable mpid to start the submonitor mpisubmon on each node. Each submonitor establishes a
connection to mpimon for exchange of control information between each mpisubmon and
mpimon to enable mpisubmon to start the specified userprograms (MPI-processes).
As mpisubmon starts all the MPI-processes to be executed they MPI_Init(). Once inside here
the user processes wait for all the mpisubmons involved to coordinate via mpimon. Once all
processes are ready mpimon will return a “start running” message to the processes. They will
then return from MPI_Init() and start executing the user code.
Stopping MPI application programs is requested by the user processes as they enter the
MPI_Finalize() call. The local mpisubmon will signal mpimon and wait for mpimon to return a “all
stopped message”. This comes when all processes are waiting in MPI_Finalize(). As the user
processes return from the MPI_Finalize() they release their resources and terminates. Then the
local mpisubmon terminates and eventual mpimon terminates.
The TCP/IP stack can run on any network supported by the socket stream. That includes
Ethernet, Myrinet, Infiniband, and SCI. For more details please refer to the Release Notes
included with the Platform MPI package.
To find out what network device is used between two processes, set the environment variable
SCAMPI_NETWORKS_VERBOSE=2. The MPI library will print out during startup a table of processes
and devices in use.
The other type of devices use the DAT µDAPL API in order to have an open API for generic third
party vendors. µDAPL is an abbreviation for User DAT Provider library. This is a shared library
that Platform MPI loads at runtime through the static DAT registry. These libraries are normally
listed in /etc/dat.conf (in Windows: C:\dat\dat.conf). For clusters using ‘exotic’ interconnects
Please note that Platform has a certification program, and may not provide support for unknown
third party vendors.
The DAT header files and registry library conforming to the µDAPL v1.2 specification, is provided
by the dat-registry package.
The TCP device is really a generic device that works over any TCP/IP network, even WAN’s. This
network device requires only that the node names given to mpimon map correctly to the nodes
IP address. TCP/IP connectivity is required for Platform MPI operation, and for this reason the
TCP device is always operational.
Attention
Note: Users should always append the TCP device at the end of a device list as the device of last
resort. This way communication will fall back to the management ethernet that anyway has to be
present for the cluster to work.
4.3.2.2 Myrinet GM
This is a RDMA capable device that uses the Myricom GM driver and library. A GM release above
2.0 is required. This device is straight forward and requires no configuration other than the
presence of the libgm.so library in the library path (see /etc/ld.so.conf).
Attention
Note: Myricom GM software is not provided by Platform. If you have purchased a
Myrinet interconnect you have the right to use the GM source, and a source tar ball is
available from Myricom. It is necessary to obtain the GM source since it must be
compiled per kernel version.
Your management software should provide tools for generating binary RPM’s to ease
installing and management.
If you used Platform Manage to install your compute nodes, and supplied it with the GM
source tar ball, the installation is already complete.
4.3.2.3 Infiniband
Infiniband is an interconnect that has been available since 2002 and has gained popularity in
recent years. There are several flavors to choose from, each having different performance
profiles. Infiniband vendors provide different hardware and software environments. Platform
MPI supports 3 different Infiniband software stack types:
• OFED (libibverbs.so.1) is the new OpenIB software stack which all vendors are
embracing (though they still ship their proprietary stacks. Both OFED 1.1 and
1.2 work with Platform MPI 5.5.
• IBT (/dev/SysIbt) is Silverstorm specific only and is only available when running the
Silverstorm Infiniband stack.
• VAPI (libvapi.so) is the Mellanox type stack used by Cisco/Topspin and Voltaire.
Attention
Note: When you specify “-networks ib”, Platform MPI will try OFED first, then
Silverstorm and finally VAPI. If the first two aren't available you will see “Mutable
errors” in the output when running with the -verbose flag. If none of them are available,
you will get the error message “Unable to complete network setup”. See release notes
on the exact versions of software stack that Platform MPI supports.
Your management software should provide a utility that automates installation of some of these
stacks.
The different vendors’ InfiniBand switches vary in feature sets, but the most important
difference is whether they have a built in subnet manager or not. An InfiniBand network must
have a subnet manager (SM) and if the switches don't come with a built-in SM, one has to be
started on a node attached to the IB network. The SM’s of choice for software SM’s are OpenSM
or minism. If you have SM-less switches your vendor will provide one as part of their software
bundle.
Platform MPI uses either the µDAPL (User DAT Provider Library) supplied by the IB vendor, or
the low level VAPI/IBA layer. DAT is an established standard and is guaranteed to work with
Platform MPI. However better performance is usually achieved with the VAPI/IBT interfaces.
However, VAPI is an API that is in flux and Platform MPI is not guaranteed to work with all
(current nor future) versions of VAPI.
Multirailing is the ability to stripe data across the multiple Infiniband HCA’s in a system as well as
to balance HCA loads between processes on the same system. See “Multi-rail Support in
Platform MPI” on page 92 for more details.
The default thresholds that control whether a message belongs to the in-line, eager buffering or
transporter protocols can be controlled from the application launch program (mpimon) described
in chapter 3.
Figure 4.4 illustrates the node resources associated with communication and mechanisms
implemented in Platform MPI for handling messages of different sizes. The communication
protocols from Figure 4.4 rely on buffers located in the main memory of the nodes. This memory
is allocated as shared, i.e., it is not private to a particular process in the node. Each process has
one set of receiving buffers for of the processes it communicates with.
As the figure shows all communication relies on the sending process depositing messages
directly into the communication buffers of the receiver. For In-line and eager buffering the
management of the buffer resources does not require participation from the receiving process,
because of their designs as ring buffers.
The Channel ring buffer is divided into equally sized entries. The size varies differs for different
architectures and networks; see “Platform MPI Release Notes” for details. An entry in the
ring buffer, which is used to hold the information forming the message envelope, is reserved
each time a message is being sent, and is used by the In-line protocol, the eager buffering
protocol, and the transporter protocol. In addition, one ore more entries are used by the In-line
protocol for application data being transmitted.
If the message data is small enough, the sender uses the in-line protocol to write the data in the
message header, using one or more of the receiver's channel ring buffer entries.
The sender uses the eager buffering protocol when sending medium-sized messages to the
receiver's channel ring buffer, identifying the data's position in the eager buffer.
The sender allocates buffer resources, which are released by the receiver, without any explicit
communication between the two communicating partners.
The eager buffering protocol uses one channel ring buffer entry for the message header, and one
eager buffer entry for the application data being sent.
When large messages are transferred the transporter protocol is used. The sender
communicates with the receiver to negotiate which resources to use for data transport. The
receiver's runtime system processes the message and returns a response.
The transporter protocol utilizes one channel ring buffer entry for the message header, and
transporter buffers for the application data being sent. The sender's runtime system
de-fragments and reassembles large messages whose size is larger than the size of the
transporter ring buffer entry (transporter_size). The data is then written to the agreed upon
position in the transporter ring buffer.
Platform recommends that you select the zerocopy protocol if your underlying hardware can
support it. To disable the zerocopy protocol, set the zerocopy_count or the zerocopy_size
parameters to 0.
The trace and timing facilities are initiated by environment variables that either can be set and
exported or set at the command line just before running mpimon.
There are different tools available that can be useful to detect and analyze the cause of
performance bottlenecks:
• Built-in proprietary trace and profiling tools provided with Platform MPI
• Commercial tools that collect information during run and post processes and present
results afterwards such as Vampir from Pallas GmbH.
See http://www.pallas.de for more information.
The main difference between these tools is that the Platform MPI built-in tools can be used
with an existing binary while the other tools require reloading with extra libraries.
The powerful run time facilities Platform MPI trace and Platform MPI timing can be used to
monitor and keep track of MPI calls and their characteristics. The various trace and timing
options can yield many different views of an application's usage of MPI. Common to most of
these logs are the massive amount of data which can sometimes be overwhelming, especially
when run with many processes and using both trace and timing concurrently.
The second part shows the timing of these different MPI calls. The timing is a sum of the timing
for all MPI calls for all MPI processes and since there are many MPI processes the timing can look
unrealistically high. However, it reflects the total time spent in all MPI calls. For situations in
which benchmarking focuses primarily on timing rather than tracing MPI calls, the timing
functionality is more appropriate. The trace functionality introduces some overhead and the total
wall clock run time of the application goes up. The timing functionality is relatively light and can
be used to time the application for performance benchmarking.
Example
To illustrate the potential of tracing and timing with Platform MPI consider the code fragment
below (full source reproduced in E.1).
This code uses collective operations (broadcast, scatter, reduce and gather) to employ multiple processes to perform
operations on a image. For example, the figure below () shows the result of processing an ultrasonic image of a fetus.
option Description
-b Trace beginning and end of each MPI_call
-s <seconds> Start trace after <seconds> seconds
-S <seconds> End trace after <seconds> seconds
-c <calls> Start trace after <calls>MPI_calls
-C <calls> End trace after <calls>MPI_calls
-m <mode> Special modes for trace
This format can be extended by using the "-f"-option. Adding "-f arguments" will provide
some additional information concerning message length. If "-f timing" is given some timing
information between the <absRank>and <MPIcall>-fields is provided.
Calls may be specified with or without the "MPI_"-prefix, and in upper- or lower-case. The
default format of the output has the following parts:
"-f rate" will add some rate-related information. The rate is calculated by dividing the number
of bytes transferred by the elapsed time to execute the call. All parameters to -f can be
abbreviated and can occur in any mix.
Normally no error messages are provided concerning the options which have been selected. But
if -verbose is added as a command-line option to mpimon, errors will be printed.
Trace provides information about which MPI routines were called and possibly information about
parameters and timing.
This will print a trace of all MPI calls for this relatively simple run:
0: MPI_Init
0: MPI_Comm_rank Rank: 0
0: MPI_Comm_size Size: 2
0: MPI_Bcast root: 0 Id: 0
my_count = 32768
0: MPI_Scatter Id: 1
1: MPI_Init
1: MPI_Comm_rank Rank: 1
1: MPI_Comm_size Size: 2
1: MPI_Bcast root: 0 Id: 0
my_count = 32768
1: MPI_Scatter Id: 1
1: MPI_Reduce Sum root: 0 Id: 2
1: MPI_Bcast root: 0 Id: 3
0: MPI_Reduce Sum root: 0 Id: 2
0: MPI_Bcast root: 0 Id: 3
1: MPI_Gather Id: 4
1: MPI_Keyval_free
0: MPI_Gather Id: 4
0: MPI_Keyval_free
Example
mpimon -trace "-f arg;timing" ./kollektive-8 ./uf256-8.pgm -- n1 2
0: +-0.951585 s 951.6ms MPI_Init
0: +0.000104 s 3.2us MPI_Comm_rank Rank: 0
0: +0.000130 s 1.7us MPI_Comm_size Size: 2
0: +0.038491 s 66.3us MPI_Bcast root: 0 sz: 1 x 4 = 4 Id: 0
my_count = 32768
0: +0.038634 s 390.0us MPI_Scatter Id: 1
1: +-1.011783 s 1.0s MPI_Init
1: +0.000100 s 3.8us MPI_Comm_rank Rank: 1
1: +0.000129 s 1.7us MPI_Comm_size Size: 2
1: +0.000157 s 69.6us MPI_Bcast root: 0 sz: 1 x 4 = 4 Id: 0
my_count = 32768
1: +0.000300 s 118.7us MPI_Scatter Id: 1
There are a number of parameters for selecting only a subset, either by limiting the number of
calls and intervals as described above under ‘Timing’ , or selecting or excluding just some MPI
calls.
5.1.2 Features
The "-b" option is useful when trying to pinpoint which MPI-call has been started but not
completed (i.e. are deadlocked). The "-s/-S/-c/-C" -options also offer useful support for an
application that runs well for a longer period and then stop, or for examining some part of the
execution of the application.
From time to time it may be desirable or feasible to trace only one or a few of the processes.
Specifying the "-p" options offers the ability to pick the processes to be traced.
All MPI-calls are enabled for tracing by default. To view only a few calls, specify a "-t
<call-list>" option; to exclude some calls, add a "-x <call-list>" option. The "-t" will disable
all tracing and then enable those calls that match the <call-list>. The matching is done using
"regular-posix-expression"-syntax. "-x" will lead to the opposite; first enable all tracing and
then disable those call matching <call-list>.
Example
"-t MPI_[b,r,s]*send" : Trace only send-calls (MPI_Send, MPI_Bsend, MPI_Rsend,
MPI_Ssend)
Example
"-t MPI_Irecv" : Trace only immediate recv (MPI_Irecv)
Example
"-t isend;irecv;wait" :Trace only MPI_Isend, MPI_Irecv and MPI_Wait
Example
“-t i[a-z]*" : Trace only calls beginning with MPI_I
Option Description
-v verbose
Example
mpimon -timing ”-s 1” ./kollektive-8 ./uf256-8.pgm -- r1 r2
where '<seconds>' is the number of seconds per printout from Platform MPI
produces:
The <seconds> field can be set to a large number in order to collect only final statistics.
We see that the output gives statistics about which MPI calls are used and their frequency and
timing. Both delta numbers since last printout and the total accumulated statistics. By setting
the interval timing (in -s <seconds>) to a large number, only the cumulative statistics at the end
are printed. The timings are presented for each process, and with many processes this can yield
a huge amount of output. There are many options for modifying SCAMPI_TIMING to reduce this
output. The selection parameter can time only those MPI processes to be monitored. There are
also other ways to minimize the output, by screening away selected MPI calls either before or
after a certain number of calls or between an interval of calls. Some examples are:
<MPIcall><Dcalls><Dtime><Dfreq> <Tcalls><Ttime><Tfreq>
where
The second part containing the buffer-statistics has two types of lines, one for receives and one
for sends.
where:
where :
!<count>!<avrLen>!<zroLen>!<inline>!<eager>!<transporter>!
where:
In order to extract information from the huge amount of data, Platform has developed a simple
analysis tool called scanalyze. This analysis tool accept output from Platform MPI applications
run with certain predefined trace and timing variables set.
Digesting the massive information in these files is a challenge, but scanalyze produces the
following summaries for tracing:
==========================================================
#call time tim/cal #call time tim/cal
s s
0: MPI_Alltoall 0 0.0ns 12399 10.6s 855.1us
0: MPI_Barrier 0 0.0ns 26 1.2ms 45.8us
0: MPI_Comm_rank 0 0.0ns 1 3.2us 3.2us
0: MPI_Comm_size 0 0.0ns 1 1.4us 1.4us
0: MPI_Init 0 0.0ns 1 1.0s 1.0s
0: MPI_Keyval_free 1 27.9us 27.9us 1 27.9us 27.9us
0: SUM 2 29.0us 14.5us 12481 11.7s 933.5us
0: Overhead 0 0.0ns 12481 12.6ms 1.0us
==========================================================
#call time tim/cal #call time tim/cal
s s
1: MPI_Alltoall 0 0.0ns 12399 10.6s 854.9us
1: MPI_Barrier 0 0.0ns 26 2.9ms 109.6us
1: MPI_Comm_rank 0 0.0ns 1 3.5us 3.5us
1: MPI_Comm_size 0 0.0ns 1 1.5us 1.5us
1: MPI_Init 0 0.0ns 1 1.0s 1.0s
1: MPI_Keyval_free 1 10.8us 10.8us 1 10.8us 10.8us
1: SUM 2 29.0us 14.5us 12481 11.7s 933.5us
1: Overhead 0 0.0ns 12479 12.7ms 1.0us
The information displayed is collected with the system-call "times"; see man-pages for more
information.
For example, to get the CPU usage when running the image enhancement program in Unix enter
this:
---------- Own
-------------------------------
--
Process timing stat. in Elapsed User System Sum
secs
Forcing size parameters to mpimon is usually not necessary. This is only a means of optimising
Platform MPI to a particular application, based on knowledge of communication patterns. For
unsafe MPI programs it may be necessary to adjust buffering to allow the program to complete.
chunk = inter_pool_size / P
Learn about the performance behaviour of your particular MPI applications on a Platform System
by using a performance analysis tool.
To maximize performance, Platform MPI is using poll when waiting for communication to
terminate, instead of using interrupts. Polling means that the CPU is performing busy-wait
(looping) when waiting for data over the interconnect. All exotic interconnects require polling.
Some applications create threads which may end up having more active threads than you have
CPUs. This will have huge impact on MPI performance. In threaded application with irregular
communication patterns you probably have other threads that could make use of the processor.
To increase performance in this case, Platform has provided a “backoff” feature in Platform
MPI. The backoff feature will still poll when waiting for data, but will start to enter sleep states
on intervals when no data is coming. The algorithm is as follows: Platform MPI polls for a short
time (idle time), then stops for a periode, and polls again.
The sleep periode starts a parameter controlled minimum and is doubled every time until it
reaches the maximum value. The following environment variables set the parameters:
Platform MPI allows you to assign which cpu will run a particular process through the use of bit
masks. This is known as "hard CPU affinity”. Hard affinity cuts down on overhead by optimizing
cache performance.
6.1.3.5 Granularity
You have a choice of three settings for the bitmasks in Platform MPI: execunit, core and
socket. The smallest processing unit in a node is called an “execution unit”. Using the
granularity setting of execunit allows the process to run on a specified single execunit. Using the
granularity setting of core allows the process to run on all execunits in a single core. Using the
granularity setting of socket allows the process to run on all execution units in a single socket.
6.1.3.6 Policy
A policy specifies how processes are placed in relation to each other on one node. Platform MPI
has four policies for binding processes on a node automatically and one for binding manually:
leastload - Processes can run on the least loaded socket, core or hyperthread.
latency - The processes run on as many cores as possible within one socket. If the number of
processes is higher than the number of cores, the next socket is used as well.
bandwidth - Only one process is run on a socket, as long as there are enough sockets. If the
number of processes excedes the number of sockets, then one socket will run the addditional
"excess process".
none - No limitations are set on the CPU affinity. The processes can run on all execunits in all cores
in all sockets.
manual - You can specify bind values for processes from rank 0 to rank n-1, using a
colon-separated list of hexadecimal values. If the number of the processes is greater than the
number of hexadecimal values, the list is begun again so that more than one process share a
hexidecimal value.
6.1.3.7 Benchmarking
Benchmarking is that part of performance evaluation that deals with the measurement and
analysis of computer performance using various kinds of test programs. Benchmark figures
should always be handled with special care when making comparisons with similar results.
Remember that group operations (MPI_Comm_{create, dup, split}) may involve creating
new communication buffers. If this is a problem, decreasing chunk_size may help.
Consider the Integer Sort (IS) benchmark in NPB (NAS Parallel Benchmarks). When running on
ten processes on 5 nodes over Gigabit Ethernet (mpimon -net smp,tcp bin/is.A.16.scampi
-- r1 2 r2 2 r3 2 r4 2 r5 2) the resulting performance is:
Attention
Note: Please note that the run time selectable algorithms and their values may vary on
different Platform MPI release versions.
For information on which algorithms that are selectable at run time and their valid values, set
the environment variable SCAMPI_ALGORITHM and run an example application:
For Windows:
For Unix:
This will produce a listing of the different implementations of particular collective MPI calls.
For each collective operation a listing consisting of a number and a short description of the
algoritmn is produced, e.g., for MPI_Alltoallv() the following:
SCAMPI_ALLTOALLV_ALGORITHM alternatives
0 pair0
1 pair1
2 pair2
3 pair3
4 pair4
5 pipe0
6 pipe1
7 safe
def 8 smp
For this particular combination of Alltoallv-algorithm and application (IS) the performance varies
significantly, with algorithm 0 close to doubling the performance over the default.
Consider the image processing example from “Profiling with Platform MPI” on page 51 which
contains four collective operations. All of these can be tuned with respect to algorithm according
to the following pseudo-language pattern:
For example, trying out the alternative algorithms for MPI_Reduce with two processes can be
done as follows (assuming Bourne Again Shell [bash]:
user% for a in 0 1 2 3 4 5 6 7 8; do
export SCAMPI_REDUCE_ALGORITHM=$a
mpimon ./kollektive-8 ./uf256-8.pgm -- r1 r2;
done
Given that the application then reports the timing of the relevant parts of the code a best choice
can be made. Note however that with multiple collective operations working in the same
program there may be interference between the algorithms. Also, the performance of the
implementations is interconnect dependent.
There are two types of product keys: demo keys and permanent product keys. Demo keys have
an expiry date and cannot be activated. Both types of keys have the following format:
XXXX-XXXX-XXXX-XXXX-XXXX-XXXX-XXXX-XXXX
You will be requested to enter your product key. The product keys can be found on the activation
form that came with your Platform software.
The product keys allow Platform software to work right out of the box and immediately give you
full enjoyment of your Platform software. For the software to remain operational it must be
activated within 30 days of intallation. Activation is simple and described in further detail in
the next section.
http://www.platform.com/Products/platform-manager/product-activation
lmhostid
2 Go to the Platform product activation web server, and enter your product details,
including the host ID. You will be given the activation key for your product.
3 Add the activation code for the product with cli as follows:
lmutil show
Product Version Used Free Expire date Status
-------------------------------------------------------------------
Platform MPI 5.2 0 10 18-jun-2006 Unknown
manage 5.000 1 9 18-jun-2006 Unknown....
There are some of the cluster-administation-tools that have support to install Platform MPI
automatically; check with the supplier.
The -h option gives you details on the installation command and shows you which options you
need to specify in order to install the software components you want :
Option Description
-x Ignore errors.
-V Print version.
Option Description
-x Ignore errors.
-V Print version.
B.1.2.1 Example
root# ./smcinstall -m
When -m or -b option is selected Platform MPI will default to Myrinet or Infiniband respectively
as the default transport devices. If this is not desired, modify the networks line in the global
${MPI_HOME}/etc/ScaMPI.conf configuration file. See the section “Network devices” on
page 42 for more information regarding network device selection.
root# ${MPI_HOME}/sbin/smcunistall
In order for Platform MPI to locate the Infiniband libraries, it is important that /etc/
ld.so.conf is correctly set up with paths to the library location. Please verify with your
Infiniband vendor on how to troubleshoot Infiniband network issues.
Platform MPI comes with a predefined list of InfiniBand device names. If you are using a OFED
based InfiniBand stack you may override the default device name and other parameters by
creating an iba_params.conf file in one of the following locations:
• ${MPI_HOME}/etc/iba_params.conf
• ${HOME}/iba_params.conf
• ./iba_params.conf
The format of the file is one parameter=value pair per line. Recognized parameters are:
• "hcadevice"
• "mtu"
Please refer to the OFED stack documentation for descriptions on each of the parameters. An
example configuration file would could look like this:
$ cat ${MPI_HOME}/etc/iba_params.conf
hcadevice=ibhca0
To verify that the gm0 device is operational, run an MPI test job on two, or more, nodes in
question:
If the gm0 devices fails, the MPI job should fail with a "[ 1] No valid network connection from 1
to 0" message.
First of all, keep in mind that the GM source must be obtained from Myricom, and compiled on
your nodes.
Verify that the "gm" kernel module is loaded by running lsmod(8) on the compute node in
question.
root# /opt/gm/bin/gm_board_info
You should see all the nodes on your GM network listed. (This command must be run on a node
with a Myrinet card installed!)
A simple cause of failure is that /opt/gm/lib is not in /etc/ld.so.conf and/or ldconfig is not run,
you will get a unable to find libgm.so error message is this is the case.
After reading the text, click on next to continue to the License agreement screen.Platform MPI
is a licensed product. On each cluster you must designate a license server. This server validates
licenses through the product key. You must obtain a valid product key to continue successfully.
This product key remains valid for a maximum of 30 days. In that time you must activate the
license. Please see the chapter on Licensing for details.
There are two types of product keys: demo keys and permanent product keys. Demo keys have
an expiry date and cannot be activated. Both types of keys have the following format. There are
eight groups of four alpha-numeric values thus:
XXXX-XXXX-XXXX-XXXX-XXXX-XXXX-XXXX-XXXX
Note: Note: Only enter the product key and activate it on the license server.
3 Do NOT activate the product key on the compute nodes. If you install the product key
on one of the compute nodes first, you will activate the license for a license
server on it. Activation is irreversible. You can install/activate the product
key on ONLY ONE server. Install the product key ONLY on the designated
license server.On all compute nodes enter the License server node name and
IP address.Please accept the settings as is and click on Next.Please confirm
the installation now by clicking Next. If a window with rolling text appears:
DON’T PANIC. This is merely a set-up tool which is configuring your
installation.When the "Installation Complete" screen comes up, please click on
Close to close the wizard.
<bracket>== "["<number_or_range>[,<number_or_range>]*"]"
<number_or_range>== <number> | <from>-<to>[:<stride>]
<number>== <digit>+
<from>== <digit>+
<to> == <digit>+
<stride>== <digit>+
<digit>== 0|1|2|3|4|5|6|7|8|9
This is typically used to expand nodenames from a range, using from 1 to multi-dimensional
numbering, or an explicit list.
If <to> or <from> contains leading zeros, then the expansion will contain leading zeros such
that the width is constant and equal to the larger of the widths of <to> and <from>.
The syntax does not allow for negative numbers. <from> does not have to be less that <to>.
Examples:
n[0-2]
is equivalent to
n0 n1 n2
n[00-10:3]
is equivalent to
D.2 Grouping
Utilities that use scagroup will accept a group alias wherever a host name of hostlist i expected.
The group alias will be resolved to a list of hostnames as specified in the file scagroup config
file. If there exists a file ${HOME}/scagroup.conf in the users home directory, this will be used.
Otherwise, the system default file
${MPI_HOME}/etc/scagroup.conf
will be used.
Examples:
#include <mpi.h>
#include <stdio.h>
#include <math.h>
int main( int argc, char** argv )
{
int width, height, rank, size, sum, my_sum;
int numpixels, my_count, i, val;
unsigned char pixels[65536], recvbuf[65536];
unsigned int buffer;
double rms;
FILE *infile;
FILE *outfile;
char line[80];
original
P2
8 8
255
168 168 168 122 122 168 188 168
168 168 168 122 168 168 168 64
168 168 168 122 168 122 168 168
168 168 122 168 148 122 168 60
168 168 122 168 128 122 168 168
168 122 168 168 88 122 122 122
168 122 168 28 168 122 168 168
168 168 168 168 168 168 168 108
enhanced contrast
F.4 Data-striping
Data-striping is the ability to split the data that is sent from the MPI layer into multiple “chunks”
and send them across the configured underlying HCA devices. This can increase the available
bandwidth to the MPI process by a factor of N where N is the number of HCAs configured for the
multi-rail device in use. The stripe-size as well as which HCAs should be used for the multi-rail
device is configured through the static DAT registry.
Although this technique in theory provides N times higher interconnect bandwidth, its actual
benefit to application performance depends on several issues:
F.5 Load-balancing
The purpose of load balancing is to distribute the available HCAs in the system among the MPI
processes to avoid that multiple processes will congest a single HCA. This will increase the
aggregated system bandwidth available to all MPI processes compared to a single HCA, and
hence has the potential of improving the MPI application performance.
mlx4_0:1,2" ""
The first field must match the desired name that will be used with the “-networks” option to
mpirun/mpimon for network selection. Please note that this name cannot match any of the
built-in names in Platform MPI (such as “smp”, “tcp” and “ib”). The next 5 fields should always be
as the example line above, look to the DAPL 1.2 API document for more information about the
static registry configuration if desired.
The 7th field, also known as the “instance data” field, in the configuration line is what controls
the multi rail device. It should always be a quoted string and has the following format :
As you can see from the example above, the “iam1” device would use 1 HCA (mlx4_0) and 2
ports (port1 and port2).
Another example using 2 HCAs, with only 1 port on each, is shown below :
In both cases, data larger than 8192 bytes would be striped across the multiple rails using 8192
bytes per rail in a round robin fashion.
The application code itself (“application”) is linked together with functions from the Platform
MPI library (“libmpi.so”). That unit constitutes one of the parallel processes executing the work
of a running job. The running processes in every processing node are started by an independent
Platform MPI process (“mpisubmon”), which subsequently provides local MPI monitoring
services.
Each “mpisubmon” process, in turn, is started from a single top process from which the entire
parallel job is initially submitted (“mpimon”). The top process ‘mpimon’ will thereafter
communicate with and monitor all ‘mpisubmon’ processes running under it.
The “libvendorcpr.so” library is a plugin component. Currently, Platform MPI supports Berkeley
Labs open source version, BLCR. The library contains the necessary functions to interface with
BLCR-librarires to create an image of the running process and store its state during a checkpoint,
as well as to restore that image during a restart (“libvendorcpr.so”).
mpimon halts all ongoing communication and the existing MPI fabric are “drained” of messages.
mpimon closes down the current MPI fabric established upon the available HPC Interconnect
environment.
The components from Platform MPI will then invoke the necessary functions in “libvendorcpr.so”
to perform the checkpoint (i.e., save the state of all running processes onto disks).
G.1.2.4 Restart
Applications can be restarted from a checkpoint. A restart resembles the continue-part of the
"checkpoint & continue". The processes are loaded into memory by the
libvendorcpr.so-functionality and control is transferred to the process which re-establishes /
re-opens the MPI fabric before continuing.
Depending on whether the --abort is given the job will continue or stop after the checkpoint is
done. If the jobid-option is omitted the newest job for the user is checkpointed
The <cp-spec> could be omitted and if soo the newest job/checkpoint will be restarted. The job
can be indentified by given "--jobid <jobid>" and/or "--checkpoint-number <cpn>". (<jobid>
and <cpn> can be found by "mpictl listJobsAndCheckpoints)
It is possible to make changes to the environment for the restarted job; node-list, network and
other parameters to mpimon can be changed. Examples:
will give a list of currently executing jobs, and checkpoints taken with checkpoint-numbers.
Delete checkpoint-files for a jobid or if no jobid is given, all checkpoint-files will be deleted.
When problems occur, first check the list of common errors and their solutions; an updated list of
Platform MPI-related Frequently Asked Questions (FAQ) is posted in the Support section
of the Platform website (http://www.platform.com). If you are unable to find a solution to
the problem(s) there, please read this chapter before contacting [email protected].
Problems and fixes reported to Platform will eventually be included in the appropriate sections of
this manual. Please send relevant remarks by e-mail to [email protected].
Many problems find their origin in not using the right application code, daemons that Platform
MPI rely on are stopped, and incomplete specification of network drivers. Below some typical
problems and their solutions are described.
mpid, mpimon, mpisubmon and the libraries all have version variables that are checked at
start-up. To insure that these are correct, try the following:
Solution: Platform MPI assumes that there is a homogenous file-structure. If you start
mpimon from a directory that is not available on all nodes you must set
SCAMPI_WORKING_DIRECTORY to point to a directory that is available on all nodes.
A
activation code ..........................................................................................................78
B
Benchmarking Platform MPI ........................................................................................66
C
-chunk_size ......................................................................................................... 21, 34
Command line options ........................................................................................... 14, 31
Communication protocols in ScaMPI
Eagerbuffering protocol ...........................................................................................48
Inline protocol ........................................................................................................47
Transporter protocol........................................................................................... 49, 50
Zerocopy protocol ...................................................................................................50
Communication resources in Platform MPI ................................................................ 21, 34
D
-debug .....................................................................................................................18
-debugger ................................................................................................................18
-display ....................................................................................................................18
E
eagerbuffering protocol .................................................................................... 21, 34, 48
Environment .............................................................................................................27
Environment-variable options ................................................................................. 14, 31
F
-failover_mode ..........................................................................................................18
G
global variables .................................................................................................... 23, 36
GM ..........................................................................................................................44
I
inline protocol ................................................................................................. 21, 34, 47
K
Key Activation ...........................................................................................................77
L
libfmpi ..................................................................................................................8, 28
libmpi ....................................................................................................................... 8
Licensing ..................................................................................................................77
lmhostid ...................................................................................................................78
M
Managing Product keys...............................................................................................78
manual activation ......................................................................................................78
MPI........................................................................................................................ 105
MPI_Abort() ......................................................................................................... 24, 37
MPI_Bsend() ........................................................................................................ 22, 36
MPI_COMM_WORLD .................................................................................................... 5
MPI_Irecv() ......................................................................................................... 22, 35
MPI_Isend()......................................................................................................... 22, 35
MPI_Probe()......................................................................................................... 22, 35
MPI_Recv().......................................................................................................... 22, 35
MPI_SOURCE ....................................................................................................... 22, 35
MPI_TAG ............................................................................................................. 22, 35
MPI_Waitany() ..................................................................................................... 23, 36
mpi.h .................................................................................................................. 23, 36
mpiboot ...................................................................................................................40
mpic++..................................................................................................................... 8
mpicc........................................................................................................................ 8
MPICH.................................................................................................................... 105
mpid ........................................................................................................................40
mpif.h ................................................................................................................. 23, 36
mpif77 ...................................................................................................................... 8
mpimon......................................................... 11, 14, 17, 18, 21, 24, 26, 29, 31, 34, 37, 39, 40
mpimon-parameters ..................................................................................................30
mpirun ................................................................................................................ 16, 21
mpirun options ..........................................................................................................17
mpisubmon...............................................................................................................40
Myrinet ....................................................................................................................44
N
-networks ............................................................................................................ 14, 32
O
Optimize Platform MPI performance .............................................................................65
P
Platform MPI Environment ........................................................................................... 7
Platform product activation web service ........................................................................78
-pool_size............................................................................................................ 21, 34
Product Activation......................................................................................................78
product key ..............................................................................................................77
demo keys ........................................................................................................ 77, 84
permanent product key ...................................................................................... 77, 84
Profiling
ScaMPI..................................................................................................................51
R
Run parameters.........................................................................................................18
T
TCP .................................................................................................................... 21, 34
totalview ..................................................................................................................21
transporter protocol ......................................................................................... 21, 34, 49
tvmpimon .................................................................................................................21
U
Unsafe MPI programs ............................................................................................ 23, 36
User interface errors and warnings.......................................................................... 24, 37
V
-verbose.............................................................................................................. 24, 37
X
xterm.......................................................................................................................18
Z
zerocopy protocol ......................................................................................................50