Numerical Libraries For Petascale Computing: Brett Bode William Gropp
Numerical Libraries For Petascale Computing: Brett Bode William Gropp
Performance Estimate
How fast should this run?
Standard complexity analysis in numerical analysis counts floating point operations Our matrix-matrix multiply algorithm has 2n3 floating point operations
3 nested loops, each with n iterations 1 multiply, 1 add in each inner iteration
For n=100, 2x106 operations, or about 1 msec on a 2GHz processor :) For n=1000, 2x109 operations, or about 1 sec
The Reality
N=100
1818 MF (1.1ms)
N=1000
335 MF (6s)
Hand-tuned
Compiler
From Atlas
Enormous effort required to get good performance
6
Sometime Slower
Using a library routine is not always the best choice:
Library routines add overhead Fewer routines (simpler for user) adds more overhead in determining exact operation Apply the usual rules:
Instrument your code Know what performance you need/expect Only worry about code that takes a significant fraction of the total run time
7
Algorithms and Moores Law This advance took place over a span of about 36 years, or 24 doubling times for Moores Law 22416 million the same as the factor from algorithms alone!
relative speedup
year
Example: Multigrid
Multigrid can be a very effective algorithm for certain classes of problems Efficient implementations must address Algorithmic choices (e.g., smoother) Implementation for memory locality Use as a preconditioner within a Krylov method And thats just on a single processor Parallel versions add questions about efficient coarse grid solves, data exchange, etc. Libraries such as hypre (https://computation.llnl.gov/casc/linear_solvers/sls_hypre.html) contain efficient implementations for parallel systems
11
Correct
Some operations are subtle and require care to get them right Example: (pseudo) random number generation in parallel Using a local random generator such as srand produces correlated values not random at all Simply using different seeds for each thread/process in a parallel program isnt enough (unless the seeds are picked very carefully) SPRNG Scalable Parallel Random Number Generator
Provides good pseudo-random number generators, suitable for use in a parallel program
http://sprng.cs.fsu.edu/
12
Greater Productivity
Parallel programming is widely viewed as difficult Much effort to develop programming languages that make parallel programming easy But what is really needed is a way to provide the data structures, algorithms, and methods needed by the computational scientist A general purpose language is not the best way to do this (though it may be a good way to implement it) An alternative is through carefully designed libraries
13
PETSC objects hide details of distributed data structures and function parameters
15
/* Get the mesh size. Use 10 by default */ n = 10; PetscOptionsGetInt( PETSC_NULL, "-n", &n, 0 ); /* Get the process decomposition. Default it the same as without DAs */ px = 1; PetscOptionsGetInt( PETSC_NULL, "-px", &px, 0 ); MPI_Comm_size( PETSC_COMM_WORLD, &worldSize ); py = worldSize / px;
/* Create a distributed array */ DACreate2d( PETSC_COMM_WORLD, DA_NONPERIODIC, DA_STENCIL_STAR, n, n, px, py, 1, 1, 0, 0, &grid );
/* Form the matrix and the vector corresponding to the DA */ A = FormLaplacianDA2d( grid, n ); b = FormVecFromFunctionDA2d( grid, n, func ); VecDuplicate( b, &x );
PETSc provides routines to create, allocate, and manage distributed data structures
16
SLESCreate( PETSC_COMM_WORLD, &sles ); SLESSetOperators( sles, A, A, DIFFERENT_NONZERO_PATTERN ); SLESSetFromOptions( sles ); PETSc provides SLESSolve( sles, b, x, &its ); routines that solve PetscPrintf( PETSC_COMM_WORLD, "Solution is:\n" ); VecView( x, PETSC_VIEWER_STDOUT_WORLD ); PetscPrintf( PETSC_COMM_WORLD, "Required %d iterations\n", its ); MatDestroy( A ); VecDestroy( b ); VecDestroy( x ); SLESDestroy( sles ); DADestroy( grid ); PetscFinalize( ); return 0; }
PETSc provides coordinated I/O (behavior is as-if a single process), including the output of the distributed vec object
17
/* -*- Mode: C; c-basic-offset:4 ; -*- */ #include "petsc.h" #include "petscvec.h" #include "petscda.h" /* Form a vector based on a function for a 2-d regular mesh on the unit square */ Vec FormVecFromFunctionDA2d( DA grid, int n, double (*f)( double, double ) ) { Vec V; int is, ie, js, je, in, jn, i, j; double h; double **vval; h = 1.0 / (n + 1); DACreateGlobalVector( grid, &V ); DAVecGetArray( grid, V, (void **)&vval );
18
/* Get global coordinates of this patch in the DA grid */ DAGetCorners( grid, &is, &js, 0, &in, &jn, 0 ); ie = is + in - 1; je = js + jn - 1; Almost the uniprocess code for (i=is ; i<=ie ; i++) { for (j=js ; j<=je ; j++){ vval[j][i] = (*f)( (i + 1) * h, (j + 1) * h ); } } DAVecRestoreArray( grid, V, (void **)&vval ); return V; }
19
row format
for (i=is; i<=ie; i++) { for (j=js; j<=je; j++){ row.j = j; row.i = i; nelm = 0; if (j - 1 > 0) { vals[nelm] = oneByh2; cols[nelm].j = j - 1; cols[nelm++].i = i;} if (i - 1 > 0) { vals[nelm] = oneByh2; cols[nelm].j = j; cols[nelm++].i = i - 1;} vals[nelm] = - 4 * oneByh2; cols[nelm].j = j; cols[nelm++].i = i; if (i + 1 < n - 1) { vals[nelm] = oneByh2; cols[nelm].j = j; cols[nelm++].i = i + 1;} if (j + 1 < n - 1) { vals[nelm] = oneByh2; cols[nelm].j = j + 1; cols[nelm++].i = i;} MatSetValuesStencil( A, 1, &row, nelm, cols, vals, INSERT_VALUES ); } } MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); return A; }
Just the usual code for setting the elements of the sparse matrix (the complexity comes, as it often does, from the boundary conditions)
21
22
24
25
28
Sequential write
Parallel writes are carried out by shipping data to a single process
PnetCDF Parallel read/write to shared netCDF file Built on top of MPI-IO which utilizes optimal I/O facilities of the parallel file system and MPI-IO implementation Allows for MPI-IO hints and datatypes for further optimization
P0
P1
P2
P3
Performance inconsistencies MPI-IO and pnetCDF should have similar performance MPI-IO and HDF-5 should have similar performance for data POSIX I/O and comparable MPI-IO patterns should have similar performance Performance consistency is important (but not sufficient) for scalability But performance inconsistencies are common in practice
Recommendations
Dont do it yourself!
Use Frameworks and Libraries where possible Exploit principles used in those libraries if you need to write your own
Summary
There are many reasons to use libraries:
Faster Correct Real parallel I/O More productive programming
The best reason: they let you focus on getting your science done There are many libraries available
Only a few mentioned in this talk Many other good ones available ask!
34