Sandeep Bhatt Et Al - Tree Codes For Vortex Dynamics: Application of A Programming Framework
Sandeep Bhatt Et Al - Tree Codes For Vortex Dynamics: Application of A Programming Framework
Bell Communications Research, Morristown NJ 07960, and Computer Science, Rutgers University, Piscataway NJ 08855. Mechanical and Aerospace Engineering, Rutgers University, Piscataway NJ 08855.
Abstract
This paper describes the implementation of a fast algorithm for the study of multi- lament vortex simulations. The simulations involve distributions that are irregular and time-varying. We adapt our programming framework which was earlier used to develop high-performance implementations of the Barnes-Hut and Greengard-Rokhlin algorithms for gravitational elds. We describe how the additional subtleties involved in vortex lament simulations are accommodated e ciently within the framework. We describe the ongoing experimental studies and report preliminary results with simulations of one million particles on the 128-node Connection Machine CM-5. The implementation sustains a rate of almost 30% of the peak machine rate. These simulations are more than an order of magnitude larger in size than previously studied using direct methods. We expect that the use of the fast algorithm will allow us to study the generation of small scales in the vortex collapse and reconnection problem with adequate resolution.
N -body
1 Introduction
Computational methods to track the motions of bodies which interact with one another, and possibly subject to an external eld as well, have been the subject of extensive research for centuries. So-called \N -body" methods have been applied to problems in astrophysics, semiconductor device simulation, molecular dynamics, plasma physics, and uid mechanics. This paper describes the implementation of a tree code for vortex dynamics simulations, which we apply to the study of vortex collapse and reconnection 20]. We expect that the use of the fast algorithm will allow
us to study the generation of small scales with adequate resolution, and make it possible to realistically simulate the ows around bodies of complex shapes encountered in engineering applications. The advantages of the fast particle vortex methods will be especially important in the high Reynolds number regime, which cannot be studied with the current grid based methods ( nite di erence, nite volume, and spectral) because of the resolution limitations imposed by current computers. Computing the eld at a point involves summing the contributions from each of the N ? 1 particles. The direct method evaluates all pairs of two-body interactions. While this method is conceptually simple, vectorizes well, and is the algorithm of choice for many applications, its O(N 2 ) arithmetic complexity rules it out for large-scale simulations involving millions of particles iterated over many time steps. For vortex simulations the largest number of particles reported is around fty thousand 3, 21, 36]. Larger simulations require faster methods involving fewer interactions to evaluate the eld at any point. In the last decade a number of approximation algorithms have emerged. These algorithms exploit the observation that the e ect of a cluster of particles at a distant point can be approximated by a small number of initial terms of an appropriate power series. This leads to an obvious divide-and-conquer algorithm in which the particles are organized in a hierarchy of clusters which allows the approximation to be applied e ciently. Barnes and Hut 5] applied this idea to gravitational simulations. More sophisticated schemes were developed by Greengard and Rokhlin 15] and subsequently re ned by Zhao 37], Anderson 2]. Better data structures have recently been developed by Callahan and Kosaraju 9]. Several parallel implementations of the algorithms
mentioned above have been developed. Salmon 27] implemented the Barnes-Hut algorithm on the NCUBE and Intel iPSC, Warren and Salmon 34] reported experiments on the 512-node Intel Touchstone Delta, and later developed hashed implementations of a global tree structure which they report in 35, 18]. They have used their codes for astrophysical simulations and also for vortex dynamics. This paper builds on our CM-5 implementation 25] of the Barnes-Hut algorithm for astrophysical simulations and contrasts our approach and conclusions with the aforementioned e orts. This abstract is organized as follows. Section 2 describes the application problem in some detail, and outlines the Barnes-Hut fast summation algorithm. Section 3 describes how we use our framework for building implicit global trees in distributed memory as well as methods for accessing and modifying these structures e ciently. Section 4 describes experimental results on the Connection Machine CM-5.
U 0.40 0.39 0.38 0.37 0.36 0.35 0.34 0.33 0.32 0.31 0.0 1.0 2.0
C = 0.07595
3.0
4.0
5.0
6.0
7.0
8.0 qa
Figure 1: Induced velocity U vs. nondimensional grid size qa . models, the O(N 2 ) nature of the computational expense of the Biot-Savart direct method (where N is the number of grid points) severely limits the vortex collapse simulations, leaving the most interesting cases of collapse beyond the cases that have been examined to date 12, 13]. Recent multi- lament simulations have raised doubts about the convergence of the results 1, 3]. Figure 1 presents a convergence study of the velocity U induced on itself by a circular vortex ring with thickness C = 0:07595 and radius a = 1. The resolution is represented (along the horizontal axis) by the nondimensional grid size qa = 2 =ha, where is the smoothing parameter in the simulation and ha is the minimum distance between grid points. Two types of discretization schemes are employed. One underpredicts and the other overpredicts the velocity U . We observe in Figure 1 that, in order for convergence, it is necessary to have both a su ciently high number of grid points N and a su ciently small value of the smoothing parameter , in agreement with the convergence theorems 6, 14]. The small values of necessary for convergence make it necessary to employ millions of particles for these particular values of the ratio C = .
Nf laments forming the bundle are advected according to the velocity eld
u(x) = ?
Nf X
The strength of the particles is p = ?p p and R ( ) = 0 g(s) ds=s2 + (0) . The term r m is the truncation error that results from using a nite number of terms in the multipole expansion (4).
2.2 Discretization
Each lament of the vortex ring is discretized by n0 grid points or vortex elements. Once this is done, the order of the summations in Equation (1) is unimportant, i.e. (1) is solved numerically at N discrete points or vortex elements p by using the approximation N 1 X (x ? p ) p ?p g (jx ? j) ; u(x) = ? 4 p jx ? p j3 p=1 (2) where the lament ordering has to be considered in the computation of the central di erence p 1 (3) p = 2 ( p+1 ? p?1 ) : This is a characteristic of the lament approach in 3D vortex methods. In contrast with the \vortex arrow" approach 18, 23], updating the \strength" of the vortex elements in the lament method does not require the evaluation of the velocity gradient, which involves the computation of another integral over all of the particles. Also, laments with form of closed curves, satisfy the divergence free condition of the vorticity eld. This is not always the case at all times in the vortex arrow approach. The velocity eld in eq. (2) can be computed more e ciently by using the multipole expansion
sin
; c
sin 2 ) ;
(7)
where 0 2 . The grid points are located at equally spaced intervals or at variable intervals, with the smaller intervals on the collapse region. We call this geometry the \Lissajous-elliptic" ring because of its projections on two orthogonal planes (for c > 0). The thickness of the multi- lament ring is C . The circulation distribution ?i , and the initial lament core radius i also need to be speci ed. Besides the fact that its parameter space contains cases of very rapid collapse, the low number of parameters of this con guration allows a complete parameter study at less computational expense. The case a = b = 1, c = 0 corresponds to the circular ring we use for the static studies (Figure 1). Low aspect ratio elliptic rings a > b; c = 0 correspond to rings with periodic behavior that can be used for dynamic and long time behavior testing of the algorithm. After thorough testing is carried out, production simulations of collapsing vortex con gurations with c > 0 will be performed. In order to evaluate the quality of the simulations we have also implemented di erent types of diagnostics. These include the motion invariants (linear and angular impulse and energy). For low aspect ratio elliptic rings we measure the period of oscillation and m X the deviation of the initial planar shape of the ring X u(x) = ? M (j; k; n ? j ? k) at each period. Other types of diagnostics have been n=0 j +k n implemented to obtain physical insight into the collapse problem. These include measurements of vorD(j; k; n ? j ? k ) r (x ? 0 ) + r m; (4) ticity, strain-rate and vortex stretching. The e cient evaluation of these diagnostics requires the use of the where fast summation algorithm employed in the simulation itself. M (j; k; n ? j ? k) =
N X p=1 n?j ?k ; j p ( p ? 0 )1 ( p ? 0 )k ( p ? 0 )3 2
(5)
To reduce the complexity of computing the sums in Equation 2, we use the Barnes-Hut algorithm. To organize a hierarchy of clusters, we rst compute an oct-tree partition of the three-dimensional box (region of space) enclosing the set of bodies. The partition is computed recursively by dividing the original box into
eight octants of equal volume until each undivided box contains exactly one body. Alternative tree decompositions have been suggested 4, 9]; the Barnes-Hut algorithm applies to these as well. Each internal node of the oct-tree represents a cluster. Once the oct-tree has been built, the moments of the internal nodes are computed in one phase up the tree, starting at the leaves. The next step is to compute induced velocities; each particle traverses the tree in depth- rst manner starting at the root. For any internal node, if the distance D from the corresponding box to the particle exceeds the quantity R= , where R is the side-length of the box and is an accuracy parameter, then the e ect of the subtree on the particle is approximated by a two-body interaction between the particle and a point vortex located at the geometric center of the tree node. The tree traversal continues, but the subtree is bypassed. Once the induced velocities of all the bodies are known, the new positions and vortex element strengths are computed. The entire process, starting with the construction of the oct-tree, is repeated for the desired number of time steps. For convenience we refer to the set of nodes which contribute to the acceleration on a particle as the essential nodes for the particle. Each particle has a distinct set of essential nodes which changes with time. One remark concerning distance measurements is in order. There are several ways to measure the distance between a particle and a box. Salmon 27] discusses various alternatives in some detail. For consistency, we measure distances from bodies to the perimeter of a box in the L1 metric. In our experiments we vary to test the accuracy of the code compared to solutions computed using the direct method at speci c points.
distributed memory. The tension between the communication overhead and computational throughput is of central concern to obtaining high performance. The challenges can be summarized as follows. 1. The oct-tree is irregularly structured and dynamic; as the tree evolves, a good mapping must change adaptively. 2. The data access patterns are irregular and dynamic; the set of essential tree nodes cannot be predicted without traversing the tree. The overhead of traversing a distributed tree to nd the essential nodes can be prohibitive unless done carefully. 3. The number of oating point operations necessary to update the position can vary tremendously between bodies; the di erence often ranges over an order of magnitude. Therefore, it is not su cient to map equal numbers of bodies among processors; rather, the work must be equally distributed among processors. This is a tricky issue since mapping the nodes unevenly can create imbalances in the work required to build the oct-tree. The techniques we use to guarantee e ciency include: (a) exploiting physical locality, (b) an implicit tree representation which requires only incremental updates when necessary, (c) sender-driven communication, (d) aggregation of computation and communication in bulk quantities, and (e) implementing the \all-to-some" communication abstraction.
number of processors is small relative to the number of particles. We chose ORB decomposition for several reasons. It provides a simple way to decompose space among processors, and a way to quickly map points in space to processors. This latter property is essential for sender-directed communication of essential data, for relocating bodies which cross processor boundaries, for nding successor particles along a lament, and also for building the global BH-tree. Furthermore, ORB preserves data locality reasonably well1 and permits simple load-balancing. While it is expensive to recompute the ORB at each time step 28], the cost of incremental load-balancing is negligible 25]. The ORB decomposition is incrementally updated in parallel as follows. At the end of an iteration, each ORB tree node is assigned a weight equal to the total number of operations performed in updating the states of particles in each of the processors which is a descendant of the node. A node is overloaded if its weight exceeds the average weight of nodes at its level by a small, xed percentage, say 5%. We identify nodes which are not overloaded but one of whose children is overloaded; call each such node an initiator. Only the processors within the corresponding subtree participate in balancing the load for the region of space associated with the initiator. The subtrees for different initiators are disjoint so that non-overlapping regions can be balanced in parallel. At each step of the load-balancing step it is necessary to move bodies from the overloaded child to the non-overloaded child. This involves computing a new separating hyperplane; we use a binary search combined with a tree traversal on the local BH-tree to determine the total weight within a parallelpiped.
Representation. We assign each BH-tree node an owner as follows. Since each tree node represents a xed region of space, the owner is chosen to be the processor whose domain contains a canonical point, say the center of the corresponding region. The data for a tree node, the multipole representation for example, is maintained by the owning processor. Since each processor contains the ORB-tree it is a simple calculation to gure out which processor owns a tree node. The only complication is that the region corresponding to a node can be spanned by the domains of multiple processors. In this case each of the spanning processors computes its contribution to the node; the owner accepts all incoming data and combines the individual contributions. This can be done e ciently when the combination is a simple linear function, as is the case with all tree codes.
Each processor rst builds a local oct-tree for the bodies which are within its domain. At the end of this stage, the local trees will not, in general, be structurally consistent. For example, Figure 2 shows a set of processor domains that span a node; the node contains four bodies, one per processor domain. Each local tree will contain the node as a leaf; but this is inconsistent with the global tree.
Construction.
. .
. .
Figure 2: An internal node which appears as a leaf in each local tree. The next step is to make the local trees be structurally consistent with the global oct-tree. This requires adjusting the levels of all nodes which are split by ORB lines. We omit the details of the leveladjustment procedure in this paper. A similar process was developed independently in 28]; an additional complication in our case is that we build the oct-tree until each leaf contains a number, L, of bodies. Choosing L to be much larger than 1 reduces the size of the BH-tree, but makes level-adjustment somewhat tricky.
The level adjustment procedure also makes it easy to update the oct-tree incrementally. We can insert and delete bodies directly on the local trees because we do not explicitly maintain the global tree. After the insertion/deletion within the local trees, level adjustment restores coherence to the implicitly represented distributed tree structure.
be needed rather than requesting it whenever it is needed. Indeed, without the use of the CM-5 vector units we found that these two ideas kept the overhead because of parallelism minimal.
was taken directly from our earlier gravitational simulation code. For example, the ORB decomposition, load remapping, communication protocol, and the BH tree module are exactly the same. The only modi cations required for the vortex simulation are the extra phase (step 1.2) to compute the vortex element strength of each particle, and the dynamic update of lament-trees which contain indices of particles in one lament (step 5).
0. build local BH trees for every time step do: 1. construct the BH-tree representation 1.1 adjust node levels 1.2 compute the strength of each particle 1.3 compute partial node values on local trees 1.4 combine partial node values at owners 2. owners send essential data 3. calculate induced velocity 4. update velocities and positions of bodies 5. update incremental data structures: local BH-tree and filament-trees 6. if the workload is not balanced update the ORB incrementally enddo
Figure 3: Outline of code structure Step 0 builds a local BH tree in each processor. After that the local BH trees are maintained incrementally and never rebuilt from scratch. Step 1 constructs an implicit representation of the global BH tree by combining information in the local trees. Step 5 updates the local trees after particles move to their new positions. Step 2 broadcasts essential information and is the most communication intensive phase. The senderoriented communication avoids two-way tra c in fetching remote data, and aggregates messages for higher communication throughput. Step 3 computes the induced velocity for each vortex and is most time-consuming. We use the KnioGhoniem algorithm to solve the Biot-Savart equations, using monople, dipole and quadrapole terms in the multipole expansion to approximate the e ect of a cluster. We use CDPEAC assembly language to implement this computation intensive routine. One signi cant optimization was to approximate the exponential smoothing term in equation 1. When = is larger than 20 we ignore the term; for smaller values we compute the rst seven terms in the Taylor expansion. This guarantees accuracy to within at least ve
decimal places. Step 6 redistributes particles when workload becomes imbalanced. The ORB is incrementally updated so that workload is balanced among processors.
4 Performance
Our platform is the Connection Machine CM-5 with SPARC vector units 30]. Each processing node has 32M bytes of memory and can perform oating point operations at peak rate of 128 M op/s 33]. We use the CMMD communication library 32]. The vector units are programmed in CDPEAC which provides an interface between C and the DPEAC assembly language for vector units 31]. The rest of the program is written in C. The vortex dynamics code along with several diagnostics are operational. Starting with the gravitational code, the e ort involved adding the phase to compute vortex element strength and writing the computational CDPEAC kernel. As expected, the latter part took most of the e ort; the data structures were adaptedand debugged in less than a week. We are currently running extensive tests to check the accuracy of the Barnes-Hut method as a function of the approximation parameter . Preliminary results show that with quadrupole expansions we can obtain up to ve digits of accuracy when is about 0.25, and the time step is 0.002. phase time (second) adjust level 1.407 compute vortex strength and moments 1.435 construct global tree 0.317 collect local essential data 15.873 compute the new velocity 307.221 update particle position 0.144 move particles into right processors 1.720 remap the workload 0.141 total time 328.258 Figure 4: Timing breakdown for vortex simulation. Figure 4 shows the timing breakdown for a 3D vortex simulation on a 128-node CM-5. The input conguration is a 22 layers, 1012 lament vortex ring2 . Each lament has 1024 particles and the total number of particles is about one million. The radius of the cross section of the vortex ring is 0:07595 and the core radius is 0:06. We use = 1 in the opening 4 test for multipole approximation so that the error in total induced velocity is within 0.00001. The vector units compute the induced velocity at 54 M op/sec, and the simulation sustains an overall rate of over 37
2
M op/sec per node (averaged over 5 time steps, not including the rst). Since the locally essential trees overlap considerably, the total size of local trees exceeds the size of the global tree. Figure 5 shows the memory overhead (ratio of total distributed memory used divided by global tree size) versus granularity (number of particles per node) on a 32-node CM-5. This ratio is less than 2.4 as the granularity exceeds 10000 particles per processors, even for very small = 0:25.
Ratio of essential and global tree size (P = 32)
Ratio 4.60 4.40 4.20 4.00 3.80 3.60 3.40 3.20 3.00 2.80 2.60 2.40 2.20 2.00 1.80 Granularity x 103 0.00 2.00 4.00 6.00 8.00 10.00 theta = 0.25 theta = 0.33 theta = 0.5
top view
side view
Figure 7: Collapsing vortex ring. This ring is discretized with N = 112640 particles. The collapse state shown is obtained by 2000 time steps t = 10?3. with a = 1; b = 0:8 and c = 0 (see Eq. 7). This ring presents an oscillating behavior between the two states presented in Fig 6. The ring was discretized with Nf = 60 laments with a total number of particles N = 15360. The nal state in Fig 6 is after 1000 time steps t = 6:28x10?3. The second case tested corresponds to a collapsing vortex ring with a = 1; b = 0:4 and c = 0:5. The initial condition and the collapse state are shown in Fig 7. This ring has Nf = 220 laments with a total number of particles N = 112640. The nal state corresponds to 2000 time steps t = 10?3.
Acknowledgments
We thank Apostolos Gerasoulis of Rutgers for helping set up the collaboration and helpful discussions. This work is supported in part by ONR Grant N00014-931-0944 and ARPA contract DABT-63-93-C-0064 \Hypercomputing & Design." The content of the information herein does not necessarily re ect the position of the Government and o cial endorsement should not be inferred. The code was developed and tested on the CM-5 at the Naval Research Laboratories, the National Center for Supercomputer Applications, University of Illinois at Urbana-Champaign, and the Minnesota Supercomputing Center.
side view
Figure 6: Low aspect ratio elliptic ring oscillates between the states shown. The ring has N = 15360 particles. The last time shown is after 1000 time steps t = 6:28x10?3. As part of the initial testing, we have examined two cases. The rst is a low aspect ratio elliptic ring
References
1] A. S. Almgren, T. Buttke, and P. Colella. A fast adaptive vortex method in three dimensions.
2] C. Anderson. An implementation of the fast multipole method without multipoles. SIAM Journal on Scienti c and Statistical Computing, 13, 1992. 3] C. Anderson and C. Greengard. The vortex merger problem at in nite Reynolds number. Communications on Pure and Applied Mathematics, XLII:1123{1139, 1989. 4] R. Anderson. Nearest neighbor trees and N-body simulation. manuscript, 1994. 5] J. Barnes and P. Hut. A hierarchical O(N log N ) force-calculation algorithm. Nature, 324, 1986. 6] J. T. Beale and A. Majda. Vortex methods. I: Convergence in three dimensions. Mathematics of Computation, 39(159):1{27, 1982. 7] S. Bhatt, M. Chen, J. Cowie, , C. Lin, and P. Liu. Object-oriented support for adaptive methods on parallel machines. In Object Oriented Numerics Conference, 1993. 8] S Bhatt and P. Liu. A framework for parallel n-body simulations. manuscript, 1994. 9] P. Callahan and S. Kosaraju. A decomposition of multi-dimension point-sets with applications to k-nearest-neighbors and N-body potential elds. 24th Annual ACM Symposium on Theory of Computing, 1992. 10] S. C. Crow. Stability theory for a pair of trailing vortices. AIAA Journal, 8(12):2172{2179, 1970. 11] S. Douady, Y. Couder, and M. E. Brachet. Direct observation of the intermittency of intense vorticity laments in turbulence. Physical Review Letters, 67(8):983{986, 1991. 12] V. M. Fernandez. Vortex intensi cation and collapse of the Lissajous-Elliptic ring: Biot-Savart simulations and visiometrics. Ph.D. Thesis, Rutgers University, New Brunswick, New Jersey, 1994. 13] V.M. Fernandez, N.J. Zabusky, and V.M. Gryanik. Near-singular collapse and local intensication of a \Lissajous-elliptic" vortex ring: Nonmonotonic behavior and zero-approaching local energy densities. Physics of Fluids A, 6(7):2242{ 2244, 1994.
14] C. Greengard. Convergence of the vortex lament method. Mathematics of Computation, 47(176):387{398, 1986. 15] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of Computational Physics, 73, 1987. 16] H. S. Husain and F. Hussain. Elliptic jets. Part 3. Dynamics of preferred mode coherent structure. Journal of Fluid Mechanics, 248:315{361, 1993. 17] F. Hussain and H. S. Husain. Elliptic jets. Part 1. Characteristics of unexcited and excited jets. Journal of Fluid Mechanics, 208:257{320, 1989. 18] M. Warren J. Salmon and G. Winckelmans. Fast parallel tree codes for gravitational and uid dynamical N-body problems. Intl. J. Supercomputer Applications, 8.2, 1994. 19] R. T. Johnston and J. P. Sullivan. A ow visualization study of the interaction between a helical vortex and a line vortex. Submitted to Experiments in Fluids, 1993. 20] S. Kida and M. Takaoka. Vortex reconnection. Annu. Rev. Fluid Mech., 26:169{189, 1994. 21] O. M. Knio and A. F. Ghoniem. Numerical study of a three-dimensional vortex method. Journal of Computational Physics, 86:75{106, 1990. 22] A. Leonard. Vortex methods for ow simulation. Journal of Computational Physics, 37:289{335, 1980. 23] A. Leonard. Computing three-dimensional incompressible ows with vortex elements. Ann. Rev. Fluid Mech., 17:523{559, 1985. 24] P. Liu, W. Aiello, and S. Bhatt. An atomic model for message passing. In 5th Annual ACM Symposium on Parallel Algorithms and Architecture, 1993. 25] P. Liu and S. Bhatt. Experiences with parallel nbody simulation. In 6th Annual ACM Symposium on Parallel Algorithms and Architecture, 1994. 26] A.J. Majda. Vorticity, turbulence and acoustics in uid ow. SIAM Review, 33(3):349{388, September 1991. 27] J. Salmon. Parallel Hierarchical N-body Methods. PhD thesis, Caltech, 1990.
28] J. Singh. Parallel Hierarchical N-body Methods and their Implications for Multiprocessors. PhD thesis, Stanford University, 1993. 29] R.E. Tarjan. Data Structures and Network Algorithms. Society for Industrial and Applied Mathematics, 1983. 30] Thinking Machines Corporation. The Connection Machine CM-5 Technical Summary, 1991. 31] Thinking Machines Corporation. CDPEAC: Using GCC to program in DPEAC, 1993. 32] Thinking Machines Corporation. CMMD Reference Manual, 1993. 33] Thinking Machines Corporation. DPEAC Reference Manual, 1993. 34] M. Warren and J. Salmon. Astrophysical N-body simulations using hierarchical tree data structures. In Proceedings of Supercomputing, 1992. 35] M. Warren and J. Salmon. A parallel hashed octtree N-body algorithm. In Proceedings of Supercomputing, 1993. 36] G. S. Winckelmans. Topics in vortex methods for the computation of three- and two-dimensional incompressible unsteady ows. Ph.D. Thesis, California Institute of Technology, Pasadena, California, 1989. 37] F. Zhao. An O(N ) algorithm for three dimensional N-body simulation. Technical report, MIT, 1987.