PMEMD 3.00 (PARTICLE MESH EWALD MOLECULAR DYNAMICS) RELEASE NOTES PMEMD is a new version of Sander that has been written with the major goal of improving performance in Particle Mesh Ewald molecular dynamics simulations and minimizations. The code has been totally rewritten in Fortran 90, and is capable of running in either an Amber 6 or Amber 7 mode. Functionality is more complete in Amber 6 mode, with the Amber 7 mode designed mostly to do the same sorts of things that Amber 6 does, but with output that is comparable to Amber 7 Sander. The calculations done in PMEMD are intended to exactly replicate either Sander 6 or Sander 7 calculations within the limits of roundoff errors. The calculations are just done more rapidly in less memory, and runs may be made efficiently on significantly larger numbers of processors. A large number of benchmarks are presented at the end of this release note. In these benchmarks, PMEMD ran on average slightly more than 2 times faster than Sander 6 and slightly less than 2 times faster than Sander 7. While this level of improvement is very significant in runs that can last for days or weeks, these figures do not tell the whole story. Because of the improved scalability, if you have access to a computer system with more than 16 processors, you can obtain better throughput than the above figures would indicate while at the same time utilizing the computer resource more efficiently. As an example, let us consider a 90906 atom solvated protein constant pressure simulation being run on either a Linux Athlon cluster (Myrinet switch), an IBM SP3, or an IBM SP4. This represents a type of problem our group typically runs, and also represents the type of problem that has to be run for the longest duration in our simulations. If we aim for about 50% parallelization efficiency in our runs, we get the following improvements in throughput (ie., psec/day simulated) AND computer utilization efficiency (ie., reduction in number of processors x time required to run a given problem): Comparison to Sander 7: System PMEMD Speedup PMEMD CPU Efficiency no. procs Increase (PMEMD / Sander) Linux 1.4GHz 3.35x 1.68x 32 / 16 Athlon Cluster (Myrinet Switch) IBM SP3 5.97x 1.19x 80 / 16 IBM SP4 3.69x 1.47x 40 / 16 Comparison to Sander 6: System PMEMD Speedup PMEMD CPU Efficiency no. procs Increase (PMEMD / Sander) Linux 1.4GHz 5.86x 1.47x 32 / 8 Athlon Cluster (Myrinet Switch) IBM SP3 4.50x 1.80x 80 / 32 IBM SP4 3.99x 1.59x 40 / 16 PMEMD also requires about half the memory required by Sander 7, and memory configuration is totally automatic, with no requirement to specify anything in a sizes.h or in an mdin file during the run. Running a 90906 atom constant pressure problem on a Linux Uniprocessor (Athlon PC) requires about 79 MB in PMEMD and about 163 MB in Sander 7. Running a 329844 atom constant pressure problem on a Linux Uniprocessor (Athlon PC) requires about 281 MB in PMEMD and about 576 MB in Sander 7. This is the largest simulation that PMEMD runs without hitting a problem in parameter input. Note that we can run this largest simulation on a PC with less than 512 MB of memory. A pairlist compression algorithm is instrumental in reducing memory utilization, as well as extensive use of temporary stack storage in Fortran 90. When running on multiple processors there is a further reduction in memory requirements, so PMEMD is basically capable of running really big problems on fairly modest hardware, in terms of available memory. PMEMD accepts both Amber 6 and Amber 7 Sander input files (prmtop, inpcrd, restrt, mdin), with the exception of the "new" Amber 7 prmtop format. If you generate a "new" format prmtop file using Amber 7 leap, you may convert it to the "old" format using the Amber 7 utility "new2oldparm". This release is the first release of PMEMD being made available to the general Amber community. The previous two releases were used by folks doing modeling in Prof. Lee Pedersen's Lab at UNC-Chapel Hill. PMEMD MODES PMEMD runs in three different modes: 1) Amber 6 Compatibility, CIT mode. This is the default mode. CIT stands for "Coordinate Index Table", a data structure that is key to a number of the performance improvements. This, is the "fast" mode of PMEMD, but support for some less-used Sander 6 options has been dropped to streamline the code and reduce the testing impact. 2) Amber 6 Compatibility, NoCIT mode. NoCit mode uses slower, less scalable algorithms that are closer in structure to the original Sander code, but supports almost all of Sander 6's functionality. This was the first version of PMEMD written. It will be automatically selected if you specify an option that the CIT code paths don't support. It is always faster than Sander 6, and generally faster than Sander 7, though Sander 7 running on the SGI is faster than PMEMD NoCit for some levels of multiprocessing (Sander 7 and PMEMD CIT use RC FFT's, PMEMD NoCit does not). This mode may be forced by setting use_cit = 0 in the &ewald namelist in mdin. Generally, the only reason to do this is for testing. 3) Amber 7 Compatibility, CIT mode. This mode is an enhancement of CIT mode, designed to allow the user to get Sander 7-like results in a PMEMD run. Basically, Sander 7-style output formatting is done, kinetic energies are calculated ala Sander 7, and the default options are those used for Sander 7. The NTT parameter (temperature regulation) is also interpreted in the same manner as Sander 7. In this mode mdin input may look like Sander 6 or Sander 7 input, and Sander 7 dynamic memory options are just ignored for user convenience. This option can be selected by setting amber7_compat = 1 in the &ctrl namelist in mdin. If you select Amber 7 compatibility mode and also select an option not supported by the CIT code, PMEMD will exit with an error message, as there is no Amber 7 compatibility support in the NoCIT code paths. If you have two molecules that have been crosslinked in your simulation, Sander 7 produces slightly different results for the virial than does Sander 6 or PMEMD. This difference is caused by an undocumented change in Sander 7 regarding how the atomic and molecular virials are computed in this situation. Neither the Sander 6 or Sander 7 treatment is superior, and we have retained the Sander 6 treatment in PMEMD. FUNCTIONALITY DETAILS As mentioned above, PMEMD is not a complete implementation of Sander 6 or Sander 7. Instead, it is intended to be a fast implementation of the functionality most likely to be used by someone doing simulations on large solvated systems. The following functionality is missing entirely: 1) igb = 1 - Generalized Born Simulations are not supported. 2) igb = 2 - Vacuum simulations are not supported. 3) igb = 3 - Vacuum simulations with distance-dependent dielectric are not supported. 4) nmropt = 2 - A variety of NMR-specific options such as NOESY restraints, chemical shift restraints, pseudocontact restraints, and direct dipolar coupling restraints are not supported. 5) ipol = 2 - Polarization calculations using 3 body interactions are not supported. This stuff was experimental and broken anyway. 6) Anything supported in the &debugf namelist is not supported. This functionality is nice for developers but not very useful for production. Retaining it would have required lots of work, as it is heavily impacted by the algorithms used in the rest of the code. 7) Anything that is in Sander 7 but not in Sander 6 is not supported, with the exception of Sander 7-like mdout output, handling of kinetic energies (step integration), and different interpretation of NTT. The following functionality is only supported in Amber 6 NoCIT mode: 1) ipol = 1 - Polarization calculations are not supported under CIT. Such calculations tend to be significantly slower, and are significantly more complex, so to cut the testing impact and overall impact on performance, we left them out of the CIT code paths. 2) ew_type != 0 - Under CIT only Particle Mesh Ewald calculations are supported. ew_type = 1 (regular Ewald calculations) may be done in the NoCIT code paths. 3) eedmeth != 1 - Under CIT only a cubic spline switch function for the direct sum Coulomb interaction is supported. This is the default, and most widely used setting for eedmeth. 4) frc_int = 1 - Force interpolation in PME is not supported under CIT. This would support better conservation of momentum, but is typically not used due to the high additional FFT overhead. 5) order != 4 - Under CIT only the default order of 4 for the PME B-spline interpolation is supported. 6) cutoff + skinnb < 6.d0 - Under CIT, an assumption is made that all nonbonded force adjustments will be found within a distance of cutoff + skinnb. To insure that this is a safe assumption, a minimum length check on cutoff + skinnb is made. The functionality in items 2-5 above can be relatively easily restored, but there would be a slight performance impact and a more significant testing impact. I would strongly suggest that new PMEMD users simply take an existing Sander 6 or Sander 7 mdin file and attempt a short 10-30 step run. The output will tell you what mode you are thrown into, or whether PMEMD is simply not going to handle the particular problem at hand. If you want Sander 7 output, simply add amber7_compat = 1 to &ctrl. That is really all there is to moving to using PMEMD instead of Sander 6 or 7. NEW MDIN VARIABLES A minimum of new variables have been introduced into the mdin namelists in PMEMD. Only one of these (amber7_compat) really ever needs to be used. The new variables are: amber7_compat = 1 - In &ctrl, this turns on Amber 7 compatibility mode. use_cit = 0 - In &ewald, this turns off the fast CIT code paths and forces a NoCIT run. This is only intended to be used for testing. mdinfo_flush_interval - In &ctrl, this variable can be used to control the minimum time in integer seconds between "flushes" of the mdinfo file. PMEMD DOES NOT use file flush() calls at all because flush functionality is broken in some versions of the SGI compilers/libraries and in some versions of the Intel Fortran Compiler (its actually worse - SGI changed the flush() interface, and depending on your compilers/libraries version, a call to flush() may corrupt the stack). Thus, PMEMD does an open/close cycle on mdinfo at a default minimum interval of 60 seconds. This interval can be changed with this variable if desired. Note that mdinfo under PMEMD simply serves as a heartbeat for the simulation at mdinfo_flush_interval, and mdinfo probably will not be updated with the last step data at the end of a run. If mdinfo_flush_interval is set to 0, then both mdinfo and mdin will be reopened and closed at each step. If you are having problems with Sander 7 under SGI, the use of flush() can be the source of the difficulty. SLIGHTLY CHANGED FUNCTIONALITY An I/O optimization and bugfix have been introduced into PMEMD. First of all, the NTWR default value (frequency of writing the restart file) has been modified such that the default minimum is 500 steps, and this value is increase incrementally for multiprocessor runs. In general, frequent writes of restrt, especially in runs with a high processor count, is wasteful. Also, if the mden file is being written, it is always written as formatted output, regardless of the value of IOUTFM. Sander 6 was broken in that it would die attempting to do a binary write of mden. There is little or no reason to ever want to write out mden as a binary file. INSTALLATION Installation is very similar to what one would do to install Sander 6 or Sander 7, though we use a separate source directory (src.pmemd) to isolate machine file changes associated with using Fortran 90/95. The recommended procedure is as follows: 1) Move src.pmemd.rel3.00.tar.gz to $AMBERHOME (your amber6 or amber7 product tree). 2) gunzip src.pmemd.rel3.00.tar.gz 3) tar xvf src.pmemd.rel3.00.tar This provides you with an expanded src.pmemd directory under $AMBERHOME. 4) cd src.pmemd 5) ln -s Machines/"Appropriate Machinefile" MACHINE 6) make install You should end up with an executable named "pmemd" in $AMBERHOME/exe. The caveats: 1) You must have a Fortran 90/95 compiler. 2) Currently supported systems include Linux Athlon, Linux Pentium, IBM SPx, and most SGI's. Other systems should not be that hard to support. 3) If you have a Linux system, I STRONGLY recommend using the Intel Fortran Compiler (ifc), Release 7.1 or later. The performance of this compiler is superior to anything else I have seen in terms of the optimizations possible. If you are associated with a nonprofit organization this compiler may be obtained for free at the Intel web site (www.intel.com), though you may have to poke around, find the evaluation compiler, and note the fine print about an unsupported compiler for free. Another good compiler is Lahey (Fujitsu) F95, though it is not as fast as Intel. I used the Absoft and Portland Group Compilers a year or so ago, and the performance was not bad, but the compilers were very buggy and apparently had problems with floating point register control, as sporadic results sometimes occurred. I will therefore not support PMEMD built with these last two compilers (the vendor reps were also not very helpful). 4) If you have a Linux cluster, it is easiest to build PMEMD with the Intel Fortran Compiler using MPICH version 1.2.5 or later (or a product based on it like MPICH-GM 12510 for Myrinet). I have been compiling with IFC 5 on MPICH-GM since about version 1.2.0, but I had to hack the configure scripts to get it to work. The bad news is that as of MPICH-GM version 12510, modifications have been done that do not help PMEMD performance, but did fix a potential memory corruption bug if you compile with ifc 7.0 (ie., you should not use ifc 7.0 or later with versions of MPICH-GM earlier than version 12510). With MPICH version 1.2.5 or something of the same ilk, the following MPICH build process should work for IFC (assuming a C shell): a) cat mpich-1.2.5.tar.gz | gunzip | tar xvf - b) cd mpich-1.2.5 c) setenv FC ifc d) setenv F90 ifc e) setenv FLIB_LIST -lPEPCF90 f) setenv FFLAGS "-auto -tpp6 -mp1" (works for Athlon) g) setenv RSHCOMMAND /usr/bin/rsh (depends on your installation) h) ./configure --disable-devdebug \ -prefix=/opt/pkg/mpich-1.2.5-ifc -optf77=-O2 -optcc=-O2 \ --with-device=ch_p4 \ --disable-c++ This is just a suggestion, and may need modification to support different devices, etc. Note, however, that in order to be able to link to MPICH, you will need to use the same compiler you are building PMEMD with (not always necessary, but it will always work). When you build PMEMD with one of the MPICH machine files, you must set the MPICH_HOME variable in the script to point to the appropriate place (/opt/pkg/mpich-1.2.5-ifc in the example above). If you have a system staff that does all this, they will have to provide you with a built MPICH that matches your compiler. I have also briefly worked with MPI-LAM, and a machine file for MPI-LAM and the Athlon processor is included in the distribution. In brief tests on a highly nonoptimal 100 mbps setup, I found MPI-LAM to be about 10% slower than MPICH. Scott Brozell sees roughly equal performance for MPICH and MPI-LAM in most of his testing. You should use the "SLOW_NONBLOCKING_MPI" conditional described below with MPI-LAM. 5) There is a new conditional defined in some of the machine files - "SLOW_NONBLOCKING_MPI". This conditional controls the selection of MPI calls that put less demand on the cluster interconnect. Slower switches, such as Gigabit Ethernet can be driven into a state of degraded performance due to buffer overruns caused by lots of nonblocking i/o (at least that is my guess as to what is going on). On such machines, I have introduced code paths that are inherently slower due to less i/o and computation overlap, but they have improved scalability because they don't clobber the interconnect switch. On the IBM SP3, the switch appears to be really good, and performance is significantly enhanced by extensive use of nonblocking i/o for MPI. So if you develop a new machine file or get upgraded hardware, you may want to try running versions compiled with and without this conditional defined to see what works the best. ACKNOWLEDGEMENTS This code was developed by Dr. Robert Duke in Prof. Lee Pedersen's Lab at UNC-Chapel Hill, starting from the version of Sander in Amber 6. I would like to thank Prof. Pedersen for his support in the development of this code, and would also like to acknowledge funding support from NIH grant HL-06350 (PPG) and NSF grant 2001-0759-02 (ITR/AP). I would also like to acknowledge Drs. Lalith Perera and Divi Venkateswarlu in the Pedersen Lab for helpful conversations and a willingness to actually use early releases of PMEMD, as well as Dr. Tom Darden of NIEHS for helpful conversations. This work required the availability of large piles of processors. The North Carolina Supercomputer Center was instrumental in making the work possible by making available a 720 processor IBM SP3 for my research efforts. The folks at the Edinburgh Parallel Computing Centre have also been most helpful and generous in providing me access to their HPCx facility (a total of 1280 IBM SP4 processors) for performance testing. Thanks also to Vance Shaffer of IBM who put me in touch with the EPCC folks and also did some benchmarking on other IBM systems. Finally, I would like to thank the Amber developers for being helpful and willing to let me dink around rather drastically with their code base. When citing PMEMD (Particle Mesh Ewald Molecular Dynamics) in the literature, please use both the Amber Version 7 citation given in the Amber 7 manual, and the following citation: Robert E. Duke and Lee G. Pedersen (2003) PMEMD 3, University of North Carolina-Chapel Hill BENCHMARKING RESULTS Test runs of PMEMD, Sander 6 and Sander 7 have been done on a variety of uniprocessor and multiprocessor machines using four protein simulation problems. The first is a simulation of a Factor IX protein with Ca++ ions and a total of 90906 atoms (provided courtesy of Dr. Lalith Perera). The second is the JAC benchmark from Amber 7, a 23528 atom simulation modified in that default values are used for cut and skinnb in mdin. The third is a simulation of a Factor IX protein fragment, also involving Ca++ ions, but with only 20943 atoms. The first two systems were used for constant volume and constant pressure simulation benchmarks. The third system was used for a minimization benchmark. A fourth system, a simulation of Hemoglobin in a trucated octahedral box with a total of 44247 atoms, has recently been added, and this benchmark was run on a subset of the available systems. Most parallel processor runs were done for either 500 or 1000 steps; all uniprocessor runs were done for 100 steps (except Hemoglobin, which was done for only 40 steps). Benchmarks done on UNC's IBM Blade cluster (1 Gigabit ethernet) were done for 250 steps which was adequate for accurate timing because I had exclusive access to the machine. Originally, runs were made for PMEMD in CIT Sander 6 compatibility mode, PMEMD in CIT Sander 7 compatibility mode, PMEMD in NoCIT mode, Sander 6, and Sander 7. After surveying the results, it was clear that most of the time PMEMD/CIT in either Sander 6 or Sander 7 compatibility mode displayed the same performance (typically within a percent or two). Thus, the decision was made to average the value of these two modes, and present the value as an "PMEMD" value. Thus, there are values for PMEMD, PMEMD/NoCIT, Sander 6 and Sander 7. All molecular dynamics run data is given in psec simulated per day, and "speedup" is the ratio of PMEMD to Sander 6 or 7 psec/day. The psec/day values for PMEMD/NoCIT were tabulated but a speedup was not calculated. If you need to run PMEMD/NoCIT, you can quickly determine how it will perform by directly referring to the psec/day values. The time step for the 90906 atom problem is 0.0015 psec. The time step for all other problems is the default of 0.001 psec. NOTE ON TRUNCATED OCTAHEDRA: PMEMD supports simulations in truncated octahedral unit cells. For nearly spherical solutes, this is the best choice for performance because adequate solute image separation may be obtained using less solvent. For solutes with a significant asymmetry, better performance may be obtained in multiprocessor runs using a rectangular parallelpiped (the default orthogonal unit cell) and not using extra solvent to form an isometric unit cell (ie., don't use a cube). This is true because PMEMD optimizes the orientation of nonisometric periodic boxes to achieve better workload locality in multiprocessor runs. Truncated octahedral unit cells always have to be isometric, so this optimization is not possible in that case. We have not done rigorous performance comparisons, but as a rough guess I would consider not using a truncated octahedral unit cell if the solute is 40% longer in one dimension. Special Notations: R - An "R" following a psec/day entry indicates that the configuration is "recommended". Configurations are marked R if the parallel efficiency of the configuration is 50% or greater when compared to the smallest sized parallel processor run used (usually 2 processors, but 8 processors on the SP4 due to accounting and architecture issues). Basically, if you stick within recommended configurations, you are not wasting a lot of compute power by using a configuration with bad parallel scaling. NP2 - This indicates that a Sander 7 value could not be obtained for a given configuration because the number of processors used is "not a power of two". Sander 7 places this restriction on parallelization, presumably as an efficiency measure, but it can prevent you from using a number of processors that would get you the best throughput at an acceptable CPU efficiency. ND - Not determined. For some configurations, it is clear that performance is not significantly increasing, and in fact may be decreasing. When this was clearly the case, we stopped running tests at higher processor count (generally done for Sander 6). Also, sometimes it was felt that additional data on Sander 6 or PMEMD NoCIT was not that useful. ******************************************************************************* LINUX UNIPROCESSOR PERFORMANCE, MICRONPC 1.6 GHZ ATHLON: ******************************************************************************* The Sander 6 build uses the default (g77) machine file. The Sander 7 build uses the ifc machine file (ifc was not available in the Amber 6 timeframe). Both were done with ifc 5.0; recently ifc 7.0 has become available, and Sander 7 compiled with ifc 7.0 is about 10% faster than indicated here. There is not much difference in the performance of PMEMD compiled with ifc 7.0 versus ifc 5.0. 90906 Atoms, Constant Volume Molecular Dynamics (Factor IX) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 1 45.4 33.0 22.2 27.9 2.05x 1.63x 90906 Atoms, Constant Pressure Molecular Dynamics (Factor IX) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 1 45.0 32.9 22.0 27.8 2.04x 1.62x 23558 Atoms, Constant Volume Molecular Dynamics (JAC Benchmark) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 1 122 93.9 62.6 74.5 1.94x 1.63x 23558 Atoms, Constant Pressure Molecular Dynamics (JAC Benchmark) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 1 121 93.9 62.2 74.5 1.94x 1.62x 20943 Atoms, Minimization (Factor IX Fragment) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 step/min step/min step/min step/min Speedup Speedup 1 100 77.9 52.6 66.7 1.90x 1.50x 44247 Atoms, Constant Volume Molecular Dynamics (Hemoglobin Truncated Octahedron Benchmark) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 1 54.9 42.1 29.3 36.8 1.87x 1.49x 44247 Atoms, Constant Pressure Molecular Dynamics (Hemoglobin Truncated Octahedron Benchmark) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 1 54.9 42.1 29.0 36.8 1.89x 1.49x ******************************************************************************* LINUX UNIPROCESSOR PERFORMANCE, IBM BLADE XEON 2.4 GHZ PROCESSOR ******************************************************************************* I did some vectorization work on an IBM Blade Xeon 2.4 GHz machine, but did not do competitive benchmarks against Sander 6 or Sander 7. I did, however, get a chance to run it against the Spring 2003 prerelease version of Sander 8, which was interesting because this version also takes advantage of vectorization on the Pentium IV. The Pentium IV vectorized code for PMEMD should be compiled using one of the machine files with a name ending in "p4". There is conditional code that is selected by these machine files, as well as compilation options. I present below data for unvectorized PMEMD, vectorized PMEMD (what you get with the P4 machine files), and Amber 8 (also vectorized). 90906 Atoms, Constant Volume Molecular Dynamics (Factor IX) #procs PMEMD, PMEMD, Sander 8, PMEMD/Sander 8 Unvectorized, Vectorized, Vectorized, Speedup psec/day psec/day psec/day 1 47.6 54.5 39.2 1.39x ******************************************************************************* LINUX CLUSTER PERFORMANCE, 1.4 GHZ ATHLON, MYRINET SWITCH (UNC-CH): ******************************************************************************* The Sander 6 and Sander 7 builds used here are, I believe, "tweaked" g77 builds done by the computer systems folks at UNC-CH. I have compared performance of the Sander 7 used here to my own ifc build of Sander 7 (using the default ifc machine file), and the Sander 7 build used is actually slightly FASTER than the ifc build. I did not use my own build due to some MPICH-GM problems I was having with the Sander 7 build. The versions of Myrinet software used here are MPICH-GM-1.2.1..7b and GM-1.5.1, and PMEMD is actually slightly faster when built with the older Myrinet software and ifc 5 than the current release and ifc 7.1 (the joys of "upgrades" that are "downgrades"). 90906 Atoms, Constant Volume Molecular Dynamics (Factor IX) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 2 64 R 46 R 36 R 41 R 1.81x 1.55x 4 121 R 81 R 63 R 74 R 1.91x 1.64x 8 216 R 123 R 98 R 124 R 2.19x 1.74x 16 359 R 187 R 150 R 189 R 2.39x 1.89x 24 466 R 218 173 NP2 2.70x NP2 32 556 R 238 182 251 3.06x 2.21x 90906 Atoms, Constant Pressure Molecular Dynamics (Factor IX) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 2 62 R 46 R 36 R 40 R 1.74x 1.56x 4 116 R 78 R 60 R 72 R 1.92x 1.59x 8 212 R 125 R 96 R 112 R 2.22x 1.89x 16 363 R 174 134 168 R 2.70x 2.16x 24 461 R 186 152 NP2 3.03x NP2 32 563 R 196 161 208 3.50x 2.71x 23558 Atoms, Constant Volume Molecular Dynamics (JAC Benchmark) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 2 173 R 130 R 103 R 116 R 1.68x 1.49x 4 320 R 222 R 174 R 211 R 1.84x 1.52x 8 554 R 346 R 282 R 335 R 1.96x 1.65x 16 882 R 520 R 389 514 R 2.27x 1.71x 24 1080 R 576 450 NP2 2.40x NP2 32 1234 626 465 635 2.66x 1.94x 23558 Atoms, Constant Pressure Molecular Dynamics (JAC Benchmark) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 2 172 R 127 R 101 R 113 R 1.70x 1.52x 4 318 R 192 R 172 R 186 R 1.85x 1.71x 8 561 R 351 R 262 R 330 R 2.14x 1.70x 16 891 R 475 360 460 R 2.47x 1.94x 24 1054 R 527 393 NP2 2.68x NP2 32 1122 533 396 508 2.83x 2.21x 20943 Atoms, Minimization (Factor IX Fragment) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 step/min step/min step/min step/min Speedup Speedup 2 138 R 102 R 83 R 95 R 1.67x 1.45x 4 239 R 153 R 143 R 153 R 1.67x 1.56x 8 380 R 238 R 203 R 227 R 1.87x 1.67x 16 526 326 278 303 1.89x 1.74x 24 594 390 337 NP2 1.76x NP2 32 612 411 326 353 1.88x 1.73x ******************************************************************************* LINUX CLUSTER PERFORMANCE, IBM BLADE XEON 2.4 GHZ, GIGABIT ETHERNET ******************************************************************************* We got amazingly good performance from an IBM Blade Pentium IV cluster with a Gigabit ethernet interconnect. The mpi version was MPICH-1.2.5. Both PMEMD and Sander 7 were built using the Intel Fortran Compiler. 90906 Atoms, Constant Volume Molecular Dynamics (Factor IX) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 2 86 R 58 R 47 R 59 R 1.83x 1.46x 4 154 R 97 R 77 R 104 R 2.00x 1.48x 6 216 R ND ND NP2 ND NP2 8 275 R 155 R 124 R 175 R 2.22x 1.57x 10 306 R ND ND NP2 ND NP2 12 338 R 174 R 141 NP2 2.40x NP2 14 356 R ND ND NP2 ND NP2 16 379 R 196 152 215 2.49x 1.76x 90906 Atoms, Constant Pressure Molecular Dynamics (Factor IX) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 2 85 R 56 R 46 R 59 R 1.85x 1.45x 4 153 R 88 R 71 R 99 R 2.16x 1.55x 6 215 R ND ND NP2 ND NP2 8 272 R 139 R 114 R 154 R 2.39x 1.77x 10 297 R ND ND NP2 ND NP2 12 326 R 146 122 NP2 2.67x NP2 14 338 R ND ND NP2 ND NP2 16 379 R 160 127 183 2.98x 2.07x 23558 Atoms, Constant Volume Molecular Dynamics (JAC Benchmark) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 2 234 R 160 R 130 R 169 R 1.80x 1.38x 4 408 R 270 R 208 R 292 R 1.96x 1.40x 6 584 R ND ND NP2 ND NP2 8 771 R 441 R 338 R 514 R 2.28x 1.50x 10 720 R ND ND NP2 ND NP2 12 864 R 470 360 NP2 2.40x NP2 14 982 R ND ND NP2 ND NP2 16 1005 R 441 379 568 2.65x 1.77x 23558 Atoms, Constant Pressure Molecular Dynamics (JAC Benchmark) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 2 234 R 153 R 129 R 164 R 1.81x 1.42x 4 408 R 254 R 204 R 277 R 2.00x 1.47x 6 568 R ND ND NP2 ND NP2 8 771 R 393 R 313 R 450 R 2.46x 1.71x 10 745 R ND ND NP2 ND NP2 12 847 R 386 313 NP2 2.71x NP2 14 939 R ND ND NP2 ND NP2 16 900 386 327 480 2.75x 1.87x 20943 Atoms, Minimization (Factor IX Fragment) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 step/min step/min step/min step/min Speedup Speedup 2 179 R 123 R 99 R 136 R 1.81x 1.32x 4 280 R 188 R 152 R 205 R 1.84x 1.37x 8 441 R 259 R 203 R 288 R 2.17x 1.53x 12 370 272 205 NP2 1.80x NP2 16 286 300 231 319 1.24x 0.90x NOTE - Compromises were made in the PMEMD minimization code that result in poor scaling on a slow interconnect. Nonetheless, peak PMEMD performance at 8 processors is substantially higher than peak Sander 6 or 7 performance at 16 processors. ******************************************************************************* IBM SP3 CLUSTER PERFORMANCE (NORTH CAROLINA SUPERCOMPUTING CENTER): ******************************************************************************* 90906 Atoms, Constant Volume Molecular Dynamics (Factor IX) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 2 41 R 32 R 23 R 32 R 1.79x 1.28x 4 79 R 59 R 42 R 58 R 1.87x 1.36x 8 159 R 116 R 84 R 106 R 1.88x 1.50x 16 297 R 202 R 143 R 175 R 2.07x 1.70x 32 559 R 345 R 224 R 259 R 2.50x 2.16x 48 738 R 414 R 262 NP2 2.82x NP2 64 900 R 444 284 322 3.17x 2.79x 80 1005 R ND 311 NP2 3.23x NP2 96 1098 R ND 313 NP2 3.50x NP2 112 1189 R ND ND NP2 ND NP2 128 1168 ND ND 334 ND 3.50x 90906 Atoms, Constant Pressure Molecular Dynamics (Factor IX) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 2 40 R 30 R 23 R 31 R 1.72x 1.29x 4 80 R 60 R 44 R 57 R 1.80x 1.39x 8 158 R 99 R 82 R 100 R 1.92x 1.57x 16 295 R 178 R 131 R 154 R 2.25x 1.91x 32 549 R 288 R 204 R 194 2.68x 2.82x 48 704 R 322 232 NP2 3.04x NP2 64 842 R 299 254 276 3.32x 3.04x 80 919 R ND 264 NP2 3.48x NP2 96 946 ND 261 NP2 3.62x NP2 112 989 ND ND NP2 ND NP2 128 953 ND ND 271 ND 3.52x 23558 Atoms, Constant Volume Molecular Dynamics (JAC Benchmark) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 2 110 R 91 R 70 R 83 R 1.58x 1.33x 4 218 R 177 R 135 R 161 R 1.62x 1.35x 8 417 R 325 R 235 R 282 R 1.78x 1.48x 16 758 R 547 R 379 R 450 R 2.00x 1.68x 32 1350 R 765 R 584 R 686 R 2.31x 1.97x 48 1662 R 971 613 NP2 2.71x NP2 64 2009 R 1167 697 909 2.88x 2.21x 23558 Atoms, Constant Pressure Molecular Dynamics (JAC Benchmark) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 2 106 R 89 R 69 R 84 R 1.53x 1.26x 4 215 R 170 R 131 R 156 R 1.65x 1.38x 8 409 R 302 R 229 R 265 R 1.79x 1.54x 16 726 R 455 R 330 R 382 R 2.20x 1.90x 32 1122 R 745 R 524 592 2.14x 1.90x 48 1309 R 765 533 NP2 2.45x NP2 64 1350 873 584 726 2.31x 1.86x 44247 Atoms, Constant Volume Molecular Dynamics (Hemoglobin Truncated Octahedron Benchmark) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 2 49.6 R 40.8 R 32.6 R 40.9 R 1.52x 1.21x 4 98.3 R 78.7 R 63.1 R 76.7 R 1.56x 1.28x 8 183 R 138 R 112 R 131 R 1.63x 1.40x 16 354 R 236 R 198 R 227 R 1.79x 1.56x 32 554 R 351 R 265 R 304 2.09x 1.82x 64 793 R 428 251 302 3.16x 2.63x 44247 Atoms, Constant Pressure Molecular Dynamics (Hemoglobin Truncated Octahedron Benchmark) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 2 48.5 R 40.2 R 30.9 R 39.6 R 1.57x 1.22x 4 97.3 R 76.2 R 55.0 R 70.4 R 1.77x 1.38x 8 187 R 131 R 109 R 126 R 1.72x 1.48x 16 353 R 206 R 183 R 205 R 1.93x 1.72x 32 557 R 282 214 245 2.60x 2.27x 64 732 313 213 253 3.44x 2.89x 20943 Atoms, Minimization (Factor IX Fragment) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 step/min step/min step/min step/min Speedup Speedup 2 89 R 68 55 R 37 R 1.62x 2.41x 4 170 R 126 102 R 67 R 1.68x 2.56x 8 303 R 205 160 R 103 R 1.89x 2.94x 16 462 R 316 246 R 164 R 1.88x 2.82x 32 714 R 476 370 ND 1.93x ND 48 784 550 417 NP2 1.88x NP2 64 833 571 403 ND 2.07x ND ******************************************************************************* IBM SP4 CLUSTER PERFORMANCE (EDINBURGH PARALLEL COMPUTING CENTRE): ******************************************************************************* The SP4 processors are grouped into "Multi-Chip Modules" (MCM's). A MCM is composed of a group of four dual processor chips with some shared cache. Thus, processors are typically allocated in groups of eight, and the user is billed for CPU's in groups of eight. However, better performance can sometimes be obtained by not using all the CPU's on a MCM, presumably due to bottlenecks in the shared components. Thus we show results below in which 8, 6 or 4 processors per MCM are in use. The total number of processors indicated are the number allocated to you, but not necessarily in use. 90906 Atoms, Constant Volume Molecular Dynamics (Factor IX) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 8 (8x 1) 351 R 253 R 189 R 243 R 1.86x 1.44x 16 (8x 2) 601 R 366 R 272 R 335 R 2.21x 1.80x 24 (8x 3) 771 R 399 R 297 R NP2 2.60x NP2 32 (8x 4) 919 R 444 323 387 2.84x 2.38x 40 (8x 5) 1016 R 476 356 NP2 2.85x NP2 64 (8x 8) 1147 512 377 399 3.04x 2.88x 64 (6x 8) 1200 576 426 NP2 2.81x NP2 72 (6x 9) 1303 589 419 NP2 3.10x NP2 80 (6x10) 1316 600 415 NP2 3.17x NP2 128 (4x16) 1610 ND ND 531 ND 3.03x 256 (4x32) 1775 ND ND 480 ND 3.70x 90906 Atoms, Constant Pressure Molecular Dynamics (Factor IX) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 8 (8x 1) 353 R 243 R 182 R 233 R 1.94x 1.51x 16 (8x 2) 594 R 337 R 258 R 279 R 2.31x 2.13x 24 (8x 3) 774 R 360 268 NP2 2.89x NP2 32 (8x 4) 929 R 399 297 306 3.13x 3.04x 40 (8x 5) 1029 R 425 320 NP2 3.21x NP2 64 (8x 8) 1127 448 339 318 3.32x 3.55x 64 (6x 8) 1206 504 390 NP2 3.09x NP2 72 (6x 9) 1309 500 379 NP2 3.45x NP2 80 (6x10) 1316 514 379 NP2 3.47x NP2 128 (4x16) 1590 ND ND 414 ND 3.84x 256 (4x32) 1751 ND ND 370 ND 4.73x ******************************************************************************* SGI O2 UNIPROCESSOR PERFORMANCE, R12000 300 MHZ: ******************************************************************************* 90906 Atoms, Constant Volume Molecular Dynamics (Factor IX) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 1 10.8 8.8 5.0 5.8 2.16x 1.88x 90906 Atoms, Constant Pressure Molecular Dynamics (Factor IX) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 1 10.5 8.7 5.0 5.7 2.10x 1.84x 23558 Atoms, Constant Volume Molecular Dynamics (JAC Benchmark) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 1 30.2 25.9 16.3 19.1 1.86x 1.58x 23558 Atoms, Constant Pressure Molecular Dynamics (JAC Benchmark) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 1 29.9 25.3 16.2 18.9 1.84x 1.58x 20943 Atoms, Minimization (Factor IX Fragment) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 step/min step/min step/min step/min Speedup Speedup 1 23.0 21.5 16.9 18.3 1.36x 1.25x 44247 Atoms, Constant Volume Molecular Dynamics (Hemoglobin Truncated Octahedron Benchmark) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 1 13.0 11.7 7.9 9.2 1.65x 1.41x 44247 Atoms, Constant Pressure Molecular Dynamics (Hemoglobin Truncated Octahedron Benchmark) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 1 12.8 11.5 7.9 9.0 1.62x 1.42x ******************************************************************************* SGI ORIGIN MULTIPROCESSOR PERFORMANCE, R14000 500 MHZ (UNC-CH): ******************************************************************************* Sander 7 was apparently well optimized for the SGI Origin. It is practically impossible for me to get access to more than eight SGI Origin processors, and I have not been able to do benchmarking in the interesting range of 16-64 processors. Prof. David Case has generously done some benchmarking at higher processor count, however, and indicates that PMEMD scaling is good at higher processor count. I got a little benchmark data from Prof. Case, which showed that on 32 SGI Origin 3800 processors PMEMD runs the JAC benchmark (23558 atoms) about 42% faster than Sander 7, and the rt-polymerase benchmark(~140,000 atoms, nrespa=1) 81% faster than Sander 7. 90906 Atoms, Constant Volume Molecular Dynamics (Factor IX) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 4 102 78 52 96 1.97x 1.07x 8 190 130 86 151 2.20x 1.26x 90906 Atoms, Constant Pressure Molecular Dynamics (Factor IX) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 4 104 72 48 90 2.16x 1.15x 8 197 127 87 131 2.27x 1.50x 23558 Atoms, Constant Volume Molecular Dynamics (JAC Benchmark) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 4 295 230 190 270 1.55x 1.09x 8 524 396 313 470 1.66x 1.11x 23558 Atoms, Constant Pressure Molecular Dynamics (JAC Benchmark) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 psec/day psec/day psec/day psec/day Speedup Speedup 4 289 225 178 251 1.63x 1.15x 8 497 320 290 400 1.71x 1.24x 20943 Atoms, Minimization (Factor IX Fragment) #procs PMEMD PMEMD/NoCIT Sander 6 Sander 7 PMEMD/Sander 6 PMEMD/Sander 7 step/min step/min step/min step/min Speedup Speedup 4 225 171 147 98 1.53x 2.30x 8 341 268 244 140 1.40x 2.44x Bob Duke University of North Carolina - Chapel Hill Chemistry Department rduke@email.unc.edu