Installation Notes for PMEMD 10 I. Installation of PMEMD on Supported Platforms Installation of PMEMD is fairly simple for a number of common platforms: 1) Make amber10/src/pmemd your current working directory. 2) Enter the command 'configure -help' to get a list of supported platforms, fortran 90 compilers, parallel implementations (or nopar to build uniprocessor code), fft implementations, and trajectory file format implementations. 3) Presuming a supported combination describes your system, you can now make the config.h header for the PMEMD build by entering: './configure [fft option] [traj option]' (eg., './configure linux_p4 ifort mpich fftw bintraj'). For generic parallel implementations (everything but mpi or nopar), you will be queried for the location where the parallel implementation is installed. The fft and traj options have defaults; all other arguments must be provided, and must be in the order listed. 4) Now simply enter the command 'make install'. If all goes well, a pmemd executable will be deposited under amber9/exe. Additional Details on Some Platforms and Options Intel Math Kernel Library Support If you are using the Intel ifort compiler, you will be asked if you want to use the Intel Math Kernel Library (MKL). If you answer "yes", you will be prompted for the location unless the MKL_HOME environment variable has been set. If MKL_HOME is set, it is assumed you want to use MKL. The MKL is primarily helpful for Generalized Born simulations on Intel architecture platforms (*athlon, *p4, *opteron, *em64t, and sgi_altix (our only supplied ia64 platform at present)). FFT Options The default FFT implementation, PUBFFT, is based on fftpack code from netlib, originally written by Swartztrauber, first ported to Amber by Tom Darden, and since converted to F90 by Bob Duke. This version has fairly reasonable performance, and supports fft's with prime factors of 2, 3, and 5. In many cases there is little advantage to selecting anything else. However, if you are running a machine that supports Intel SSE2 instructions, you will get slightly better performance by using FFTW 3.0 or FFTW 3.1, and you will also likely get a more optimal grid size because FFTW also supports a prime factor of 7. The only vendor FFT currently supported is for SGI, and we support both the newest interface as well as the interface that was used in earlier versions of Amber. A relatively clean interface for adding new FFT implementations has been created in fft1d.fpp. You would of course have to create a new defined constant. NetCDF Support For both pmemd and sander, trajectory and velocity files may now be written using the portable binary NetCDF format. The files created are about half the size of their formatted equivalents, and the overhead associated with writing NetCDF files is low. A NetCDF source tree is provided as part of Amber; you may also just use an existing NetCDF installation, which is a fairly common thing at large computer installations. At high processor count there are significant performance advantages associated with using NetCDF files, on the order of 10%, depending of course on various system and simulation problem specifics. In order to be able to write NetCDF files, you must compile pmemd with the bintraj option, and then specify ioutfm=1 in the mdin &cntrl namelist. The output trajectories may be analyzed or converted to formatted output for use in other programs with Amber 9 ptraj. VMD 1.8.3 also has support for the NetCDF format. IBM AIX SP5 Platforms We strongly recommend building a 32-bit pmemd executable for AIX SP5 platforms, as the performance is better. IBM AIX SP3/SP4/SP5 Platforms For the IBM AIX platforms, it is possible to use the MASSVP libraries if one wishes. In pmemd 8, this improved performance. In pmemd 9, different algorithms were used and it did not really matter whether MASSVP libraries were linked to pmemd or not for PME. This is presumed to still be the case for pmemd 10. For generalized Born simulations, howeverlinking in the MASSVP libraries should improve performance. For all IBM AIX platforms we specify the USE_MPI_MODULE define to cause mpi module files to be used. If this is not done, some build combinations may not correctly call MPI functions, and pmemd will crash. This is only required for IBM AIX platforms; all other builds work fine if MPI is treated as an F77 library. OSF1 AlphaServer / Quadrics Interconnect Platforms We routinely ran on lemieux.psc.edu, which was an OSF1 AlphaServer with a Quadrics interconnect. Typically, better performance was obtained by using both communications "rails" of the interconnect, which was done by specifying some job submission options (please see PSC documentation). Using two rails significantly improved performance at 64 or more processors. In the timeframe of the pmemd 8 release, there was a problem with the interconnect system software that made using 2 rails with pmemd unreliable (pmemd could crash). This problem has been fixed. To our knowledge, lemieux has been decommissioned, but we still support this configuration. Autoconfiguration issues no longer exist. In the past, the configuration script would autodetect things like the cpu type for alpha servers, sgi workstations/servers, etc. This is no longer done because sometimes pmemd is built on a frontend platform that is different in significant ways from the intended target. Thus we now query the user regarding cpu type for the IBM AIX, Alpha Server, and SGI MIPS platforms. The Build Process. A New Dependencies Scheme. A simpler build process has been introduced for pmemd 9 and has remained for pmemd 10. The original source is provided in *.fpp files, which are preprocessed into *.f90 files. Dependency management is handled by a simple perl script that analyzes both module and include (*.i) file dependencies. There is a requirement that there be only one module in a source file, and that the source file name and module name be related such that if the source file name is foo.fpp, the module name will be foo_mod. Dependencies are generated by entering 'make depends' in the directory amber10/src/pmemd. This should only be necessary to do if you modify the code. Pathscale Compiler Linkage Issues. The Pathscale F90 Compiler produces very high quality fast code, but linkage can be complicated because different linkage options are chosen at different sites. The linkage options you use will be dictated by how the local MPI implementation F90 interfaces were built. The default pathf90 linkage convention is to use g77- or GNU-compatible name mangling; alternatively, ifort/pgf90 mangling can be used. The configure script will ask you which convention you wish to use. Probably the simplist thing to do is to take the default, and if you have linkage problems, select the alternative (I build pmemd at two sites; one uses the default and the other uses the alternative, and I can never remember which is which). Beware of Nonstandard MPI Library Installations! The configure script assumes that lam, mpich, mpich2, mpich-gm, or mvapich will be installed in a standard manner, with headers in an 'include' subdirectory and mpi libraries in a 'lib' directory under an mpi home directory which the configure script requests from the user. Installation of mpich-gm or mvapich will also typically require specification of a directory where device libraries may be found. Some mpi installers sometimes install the mpi implementations in nonstandard directories, and sometimes even rename library files. Also, as new releases of the generic mpi software occur, sometimes the list of required libraries changes. For these reasons, you may have to hand-tweak the config.h after you run configure (typically, attempt a build and see if it works; if it doesn't, go looking for the problem). To determine the directories and libraries for lam, you can enter 'mpif77 -showme'. For mpich, mpich2, mpich_gm or mvapich enter 'mpif77 -link_info'. Note that if there are multiple mpi implementations installed on the machine, you must be sure that you are executing the correct mpif77 (doing a 'which mpif77' will show you what you are running; you then have to ascertain whether that is the right one or not). Unfortunately, we have not had much time for updating mpi configuration info, so the procedure of for getting link information given above may well be needed more often than before. Beware of static linkage on Linux systems! None of the Linux installation scripts use static linkage due to the fact that there are Linux distributions in which the statically linked threads libraries are broken. Basically, if you statically link an executable and it includes these threads libraries, you are very apt to get a segment violation. However, dynamic linkage is also a headache in that the mechanisms used to find shared libraries during dynamic loading are not all that robust on Linux systems running MPICH or other MPI packages. Typically the LOAD_LIBRARY_PATH environment variable will be used to find shared libraries during loading, but this variable is not reliably propagated to all processes. For this reason, for the compilers that use compiler shared libraries (ifort, pathscale), we use LD_LIBRARY_PATH during configuration to set an -rpath linkage option, which is reliably available in the executable. This works well as long as you insure that the path is the same for all machines running pmemd. Earlier versions of ifort actually also set -rpath, but this was dropped due to confusing error messages when ifort is executed without arguments. The static linkage problem has become less common in the time intervening since the Amber 8 release, but because systems that have these problems are still "out there", we have opted to be conservative on this issue for the time being. Beware of stack overflows! Well, this should be a problem of the past, as since pmemd 9 the code resets the stack limits itself. If pmemd cannot set the stack limit to a number it believes to be adequate, it will issue a warning in the mdout file. If you have problems with segment violations caused by stack overflows, you may then need to talk to your system administator about increasing the hard stack limit. II. Installation of PMEMD on Unsupported Platforms The configuration process for pmemd is actually fairly simple, and basically involves providing a number of environment variables for use by the Compile script in a file named config.h. These environment variables control the choice of compilers, compiler and link options, libraries, and definition of defined constants that are used in conditionally compiled code. If you have an unsupported system, it is fairly simple to generate a config.h for some other system (the closer to your system in characterstics, the better) and then modify the values as required. The only hard part is in choosing the correct defined constants to use for optimizations and compiler compatibility. This has actually gotten even more complicated than it was in pmemd 8 because there are more optimization options. Communications Optimizations The default code paths in the PMEMD parallel code make extensive use of asynchronous mpi i/o and overlap of communications and computation. For slower interconnects, this may not produce the best performance because the interconnect hardware gets overwhelmed and data retransmission is necessary. For this reason, we introduced the SLOW_NONBLOCKING_MPI defined constant. It is currently used for slow interconnects like gigabit ethernet, and should probably be tried when porting the code to systems with previously untested interconnects. In most instances, I would expect the default code paths that use asynchronous mpi i/o to have better performance. In fact, the differences for gigabit ethernet are now minimal, and this option is being maintained mostly "just in case". However, it is still the default for gigabit ethernet platforms. Use of the interconnect files for LAM, MPICH, MPICH2, MPICH-GM or MVAPICH The Linux mpi interconnect implementations are now supported by "interconnect" data files, eg. config_data/interconnect.mpich, etc. If a config data file matching the pattern .. is not found, the configure script will look for two files: . and interconnect. With this mechanism it is possible to support a wide range of interconnects for an architecture-compiler combination by only adding one file to config_data. I would recommend looking at linux_p4.ifort and interconnect.mpich to see how this is done. We briefly describe each defined constant below. These should of course all be specified as -D as part of the appropriate variable. If you have a new processor, it may be best to just pick a configuration already in use for a similar processor. The values found in the release were actually determined to be optimal through extensive testing. Table of Defined Constants Used in PMEMD: MPI Defined Constants: Use in: MPI_DEFINES MPI - Use this to build a multiprocessor version of PMEMD. SLOW_NONBLOCKING_MPI - See note above on communications optimizations. USE_MPI_MODULE - Generally, a simple F77-level interface to MPI is used. However, in some systems (AIX), there may be problems interfacing to MPI if the module definitions are not used. This define is provided to allow usage of the MPI module definitions instead of mpif.h. Use in: F90_DEFINES FFTLOADBAL_2PROC - Normally FFT loadbalancing is only done in scenarios where there is clearly the potential to benefit from assigning the FFT workload to a subset of the processors. However, on some systems with slow interconnects, there may be an advantage to assigning all the workload to one of two processors in a two processor run. This is a loadbalancing scenario, because if performance bottlenecks in the processor with all the FFT workload, FFT slabs can be reassigned. This is probably worth trying for any slow interconnect. Direct Force Optimization Defined Constants The direct force code, where the pairlist is processed to compute direct forces and energies, is critical to overall pmemd performance. This code is optimized differently for different platforms. The optimizations were originally developed to take advantage of the characteristics of a given platform, but with the proliferation of optimizations, the interplay has become difficult to predict. For this reason, we typically benchmark all combinations to determine the best settings. In past releases the default pmemd build used a pairlist compression algorithm to reduce the memory required by the pairlist by a factor of roughly four. For most machines in common use today and for most problems currently being attempted, this compression algorithm has a cost in execution time and it has been dropped from the code base. The practical fact of the matter is that for really big problems that would benefit from pairlist compression, it is best to run at larger processor counts, since the pairlist is then divided among the processors. Use in: DIRFRC_DEFINES DIRFRC_COMTRANS - Take advantage of knowledge that all of an atom's pairs use the same unit cell translation vector. DIRFRC_EFS - Use a different erfc splining switch than the standard one. The alternative switch avoids the need to use sqrt() for the vast majority of direct force pair evaluations, has lower overall splining error, thrashes memory less, and is faster on many but not all machines. DIRFRC_NOVEC - Use no vectorization of the direct force workload data in evaluating the pairs associated with one atom. The default mechanism is to place data about the pairs in a vector, which for many but not all platforms, produces faster code. SLOW_INDIRECTVEC - On some machines it is actually cheaper to do some unnecessary calculations than it is to introduce an indirection layer in the vectors (basically a set of indexes specifying which entries in the vectors actually need to be used). HAS_10_12 - Enables use of the Leonard-Jones 10-12 potential instead of the Leonard-Jones 6-12 potential. Little used, and not a performance issue... Compiler Linkage Defined Constants: Use in: CFLAGS CLINK_CAPS DBL_C_UNDERSCORE NO_C_UNDERSCORE - These options control naming conventions used to link in C code used by PMEMD. There is very little C code in PMEMD (only pmemd_clib.c), and if you have trouble linking to the routines here, you may need to use one of these defined constants. Please look at how the defined constants control name definition and what your compiler expects if you have problems. III. Some Information on Parallel Performance Tuning Namelist Variables. New algorithms have been introduced in pmemd 10 with the goal of further extending the scaling capabilities of the code. These algorithms may be controlled to some extent by five &ewald namelist variables; however these variables are assigned default values that are determined taking into consideration run conditions, and in general it is probably best that the user just use the defaults and not attempt to make adjustments. However, in some instances, fine tuning may yield slightly better performance. The variables involved are described in a bit more detail below, should a user want to experiment: block_fft = 0 - use slab fft = 1 - use block fft; requires at least 4 processors, and not permitted for minimizations or if nrespa > 1. When using block fft's, we essentially start with a collection of fft x-y slabs, and further divide each slab into fft_blk_y_divisor "blocks" in the y dimension. Thus each "block is a collection contiguous x-runs. The y divisor must be a value between 2 and nfft2, where nfft2 is the nfft2 value AFTER the axis optimization operation has been performed if it is allowed (thus it is really only safe to either use min(nfft1,nfft2,nfft3), or turn off axis optimization, or think carefully about what axis optimization will actually do... In all instances tested so far, relatively small values work best for fft_blk_y_divisor. This value is used only if block_fft .eq. 1. fft_blk_y_divisor = 2 .. nfft2 (after axis optimization reorientation); default=2 or 4 depending on numtasks. excl_recip = 0..1 - Exclusive reciprocal tasks flag. This flag, when 1, specifies that tasks that do reciprocal force calcs will not also do direct force calculations. This has some benefits at higher task count. At lower task count, setting this flag can result in significant underutilization of reciprocal tasks. This flag will automatically be cleared if block fft's are not in use. excl_master = 0..1 - Exclusive master task flag. This flag, when 1, specifies that the master task will not do force and energy calculations. At high scaling, what this does is insure that no tasks are waiting for the master to initiate collective communications events. The master is thus basically dedicated to handling loadbalancing and output. At lower task count, this is obviously wasteful. This flag will automatically be cleared if block fft's are not in use or if excl_recip .ne. 1. AND NOTE - when block fft's are in use, that implies that you are not doing a minimization and are not using nrespa > 1. atm_redist_freq = 16..1280 - The frequency (in pairlist build events) for reassigning atom ownership to tasks. As a run progresses, diffusion causes the atoms originally collocated and assigned to one task to occupy a larger volume. With time, this starts to cause a higher communications load, though the increased load is lower than one might expect. Currently, by default we reassign atoms to tasks every 320 pairlist builds at low to medium task count and we reassign atoms to tasks every 32 pairlist builds at higher task counts (currently defined as >= 96 tasks, redefinable in config.h). The user can however specify the specific value he desires. At low task count, frequent atom redistribution tends to have a noticeable cost and little benefit. At higher task count, the cost is lower and the benefit is higher. IV. Representative Benchmarks Representative benchmarks will be published at the amber website, amber.scripps.edu. - Bob Duke NIEHS and UNC-Chapel Hill