Re: Sander parallel compilation

From: David Case <>
Date: Wed 28 Feb 2001 10:46:04 -0800

On Wed, Feb 28, 2001, David Case wrote:
> I'm putting below our current version of "INSTALL.parllel"; study the
> section on "Installing and running the MPI version". Users are encouraged
> to submit more detailed explanations than this one!

...oops, left out the file from my previous mail. Here it is:

        Amber 7 information for parallel architectures

---- Introduction

This file contains notes about the various parallel implementations supplied
in the current release. Only sander, sander_classic, gibbs and roar are
parallel programs; all others are single threaded. For information on
parallel roar, see its documentation. NOTE: Parallel machines and networks
and fail in unexpected ways. PLEASE check short parallel runs against a
workstation version of Amber before embarking on long parallel simulations!

---- What architectures are supported or should I be reading this file?

This release supports both shared memory (fortran directives) and
message passing (MPI) code for sander and gibbs for particular

        Shared memory: (for sander_classic and gibbs)

                All SGI multiprocessors
                Cray Unicos multiprocessors running Unicos

        Message passing environments: (MPI, for sander and sander_classic)

        (Presumably) any MPI-compliant parallel environment; we have
           tested the following:
                SGI with MPICH library
                SGI with vendor-supplied MPI library
                Cray T3D/E using SHMEM wrappers to intercept MPI calls
                Cray T3E with Cray-supplied MPT library
                IBM POE MPL (SP1/SP2)
        Linux clusters with various MPI libraries, including MPICH
        HP clusters


---- Overview of the shared memory version

The shared memory version was developed within AMBER 4.0 (minmd and gibbs)
and later optimized and extended into an early version of AMBER 4.1 (sander
and gibbs) by Roberto Gomperts and Michael Schlenkrich of SGI with the
assistance of Thomas Cheatham at UCSF, specifically for Silicon Graphics,
Inc. multiprocessors. The 4.1 SGI shared memory version was ported to Cray
multiprocessors by Jeyapandian Kottalam and Mike Page of Cray Research and
incorporated into the 4.1 distribution by Thomas Cheatham. The Particle
Mesh Ewald parallelization for SGI multiprocessors was done by Tom Darden

In general, the bonds, angles, dihedrals, pairlist generation, SHAKE
(currently only in the SGI version of sander), and nonbonds are completely
multitasked. Most of the Particle Mesh Ewald code is parallelized; note
that it uses by default the optimized SGI fast fourier transform libraries
which may not be present on all systems.

Currently the code has only been ported to SGI and Cray multiprocessors.
Multitasking is handled via the placement of FORTRAN parallel directives in
the source which multitask over loops or parallel regions. Scratch memory
is allocated on an as-needed basis using SGI and Cray specific FORTRAN calls
to allocate memory.

Porting to other machines (with FORTRAN multitasking directives) should be
relatively straightforward. There are a few machine dependencies introduced
in the code, the most tricky of which is how to handle allocation of scratch

Under development is an OpenMP version which should be portable to a variety
of shared memory multiprocessor machines.

---- Overview of the message passing (MPI) version

This message passing version was independently developed and contributed by
James Vincent and Ken Merz of the Pennsylvania State University based on 4.0
and later an early prerelease 4.1 version.

   see: J. Vincent and K.M. Merz, "A highly portable parallel implementation
   of AMBER 4 using the Message Passing Interface Standard", J. Comp. Chem.
   11: 1420-1427 (1995).

This version was optimized, integrated and extended by James Vincent,
Dave Case and Thomas Cheatham with help from Micheal Crowley (PSC, T3D/E
portable namelist port and PME development), Thomas Huber (TCGMSG
library) and Asiri Nanyakkara (T3D optimization).

The bonds, angles, dihedrals, SHAKE (only on bonds involving hydrogen when
NTC=2), nonbonded energies and forces, pairlist creation, and integration
steps are parallelized. The code is pure SPMD (single program multiple
data) using a master/slave, replicated data model. Basically, the master
node does all of the initial set-up and performs all the I/O. Depending on
the version and/or what particular input options are chosen, either all the
non-master nodes execute force() in parallel or does the dynamics (runmd(),
more optimal) in parallel. Communication is done to accumulate partial
forces and/or update coordinates, etc. Please see the paper by Vincent et
al. referenced above for more information.


---- Installing and running the MPI version

The message passing in sander is performed by calls to MPI (Message Passing

The MPI source code is generally wrapped with the CPP wrapper:

#ifdef MPI

    ...parallel sections with calls to MPI library routines...


Addition of -DMPI to the MACHINEFLAGS will lead to compilation of the MPI
code. Resolving of the MPI library calls and header files requires the
specification of more information and is discussed below. Most of the
message passing model and machine dependencies are isolated into the
specific Machine system directories, the Machine/mpi subdirectory, and
special include files.

All of the CPP #include directives are specified in the source code without
hardcoding paths for the files. In order to find the MPI implementation-
specific include files, it is necessary to specify an include path. This is
either done explicitly in the MACHINEFLAGS (using the -I option to the CPP)
in the case of a vendor-supplied or public domain MPI implementation...

setenv MACHINEFLAGS = "-DMPI -I/usr/local/mpi/include"

...or through the use of a variable in the Machine file called LOCINCLUDE
which specifies where the Compile script will look for the include file
relative to the top level amber src directory. When using the MPI wrappers
provided in this release, set LOCINCLUDE as follows:

    setenv LOCINCLUDE Machine/mpi

If you plan on running with an MPI version and there is no pre-made MACHINE
file (these files end in "_mpi" in the src/Machines directory) then you will
need to modify the Machine file as follows:

   (1) add "-DMPI " to the MACHINEFLAGS variable.

   (2) add the path for include file for the (implementation supplied)
       mpif.h file to the MACHINEFLAGS variable.

    setenv MACHINEFLAGS "-DMPI -I/usr/local/src/mpi/include"

   (3) Reference any necessary MPI libraries in the LOADLIB variable.

   (4) Add any special compile and load flags to the L0, L1, L2,
       L3, and LOAD variables.

   (5) Pick a particular system directory, specified by the SYSDIR
       variable, and heed the next warning...

To run the resulting codes, you need to use the "mpirun" (or equivalent)
command for your system. The naming and syntax of this is unfortunately
not well standardized: on the T3E is is call "mpprun"; on DEC OS/1 machines
it is called "dmpirun"; etc. Consult your MPI documentation.

For the test suites, you need to set the DO_PARALLEL environment variable
to include mpirun or its equivalent, e.g.:

    setenv DO_PARALLEL 'mpirun -np 4'

will run the test cases with four processors.


---- Installing and running the shared memory version

Currently this version only supports SGI and Cray multiprocessors, and
only gibbs and sander_classic. This version is not currently being actively
developed, and is likely to disappear in future versions.

In general, the shared memory source code is all wrapped with the CPP


    ...general shared memory code...

# ifdef SGI_MP

    ...SGI specific code...

# endif
# ifdef CRAY_MP

    ...Cray specific code...

# endif


Specification of either -DSGI_MP or -DCRAY_MP in the MACHINEFLAGS of the
MACHINE file leads to auto-setting of the #define SHARED_MEMORY in the code.
[Note that specifying -DSHARED_MEMORY alone in the MACHINEFLAGS will not
lead to a correctly compiled source.]

Beware that some of the shared memory source is placed in extra files which
are included when the appropriate #define's are specified. This is to
minimize obfuscation of the source where possible. In particular, this is
apparent in force.f (#include forcemp.f) and resnba.f (#include resnbamp.f).

In addition to setting either -DSGI_MP or -DCRAY_MP in the MACHINEFLAGS,
optionally one can specify the maximum number of processors for a given
executable (which determines allocation of scratch memory) to be compiled in
by adding the following define to the MACHINEFLAGS:


where N is the maximum number of processors to run on.

If this is not set, it will default to the current maximum size for a given
machine based on default values set in sander.f (sander) and gib.f (gibbs)
and in various other routines.

To set the number of processors to run on, set the MP_SET_NUMTHREADS
environment variable at runtime. E.g. to run on 4 processors:

    setenv MP_SET_NUMTHREADS 4

David A. Case                     |  e-mail:
Dept. of Molecular Biology, TPC15 |  fax:          +1-858-784-8896
The Scripps Research Institute    |  phone:        +1-858-784-9768
10550 N. Torrey Pines Rd.         |  home page:                   
La Jolla CA 92037  USA            |
Received on Wed Feb 28 2001 - 10:46:04 PST
Custom Search