Re: [AMBER] Sander.MPI parallel run

From: Jason Swails <jason.swails.gmail.com>
Date: Thu, 27 Oct 2011 15:32:20 -0400

On Thu, Oct 27, 2011 at 3:08 PM, Lianhu Wei <lianhu.wei.gmail.com> wrote:

> Hi Ross and other experts,
>
> This time, I used pmemd.MPI. Generally it runs a lot faster than
> sander.MPI. But I still have the same issue when I run on multiple nodes.
> And also I looked carefully with my input options. All these runs I used
> the same input file, just using varies nodes on cluster. The
> interconnection among the nodes are infiniband. In y pbs scripts, I
> removed
> all the unnecessary options. The results is still with 2nodes, I got the
> maximum performance (x1.65 compared to one node). On 4 nodes, it was x0.3.
> on 8 nodes, it was x0.1. The speed is much slower then running on one node.
> My simulation system is about 100K atoms. I checked on all the distributed
> nodes, there were 8 threads on each nodes.
>
> When pmemd is running on 1 or 2 nodes, most of the thread on the nodes are
> all running. Running on 4 nodes and 8 nodes, I saw many threads showed
> status "S" (sleeping).
>
> I do not know if I did not use options properly, or pmemd did not
> distribute
> the calculation well enough. Please give suggestions.
>

pmemd does a good job of loadbalancing, and it does it dynamically. The
problem is unlikely to be workload distribution. I'm guessing your issue is
interconnect. You say you have infiniband (but not what speed infiniband).
Keep in mind that with 8 threads per node, you're effectively cutting your
bandwidth that each processor sees down to 1/8 of the full speed (certain
MPI implementations may be able to deal with that more efficiently than
others).

Another possibility is that your MPI is not actually taking advantage of
your infiniband hardware. Can you install a different MPI (like mvapich2 or
mvapich) and try with that? Those 2 packages, specifically, are tailored
specifically for infiniband.

HTH,
Jason


> Thanks,
> WIlliam
>
> Here is my PBS scripts on 2 nodes:
>
> [william.speed]$ more Qsub_T
> #!/bin/bash
> #PBS -V
> #PBS -r n
> #PBS -N ENZ_T
> #PBS -l select=2:ncpus=8
> #PBS -l select=arch=linux
> #PBS -m abe
> #PBS -q verylong
> #
> export WRKDIR=/home/william/Work/PAD4/MD/KP94_ENZ_T
> cd $WRKDIR
>
> ulimit -a
>
> nohup mpirun -np 16 -machinefile $PBS_NODEFILE pmemd.MPI -O -i
> ENZ_KP94_CnsP_4_6ns.in \
> -o T_ENZ_KP94_CnsP_6ns.out \
> -p K94_ENZ.top \
> -c ENZ_KP94_CnsP_4ns.rst \
> -x T_ENZ_KP94_CnsP_6ns.mdcrd \
> -v T_ENZ_KP94_CnsP_6ns.mdvel \
> -e T_ENZ_KP94_CnsP_6ns.mden \
> -r T_ENZ_KP94_CnsP_6ns.rst \
> -inf T_ENZ_KP94_CnsP_6ns.mdinfo
>
> =========================
>
> This is my pmemd input file:
> &cntrl
> timlim = 999999,
> imin=0,
> nmropt = 0,
>
> ntx=7,
> irest=1,
>
> ntxo=1,
> ntpr=50,
> ntwr=50,
> iwrap=0,
> ntwx=500,
> ntwv=500,
> ntwe=500,
>
> ntf=2,
> ntb=2,
> dielc=1.0,
> igb=0,
> scnb=2.0,
> scee=1.2,
>
> nstlim=1000000,
> t=10.0,
> dt=0.002,
>
> temp0=300,
> tempi=300,
> heat=0.0,
> ntt=1,
> tautp=1.0,
> vlimit=0.0,
>
> ntp=1,
> pres0=1.0,
> comp=44.6,
> taup=1.0,
> npscal=1,
>
>
> ntc=2,
> tol=0.0005,
>
> cut=12.0,
> &end
>
> &ewald
> a = 104.8389167,
> b = 132.7362072,
> c = 72.5503008,
> alpha=90,
> beta=90,
> gamma=90,
> nfft1=100,
> nfft2=144,
> nfft3=81,
> order=4,
> ischrgd=0,
> verbose=1,
> ew_type=0,
> dsum_tol=0.00001,
> &end
>
> Here is the speed reported by pmemd info:
>
> #PBS -l select=1:ncpus=8
> ...
> nohup mpirun -np 8 ...
>
> ==> T1_ENZ_KP94_CnsP_6ns.mdinfo <==
> | Elapsed(s) = 69.08 Per Step(ms) = 345.38
> | ns/day = 0.50 seconds/ns = 172688.77
> |
> | Average timings for all steps:
> | Elapsed(s) = 3215.09 Per Step(ms) = 347.58
> | ns/day = 0.50 seconds/ns = 173788.91
> |
> |
> | Estimated time remaining: 95.7 hours.
>
> ------------------------------------------------------------------------------
>
> #PBS -l select=2:ncpus=8
> ...
> nohup mpirun -np 16
>
> ==> T_ENZ_KP94_CnsP_6ns.mdinfo <==
> | Elapsed(s) = 62.11 Per Step(ms) = 207.05
> | ns/day = 0.83 seconds/ns = 103524.14
> |
> | Average timings for all steps:
> | Elapsed(s) = 2305.82 Per Step(ms) = 206.80
> | ns/day = 0.84 seconds/ns = 103399.92
> |
> |
> | Estimated time remaining: 56.8 hours.
>
> ------------------------------------------------------------------------------
>
> #PBS -l select=4:ncpus=8
> ...
> nohup mpirun -np 32
>
> ==> T4_ENZ_KP94_CnsP_6ns.mdinfo <==
> | Elapsed(s) = 147.48 Per Step(ms) = 1474.84
> | ns/day = 0.12 seconds/ns = 737418.58
> |
> | Average timings for all steps:
> | Elapsed(s) = 3122.34 Per Step(ms) = 1178.24
> | ns/day = 0.15 seconds/ns = 589120.22
> |
> |
> | Estimated time remaining: 326.4 hours.
>
> ------------------------------------------------------------------------------
>
> #PBS -l select=8:ncpus=8
> ...
> nohup mpirun -np 64
>
> ==> T8_ENZ_KP94_CnsP_6ns.mdinfo <==
> | Elapsed(s) = 282.79 Per Step(ms) = 5655.89
> | ns/day = 0.03 seconds/ns = 2827947.20
> |
> | Average timings for all steps:
> | Elapsed(s) = 2922.61 Per Step(ms) = 3896.82
> | ns/day = 0.04 seconds/ns = 1948409.06
> |
> |
> | Estimated time remaining: 1081.6 hours.
>
> ------------------------------------------------------------------------------
>
> Best,
> William
>
> On Thu, Oct 20, 2011 at 12:33 PM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
>
> > Hi Lianhu,
> >
> > > I have been struggling with my MD simulation using sander.MPI for many
> > > days. Tried many ways, but still can not figure out why my parallel
> > > running
> > > is not speed up. Using 2 nodes, is faster than using 1 node. But when
> > > I
> > > used 8 nodes, the speed is similar as using 1 node. My system have
> > > 101,679
> > > atoms. The following is the detail of my tests.
> >
> > This is normal for sander. It generally won't scale much beyond 32 cores
> or
> > so and especially with these multicore boxes that have a large number of
> > cores in a single box but do not have a corresponding interconnect to
> > match.
> >
> > You don't say what your interconnect is. If it is infiniband then you are
> > in
> > with a shot. If it something else then all bets are off.
> >
> > A few suggestions:
> >
> > > #PBS -v LD_LIBRARY_PATH=/home/appmgr/Software/Openmpi/openmpi-
> > > 1.4.3/exe/lib
> >
> > Consider using MVAPICH instead of openmpi. It generally performs better
> and
> > is optimized for infiniband.
> >
> > > #PBS -l select=8:ncpus=8
> >
> > I assume you have 8 real cores per node and not 4 cores and 4
> hyperthreads?
> > - Check this.
> >
> > > #PBS -l select=arch=linux
> > > #PBS -l place=scatter
> >
> > I am not sure what the 'scatter' implies here for thread placement. It is
> > probably better just to remove this line all together and use whatever
> the
> > default placement is.
> >
> > > export OMP_NUM_THREADS=64
> > > ##unset OMP_NUM_THREADS
> >
> > This does absolutely nothing for sander.
> >
> > >
> > > mpirun -np 64 -machinefile $PBS_NODEFILE sander.MPI -O -i
> > > ENZ_KP94_CnsP_4_6ns.in \
> > > -o ENZ_KP94_CnsP_6ns.out \
> > > -p K94_ENZ.top \
> > > -c ENZ_KP94_CnsP_4ns.rst \
> > > -x ENZ_KP94_CnsP_6ns.mdcrd \
> > > -v ENZ_KP94_CnsP_6ns.mdvel \
> > > -e ENZ_KP94_CnsP_6ns.mden \
> > > -r ENZ_KP94_CnsP_6ns.rst \
> > > -inf ENZ_KP94_CnsP_6ns.mdinfo
> >
> > Consider using pmemd instead of sander. If you input options are
> supported
> > then pmemd.MPI will generally much faster and scale much better than
> > sander.
> >
> > I would also consider manually checking that the MPI threads get placed
> on
> > the correct nodes. I.e. that you are not just ending up with 64 threads
> > running on the first node.
> >
> > You can also try:
> >
> > #PBS -l select=8:ncpus=4
> >
> > mpirun -np 32 ...
> >
> > Often leaving cores on a node idle can actually give you higher
> performance
> > since then the interconnect is not so overloaded.
> >
> > It would also be helpful to see your input file so we can offer some
> > suggestions on tweaking that for performance. I note you have mdvel and
> > mden
> > specified above. Do you actually need these files? Doing too much i/o can
> > seriously hurt performance in parallel. I would suggest turning off
> writing
> > to mden and mdvel unless you absolutely need the info in them.
> >
> > The biggest improvement is likely to come from using pmemd.MPI though.
> >
> > All the best
> > Ross
> >
> > /\
> > \/
> > |\oss Walker
> >
> > ---------------------------------------------------------
> > | Assistant Research Professor |
> > | San Diego Supercomputer Center |
> > | Adjunct Assistant Professor |
> > | Dept. of Chemistry and Biochemistry |
> > | University of California San Diego |
> > | NVIDIA Fellow |
> > | http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
> > | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> > ---------------------------------------------------------
> >
> > Note: Electronic Mail is not secure, has no guarantee of delivery, may
> not
> > be read every day, and should not be used for urgent or sensitive issues.
> >
> >
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>



-- 
Jason M. Swails
Quantum Theory Project,
University of Florida
Ph.D. Candidate
352-392-4032
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Oct 27 2011 - 13:00:03 PDT
Custom Search