Re: [AMBER] Sander.MPI parallel run

From: Lianhu Wei <lianhu.wei.gmail.com>
Date: Fri, 28 Oct 2011 10:33:04 -0400

The speed on my system is much slower. I run on 1, 2, 4, 8, nodes, each
nodes have 8 cores. the average speed are 1.45, 1.70, 0.26, 0.20 ns/day. I
will investigate my hardware issue.

Thanks,
William

On Thu, Oct 27, 2011 at 4:21 PM, Jason Swails <jason.swails.gmail.com>wrote:

> On Thu, Oct 27, 2011 at 4:02 PM, Lianhu Wei <lianhu.wei.gmail.com> wrote:
>
> > Hi Jason,
> >
> > On my cluster, I have two network interface, one ethernet and one
> > Infiniband. How can I know if the data exchange is via ethernet or
> > Infiniband?
> >
>
> Not sure -- that's a question to ask your sysadmin. Naively, I would say
> switch to mvapich/mvapich2 and see if you get a big speedup. If so, I
> would
> say that's your answer. If not, you're still stuck with the same question.
>
> But I did mean to mention not to use nohup inside a submission script,
> which
> Ross pointed out (that's what reminded me). The last thing you want to do
> is launch a process inside an automated environment like this that resists
> hang-up signals. nohup is something you use on a personal machine or
> something and you don't want to kill the process if you end your shell
> session and go home.
>
> HTH,
> Jason
>
>
> > Sorry for the system question.
> > William
> >
> > On Thu, Oct 27, 2011 at 3:32 PM, Jason Swails <jason.swails.gmail.com
> > >wrote:
> >
> > > On Thu, Oct 27, 2011 at 3:08 PM, Lianhu Wei <lianhu.wei.gmail.com>
> > wrote:
> > >
> > > > Hi Ross and other experts,
> > > >
> > > > This time, I used pmemd.MPI. Generally it runs a lot faster than
> > > > sander.MPI. But I still have the same issue when I run on multiple
> > > nodes.
> > > > And also I looked carefully with my input options. All these runs I
> > used
> > > > the same input file, just using varies nodes on cluster. The
> > > > interconnection among the nodes are infiniband. In y pbs scripts, I
> > > > removed
> > > > all the unnecessary options. The results is still with 2nodes, I got
> > the
> > > > maximum performance (x1.65 compared to one node). On 4 nodes, it was
> > > x0.3.
> > > > on 8 nodes, it was x0.1. The speed is much slower then running on one
> > > node.
> > > > My simulation system is about 100K atoms. I checked on all the
> > > distributed
> > > > nodes, there were 8 threads on each nodes.
> > > >
> > > > When pmemd is running on 1 or 2 nodes, most of the thread on the
> nodes
> > > are
> > > > all running. Running on 4 nodes and 8 nodes, I saw many threads
> showed
> > > > status "S" (sleeping).
> > > >
> > > > I do not know if I did not use options properly, or pmemd did not
> > > > distribute
> > > > the calculation well enough. Please give suggestions.
> > > >
> > >
> > > pmemd does a good job of loadbalancing, and it does it dynamically.
> The
> > > problem is unlikely to be workload distribution. I'm guessing your
> issue
> > > is
> > > interconnect. You say you have infiniband (but not what speed
> > infiniband).
> > > Keep in mind that with 8 threads per node, you're effectively cutting
> > your
> > > bandwidth that each processor sees down to 1/8 of the full speed
> (certain
> > > MPI implementations may be able to deal with that more efficiently than
> > > others).
> > >
> > > Another possibility is that your MPI is not actually taking advantage
> of
> > > your infiniband hardware. Can you install a different MPI (like
> mvapich2
> > > or
> > > mvapich) and try with that? Those 2 packages, specifically, are
> tailored
> > > specifically for infiniband.
> > >
> > > HTH,
> > > Jason
> > >
> > >
> > > > Thanks,
> > > > WIlliam
> > > >
> > > > Here is my PBS scripts on 2 nodes:
> > > >
> > > > [william.speed]$ more Qsub_T
> > > > #!/bin/bash
> > > > #PBS -V
> > > > #PBS -r n
> > > > #PBS -N ENZ_T
> > > > #PBS -l select=2:ncpus=8
> > > > #PBS -l select=arch=linux
> > > > #PBS -m abe
> > > > #PBS -q verylong
> > > > #
> > > > export WRKDIR=/home/william/Work/PAD4/MD/KP94_ENZ_T
> > > > cd $WRKDIR
> > > >
> > > > ulimit -a
> > > >
> > > > nohup mpirun -np 16 -machinefile $PBS_NODEFILE pmemd.MPI -O -i
> > > > ENZ_KP94_CnsP_4_6ns.in \
> > > > -o T_ENZ_KP94_CnsP_6ns.out \
> > > > -p K94_ENZ.top \
> > > > -c ENZ_KP94_CnsP_4ns.rst \
> > > > -x T_ENZ_KP94_CnsP_6ns.mdcrd \
> > > > -v T_ENZ_KP94_CnsP_6ns.mdvel \
> > > > -e T_ENZ_KP94_CnsP_6ns.mden \
> > > > -r T_ENZ_KP94_CnsP_6ns.rst \
> > > > -inf T_ENZ_KP94_CnsP_6ns.mdinfo
> > > >
> > > > =========================
> > > >
> > > > This is my pmemd input file:
> > > > &cntrl
> > > > timlim = 999999,
> > > > imin=0,
> > > > nmropt = 0,
> > > >
> > > > ntx=7,
> > > > irest=1,
> > > >
> > > > ntxo=1,
> > > > ntpr=50,
> > > > ntwr=50,
> > > > iwrap=0,
> > > > ntwx=500,
> > > > ntwv=500,
> > > > ntwe=500,
> > > >
> > > > ntf=2,
> > > > ntb=2,
> > > > dielc=1.0,
> > > > igb=0,
> > > > scnb=2.0,
> > > > scee=1.2,
> > > >
> > > > nstlim=1000000,
> > > > t=10.0,
> > > > dt=0.002,
> > > >
> > > > temp0=300,
> > > > tempi=300,
> > > > heat=0.0,
> > > > ntt=1,
> > > > tautp=1.0,
> > > > vlimit=0.0,
> > > >
> > > > ntp=1,
> > > > pres0=1.0,
> > > > comp=44.6,
> > > > taup=1.0,
> > > > npscal=1,
> > > >
> > > >
> > > > ntc=2,
> > > > tol=0.0005,
> > > >
> > > > cut=12.0,
> > > > &end
> > > >
> > > > &ewald
> > > > a = 104.8389167,
> > > > b = 132.7362072,
> > > > c = 72.5503008,
> > > > alpha=90,
> > > > beta=90,
> > > > gamma=90,
> > > > nfft1=100,
> > > > nfft2=144,
> > > > nfft3=81,
> > > > order=4,
> > > > ischrgd=0,
> > > > verbose=1,
> > > > ew_type=0,
> > > > dsum_tol=0.00001,
> > > > &end
> > > >
> > > > Here is the speed reported by pmemd info:
> > > >
> > > > #PBS -l select=1:ncpus=8
> > > > ...
> > > > nohup mpirun -np 8 ...
> > > >
> > > > ==> T1_ENZ_KP94_CnsP_6ns.mdinfo <==
> > > > | Elapsed(s) = 69.08 Per Step(ms) = 345.38
> > > > | ns/day = 0.50 seconds/ns = 172688.77
> > > > |
> > > > | Average timings for all steps:
> > > > | Elapsed(s) = 3215.09 Per Step(ms) = 347.58
> > > > | ns/day = 0.50 seconds/ns = 173788.91
> > > > |
> > > > |
> > > > | Estimated time remaining: 95.7 hours.
> > > >
> > > >
> > >
> >
> ------------------------------------------------------------------------------
> > > >
> > > > #PBS -l select=2:ncpus=8
> > > > ...
> > > > nohup mpirun -np 16
> > > >
> > > > ==> T_ENZ_KP94_CnsP_6ns.mdinfo <==
> > > > | Elapsed(s) = 62.11 Per Step(ms) = 207.05
> > > > | ns/day = 0.83 seconds/ns = 103524.14
> > > > |
> > > > | Average timings for all steps:
> > > > | Elapsed(s) = 2305.82 Per Step(ms) = 206.80
> > > > | ns/day = 0.84 seconds/ns = 103399.92
> > > > |
> > > > |
> > > > | Estimated time remaining: 56.8 hours.
> > > >
> > > >
> > >
> >
> ------------------------------------------------------------------------------
> > > >
> > > > #PBS -l select=4:ncpus=8
> > > > ...
> > > > nohup mpirun -np 32
> > > >
> > > > ==> T4_ENZ_KP94_CnsP_6ns.mdinfo <==
> > > > | Elapsed(s) = 147.48 Per Step(ms) = 1474.84
> > > > | ns/day = 0.12 seconds/ns = 737418.58
> > > > |
> > > > | Average timings for all steps:
> > > > | Elapsed(s) = 3122.34 Per Step(ms) = 1178.24
> > > > | ns/day = 0.15 seconds/ns = 589120.22
> > > > |
> > > > |
> > > > | Estimated time remaining: 326.4 hours.
> > > >
> > > >
> > >
> >
> ------------------------------------------------------------------------------
> > > >
> > > > #PBS -l select=8:ncpus=8
> > > > ...
> > > > nohup mpirun -np 64
> > > >
> > > > ==> T8_ENZ_KP94_CnsP_6ns.mdinfo <==
> > > > | Elapsed(s) = 282.79 Per Step(ms) = 5655.89
> > > > | ns/day = 0.03 seconds/ns = 2827947.20
> > > > |
> > > > | Average timings for all steps:
> > > > | Elapsed(s) = 2922.61 Per Step(ms) = 3896.82
> > > > | ns/day = 0.04 seconds/ns = 1948409.06
> > > > |
> > > > |
> > > > | Estimated time remaining: 1081.6 hours.
> > > >
> > > >
> > >
> >
> ------------------------------------------------------------------------------
> > > >
> > > > Best,
> > > > William
> > > >
> > > > On Thu, Oct 20, 2011 at 12:33 PM, Ross Walker <ross.rosswalker.co.uk
> >
> > > > wrote:
> > > >
> > > > > Hi Lianhu,
> > > > >
> > > > > > I have been struggling with my MD simulation using sander.MPI for
> > > many
> > > > > > days. Tried many ways, but still can not figure out why my
> > parallel
> > > > > > running
> > > > > > is not speed up. Using 2 nodes, is faster than using 1 node.
> But
> > > when
> > > > > > I
> > > > > > used 8 nodes, the speed is similar as using 1 node. My system
> have
> > > > > > 101,679
> > > > > > atoms. The following is the detail of my tests.
> > > > >
> > > > > This is normal for sander. It generally won't scale much beyond 32
> > > cores
> > > > or
> > > > > so and especially with these multicore boxes that have a large
> number
> > > of
> > > > > cores in a single box but do not have a corresponding interconnect
> to
> > > > > match.
> > > > >
> > > > > You don't say what your interconnect is. If it is infiniband then
> you
> > > are
> > > > > in
> > > > > with a shot. If it something else then all bets are off.
> > > > >
> > > > > A few suggestions:
> > > > >
> > > > > > #PBS -v LD_LIBRARY_PATH=/home/appmgr/Software/Openmpi/openmpi-
> > > > > > 1.4.3/exe/lib
> > > > >
> > > > > Consider using MVAPICH instead of openmpi. It generally performs
> > better
> > > > and
> > > > > is optimized for infiniband.
> > > > >
> > > > > > #PBS -l select=8:ncpus=8
> > > > >
> > > > > I assume you have 8 real cores per node and not 4 cores and 4
> > > > hyperthreads?
> > > > > - Check this.
> > > > >
> > > > > > #PBS -l select=arch=linux
> > > > > > #PBS -l place=scatter
> > > > >
> > > > > I am not sure what the 'scatter' implies here for thread placement.
> > It
> > > is
> > > > > probably better just to remove this line all together and use
> > whatever
> > > > the
> > > > > default placement is.
> > > > >
> > > > > > export OMP_NUM_THREADS=64
> > > > > > ##unset OMP_NUM_THREADS
> > > > >
> > > > > This does absolutely nothing for sander.
> > > > >
> > > > > >
> > > > > > mpirun -np 64 -machinefile $PBS_NODEFILE sander.MPI -O -i
> > > > > > ENZ_KP94_CnsP_4_6ns.in \
> > > > > > -o ENZ_KP94_CnsP_6ns.out \
> > > > > > -p K94_ENZ.top \
> > > > > > -c ENZ_KP94_CnsP_4ns.rst \
> > > > > > -x ENZ_KP94_CnsP_6ns.mdcrd \
> > > > > > -v ENZ_KP94_CnsP_6ns.mdvel \
> > > > > > -e ENZ_KP94_CnsP_6ns.mden \
> > > > > > -r ENZ_KP94_CnsP_6ns.rst \
> > > > > > -inf ENZ_KP94_CnsP_6ns.mdinfo
> > > > >
> > > > > Consider using pmemd instead of sander. If you input options are
> > > > supported
> > > > > then pmemd.MPI will generally much faster and scale much better
> than
> > > > > sander.
> > > > >
> > > > > I would also consider manually checking that the MPI threads get
> > placed
> > > > on
> > > > > the correct nodes. I.e. that you are not just ending up with 64
> > threads
> > > > > running on the first node.
> > > > >
> > > > > You can also try:
> > > > >
> > > > > #PBS -l select=8:ncpus=4
> > > > >
> > > > > mpirun -np 32 ...
> > > > >
> > > > > Often leaving cores on a node idle can actually give you higher
> > > > performance
> > > > > since then the interconnect is not so overloaded.
> > > > >
> > > > > It would also be helpful to see your input file so we can offer
> some
> > > > > suggestions on tweaking that for performance. I note you have mdvel
> > and
> > > > > mden
> > > > > specified above. Do you actually need these files? Doing too much
> i/o
> > > can
> > > > > seriously hurt performance in parallel. I would suggest turning off
> > > > writing
> > > > > to mden and mdvel unless you absolutely need the info in them.
> > > > >
> > > > > The biggest improvement is likely to come from using pmemd.MPI
> > though.
> > > > >
> > > > > All the best
> > > > > Ross
> > > > >
> > > > > /\
> > > > > \/
> > > > > |\oss Walker
> > > > >
> > > > > ---------------------------------------------------------
> > > > > | Assistant Research Professor |
> > > > > | San Diego Supercomputer Center |
> > > > > | Adjunct Assistant Professor |
> > > > > | Dept. of Chemistry and Biochemistry |
> > > > > | University of California San Diego |
> > > > > | NVIDIA Fellow |
> > > > > | http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
> > > > > | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> > > > > ---------------------------------------------------------
> > > > >
> > > > > Note: Electronic Mail is not secure, has no guarantee of delivery,
> > may
> > > > not
> > > > > be read every day, and should not be used for urgent or sensitive
> > > issues.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > AMBER mailing list
> > > > > AMBER.ambermd.org
> > > > > http://lists.ambermd.org/mailman/listinfo/amber
> > > > >
> > > > _______________________________________________
> > > > AMBER mailing list
> > > > AMBER.ambermd.org
> > > > http://lists.ambermd.org/mailman/listinfo/amber
> > > >
> > >
> > >
> > >
> > > --
> > > Jason M. Swails
> > > Quantum Theory Project,
> > > University of Florida
> > > Ph.D. Candidate
> > > 352-392-4032
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
>
>
>
> --
> Jason M. Swails
> Quantum Theory Project,
> University of Florida
> Ph.D. Candidate
> 352-392-4032
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Oct 28 2011 - 08:00:02 PDT
Custom Search