Hi Ross and other experts,
This time, I used pmemd.MPI. Generally it runs a lot faster than
sander.MPI. But I still have the same issue when I run on multiple nodes.
And also I looked carefully with my input options. All these runs I used
the same input file, just using varies nodes on cluster. The
interconnection among the nodes are infiniband. In y pbs scripts, I removed
all the unnecessary options. The results is still with 2nodes, I got the
maximum performance (x1.65 compared to one node). On 4 nodes, it was x0.3.
on 8 nodes, it was x0.1. The speed is much slower then running on one node.
My simulation system is about 100K atoms. I checked on all the distributed
nodes, there were 8 threads on each nodes.
When pmemd is running on 1 or 2 nodes, most of the thread on the nodes are
all running. Running on 4 nodes and 8 nodes, I saw many threads showed
status "S" (sleeping).
I do not know if I did not use options properly, or pmemd did not distribute
the calculation well enough. Please give suggestions.
Thanks,
WIlliam
Here is my PBS scripts on 2 nodes:
[william.speed]$ more Qsub_T
#!/bin/bash
#PBS -V
#PBS -r n
#PBS -N ENZ_T
#PBS -l select=2:ncpus=8
#PBS -l select=arch=linux
#PBS -m abe
#PBS -q verylong
#
export WRKDIR=/home/william/Work/PAD4/MD/KP94_ENZ_T
cd $WRKDIR
ulimit -a
nohup mpirun -np 16 -machinefile $PBS_NODEFILE pmemd.MPI -O -i
ENZ_KP94_CnsP_4_6ns.in \
-o T_ENZ_KP94_CnsP_6ns.out \
-p K94_ENZ.top \
-c ENZ_KP94_CnsP_4ns.rst \
-x T_ENZ_KP94_CnsP_6ns.mdcrd \
-v T_ENZ_KP94_CnsP_6ns.mdvel \
-e T_ENZ_KP94_CnsP_6ns.mden \
-r T_ENZ_KP94_CnsP_6ns.rst \
-inf T_ENZ_KP94_CnsP_6ns.mdinfo
=========================
This is my pmemd input file:
&cntrl
timlim = 999999,
imin=0,
nmropt = 0,
ntx=7,
irest=1,
ntxo=1,
ntpr=50,
ntwr=50,
iwrap=0,
ntwx=500,
ntwv=500,
ntwe=500,
ntf=2,
ntb=2,
dielc=1.0,
igb=0,
scnb=2.0,
scee=1.2,
nstlim=1000000,
t=10.0,
dt=0.002,
temp0=300,
tempi=300,
heat=0.0,
ntt=1,
tautp=1.0,
vlimit=0.0,
ntp=1,
pres0=1.0,
comp=44.6,
taup=1.0,
npscal=1,
ntc=2,
tol=0.0005,
cut=12.0,
&end
&ewald
a = 104.8389167,
b = 132.7362072,
c = 72.5503008,
alpha=90,
beta=90,
gamma=90,
nfft1=100,
nfft2=144,
nfft3=81,
order=4,
ischrgd=0,
verbose=1,
ew_type=0,
dsum_tol=0.00001,
&end
Here is the speed reported by pmemd info:
#PBS -l select=1:ncpus=8
...
nohup mpirun -np 8 ...
==> T1_ENZ_KP94_CnsP_6ns.mdinfo <==
| Elapsed(s) = 69.08 Per Step(ms) = 345.38
| ns/day = 0.50 seconds/ns = 172688.77
|
| Average timings for all steps:
| Elapsed(s) = 3215.09 Per Step(ms) = 347.58
| ns/day = 0.50 seconds/ns = 173788.91
|
|
| Estimated time remaining: 95.7 hours.
------------------------------------------------------------------------------
#PBS -l select=2:ncpus=8
...
nohup mpirun -np 16
==> T_ENZ_KP94_CnsP_6ns.mdinfo <==
| Elapsed(s) = 62.11 Per Step(ms) = 207.05
| ns/day = 0.83 seconds/ns = 103524.14
|
| Average timings for all steps:
| Elapsed(s) = 2305.82 Per Step(ms) = 206.80
| ns/day = 0.84 seconds/ns = 103399.92
|
|
| Estimated time remaining: 56.8 hours.
------------------------------------------------------------------------------
#PBS -l select=4:ncpus=8
...
nohup mpirun -np 32
==> T4_ENZ_KP94_CnsP_6ns.mdinfo <==
| Elapsed(s) = 147.48 Per Step(ms) = 1474.84
| ns/day = 0.12 seconds/ns = 737418.58
|
| Average timings for all steps:
| Elapsed(s) = 3122.34 Per Step(ms) = 1178.24
| ns/day = 0.15 seconds/ns = 589120.22
|
|
| Estimated time remaining: 326.4 hours.
------------------------------------------------------------------------------
#PBS -l select=8:ncpus=8
...
nohup mpirun -np 64
==> T8_ENZ_KP94_CnsP_6ns.mdinfo <==
| Elapsed(s) = 282.79 Per Step(ms) = 5655.89
| ns/day = 0.03 seconds/ns = 2827947.20
|
| Average timings for all steps:
| Elapsed(s) = 2922.61 Per Step(ms) = 3896.82
| ns/day = 0.04 seconds/ns = 1948409.06
|
|
| Estimated time remaining: 1081.6 hours.
------------------------------------------------------------------------------
Best,
William
On Thu, Oct 20, 2011 at 12:33 PM, Ross Walker <ross.rosswalker.co.uk> wrote:
> Hi Lianhu,
>
> > I have been struggling with my MD simulation using sander.MPI for many
> > days. Tried many ways, but still can not figure out why my parallel
> > running
> > is not speed up. Using 2 nodes, is faster than using 1 node. But when
> > I
> > used 8 nodes, the speed is similar as using 1 node. My system have
> > 101,679
> > atoms. The following is the detail of my tests.
>
> This is normal for sander. It generally won't scale much beyond 32 cores or
> so and especially with these multicore boxes that have a large number of
> cores in a single box but do not have a corresponding interconnect to
> match.
>
> You don't say what your interconnect is. If it is infiniband then you are
> in
> with a shot. If it something else then all bets are off.
>
> A few suggestions:
>
> > #PBS -v LD_LIBRARY_PATH=/home/appmgr/Software/Openmpi/openmpi-
> > 1.4.3/exe/lib
>
> Consider using MVAPICH instead of openmpi. It generally performs better and
> is optimized for infiniband.
>
> > #PBS -l select=8:ncpus=8
>
> I assume you have 8 real cores per node and not 4 cores and 4 hyperthreads?
> - Check this.
>
> > #PBS -l select=arch=linux
> > #PBS -l place=scatter
>
> I am not sure what the 'scatter' implies here for thread placement. It is
> probably better just to remove this line all together and use whatever the
> default placement is.
>
> > export OMP_NUM_THREADS=64
> > ##unset OMP_NUM_THREADS
>
> This does absolutely nothing for sander.
>
> >
> > mpirun -np 64 -machinefile $PBS_NODEFILE sander.MPI -O -i
> > ENZ_KP94_CnsP_4_6ns.in \
> > -o ENZ_KP94_CnsP_6ns.out \
> > -p K94_ENZ.top \
> > -c ENZ_KP94_CnsP_4ns.rst \
> > -x ENZ_KP94_CnsP_6ns.mdcrd \
> > -v ENZ_KP94_CnsP_6ns.mdvel \
> > -e ENZ_KP94_CnsP_6ns.mden \
> > -r ENZ_KP94_CnsP_6ns.rst \
> > -inf ENZ_KP94_CnsP_6ns.mdinfo
>
> Consider using pmemd instead of sander. If you input options are supported
> then pmemd.MPI will generally much faster and scale much better than
> sander.
>
> I would also consider manually checking that the MPI threads get placed on
> the correct nodes. I.e. that you are not just ending up with 64 threads
> running on the first node.
>
> You can also try:
>
> #PBS -l select=8:ncpus=4
>
> mpirun -np 32 ...
>
> Often leaving cores on a node idle can actually give you higher performance
> since then the interconnect is not so overloaded.
>
> It would also be helpful to see your input file so we can offer some
> suggestions on tweaking that for performance. I note you have mdvel and
> mden
> specified above. Do you actually need these files? Doing too much i/o can
> seriously hurt performance in parallel. I would suggest turning off writing
> to mden and mdvel unless you absolutely need the info in them.
>
> The biggest improvement is likely to come from using pmemd.MPI though.
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> ---------------------------------------------------------
> | Assistant Research Professor |
> | San Diego Supercomputer Center |
> | Adjunct Assistant Professor |
> | Dept. of Chemistry and Biochemistry |
> | University of California San Diego |
> | NVIDIA Fellow |
> | http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> ---------------------------------------------------------
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
> be read every day, and should not be used for urgent or sensitive issues.
>
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Oct 27 2011 - 12:30:02 PDT