Re: [AMBER] Error in PMEMD run from Robert Duke on 2009-05-08 (Amber Archive May 2009)

From: Robert Duke <rduke.email.unc.edu>
Date: Sat, 9 May 2009 01:01:23 +0100

Hi Marek,
I glanced at the dif's but I will let Ross or somebody more used to looking
at the strange things that may happen in the full suite comment on them. If
pmemd passed all it's tests, then it should be good. At 16 processors, I
guess I am not greatly surprised that there are not huge differences in
performance - you expect things to be hitting you more as go 32, 48, 64...
So the biggest difference you see is the ntt 3 vs ntt 1, and that I would
expect. Where you will see the cut make more of a difference, honestly, is
at relatively low processor count. What happens is that the recip space and
data distrib costs start going up as you scale, while the direct space costs
scale reasonably. I think the less frequent trajectory running slower is a
matter of your test times being too short. Also, is anything else running
on this cluster? Any chance, whatsoever, that there are other jobs running
on the actual nodes you are using? That also makes things sort of poor on
performance and unreliable. ON ntt 3 vs ntt 1. Well, I am working with a
bunch of guys that still use ntt 1. There are theoretical objections that
can be raised about the quality of results with this thermostat. With ntt 2
or 3, if you don't change the random seed at each restart, then your results
can have serious artifacts (another point of some contention). So all sorts
of wild things were happening, it seemed to me, when these thermostats were
first introduced, (3 in particular), but they were reputed to equilibrate
temperature better. They probably do; you just have to be sure to use a
different random seed with each restart. I have steered clear of them
because all of our work went okay with the older ntt 1, because there was
this period of bad results, probably due to not resetting the random seed,
and finally, because if you really try to scale up, the random number
generation methods will start eating up more and more of your time and keep
you from scaling very well. I expect at 32 cpu it is more noticeable. It
is not a huge effect probably until 64-128+ or so, but that is an area that
is interesting to me. So that's the history; probably if you don't
routinely want to run on a ton of cpu's and change the seed religiously,
there is virtue in ntt 3, but many usec has been piled up with ntt 1 over
the last decade. Bear in mind, I am more of a computer guy than an MD guy,
though I am trained in both computer science and the sciences; still my
focus in all this is more providing the tools so you all can do the
simulations, not in doing them myself.

Okay, last point. Please just benchmark some with factor ix, and see how
what you get compares to what other folks are getting on their clusters. So
the goal here is to try to sort out if there are any problems with your
hardware or software in the performance area. Without comparing something
for which we have data elsewhere, we can't really tell...

Best Regards - Bob

----- Original Message -----
From: "Marek Malý" <maly.sci.ujep.cz>
To: "AMBER Mailing List" <amber.ambermd.org>
Sent: Friday, May 08, 2009 7:37 PM
Subject: Re: [AMBER] Error in PMEMD run

Dear Bob,

thanks a lot for your analysis !

I made some tests (ONLY PMEMD) regarding your hypothesis.
Just the same short test like previous ones, with the same input files,
1000 steps.

In each additional test I just changed 1 parameter (from my original
configuration)
to see it's influence on CPU time. Regarding to node/cpus setting I have
tested only
one case : 2/8cpus = 16 single processors job where I am using all 8
single cpus per node.

my original setting : 85 s

cutt = 8 : 84 s
ntpr, ntwx = 1000 : 87 s ( strange but true :)) )
ntt = 1 : 78 s
ntt = 2, vrand =1000 : 83 s
ntt = 3, gamma_ln = 0 : 82 s
t0 = 300 : 87 s

As you can see there are just little changes comparing to my original
setting which is listed below (in your last replay).

Of course it is question how the influence of tested parameters changes in
another node/cpu configurations (4/8 cpu, 4/4 cpu ...) or
in longer test, like 5000 steps which you recommeded ...

Anyway in this short test I set originally ntpr, ntwx to 200 but ofcourse
in real simulation they are much bigger (5000).
Regarding ntt it seems to me that you do not recommend ntt=3 (at least for
explicit solvent) so what is your favourite choice
for this type of simulation ?

OK, and now back to the reliability question.

I have made all the tests with my "ifort 11" compilation of Amber and
10.1.019 compilation of PMEMD which just uses new cc and MKL libs.

Here are the results:

#1 - AmberTools - I think OK
#2 - AmberSerial_MM - I think OK
#3 - AmberSerial_QMMM - I think OK

(please see the attached files)

#4 - AmberParallel_MM

I made it on full node = 8 single cpus

Here is my script to run this test:

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#!/bin/bash
mpdboot -f ~/.mpd11.hosts -n $NODES
export DO_PARALLEL="mpiexec -np 8"
make test.parallel.MM
mpdallexit
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<,<<<<

big part of the test passed without any problems, but after while it got
stucked,
I have waited cca 45 min, for this time period whole processors were busy
for 100%
all the time see this "top" list:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
   472 mmaly 20 0 153m 14m 5672 R 101 0.1 43:25.90 sander.MPI
   468 mmaly 20 0 153m 14m 5668 R 100 0.1 46:12.26 sander.MPI
   469 mmaly 20 0 157m 16m 7728 R 100 0.1 45:38.51 sander.MPI
   470 mmaly 20 0 153m 14m 5672 R 100 0.1 46:12.23 sander.MPI
   467 mmaly 20 0 157m 16m 7736 R 100 0.1 45:57.42 sander.MPI
   473 mmaly 20 0 159m 16m 7728 R 100 0.1 46:03.65 sander.MPI
   466 mmaly 20 0 153m 14m 5664 R 99 0.1 45:57.73 sander.MPI
   471 mmaly 20 0 157m 16m 7748 R 90 0.1 45:24.29 sander.MPI
   512 mmaly 20 0 10740 1480 1032 R 0 0.0 0:01.18 top

.........

==============================================================
cd PIMD/part_cmd_water/restart && ./Run.cmdyn
diffing cmd.out.save with cmd.out
PASSED
==============================================================
cd PIMD/part_rpmd_water && ./Run.rpmd
diffing spcfw_rpmd.top.save with spcfw_rpmd.top
PASSED
==============================================================
diffing spcfw_rpmd.xyz.save with spcfw_rpmd.xyz
PASSED
==============================================================
diffing spcfw_rpmd.out.save with spcfw_rpmd.out
PASSED
==============================================================
cd ti_mass/pent_LES_PIMD && ./Run.pentadiene
This test not set up for parallel
  cannot run in parallel with #residues < #pes
make[1]: Leaving directory `/home/mmaly/_applications/amber/test'
cd PIMD/full_cmd_water/equilib && ./Run.full_cmd
Testing Centroid MD <<<< - HERE IT GOT STUCKED

so I had to kill this process since I do not believe that this test should
be longer on 8 CPUs that just several minutes ...
Anyway relevant TEST_FAILURES file was created (please see attached
TEST_FAILURES_AMBER_PARALLEL_MM.diff).

#5 - AmberParallel_QMMM

This test crashed very soon as you can see on the below listing:

export TESTsander=/opt/amber/exe/sander.MPI; make test.sander.QMMM
make[1]: Entering directory `/home/mmaly/_applications/amber/test'
cd qmmm2/xcrd_build_test/ && ./Run.oct_nma_imaged
diffing mdout.oct_nma_imaged.save with mdout.oct_nma_imaged
PASSED
==============================================================
cd qmmm2/xcrd_build_test/ && ./Run.oct_nma_noimage
diffing mdout.oct_nma_noimage.save with mdout.oct_nma_noimage
PASSED
==============================================================
cd qmmm2/xcrd_build_test/ && ./Run.ortho_qmewald0

  * NB pairs 145 185645 exceeds capacity ( 185750) 3
      SIZE OF NONBOND LIST = 185750
  SANDER BOMB in subroutine nonbond_list
  Non bond list overflow!
  check MAXPR in locmem.f
[cli_3]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
rank 3 in job 3 enode11_56157 caused collective abort of all ranks
   exit status of rank 3: return code 1
   ./Run.ortho_qmewald0: Program error
make[1]: *** [test.sander.QMMM] Error 1
make[1]: Leaving directory `/home/mmaly/_applications/amber/test'
make: *** [test.sander.QMMM.MPI] Error

There is some problem with MAXPR, but as I learned (after seeing file
locmem.f) this is not a typical constant but variable which
is evaluated by the program it self, or am I wrong ?

Anyway can I do something to prevent this error and to proceed whole
AmberParallel_QMMM test ?

#6 PMEMD test

Absolutely without problems. All passed after while and no TEST_FAILURES
file was created.

Bob I would be very grateful if you can look into attached files and let
say at least to indicate :)) if my instalation
seems to be reliable or if will be better to compleet reinstalation using
your recommended ifort 10.1.021 ...

Thank you very much in advance !

    Best,

      Marek

Dne Fri, 08 May 2009 21:44:08 +0200 Robert Duke <rduke.email.unc.edu>
napsal/-a:

> Ah, now we are getting somewhere!
> A 60000 atom system - that is fine.
> Now, let's look at the mdin file you sent:
> heat ras-raf
> &cntrl
> imin=0,irest=1,ntx=5,
> nstlim=1000,dt=0.002,
> ntc=2,ntf=2,
> cut=10.0, ntb=2, ntp=1, taup=2.0,
> ntpr=200, ntwx=200,
> ntt=3, gamma_ln=2.0,
> temp0=310.0,
> /
>
> Here, things get interesting. Let's go through the potential problems
> in the order they occur:
>
> cut=10.0 - This is a really big cutoff for pme, generally unnecessary.
> The default cut is 8 angstrom; you will run roughly twice as slow for
> your direct space calcs with a cutoff this big. Not really a great idea
> (some folks go to 9 angstrom to get a longer vdw interaction; with pmemd
> you can actually just increase the vdw while leaving the electrostatic
> cut at 8 and get better performance. Now the other thing - if you are
> having trouble with scaling, larger cutoffs will slow you down even more
> because there is more information interchange.
>
> ntwx=200 - You are dumping a trajectory snapshot every 0.4 psec - this
> is not outrageous, but is probably also a bit of overkill. You could
> probably print every psec and be fine (ntwx=500). If your disk is at
> all slow, this will hurt. It sounded like what your were doing on the
> disks is okay, as long as there is not some screwy nfs mount issue
> (sounds like there is not).
>
> ntt=3 - AhHa! This is a langevin thermostat. There is a huge
> inefficiency here, associated with random number generation. I don't
> know how expensive it gets, but it does get expensive, and I view ntt 3
> as not a production tool for this reason. Others undoubtedly disagree,
> as lots of folks like this thermostat. BUT the way it is currently
> implemented, it really kills scaling.
>
> tem0=310. Additional motion at higher temp. More listbuilds. Less
> efficient (but you are driving the dynamics further in less time).
> Probably a very small effect.
>
> nstlim=1000 - PMEMD is still adjusting the run parameters out to roughly
> step 4000. So for higher scaling stuff, I typically do about 5000 steps
> minimum to see what is going on.
>
> - This stuff is at least some of the reason you are not scaling as well
> as one might hope... The devil is in the details, and he can be a real
> pain...
>
> Best Regards - Bob
>
> ----- Original Message ----- From: "Marek Malý" <maly.sci.ujep.cz>
> To: "AMBER Mailing List" <amber.ambermd.org>
> Sent: Friday, May 08, 2009 3:10 PM
> Subject: Re: [AMBER] Error in PMEMD run
>
>
> Hi Bob,
>
> my testing system is composed of PPI dendrimer 4 gen + explicit wat,
> total num. of atoms cca 60000.
>
> Here are the input files for testing:
>
> http://physics.ujep.cz/~mmaly/MySystem/
>
> I know it is not a big system but for benchmark on 16-32 CPUs is OK I
> think or am I wrong ?
>
> For testing I used just 1000 steps from the equilibrium phase ( NPT
> simulation see - equil_DEN_PPIp_D.in ).
>
> Regarding to discs question.
>
> Each node has his local harddrive (SATA 250 GB), so I run my jobs from
> the
> first
> node listed in relevant .mpd.hosts file.
>
> Let say that if I want to run my job on 2 nodes (for example 11 and 12 )
> I go to local disc of the node 11 and run the job from it.
>
> This local discs are not shared yet.
>
> Regarding to MPI, we are using Intel MPI (actually version 3.2.0.011).
>
> here are my config commands for compilation of parallel Amber/PMEMD:
>
> ./configure_amber -intelmpi ifort (Parallel Amber)
>
> ./configure linux_em64t ifort intelmpi (PMEMD)
>
> We have 14 nodes in total each node = 2 x Intel Xeon Quad-core 5365 (
> 3,00
> GHz) = 8 single CPUs
> Nodes are connected using "Cisco InfiniBand".
>
> So that's all what I can say about my testing system and our cluster.
>
> Thanks for your time !
>
> Best,
>
> Marek
>
>
>
>
>
>
> Dne Fri, 08 May 2009 20:24:35 +0200 Robert Duke <rduke.email.unc.edu>
> napsal/-a:
>
>> Yes, Ross makes points I was planning on making next. We need to know
>> your benchmark. You should be running something like JAC, or even
>> better yet, factor ix, from the benchmarks suite. Then you should
>> convert your times to nsec/day and compare to some to the published
>> values at www.ambermd.org to have a clue as to just how good or bad
>> you are doing. Once you have a reasonable benchmark (not too small,
>> balanced i/o, not asking for extra features that are known not to
>> scale, etc etc), then we can look for other problems. Given a GOOD
>> infiniband setup (high bandwidth, configured correctly, balance
>> between pci express and the infiniband hca's, well-scaled infiniband
>> switch layout, no noise from loose cables, etc etc etc), then the next
>> likely source of grief is the disk. Are you all perhaps using an
>> nfs-mounted volume, and even worse, one volume, not a parallel file
>> system, being written to by multiple running jobs? Bad idea.
>> Parallel jobs will hang like crazy waiting for the master to do disk
>> i/o. Is mpi really set up correctly? The only way you know is if the
>> setup has passed other benchmarks (I typically tell by comparison of
>> pmemd on the candidate system to other systems, but believe me, mpi
>> can really be screwed up pretty easily). Which mpi? OpenMPI is known
>> to be bad with infiniband (I don't know if it is actually "good" with
>> anything). Intel mpi is supposed to be good, but I have never tried
>> to jump through all the configuration hoops. MVAPICH is pretty
>> standard; once again, though, because I don't admin a system of this
>> type, I have no idea how hard it is to get everything right. I am
>> really sorry you are having so much "fun" with all this; I know it
>> must be frustrating, but there is a reason bigger clusters get run by
>> staff. By the way, how big is the cluster?
>> Best Regards - Bob
>> ----- Original Message ----- From: "Ross Walker" <ross.rosswalker.co.uk>
>> To: "'AMBER Mailing List'" <amber.ambermd.org>
>> Sent: Friday, May 08, 2009 2:11 PM
>> Subject: RE: [AMBER] Error in PMEMD run
>>
>>
>> Hi Marek,
>>
>> I don't think I've seen anywhere what the actual simulation you are
>> running
>> is. This will have a huge effect on parallel scalability. With
>> infiniband
>> and a 'reasonable' system size you should easily be able to get beyond 2
>> nodes. Here are some numbers for the JAC NVE benchmark from the suite
>> provided on http://ambermd.org/amber10.bench1.html
>>
>> This is for NCSA Abe which is Dual x Quad core clovertown (E5345
>> 2.33GHz so
>> very similar to your setup) and uses SDR infiniband.
>>
>> Using all 8 processors per node (time for benchmark in seconds):
>> 8 ppn 8 cpu 364.09
>> 8 ppn 16 cpu 202.65
>> 8 ppn 24 cpu 155.12
>> 8 ppn 32 cpu 123.63
>> 8 ppn 64 cpu 111.82
>> 8 ppn 96 cpu 91.87
>>
>> Using 4 processors per node (2 per socket):
>> 4 ppn 8 cpu 317.07
>> 4 ppn 16 cpu 178.95
>> 4 ppn 24 cpu 134.10
>> 4 ppn 32 cpu 105.25
>> 4 ppn 64 cpu 83.28
>> 4 ppn 96 cpu 67.73
>>
>> As you can see it is still scaling to 96 cpus (24 nodes at 4 threads per
>> node). So I think you must either be running an unreasonably small
>> system to
>> expect scaling in parallel or there is something very wrong with the
>> setup
>> of your computer.
>>
>> All the best
>> Ross
>>
>>> -----Original Message-----
>>> From: amber-bounces.ambermd.org [mailto:amber-bounces.ambermd.org] On
>>> Behalf Of Marek Malý
>>> Sent: Friday, May 08, 2009 10:58 AM
>>> To: AMBER Mailing List
>>> Subject: Re: [AMBER] Error in PMEMD run
>>>
>>> Hi Gustavo,
>>>
>>> thanks for your suggestion but we have only 14 nodes in our cluster
>>> (each
>>> node = 2 x Xeon Quad-core 5365 (3,00 GHz) = 8 single CPUs per node
>>> connected with "Cisco InfiniBand").
>>>
>>> If I allocate 8 nodes and I use just 2 CPUs per node for one my job it
>>> means that 8x6 single CPUs = 48 will be wasted. In this
>>> case I am sure that my colleagues will kill me :)) Moreover I do not
>>> assume that 8/2CPU combination will have significantly better
>>> performance that 2/8CPU at least in case of PMEMD.
>>>
>>> But anyway, thank you for your opinion/experience !
>>>
>>> Best,
>>>
>>> Marek
>>>
>>>
>>>
>>>
>>> Dne Fri, 08 May 2009 19:28:35 +0200 Gustavo Seabra
>>> <gustavo.seabra.gmail.com> napsal/-a:
>>>
>>> >> the best performance I have obtained in case of using combination of
>>> 4
>>> >> nodes
>>> >> and 4 CPUs (from 8) per node.
>>> >
>>> > I don't know exactly what you have in your system, but I gather you
>>> > are using 8core-nodes, and from it you got the best performance by
>>> > leaving 4 cores idle. Is that correct?
>>> >
>>> > In this case, I would suggest that you go a bit further, and also
>>> test
>>> > using only 1 or 2 cores per node, i.e., leaving the remaining 6-7
>>> > cores idle. So, for 16 MPI processes, try allocating 16 or 8 nodes.
>>> > (I didn't see this case in your tests)
>>> >
>>> > AFAIK, The 8-core nodes are arranged in 2 4-core sockets, and the
>>> > communication between core, that was already bad within the 4-cores
>>> in
>>> > the same socket, gets even worse when you need to get information
>>> > between two sockets. Depending on your system, if you send 2
>>> processes
>>> > to the same node, it may put all in the same socket or automatically
>>> > split it one for each socket. You may also be able to tell it to make
>>> > sure that this gets split in to 1 process per socket. (Look into the
>>> > mpirun flags.) From the tests we've run on those kind of machines, we
>>> > do get the best performance by leaving ALL BUT ONE core idle in each
>>> > socket.
>>> >
>>> > Gustavo.
>>> >
>>> > _______________________________________________
>>> > AMBER mailing list
>>> > AMBER.ambermd.org
>>> > http://lists.ambermd.org/mailman/listinfo/amber
>>> >
>>> > __________ Informace od NOD32 4051 (20090504) __________
>>> >
>>> > Tato zprava byla proverena antivirovym systemem NOD32.
>>> > http://www.nod32.cz
>>> >
>>> >
>>>
>>> --
>>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>> http://www.opera.com/mail/
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>> __________ Informace od NOD32 4051 (20090504) __________
>>
>> Tato zprava byla proverena antivirovym systemem NOD32.
>> http://www.nod32.cz
>>
>>
>

--
Tato zpráva byla vytvořena převratným poštovním klientem Opery:
http://www.opera.com/mail/
--------------------------------------------------------------------------------
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Wed May 20 2009 - 15:15:45 PDT