Re: AMBER: pmemd 10 output from Vlad Cojocaru on 2008-07-10 (Amber Archive Jul 2008)

From: Vlad Cojocaru <Vlad.Cojocaru.eml-r.villa-bosch.de>
Date: Thu, 10 Jul 2008 16:18:28 +0200

Bob,

A bit of correction regarding the architecture and compilation

The nodes are dual CPU, each CPU is a dual core AMD opteron (total 4
cores/node). I used the term AMD64 wrongly to indicate a x86_64
architecture. The compiler we used for AMBER 10 was pgi-7.1 and the
mpilib was openmpi 1.2.5. The nodes have an gigabit ethernet
interconnection.

Vlad

Robert Duke wrote:
> I would consider trying mpich 1 or 2, which I do support, and which
> have been tested extensively with pmemd (if this is infiniband, then
> mvapich gives better performance than openmpi I believe, based at
> least on one set of benchmarks I have seen). Given that this is
> random, it could be buffer problems somewhere, or heaven only knows,
> but it is not likely a matter of the combination of the output
> params. I would also look at pgi and see what, if anything they may
> have done to fortran i/o, just in case. This could very easily be a
> subtle "linked to the wrong stuff" problem though, but I am just
> throwing out wild guesses here. You are basically exploring all the
> corners of the hardware space here - amd 64, openmpi, pgi - all stuff
> not routinely tested with pmemd (either due to availability or because
> given options, I test and recommend use of the stuff that has the best
> performance and is most reliable). The thing that worries me about
> all this - it suggests a memory stomp at some point of the file i/o
> buffer control variables, and I am wondering what else might be
> getting stomped on.
> Regards - Bob
>
> ----- Original Message -----
> *From:* Vlad Cojocaru <mailto:Vlad.Cojocaru.eml-r.villa-bosch.de>
> *To:* amber.scripps.edu <mailto:amber.scripps.edu>
> *Sent:* Thursday, July 10, 2008 9:26 AM
> *Subject:* Re: AMBER: pmemd 10 output
>
> Thanks Bob for the details
>
> I tested on 2 different machines. Not really different
> architectures but on different clusters (different generations of
> AMD64 4cores/node). The problem is reproducible. The time of
> appearance differs from run to run and it looks random.
>
> Weird is that I have never observed this with pmemd from AMBER 9.
> Which made me think that might a compilation issue (we compiled
> with pgi and openmpi) but of course doesnt make to much sense. I
> could test without ntave (in the AMBER9 runs I did not use
> ntave). I will also give it a try and modify the
> mdout_flush_interval. I'll let you know if something changes.
>
> Best
> vlad
>
> Robert Duke wrote:
>> Hmmm. Two things to try, Vlad. Can you reproduce this on
>> another (different type of) machine? Secondly, the only
>> difference I can think of in how mdout is processed between amber
>> 9 and amber 10 is the energy average sampling switch. But I just
>> looked at your output, and because you are using ntave, energy
>> average sampling is turned off, so we are not hitting some wierd
>> combination of events there. And also, really, looking at the
>> output, the only way this can be happening is in the fortran i/o
>> buffers for mdout. Somehow the pointers to that buffer are
>> getting messed up is what it looks like to me - and that is below
>> the applications code. Now, what does pmemd do that is different
>> than sander? Well, for one thing it flushes mdout at regular
>> intervals based on a timer by closing and then reopening it. The
>> frequency of this activity is controlled by the
>> mdout_flush_interval namelist variable in &cntrl. It defaults to
>> a close/open every 300 seconds, and can be set over the range of
>> 0 to 3600. You can dink with this to see if your problem moves.
>> I suspect some wierd problem with the close/open calls on this
>> machine, still being attached to stdout with some unexpected
>> results, or some such, but don't know. The reasons this
>> mechanism exists in pmemd: 1) there really is no standard flush
>> mechanism in fortran (at least last time I looked), and 2) on
>> some really big machines (the xt3 comes to mind) flushing could
>> be delayed for hours (at least as I best recollect), so it was
>> possible for folks to run a simulation and not be able to see
>> mdout until the run completed. I did not want constant flushing
>> for performance reasons, but I did want some visibility to how
>> the simulation was proceeding, so I put both mdinfo and mdout on
>> flush timers, closing them and reopening in append mode. This
>> has got to be a problem with the compile and/or the libraries for
>> this specific build, or some idiosyncrasy ("feature") of this
>> machine. The relevant code has simply not changed. By the way,
>> is there a reason you get a nve simulation by setting ntp 1 but
>> then setting a very high value of taup? This has got to be
>> extraordinarily inefficient (keeping track of all the pressure
>> stuff has a really big cost, especially at high scaling).
>> Regards - Bob Duke
>>
>> ----- Original Message -----
>> *From:* Vlad Cojocaru
>> <mailto:Vlad.Cojocaru.eml-r.villa-bosch.de>
>> *To:* amber.scripps.edu <mailto:amber.scripps.edu>
>> *Sent:* Thursday, July 10, 2008 6:19 AM
>> *Subject:* Re: AMBER: pmemd 10 output
>>
>> Dear Amber users,
>>
>> Coming back to the pmemd 10 output problem I reported in the
>> thread below, I did test different nodes (writing locally as
>> well as via the network), with iwrap=1 and iwrap=0 and the
>> problem is very reproducible. I get it everytime I run
>> pmemd10 but not sander.MPI 10 or amber9. Attached is a sample
>> of the output. This is very strange.
>>
>> If anybody is able to explain this, I'd be very grateful for
>> some suggestions (could be a compilation issue). If there was
>> a file system issue, why it doesnt happen with any other
>> executable ?
>>
>> Best wishes
>> vlad
>>
>> -----------------input script --------------------
>> # NVE production run
>> &cntrl
>> imin=0, ntx=5, irest=1, ntrx=1, ntxo=1,
>> ioutfm=1, ntave=100000, iwrap=0,
>> ntpr=250, ntwx=1000, ntwv=25000, ntwe=25000,
>> ntf=1, ntb=2,
>> dielc=1.0, scnb=2.0, scee=1.2, cut=10.0,
>> nsnb=100, igb=0,
>> ntr=0,
>> nstlim=1000000,
>> t=0.0, dt=0.001,
>> ntt=3, gamma_ln=0.0, tempi=300.0, temp0=300.0,
>> vlimit=15,
>> ntp=1, taup=9999999, pres0=1.0, comp=44.6,
>> ntc=2, tol=0.00000001,
>> /
>>
>>
>>
>>
>>
>> Ross Walker wrote:
>>>
>>> Hi Vlad,
>>>
>>> This really does look to me like an issue with your file
>>> system - I have never seen this from PMEMD myself and I
>>> can't see how you would end up with this situation - it
>>> looks more to me like you have some kind of malfunctioning
>>> raid device or something.
>>>
>>> I have seen something similar to this on GPFS parallel file
>>> systems where one of the meta data servers had failed such
>>> that you only see 4/5 of the striped data for example. This
>>> can happen both in read and write mode, I.e. a perfectly
>>> good file on disk can be read by the user as being bad
>>> because of the striping issues or alternatively if the error
>>> occurs during a write then the data can get written to disk
>>> with chunks missing.
>>>
>>> How reproducible is the problem? Can you try running it and
>>> write to a local scratch disk on the master node instead of
>>> a network drive (if that is what you were doing) and see if
>>> the problem recurs.
>>>
>>> All the best
>>>
>>> Ross
>>>
>>> *From:* owner-amber.scripps.edu
>>> [mailto:owner-amber.scripps.edu] *On Behalf Of *Vlad Cojocaru
>>> *Sent:* Friday, June 27, 2008 9:25 AM
>>> *To:* amber.scripps.edu
>>> *Subject:* Re: AMBER: pmemd 10 output
>>>
>>> Hi Ross,
>>>
>>> Yes, at some point the ---- lines are truncated, the "check
>>> COM velocity" phrase overflows the data lines. VOLUME starts
>>> not be printed and towards 100000 steps I get lines where
>>> "check COM" appears after NSTEP ... and so on .. the output
>>> gets really messy.
>>>
>>> As for the input, I am well aware of the performance loss by
>>> running NVE this way. However this was a test run in which I
>>> wanted to follow the pressure of the system. Unfortunately
>>> ntp=0 does not allow that.
>>>
>>> Best
>>> vlad
>>>
>>>
>>> Ross Walker wrote:
>>>
>>> Hi Vlad,
>>>
>>> I assume you mean the truncated --- lines, missing data and the missing
>>> carriage returns. This looks to me like a file system issue where your
>>> machine is actually not writing to disk properly. If this is over a NFS
>>> mount then I would run some serious stress tests on the system to make sure
>>> things are working properly.
>>>
>>> Also you may want to note that your input file is probably not optimum for
>>> performance. You have:
>>>
>>> ntp=1, taup=9999999, pres0=1.0, comp=44.6,
>>>
>>>
>>> Which is effectively the same as running constant volume, with ntb=1.
>>> However, computationally it still runs NPT which involves much more
>>> communication. This generally effects parallel scaling, more than low
>>> processor count performance.
>>>
>>> Generally the performance goes as:
>>>
>>> NVE > NVT > NPT
>>>
>>> And for thermostats:
>>>
>>> NTT=0 > NTT=1 >> NTT=3
>>>
>>> Hence you are running an NVT calculation but paying the performance penalty
>>> for a NPT calculation.
>>>
>>> All the best
>>> Ross
>>>
>>>
>>>
>>> -----Original Message-----
>>>
>>> From: owner-amber.scripps.edu <mailto:owner-amber.scripps.edu> [mailto:owner-amber.scripps.edu] On Behalf
>>>
>>> Of Vlad Cojocaru
>>>
>>> Sent: Friday, June 27, 2008 8:49 AM
>>>
>>> To: AMBER list
>>>
>>> Subject: AMBER: pmemd 10 output
>>>
>>>
>>>
>>> Dear Amber users,
>>>
>>>
>>>
>>> The pmemd of AMBER 10 produces some really strange looking output (see
>>>
>>> attached, the three dot lines between NSTEP=250 and NSTEP=56500 are
>>>
>>> there to indicate that I truncated the output). What is actually strange
>>>
>>> is that the output looks fine till NSTEP=57500. Only after that, the
>>>
>>> output is messed up.
>>>
>>>
>>>
>>> I haven't noticed this with any previous version of pmemd. Also not with
>>>
>>> sander.MPI from amber 10.
>>>
>>>
>>>
>>> Thanks
>>>
>>> vlad
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --
>>>
>>> Dr. Vlad Cojocaru
>>>
>>>
>>>
>>> EML Research gGmbH
>>>
>>> Schloss-Wolfsbrunnenweg 33
>>>
>>> 69118 Heidelberg
>>>
>>>
>>>
>>> Tel: ++49-6221-533266
>>>
>>> Fax: ++49-6221-533298
>>>
>>>
>>>
>>> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
>>>
>>>
>>>
>>> http://projects.villa-bosch.de/mcm/people/cojocaru/
>>>
>>>
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --
>>>
>>> EML Research gGmbH
>>>
>>> Amtgericht Mannheim / HRB 337446
>>>
>>> Managing Partner: Dr. h.c. Klaus Tschira
>>>
>>> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
>>>
>>> http://www.eml-r.org
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----------------------------------------------------------------------
>>> The AMBER Mail Reflector
>>> To post, send mail to amber.scripps.edu <mailto:amber.scripps.edu>
>>> To unsubscribe, send "unsubscribe amber" (in the *body* of the email)
>>> to majordomo.scripps.edu <mailto:majordomo.scripps.edu>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> ----------------------------------------------------------------------------
>>> Dr. Vlad Cojocaru
>>>
>>> EML Research gGmbH
>>> Schloss-Wolfsbrunnenweg 33
>>> 69118 Heidelberg
>>>
>>> Tel: ++49-6221-533266
>>> Fax: ++49-6221-533298
>>>
>>> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
>>>
>>> http://projects.villa-bosch.de/mcm/people/cojocaru/
>>>
>>> ----------------------------------------------------------------------------
>>> EML Research gGmbH
>>> Amtgericht Mannheim / HRB 337446
>>> Managing Partner: Dr. h.c. Klaus Tschira
>>> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
>>> http://www.eml-r.org
>>> ----------------------------------------------------------------------------
>>>
>>
>> --
>> ----------------------------------------------------------------------------
>> Dr. Vlad Cojocaru
>>
>> EML Research gGmbH
>> Schloss-Wolfsbrunnenweg 33
>> 69118 Heidelberg
>>
>> Tel: ++49-6221-533266
>> Fax: ++49-6221-533298
>>
>> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
>>
>> http://projects.villa-bosch.de/mcm/people/cojocaru/
>>
>> ----------------------------------------------------------------------------
>> EML Research gGmbH
>> Amtgericht Mannheim / HRB 337446
>> Managing Partner: Dr. h.c. Klaus Tschira
>> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
>> http://www.eml-r.org
>> ----------------------------------------------------------------------------
>>
>>
>>
>
> --
> ----------------------------------------------------------------------------
> Dr. Vlad Cojocaru
>
> EML Research gGmbH
> Schloss-Wolfsbrunnenweg 33
> 69118 Heidelberg
>
> Tel: ++49-6221-533266
> Fax: ++49-6221-533298
>
> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
>
> http://projects.villa-bosch.de/mcm/people/cojocaru/
>
> ----------------------------------------------------------------------------
> EML Research gGmbH
> Amtgericht Mannheim / HRB 337446
> Managing Partner: Dr. h.c. Klaus Tschira
> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
> http://www.eml-r.org
> ----------------------------------------------------------------------------
>
>
>

-- 
----------------------------------------------------------------------------
Dr. Vlad Cojocaru
EML Research gGmbH
Schloss-Wolfsbrunnenweg 33
69118 Heidelberg
Tel: ++49-6221-533266
Fax: ++49-6221-533298
e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
http://projects.villa-bosch.de/mcm/people/cojocaru/
----------------------------------------------------------------------------
EML Research gGmbH
Amtgericht Mannheim / HRB 337446
Managing Partner: Dr. h.c. Klaus Tschira
Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
http://www.eml-r.org
----------------------------------------------------------------------------
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" (in the *body* of the email)
      to majordomo.scripps.edu

Received on Sun Jul 13 2008 - 06:07:29 PDT