Re: AMBER: pmemd 10 output from Vlad Cojocaru on 2008-07-10 (Amber Archive Jul 2008)

From: Vlad Cojocaru <Vlad.Cojocaru.eml-r.villa-bosch.de>
Date: Thu, 10 Jul 2008 15:26:01 +0200

Thanks Bob for the details

I tested on 2 different machines. Not really different architectures but
on different clusters (different generations of AMD64 4cores/node). The
problem is reproducible. The time of appearance differs from run to run
and it looks random.

Weird is that I have never observed this with pmemd from AMBER 9. Which
made me think that might a compilation issue (we compiled with pgi and
openmpi) but of course doesnt make to much sense. I could test without
ntave (in the AMBER9 runs I did not use ntave). I will also give it a
try and modify the mdout_flush_interval. I'll let you know if something
changes.

Best
vlad

Robert Duke wrote:
> Hmmm. Two things to try, Vlad. Can you reproduce this on another
> (different type of) machine? Secondly, the only difference I can
> think of in how mdout is processed between amber 9 and amber 10 is the
> energy average sampling switch. But I just looked at your output, and
> because you are using ntave, energy average sampling is turned off, so
> we are not hitting some wierd combination of events there. And also,
> really, looking at the output, the only way this can be happening is
> in the fortran i/o buffers for mdout. Somehow the pointers to that
> buffer are getting messed up is what it looks like to me - and that is
> below the applications code. Now, what does pmemd do that is
> different than sander? Well, for one thing it flushes mdout at
> regular intervals based on a timer by closing and then reopening it.
> The frequency of this activity is controlled by the
> mdout_flush_interval namelist variable in &cntrl. It defaults to a
> close/open every 300 seconds, and can be set over the range of 0 to
> 3600. You can dink with this to see if your problem moves. I suspect
> some wierd problem with the close/open calls on this machine, still
> being attached to stdout with some unexpected results, or some such,
> but don't know. The reasons this mechanism exists in pmemd: 1) there
> really is no standard flush mechanism in fortran (at least last time I
> looked), and 2) on some really big machines (the xt3 comes to mind)
> flushing could be delayed for hours (at least as I best recollect), so
> it was possible for folks to run a simulation and not be able to see
> mdout until the run completed. I did not want constant flushing for
> performance reasons, but I did want some visibility to how the
> simulation was proceeding, so I put both mdinfo and mdout on flush
> timers, closing them and reopening in append mode. This has got to be
> a problem with the compile and/or the libraries for this specific
> build, or some idiosyncrasy ("feature") of this machine. The relevant
> code has simply not changed. By the way, is there a reason you get a
> nve simulation by setting ntp 1 but then setting a very high value of
> taup? This has got to be extraordinarily inefficient (keeping track
> of all the pressure stuff has a really big cost, especially at high
> scaling).
> Regards - Bob Duke
>
> ----- Original Message -----
> *From:* Vlad Cojocaru <mailto:Vlad.Cojocaru.eml-r.villa-bosch.de>
> *To:* amber.scripps.edu <mailto:amber.scripps.edu>
> *Sent:* Thursday, July 10, 2008 6:19 AM
> *Subject:* Re: AMBER: pmemd 10 output
>
> Dear Amber users,
>
> Coming back to the pmemd 10 output problem I reported in the
> thread below, I did test different nodes (writing locally as well
> as via the network), with iwrap=1 and iwrap=0 and the problem is
> very reproducible. I get it everytime I run pmemd10 but not
> sander.MPI 10 or amber9. Attached is a sample of the output. This
> is very strange.
>
> If anybody is able to explain this, I'd be very grateful for some
> suggestions (could be a compilation issue). If there was a file
> system issue, why it doesnt happen with any other executable ?
>
> Best wishes
> vlad
>
> -----------------input script --------------------
> # NVE production run
> &cntrl
> imin=0, ntx=5, irest=1, ntrx=1, ntxo=1,
> ioutfm=1, ntave=100000, iwrap=0,
> ntpr=250, ntwx=1000, ntwv=25000, ntwe=25000,
> ntf=1, ntb=2,
> dielc=1.0, scnb=2.0, scee=1.2, cut=10.0,
> nsnb=100, igb=0,
> ntr=0,
> nstlim=1000000,
> t=0.0, dt=0.001,
> ntt=3, gamma_ln=0.0, tempi=300.0, temp0=300.0,
> vlimit=15,
> ntp=1, taup=9999999, pres0=1.0, comp=44.6,
> ntc=2, tol=0.00000001,
> /
>
>
>
>
>
> Ross Walker wrote:
>>
>> Hi Vlad,
>>
>> This really does look to me like an issue with your file system -
>> I have never seen this from PMEMD myself and I can't see how you
>> would end up with this situation - it looks more to me like you
>> have some kind of malfunctioning raid device or something.
>>
>> I have seen something similar to this on GPFS parallel file
>> systems where one of the meta data servers had failed such that
>> you only see 4/5 of the striped data for example. This can happen
>> both in read and write mode, I.e. a perfectly good file on disk
>> can be read by the user as being bad because of the striping
>> issues or alternatively if the error occurs during a write then
>> the data can get written to disk with chunks missing.
>>
>> How reproducible is the problem? Can you try running it and write
>> to a local scratch disk on the master node instead of a network
>> drive (if that is what you were doing) and see if the problem recurs.
>>
>> All the best
>>
>> Ross
>>
>> *From:* owner-amber.scripps.edu [mailto:owner-amber.scripps.edu]
>> *On Behalf Of *Vlad Cojocaru
>> *Sent:* Friday, June 27, 2008 9:25 AM
>> *To:* amber.scripps.edu
>> *Subject:* Re: AMBER: pmemd 10 output
>>
>> Hi Ross,
>>
>> Yes, at some point the ---- lines are truncated, the "check COM
>> velocity" phrase overflows the data lines. VOLUME starts not be
>> printed and towards 100000 steps I get lines where "check COM"
>> appears after NSTEP ... and so on .. the output gets really messy.
>>
>> As for the input, I am well aware of the performance loss by
>> running NVE this way. However this was a test run in which I
>> wanted to follow the pressure of the system. Unfortunately ntp=0
>> does not allow that.
>>
>> Best
>> vlad
>>
>>
>> Ross Walker wrote:
>>
>> Hi Vlad,
>>
>> I assume you mean the truncated --- lines, missing data and the missing
>> carriage returns. This looks to me like a file system issue where your
>> machine is actually not writing to disk properly. If this is over a NFS
>> mount then I would run some serious stress tests on the system to make sure
>> things are working properly.
>>
>> Also you may want to note that your input file is probably not optimum for
>> performance. You have:
>>
>> ntp=1, taup=9999999, pres0=1.0, comp=44.6,
>>
>>
>> Which is effectively the same as running constant volume, with ntb=1.
>> However, computationally it still runs NPT which involves much more
>> communication. This generally effects parallel scaling, more than low
>> processor count performance.
>>
>> Generally the performance goes as:
>>
>> NVE > NVT > NPT
>>
>> And for thermostats:
>>
>> NTT=0 > NTT=1 >> NTT=3
>>
>> Hence you are running an NVT calculation but paying the performance penalty
>> for a NPT calculation.
>>
>> All the best
>> Ross
>>
>>
>>
>> -----Original Message-----
>>
>> From: owner-amber.scripps.edu <mailto:owner-amber.scripps.edu> [mailto:owner-amber.scripps.edu] On Behalf
>>
>> Of Vlad Cojocaru
>>
>> Sent: Friday, June 27, 2008 8:49 AM
>>
>> To: AMBER list
>>
>> Subject: AMBER: pmemd 10 output
>>
>>
>>
>> Dear Amber users,
>>
>>
>>
>> The pmemd of AMBER 10 produces some really strange looking output (see
>>
>> attached, the three dot lines between NSTEP=250 and NSTEP=56500 are
>>
>> there to indicate that I truncated the output). What is actually strange
>>
>> is that the output looks fine till NSTEP=57500. Only after that, the
>>
>> output is messed up.
>>
>>
>>
>> I haven't noticed this with any previous version of pmemd. Also not with
>>
>> sander.MPI from amber 10.
>>
>>
>>
>> Thanks
>>
>> vlad
>>
>>
>>
>>
>>
>> --
>>
>> --------------------------------------------------------------------------
>>
>> --
>>
>> Dr. Vlad Cojocaru
>>
>>
>>
>> EML Research gGmbH
>>
>> Schloss-Wolfsbrunnenweg 33
>>
>> 69118 Heidelberg
>>
>>
>>
>> Tel: ++49-6221-533266
>>
>> Fax: ++49-6221-533298
>>
>>
>>
>> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
>>
>>
>>
>> http://projects.villa-bosch.de/mcm/people/cojocaru/
>>
>>
>>
>> --------------------------------------------------------------------------
>>
>> --
>>
>> EML Research gGmbH
>>
>> Amtgericht Mannheim / HRB 337446
>>
>> Managing Partner: Dr. h.c. Klaus Tschira
>>
>> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
>>
>> http://www.eml-r.org
>>
>> --------------------------------------------------------------------------
>>
>> --
>>
>>
>>
>>
>>
>>
>>
>> -----------------------------------------------------------------------
>> The AMBER Mail Reflector
>> To post, send mail to amber.scripps.edu <mailto:amber.scripps.edu>
>> To unsubscribe, send "unsubscribe amber" (in the *body* of the email)
>> to majordomo.scripps.edu <mailto:majordomo.scripps.edu>
>>
>>
>>
>>
>>
>> --
>> ----------------------------------------------------------------------------
>> Dr. Vlad Cojocaru
>>
>> EML Research gGmbH
>> Schloss-Wolfsbrunnenweg 33
>> 69118 Heidelberg
>>
>> Tel: ++49-6221-533266
>> Fax: ++49-6221-533298
>>
>> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
>>
>> http://projects.villa-bosch.de/mcm/people/cojocaru/
>>
>> ----------------------------------------------------------------------------
>> EML Research gGmbH
>> Amtgericht Mannheim / HRB 337446
>> Managing Partner: Dr. h.c. Klaus Tschira
>> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
>> http://www.eml-r.org
>> ----------------------------------------------------------------------------
>>
>
> --
> ----------------------------------------------------------------------------
> Dr. Vlad Cojocaru
>
> EML Research gGmbH
> Schloss-Wolfsbrunnenweg 33
> 69118 Heidelberg
>
> Tel: ++49-6221-533266
> Fax: ++49-6221-533298
>
> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
>
> http://projects.villa-bosch.de/mcm/people/cojocaru/
>
> ----------------------------------------------------------------------------
> EML Research gGmbH
> Amtgericht Mannheim / HRB 337446
> Managing Partner: Dr. h.c. Klaus Tschira
> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
> http://www.eml-r.org
> ----------------------------------------------------------------------------
>
>
>

-- 
----------------------------------------------------------------------------
Dr. Vlad Cojocaru
EML Research gGmbH
Schloss-Wolfsbrunnenweg 33
69118 Heidelberg
Tel: ++49-6221-533266
Fax: ++49-6221-533298
e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
http://projects.villa-bosch.de/mcm/people/cojocaru/
----------------------------------------------------------------------------
EML Research gGmbH
Amtgericht Mannheim / HRB 337446
Managing Partner: Dr. h.c. Klaus Tschira
Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
http://www.eml-r.org
----------------------------------------------------------------------------
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" (in the *body* of the email)
      to majordomo.scripps.edu

Received on Sun Jul 13 2008 - 06:07:27 PDT