Re: AMBER: pmemd 10 output from Robert Duke on 2008-07-10 (Amber Archive Jul 2008)

From: Robert Duke <rduke.email.unc.edu>
Date: Thu, 10 Jul 2008 09:12:25 -0400

Hmmm. Two things to try, Vlad. Can you reproduce this on another (different type of) machine? Secondly, the only difference I can think of in how mdout is processed between amber 9 and amber 10 is the energy average sampling switch. But I just looked at your output, and because you are using ntave, energy average sampling is turned off, so we are not hitting some wierd combination of events there. And also, really, looking at the output, the only way this can be happening is in the fortran i/o buffers for mdout. Somehow the pointers to that buffer are getting messed up is what it looks like to me - and that is below the applications code. Now, what does pmemd do that is different than sander? Well, for one thing it flushes mdout at regular intervals based on a timer by closing and then reopening it. The frequency of this activity is controlled by the mdout_flush_interval namelist variable in &cntrl. It defaults to a close/open every 300 seconds, and can be set over the range of 0 to 3600. You can dink with this to see if your problem moves. I suspect some wierd problem with the close/open calls on this machine, still being attached to stdout with some unexpected results, or some such, but don't know. The reasons this mechanism exists in pmemd: 1) there really is no standard flush mechanism in fortran (at least last time I looked), and 2) on some really big machines (the xt3 comes to mind) flushing could be delayed for hours (at least as I best recollect), so it was possible for folks to run a simulation and not be able to see mdout until the run completed. I did not want constant flushing for performance reasons, but I did want some visibility to how the simulation was proceeding, so I put both mdinfo and mdout on flush timers, closing them and reopening in append mode. This has got to be a problem with the compile and/or the libraries for this specific build, or some idiosyncrasy ("feature") of this machine. The relevant code has simply not changed. By the way, is there a reason you get a nve simulation by setting ntp 1 but then setting a very high value of taup? This has got to be extraordinarily inefficient (keeping track of all the pressure stuff has a really big cost, especially at high scaling).
Regards - Bob Duke
  ----- Original Message -----
  From: Vlad Cojocaru
  To: amber.scripps.edu
  Sent: Thursday, July 10, 2008 6:19 AM
  Subject: Re: AMBER: pmemd 10 output

  Dear Amber users,

  Coming back to the pmemd 10 output problem I reported in the thread below, I did test different nodes (writing locally as well as via the network), with iwrap=1 and iwrap=0 and the problem is very reproducible. I get it everytime I run pmemd10 but not sander.MPI 10 or amber9. Attached is a sample of the output. This is very strange.

  If anybody is able to explain this, I'd be very grateful for some suggestions (could be a compilation issue). If there was a file system issue, why it doesnt happen with any other executable ?

  Best wishes
  vlad

  -----------------input script --------------------
  # NVE production run
   &cntrl
    imin=0, ntx=5, irest=1, ntrx=1, ntxo=1,
    ioutfm=1, ntave=100000, iwrap=0,
    ntpr=250, ntwx=1000, ntwv=25000, ntwe=25000,
    ntf=1, ntb=2,
    dielc=1.0, scnb=2.0, scee=1.2, cut=10.0,
    nsnb=100, igb=0,
    ntr=0,
    nstlim=1000000,
    t=0.0, dt=0.001,
    ntt=3, gamma_ln=0.0, tempi=300.0, temp0=300.0,
    vlimit=15,
    ntp=1, taup=9999999, pres0=1.0, comp=44.6,
    ntc=2, tol=0.00000001,
   /


  Ross Walker wrote:
    Hi Vlad,

    This really does look to me like an issue with your file system - I have never seen this from PMEMD myself and I can't see how you would end up with this situation - it looks more to me like you have some kind of malfunctioning raid device or something.

    I have seen something similar to this on GPFS parallel file systems where one of the meta data servers had failed such that you only see 4/5 of the striped data for example. This can happen both in read and write mode, I.e. a perfectly good file on disk can be read by the user as being bad because of the striping issues or alternatively if the error occurs during a write then the data can get written to disk with chunks missing.

    How reproducible is the problem? Can you try running it and write to a local scratch disk on the master node instead of a network drive (if that is what you were doing) and see if the problem recurs.

    All the best

    Ross

    From: owner-amber.scripps.edu [mailto:owner-amber.scripps.edu] On Behalf Of Vlad Cojocaru
    Sent: Friday, June 27, 2008 9:25 AM
    To: amber.scripps.edu
    Subject: Re: AMBER: pmemd 10 output

    Hi Ross,

    Yes, at some point the ---- lines are truncated, the "check COM velocity" phrase overflows the data lines. VOLUME starts not be printed and towards 100000 steps I get lines where "check COM" appears after NSTEP ... and so on .. the output gets really messy.

    As for the input, I am well aware of the performance loss by running NVE this way. However this was a test run in which I wanted to follow the pressure of the system. Unfortunately ntp=0 does not allow that.

    Best
    vlad

    Ross Walker wrote:

Hi Vlad, I assume you mean the truncated --- lines, missing data and the missingcarriage returns. This looks to me like a file system issue where yourmachine is actually not writing to disk properly. If this is over a NFSmount then I would run some serious stress tests on the system to make surethings are working properly. Also you may want to note that your input file is probably not optimum forperformance. You have: ntp=1, taup=9999999, pres0=1.0, comp=44.6, Which is effectively the same as running constant volume, with ntb=1.However, computationally it still runs NPT which involves much morecommunication. This generally effects parallel scaling, more than lowprocessor count performance. Generally the performance goes as: NVE > NVT > NPT And for thermostats: NTT=0 > NTT=1 >> NTT=3 Hence you are running an NVT calculation but paying the performance penaltyfor a NPT calculation. All the bestRoss -----Original Message-----From: owner-amber.scripps.edu [mailto:owner-amber.scripps.edu] On BehalfOf Vlad CojocaruSent: Friday, June 27, 2008 8:49 AMTo: AMBER listSubject: AMBER: pmemd 10 output Dear Amber users, The pmemd of AMBER 10 produces some really strange looking output (seeattached, the three dot lines between NSTEP=250 and NSTEP=56500 arethere to indicate that I truncated the output). What is actually strangeis that the output looks fine till NSTEP=57500. Only after that, theoutput is messed up. I haven't noticed this with any previous version of pmemd. Also not withsander.MPI from amber 10. Thanksvlad ------------------------------------------------------------------------------Dr. Vlad Cojocaru EML Research gGmbHSchloss-Wolfsbrunnenweg 3369118 Heidelberg Tel: ++49-6221-533266Fax: ++49-6221-533298 e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de http://projects.villa-bosch.de/mcm/people/cojocaru/ ----------------------------------------------------------------------------EML Research gGmbHAmtgericht Mannheim / HRB 337446Managing Partner: Dr. h.c. Klaus TschiraScientific and Managing Director: Prof. Dr.-Ing. Andreas Reuterhttp://www.eml-r.org---------------------------------------------------------------------------- -----------------------------------------------------------------------The AMBER Mail ReflectorTo post, send mail to amber.scripps.eduTo unsubscribe, send "unsubscribe amber" (in the *body* of the email) to majordomo.scripps.edu

-- ----------------------------------------------------------------------------Dr. Vlad Cojocaru EML Research gGmbHSchloss-Wolfsbrunnenweg 3369118 Heidelberg Tel: ++49-6221-533266Fax: ++49-6221-533298 e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de http://projects.villa-bosch.de/mcm/people/cojocaru/ ----------------------------------------------------------------------------EML Research gGmbHAmtgericht Mannheim / HRB 337446Managing Partner: Dr. h.c. Klaus TschiraScientific and Managing Director: Prof. Dr.-Ing. Andreas Reuterhttp://www.eml-r.org----------------------------------------------------------------------------

-- 
----------------------------------------------------------------------------
Dr. Vlad Cojocaru
EML Research gGmbH
Schloss-Wolfsbrunnenweg 33
69118 Heidelberg
Tel: ++49-6221-533266
Fax: ++49-6221-533298
e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
http://projects.villa-bosch.de/mcm/people/cojocaru/
----------------------------------------------------------------------------
EML Research gGmbH
Amtgericht Mannheim / HRB 337446
Managing Partner: Dr. h.c. Klaus Tschira
Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
http://www.eml-r.org
----------------------------------------------------------------------------
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" (in the *body* of the email)
      to majordomo.scripps.edu

Received on Sun Jul 13 2008 - 06:07:27 PDT