Re: AMBER: pmemd 10 output from Robert Duke on 2008-07-10 (Amber Archive Jul 2008)

From: Robert Duke <rduke.email.unc.edu>
Date: Thu, 10 Jul 2008 09:44:19 -0400

I would consider trying mpich 1 or 2, which I do support, and which have been tested extensively with pmemd (if this is infiniband, then mvapich gives better performance than openmpi I believe, based at least on one set of benchmarks I have seen). Given that this is random, it could be buffer problems somewhere, or heaven only knows, but it is not likely a matter of the combination of the output params. I would also look at pgi and see what, if anything they may have done to fortran i/o, just in case. This could very easily be a subtle "linked to the wrong stuff" problem though, but I am just throwing out wild guesses here. You are basically exploring all the corners of the hardware space here - amd 64, openmpi, pgi - all stuff not routinely tested with pmemd (either due to availability or because given options, I test and recommend use of the stuff that has the best performance and is most reliable). The thing that worries me about all this - it suggests a memory stomp at some point of the file i/o buffer control variables, and I am wondering what else might be getting stomped on.
Regards - Bob
  ----- Original Message -----
  From: Vlad Cojocaru
  To: amber.scripps.edu
  Sent: Thursday, July 10, 2008 9:26 AM
  Subject: Re: AMBER: pmemd 10 output

  Thanks Bob for the details

  I tested on 2 different machines. Not really different architectures but on different clusters (different generations of AMD64 4cores/node). The problem is reproducible. The time of appearance differs from run to run and it looks random.

  Weird is that I have never observed this with pmemd from AMBER 9. Which made me think that might a compilation issue (we compiled with pgi and openmpi) but of course doesnt make to much sense. I could test without ntave (in the AMBER9 runs I did not use ntave). I will also give it a try and modify the mdout_flush_interval. I'll let you know if something changes.

  Best
  vlad

  Robert Duke wrote:
    Hmmm. Two things to try, Vlad. Can you reproduce this on another (different type of) machine? Secondly, the only difference I can think of in how mdout is processed between amber 9 and amber 10 is the energy average sampling switch. But I just looked at your output, and because you are using ntave, energy average sampling is turned off, so we are not hitting some wierd combination of events there. And also, really, looking at the output, the only way this can be happening is in the fortran i/o buffers for mdout. Somehow the pointers to that buffer are getting messed up is what it looks like to me - and that is below the applications code. Now, what does pmemd do that is different than sander? Well, for one thing it flushes mdout at regular intervals based on a timer by closing and then reopening it. The frequency of this activity is controlled by the mdout_flush_interval namelist variable in &cntrl. It defaults to a close/open every 300 seconds, and can be set over the range of 0 to 3600. You can dink with this to see if your problem moves. I suspect some wierd problem with the close/open calls on this machine, still being attached to stdout with some unexpected results, or some such, but don't know. The reasons this mechanism exists in pmemd: 1) there really is no standard flush mechanism in fortran (at least last time I looked), and 2) on some really big machines (the xt3 comes to mind) flushing could be delayed for hours (at least as I best recollect), so it was possible for folks to run a simulation and not be able to see mdout until the run completed. I did not want constant flushing for performance reasons, but I did want some visibility to how the simulation was proceeding, so I put both mdinfo and mdout on flush timers, closing them and reopening in append mode. This has got to be a problem with the compile and/or the libraries for this specific build, or some idiosyncrasy ("feature") of this machine. The relevant code has simply not changed. By the way, is there a reason you get a nve simulation by setting ntp 1 but then setting a very high value of taup? This has got to be extraordinarily inefficient (keeping track of all the pressure stuff has a really big cost, especially at high scaling).
    Regards - Bob Duke
      ----- Original Message -----
      From: Vlad Cojocaru
      To: amber.scripps.edu
      Sent: Thursday, July 10, 2008 6:19 AM
      Subject: Re: AMBER: pmemd 10 output

      Dear Amber users,

      Coming back to the pmemd 10 output problem I reported in the thread below, I did test different nodes (writing locally as well as via the network), with iwrap=1 and iwrap=0 and the problem is very reproducible. I get it everytime I run pmemd10 but not sander.MPI 10 or amber9. Attached is a sample of the output. This is very strange.

      If anybody is able to explain this, I'd be very grateful for some suggestions (could be a compilation issue). If there was a file system issue, why it doesnt happen with any other executable ?

      Best wishes
      vlad

      -----------------input script --------------------
      # NVE production run
       &cntrl
        imin=0, ntx=5, irest=1, ntrx=1, ntxo=1,
        ioutfm=1, ntave=100000, iwrap=0,
        ntpr=250, ntwx=1000, ntwv=25000, ntwe=25000,
        ntf=1, ntb=2,
        dielc=1.0, scnb=2.0, scee=1.2, cut=10.0,
        nsnb=100, igb=0,
        ntr=0,
        nstlim=1000000,
        t=0.0, dt=0.001,
        ntt=3, gamma_ln=0.0, tempi=300.0, temp0=300.0,
        vlimit=15,
        ntp=1, taup=9999999, pres0=1.0, comp=44.6,
        ntc=2, tol=0.00000001,
       /


      Ross Walker wrote:
        Hi Vlad,

        This really does look to me like an issue with your file system - I have never seen this from PMEMD myself and I can't see how you would end up with this situation - it looks more to me like you have some kind of malfunctioning raid device or something.

        I have seen something similar to this on GPFS parallel file systems where one of the meta data servers had failed such that you only see 4/5 of the striped data for example. This can happen both in read and write mode, I.e. a perfectly good file on disk can be read by the user as being bad because of the striping issues or alternatively if the error occurs during a write then the data can get written to disk with chunks missing.

        How reproducible is the problem? Can you try running it and write to a local scratch disk on the master node instead of a network drive (if that is what you were doing) and see if the problem recurs.

        All the best

        Ross

        From: owner-amber.scripps.edu [mailto:owner-amber.scripps.edu] On Behalf Of Vlad Cojocaru
        Sent: Friday, June 27, 2008 9:25 AM
        To: amber.scripps.edu
        Subject: Re: AMBER: pmemd 10 output

        Hi Ross,

        Yes, at some point the ---- lines are truncated, the "check COM velocity" phrase overflows the data lines. VOLUME starts not be printed and towards 100000 steps I get lines where "check COM" appears after NSTEP ... and so on .. the output gets really messy.

        As for the input, I am well aware of the performance loss by running NVE this way. However this was a test run in which I wanted to follow the pressure of the system. Unfortunately ntp=0 does not allow that.

        Best
        vlad

        Ross Walker wrote:

Hi Vlad, I assume you mean the truncated --- lines, missing data and the missingcarriage returns. This looks to me like a file system issue where yourmachine is actually not writing to disk properly. If this is over a NFSmount then I would run some serious stress tests on the system to make surethings are working properly. Also you may want to note that your input file is probably not optimum forperformance. You have: ntp=1, taup=9999999, pres0=1.0, comp=44.6, Which is effectively the same as running constant volume, with ntb=1.However, computationally it still runs NPT which involves much morecommunication. This generally effects parallel scaling, more than lowprocessor count performance. Generally the performance goes as: NVE > NVT > NPT And for thermostats: NTT=0 > NTT=1 >> NTT=3 Hence you are running an NVT calculation but paying the performance penaltyfor a NPT calculation. All the bestRoss -----Original Message-----From: owner-amber.scripps.edu [mailto:owner-amber.scripps.edu] On BehalfOf Vlad CojocaruSent: Friday, June 27, 2008 8:49 AMTo: AMBER listSubject: AMBER: pmemd 10 output Dear Amber users, The pmemd of AMBER 10 produces some really strange looking output (seeattached, the three dot lines between NSTEP=250 and NSTEP=56500 arethere to indicate that I truncated the output). What is actually strangeis that the output looks fine till NSTEP=57500. Only after that, theoutput is messed up. I haven't noticed this with any previous version of pmemd. Also not withsander.MPI from amber 10. Thanksvlad ------------------------------------------------------------------------------Dr. Vlad Cojocaru EML Research gGmbHSchloss-Wolfsbrunnenweg 3369118 Heidelberg Tel: ++49-6221-533266Fax: ++49-6221-533298 e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de http://projects.villa-bosch.de/mcm/people/cojocaru/ ----------------------------------------------------------------------------EML Research gGmbHAmtgericht Mannheim / HRB 337446Managing Partner: Dr. h.c. Klaus TschiraScientific and Managing Director: Prof. Dr.-Ing. Andreas Reuterhttp://www.eml-r.org---------------------------------------------------------------------------- -----------------------------------------------------------------------The AMBER Mail ReflectorTo post, send mail to amber.scripps.eduTo unsubscribe, send "unsubscribe amber" (in the *body* of the email) to majordomo.scripps.edu

-- ----------------------------------------------------------------------------Dr. Vlad Cojocaru EML Research gGmbHSchloss-Wolfsbrunnenweg 3369118 Heidelberg Tel: ++49-6221-533266Fax: ++49-6221-533298 e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de http://projects.villa-bosch.de/mcm/people/cojocaru/ ----------------------------------------------------------------------------EML Research gGmbHAmtgericht Mannheim / HRB 337446Managing Partner: Dr. h.c. Klaus TschiraScientific and Managing Director: Prof. Dr.-Ing. Andreas Reuterhttp://www.eml-r.org----------------------------------------------------------------------------

-- 
----------------------------------------------------------------------------
Dr. Vlad Cojocaru
EML Research gGmbH
Schloss-Wolfsbrunnenweg 33
69118 Heidelberg
Tel: ++49-6221-533266
Fax: ++49-6221-533298
e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
http://projects.villa-bosch.de/mcm/people/cojocaru/
----------------------------------------------------------------------------
EML Research gGmbH
Amtgericht Mannheim / HRB 337446
Managing Partner: Dr. h.c. Klaus Tschira
Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
http://www.eml-r.org
----------------------------------------------------------------------------
    
-- 
----------------------------------------------------------------------------
Dr. Vlad Cojocaru
EML Research gGmbH
Schloss-Wolfsbrunnenweg 33
69118 Heidelberg
Tel: ++49-6221-533266
Fax: ++49-6221-533298
e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
http://projects.villa-bosch.de/mcm/people/cojocaru/
----------------------------------------------------------------------------
EML Research gGmbH
Amtgericht Mannheim / HRB 337446
Managing Partner: Dr. h.c. Klaus Tschira
Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
http://www.eml-r.org
----------------------------------------------------------------------------
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" (in the *body* of the email)
      to majordomo.scripps.edu

Received on Sun Jul 13 2008 - 06:07:28 PDT