Re: [AMBER] Could not read velocities from restart file

From: Ross Walker <ross.rosswalker.co.uk>
Date: Tue, 10 Apr 2012 13:59:26 -0700

Hi Dave et al,

I think the issue is that the restart file is never fully written to disk
when running MD. I.e. it is not being killed in the middle of a run while it
was writing the restart file it is that the OS holds approximately 8Kb of
the file in a buffer that it does not flush until the calculation completes
and the file is closed as pmemd exits. If you kill PMEMD or it crashes in
some way the last 8kb in the buffer does not get written and you get left
with a restart file that is truncated.

Here's a test I ran on my Redhat EL6 machine:

[13:53][caffeine:0.05][rcw:GPU_PMEMD]$ ls -la
total 4320
drwxrwxr-x 2 rcw rcw 4096 Apr 10 13:53 .
drwxrwxr-x 4 rcw rcw 4096 Apr 10 13:48 ..
-rw------- 1 rcw rcw 1674065 Apr 10 13:41 inpcrd.rst
-rwx------ 1 rcw rcw 243 Apr 10 13:51 mdin
-rw------- 1 rcw rcw 2733166 Apr 10 13:41 prmtop

[13:53][caffeine:0.04][rcw:GPU_PMEMD]$
/server-home/netbin/amber11/bin/pmemd.cuda -O -c inpcrd.rst &

[13:53][caffeine:0.03][rcw:GPU_PMEMD]$ ls -la
total 4336
drwxrwxr-x 2 rcw rcw 4096 Apr 10 13:53 .
drwxrwxr-x 4 rcw rcw 4096 Apr 10 13:48 ..
-rw------- 1 rcw rcw 1674065 Apr 10 13:41 inpcrd.rst
-rwx------ 1 rcw rcw 243 Apr 10 13:51 mdin
-rw-rw-r-- 1 rcw rcw 1191 Apr 10 13:53 mdinfo
-rw-rw-r-- 1 rcw rcw 9711 Apr 10 13:54 mdout
-rw------- 1 rcw rcw 2733166 Apr 10 13:41 prmtop
-rw-rw-r-- 1 rcw rcw 0 Apr 10 13:53 restrt

So at this point we have a 0 byte restart file (which is fine since I set it
to write every 10,000 steps).

If I wait until the mdout says 10,000 steps I then have:

[13:54][caffeine:0.61][rcw:GPU_PMEMD]$ ls -la
total 5976
drwxrwxr-x 2 rcw rcw 4096 Apr 10 13:53 .
drwxrwxr-x 4 rcw rcw 4096 Apr 10 13:48 ..
-rw------- 1 rcw rcw 1674065 Apr 10 13:41 inpcrd.rst
-rwx------ 1 rcw rcw 243 Apr 10 13:51 mdin
-rw-rw-r-- 1 rcw rcw 1191 Apr 10 13:53 mdinfo
-rw-rw-r-- 1 rcw rcw 14847 Apr 10 13:54 mdout
-rw------- 1 rcw rcw 2733166 Apr 10 13:41 prmtop
-rw-rw-r-- 1 rcw rcw 1674065 Apr 10 13:54 restrt

So now the restrt file matches what the inpcrd.rst file is. Exactly as
expected. If I tail restrt I get:

   1.3885668 -0.1074871 -0.3129809 0.0681850 -0.2615662 0.0152306
  69.7703165 60.2059629 54.3607717 90.0000000 90.0000000 90.0000000

As expected. If I wait until 20,000 steps I get similar behavior:

   0.3031101 -0.1397434 0.4848370 -0.9246060 0.7415231 0.3254361
  69.7948383 60.1037068 54.4839923 90.0000000 90.0000000 90.0000000

And if I do kill -9 at 27,000 steps I get:

   0.3031101 -0.1397434 0.4848370 -0.9246060 0.7415231 0.3254361
  69.7948383 60.1037068 54.4839923 90.0000000 90.0000000 90.0000000

Which is exactly as expected. So my Redhat EL6 system is working correctly.
I think though that some Linux systems are setup to be dangerously
aggressive with the output buffering and thus tailing the restrt file in the
middle of the job will actually show a truncated file. The issue is what
exactly is causing this.

It would be useful if the person who reported this problem could try my
input on their machine and see if it behaves in the way I see. Or if it
shows the issue with the restart file not being flushed correctly.

All the best
Ross

/\
\/
|\oss Walker

---------------------------------------------------------
| Assistant Research Professor |
| San Diego Supercomputer Center |
| Adjunct Assistant Professor |
| Dept. of Chemistry and Biochemistry |
| University of California San Diego |
| NVIDIA Fellow |
| http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
---------------------------------------------------------

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.





_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber



Received on Tue Apr 10 2012 - 14:30:03 PDT
Custom Search