Dear Xin,
>I do not know why the sander job crashed (Is there a way to check?).
>I got a kind of error message "pipe broken". (I found it frequently
happened
>after the job runing up to three days). Seems something wrong with the
parallel.
>I run the sander on a linux cluster with 16 dual-nodes (AMD1.6GHz). My MD
>system includes protein (580 residues) and up to 25000 water.
Are you using a queing system like pbs? If you are then it is possible that
your job is being killed by the queue system after it has reached a
specified cpu time. If this is the case check with whoever set up your
cluster and see if they can increase the time a job is allowed to run for.
Alternatively you could split your job up into sections that are less than
the queue time limit and then queue the next one as each finishes.
You might also be hitting a hard disk quota limit on your account which is
preventing any further disk writes.
Another thing to check is whether you are hitting the 'infamous' 2GB file
limit. This plagues 32bit systems such as AMD's and Pentiums. Although not
Opterons or Itaniums (these are 64 bit). Essentially if any of your output
files reaches 2GB in size any subsequent write will fail - this may manifest
itself in the broken pipe error you are seeing. Check your mdcrd file to see
if it is almost 2GB in size when your job fails. If it is then you will have
to either write to your mdcrd file less frequently or alternatively split
your job up into chunks each writing to a seperate mdcrd file. Note: It is
possible to compile a version of sander that supports large files on 32 bit
architecture but it is significantly mroe involved than simply splitting a
job into bits.
The mdcrd file size will be roughly:
(8N+24)*S bytes
Where N is the number of atoms in your system and S is the number of frames
in the mdcrd file (nstep/ntwx).
Note, the mdcrd files are ascii format and so compress very well (6 to 7
times) using something like gzip. Thus before you move on to the next
segment of a job you probably want to compress the previous job's mdcrd file
to save space.
>The 2 ns of MD would take almost 2 weeks in general (without competition).
Is it normal?
> (I am using " mpirun -np 32 $AMBERHOME/exe/sander .....). I feel it is
kind of slow.
> Maybe something wrong with the parallel setting, or maybe I need to find
an optimzed
> number of processors (I heard NOT the more the fast)?
This is a question for which there are many many many answers... The
performance in parallel can depend on a large number of issues. The speed of
the individual CPUS, the interconnect speed, the amount memory available on
each node, the mpi implementation, the network buffer sizes, the size of
your system, the options you have selected, the size of your cut off etc
etc... The scaling can also differ from cluster to cluster since slower
cpu's will actually probably scale better than fast cpu's since they put
less strain on a slow interconnect.
The best option would be to first of all try running the JAC benchmark on 1
cpu to see how your system compares to ours:
http://amber.scripps.edu/amber8.bench1.html. This will tell you if sander is
running as it should. Then I would try running your system first on 1 cpu
(just 500 steps will do) and then on 2, 4, 8, 16, 32 and see how the timings
compare. If you have a slow interconnect (e.g gigabit ethernet) then you may
find that your calculation actually tops out at around 8 or 16 cpus and that
going to 32 actually causes the calculation to slow down, since the code
spends all it's time communicating and not much time actually calculating.
There are no hard and fast rules here, the best thing you can do is to try
it out and see what the optimal value is.
All the best
Ross
/\
\/
|\oss Walker
| Department of Molecular Biology TPC15 |
| The Scripps Research Institute |
| Tel:- +1 858 784 8889 | EMail:- ross.rosswalker.co.uk |
|
http://www.rosswalker.co.uk/ | PGP Key available on request |
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Tue Oct 12 2004 - 18:53:00 PDT