Hi Ping,
> Thank you for your input. I tested with an example downloaded from
> Amber website: cellulose system. The mdin and prmtop files are same as
> original. The job finished correctly when using 16 cores (output file
> is
> attached below). When 64 processors were use, rank 0 quit. Both output
> and error files from this calculation is listed. I tried to use
> 'profile_mpi' to see if I can get more information. No additional
> information was given by this keyword for this issue. I also compiled
> a
> debug version with '-g' flag. No coredump file is created either. Any
> suggestion for the next step?
This is really weird and I have not been able to reproduce this. I can run
sander.MPI with this up to 128 threads no problems and pmemd up to 512
threads. Thus this will be very hard to debug. It could be something very
subtle like your machine running out of MPI buffer space etc.
One thing I would definitely suggest doing is setting the environment
variable OMP_NUM_THREADS=1 on all nodes. Otherwise the MKL calls could be
spinning out 8 threads each and running you out of memory. Also try making
sure your stack size is unlimited since you may be blowing the stack in some
way. You may need root access to be able to do this so you may have to
request whoever admins the machine to do it for you.
Those unfortunately are my only guesses right now, with the stack size issue
being the most likely candidate.
Good luck,
Ross
/\
\/
|\oss Walker
| Assistant Research Professor |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
|
http://www.rosswalker.co.uk | PGP Key available on request |
Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jul 06 2009 - 11:08:57 PDT