Greetings,
Is there any hard limit implemented in Amber10? The code was compiled
using icc+mvapich+mkl and passed test successfully.
For a job that submit to the computer, it runs fine and finishes happily
when use 16 or 32 or 48 processors. However, once using 64 processors
and beyond (more than 8 nodes), rank 0 got 'segmentation fault' and
stops at the step of dividing atoms among processors while leaves the
rest ranks hanging. This happens to both Sander.MPI and pmemd.
Could this issue be related to the AmberTools which is a serial version?
I tried to recompile the parallel version of AmberTools. However, the
configure file ignores the option '-mpi' when both '-mpi icc' is
provided. Did I miss something here? ( I list more information at
the end of the email.) I'd appreciate your kind help and any insight on
this issue.
Thanks much,
-Ping
****************
The details for building code
****************
intel/10.1.015
mvapich/1.0.1-2533
mkl/10.0.011
****************
Below are the last two lines from the unsuccessful job.
****************
"begin time read from input coords = 20.020 ps
Number of triangulated 3-point waters found: 105855"
****************
The corresponding error file contains:
****************
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line
Source
libpthread.so.0 000000353C70C4F0 Unknown Unknown
Unknown
libc.so.6 000000353C0721E3 Unknown Unknown
Unknown
sander.MPI 00000000009919AC Unknown Unknown
Unknown
libmpich.so.1.0 0000002A967E65FE Unknown Unknown
Unknown
libmpich.so.1.0 0000002A967BF582 Unknown Unknown
Unknown
libmpich.so.1.0 0000002A967BDC3A Unknown Unknown
Unknown
libmpich.so.1.0 0000002A967B2BBC Unknown Unknown
Unknown
libmpich.so.1.0 0000002A967CFCFE Unknown Unknown
Unknown
libmpich.so.1.0 0000002A967A5F79 Unknown Unknown
Unknown
libmpich.so.1.0 0000002A967A3730 Unknown Unknown
Unknown
libmpich.so.1.0 0000002A9677B5C0 Unknown Unknown
Unknown
libmpich.so.1.0 0000002A9677B773 Unknown Unknown
Unknown
sander.MPI 00000000005234CA Unknown Unknown
Unknown
sander.MPI 00000000004CDA96 Unknown Unknown
Unknown
sander.MPI 00000000004C9334 Unknown Unknown
Unknown
sander.MPI 000000000041EE22 Unknown Unknown
Unknown
libc.so.6 000000353C01C3FB Unknown Unknown
Unknown
sander.MPI 000000000041ED6A Unknown Unknown
Unknown
srun: error: cu04n81: task0: Exited with exit code 174
srun: Warning: first task terminated 60s ago
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line
Source
libpthread.so.0 000000353C70C4F0 Unknown Unknown
Unknown
libc.so.6 000000353C0721E3 Unknown Unknown
Unknown
sander.MPI 00000000009919AC Unknown Unknown
Unknown
libmpich.so.1.0 0000002A967E65FE Unknown Unknown
Unknown
libmpich.so.1.0 0000002A967BF582 Unknown Unknown
Unknown
libmpich.so.1.0 0000002A967BDC3A Unknown Unknown
Unknown
libmpich.so.1.0 0000002A967B2BBC Unknown Unknown
Unknown
libmpich.so.1.0 0000002A967CFCFE Unknown Unknown
Unknown
libmpich.so.1.0 0000002A967A5F79 Unknown Unknown
Unknown
libmpich.so.1.0 0000002A967A3730 Unknown Unknown
Unknown
libmpich.so.1.0 0000002A9677B5C0 Unknown Unknown
Unknown
libmpich.so.1.0 0000002A9677B773 Unknown Unknown
Unknown
sander.MPI 00000000005234CA Unknown Unknown
Unknown
sander.MPI 00000000004CDA96 Unknown Unknown
Unknown
sander.MPI 00000000004C9334 Unknown Unknown
Unknown
sander.MPI 000000000041EE22 Unknown Unknown
Unknown
libc.so.6 000000353C01C3FB Unknown Unknown
Unknown
sander.MPI 000000000041ED6A Unknown Unknown
Unknown
srun: error: cu02n104: task0: Exited with exit code 174
srun: Warning: first task terminated 60s ago
**************
**************
__________________________________________________
Ping Yang
EMSL, Molecular Science Computing
Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, MSIN K8-83
Richland, WA 99352 USA
Tel: 509-371-6405
Fax: 509-371-6110
ping.yang.pnl.gov
www.emsl.pnl.gov
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jul 06 2009 - 10:11:56 PDT