Hi Ross,
Thank you for your input. I tested with an example downloaded from Amber website: cellulose system. The mdin and prmtop files are same as original. The job finished correctly when using 16 cores (output file is attached below). When 64 processors were use, rank 0 quit. Both output and error files from this calculation is listed. I tried to use 'profile_mpi' to see if I can get more information. No additional information was given by this keyword for this issue. I also compiled a debug version with '-g' flag. No coredump file is created either. Any suggestion for the next step?
Thanks,
-Ping
***************
***************
error file for 64-core job
***************
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libpthread.so.0 000000353C70C4F0 Unknown Unknown Unknown
libc.so.6 000000353C0721E3 Unknown Unknown Unknown
sander.MPI 00000000009998AC Unknown Unknown Unknown
libmpich.so.1.0 0000002A967E65FE Unknown Unknown Unknown
libmpich.so.1.0 0000002A967BF582 Unknown Unknown Unknown
libmpich.so.1.0 0000002A967BDC3A Unknown Unknown Unknown
libmpich.so.1.0 0000002A967B2BBC Unknown Unknown Unknown
libmpich.so.1.0 0000002A967CFCFE Unknown Unknown Unknown
libmpich.so.1.0 0000002A967A5F79 Unknown Unknown Unknown
libmpich.so.1.0 0000002A967A3730 Unknown Unknown Unknown
libmpich.so.1.0 0000002A9677B5C0 Unknown Unknown Unknown
libmpich.so.1.0 0000002A9677B773 Unknown Unknown Unknown
sander.MPI 00000000005246BA Unknown Unknown Unknown
sander.MPI 00000000004CE9C0 Unknown Unknown Unknown
sander.MPI 00000000004CA264 Unknown Unknown Unknown
sander.MPI 000000000041EDE2 Unknown Unknown Unknown
libc.so.6 000000353C01C3FB Unknown Unknown Unknown
sander.MPI 000000000041ED2A Unknown Unknown Unknown
srun: error: cu02n1: task0: Exited with exit code 174
srun: Warning: first task terminated 60s ago
****************
****************
output file for 64-core job
****************
-------------------------------------------------------
Amber 10 SANDER 2008
-------------------------------------------------------
| Run on 06/17/2009 at 08:51:08
[-O]verwriting output
File Assignments:
| MDIN: mdin
| MDOUT: md64.out
|INPCRD: eq200.x
| PARM: prmtop
|RESTRT: restrt
| REFC: refc
| MDVEL: mdvel
| MDEN: mden
| MDCRD: mdcrd
|MDINFO: mdinfo
|INPDIP: inpdip
|RSTDIP: rstdip
|INPTRA: inptraj
|
Here is the input file:
equilibration
&cntrl
nstlim=10, dt=0.001, nrespa=2,
ntc=2, ntf=2, tol=0.000001,
ntx=5, irest=1, ntpr=1,
ntt=0,
ntb=1,
ntwr=10000, ntwx=0,
profile_mpi=1,
/
--------------------------------------------------------------------------------
1. RESOURCE USE:
--------------------------------------------------------------------------------
| Flags: MPI
getting new box info from bottom of inpcrd
| INFO: Old style inpcrd file read
| peek_ewald_inpcrd: Box info found
|Largest sphere to fit in unit cell has radius = 61.751
| New format PARM file being parsed.
| Version = 1.000 Date = 05/13/05 Time = 14:32:09
NATOM = 408609 NTYPES = 8 NBONH = 360981 MBONA = 51840
NTHETH = 99576 MTHETA = 77652 NPHIH = 181764 MPHIA = 155196
NHPARM = 0 NPARM = 0 NNB = 976704 NRES = 110283
NBONA = 51840 NTHETA = 77652 NPHIA = 155196 NUMBND = 8
NUMANG = 14 NPTRA = 18 NATYP = 8 NPHB = 1
IFBOX = 1 NMXRS = 22 IFCAP = 0 NEXTRA = 0
NCOPY = 0
| Memory Use Allocated
| Real 20939375
| Hollerith 2561939
| Integer 15849307
| Max Pairs 2837562
| nblistReal 4903308
| nblist Int 16963937
| Total 351164 kbytes
| Duplicated 0 dihedrals
| Duplicated 0 dihedrals
BOX TYPE: RECTILINEAR
--------------------------------------------------------------------------------
2. CONTROL DATA FOR THE RUN
--------------------------------------------------------------------------------
General flags:
imin = 0, nmropt = 0
Nature and format of input:
ntx = 5, irest = 1, ntrx = 1
Nature and format of output:
ntxo = 1, ntpr = 1, ntrx = 1, ntwr = 10000
iwrap = 0, ntwx = 0, ntwv = 0, ntwe = 0
ioutfm = 0, ntwprt = 0, idecomp = 0, rbornstat= 0
Potential function:
ntf = 2, ntb = 1, igb = 0, nsnb = 25
ipol = 0, gbsa = 0, iesp = 0
dielc = 1.00000, cut = 8.00000, intdiel = 1.00000
scnb = 2.00000, scee = 1.20000
Frozen or restrained atoms:
ibelly = 0, ntr = 0
Molecular dynamics:
nstlim = 10, nscm = 1000, nrespa = 2
t = 0.00000, dt = 0.00100, vlimit = 20.00000
SHAKE:
ntc = 2, jfastw = 0
tol = 0.00000
Ewald parameters:
verbose = 0, ew_type = 0, nbflag = 1, use_pme = 1
vdwmeth = 1, eedmeth = 1, netfrc = 1
Box X = 259.230 Box Y = 124.558 Box Z = 123.502
Alpha = 90.000 Beta = 90.000 Gamma = 90.000
NFFT1 = 270 NFFT2 = 125 NFFT3 = 125
Cutoff= 8.000 Tol =0.100E-04
Ewald Coefficient = 0.34864
Interpolation order = 4
| MPI Timing options:
| profile_mpi = 1
--------------------------------------------------------------------------------
3. ATOMIC COORDINATES AND VELOCITIES
--------------------------------------------------------------------------------
begin time read from input coords = 20.020 ps
Number of triangulated 3-point waters found: 105855
****************
****************
output file for 16-core job
****************
-------------------------------------------------------
Amber 10 SANDER 2008
-------------------------------------------------------
| Run on 06/16/2009 at 20:06:30
[-O]verwriting output
File Assignments:
| MDIN: mdin
| MDOUT: md16.out
|INPCRD: eq200.x
| PARM: prmtop
|RESTRT: restrt
| REFC: refc
| MDVEL: mdvel
| MDEN: mden
| MDCRD: mdcrd
|MDINFO: mdinfo
|INPDIP: inpdip
|RSTDIP: rstdip
|INPTRA: inptraj
|
Here is the input file:
equilibration
&cntrl
nstlim=10, dt=0.001, nrespa=2,
ntc=2, ntf=2, tol=0.000001,
ntx=5, irest=1, ntpr=1,
ntt=0,
ntb=1,
ntwr=10000, ntwx=0,
profile_mpi=1,
/
--------------------------------------------------------------------------------
1. RESOURCE USE:
--------------------------------------------------------------------------------
| Flags: MPI
getting new box info from bottom of inpcrd
| INFO: Old style inpcrd file read
| peek_ewald_inpcrd: Box info found
|Largest sphere to fit in unit cell has radius = 61.751
| New format PARM file being parsed.
| Version = 1.000 Date = 05/13/05 Time = 14:32:09
NATOM = 408609 NTYPES = 8 NBONH = 360981 MBONA = 51840
NTHETH = 99576 MTHETA = 77652 NPHIH = 181764 MPHIA = 155196
NHPARM = 0 NPARM = 0 NNB = 976704 NRES = 110283
NBONA = 51840 NTHETA = 77652 NPHIA = 155196 NUMBND = 8
NUMANG = 14 NPTRA = 18 NATYP = 8 NPHB = 1
IFBOX = 1 NMXRS = 22 IFCAP = 0 NEXTRA = 0
NCOPY = 0
| Memory Use Allocated
| Real 20939375
| Hollerith 2561939
| Integer 15849307
| Max Pairs 11350250
| nblistReal 4903308
| nblist Int 17170141
| Total 385222 kbytes
| Duplicated 0 dihedrals
| Duplicated 0 dihedrals
BOX TYPE: RECTILINEAR
--------------------------------------------------------------------------------
2. CONTROL DATA FOR THE RUN
--------------------------------------------------------------------------------
General flags:
imin = 0, nmropt = 0
Nature and format of input:
ntx = 5, irest = 1, ntrx = 1
Nature and format of output:
ntxo = 1, ntpr = 1, ntrx = 1, ntwr = 10000
iwrap = 0, ntwx = 0, ntwv = 0, ntwe = 0
ioutfm = 0, ntwprt = 0, idecomp = 0, rbornstat= 0
Potential function:
ntf = 2, ntb = 1, igb = 0, nsnb = 25
ipol = 0, gbsa = 0, iesp = 0
dielc = 1.00000, cut = 8.00000, intdiel = 1.00000
scnb = 2.00000, scee = 1.20000
Frozen or restrained atoms:
ibelly = 0, ntr = 0
Molecular dynamics:
nstlim = 10, nscm = 1000, nrespa = 2
t = 0.00000, dt = 0.00100, vlimit = 20.00000
SHAKE:
ntc = 2, jfastw = 0
tol = 0.00000
Ewald parameters:
verbose = 0, ew_type = 0, nbflag = 1, use_pme = 1
vdwmeth = 1, eedmeth = 1, netfrc = 1
Box X = 259.230 Box Y = 124.558 Box Z = 123.502
Alpha = 90.000 Beta = 90.000 Gamma = 90.000
NFFT1 = 270 NFFT2 = 125 NFFT3 = 125
Cutoff= 8.000 Tol =0.100E-04
Ewald Coefficient = 0.34864
Interpolation order = 4
| MPI Timing options:
| profile_mpi = 1
--------------------------------------------------------------------------------
3. ATOMIC COORDINATES AND VELOCITIES
--------------------------------------------------------------------------------
begin time read from input coords = 20.020 ps
Number of triangulated 3-point waters found: 105855
| Atom division among processors:
| 0 25544 51086 76628 102153 127692 153228 178767
| 204306 229842 255381 280920 306456 331995 357534 383070
| 408609
Sum of charges from parm topology file = 0.00000000
Forcing neutrality...
| Running AMBER/MPI version on 16 nodes
--------------------------------------------------------------------------------
4. RESULTS
--------------------------------------------------------------------------------
| # of SOLUTE degrees of freedom (RNDFP): 864846.
| # of SOLVENT degrees of freedom (RNDFS): 0.
| NDFMIN = 864843. NUM_NOSHAKE = 0 CORRECTED RNDFP = 864843.
| TOTAL # of degrees of freedom (RNDF) = 864843.
---------------------------------------------------
APPROXIMATING switch and d/dx switch using CUBIC SPLINE INTERPOLATION
using 5000.0 points per unit in tabled values
TESTING RELATIVE ERROR over r ranging from 0.0 to cutoff
| CHECK switch(x): max rel err = 0.2738E-14 at 2.422500
| CHECK d/dx switch(x): max rel err = 0.8332E-11 at 2.782960
---------------------------------------------------
| Local SIZE OF NONBOND LIST = 6297209
| TOTAL SIZE OF NONBOND LIST = 87461669
NSTEP = 2 TIME(PS) = 20.022 TEMP(K) = 297.32 PRESS = 0.0
Etot = -442303.5426 EKtot = 255489.1640 EPtot = -697792.7066
BOND = 20811.6010 ANGLE = 55885.5884 DIHED = 23637.6939
1-4 NB = 21997.4171 1-4 EEL = 742112.9832 VDWAALS = 97671.1570
EELEC = -1659909.1473 EHBOND = 0.0000 RESTRAINT = 0.0000
Ewald error estimate: 0.2220E-03
------------------------------------------------------------------------------
NSTEP = 4 TIME(PS) = 20.024 TEMP(K) = 292.48 PRESS = 0.0
Etot = -442264.1499 EKtot = 251326.5178 EPtot = -693590.6678
BOND = 21485.0936 ANGLE = 57212.8303 DIHED = 23656.3753
1-4 NB = 22052.8283 1-4 EEL = 742147.1564 VDWAALS = 97951.0328
EELEC = -1658095.9843 EHBOND = 0.0000 RESTRAINT = 0.0000
Ewald error estimate: 0.2220E-03
------------------------------------------------------------------------------
NSTEP = 6 TIME(PS) = 20.026 TEMP(K) = 293.62 PRESS = 0.0
Etot = -442309.5876 EKtot = 252308.6243 EPtot = -694618.2119
BOND = 21035.4758 ANGLE = 55465.4631 DIHED = 23632.2956
1-4 NB = 21995.4631 1-4 EEL = 742160.0466 VDWAALS = 98135.2878
EELEC = -1657042.2439 EHBOND = 0.0000 RESTRAINT = 0.0000
Ewald error estimate: 0.2191E-03
------------------------------------------------------------------------------
NSTEP = 8 TIME(PS) = 20.028 TEMP(K) = 297.69 PRESS = 0.0
Etot = -442381.6602 EKtot = 255807.1368 EPtot = -698188.7970
BOND = 20043.1357 ANGLE = 52749.9450 DIHED = 23593.5910
1-4 NB = 21903.4014 1-4 EEL = 742155.7318 VDWAALS = 98268.8159
EELEC = -1656903.4177 EHBOND = 0.0000 RESTRAINT = 0.0000
Ewald error estimate: 0.2246E-03
------------------------------------------------------------------------------
NSTEP = 10 TIME(PS) = 20.030 TEMP(K) = 300.03 PRESS = 0.0
Etot = -442403.2922 EKtot = 257821.3786 EPtot = -700224.6708
BOND = 19439.5563 ANGLE = 51897.5339 DIHED = 23578.6361
1-4 NB = 21874.1521 1-4 EEL = 742146.5988 VDWAALS = 98396.6480
EELEC = -1657557.7960 EHBOND = 0.0000 RESTRAINT = 0.0000
Ewald error estimate: 0.2320E-03
------------------------------------------------------------------------------
A V E R A G E S O V E R 5 S T E P S
NSTEP = 10 TIME(PS) = 20.030 TEMP(K) = 296.23 PRESS = 0.0
Etot = -442332.4465 EKtot = 254550.5643 EPtot = -696883.0108
BOND = 20562.9725 ANGLE = 54642.2721 DIHED = 23619.7184
1-4 NB = 21964.6524 1-4 EEL = 742144.5034 VDWAALS = 98084.5883
EELEC = -1657901.7178 EHBOND = 0.0000 RESTRAINT = 0.0000
Ewald error estimate: 0.2239E-03
------------------------------------------------------------------------------
R M S F L U C T U A T I O N S
NSTEP = 10 TIME(PS) = 20.030 TEMP(K) = 2.78 PRESS = 0.0
Etot = 51.8912 EKtot = 2390.7715 EPtot = 2435.9045
BOND = 730.4351 ANGLE = 1997.2851 DIHED = 28.9670
1-4 NB = 65.9391 1-4 EEL = 16.5688 VDWAALS = 254.0291
EELEC = 1088.2181 EHBOND = 0.0000 RESTRAINT = 0.0000
|E(PBS) = 2.4043
Ewald error estimate: 0.4374E-05
------------------------------------------------------------------------------
--------------------------------------------------------------------------------
5. TIMINGS
--------------------------------------------------------------------------------
|>>>>>>>>PROFILE of Average TIMES>>>>>>>>>
| Read coords time 0.05 ( 0.30% of Total)
| Fast Water setup 0.00 ( 0.01% of Total)
| Build the list 0.83 (73.34% of List )
| Other 0.30 (26.66% of List )
| List time 1.14 (18.36% of Nonbo)
| Short_ene time 2.83 (87.67% of Direc)
| Other 0.40 (12.33% of Direc)
| Direct Ewald time 3.23 (64.14% of Ewald)
| Adjust Ewald time 0.06 ( 1.09% of Ewald)
| Fill Bspline coeffs 0.10 (12.04% of Recip)
| Fill charge grid 0.08 ( 9.91% of Recip)
| Scalar sum 0.07 ( 8.43% of Recip)
| Grad sum 0.12 (14.53% of Recip)
| FFT back comm time 0.10 (25.08% of FFT t)
| Other 0.30 (74.92% of FFT t)
| FFT time 0.40 (48.26% of Recip)
| Other 0.06 ( 6.83% of Recip)
| Recip Ewald time 0.83 (16.56% of Ewald)
| Force Adjust 0.69 (13.64% of Ewald)
| Virial junk 0.11 ( 2.22% of Ewald)
| Other 0.12 ( 2.33% of Ewald)
| Ewald time 5.04 (81.43% of Nonbo)
| IPS excludes 0.01 ( 0.15% of Nonbo)
| Other 0.00 ( 0.07% of Nonbo)
| Nonbond force 6.19 (87.18% of Force)
| Bond/Angle/Dihedral 0.15 ( 2.08% of Force)
| FRC Collect time 0.58 ( 8.10% of Force)
| Other 0.19 ( 2.64% of Force)
| Force time 7.10 (84.63% of Runmd)
| Shake time 0.13 ( 1.54% of Runmd)
| Verlet update time 0.38 ( 4.57% of Runmd)
| CRD distribute time 0.55 ( 6.52% of Runmd)
| Other 0.23 ( 2.74% of Runmd)
| Runmd Time 8.38 (52.12% of Total)
| Other 7.65 (47.58% of Total)
| Total time 16.09 (99.63% of ALL )
| Number of list builds : 1
| Highest rstack allocated: 1878609
| Highest istack allocated: 47943
| Job began at 20:06:30.322 on 06/16/2009
| Setup done at 20:06:38.053 on 06/16/2009
| Run done at 20:06:49.852 on 06/16/2009
| wallclock() was called 477 times
**************
**************
**************
***************
-----Original Message-----
From: amber-bounces.ambermd.org on behalf of Ross Walker
Sent: Sat 6/20/2009 10:40 AM
To: 'AMBER Mailing List'
Subject: RE: [AMBER] Hard limit in Amber10?
Hi Ping,
The limits in Sander v10 are that nthreads < 128 and there must be more
residues than threads. For most periodic simulations there are lots of
waters so this does not present a problem since the residue count is large.
For implicit solvent simulations it can cause issues however. PMEMD has
slightly more relaxed restraints. There should not, as far as I know, be an
upper limit on the number of threads except that you need to have 10x more
atoms than processors (for implicit solvent) and more residues than
processors (I think) for explicit solvent.
Note though that in either case the code should not segfault, it should quit
with an appropriate error. Thus it would be helpful if you could post an
example (prmtop, inpcrd, mdin) that shows this error so that we can try to
reproduce it.
All the best
Ross
> -----Original Message-----
> From: amber-bounces.ambermd.org [mailto:amber-bounces.ambermd.org] On
> Behalf Of Yang, Ping
> Sent: Friday, June 19, 2009 5:21 PM
> To: amber.ambermd.org
> Subject: [AMBER] Hard limit in Amber10?
>
>
> Greetings,
>
> Is there any hard limit implemented in Amber10? The code was compiled
> using icc+mvapich+mkl and passed test successfully.
>
> For a job that submit to the computer, it runs fine and finishes
> happily
> when use 16 or 32 or 48 processors. However, once using 64 processors
> and beyond (more than 8 nodes), rank 0 got 'segmentation fault' and
> stops at the step of dividing atoms among processors while leaves the
> rest ranks hanging. This happens to both Sander.MPI and pmemd.
>
> Could this issue be related to the AmberTools which is a serial
> version?
> I tried to recompile the parallel version of AmberTools. However, the
> configure file ignores the option '-mpi' when both '-mpi icc' is
> provided. Did I miss something here? ( I list more information at
> the end of the email.) I'd appreciate your kind help and any insight
> on
> this issue.
>
> Thanks much,
>
> -Ping
>
>
> ****************
> The details for building code
> ****************
> intel/10.1.015
> mvapich/1.0.1-2533
> mkl/10.0.011
>
> ****************
> Below are the last two lines from the unsuccessful job.
> ****************
> "begin time read from input coords = 20.020 ps
> Number of triangulated 3-point waters found: 105855"
>
>
> ****************
> The corresponding error file contains:
> ****************
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> Image PC Routine Line
> Source
>
> libpthread.so.0 000000353C70C4F0 Unknown Unknown
> Unknown
> libc.so.6 000000353C0721E3 Unknown Unknown
> Unknown
> sander.MPI 00000000009919AC Unknown Unknown
> Unknown
> libmpich.so.1.0 0000002A967E65FE Unknown Unknown
> Unknown
> libmpich.so.1.0 0000002A967BF582 Unknown Unknown
> Unknown
> libmpich.so.1.0 0000002A967BDC3A Unknown Unknown
> Unknown
> libmpich.so.1.0 0000002A967B2BBC Unknown Unknown
> Unknown
> libmpich.so.1.0 0000002A967CFCFE Unknown Unknown
> Unknown
> libmpich.so.1.0 0000002A967A5F79 Unknown Unknown
> Unknown
> libmpich.so.1.0 0000002A967A3730 Unknown Unknown
> Unknown
> libmpich.so.1.0 0000002A9677B5C0 Unknown Unknown
> Unknown
> libmpich.so.1.0 0000002A9677B773 Unknown Unknown
> Unknown
> sander.MPI 00000000005234CA Unknown Unknown
> Unknown
> sander.MPI 00000000004CDA96 Unknown Unknown
> Unknown
> sander.MPI 00000000004C9334 Unknown Unknown
> Unknown
> sander.MPI 000000000041EE22 Unknown Unknown
> Unknown
> libc.so.6 000000353C01C3FB Unknown Unknown
> Unknown
> sander.MPI 000000000041ED6A Unknown Unknown
> Unknown
> srun: error: cu04n81: task0: Exited with exit code 174
> srun: Warning: first task terminated 60s ago
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> Image PC Routine Line
> Source
>
> libpthread.so.0 000000353C70C4F0 Unknown Unknown
> Unknown
> libc.so.6 000000353C0721E3 Unknown Unknown
> Unknown
> sander.MPI 00000000009919AC Unknown Unknown
> Unknown
> libmpich.so.1.0 0000002A967E65FE Unknown Unknown
> Unknown
> libmpich.so.1.0 0000002A967BF582 Unknown Unknown
> Unknown
> libmpich.so.1.0 0000002A967BDC3A Unknown Unknown
> Unknown
> libmpich.so.1.0 0000002A967B2BBC Unknown Unknown
> Unknown
> libmpich.so.1.0 0000002A967CFCFE Unknown Unknown
> Unknown
> libmpich.so.1.0 0000002A967A5F79 Unknown Unknown
> Unknown
> libmpich.so.1.0 0000002A967A3730 Unknown Unknown
> Unknown
> libmpich.so.1.0 0000002A9677B5C0 Unknown Unknown
> Unknown
> libmpich.so.1.0 0000002A9677B773 Unknown Unknown
> Unknown
> sander.MPI 00000000005234CA Unknown Unknown
> Unknown
> sander.MPI 00000000004CDA96 Unknown Unknown
> Unknown
> sander.MPI 00000000004C9334 Unknown Unknown
> Unknown
> sander.MPI 000000000041EE22 Unknown Unknown
> Unknown
> libc.so.6 000000353C01C3FB Unknown Unknown
> Unknown
> sander.MPI 000000000041ED6A Unknown Unknown
> Unknown
> srun: error: cu02n104: task0: Exited with exit code 174
> srun: Warning: first task terminated 60s ago
>
> **************
> **************
>
>
>
>
> __________________________________________________
> Ping Yang
> EMSL, Molecular Science Computing
> Pacific Northwest National Laboratory
> 902 Battelle Boulevard
> P.O. Box 999, MSIN K8-83
> Richland, WA 99352 USA
> Tel: 509-371-6405
> Fax: 509-371-6110
> ping.yang.pnl.gov
> www.emsl.pnl.gov
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jul 06 2009 - 10:17:08 PDT