Hi all,
I have been trying to get some runs going on the supercomputer GPUs. Specs below (i get similar out come on CPUs):
NVIDIA Maxwell K80 GPU Nodes
1. Node count: 362. CPU cores: GPUs/node 24:43. CPU:GPU DRAM/node: 128 GB:40 GB
However, i get the following error message (this is truncated because of the length):
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Backtrace for this error:
Backtrace for this error:#0 0x2B71C0FD7337#1 0x2B71C0FD794E#2 0x3C3723269F#0 0x2B4D739D5337#0 0x2AD7BE157337#1 0x2B4D739D594E#0 #0 0x0x2B2362D503372B21910CD337
#1 #0 0x2AD7BE15794E0x2B224416D337#00x#2 #0 2B0A5CEBB3370x#0 0x3C3723269F#0 2B604CB313370x0x#0 2B54F1AF8337#1 #1
2B2AD10EA3370x0x0x2B880BD65337#22B21910CD94E0x2B2362D5094E#1
3C3723269F0x#02B224416D94E0x#1 2B8FEB9863370x2B0A5CEBB94E#1 #1 0x0x#0 2B604CB3194E2B54F1AF894E0x
[comet-25-70.sdsc.edu:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 24. MPI process died?[comet-25-70.sdsc.edu:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died?[comet-25-68.sdsc.edu:mpispawn_0][child_handler] MPI process (rank: 11, pid: 23565) terminated with signal 11 -> abort job[comet-25-70.sdsc.edu:mpispawn_1][child_handler] MPI process (rank: 26, pid: 6930) terminated with signal 11 -> abort job[comet-25-68.sdsc.edu:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node comet-25-68 aborted: Error while reading a PMI socket (4)/opt/amber/bin/pmemd.cuda.MPI: error while loading shared libraries: libcurand.so.8.0: cannot open shared object file: No such file or directory/opt/amber/bin/pmemd.cuda.MPI: error while loading shared libraries: libcurand.so.8.0: cannot open shared object file: No such file or directory/opt/amber/bin/pmemd.cuda.MPI: error while loading shared libraries: libcurand.so.8.0: cannot open shared object file: No such file or directory[comet-25-68.sdsc.edu:mpispawn_0][child_handler] MPI process (rank: 1, pid: 23756) exited with status 127[comet-25-68.sdsc.edu:mpispawn_0][child_handler] MPI process (rank: 6, pid: 23761) exited with status 127[comet-25-68.sdsc.edu:mpispawn_0][child_handler] MPI process (rank: 2, pid: 23757) exited with status 127
This is my minimization file (min1.in):
minimize structure &cntrl imin=1,maxcyc=20000, ntmin=1, ncyc=5000, ntb=1, cut=8, ntwx=500, ioutfm=1,iwrap=1, ntr=1, ntwprt=0, restraintmask=':1', restraint_wt=2.0, / &ewald /
This is my 2nd minimization file (min2.in):
minimize structure &cntrl imin=1,maxcyc=50000, ntmin=1, ncyc=5000,iwrap=1, ntwprt=0, ntb=1, cut=8, ntwx=500, ioutfm=1, / &ewald/
These are my equilibration files:
equil1.in
CZRA : equilibration &cntrl nstlim=100000, dt=0.002,ntx=1,irest=0,ntpr=500,ntwr=5000,ntwx=5000, tempi=0, temp0=300.0, ntt=3, ig=-1, imin=0, ntb=1, cut=8, iwrap=1, ntp=0, ntc=2, ntf=2, gamma_ln = 2.0, ioutfm=1,ntr=1, restraintmask=':1', restraint_wt=2.0,nmropt=1 / &wt TYPE='TEMP0', istep1=0, istep2=100000, value1=0, value2=300.0, / &wt TYPE='END' /
equil2.in
CZRA : equilibration &cntrl nstlim=100000, dt=0.002,ntx=1,irest=0,ntpr=500,ntwr=5000,ntwx=5000, tempi=0, temp0=300.0, ntt=3, ig=-1, imin=0, ntb=1, cut=8, iwrap=1, ntp=0, ntc=2, ntf=2, gamma_ln = 2.0, ioutfm=1,ntr=1, restraintmask=':1', restraint_wt=2.0,nmropt=1 / &wt TYPE='TEMP0', istep1=0, istep2=100000, value1=0, value2=300.0, / &wt TYPE='END' /[stumusii.comet-ln2 min_eq_prod_5]$ cat equil2.in CZRA : equilibration &cntrl nstlim=500000, dt=0.002,ntx=7,irest=1,ntpr=1000,ntwx=1000, tempi=300.0, temp0=300.0, ntt=3, imin=0, ntwv=-1, ntb=2, cut=8,ig=-1,ntwr=1000, pres0 = 1.0, ntp = 1, iwrap=1, taup = 2.0, ig=-1, ntc=2, ntf=2, gamma_ln = 2.0, ioutfm=1, / &ewald /
This the production file:
CZRA : equilibration &cntrl nstlim=10000000, dt=0.002, ntx=5, irest=1, ntpr=1000, ntwx=10000, tempi=300.0, temp0=300.0, ntt=3, imin=0, ntwv=-1, ntb=2, cut=8, ig=-1, ntwr=1000, ntwprt=0, pres0 = 1.0, ntp=1, iwrap=1, taup = 2.0, barostat=2, ntc=2, ntf=2, gamma_ln = 2.0, ioutfm=1, / &ewald /
This is the mdinfo output:
NSTEP = 0 TIME(PS) = 0.000 TEMP(K) = 0.00 PRESS = 0.0 Etot = -139172.1824 EKtot = 0.0000 EPtot = -139172.1824 BOND = 4.9660 ANGLE = 16.7275 DIHED = 162.8560 1-4 NB = 38.1430 1-4 EEL = -79.6157 VDWAALS = 34613.3061 EELEC = -173928.5652 EHBOND = 0.0000 RESTRAINT = 0.0000 Ewald error estimate: 0.3256E-03 NMR restraints: Bond = 0.000 Angle = 0.000 Torsion = 0.000===============================================================================
This is the final result of the 2nd minimization file:
FINAL RESULTS
NSTEP ENERGY RMS GMAX NAME NUMBER 13397 -1.2770E+05 3.4370E-03 2.2172E-01 H15 131
BOND = 11477.9062 ANGLE = 16.7275 DIHED = 162.8560 VDWAALS = 34613.3061 EEL = -173928.5652 HBOND = 0.0000 1-4 VDW = 38.1430 1-4 EEL = -79.6157 RESTRAINT = 0.0000
-------------------------------------------------------------------------------- 5. TIMINGS--------------------------------------------------------------------------------
| Build the list 0.11 ( 3.40% of List )| Other 3.12 (96.60% of List )| List time 3.23 ( 2.49% of Nonbo)| Short_ene time 49.47 (79.59% of Direc)| Other 12.69 (20.41% of Direc)| Direct Ewald time 62.16 (49.22% of Ewald)| Adjust Ewald time 0.42 ( 0.33% of Ewald)| Self Ewald time 0.02 ( 0.02% of Ewald)| Fill Bspline coeffs 3.78 ( 8.53% of Recip)| Fill charge grid 2.31 ( 5.21% of Recip)| Scalar sum 2.91 ( 6.57% of Recip)| Grad sum 3.58 ( 8.08% of Recip)| FFT back comm time 23.56 (75.81% of FFT t)| Other 7.52 (24.19% of FFT t)| FFT time 31.07 (70.07% of Recip)| Other 0.69 ( 1.55% of Recip)| Recip Ewald time 44.34 (35.11% of Ewald)| Force Adjust 13.03 (10.31% of Ewald)| Virial junk 6.23 ( 4.93% of Ewald)| Start synchronizatio 0.02 ( 0.02% of Ewald)| Other 0.08 ( 0.06% of Ewald)| Ewald time 126.30 (97.49% of Nonbo)| Other 0.02 ( 0.01% of Nonbo)| Nonbond force 129.55 (80.78% of Force)| Bond/Angle/Dihedral 0.39 ( 0.24% of Force)| FRC Collect time 21.28 (13.27% of Force)| Other 9.16 ( 5.71% of Force)| Force time 160.37 (100.0% of Runmd)| Runmd Time 160.37 (72.41% of Total)| Other 61.08 (27.58% of Total)| Total time 221.46 (100.0% of ALL )
| Highest rstack allocated: 89990| Highest istack allocated: 2452| Job began at 05:48:29.038 on 04/21/2017| Setup done at 05:48:29.281 on 04/21/2017| Run done at 05:52:10.501 on 04/21/2017| wallclock() was called 589556 times
I appreciate any input. If anyone needs .prmtop .inpcrd (co-ordinate) files or other info please let me know and i can email you a compressed file.
Thanks!
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Apr 21 2017 - 07:00:03 PDT