[AMBER] [Amber 14] multi-GPU calculation error occers when specify eight GPUs with Tesla K20Xm

From: Keisuke Aono <kaono.o.cc.titech.ac.jp>
Date: Mon, 9 Jun 2014 20:17:52 +0900

Dear Amber-users,

I had problem about multi-GPU calculation when using Amber 14.

Error occurs when specify eight GPUs with Tesla K20Xm (4 node x 2GPUs/node).
On the other hand, the case of specify four GPUs,
calculation succeed (4 node x 1GPU/node or 2 node x 2 GPU/node).
This problem didn't occur when using Amber 12.

Could you give me some advise on this problem?


I checked some case and get the following stderr.

- Build with MVAPICH2 1.8.1, Intel Compiler 2013.1.046, CUDA 5.0
    forrtl: severe (174): SIGSEGV, segmentation fault occurred

- Build with mpich2 3,0,3, Intel Compiler 2013.1.046, CUDA 5.5
    *** glibc detected *** pmemd.cuda.MPI: double free or corruption (out): 0x0000000007e8e830 ***

- Build with OpenMPI 1.4.2, Intel Compiler 2013.1.046, CUDA 5.5
    [t2a004110:28186] *** Process received signal ***
    [t2a004110:28186] Signal: Bus error (7)
    [t2a004110:28186] Signal code: (128)
    [t2a004110:28186] Failing at address: (nil)

- Build with OpenMPI 1.4.2, GNU Compiler 4.3.4, CUDA 5.5
    [t2a005039:05045] *** Process received signal ***
    [t2a005039:05045] Signal: Segmentation fault (11)
    [t2a005039:05045] Signal code: (128)
    [t2a005039:05045] Failing at address: (nil)


Each case, mdout file shows temp = NaN and EPtot = ***********.

    check COM velocity, temp: NaN NaN(Removed)

     NSTEP = 1000 TIME(PS) = 8.000 TEMP(K) = NaN PRESS = 0.0
     Etot = NaN EKtot = NaN EPtot = 974005909.5961
     BOND = 617477679.4037 ANGLE = 252874.7313 DIHED = 7138.6530
     1-4 NB = 0.0000 1-4 EEL = 0.0709 VDWAALS = 356801602.4831
     EELEC = -533385.7459 EHBOND = 0.0000 RESTRAINT = 0.0000
     ------------------------------------------------------------------------------

    check COM velocity, temp: NaN NaN(Removed)

     NSTEP = 2000 TIME(PS) = 10.000 TEMP(K) = NaN PRESS = 0.0
     Etot = NaN EKtot = NaN EPtot = **************
     BOND = 0.0000 ANGLE = 297187.8075 DIHED = 0.0000
     1-4 NB = 0.0000 1-4 EEL = 0.0000 VDWAALS = **************
     EELEC = ************** EHBOND = 0.0000 RESTRAINT = 0.0000
     ------------------------------------------------------------------------------


Our computer cluster and test case Information is the following.

> ./update_amber -version
Version is reported as <version>.<patches applied>

        AmberTools version 14.02
             Amber version 14.00

- SUSE Linux Enterprise Server 11 SP1
- GPU: (Driver: 319.82, CUDA 5.5 or 5.0, K20Xm)
- Command: mpirun -np 8 -hostfile hostfile pmemd.cuda.MPI -O -i mdin -o mdout -p prmtop -c inpcrd
- Test Case
    Explicit Solvent(PME) 3.DHFR NVE = 23,558 atoms
    http://ambermd.org/gpus/benchmarks.htm


Best Regards,
Keisuke Aono

------------------------
Keisuke Aono
Global Scientific Information and Computing Center(GSIC)
Tokyo Institute of Technology
Mail:kaono.o.cc.titech.ac.jp
2-12-1, Ookayama, Meguro-ku, TOKYO
152-8550 JAPAN
TEL:+81-3-5754-1375



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jun 09 2014 - 04:30:03 PDT
Custom Search