Hi Keisuke,
My first question is why on earth are you trying to run a calculation
across 8 GPUs? This is madness with AMBER 12 and doubly madness with AMBER
14. Did you even check the performance when running on 8 GPUS? - it will
be horendously slow compared to running on just a single GPU. AMBER 12
used to scale for C2050 type GPUs but then the GPUs got much quicker and
the interconnect and PCI-E bus stagnated hence no more scaling. If you
have 8 GPUs I'd recommend running 8 individual calculations.
AMBER 14 supports peer to peer if your motherboard supports it but really
needs PCI-E Gen 3 to function well and at this time is typically limited
to 2 GPUs per node on the same IOH controller. There is no way 8 K20s
could be used in this configuration.
So while the code itself should not crash and there is possibly a bug here
8 GPUs with such a small system as DFHR PME is something we never test
since there simply isn't a valid use case so it's something that is
unlikely to be investigated.
I suggest sticking to more realistic GPU counts and actually monitoring
the performance of your calculations.
All the best
Ross
On 6/9/14, 4:17 AM, "Keisuke Aono" <kaono.o.cc.titech.ac.jp> wrote:
>Dear Amber-users,
>
>I had problem about multi-GPU calculation when using Amber 14.
>
>Error occurs when specify eight GPUs with Tesla K20Xm (4 node x
>2GPUs/node).
>On the other hand, the case of specify four GPUs,
>calculation succeed (4 node x 1GPU/node or 2 node x 2 GPU/node).
>This problem didn't occur when using Amber 12.
>
>Could you give me some advise on this problem?
>
>
>I checked some case and get the following stderr.
>
>- Build with MVAPICH2 1.8.1, Intel Compiler 2013.1.046, CUDA 5.0
>    forrtl: severe (174): SIGSEGV, segmentation fault occurred
>
>- Build with mpich2 3,0,3, Intel Compiler 2013.1.046, CUDA 5.5
>    *** glibc detected *** pmemd.cuda.MPI: double free or corruption
>(out): 0x0000000007e8e830 ***
>
>- Build with OpenMPI 1.4.2, Intel Compiler 2013.1.046, CUDA 5.5
>    [t2a004110:28186] *** Process received signal ***
>    [t2a004110:28186] Signal: Bus error (7)
>    [t2a004110:28186] Signal code:  (128)
>    [t2a004110:28186] Failing at address: (nil)
>
>- Build with OpenMPI 1.4.2, GNU Compiler 4.3.4, CUDA 5.5
>    [t2a005039:05045] *** Process received signal ***
>    [t2a005039:05045] Signal: Segmentation fault (11)
>    [t2a005039:05045] Signal code:  (128)
>    [t2a005039:05045] Failing at address: (nil)
>
>
>Each case, mdout file shows temp = NaN and EPtot = ***********.
>
>    check COM velocity, temp:             NaN      NaN(Removed)
>
>     NSTEP =     1000   TIME(PS) =       8.000  TEMP(K) =      NaN  PRESS
>=     0.0
>     Etot   =            NaN  EKtot   =            NaN  EPtot      =
>974005909.5961
>     BOND   = 617477679.4037  ANGLE   =    252874.7313  DIHED      =
>7138.6530
>     1-4 NB =         0.0000  1-4 EEL =         0.0709  VDWAALS    =
>356801602.4831
>     EELEC  =   -533385.7459  EHBOND  =         0.0000  RESTRAINT  =
>   0.0000
>     
>--------------------------------------------------------------------------
>----
>
>    check COM velocity, temp:             NaN      NaN(Removed)
>
>     NSTEP =     2000   TIME(PS) =      10.000  TEMP(K) =      NaN  PRESS
>=     0.0
>     Etot   =            NaN  EKtot   =            NaN  EPtot      =
>**************
>     BOND   =         0.0000  ANGLE   =    297187.8075  DIHED      =
>   0.0000
>     1-4 NB =         0.0000  1-4 EEL =         0.0000  VDWAALS    =
>**************
>     EELEC  = **************  EHBOND  =         0.0000  RESTRAINT  =
>   0.0000
>     
>--------------------------------------------------------------------------
>----
>
>
>Our computer cluster and test case Information is the following.
>
>> ./update_amber -version
>Version is reported as <version>.<patches applied>
>
>        AmberTools version 14.02
>             Amber version 14.00
>
>- SUSE Linux Enterprise Server 11 SP1
>- GPU: (Driver: 319.82, CUDA 5.5 or 5.0, K20Xm)
>- Command: mpirun -np 8 -hostfile hostfile pmemd.cuda.MPI -O -i mdin -o
>mdout -p prmtop -c inpcrd
>- Test Case
>    Explicit Solvent(PME) 3.DHFR NVE = 23,558 atoms
>    http://ambermd.org/gpus/benchmarks.htm
>
>
>Best Regards,
>Keisuke Aono
>
>------------------------
>Keisuke Aono
>Global Scientific Information and Computing Center(GSIC)
>Tokyo Institute of Technology
>Mail:kaono.o.cc.titech.ac.jp
>2-12-1, Ookayama, Meguro-ku, TOKYO
>152-8550 JAPAN
>TEL:+81-3-5754-1375
>
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jun 09 2014 - 06:00:04 PDT