Re: [AMBER] [Amber 14] multi-GPU calculation error occers when specify eight GPUs with Tesla K20Xm from Scott Le Grand on 2014-06-09 (Amber Archive Jun 2014)

From: Scott Le Grand <varelse2005.gmail.com>
Date: Mon, 9 Jun 2014 09:50:56 -0700

I will examine this situation, but there is no way to get AMBER to scale
PME past 4 GPUs. It is also not at all unlikely this is an MPI bug of some
sort having to do with the relatively new RDMA code in MPI.
On Jun 9, 2014 5:45 AM, "Ross Walker" <ross.rosswalker.co.uk> wrote:

> Hi Keisuke,
>
> My first question is why on earth are you trying to run a calculation
> across 8 GPUs? This is madness with AMBER 12 and doubly madness with AMBER
> 14. Did you even check the performance when running on 8 GPUS? - it will
> be horendously slow compared to running on just a single GPU. AMBER 12
> used to scale for C2050 type GPUs but then the GPUs got much quicker and
> the interconnect and PCI-E bus stagnated hence no more scaling. If you
> have 8 GPUs I'd recommend running 8 individual calculations.
>
> AMBER 14 supports peer to peer if your motherboard supports it but really
> needs PCI-E Gen 3 to function well and at this time is typically limited
> to 2 GPUs per node on the same IOH controller. There is no way 8 K20s
> could be used in this configuration.
>
> So while the code itself should not crash and there is possibly a bug here
> 8 GPUs with such a small system as DFHR PME is something we never test
> since there simply isn't a valid use case so it's something that is
> unlikely to be investigated.
>
> I suggest sticking to more realistic GPU counts and actually monitoring
> the performance of your calculations.
>
> All the best
> Ross
>
>
> On 6/9/14, 4:17 AM, "Keisuke Aono" <kaono.o.cc.titech.ac.jp> wrote:
>
> >Dear Amber-users,
> >
> >I had problem about multi-GPU calculation when using Amber 14.
> >
> >Error occurs when specify eight GPUs with Tesla K20Xm (4 node x
> >2GPUs/node).
> >On the other hand, the case of specify four GPUs,
> >calculation succeed (4 node x 1GPU/node or 2 node x 2 GPU/node).
> >This problem didn't occur when using Amber 12.
> >
> >Could you give me some advise on this problem?
> >
> >
> >I checked some case and get the following stderr.
> >
> >- Build with MVAPICH2 1.8.1, Intel Compiler 2013.1.046, CUDA 5.0
> > forrtl: severe (174): SIGSEGV, segmentation fault occurred
> >
> >- Build with mpich2 3,0,3, Intel Compiler 2013.1.046, CUDA 5.5
> > *** glibc detected *** pmemd.cuda.MPI: double free or corruption
> >(out): 0x0000000007e8e830 ***
> >
> >- Build with OpenMPI 1.4.2, Intel Compiler 2013.1.046, CUDA 5.5
> > [t2a004110:28186] *** Process received signal ***
> > [t2a004110:28186] Signal: Bus error (7)
> > [t2a004110:28186] Signal code: (128)
> > [t2a004110:28186] Failing at address: (nil)
> >
> >- Build with OpenMPI 1.4.2, GNU Compiler 4.3.4, CUDA 5.5
> > [t2a005039:05045] *** Process received signal ***
> > [t2a005039:05045] Signal: Segmentation fault (11)
> > [t2a005039:05045] Signal code: (128)
> > [t2a005039:05045] Failing at address: (nil)
> >
> >
> >Each case, mdout file shows temp = NaN and EPtot = ***********.
> >
> > check COM velocity, temp: NaN NaN(Removed)
> >
> > NSTEP = 1000 TIME(PS) = 8.000 TEMP(K) = NaN PRESS
> >= 0.0
> > Etot = NaN EKtot = NaN EPtot =
> >974005909.5961
> > BOND = 617477679.4037 ANGLE = 252874.7313 DIHED =
> >7138.6530
> > 1-4 NB = 0.0000 1-4 EEL = 0.0709 VDWAALS =
> >356801602.4831
> > EELEC = -533385.7459 EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> >
> >--------------------------------------------------------------------------
> >----
> >
> > check COM velocity, temp: NaN NaN(Removed)
> >
> > NSTEP = 2000 TIME(PS) = 10.000 TEMP(K) = NaN PRESS
> >= 0.0
> > Etot = NaN EKtot = NaN EPtot =
> >**************
> > BOND = 0.0000 ANGLE = 297187.8075 DIHED =
> > 0.0000
> > 1-4 NB = 0.0000 1-4 EEL = 0.0000 VDWAALS =
> >**************
> > EELEC = ************** EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> >
> >--------------------------------------------------------------------------
> >----
> >
> >
> >Our computer cluster and test case Information is the following.
> >
> >> ./update_amber -version
> >Version is reported as <version>.<patches applied>
> >
> > AmberTools version 14.02
> > Amber version 14.00
> >
> >- SUSE Linux Enterprise Server 11 SP1
> >- GPU: (Driver: 319.82, CUDA 5.5 or 5.0, K20Xm)
> >- Command: mpirun -np 8 -hostfile hostfile pmemd.cuda.MPI -O -i mdin -o
> >mdout -p prmtop -c inpcrd
> >- Test Case
> > Explicit Solvent(PME) 3.DHFR NVE = 23,558 atoms
> > http://ambermd.org/gpus/benchmarks.htm
> >
> >
> >Best Regards,
> >Keisuke Aono
> >
> >------------------------
> >Keisuke Aono
> >Global Scientific Information and Computing Center(GSIC)
> >Tokyo Institute of Technology
> >Mail：kaono.o.cc.titech.ac.jp
> >2-12-1, Ookayama, Meguro-ku, TOKYO
> >152-8550 JAPAN
> >TEL：+81-3-5754-1375
> >
> >
> >
> >_______________________________________________
> >AMBER mailing list
> >AMBER.ambermd.org
> >http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jun 09 2014 - 10:00:02 PDT