Re: [AMBER] PMEMD.CUDA.MPI PME tests/Benchmarks

From: Scott Le Grand <varelse2005.gmail.com>
Date: Tue, 7 Feb 2012 07:52:49 -0800

Could you try compiling with a relatively recent cut of MVAPICH2?

MPI 2.0 functionality is really flaky across vendors...


On Tue, Feb 7, 2012 at 6:45 AM, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Martin,
>
> > No burst bubble all I'm trying to do is put together some docs for our
> > users.
> > It would be nice to show the poor scaling and avoid others doing the
> > same.
> > I still see pretty good speed up on a single node so its not all bad.
> > Its just a
> > little weird that the program runs fine on a single node with two gpus
> > but
> > seg faults when requesting two nodes and four nodes. I don't believe
> > it is
> > a memory issue but I could be wrong.
>
> It is a driver issue with the MPI for sure. I have seen this before and the
> CUDA_NIC_INTEROP was the recommended fix. If that doesn't fix it (and make
> sure it is set on EVERY node) then we'll need all the specs, OFED driver
> version, MPI version, CUDA version and driver version + compiler info etc
> in
> order to escalate this with NVIDIA.
>
> It is related to issues with GPU direct which gets used when running over
> multiple nodes, there is some kind of incompatibility with the IB card /
> drivers. I would suggest trying CUDA 4.1 as well as updating the MVAPICH
> version and IB drivers and see if that helps. I'd also put 'export
> CUDA_NIC_INTEROP=1' in /etc/bashrc on all nodes.
>
> > I was using CUDA/4.0 would 3.2 or 4.1 work any better?
> > Will AMBER12 resolve or give more debug messages re this seg fault to
> > the user?
>
> This is not an AMBER issue and so AMBER 12 won't resolve it. It is an
> incompatibility between the MPI library, GPU Direct, the IB card and the
> GPU
> Drivers. I can escalate it to NVIDIA if I get ALL the version numbers and
> they can suggest what versions should be used / what needs updating.
>
> > FLIBS= -L$(LIBDIR) -lsff_mpi -lpbsa $(LIBDIR)/arpack.a
> > $(LIBDIR)/libnetcdf.a -Wl,--start-group
> > /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_intel_lp64.a
> > /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_sequential.a
> > /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_core.a -Wl,--end-
>
> Try without MKL as well, that could cause some issues. Unset MKL_HOME and
> then try doing a complete build again.
>
> > CXX=icpc
> > CPLUSPLUS=icpc
>
> I assume your MVAPICH was built using the Intel compilers - and the same
> version of the intel compilers as are referenced here.
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> ---------------------------------------------------------
> | Assistant Research Professor |
> | San Diego Supercomputer Center |
> | Adjunct Assistant Professor |
> | Dept. of Chemistry and Biochemistry |
> | University of California San Diego |
> | NVIDIA Fellow |
> | http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> ---------------------------------------------------------
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
> be read every day, and should not be used for urgent or sensitive issues.
>
>
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Feb 07 2012 - 08:00:03 PST
Custom Search