Avoid CUDA 4.1 for AMBER...
On Feb 7, 2012 8:05 AM, "Martin Peters" <martin.b.peters.me.com> wrote:
> Thanks Ross/Scott for your suggestions,
>
> The mvapich2 (v1.5.1) library was built with the same intel compilers.
>
> I'm going to try the intel-mpi (v4.0.1.007) library with both CUDA 4.0 and
> 4.1 first. If they fail I'll get the latest mvapich2 (v1.8) installed.
> I'll report back on this thread with an update.
>
> All the best,
> Martin
>
> On 7 Feb 2012, at 15:52, Scott Le Grand wrote:
>
> > Could you try compiling with a relatively recent cut of MVAPICH2?
> >
> > MPI 2.0 functionality is really flaky across vendors...
> >
> >
> > On Tue, Feb 7, 2012 at 6:45 AM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
> >
> >> Hi Martin,
> >>
> >>> No burst bubble all I'm trying to do is put together some docs for our
> >>> users.
> >>> It would be nice to show the poor scaling and avoid others doing the
> >>> same.
> >>> I still see pretty good speed up on a single node so its not all bad.
> >>> Its just a
> >>> little weird that the program runs fine on a single node with two gpus
> >>> but
> >>> seg faults when requesting two nodes and four nodes. I don't believe
> >>> it is
> >>> a memory issue but I could be wrong.
> >>
> >> It is a driver issue with the MPI for sure. I have seen this before and
> the
> >> CUDA_NIC_INTEROP was the recommended fix. If that doesn't fix it (and
> make
> >> sure it is set on EVERY node) then we'll need all the specs, OFED driver
> >> version, MPI version, CUDA version and driver version + compiler info
> etc
> >> in
> >> order to escalate this with NVIDIA.
> >>
> >> It is related to issues with GPU direct which gets used when running
> over
> >> multiple nodes, there is some kind of incompatibility with the IB card /
> >> drivers. I would suggest trying CUDA 4.1 as well as updating the MVAPICH
> >> version and IB drivers and see if that helps. I'd also put 'export
> >> CUDA_NIC_INTEROP=1' in /etc/bashrc on all nodes.
> >>
> >>> I was using CUDA/4.0 would 3.2 or 4.1 work any better?
> >>> Will AMBER12 resolve or give more debug messages re this seg fault to
> >>> the user?
> >>
> >> This is not an AMBER issue and so AMBER 12 won't resolve it. It is an
> >> incompatibility between the MPI library, GPU Direct, the IB card and the
> >> GPU
> >> Drivers. I can escalate it to NVIDIA if I get ALL the version numbers
> and
> >> they can suggest what versions should be used / what needs updating.
> >>
> >>> FLIBS= -L$(LIBDIR) -lsff_mpi -lpbsa $(LIBDIR)/arpack.a
> >>> $(LIBDIR)/libnetcdf.a -Wl,--start-group
> >>> /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_intel_lp64.a
> >>> /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_sequential.a
> >>> /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_core.a-Wl,--end-
> >>
> >> Try without MKL as well, that could cause some issues. Unset MKL_HOME
> and
> >> then try doing a complete build again.
> >>
> >>> CXX=icpc
> >>> CPLUSPLUS=icpc
> >>
> >> I assume your MVAPICH was built using the Intel compilers - and the same
> >> version of the intel compilers as are referenced here.
> >>
> >> All the best
> >> Ross
> >>
> >> /\
> >> \/
> >> |\oss Walker
> >>
> >> ---------------------------------------------------------
> >> | Assistant Research Professor |
> >> | San Diego Supercomputer Center |
> >> | Adjunct Assistant Professor |
> >> | Dept. of Chemistry and Biochemistry |
> >> | University of California San Diego |
> >> | NVIDIA Fellow |
> >> | http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
> >> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> >> ---------------------------------------------------------
> >>
> >> Note: Electronic Mail is not secure, has no guarantee of delivery, may
> not
> >> be read every day, and should not be used for urgent or sensitive
> issues.
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Feb 07 2012 - 12:00:02 PST