Re: [AMBER] PMEMD.CUDA.MPI PME tests/Benchmarks

From: Ross Walker <ross.rosswalker.co.uk>
Date: Tue, 7 Feb 2012 06:45:53 -0800

Hi Martin,

> No burst bubble all I'm trying to do is put together some docs for our
> users.
> It would be nice to show the poor scaling and avoid others doing the
> same.
> I still see pretty good speed up on a single node so its not all bad.
> Its just a
> little weird that the program runs fine on a single node with two gpus
> but
> seg faults when requesting two nodes and four nodes. I don't believe
> it is
> a memory issue but I could be wrong.

It is a driver issue with the MPI for sure. I have seen this before and the
CUDA_NIC_INTEROP was the recommended fix. If that doesn't fix it (and make
sure it is set on EVERY node) then we'll need all the specs, OFED driver
version, MPI version, CUDA version and driver version + compiler info etc in
order to escalate this with NVIDIA.

It is related to issues with GPU direct which gets used when running over
multiple nodes, there is some kind of incompatibility with the IB card /
drivers. I would suggest trying CUDA 4.1 as well as updating the MVAPICH
version and IB drivers and see if that helps. I'd also put 'export
CUDA_NIC_INTEROP=1' in /etc/bashrc on all nodes.

> I was using CUDA/4.0 would 3.2 or 4.1 work any better?
> Will AMBER12 resolve or give more debug messages re this seg fault to
> the user?

This is not an AMBER issue and so AMBER 12 won't resolve it. It is an
incompatibility between the MPI library, GPU Direct, the IB card and the GPU
Drivers. I can escalate it to NVIDIA if I get ALL the version numbers and
they can suggest what versions should be used / what needs updating.
 
> FLIBS= -L$(LIBDIR) -lsff_mpi -lpbsa $(LIBDIR)/arpack.a
> $(LIBDIR)/libnetcdf.a -Wl,--start-group
> /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_intel_lp64.a
> /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_sequential.a
> /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_core.a -Wl,--end-

Try without MKL as well, that could cause some issues. Unset MKL_HOME and
then try doing a complete build again.

> CXX=icpc
> CPLUSPLUS=icpc

I assume your MVAPICH was built using the Intel compilers - and the same
version of the intel compilers as are referenced here.

All the best
Ross

/\
\/
|\oss Walker

---------------------------------------------------------
| Assistant Research Professor |
| San Diego Supercomputer Center |
| Adjunct Assistant Professor |
| Dept. of Chemistry and Biochemistry |
| University of California San Diego |
| NVIDIA Fellow |
| http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
---------------------------------------------------------

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.





_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Feb 07 2012 - 07:00:03 PST
Custom Search