Thanks Ross/Scott for your suggestions,
The mvapich2 (v1.5.1) library was built with the same intel compilers.
I'm going to try the intel-mpi (v4.0.1.007) library with both CUDA 4.0 and 4.1 first. If they fail I'll get the latest mvapich2 (v1.8) installed. I'll report back on this thread with an update.
All the best,
Martin
On 7 Feb 2012, at 15:52, Scott Le Grand wrote:
> Could you try compiling with a relatively recent cut of MVAPICH2?
>
> MPI 2.0 functionality is really flaky across vendors...
>
>
> On Tue, Feb 7, 2012 at 6:45 AM, Ross Walker <ross.rosswalker.co.uk> wrote:
>
>> Hi Martin,
>>
>>> No burst bubble all I'm trying to do is put together some docs for our
>>> users.
>>> It would be nice to show the poor scaling and avoid others doing the
>>> same.
>>> I still see pretty good speed up on a single node so its not all bad.
>>> Its just a
>>> little weird that the program runs fine on a single node with two gpus
>>> but
>>> seg faults when requesting two nodes and four nodes. I don't believe
>>> it is
>>> a memory issue but I could be wrong.
>>
>> It is a driver issue with the MPI for sure. I have seen this before and the
>> CUDA_NIC_INTEROP was the recommended fix. If that doesn't fix it (and make
>> sure it is set on EVERY node) then we'll need all the specs, OFED driver
>> version, MPI version, CUDA version and driver version + compiler info etc
>> in
>> order to escalate this with NVIDIA.
>>
>> It is related to issues with GPU direct which gets used when running over
>> multiple nodes, there is some kind of incompatibility with the IB card /
>> drivers. I would suggest trying CUDA 4.1 as well as updating the MVAPICH
>> version and IB drivers and see if that helps. I'd also put 'export
>> CUDA_NIC_INTEROP=1' in /etc/bashrc on all nodes.
>>
>>> I was using CUDA/4.0 would 3.2 or 4.1 work any better?
>>> Will AMBER12 resolve or give more debug messages re this seg fault to
>>> the user?
>>
>> This is not an AMBER issue and so AMBER 12 won't resolve it. It is an
>> incompatibility between the MPI library, GPU Direct, the IB card and the
>> GPU
>> Drivers. I can escalate it to NVIDIA if I get ALL the version numbers and
>> they can suggest what versions should be used / what needs updating.
>>
>>> FLIBS= -L$(LIBDIR) -lsff_mpi -lpbsa $(LIBDIR)/arpack.a
>>> $(LIBDIR)/libnetcdf.a -Wl,--start-group
>>> /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_intel_lp64.a
>>> /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_sequential.a
>>> /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_core.a -Wl,--end-
>>
>> Try without MKL as well, that could cause some issues. Unset MKL_HOME and
>> then try doing a complete build again.
>>
>>> CXX=icpc
>>> CPLUSPLUS=icpc
>>
>> I assume your MVAPICH was built using the Intel compilers - and the same
>> version of the intel compilers as are referenced here.
>>
>> All the best
>> Ross
>>
>> /\
>> \/
>> |\oss Walker
>>
>> ---------------------------------------------------------
>> | Assistant Research Professor |
>> | San Diego Supercomputer Center |
>> | Adjunct Assistant Professor |
>> | Dept. of Chemistry and Biochemistry |
>> | University of California San Diego |
>> | NVIDIA Fellow |
>> | http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
>> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
>> ---------------------------------------------------------
>>
>> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
>> be read every day, and should not be used for urgent or sensitive issues.
>>
>>
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Feb 07 2012 - 08:30:03 PST