Re: [AMBER] pmemd.cuda segfaults

From: Jan-Philip Gehrcke <jgehrcke.googlemail.com>
Date: Mon, 03 Mar 2014 17:44:29 +0100

Hello Pavel,

I think that not many of us are building pmemd.cuda with Intel
compilers. Compared to using GCC, there is no performance gain to be
expected. And since Intel compilers are not exactly known for equivalent
behavior among different releases, it might very well be that you had
bad luck in your combination of things. Did you really try CUDA toolkit
5.0 together with older Intel compilers?

We once had a build configured with

./configure -cuda intel

using icc 13.0.1 20121010 and CUDA 5.0.35. It ran fine on E5-2670 CPUs.

I guess Ross will be able to tell us if the current pmemd.cuda code has
been successfully tested on CUDA 5.5 + recent Intel compilers.

In my opinion the next step for further diagnosing this would be to
build pmemd.cuda using GCC and to see if the problem persists.


Cheers,

Jan-Philip



On 03/03/2014 05:09 PM, pavel.banas.upol.cz wrote:
>
> Dear AMBER users,
>
> Recently we bought a cluster containing GPU SuperWorkstations 7047GR-TPRF
> with Super X9DRG-QF motherboards, Intel E5-2620 CPUs, and four GTX Titan
> cards GK110 6GB per node. We have 331.38 driver, cuda toolkit 5.5 (also
> tried 5.0) on Debian 7 with kernel 3.11.8.
>
>
>
> While we were able to successfully compile AMBER with 2013.2.146 intel
> compilers (AMBER12 patch 21, AmberTools13 patch 22; all test passed), the
> few-hours-long GPU jobs (pmemd.cuda_SPFP) are highly error-prone and
> randomly but quite often end with segfaults (on the other hand they seem to
> be too rare to occur during the tests). We expected that it might be
> hardware problem, but the segfaults occur randomly on all 64 Titans, so I
> guess it is unlikely that ALL cards are faulty. In turn, we suspect that
> there is something wrong with our compilation or in way, how compilers are
> interpreting the code. We observed these segfaults on NVT simulations (see
> attached input file) on several systems (protein, RNA, DNA, whatever number
> of atoms), all of them are jobs that are running without any problem on
> different GPU machines. From the error massages (there are several different
> error massages but all of them of the same kind, see attached example of one
> of the error massages) it seems that it has something what to do with memory
> leaks on CPU part. In the initial state of our troubles we indeed found that
> the same kind of the problems has CPU code (pmemd), but we were able to
> solve this step by suppressing the optimization of SSE during the
> compilation and letting default SSE4.2. However, it seems that the problem
> remains somewhere in between pmemd.cuda, cuda toolkit and driver (or maybe
> something else??). We tested older compiler 2011.5.220, older version of
> cuda toolkit 5.0, all without any effect.
>
>
>
> Please, do you have any idea what is going on? Any comments or suggestions
> are highly welcome.
>
>
>
> Thank you very much, have a nice day,
>
> Pavel
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Mar 03 2014 - 09:00:03 PST
Custom Search