Re: [AMBER] pmemd.cuda segfaults

From: Ross Walker <ross.rosswalker.co.uk>
Date: Mon, 03 Mar 2014 08:49:31 -0800

I never test the GPU code with the Intel compilers - way too many
combinations to test. That said there is very little CPU code that is
actually used in the GPU calculation so misvectorization by the Intel
compiler (the most likely fail I see in Intel compilers for CPU) is
unlikely to affect things.

That said, yes please test with the GNU compilers first and see if the
problem still exists.

All the best
Ross


On 3/3/14, 8:44 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com> wrote:

>Hello Pavel,
>
>I think that not many of us are building pmemd.cuda with Intel
>compilers. Compared to using GCC, there is no performance gain to be
>expected. And since Intel compilers are not exactly known for equivalent
>behavior among different releases, it might very well be that you had
>bad luck in your combination of things. Did you really try CUDA toolkit
>5.0 together with older Intel compilers?
>
>We once had a build configured with
>
>./configure -cuda intel
>
>using icc 13.0.1 20121010 and CUDA 5.0.35. It ran fine on E5-2670 CPUs.
>
>I guess Ross will be able to tell us if the current pmemd.cuda code has
>been successfully tested on CUDA 5.5 + recent Intel compilers.
>
>In my opinion the next step for further diagnosing this would be to
>build pmemd.cuda using GCC and to see if the problem persists.
>
>
>Cheers,
>
>Jan-Philip
>
>
>
>On 03/03/2014 05:09 PM, pavel.banas.upol.cz wrote:
>>
>> Dear AMBER users,
>>
>> Recently we bought a cluster containing GPU SuperWorkstations
>>7047GR-TPRF
>> with Super X9DRG-QF motherboards, Intel E5-2620 CPUs, and four GTX Titan
>> cards GK110 6GB per node. We have 331.38 driver, cuda toolkit 5.5 (also
>> tried 5.0) on Debian 7 with kernel 3.11.8.
>>
>>
>>
>> While we were able to successfully compile AMBER with 2013.2.146 intel
>> compilers (AMBER12 patch 21, AmberTools13 patch 22; all test passed),
>>the
>> few-hours-long GPU jobs (pmemd.cuda_SPFP) are highly error-prone and
>> randomly but quite often end with segfaults (on the other hand they
>>seem to
>> be too rare to occur during the tests). We expected that it might be
>> hardware problem, but the segfaults occur randomly on all 64 Titans, so
>>I
>> guess it is unlikely that ALL cards are faulty. In turn, we suspect that
>> there is something wrong with our compilation or in way, how compilers
>>are
>> interpreting the code. We observed these segfaults on NVT simulations
>>(see
>> attached input file) on several systems (protein, RNA, DNA, whatever
>>number
>> of atoms), all of them are jobs that are running without any problem on
>> different GPU machines. From the error massages (there are several
>>different
>> error massages but all of them of the same kind, see attached example
>>of one
>> of the error massages) it seems that it has something what to do with
>>memory
>> leaks on CPU part. In the initial state of our troubles we indeed found
>>that
>> the same kind of the problems has CPU code (pmemd), but we were able to
>> solve this step by suppressing the optimization of SSE during the
>> compilation and letting default SSE4.2. However, it seems that the
>>problem
>> remains somewhere in between pmemd.cuda, cuda toolkit and driver (or
>>maybe
>> something else??). We tested older compiler 2011.5.220, older version of
>> cuda toolkit 5.0, all without any effect.
>>
>>
>>
>> Please, do you have any idea what is going on? Any comments or
>>suggestions
>> are highly welcome.
>>
>>
>>
>> Thank you very much, have a nice day,
>>
>> Pavel
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Mar 03 2014 - 09:00:04 PST
Custom Search