Re: [AMBER] pmemd.cuda segfaults

From: Ross Walker <ross.rosswalker.co.uk>
Date: Mon, 03 Mar 2014 08:43:12 -0800

Hi Pavel,

Are these 'Titan' cards or Titan-Black cards?

All the best
Ross



On 3/3/14, 8:09 AM, "pavel.banas.upol.cz" <pavel.banas.upol.cz> wrote:

>
>Dear AMBER users,
>
>Recently we bought a cluster containing GPU SuperWorkstations 7047GR-TPRF
>with Super X9DRG-QF motherboards, Intel E5-2620 CPUs, and four GTX Titan
>cards GK110 6GB per node. We have 331.38 driver, cuda toolkit 5.5 (also
>tried 5.0) on Debian 7 with kernel 3.11.8.
>
>
>
>While we were able to successfully compile AMBER with 2013.2.146 intel
>compilers (AMBER12 patch 21, AmberTools13 patch 22; all test passed), the
>few-hours-long GPU jobs (pmemd.cuda_SPFP) are highly error-prone and
>randomly but quite often end with segfaults (on the other hand they seem
>to
>be too rare to occur during the tests). We expected that it might be
>hardware problem, but the segfaults occur randomly on all 64 Titans, so I
>guess it is unlikely that ALL cards are faulty. In turn, we suspect that
>there is something wrong with our compilation or in way, how compilers
>are
>interpreting the code. We observed these segfaults on NVT simulations
>(see
>attached input file) on several systems (protein, RNA, DNA, whatever
>number
>of atoms), all of them are jobs that are running without any problem on
>different GPU machines. From the error massages (there are several
>different
>error massages but all of them of the same kind, see attached example of
>one
>of the error massages) it seems that it has something what to do with
>memory
>leaks on CPU part. In the initial state of our troubles we indeed found
>that
>the same kind of the problems has CPU code (pmemd), but we were able to
>solve this step by suppressing the optimization of SSE during the
>compilation and letting default SSE4.2. However, it seems that the
>problem
>remains somewhere in between pmemd.cuda, cuda toolkit and driver (or
>maybe
>something else??). We tested older compiler 2011.5.220, older version of
>cuda toolkit 5.0, all without any effect.
>
>
>
>Please, do you have any idea what is going on? Any comments or
>suggestions
>are highly welcome.
>
>
>
>Thank you very much, have a nice day,
>
>Pavel
>
>--
>Pavel Banáš
>pavel.banas.upol.cz
>Department of Physical Chemistry,
>Palacky University Olomouc
>Czech Republic
>
>=_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Mar 03 2014 - 09:00:03 PST
Custom Search