Re: [AMBER] pmemd.cuda segfaults

From: Ross Walker <ross.rosswalker.co.uk>
Date: Mon, 03 Mar 2014 13:43:45 -0800

Hi Pavel,

I guarantee you this is a hardware issue. It is either that they are
Titan-Black cards or they are faulty Titan cards. It could also be an
overheating issue, that case you showed is not ducted - way to check that
is to pull all but one of the GPUs and see if it works. It could be a
motherboard issue but I doubt it if the machine is working properly and
CPU runs don't randomly crash.

FYI this is what you should see in mdout for Titan

|------------------- GPU DEVICE INFO --------------------
|
| CUDA Capable Devices Detected: 1
| CUDA Device ID in use: 0
| CUDA Device Name: GeForce GTX TITAN
| CUDA Device Global Mem Size: 6143 MB
| CUDA Device Num Multiprocessors: 14
| CUDA Device Core Freq: 0.88 GHz
|
|--------------------------------------------------------


and for Titan-Black

|------------------- GPU DEVICE INFO --------------------
|
| CUDA Capable Devices Detected: 1
| CUDA Device ID in use: 0
| CUDA Device Name: GeForce GTX TITAN Black
| CUDA Device Global Mem Size: 6143 MB
| CUDA Device Num Multiprocessors: 15
| CUDA Device Core Freq: 0.98 GHz
|
|--------------------------------------------------------


To test things build AMBER with GCC and CUDA5.0 (we recommend 5.0 since
5.5 and 6.0 are about a 5 to 8% performance regression), download the
following:

https://dl.dropboxusercontent.com/u/708185/GPU_Validation_Test.tar.gz

then

export AMBERHOME=/path_to_amber12
tar xvzf GPU_Validation_Test.tar.gz
cd GPU_Validation_Test
./run_test_4gpu.x

Let it run until it completes - about 5 hours and then let us know what
the 4 log files that you get contain.

All the best
Ross



On 3/3/14, 12:44 PM, "pavel.banas.upol.cz" <pavel.banas.upol.cz> wrote:

>
>Dear all,
>
>Thanks for all comments and suggestions. In order to simplify things, I
>will
>put all my answers together.
>
>
>
>> Are these 'Titan' cards or Titan-Black cards?
>
>
>
>Actually don't know exactly. Nvidia-smi says just 'GeForce GTX TITAN' and
>I
>can't get anything conclusive from lspci, but assuming the core frequency
>mentioned in AMBER output 0.88 GHz, this would speak for TITAN-black. I
>will
>check it, but not before tomorrow as this is remote machine.
>
>
>
>> I think that not many of us are building pmemd.cuda with Intel
> > compilers. Compared to using GCC, there is no performance gain to be
> > expected. And since Intel compilers are not exactly known for
>equivalent
> > behavior among different releases, it might very well be that you had
> > bad luck in your combination of things. Did you really try CUDA
>toolkit
> > 5.0 together with older Intel compilers?
>
>
>
>No, we just tested older compiler with toolkit 5.5 and older toolkit with
>new compiler. If we do not have titans-black cards, we will definitely
>check
>gcc and after that even the combination of older toolkit and intel
>compiler.
>
>
>
>> .Pavel: If you know how to use gdb/idb, you can add the "-g" flag to the
>> pmemd compiler defined in config.h (to add debugging symbols) and then
>> use idb/gdb to get the stack trace of the resulting core dump. That
>> should at least tell us where the segfault is occurring.
>
>
>
>We indeed compiled the code with "-g" and used gdb, attached you can find
>two of error messages.
>
>
>
>Anyway, many thanks to all of you. I will definitely check, whether we
>have
>titans or titan-black cards and if titans, we will try to use gnu
>compilers
>and let you know.
>
>Thank you very much,
>
>Pavel
>
>--
>Pavel Banáš
>pavel.banas.upol.cz
>Department of Physical Chemistry,
>Palacky University Olomouc
>Czech Republic
>
>
>
>---------- Původní zpráva ----------
>Od: Ross Walker <ross.rosswalker.co.uk>
>Komu: AMBER Mailing List <amber.ambermd.org>
>Datum: 3. 3. 2014 18:07:31
>Předmět: Re: [AMBER] pmemd.cuda segfaults
>
>"Forget the debugging. It is a hardware issue for sure.
>
>Main thing to know is if these are Titan or Titan-Black. If they are
>Titan-Black then all bets are off. Titan-Black are currently not supported
>due to the fact the cards give incorrect results. A bug is filed with
>NVIDIA. Until it is fixed they will remain unusable.
>
>All the best
>Ross
>
>
>
>
>On 3/3/14, 9:01 AM, "Jason Swails" <jason.swails.gmail.com> wrote:
>
>>On Mon, 2014-03-03 at 17:44 +0100, Jan-Philip Gehrcke wrote:
>>> Hello Pavel,
>>>
>>> I think that not many of us are building pmemd.cuda with Intel
>>> compilers.
>>
>>FWIW, I always use this combination on my machine with a GTX 680 and an
>>AMD 6-core processor; mainly out of habit. I've never seen any
>>problems.
>>
>>.Pavel: If you know how to use gdb/idb, you can add the "-g" flag to the
>>pmemd compiler defined in config.h (to add debugging symbols) and then
>>use idb/gdb to get the stack trace of the resulting core dump. That
>>should at least tell us where the segfault is occurring.
>>
>>HTH,
>>Jason
>>
>>--
>>Jason M. Swails
>>BioMaPS,
>>Rutgers University
>>Postdoctoral Researcher
>>
>>
>>_______________________________________________
>>AMBER mailing list
>>AMBER.ambermd.org
>>http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber"=_________________________
>______________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Mar 03 2014 - 15:30:02 PST
Custom Search