Re: [AMBER] pmemd.cuda segfaults

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 05 Mar 2014 12:49:26 -0800

Hi Pavel,

I can't help much until you run the validation suite I sent you the link
for. Note this is why I have the certified MD-Workstation program on the
AMBER website - to make sure the configuration works, and to test cards
and software installation before shipping. We've used those Supermicro
X9DRG-QF motherboards before without issue - although you should make sure
you have the latest bios installed.

I still suspect bad GPUs - if they all came from the same batch it is
perfectly feasible that all 4 could be flaky. Until you run the validation
suite though it is hard to know if this is indeed hardware or just
something wrong with the actual simulation you are trying to run.

All the best
Ross



On 3/5/14, 12:09 PM, "pavel.banas.upol.cz" <pavel.banas.upol.cz> wrote:

>
>
>Dear all,
>
>we tried to compile AMBER with gnu compilers and it does not help a lot.
>The
>segfaults are less frequent but they persists and giving the same error
>message. The overheating is also not the case as in most cases (but not
>all)
>the segfault occurs at the very beginning of the simulation, when the
>temperature on the gpu is even significantly below the operating
>temperature. In the mean time, we updated drivers on different cluster,
>so
>we will be able to put our titans to that cluster and test their
>hardware.
>Anyway, still it seems to me that all 64 card being faulty is not very
>likely.
>
>Please, do you have any other idea, what might be wrong...bios,
>motherboard
>drivers, etc?? Please, does anybody have the same architecture (GPU
>SuperWorkstations 7047GR-TPRF with Super X9DRG-QF motherboards)?
>
>thanks a lot, Pavel
>
>--
>Pavel Banáš
>pavel.banas.upol.cz
>Department of Physical Chemistry,
>Palacky University Olomouc
>Czech Republic
>
>
>
>---------- Původní zpráva ----------
>Od: pavel.banas.upol.cz
>Komu: AMBER Mailing List <amber.ambermd.org>
>Datum: 4. 3. 2014 9:13:03
>Předmět: Re: [AMBER] pmemd.cuda segfaults
>
>"Hi Ross,
>thank you very much. Actually I have not seen nvidia-smi or amber output
>for
>titan-black card until now. So now it is clear that we have titan cards,
>not
>titan-black. Tomorrow we will be able to compile amber with gcc (as our
>admin has vacacy today), run all tests and I will let you know.
>thanks a lot,
>
>Pavel
>
>
>--
>Pavel Banáš
>pavel.banas.upol.cz
>Department of Physical Chemistry,
>Palacky University Olomouc
>Czech Republic
>
>
>
>---------- Původní zpráva ----------
>Od: Ross Walker <ross.rosswalker.co.uk>
>Komu: AMBER Mailing List <amber.ambermd.org>
>Datum: 3. 3. 2014 22:50:31
>Předmět: Re: [AMBER] pmemd.cuda segfaults
>
>"Hi Pavel,
>
>I guarantee you this is a hardware issue. It is either that they are
>Titan-Black cards or they are faulty Titan cards. It could also be an
>overheating issue, that case you showed is not ducted - way to check that
>is to pull all but one of the GPUs and see if it works. It could be a
>motherboard issue but I doubt it if the machine is working properly and
>CPU runs don't randomly crash.
>
>FYI this is what you should see in mdout for Titan
>
>|------------------- GPU DEVICE INFO --------------------
>|
>| CUDA Capable Devices Detected: 1
>| CUDA Device ID in use: 0
>| CUDA Device Name: GeForce GTX TITAN
>| CUDA Device Global Mem Size: 6143 MB
>| CUDA Device Num Multiprocessors: 14
>| CUDA Device Core Freq: 0.88 GHz
>|
>|--------------------------------------------------------
>
>
>and for Titan-Black
>
>|------------------- GPU DEVICE INFO --------------------
>|
>| CUDA Capable Devices Detected: 1
>| CUDA Device ID in use: 0
>| CUDA Device Name: GeForce GTX TITAN Black
>| CUDA Device Global Mem Size: 6143 MB
>| CUDA Device Num Multiprocessors: 15
>| CUDA Device Core Freq: 0.98 GHz
>|
>|--------------------------------------------------------
>
>
>To test things build AMBER with GCC and CUDA5.0 (we recommend 5.0 since
>5.5 and 6.0 are about a 5 to 8% performance regression), download the
>following:
>
>https://dl.dropboxusercontent.com/u/708185/GPU_Validation_Test.tar.gz
>
>then
>
>export AMBERHOME=/path_to_amber12
>tar xvzf GPU_Validation_Test.tar.gz
>cd GPU_Validation_Test
>./run_test_4gpu.x
>
>Let it run until it completes - about 5 hours and then let us know what
>the 4 log files that you get contain.
>
>All the best
>Ross
>
>
>
>On 3/3/14, 12:44 PM, "pavel.banas.upol.cz" <pavel.banas.upol.cz> wrote:
>
>>
>>Dear all,
>>
>>Thanks for all comments and suggestions. In order to simplify things, I
>>will
>>put all my answers together.
>>
>>
>>
>>> Are these 'Titan' cards or Titan-Black cards?
>>
>>
>>
>>Actually don't know exactly. Nvidia-smi says just 'GeForce GTX TITAN' and
>>I
>>can't get anything conclusive from lspci, but assuming the core frequency
>>mentioned in AMBER output 0.88 GHz, this would speak for TITAN-black. I
>>will
>>check it, but not before tomorrow as this is remote machine.
>>
>>
>>
>>> I think that not many of us are building pmemd.cuda with Intel
>> > compilers. Compared to using GCC, there is no performance gain to be
>> > expected. And since Intel compilers are not exactly known for
>>equivalent
>> > behavior among different releases, it might very well be that you had
>> > bad luck in your combination of things. Did you really try CUDA
>>toolkit
>> > 5.0 together with older Intel compilers?
>>
>>
>>
>>No, we just tested older compiler with toolkit 5.5 and older toolkit with
>>new compiler. If we do not have titans-black cards, we will definitely
>>check
>>gcc and after that even the combination of older toolkit and intel
>>compiler.
>>
>>
>>
>>> .Pavel: If you know how to use gdb/idb, you can add the "-g" flag to
>>>the
>>> pmemd compiler defined in config.h (to add debugging symbols) and then
>>> use idb/gdb to get the stack trace of the resulting core dump. That
>>> should at least tell us where the segfault is occurring.
>>
>>
>>
>>We indeed compiled the code with "-g" and used gdb, attached you can find
>>two of error messages.
>>
>>
>>
>>Anyway, many thanks to all of you. I will definitely check, whether we
>>have
>>titans or titan-black cards and if titans, we will try to use gnu
>>compilers
>>and let you know.
>>
>>Thank you very much,
>>
>>Pavel
>>
>>--
>>Pavel Banáš
>>pavel.banas.upol.cz
>>Department of Physical Chemistry,
>>Palacky University Olomouc
>>Czech Republic
>>
>>
>>
>>---------- Původní zpráva ----------
>>Od: Ross Walker <ross.rosswalker.co.uk>
>>Komu: AMBER Mailing List <amber.ambermd.org>
>>Datum: 3. 3. 2014 18:07:31
>>Předmět: Re: [AMBER] pmemd.cuda segfaults
>>
>>"Forget the debugging. It is a hardware issue for sure.
>>
>>Main thing to know is if these are Titan or Titan-Black. If they are
>>Titan-Black then all bets are off. Titan-Black are currently not
>>supported
>>due to the fact the cards give incorrect results. A bug is filed with
>>NVIDIA. Until it is fixed they will remain unusable.
>>
>>All the best
>>Ross
>>
>>
>>
>>
>>On 3/3/14, 9:01 AM, "Jason Swails" <jason.swails.gmail.com> wrote:
>>
>>>On Mon, 2014-03-03 at 17:44 +0100, Jan-Philip Gehrcke wrote:
>>>> Hello Pavel,
>>>>
>>>> I think that not many of us are building pmemd.cuda with Intel
>>>> compilers.
>>>
>>>FWIW, I always use this combination on my machine with a GTX 680 and an
>>>AMD 6-core processor; mainly out of habit. I've never seen any
>>>problems.
>>>
>>>.Pavel: If you know how to use gdb/idb, you can add the "-g" flag to the
>>>pmemd compiler defined in config.h (to add debugging symbols) and then
>>>use idb/gdb to get the stack trace of the resulting core dump. That
>>>should at least tell us where the segfault is occurring.
>>>
>>>HTH,
>>>Jason
>>>
>>>--
>>>Jason M. Swails
>>>BioMaPS,
>>>Rutgers University
>>>Postdoctoral Researcher
>>>
>>>
>>>_______________________________________________
>>>AMBER mailing list
>>>AMBER.ambermd.org
>>>http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>>
>>_______________________________________________
>>AMBER mailing list
>>AMBER.ambermd.org
>>http://lists.ambermd.org/mailman/listinfo/amber"=________________________
>>_
>>______________________
>>AMBER mailing list
>>AMBER.ambermd.org
>>http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber"
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber"
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Mar 05 2014 - 13:00:02 PST
Custom Search