Re: [AMBER] pmemd.cuda segfaults

From: <pavel.banas.upol.cz>
Date: Wed, 05 Mar 2014 21:09:15 +0100 (CET)

Dear all,

we tried to compile AMBER with gnu compilers and it does not help a lot. The
segfaults are less frequent but they persists and giving the same error
message. The overheating is also not the case as in most cases (but not all)
the segfault occurs at the very beginning of the simulation, when the
temperature on the gpu is even significantly below the operating
temperature. In the mean time, we updated drivers on different cluster, so
we will be able to put our titans to that cluster and test their hardware.
Anyway, still it seems to me that all 64 card being faulty is not very
likely.

Please, do you have any other idea, what might be wrong...bios, motherboard
drivers, etc?? Please, does anybody have the same architecture (GPU
SuperWorkstations 7047GR-TPRF with Super X9DRG-QF motherboards)?

thanks a lot, Pavel

-- 
Pavel Banáš
pavel.banas.upol.cz
Department of Physical Chemistry, 
Palacky University Olomouc 
Czech Republic 
---------- Původní zpráva ----------
Od: pavel.banas.upol.cz
Komu: AMBER Mailing List <amber.ambermd.org>
Datum: 4. 3. 2014 9:13:03
Předmět: Re: [AMBER] pmemd.cuda segfaults
"Hi Ross,
thank you very much. Actually I have not seen nvidia-smi or amber output for
titan-black card until now. So now it is clear that we have titan cards, not
titan-black. Tomorrow we will be able to compile amber with gcc (as our 
admin has vacacy today), run all tests and I will let you know.
thanks a lot,
Pavel
-- 
Pavel Banáš
pavel.banas.upol.cz
Department of Physical Chemistry, 
Palacky University Olomouc 
Czech Republic 
---------- Původní zpráva ----------
Od: Ross Walker <ross.rosswalker.co.uk>
Komu: AMBER Mailing List <amber.ambermd.org>
Datum: 3. 3. 2014 22:50:31
Předmět: Re: [AMBER] pmemd.cuda segfaults
"Hi Pavel,
I guarantee you this is a hardware issue. It is either that they are
Titan-Black cards or they are faulty Titan cards. It could also be an
overheating issue, that case you showed is not ducted - way to check that
is to pull all but one of the GPUs and see if it works. It could be a
motherboard issue but I doubt it if the machine is working properly and
CPU runs don't randomly crash.
FYI this is what you should see in mdout for Titan
|------------------- GPU DEVICE INFO --------------------
|
| CUDA Capable Devices Detected: 1
| CUDA Device ID in use: 0
| CUDA Device Name: GeForce GTX TITAN
| CUDA Device Global Mem Size: 6143 MB
| CUDA Device Num Multiprocessors: 14
| CUDA Device Core Freq: 0.88 GHz
|
|--------------------------------------------------------
and for Titan-Black
|------------------- GPU DEVICE INFO --------------------
|
| CUDA Capable Devices Detected: 1
| CUDA Device ID in use: 0
| CUDA Device Name: GeForce GTX TITAN Black
| CUDA Device Global Mem Size: 6143 MB
| CUDA Device Num Multiprocessors: 15
| CUDA Device Core Freq: 0.98 GHz
|
|--------------------------------------------------------
To test things build AMBER with GCC and CUDA5.0 (we recommend 5.0 since
5.5 and 6.0 are about a 5 to 8% performance regression), download the
following:
https://dl.dropboxusercontent.com/u/708185/GPU_Validation_Test.tar.gz
then
export AMBERHOME=/path_to_amber12
tar xvzf GPU_Validation_Test.tar.gz
cd GPU_Validation_Test
./run_test_4gpu.x
Let it run until it completes - about 5 hours and then let us know what
the 4 log files that you get contain.
All the best
Ross
On 3/3/14, 12:44 PM, "pavel.banas.upol.cz" <pavel.banas.upol.cz> wrote:
>
>Dear all,
>
>Thanks for all comments and suggestions. In order to simplify things, I
>will
>put all my answers together.
>
> 
> 
>> Are these 'Titan' cards or Titan-Black cards?
>
> 
> 
>Actually don't know exactly. Nvidia-smi says just 'GeForce GTX TITAN' and
>I 
>can't get anything conclusive from lspci, but assuming the core frequency
>mentioned in AMBER output 0.88 GHz, this would speak for TITAN-black. I
>will
>check it, but not before tomorrow as this is remote machine.
>
> 
> 
>> I think that not many of us are building pmemd.cuda with Intel
> > compilers. Compared to using GCC, there is no performance gain to be
> > expected. And since Intel compilers are not exactly known for
>equivalent 
> > behavior among different releases, it might very well be that you had
> > bad luck in your combination of things. Did you really try CUDA
>toolkit 
> > 5.0 together with older Intel compilers?
>
> 
> 
>No, we just tested older compiler with toolkit 5.5 and older toolkit with
>new compiler. If we do not have titans-black cards, we will definitely
>check
>gcc and after that even the combination of older toolkit and intel
>compiler.
>
> 
> 
>> .Pavel: If you know how to use gdb/idb, you can add the "-g" flag to the
>> pmemd compiler defined in config.h (to add debugging symbols) and then
>> use idb/gdb to get the stack trace of the resulting core dump. That
>> should at least tell us where the segfault is occurring.
>
> 
> 
>We indeed compiled the code with "-g" and used gdb, attached you can find
>two of error messages.
>
> 
> 
>Anyway, many thanks to all of you. I will definitely check, whether we
>have 
>titans or titan-black cards and if titans, we will try to use gnu
>compilers 
>and let you know.
>
>Thank you very much,
>
>Pavel
>
>-- 
>Pavel Banáš
>pavel.banas.upol.cz
>Department of Physical Chemistry,
>Palacky University Olomouc
>Czech Republic 
>
>
>
>---------- Původní zpráva ----------
>Od: Ross Walker <ross.rosswalker.co.uk>
>Komu: AMBER Mailing List <amber.ambermd.org>
>Datum: 3. 3. 2014 18:07:31
>Předmět: Re: [AMBER] pmemd.cuda segfaults
>
>"Forget the debugging. It is a hardware issue for sure.
>
>Main thing to know is if these are Titan or Titan-Black. If they are
>Titan-Black then all bets are off. Titan-Black are currently not supported
>due to the fact the cards give incorrect results. A bug is filed with
>NVIDIA. Until it is fixed they will remain unusable.
>
>All the best
>Ross
>
>
>
>
>On 3/3/14, 9:01 AM, "Jason Swails" <jason.swails.gmail.com> wrote:
>
>>On Mon, 2014-03-03 at 17:44 +0100, Jan-Philip Gehrcke wrote:
>>> Hello Pavel,
>>> 
>>> I think that not many of us are building pmemd.cuda with Intel
>>> compilers.
>>
>>FWIW, I always use this combination on my machine with a GTX 680 and an
>>AMD 6-core processor; mainly out of habit. I've never seen any
>>problems.
>>
>>.Pavel: If you know how to use gdb/idb, you can add the "-g" flag to the
>>pmemd compiler defined in config.h (to add debugging symbols) and then
>>use idb/gdb to get the stack trace of the resulting core dump. That
>>should at least tell us where the segfault is occurring.
>>
>>HTH,
>>Jason
>>
>>-- 
>>Jason M. Swails
>>BioMaPS,
>>Rutgers University
>>Postdoctoral Researcher
>>
>>
>>_______________________________________________
>>AMBER mailing list
>>AMBER.ambermd.org
>>http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber"=_________________________
>______________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber"
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber"
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Mar 05 2014 - 12:30:03 PST
Custom Search