Re: [AMBER] Amber 12 cuda test suite: some tests 'hang' from Ross Walker on 2012-08-22 (Amber Archive Aug 2012)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 22 Aug 2012 09:04:14 -0700

Hi Jan,

This really sounds to me like a hardware issue. Have you had any issues
with this specific card before? Do you have any other GPUs you could try
in this machine to see if they run fine? I use those Lian Li tower cases
with multiple GTX580s and C2075s as desktop machines in unconditioned
offices so that's certainly not the issue.

85C is fine temperature wise so it isn't that and the lockup when
undercooled is a hard lockup. You physically have to power the machine off
and back on to get the GPU working again so this probably isn't the issue
here. You could see if you can override the fan setting in the nvidia
control panel (nvidia-settings) and see if that helps. Also make sure X11
is not running (init 3).

295.41 is the same driver version I am running without issue so it isn't
that either. Try turning off ECC on the card and see if that changes the
behavior at all.

as root:

nvidia-smi -g 0 --ecc-config=0

followed by a reboot.

All the best
Ross

On 8/22/12 8:25 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com> wrote:

>Hey Ross,
>
>thanks for the quick response. Driver version:
>
>> 17:15:01 $ cat /proc/driver/nvidia/version
>> NVRM version: NVIDIA UNIX x86_64 Kernel Module 295.41 Fri Apr 6
>>23:18:58 PDT 2012
>> GCC version: gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
>
>This, for example, is the card's state while hanging during cd 4096wat/
>&& ./Run.pure_wat:
>
>> 17:05:14 $ nvidia-smi
>> Wed Aug 22 17:15:01 2012
>> +------------------------------------------------------+
>> | NVIDIA-SMI 3.295.41 Driver Version: 295.41 |
>>
>>|-------------------------------+----------------------+-----------------
>>-----+
>> | Nb. Name | Bus Id Disp. | Volatile ECC
>>SB / DB |
>> | Fan Temp Power Usage /Cap | Memory Usage | GPU Util.
>>Compute M. |
>>
>>|===============================+======================+=================
>>=====|
>> | 0. Tesla C2075 | 0000:0C:00.0 On | 0
>> 0 |
>> | 38% 85 C P0 95W / 225W | 3% 144MB / 5375MB | 99%
>>Default
>
>I don't know if 85 degrees Celcius are undercooled. The machine (it's a
>Lian Li tower) actually is stored in a decently cooled rack. That's why
>the GPU's fan does not really need to speed up.
>
>Regarding Intel vs. GCC: I've used the same Intel 12.1.3 for several
>Amber builds (Amber 11 CPU and GPU and Amber 12 CPU). For Amber 11's
>pmemd.cuda everything was running very fine with Intel 12.1.3, but this
>was on different hardware. Yes, I can try rebuilding with the system's
>GCC (4.6.3 in this case).
>
>Again regarding the undercooling issue: you said you normally have to
>restart the machine to get things working again. For me now 3 of 4 tests
>hung up:
>
>> cd gb_ala3/ && ./Run.irest1_ntt2_igb1_ntc2 SPDP
>>/apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/netc
>>df.mod
>>
>> Killed
>> ./Run.irest1_ntt2_igb1_ntc2: Program error
>> make[3]: *** [test.pmemd.cuda.gb.serial] Error 1
>> ---------------------------------------------
>> Running Extended CUDA Implicit solvent tests.
>> Precision Model = SPDP
>> ---------------------------------------------
>> cd trpcage/ && ./Run_md_trpcage SPDP
>>/apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/netc
>>df.mod
>> diffing trpcage_md.out.GPU_SPDP with trpcage_md.out
>> PASSED
>> ==============================================================
>> cd myoglobin/ && ./Run_md_myoglobin SPDP
>>/apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/netc
>>df.mod
>> Killed
>> ./Run_md_myoglobin: Program error
>> make[3]: *** [test.pmemd.cuda.gb] Error 1
>> ---------------------------------------------
>> Running Extended CUDA Explicit solvent tests.
>> Precision Model = SPDP
>> ---------------------------------------------
>> cd 4096wat/ && ./Run.pure_wat SPDP
>>/apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/netc
>>df.mod
>> Killed
>> ./Run.pure_wat: Program error
>> make[3]: *** [test.pmemd.cuda.pme] Error 1
>> ------------------------------------
>> Running CUDA Explicit solvent tests.
>> Precision Model = SPDP
>> ------------------------------------
>
>and the next one (4096wat/ && ./Run.vrand) seems to also be hanging. So
>maybe I have the same issue here?
>
>All the best,
>
>JP
>
>
>
>
>On 08/22/2012 05:08 PM, Ross Walker wrote:
>> Hi Jan,
>>
>> Can you possibly try this with the GNU compilers? I've not tried the
>>very
>> latest Intel so am not sure if that is the problem. I've only seen this
>> problem with undercooled M2090 cards where the card itself locks up but
>> that normally requires a reboot to get things working again.
>>
>> What driver are you running btw? cat /proc/driver/nvidia/version
>>
>> All the best
>> Ross
>>
>>
>>
>> On 8/22/12 8:00 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com>
>>wrote:
>>
>>> Hello,
>>>
>>> I have just built Amber 12 cuda on Ubuntu 12.04 with Intel 12.1.3 and
>>> NVCC 4.2 V0.2.1221 and then invoked the test suite on a Tesla C2075.
>>>
>>> The test below did not produce output for about 40 minutes (I then
>>> killed the pmemd.cuda process):
>>>
>>>> cd gb_ala3/ && ./Run.irest1_ntt2_igb1_ntc2 SPDP
>>>>
>>>>/apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/ne
>>>>tc
>>>> df.mod
>>>>
>>>> Killed
>>>> ./Run.irest1_ntt2_igb1_ntc2: Program error
>>>
>>> With "did not produce output" I mean that it did not change the mdout
>>> file anymore. These are the last lines in the mdout file before and
>>> after killing the process:
>>>
>>>
>>>> | Intermolecular bonds treatment:
>>>> | no_intermolecular_bonds = 1
>>>>
>>>> | Energy averages sample interval:
>>>> | ene_avg_sampling = 1
>>>>
>>>>
>>>>
>>>>-----------------------------------------------------------------------
>>>>--
>>>> -------
>>>> 3. ATOMIC COORDINATES AND VELOCITIES
>>>>
>>>>
>>>>-----------------------------------------------------------------------
>>>>--
>>>> -------
>>>>
>>>> ACE
>>>> begin time read from input coords = 1050.000 ps
>>>
>>> The tests before this one (about 20 of them) PASSED. While I am writing
>>> this email, the test suite proceeds. However, it currently hangs at
>>>
>>>> cd myoglobin/ && ./Run_md_myoglobin
>>>
>>> for already about 15 minutes which is suspicious, right?
>>>
>>> If I will have to kill this one, too, and maybe others, I will rerun
>>>the
>>> tests and see if the same tests are hanging or if they are hanging
>>> randomly. I'll then get back to you.
>>>
>>> Have you seen such behavior before?
>>>
>>> Any suggestion would be helpful.
>>>
>>> Thanks,
>>>
>>> Jan-Philip
>>>
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Aug 22 2012 - 09:30:07 PDT