Re: [AMBER] Amber 12 cuda test suite: some tests 'hang'

From: Jan-Philip Gehrcke <jgehrcke.googlemail.com>
Date: Wed, 22 Aug 2012 20:35:23 +0200

Thanks Ross and Scott so far.

I have two GTX580 in the same machine and by setting
CUDA_VISIBLE_DEVICES to 0 or 2, which corresponds to the GTX580s
(according to deviceQuery), the test suite passed. Hence, yes, the
problem seems to be hardware dependent. I tried two more times with the
Tesla C2075 and saw this error two times:

> cd gb_ala3/ && ./Run.irest1_ntt2_igb1_ntc2 SPDP
/apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/netcdf.mod
> cudaMemcpy GpuBuffer::Upload failed unspecified launch failure
> ./Run.irest1_ntt2_igb1_ntc2: Program error
> make[3]: *** [test.pmemd.cuda.gb.serial] Error 1

After that, the next test hung up again.

Then, as you suggested, I deactivated ECC for the Tesla card, rebooted
and reran the tests with the Tesla card. This time, all tests finished
smoothly with the following outcome:

> 78 file comparisons passed
> 8 file comparisons failed
> 0 tests experienced errors

This is exactly the same result as with both of the GTX580. I switched
ECC on again and rebooted and reran the tests with the Tesla card.

No hangup this time, the same result:

> 78 file comparisons passed
> 8 file comparisons failed
> 0 tests experienced errors


It looks like one single reboot of the machine kind of healed the Tesla
card. I did not test the card before with Amber. Someone else was
running BOINC stuff and he never complained. This behavior worries me.
Is the card kind of broken? Should we complain to the vendor? Or should
we just ignore what has happened?


What is following now are a few more issues/questions that just came up:

- Is it expexted behavior that the 'CUDA Device ID in use:' line in the
mdout file always shows ID 0 independent of the definition via
CUDA_VISIBLE_DEVICES?

- If an invalid GPU ID has been chosen via CUDA_VISIBLE_DEVICES,
currently, the last printed information in the mdout file is the input
file. The program exists without an error message.

- The IDs given by nvidia-smi are not the same as the deviceQuery IDs.
One has to be careful there :-)


All the best,

Jan-Philip


On 08/22/2012 06:04 PM, Ross Walker wrote:
> Hi Jan,
>
> This really sounds to me like a hardware issue. Have you had any issues
> with this specific card before? Do you have any other GPUs you could try
> in this machine to see if they run fine? I use those Lian Li tower cases
> with multiple GTX580s and C2075s as desktop machines in unconditioned
> offices so that's certainly not the issue.
>
> 85C is fine temperature wise so it isn't that and the lockup when
> undercooled is a hard lockup. You physically have to power the machine off
> and back on to get the GPU working again so this probably isn't the issue
> here. You could see if you can override the fan setting in the nvidia
> control panel (nvidia-settings) and see if that helps. Also make sure X11
> is not running (init 3).
>
> 295.41 is the same driver version I am running without issue so it isn't
> that either. Try turning off ECC on the card and see if that changes the
> behavior at all.
>
> as root:
>
> nvidia-smi -g 0 --ecc-config=0
>
> followed by a reboot.
>
> All the best
> Ross
>
>
> On 8/22/12 8:25 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com> wrote:
>
>> Hey Ross,
>>
>> thanks for the quick response. Driver version:
>>
>>> 17:15:01 $ cat /proc/driver/nvidia/version
>>> NVRM version: NVIDIA UNIX x86_64 Kernel Module 295.41 Fri Apr 6
>>> 23:18:58 PDT 2012
>>> GCC version: gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
>>
>> This, for example, is the card's state while hanging during cd 4096wat/
>> && ./Run.pure_wat:
>>
>>> 17:05:14 $ nvidia-smi
>>> Wed Aug 22 17:15:01 2012
>>> +------------------------------------------------------+
>>> | NVIDIA-SMI 3.295.41 Driver Version: 295.41 |
>>>
>>> |-------------------------------+----------------------+-----------------
>>> -----+
>>> | Nb. Name | Bus Id Disp. | Volatile ECC
>>> SB / DB |
>>> | Fan Temp Power Usage /Cap | Memory Usage | GPU Util.
>>> Compute M. |
>>>
>>> |===============================+======================+=================
>>> =====|
>>> | 0. Tesla C2075 | 0000:0C:00.0 On | 0
>>> 0 |
>>> | 38% 85 C P0 95W / 225W | 3% 144MB / 5375MB | 99%
>>> Default
>>
>> I don't know if 85 degrees Celcius are undercooled. The machine (it's a
>> Lian Li tower) actually is stored in a decently cooled rack. That's why
>> the GPU's fan does not really need to speed up.
>>
>> Regarding Intel vs. GCC: I've used the same Intel 12.1.3 for several
>> Amber builds (Amber 11 CPU and GPU and Amber 12 CPU). For Amber 11's
>> pmemd.cuda everything was running very fine with Intel 12.1.3, but this
>> was on different hardware. Yes, I can try rebuilding with the system's
>> GCC (4.6.3 in this case).
>>
>> Again regarding the undercooling issue: you said you normally have to
>> restart the machine to get things working again. For me now 3 of 4 tests
>> hung up:
>>
>>> cd gb_ala3/ && ./Run.irest1_ntt2_igb1_ntc2 SPDP
>>> /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/netc
>>> df.mod
>>>
>>> Killed
>>> ./Run.irest1_ntt2_igb1_ntc2: Program error
>>> make[3]: *** [test.pmemd.cuda.gb.serial] Error 1
>>> ---------------------------------------------
>>> Running Extended CUDA Implicit solvent tests.
>>> Precision Model = SPDP
>>> ---------------------------------------------
>>> cd trpcage/ && ./Run_md_trpcage SPDP
>>> /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/netc
>>> df.mod
>>> diffing trpcage_md.out.GPU_SPDP with trpcage_md.out
>>> PASSED
>>> ==============================================================
>>> cd myoglobin/ && ./Run_md_myoglobin SPDP
>>> /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/netc
>>> df.mod
>>> Killed
>>> ./Run_md_myoglobin: Program error
>>> make[3]: *** [test.pmemd.cuda.gb] Error 1
>>> ---------------------------------------------
>>> Running Extended CUDA Explicit solvent tests.
>>> Precision Model = SPDP
>>> ---------------------------------------------
>>> cd 4096wat/ && ./Run.pure_wat SPDP
>>> /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/netc
>>> df.mod
>>> Killed
>>> ./Run.pure_wat: Program error
>>> make[3]: *** [test.pmemd.cuda.pme] Error 1
>>> ------------------------------------
>>> Running CUDA Explicit solvent tests.
>>> Precision Model = SPDP
>>> ------------------------------------
>>
>> and the next one (4096wat/ && ./Run.vrand) seems to also be hanging. So
>> maybe I have the same issue here?
>>
>> All the best,
>>
>> JP
>>
>>
>>
>>
>> On 08/22/2012 05:08 PM, Ross Walker wrote:
>>> Hi Jan,
>>>
>>> Can you possibly try this with the GNU compilers? I've not tried the
>>> very
>>> latest Intel so am not sure if that is the problem. I've only seen this
>>> problem with undercooled M2090 cards where the card itself locks up but
>>> that normally requires a reboot to get things working again.
>>>
>>> What driver are you running btw? cat /proc/driver/nvidia/version
>>>
>>> All the best
>>> Ross
>>>
>>>
>>>
>>> On 8/22/12 8:00 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have just built Amber 12 cuda on Ubuntu 12.04 with Intel 12.1.3 and
>>>> NVCC 4.2 V0.2.1221 and then invoked the test suite on a Tesla C2075.
>>>>
>>>> The test below did not produce output for about 40 minutes (I then
>>>> killed the pmemd.cuda process):
>>>>
>>>>> cd gb_ala3/ && ./Run.irest1_ntt2_igb1_ntc2 SPDP
>>>>>
>>>>> /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/ne
>>>>> tc
>>>>> df.mod
>>>>>
>>>>> Killed
>>>>> ./Run.irest1_ntt2_igb1_ntc2: Program error
>>>>
>>>> With "did not produce output" I mean that it did not change the mdout
>>>> file anymore. These are the last lines in the mdout file before and
>>>> after killing the process:
>>>>
>>>>
>>>>> | Intermolecular bonds treatment:
>>>>> | no_intermolecular_bonds = 1
>>>>>
>>>>> | Energy averages sample interval:
>>>>> | ene_avg_sampling = 1
>>>>>
>>>>>
>>>>>
>>>>> -----------------------------------------------------------------------
>>>>> --
>>>>> -------
>>>>> 3. ATOMIC COORDINATES AND VELOCITIES
>>>>>
>>>>>
>>>>> -----------------------------------------------------------------------
>>>>> --
>>>>> -------
>>>>>
>>>>> ACE
>>>>> begin time read from input coords = 1050.000 ps
>>>>
>>>> The tests before this one (about 20 of them) PASSED. While I am writing
>>>> this email, the test suite proceeds. However, it currently hangs at
>>>>
>>>>> cd myoglobin/ && ./Run_md_myoglobin
>>>>
>>>> for already about 15 minutes which is suspicious, right?
>>>>
>>>> If I will have to kill this one, too, and maybe others, I will rerun
>>>> the
>>>> tests and see if the same tests are hanging or if they are hanging
>>>> randomly. I'll then get back to you.
>>>>
>>>> Have you seen such behavior before?
>>>>
>>>> Any suggestion would be helpful.
>>>>
>>>> Thanks,
>>>>
>>>> Jan-Philip
>>>>
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>>
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Aug 22 2012 - 12:00:02 PDT
Custom Search