Re: [AMBER] Amber 12 cuda test suite: some tests 'hang' from Ross Walker on 2012-08-22 (Amber Archive Aug 2012)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 22 Aug 2012 12:20:22 -0700

Hi Jan,

This really does sound like a flakey card. You could try running some long
simulations on there. That say run for 1 day. Run them twice assuming ig
/= -1 you should get absolutely identical outputs. If the cards hangs
again or the outputs don't match then that confirms the card is faulty.
You may want to consider RMA'ing the card. If you have GTX580's in the
same box it could also be a motherboard compatibility issue. I have seen
such weird things in the past so if it hangs again try pulling the two
GTX580s and see if it then works okay. It could also be power, do you have
a 1.2KW or better power supply? Again pulling the two GTX580s will answer
that for you.

The BOINC stuff probably never really stressed the card very much. AMBER
is much more aggressive on GPUs. It is also possible the BOINC stuff was
churning out dodgy results and nobody noticed.

The CUDA Device in use line will always be the same for a serial run. The
code will pick the GPU with the most memory. It has no knowledge of
whether other runs are using that GPU so CUDA_VISIBLE_DEVICES should
always be set to only expose the GPU you wish to run on.

As for the NVIDIA_smi vs device query this has driven me crazy and I have
pulled a LOT of hair out over it. Complaining to NVIDIA multiple times has
not helped at all. :-( They claim it can't be fixed since it depends on
how the OS enumerates the hardward. The definitive approach is to use
device query to work out the ID's of each of your GPUs and then use these
for CUDA_VISIBLE_DEVICES.

With regards to the CUDA_VISIBLE_DEVICES being set to an invalid GPU_ID I
get the following error printed to stderr:

cudaGetDeviceCount failed no CUDA-capable device is detected.

This seems like a reasonable error message to me. Do you not see this
error?

All the best
Ross

On 8/22/12 11:35 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com> wrote:

>Thanks Ross and Scott so far.
>
>I have two GTX580 in the same machine and by setting
>CUDA_VISIBLE_DEVICES to 0 or 2, which corresponds to the GTX580s
>(according to deviceQuery), the test suite passed. Hence, yes, the
>problem seems to be hardware dependent. I tried two more times with the
>Tesla C2075 and saw this error two times:
>
> > cd gb_ala3/ && ./Run.irest1_ntt2_igb1_ntc2 SPDP
>/apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/netcd
>f.mod
> > cudaMemcpy GpuBuffer::Upload failed unspecified launch failure
> > ./Run.irest1_ntt2_igb1_ntc2: Program error
> > make[3]: *** [test.pmemd.cuda.gb.serial] Error 1
>
>After that, the next test hung up again.
>
>Then, as you suggested, I deactivated ECC for the Tesla card, rebooted
>and reran the tests with the Tesla card. This time, all tests finished
>smoothly with the following outcome:
>
>> 78 file comparisons passed
>> 8 file comparisons failed
>> 0 tests experienced errors
>
>This is exactly the same result as with both of the GTX580. I switched
>ECC on again and rebooted and reran the tests with the Tesla card.
>
>No hangup this time, the same result:
>
>> 78 file comparisons passed
>> 8 file comparisons failed
>> 0 tests experienced errors
>
>
>It looks like one single reboot of the machine kind of healed the Tesla
>card. I did not test the card before with Amber. Someone else was
>running BOINC stuff and he never complained. This behavior worries me.
>Is the card kind of broken? Should we complain to the vendor? Or should
>we just ignore what has happened?
>
>
>What is following now are a few more issues/questions that just came up:
>
>- Is it expexted behavior that the 'CUDA Device ID in use:' line in the
>mdout file always shows ID 0 independent of the definition via
>CUDA_VISIBLE_DEVICES?
>
>- If an invalid GPU ID has been chosen via CUDA_VISIBLE_DEVICES,
>currently, the last printed information in the mdout file is the input
>file. The program exists without an error message.
>
>- The IDs given by nvidia-smi are not the same as the deviceQuery IDs.
>One has to be careful there :-)
>
>
>All the best,
>
>Jan-Philip
>
>
>On 08/22/2012 06:04 PM, Ross Walker wrote:
>> Hi Jan,
>>
>> This really sounds to me like a hardware issue. Have you had any issues
>> with this specific card before? Do you have any other GPUs you could try
>> in this machine to see if they run fine? I use those Lian Li tower cases
>> with multiple GTX580s and C2075s as desktop machines in unconditioned
>> offices so that's certainly not the issue.
>>
>> 85C is fine temperature wise so it isn't that and the lockup when
>> undercooled is a hard lockup. You physically have to power the machine
>>off
>> and back on to get the GPU working again so this probably isn't the
>>issue
>> here. You could see if you can override the fan setting in the nvidia
>> control panel (nvidia-settings) and see if that helps. Also make sure
>>X11
>> is not running (init 3).
>>
>> 295.41 is the same driver version I am running without issue so it isn't
>> that either. Try turning off ECC on the card and see if that changes the
>> behavior at all.
>>
>> as root:
>>
>> nvidia-smi -g 0 --ecc-config=0
>>
>> followed by a reboot.
>>
>> All the best
>> Ross
>>
>>
>> On 8/22/12 8:25 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com>
>>wrote:
>>
>>> Hey Ross,
>>>
>>> thanks for the quick response. Driver version:
>>>
>>>> 17:15:01 $ cat /proc/driver/nvidia/version
>>>> NVRM version: NVIDIA UNIX x86_64 Kernel Module 295.41 Fri Apr 6
>>>> 23:18:58 PDT 2012
>>>> GCC version: gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
>>>
>>> This, for example, is the card's state while hanging during cd 4096wat/
>>> && ./Run.pure_wat:
>>>
>>>> 17:05:14 $ nvidia-smi
>>>> Wed Aug 22 17:15:01 2012
>>>> +------------------------------------------------------+
>>>> | NVIDIA-SMI 3.295.41 Driver Version: 295.41 |
>>>>
>>>>
>>>>|-------------------------------+----------------------+---------------
>>>>--
>>>> -----+
>>>> | Nb. Name | Bus Id Disp. | Volatile ECC
>>>> SB / DB |
>>>> | Fan Temp Power Usage /Cap | Memory Usage | GPU Util.
>>>> Compute M. |
>>>>
>>>>
>>>>|===============================+======================+===============
>>>>==
>>>> =====|
>>>> | 0. Tesla C2075 | 0000:0C:00.0 On | 0
>>>> 0 |
>>>> | 38% 85 C P0 95W / 225W | 3% 144MB / 5375MB | 99%
>>>> Default
>>>
>>> I don't know if 85 degrees Celcius are undercooled. The machine (it's a
>>> Lian Li tower) actually is stored in a decently cooled rack. That's why
>>> the GPU's fan does not really need to speed up.
>>>
>>> Regarding Intel vs. GCC: I've used the same Intel 12.1.3 for several
>>> Amber builds (Amber 11 CPU and GPU and Amber 12 CPU). For Amber 11's
>>> pmemd.cuda everything was running very fine with Intel 12.1.3, but this
>>> was on different hardware. Yes, I can try rebuilding with the system's
>>> GCC (4.6.3 in this case).
>>>
>>> Again regarding the undercooling issue: you said you normally have to
>>> restart the machine to get things working again. For me now 3 of 4
>>>tests
>>> hung up:
>>>
>>>> cd gb_ala3/ && ./Run.irest1_ntt2_igb1_ntc2 SPDP
>>>>
>>>>/apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/ne
>>>>tc
>>>> df.mod
>>>>
>>>> Killed
>>>> ./Run.irest1_ntt2_igb1_ntc2: Program error
>>>> make[3]: *** [test.pmemd.cuda.gb.serial] Error 1
>>>> ---------------------------------------------
>>>> Running Extended CUDA Implicit solvent tests.
>>>> Precision Model = SPDP
>>>> ---------------------------------------------
>>>> cd trpcage/ && ./Run_md_trpcage SPDP
>>>>
>>>>/apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/ne
>>>>tc
>>>> df.mod
>>>> diffing trpcage_md.out.GPU_SPDP with trpcage_md.out
>>>> PASSED
>>>> ==============================================================
>>>> cd myoglobin/ && ./Run_md_myoglobin SPDP
>>>>
>>>>/apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/ne
>>>>tc
>>>> df.mod
>>>> Killed
>>>> ./Run_md_myoglobin: Program error
>>>> make[3]: *** [test.pmemd.cuda.gb] Error 1
>>>> ---------------------------------------------
>>>> Running Extended CUDA Explicit solvent tests.
>>>> Precision Model = SPDP
>>>> ---------------------------------------------
>>>> cd 4096wat/ && ./Run.pure_wat SPDP
>>>>
>>>>/apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/ne
>>>>tc
>>>> df.mod
>>>> Killed
>>>> ./Run.pure_wat: Program error
>>>> make[3]: *** [test.pmemd.cuda.pme] Error 1
>>>> ------------------------------------
>>>> Running CUDA Explicit solvent tests.
>>>> Precision Model = SPDP
>>>> ------------------------------------
>>>
>>> and the next one (4096wat/ && ./Run.vrand) seems to also be hanging. So
>>> maybe I have the same issue here?
>>>
>>> All the best,
>>>
>>> JP
>>>
>>>
>>>
>>>
>>> On 08/22/2012 05:08 PM, Ross Walker wrote:
>>>> Hi Jan,
>>>>
>>>> Can you possibly try this with the GNU compilers? I've not tried the
>>>> very
>>>> latest Intel so am not sure if that is the problem. I've only seen
>>>>this
>>>> problem with undercooled M2090 cards where the card itself locks up
>>>>but
>>>> that normally requires a reboot to get things working again.
>>>>
>>>> What driver are you running btw? cat /proc/driver/nvidia/version
>>>>
>>>> All the best
>>>> Ross
>>>>
>>>>
>>>>
>>>> On 8/22/12 8:00 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have just built Amber 12 cuda on Ubuntu 12.04 with Intel 12.1.3 and
>>>>> NVCC 4.2 V0.2.1221 and then invoked the test suite on a Tesla C2075.
>>>>>
>>>>> The test below did not produce output for about 40 minutes (I then
>>>>> killed the pmemd.cuda process):
>>>>>
>>>>>> cd gb_ala3/ && ./Run.irest1_ntt2_igb1_ntc2 SPDP
>>>>>>
>>>>>>
>>>>>>/apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/
>>>>>>ne
>>>>>> tc
>>>>>> df.mod
>>>>>>
>>>>>> Killed
>>>>>> ./Run.irest1_ntt2_igb1_ntc2: Program error
>>>>>
>>>>> With "did not produce output" I mean that it did not change the mdout
>>>>> file anymore. These are the last lines in the mdout file before and
>>>>> after killing the process:
>>>>>
>>>>>
>>>>>> | Intermolecular bonds treatment:
>>>>>> | no_intermolecular_bonds = 1
>>>>>>
>>>>>> | Energy averages sample interval:
>>>>>> | ene_avg_sampling = 1
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>---------------------------------------------------------------------
>>>>>>--
>>>>>> --
>>>>>> -------
>>>>>> 3. ATOMIC COORDINATES AND VELOCITIES
>>>>>>
>>>>>>
>>>>>>
>>>>>>---------------------------------------------------------------------
>>>>>>--
>>>>>> --
>>>>>> -------
>>>>>>
>>>>>> ACE
>>>>>> begin time read from input coords = 1050.000 ps
>>>>>
>>>>> The tests before this one (about 20 of them) PASSED. While I am
>>>>>writing
>>>>> this email, the test suite proceeds. However, it currently hangs at
>>>>>
>>>>>> cd myoglobin/ && ./Run_md_myoglobin
>>>>>
>>>>> for already about 15 minutes which is suspicious, right?
>>>>>
>>>>> If I will have to kill this one, too, and maybe others, I will rerun
>>>>> the
>>>>> tests and see if the same tests are hanging or if they are hanging
>>>>> randomly. I'll then get back to you.
>>>>>
>>>>> Have you seen such behavior before?
>>>>>
>>>>> Any suggestion would be helpful.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jan-Philip
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> AMBER mailing list
>>>>> AMBER.ambermd.org
>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>>
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Aug 22 2012 - 12:30:02 PDT