Re: [AMBER] Amber 12 cuda test suite: some tests 'hang'

From: Robert Crovella <RCrovella.nvidia.com>
Date: Fri, 24 Aug 2012 06:58:32 -0700

Regarding this comment:

" I think we misunderstood each other :-) Let my try again: in the mdout file, I always see 'CUDA Device ID in use: 0'. It does not matter if I set CUDA_VISIBLE_DEVICES to 0, 1, or 2. This is not a severe problem, since pmemd.cuda runs on the correct device as given by CUDA_VISIBLE_DEVICES. Also the name of the device in use is always correct in the mdout file. It is just that this ID output line constantly shows 0."

I believe this is expected behavior. CUDA devices in a system have a logical enumeration as has already been mentioned. CUDA_VISIBLE_DEVICES acts as a mask on the actual devices in the system, before they are logically enumerated, from the point of view of user code. So if I had 6 CUDA devices in the system, without any mask they would be enumerated 0, 1, 2, 3, 4, 5 If I then have a CUDA_VISIBLE_DEVICES mask of "1, 3, 5" then the new logical enumeration would be 0, 1, 2 where the new device 0 corresponds to the old device 1, the new device 1 corresponds to the old device 3, and the new device 2 corresponds to the old device 5. AMBER reports the logical device(s) in use *after* the mask effect of CUDA_VISIBLE_DEVICES is taken into account. deviceQuery and all CUDA programs should behave similarly, because the intended effect of CUDA_VISIBLE_DEVICES is to make the masked devices "invisible" i.e. the system should behave from a CUDA user standpoint as if those devices were not present. If th
ose devices were not present, you would get the enumeration order I described above.

-----Original Message-----
From: Jan-Philip Gehrcke [mailto:jgehrcke.googlemail.com]
Sent: Wednesday, August 22, 2012 4:12 PM
To: AMBER Mailing List
Subject: Re: [AMBER] Amber 12 cuda test suite: some tests 'hang'

Ross,

thanks for your time. Some more comments below.

On 22.08.2012 21:20, Ross Walker wrote:
> Hi Jan,
>
> This really does sound like a flakey card. You could try running some
> long simulations on there. That say run for 1 day. Run them twice
> assuming ig /= -1 you should get absolutely identical outputs. If the
> cards hangs again or the outputs don't match then that confirms the card is faulty.
> You may want to consider RMA'ing the card. If you have GTX580's in the
> same box it could also be a motherboard compatibility issue. I have
> seen such weird things in the past so if it hangs again try pulling
> the two GTX580s and see if it then works okay. It could also be power,
> do you have a 1.2KW or better power supply? Again pulling the two
> GTX580s will answer that for you.

The machine is built on top of a EVGA SR-2 mainboard and has a 1250 W power supply. Power should not be a problem, compatibility is hopefully none. We will look into that and also into the reproducibility of trajectories. Btw: When setting ig to -1, is there any way to extract the actually used random seed from the mdout file in order to be able to reproduce an ig=-1 run? I had a quick look and found nothing.

[...]
>
> The CUDA Device in use line will always be the same for a serial run.
> The code will pick the GPU with the most memory. It has no knowledge
> of whether other runs are using that GPU so CUDA_VISIBLE_DEVICES
> should always be set to only expose the GPU you wish to run on.

I think we misunderstood each other :-) Let my try again: in the mdout file, I always see 'CUDA Device ID in use: 0'. It does not matter if I set CUDA_VISIBLE_DEVICES to 0, 1, or 2. This is not a severe problem, since pmemd.cuda runs on the correct device as given by CUDA_VISIBLE_DEVICES. Also the name of the device in use is always correct in the mdout file. It is just that this ID output line constantly shows 0.

> As for the NVIDIA_smi vs device query this has driven me crazy and I
> have pulled a LOT of hair out over it. Complaining to NVIDIA multiple
> times has not helped at all. :-( They claim it can't be fixed since it
> depends on how the OS enumerates the hardward. The definitive approach
> is to use device query to work out the ID's of each of your GPUs and
> then use these for CUDA_VISIBLE_DEVICES.

Sounds rather like a "don't want to fix" than a "can't fix" :-) Anyway...

>
> With regards to the CUDA_VISIBLE_DEVICES being set to an invalid
> GPU_ID I get the following error printed to stderr:
>
> cudaGetDeviceCount failed no CUDA-capable device is detected.
>
> This seems like a reasonable error message to me. Do you not see this
> error?
>

You are right, I got this error. I agree, it is proper behavior to print the error message to stderr. However, I am not used to this with respect to Amber programs, because they most of the times write error messages to the output file and not to stderr :-)

Thanks a lot for your support,

Jan-Philip


> All the best
> Ross
>
>
>
> On 8/22/12 11:35 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com> wrote:
>
>> Thanks Ross and Scott so far.
>>
>> I have two GTX580 in the same machine and by setting
>> CUDA_VISIBLE_DEVICES to 0 or 2, which corresponds to the GTX580s
>> (according to deviceQuery), the test suite passed. Hence, yes, the
>> problem seems to be hardware dependent. I tried two more times with
>> the Tesla C2075 and saw this error two times:
>>
>>> cd gb_ala3/ && ./Run.irest1_ntt2_igb1_ntc2 SPDP
>> /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/
>> netcd
>> f.mod
>>> cudaMemcpy GpuBuffer::Upload failed unspecified launch failure
>>> ./Run.irest1_ntt2_igb1_ntc2: Program error
>>> make[3]: *** [test.pmemd.cuda.gb.serial] Error 1
>>
>> After that, the next test hung up again.
>>
>> Then, as you suggested, I deactivated ECC for the Tesla card,
>> rebooted and reran the tests with the Tesla card. This time, all
>> tests finished smoothly with the following outcome:
>>
>>> 78 file comparisons passed
>>> 8 file comparisons failed
>>> 0 tests experienced errors
>>
>> This is exactly the same result as with both of the GTX580. I
>> switched ECC on again and rebooted and reran the tests with the Tesla card.
>>
>> No hangup this time, the same result:
>>
>>> 78 file comparisons passed
>>> 8 file comparisons failed
>>> 0 tests experienced errors
>>
>>
>> It looks like one single reboot of the machine kind of healed the
>> Tesla card. I did not test the card before with Amber. Someone else
>> was running BOINC stuff and he never complained. This behavior worries me.
>> Is the card kind of broken? Should we complain to the vendor? Or
>> should we just ignore what has happened?
>>
>>
>> What is following now are a few more issues/questions that just came up:
>>
>> - Is it expexted behavior that the 'CUDA Device ID in use:' line in
>> the mdout file always shows ID 0 independent of the definition via
>> CUDA_VISIBLE_DEVICES?
>>
>> - If an invalid GPU ID has been chosen via CUDA_VISIBLE_DEVICES,
>> currently, the last printed information in the mdout file is the
>> input file. The program exists without an error message.
>>
>> - The IDs given by nvidia-smi are not the same as the deviceQuery IDs.
>> One has to be careful there :-)
>>
>>
>> All the best,
>>
>> Jan-Philip
>>
>>
>> On 08/22/2012 06:04 PM, Ross Walker wrote:
>>> Hi Jan,
>>>
>>> This really sounds to me like a hardware issue. Have you had any
>>> issues with this specific card before? Do you have any other GPUs
>>> you could try in this machine to see if they run fine? I use those
>>> Lian Li tower cases with multiple GTX580s and C2075s as desktop
>>> machines in unconditioned offices so that's certainly not the issue.
>>>
>>> 85C is fine temperature wise so it isn't that and the lockup when
>>> undercooled is a hard lockup. You physically have to power the
>>> machine off and back on to get the GPU working again so this
>>> probably isn't the issue here. You could see if you can override the
>>> fan setting in the nvidia control panel (nvidia-settings) and see if
>>> that helps. Also make sure
>>> X11
>>> is not running (init 3).
>>>
>>> 295.41 is the same driver version I am running without issue so it
>>> isn't that either. Try turning off ECC on the card and see if that
>>> changes the behavior at all.
>>>
>>> as root:
>>>
>>> nvidia-smi -g 0 --ecc-config=0
>>>
>>> followed by a reboot.
>>>
>>> All the best
>>> Ross
>>>
>>>
>>> On 8/22/12 8:25 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com>
>>> wrote:
>>>
>>>> Hey Ross,
>>>>
>>>> thanks for the quick response. Driver version:
>>>>
>>>>> 17:15:01 $ cat /proc/driver/nvidia/version NVRM version: NVIDIA
>>>>> UNIX x86_64 Kernel Module 295.41 Fri Apr 6
>>>>> 23:18:58 PDT 2012
>>>>> GCC version: gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
>>>>
>>>> This, for example, is the card's state while hanging during cd
>>>> 4096wat/ && ./Run.pure_wat:
>>>>
>>>>> 17:05:14 $ nvidia-smi
>>>>> Wed Aug 22 17:15:01 2012
>>>>> +------------------------------------------------------+
>>>>> | NVIDIA-SMI 3.295.41 Driver Version: 295.41 |
>>>>>
>>>>>
>>>>> |-------------------------------+----------------------+----------
>>>>> |-------------------------------+----------------------+-----
>>>>> --
>>>>> -----+
>>>>> | Nb. Name | Bus Id Disp. | Volatile ECC
>>>>> SB / DB |
>>>>> | Fan Temp Power Usage /Cap | Memory Usage | GPU Util.
>>>>> Compute M. |
>>>>>
>>>>>
>>>>> |===============================+======================+==========
>>>>> |=====
>>>>> ==
>>>>> =====|
>>>>> | 0. Tesla C2075 | 0000:0C:00.0 On | 0
>>>>> 0 |
>>>>> | 38% 85 C P0 95W / 225W | 3% 144MB / 5375MB | 99%
>>>>> Default
>>>>
>>>> I don't know if 85 degrees Celcius are undercooled. The machine
>>>> (it's a Lian Li tower) actually is stored in a decently cooled
>>>> rack. That's why the GPU's fan does not really need to speed up.
>>>>
>>>> Regarding Intel vs. GCC: I've used the same Intel 12.1.3 for
>>>> several Amber builds (Amber 11 CPU and GPU and Amber 12 CPU). For
>>>> Amber 11's pmemd.cuda everything was running very fine with Intel
>>>> 12.1.3, but this was on different hardware. Yes, I can try
>>>> rebuilding with the system's GCC (4.6.3 in this case).
>>>>
>>>> Again regarding the undercooling issue: you said you normally have
>>>> to restart the machine to get things working again. For me now 3 of
>>>> 4 tests hung up:
>>>>
>>>>> cd gb_ala3/ && ./Run.irest1_ntt2_igb1_ntc2 SPDP
>>>>>
>>>>> /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/inclu
>>>>> de/ne
>>>>> tc
>>>>> df.mod
>>>>>
>>>>> Killed
>>>>> ./Run.irest1_ntt2_igb1_ntc2: Program error
>>>>> make[3]: *** [test.pmemd.cuda.gb.serial] Error 1
>>>>> ---------------------------------------------
>>>>> Running Extended CUDA Implicit solvent tests.
>>>>> Precision Model = SPDP
>>>>> ---------------------------------------------
>>>>> cd trpcage/ && ./Run_md_trpcage SPDP
>>>>>
>>>>> /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/inclu
>>>>> de/ne
>>>>> tc
>>>>> df.mod
>>>>> diffing trpcage_md.out.GPU_SPDP with trpcage_md.out PASSED
>>>>> ==============================================================
>>>>> cd myoglobin/ && ./Run_md_myoglobin SPDP
>>>>>
>>>>> /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/inclu
>>>>> de/ne
>>>>> tc
>>>>> df.mod
>>>>> Killed
>>>>> ./Run_md_myoglobin: Program error
>>>>> make[3]: *** [test.pmemd.cuda.gb] Error 1
>>>>> ---------------------------------------------
>>>>> Running Extended CUDA Explicit solvent tests.
>>>>> Precision Model = SPDP
>>>>> ---------------------------------------------
>>>>> cd 4096wat/ && ./Run.pure_wat SPDP
>>>>>
>>>>> /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/inclu
>>>>> de/ne
>>>>> tc
>>>>> df.mod
>>>>> Killed
>>>>> ./Run.pure_wat: Program error
>>>>> make[3]: *** [test.pmemd.cuda.pme] Error 1
>>>>> ------------------------------------
>>>>> Running CUDA Explicit solvent tests.
>>>>> Precision Model = SPDP
>>>>> ------------------------------------
>>>>
>>>> and the next one (4096wat/ && ./Run.vrand) seems to also be
>>>> hanging. So maybe I have the same issue here?
>>>>
>>>> All the best,
>>>>
>>>> JP
>>>>
>>>>
>>>>
>>>>
>>>> On 08/22/2012 05:08 PM, Ross Walker wrote:
>>>>> Hi Jan,
>>>>>
>>>>> Can you possibly try this with the GNU compilers? I've not tried
>>>>> the very latest Intel so am not sure if that is the problem. I've
>>>>> only seen this problem with undercooled M2090 cards where the card
>>>>> itself locks up but that normally requires a reboot to get things
>>>>> working again.
>>>>>
>>>>> What driver are you running btw? cat /proc/driver/nvidia/version
>>>>>
>>>>> All the best
>>>>> Ross
>>>>>
>>>>>
>>>>>
>>>>> On 8/22/12 8:00 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have just built Amber 12 cuda on Ubuntu 12.04 with Intel 12.1.3
>>>>>> and NVCC 4.2 V0.2.1221 and then invoked the test suite on a Tesla C2075.
>>>>>>
>>>>>> The test below did not produce output for about 40 minutes (I
>>>>>> then killed the pmemd.cuda process):
>>>>>>
>>>>>>> cd gb_ala3/ && ./Run.irest1_ntt2_igb1_ntc2 SPDP
>>>>>>>
>>>>>>>
>>>>>>> /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/inc
>>>>>>> lude/
>>>>>>> ne
>>>>>>> tc
>>>>>>> df.mod
>>>>>>>
>>>>>>> Killed
>>>>>>> ./Run.irest1_ntt2_igb1_ntc2: Program error
>>>>>>
>>>>>> With "did not produce output" I mean that it did not change the
>>>>>> mdout file anymore. These are the last lines in the mdout file
>>>>>> before and after killing the process:
>>>>>>
>>>>>>
>>>>>>> | Intermolecular bonds treatment:
>>>>>>> | no_intermolecular_bonds = 1
>>>>>>>
>>>>>>> | Energy averages sample interval:
>>>>>>> | ene_avg_sampling = 1
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----------------------------------------------------------------
>>>>>>> -----
>>>>>>> --
>>>>>>> --
>>>>>>> -------
>>>>>>> 3. ATOMIC COORDINATES AND VELOCITIES
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----------------------------------------------------------------
>>>>>>> -----
>>>>>>> --
>>>>>>> --
>>>>>>> -------
>>>>>>>
>>>>>>> ACE
>>>>>>> begin time read from input coords = 1050.000 ps
>>>>>>
>>>>>> The tests before this one (about 20 of them) PASSED. While I am
>>>>>> writing this email, the test suite proceeds. However, it
>>>>>> currently hangs at
>>>>>>
>>>>>>> cd myoglobin/ && ./Run_md_myoglobin
>>>>>>
>>>>>> for already about 15 minutes which is suspicious, right?
>>>>>>
>>>>>> If I will have to kill this one, too, and maybe others, I will
>>>>>> rerun the tests and see if the same tests are hanging or if they
>>>>>> are hanging randomly. I'll then get back to you.
>>>>>>
>>>>>> Have you seen such behavior before?
>>>>>>
>>>>>> Any suggestion would be helpful.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Jan-Philip
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> AMBER mailing list
>>>>>> AMBER.ambermd.org
>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> AMBER mailing list
>>>>> AMBER.ambermd.org
>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>>
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Aug 24 2012 - 07:00:02 PDT
Custom Search