Re: [AMBER] Amber 12 cuda test suite: some tests 'hang'

From: Jan-Philip Gehrcke <jgehrcke.googlemail.com>
Date: Wed, 22 Aug 2012 23:11:31 +0200

Ross,

thanks for your time. Some more comments below.

On 22.08.2012 21:20, Ross Walker wrote:
> Hi Jan,
>
> This really does sound like a flakey card. You could try running some long
> simulations on there. That say run for 1 day. Run them twice assuming ig
> /= -1 you should get absolutely identical outputs. If the cards hangs
> again or the outputs don't match then that confirms the card is faulty.
> You may want to consider RMA'ing the card. If you have GTX580's in the
> same box it could also be a motherboard compatibility issue. I have seen
> such weird things in the past so if it hangs again try pulling the two
> GTX580s and see if it then works okay. It could also be power, do you have
> a 1.2KW or better power supply? Again pulling the two GTX580s will answer
> that for you.

The machine is built on top of a EVGA SR-2 mainboard and has a 1250 W
power supply. Power should not be a problem, compatibility is hopefully
none. We will look into that and also into the reproducibility of
trajectories. Btw: When setting ig to -1, is there any way to extract
the actually used random seed from the mdout file in order to be able to
reproduce an ig=-1 run? I had a quick look and found nothing.

[...]
>
> The CUDA Device in use line will always be the same for a serial run. The
> code will pick the GPU with the most memory. It has no knowledge of
> whether other runs are using that GPU so CUDA_VISIBLE_DEVICES should
> always be set to only expose the GPU you wish to run on.

I think we misunderstood each other :-) Let my try again: in the mdout
file, I always see 'CUDA Device ID in use: 0'. It does not matter if I
set CUDA_VISIBLE_DEVICES to 0, 1, or 2. This is not a severe problem,
since pmemd.cuda runs on the correct device as given by
CUDA_VISIBLE_DEVICES. Also the name of the device in use is always
correct in the mdout file. It is just that this ID output line
constantly shows 0.

> As for the NVIDIA_smi vs device query this has driven me crazy and I have
> pulled a LOT of hair out over it. Complaining to NVIDIA multiple times has
> not helped at all. :-( They claim it can't be fixed since it depends on
> how the OS enumerates the hardward. The definitive approach is to use
> device query to work out the ID's of each of your GPUs and then use these
> for CUDA_VISIBLE_DEVICES.

Sounds rather like a "don't want to fix" than a "can't fix" :-) Anyway...

>
> With regards to the CUDA_VISIBLE_DEVICES being set to an invalid GPU_ID I
> get the following error printed to stderr:
>
> cudaGetDeviceCount failed no CUDA-capable device is detected.
>
> This seems like a reasonable error message to me. Do you not see this
> error?
>

You are right, I got this error. I agree, it is proper behavior to print
the error message to stderr. However, I am not used to this with respect
to Amber programs, because they most of the times write error messages
to the output file and not to stderr :-)

Thanks a lot for your support,

Jan-Philip


> All the best
> Ross
>
>
>
> On 8/22/12 11:35 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com> wrote:
>
>> Thanks Ross and Scott so far.
>>
>> I have two GTX580 in the same machine and by setting
>> CUDA_VISIBLE_DEVICES to 0 or 2, which corresponds to the GTX580s
>> (according to deviceQuery), the test suite passed. Hence, yes, the
>> problem seems to be hardware dependent. I tried two more times with the
>> Tesla C2075 and saw this error two times:
>>
>>> cd gb_ala3/ && ./Run.irest1_ntt2_igb1_ntc2 SPDP
>> /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/netcd
>> f.mod
>>> cudaMemcpy GpuBuffer::Upload failed unspecified launch failure
>>> ./Run.irest1_ntt2_igb1_ntc2: Program error
>>> make[3]: *** [test.pmemd.cuda.gb.serial] Error 1
>>
>> After that, the next test hung up again.
>>
>> Then, as you suggested, I deactivated ECC for the Tesla card, rebooted
>> and reran the tests with the Tesla card. This time, all tests finished
>> smoothly with the following outcome:
>>
>>> 78 file comparisons passed
>>> 8 file comparisons failed
>>> 0 tests experienced errors
>>
>> This is exactly the same result as with both of the GTX580. I switched
>> ECC on again and rebooted and reran the tests with the Tesla card.
>>
>> No hangup this time, the same result:
>>
>>> 78 file comparisons passed
>>> 8 file comparisons failed
>>> 0 tests experienced errors
>>
>>
>> It looks like one single reboot of the machine kind of healed the Tesla
>> card. I did not test the card before with Amber. Someone else was
>> running BOINC stuff and he never complained. This behavior worries me.
>> Is the card kind of broken? Should we complain to the vendor? Or should
>> we just ignore what has happened?
>>
>>
>> What is following now are a few more issues/questions that just came up:
>>
>> - Is it expexted behavior that the 'CUDA Device ID in use:' line in the
>> mdout file always shows ID 0 independent of the definition via
>> CUDA_VISIBLE_DEVICES?
>>
>> - If an invalid GPU ID has been chosen via CUDA_VISIBLE_DEVICES,
>> currently, the last printed information in the mdout file is the input
>> file. The program exists without an error message.
>>
>> - The IDs given by nvidia-smi are not the same as the deviceQuery IDs.
>> One has to be careful there :-)
>>
>>
>> All the best,
>>
>> Jan-Philip
>>
>>
>> On 08/22/2012 06:04 PM, Ross Walker wrote:
>>> Hi Jan,
>>>
>>> This really sounds to me like a hardware issue. Have you had any issues
>>> with this specific card before? Do you have any other GPUs you could try
>>> in this machine to see if they run fine? I use those Lian Li tower cases
>>> with multiple GTX580s and C2075s as desktop machines in unconditioned
>>> offices so that's certainly not the issue.
>>>
>>> 85C is fine temperature wise so it isn't that and the lockup when
>>> undercooled is a hard lockup. You physically have to power the machine
>>> off
>>> and back on to get the GPU working again so this probably isn't the
>>> issue
>>> here. You could see if you can override the fan setting in the nvidia
>>> control panel (nvidia-settings) and see if that helps. Also make sure
>>> X11
>>> is not running (init 3).
>>>
>>> 295.41 is the same driver version I am running without issue so it isn't
>>> that either. Try turning off ECC on the card and see if that changes the
>>> behavior at all.
>>>
>>> as root:
>>>
>>> nvidia-smi -g 0 --ecc-config=0
>>>
>>> followed by a reboot.
>>>
>>> All the best
>>> Ross
>>>
>>>
>>> On 8/22/12 8:25 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com>
>>> wrote:
>>>
>>>> Hey Ross,
>>>>
>>>> thanks for the quick response. Driver version:
>>>>
>>>>> 17:15:01 $ cat /proc/driver/nvidia/version
>>>>> NVRM version: NVIDIA UNIX x86_64 Kernel Module 295.41 Fri Apr 6
>>>>> 23:18:58 PDT 2012
>>>>> GCC version: gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
>>>>
>>>> This, for example, is the card's state while hanging during cd 4096wat/
>>>> && ./Run.pure_wat:
>>>>
>>>>> 17:05:14 $ nvidia-smi
>>>>> Wed Aug 22 17:15:01 2012
>>>>> +------------------------------------------------------+
>>>>> | NVIDIA-SMI 3.295.41 Driver Version: 295.41 |
>>>>>
>>>>>
>>>>> |-------------------------------+----------------------+---------------
>>>>> --
>>>>> -----+
>>>>> | Nb. Name | Bus Id Disp. | Volatile ECC
>>>>> SB / DB |
>>>>> | Fan Temp Power Usage /Cap | Memory Usage | GPU Util.
>>>>> Compute M. |
>>>>>
>>>>>
>>>>> |===============================+======================+===============
>>>>> ==
>>>>> =====|
>>>>> | 0. Tesla C2075 | 0000:0C:00.0 On | 0
>>>>> 0 |
>>>>> | 38% 85 C P0 95W / 225W | 3% 144MB / 5375MB | 99%
>>>>> Default
>>>>
>>>> I don't know if 85 degrees Celcius are undercooled. The machine (it's a
>>>> Lian Li tower) actually is stored in a decently cooled rack. That's why
>>>> the GPU's fan does not really need to speed up.
>>>>
>>>> Regarding Intel vs. GCC: I've used the same Intel 12.1.3 for several
>>>> Amber builds (Amber 11 CPU and GPU and Amber 12 CPU). For Amber 11's
>>>> pmemd.cuda everything was running very fine with Intel 12.1.3, but this
>>>> was on different hardware. Yes, I can try rebuilding with the system's
>>>> GCC (4.6.3 in this case).
>>>>
>>>> Again regarding the undercooling issue: you said you normally have to
>>>> restart the machine to get things working again. For me now 3 of 4
>>>> tests
>>>> hung up:
>>>>
>>>>> cd gb_ala3/ && ./Run.irest1_ntt2_igb1_ntc2 SPDP
>>>>>
>>>>> /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/ne
>>>>> tc
>>>>> df.mod
>>>>>
>>>>> Killed
>>>>> ./Run.irest1_ntt2_igb1_ntc2: Program error
>>>>> make[3]: *** [test.pmemd.cuda.gb.serial] Error 1
>>>>> ---------------------------------------------
>>>>> Running Extended CUDA Implicit solvent tests.
>>>>> Precision Model = SPDP
>>>>> ---------------------------------------------
>>>>> cd trpcage/ && ./Run_md_trpcage SPDP
>>>>>
>>>>> /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/ne
>>>>> tc
>>>>> df.mod
>>>>> diffing trpcage_md.out.GPU_SPDP with trpcage_md.out
>>>>> PASSED
>>>>> ==============================================================
>>>>> cd myoglobin/ && ./Run_md_myoglobin SPDP
>>>>>
>>>>> /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/ne
>>>>> tc
>>>>> df.mod
>>>>> Killed
>>>>> ./Run_md_myoglobin: Program error
>>>>> make[3]: *** [test.pmemd.cuda.gb] Error 1
>>>>> ---------------------------------------------
>>>>> Running Extended CUDA Explicit solvent tests.
>>>>> Precision Model = SPDP
>>>>> ---------------------------------------------
>>>>> cd 4096wat/ && ./Run.pure_wat SPDP
>>>>>
>>>>> /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/ne
>>>>> tc
>>>>> df.mod
>>>>> Killed
>>>>> ./Run.pure_wat: Program error
>>>>> make[3]: *** [test.pmemd.cuda.pme] Error 1
>>>>> ------------------------------------
>>>>> Running CUDA Explicit solvent tests.
>>>>> Precision Model = SPDP
>>>>> ------------------------------------
>>>>
>>>> and the next one (4096wat/ && ./Run.vrand) seems to also be hanging. So
>>>> maybe I have the same issue here?
>>>>
>>>> All the best,
>>>>
>>>> JP
>>>>
>>>>
>>>>
>>>>
>>>> On 08/22/2012 05:08 PM, Ross Walker wrote:
>>>>> Hi Jan,
>>>>>
>>>>> Can you possibly try this with the GNU compilers? I've not tried the
>>>>> very
>>>>> latest Intel so am not sure if that is the problem. I've only seen
>>>>> this
>>>>> problem with undercooled M2090 cards where the card itself locks up
>>>>> but
>>>>> that normally requires a reboot to get things working again.
>>>>>
>>>>> What driver are you running btw? cat /proc/driver/nvidia/version
>>>>>
>>>>> All the best
>>>>> Ross
>>>>>
>>>>>
>>>>>
>>>>> On 8/22/12 8:00 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have just built Amber 12 cuda on Ubuntu 12.04 with Intel 12.1.3 and
>>>>>> NVCC 4.2 V0.2.1221 and then invoked the test suite on a Tesla C2075.
>>>>>>
>>>>>> The test below did not produce output for about 40 minutes (I then
>>>>>> killed the pmemd.cuda process):
>>>>>>
>>>>>>> cd gb_ala3/ && ./Run.irest1_ntt2_igb1_ntc2 SPDP
>>>>>>>
>>>>>>>
>>>>>>> /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/
>>>>>>> ne
>>>>>>> tc
>>>>>>> df.mod
>>>>>>>
>>>>>>> Killed
>>>>>>> ./Run.irest1_ntt2_igb1_ntc2: Program error
>>>>>>
>>>>>> With "did not produce output" I mean that it did not change the mdout
>>>>>> file anymore. These are the last lines in the mdout file before and
>>>>>> after killing the process:
>>>>>>
>>>>>>
>>>>>>> | Intermolecular bonds treatment:
>>>>>>> | no_intermolecular_bonds = 1
>>>>>>>
>>>>>>> | Energy averages sample interval:
>>>>>>> | ene_avg_sampling = 1
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> --
>>>>>>> --
>>>>>>> -------
>>>>>>> 3. ATOMIC COORDINATES AND VELOCITIES
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> --
>>>>>>> --
>>>>>>> -------
>>>>>>>
>>>>>>> ACE
>>>>>>> begin time read from input coords = 1050.000 ps
>>>>>>
>>>>>> The tests before this one (about 20 of them) PASSED. While I am
>>>>>> writing
>>>>>> this email, the test suite proceeds. However, it currently hangs at
>>>>>>
>>>>>>> cd myoglobin/ && ./Run_md_myoglobin
>>>>>>
>>>>>> for already about 15 minutes which is suspicious, right?
>>>>>>
>>>>>> If I will have to kill this one, too, and maybe others, I will rerun
>>>>>> the
>>>>>> tests and see if the same tests are hanging or if they are hanging
>>>>>> randomly. I'll then get back to you.
>>>>>>
>>>>>> Have you seen such behavior before?
>>>>>>
>>>>>> Any suggestion would be helpful.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Jan-Philip
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> AMBER mailing list
>>>>>> AMBER.ambermd.org
>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> AMBER mailing list
>>>>> AMBER.ambermd.org
>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>>
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Aug 22 2012 - 14:30:03 PDT
Custom Search