Re: [AMBER] Amber 12 cuda test suite: some tests 'hang'

From: Jan-Philip Gehrcke <jgehrcke.googlemail.com>
Date: Wed, 22 Aug 2012 17:25:47 +0200

Hey Ross,

thanks for the quick response. Driver version:

> 17:15:01 $ cat /proc/driver/nvidia/version
> NVRM version: NVIDIA UNIX x86_64 Kernel Module 295.41 Fri Apr 6 23:18:58 PDT 2012
> GCC version: gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

This, for example, is the card's state while hanging during cd 4096wat/
&& ./Run.pure_wat:

> 17:05:14 $ nvidia-smi
> Wed Aug 22 17:15:01 2012
> +------------------------------------------------------+
> | NVIDIA-SMI 3.295.41 Driver Version: 295.41 |
> |-------------------------------+----------------------+----------------------+
> | Nb. Name | Bus Id Disp. | Volatile ECC SB / DB |
> | Fan Temp Power Usage /Cap | Memory Usage | GPU Util. Compute M. |
> |===============================+======================+======================|
> | 0. Tesla C2075 | 0000:0C:00.0 On | 0 0 |
> | 38% 85 C P0 95W / 225W | 3% 144MB / 5375MB | 99% Default

I don't know if 85 degrees Celcius are undercooled. The machine (it's a
Lian Li tower) actually is stored in a decently cooled rack. That's why
the GPU's fan does not really need to speed up.

Regarding Intel vs. GCC: I've used the same Intel 12.1.3 for several
Amber builds (Amber 11 CPU and GPU and Amber 12 CPU). For Amber 11's
pmemd.cuda everything was running very fine with Intel 12.1.3, but this
was on different hardware. Yes, I can try rebuilding with the system's
GCC (4.6.3 in this case).

Again regarding the undercooling issue: you said you normally have to
restart the machine to get things working again. For me now 3 of 4 tests
hung up:

> cd gb_ala3/ && ./Run.irest1_ntt2_igb1_ntc2 SPDP /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/netcdf.mod
>
> Killed
> ./Run.irest1_ntt2_igb1_ntc2: Program error
> make[3]: *** [test.pmemd.cuda.gb.serial] Error 1
> ---------------------------------------------
> Running Extended CUDA Implicit solvent tests.
> Precision Model = SPDP
> ---------------------------------------------
> cd trpcage/ && ./Run_md_trpcage SPDP /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/netcdf.mod
> diffing trpcage_md.out.GPU_SPDP with trpcage_md.out
> PASSED
> ==============================================================
> cd myoglobin/ && ./Run_md_myoglobin SPDP /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/netcdf.mod
> Killed
> ./Run_md_myoglobin: Program error
> make[3]: *** [test.pmemd.cuda.gb] Error 1
> ---------------------------------------------
> Running Extended CUDA Explicit solvent tests.
> Precision Model = SPDP
> ---------------------------------------------
> cd 4096wat/ && ./Run.pure_wat SPDP /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/netcdf.mod
> Killed
> ./Run.pure_wat: Program error
> make[3]: *** [test.pmemd.cuda.pme] Error 1
> ------------------------------------
> Running CUDA Explicit solvent tests.
> Precision Model = SPDP
> ------------------------------------

and the next one (4096wat/ && ./Run.vrand) seems to also be hanging. So
maybe I have the same issue here?

All the best,

JP




On 08/22/2012 05:08 PM, Ross Walker wrote:
> Hi Jan,
>
> Can you possibly try this with the GNU compilers? I've not tried the very
> latest Intel so am not sure if that is the problem. I've only seen this
> problem with undercooled M2090 cards where the card itself locks up but
> that normally requires a reboot to get things working again.
>
> What driver are you running btw? cat /proc/driver/nvidia/version
>
> All the best
> Ross
>
>
>
> On 8/22/12 8:00 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com> wrote:
>
>> Hello,
>>
>> I have just built Amber 12 cuda on Ubuntu 12.04 with Intel 12.1.3 and
>> NVCC 4.2 V0.2.1221 and then invoked the test suite on a Tesla C2075.
>>
>> The test below did not produce output for about 40 minutes (I then
>> killed the pmemd.cuda process):
>>
>>> cd gb_ala3/ && ./Run.irest1_ntt2_igb1_ntc2 SPDP
>>> /apps11/bioinfp/amber12_centos58_intel1213_openmpi16/amber12/include/netc
>>> df.mod
>>>
>>> Killed
>>> ./Run.irest1_ntt2_igb1_ntc2: Program error
>>
>> With "did not produce output" I mean that it did not change the mdout
>> file anymore. These are the last lines in the mdout file before and
>> after killing the process:
>>
>>
>>> | Intermolecular bonds treatment:
>>> | no_intermolecular_bonds = 1
>>>
>>> | Energy averages sample interval:
>>> | ene_avg_sampling = 1
>>>
>>>
>>> -------------------------------------------------------------------------
>>> -------
>>> 3. ATOMIC COORDINATES AND VELOCITIES
>>>
>>> -------------------------------------------------------------------------
>>> -------
>>>
>>> ACE
>>> begin time read from input coords = 1050.000 ps
>>
>> The tests before this one (about 20 of them) PASSED. While I am writing
>> this email, the test suite proceeds. However, it currently hangs at
>>
>>> cd myoglobin/ && ./Run_md_myoglobin
>>
>> for already about 15 minutes which is suspicious, right?
>>
>> If I will have to kill this one, too, and maybe others, I will rerun the
>> tests and see if the same tests are hanging or if they are hanging
>> randomly. I'll then get back to you.
>>
>> Have you seen such behavior before?
>>
>> Any suggestion would be helpful.
>>
>> Thanks,
>>
>> Jan-Philip
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Aug 22 2012 - 08:30:05 PDT
Custom Search