Re: [AMBER] Amber 12 cuda test suite: some tests 'hang'

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 31 Jan 2013 10:16:50 -0800

Hi Alessandro,

In addition to Jason's suggestion, assuming bugfix.14 is correctly applied
could you try compiling the SPDP version of the code.

./configure -cuda_SPDP gnu
make install

Then run the GPU tests with

cd $AMBERHOME/test
./test_amber_cuda.sh SPDP

And see if that works. The SPFP precision model is really only designed
for GPUs with hardware revision >=2.0 and has not been fully tested on
earlier cards since the C1060 (I don't have access to anything below a
C2075). If SPDP works on your C1060 then my suggestion would be to use
that precision model going forward on anything pre-Fermi.

All the best
Ross



On 1/31/13 9:24 AM, "Jason Swails" <jason.swails.gmail.com> wrote:

>What bug fixes have been applied? You can get this from:
>
>cd $AMBERHOME && ./patch_amber.py --patch-level
>
>I've seen this occur with CUDA 5.0 on a pre-bugfix.14 version of the code.
> bugfix.14 fixed this for me (on a GTX 680).
>
>HTH,
>Jason
>
>On Thu, Jan 31, 2013 at 11:45 AM, Alessandro Contini <
>alessandro.contini.unimi.it> wrote:
>
>> Dear Amber community,
>> I've recently compiled amber12 with cuda support (ubuntu 10.04, x86_64,
>> 2.6.32-45-generic kernel, cuda 5.0, NVIDIA Driver Version: 310.32, GPU
>> Tesla C1060, intel composer XE 2013.1.117, dual intel xeon E5506) and
>> I'm experiencing "hanging" on some tests in a way similar to what
>> previously described by Jan-Philip Gehrck in this lists. When the
>> "hanging" test are run by "make test.cuda", the GPU hangs and no output
>> is produced. By killing the test and the corresponding job no more jobs
>> can be run and the GPU is "unavailable" until the system is rebooted. By
>> manually running the test without backgrounding it, it still hangs by it
>> can be killed by ctrl-c and the GPU became available. The test where I
>> experience this problem are:
>>
>> chamber/dhfr_cmap/ && ./Run.dhfr_charmm.min
>> chamber/dhfr_cmap/ && ./Run.dhfr_charmm.md
>> tip4pew/ && ./Run.tip4pew_box_npt
>> tip4pew/ && ./Run.tip4pew_oct_npt
>> tip5p/ && ./Run.tip5p_oct_npt
>> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm.min
>> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_noshake.min
>> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm.md
>> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_NPT.md
>> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_noshake.md
>>
>> All other tests runs quite fine (in some cases acceptable differences
>> are observed). I also tested the system on a full length simulation
>> (previously run on standard CPUs) of a tip3p solvated protein (42032
>> atoms) including minimizations, restrained equilibrations (NVT and NPT)
>> and unrestrained production run (4ns) and it worked fine.
>> The card is not overheating (72°C on average during the run).
>> Summarizing, I'm experiencing errors with chamber "cmap" runs and with
>> tip4pew and tip5p npt runs (however not with tip3p npt, since I've
>> tested this on my protein).
>>
>> By running "cuda-memcheck pmemd.cuda -o mdout.tip5p_box_npt -r restrt
>> -x mdcrd -p tip5p_box.prmtop -c tip5p_box.inpcrd" I've obtained the
>> following output:
>>
>> ========= CUDA-MEMCHECK
>> Error: unspecified launch failure launching kernel
>> kCalculateCOMKineticEnergy
>> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
>> ========= Invalid __shared__ read of size 4
>> ========= at 0x00003020 in kNLOrientForcesVirial_kernel(void)
>> ========= by thread (254,0,0) in block (2,0,0)
>> ========= Address 0x00001840 is out of bounds
>> ========= Saved host backtrace up to driver entry point at kernel
>> launch time
>> ========= Host Frame:/usr/lib/libcuda.so (cuLaunchKernel + 0x3dc)
>> [0xc9d5c]
>> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
>> [0x13324]
>> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
>> (cudaLaunch + 0x182) [0x3ac62]
>> ========= Host Frame:pmemd.cuda [0x13572c]
>> ========= Host Frame:pmemd.cuda [0x1336bd]
>> ========= Host Frame:pmemd.cuda [0x1336c8]
>> ========= Host Frame:pmemd.cuda [0x1321fd]
>> ========= Host Frame:pmemd.cuda [0x11e71f]
>> ========= Host Frame:pmemd.cuda [0x4e11d]
>> ========= Host Frame:pmemd.cuda [0x71cd9]
>> ========= Host Frame:pmemd.cuda [0xab2ac]
>> ========= Host Frame:pmemd.cuda [0x42dc]
>> ========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
>> [0x1ec4d]
>> ========= Host Frame:pmemd.cuda [0x41d9]
>> =========
>> ========= Program hit error 4 on CUDA API call to cudaLaunch
>> ========= Saved host backtrace up to driver entry point at error
>> ========= Host Frame:/usr/lib/libcuda.so [0x26a070]
>> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
>> (cudaLaunch + 0x246) [0x3ad26]
>> ========= Host Frame:pmemd.cuda [0x13572c]
>> ========= Host Frame:pmemd.cuda [0x133597]
>> ========= Host Frame:pmemd.cuda [0x1335a2]
>> ========= Host Frame:pmemd.cuda [0x131c36]
>> ========= Host Frame:pmemd.cuda [0x11e734]
>> ========= Host Frame:pmemd.cuda [0x4e11d]
>> ========= Host Frame:pmemd.cuda [0x71cd9]
>> ========= Host Frame:pmemd.cuda [0xab2ac]
>> ========= Host Frame:pmemd.cuda [0x42dc]
>> ========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
>> [0x1ec4d]
>> ========= Host Frame:pmemd.cuda [0x41d9]
>> =========
>> ========= Program hit error 4 on CUDA API call to cudaGetLastError
>> ========= Saved host backtrace up to driver entry point at error
>> ========= Host Frame:/usr/lib/libcuda.so [0x26a070]
>> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
>> (cudaGetLastError + 0x1da) [0x4048a]
>> ========= Host Frame:pmemd.cuda [0x131c3b]
>> ========= Host Frame:pmemd.cuda [0x11e734]
>> ========= Host Frame:pmemd.cuda [0x4e11d]
>> ========= Host Frame:pmemd.cuda [0x71cd9]
>> ========= Host Frame:pmemd.cuda [0xab2ac]
>> ========= Host Frame:pmemd.cuda [0x42dc]
>> ========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
>> [0x1ec4d]
>> ========= Host Frame:pmemd.cuda [0x41d9]
>> =========
>> ========= Program hit error 4 on CUDA API call to cudaFree
>> ========= Saved host backtrace up to driver entry point at error
>> ========= Host Frame:/usr/lib/libcuda.so [0x26a070]
>> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
>> (cudaFree + 0x215) [0x39525]
>> ========= Host Frame:pmemd.cuda [0x12a50a]
>> ========= Host Frame:pmemd.cuda [0x12f4d5]
>> ========= Host Frame:pmemd.cuda [0x1015c6]
>> ========= Host Frame:pmemd.cuda [0x131c6a]
>> ========= Host Frame:pmemd.cuda [0x11e734]
>> ========= Host Frame:pmemd.cuda [0x4e11d]
>> ========= Host Frame:pmemd.cuda [0x71cd9]
>> ========= Host Frame:pmemd.cuda [0xab2ac]
>> ========= Host Frame:pmemd.cuda [0x42dc]
>> ========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
>> [0x1ec4d]
>> ========= Host Frame:pmemd.cuda [0x41d9]
>> =========
>> ========= ERROR SUMMARY: 4 errors
>>
>>
>> No errors where obtained by running "cuda-memcheck pmemd.cuda -O -o
>> mdout.tip5p_box_nvt -r restrt -x mdcrd -p tip5p_box.prmtop -c
>> tip5p_box.inpcrd"
>>
>> As usual, any suggestion would be greatly appreciated.
>>
>> Best regards
>>
>> Alessandro
>>
>>
>>
>>
>> --
>> Alessandro Contini, PhD
>> Dipartimento di Scienze Farmaceutiche
>> Sezione di Chimica Generale e Organica "A. Marchesini"
>> Via Venezian, 21 20133 Milano
>> tel. +390250314480
>> e-mail alessandro.contini.unimi.it
>> skype alessandrocontini
>>
>>
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
>
>--
>Jason M. Swails
>Quantum Theory Project,
>University of Florida
>Ph.D. Candidate
>352-392-4032
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jan 31 2013 - 10:30:04 PST
Custom Search