Re: [AMBER] Amber 12 cuda test suite: some tests 'hang' from Alessandro Contini on 2013-01-31 (Amber Archive Jan 2013)

From: Alessandro Contini <alessandro.contini.unimi.it>
Date: Thu, 31 Jan 2013 21:11:54 +0100

Hi all and thanks for your replies. Unfortunately all suggestions were
tried, all patches were applied as the first compilation step (althoug
I'll verify this asap) and I tried both spfp and spdp versions. I'll
also tried cuda tools 4.2.9 coupled with either 310 or 295 driver
versions. Older version of Intel compiler (composer xe 2012) which
worked for amber11+cuda3.2 produced pbsa errors in serial amber12 and
so I dismissed them.

Best regards

Alessandro

Inviato da iPhone

Il giorno 31/gen/2013, alle ore 19.24, Scott Le Grand <varelse2005.gmail.com
> ha scritto:

> SPFP should be a performance disaster on pre-Fermi cards... Avoid,
> avoid,
> avoid...
>
>
> On Thu, Jan 31, 2013 at 10:16 AM, Ross Walker
> <ross.rosswalker.co.uk> wrote:
>
>> Hi Alessandro,
>>
>> In addition to Jason's suggestion, assuming bugfix.14 is correctly
>> applied
>> could you try compiling the SPDP version of the code.
>>
>> ./configure -cuda_SPDP gnu
>> make install
>>
>> Then run the GPU tests with
>>
>> cd $AMBERHOME/test
>> ./test_amber_cuda.sh SPDP
>>
>> And see if that works. The SPFP precision model is really only
>> designed
>> for GPUs with hardware revision >=2.0 and has not been fully tested
>> on
>> earlier cards since the C1060 (I don't have access to anything
>> below a
>> C2075). If SPDP works on your C1060 then my suggestion would be to
>> use
>> that precision model going forward on anything pre-Fermi.
>>
>> All the best
>> Ross
>>
>>
>>
>> On 1/31/13 9:24 AM, "Jason Swails" <jason.swails.gmail.com> wrote:
>>
>>> What bug fixes have been applied? You can get this from:
>>>
>>> cd $AMBERHOME && ./patch_amber.py --patch-level
>>>
>>> I've seen this occur with CUDA 5.0 on a pre-bugfix.14 version of
>>> the code.
>>> bugfix.14 fixed this for me (on a GTX 680).
>>>
>>> HTH,
>>> Jason
>>>
>>> On Thu, Jan 31, 2013 at 11:45 AM, Alessandro Contini <
>>> alessandro.contini.unimi.it> wrote:
>>>
>>>> Dear Amber community,
>>>> I've recently compiled amber12 with cuda support (ubuntu 10.04,
>>>> x86_64,
>>>> 2.6.32-45-generic kernel, cuda 5.0, NVIDIA Driver Version:
>>>> 310.32, GPU
>>>> Tesla C1060, intel composer XE 2013.1.117, dual intel xeon
>>>> E5506) and
>>>> I'm experiencing "hanging" on some tests in a way similar to what
>>>> previously described by Jan-Philip Gehrck in this lists. When the
>>>> "hanging" test are run by "make test.cuda", the GPU hangs and no
>>>> output
>>>> is produced. By killing the test and the corresponding job no
>>>> more jobs
>>>> can be run and the GPU is "unavailable" until the system is
>>>> rebooted. By
>>>> manually running the test without backgrounding it, it still
>>>> hangs by it
>>>> can be killed by ctrl-c and the GPU became available. The test
>>>> where I
>>>> experience this problem are:
>>>>
>>>> chamber/dhfr_cmap/ && ./Run.dhfr_charmm.min
>>>> chamber/dhfr_cmap/ && ./Run.dhfr_charmm.md
>>>> tip4pew/ && ./Run.tip4pew_box_npt
>>>> tip4pew/ && ./Run.tip4pew_oct_npt
>>>> tip5p/ && ./Run.tip5p_oct_npt
>>>> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm.min
>>>> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_noshake.min
>>>> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm.md
>>>> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_NPT.md
>>>> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_noshake.md
>>>>
>>>> All other tests runs quite fine (in some cases acceptable
>>>> differences
>>>> are observed). I also tested the system on a full length simulation
>>>> (previously run on standard CPUs) of a tip3p solvated protein
>>>> (42032
>>>> atoms) including minimizations, restrained equilibrations (NVT
>>>> and NPT)
>>>> and unrestrained production run (4ns) and it worked fine.
>>>> The card is not overheating (72°C on average during the run).
>>>> Summarizing, I'm experiencing errors with chamber "cmap" runs and
>>>> with
>>>> tip4pew and tip5p npt runs (however not with tip3p npt, since I've
>>>> tested this on my protein).
>>>>
>>>> By running "cuda-memcheck pmemd.cuda -o mdout.tip5p_box_npt -r
>>>> restrt
>>>> -x mdcrd -p tip5p_box.prmtop -c tip5p_box.inpcrd" I've obtained the
>>>> following output:
>>>>
>>>> ========= CUDA-MEMCHECK
>>>> Error: unspecified launch failure launching kernel
>>>> kCalculateCOMKineticEnergy
>>>> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
>>>> ========= Invalid __shared__ read of size 4
>>>> ========= at 0x00003020 in kNLOrientForcesVirial_kernel(void)
>>>> ========= by thread (254,0,0) in block (2,0,0)
>>>> ========= Address 0x00001840 is out of bounds
>>>> ========= Saved host backtrace up to driver entry point at
>>>> kernel
>>>> launch time
>>>> ========= Host Frame:/usr/lib/libcuda.so (cuLaunchKernel +
>>>> 0x3dc)
>>>> [0xc9d5c]
>>>> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
>>>> [0x13324]
>>>> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
>>>> (cudaLaunch + 0x182) [0x3ac62]
>>>> ========= Host Frame:pmemd.cuda [0x13572c]
>>>> ========= Host Frame:pmemd.cuda [0x1336bd]
>>>> ========= Host Frame:pmemd.cuda [0x1336c8]
>>>> ========= Host Frame:pmemd.cuda [0x1321fd]
>>>> ========= Host Frame:pmemd.cuda [0x11e71f]
>>>> ========= Host Frame:pmemd.cuda [0x4e11d]
>>>> ========= Host Frame:pmemd.cuda [0x71cd9]
>>>> ========= Host Frame:pmemd.cuda [0xab2ac]
>>>> ========= Host Frame:pmemd.cuda [0x42dc]
>>>> ========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
>>>> [0x1ec4d]
>>>> ========= Host Frame:pmemd.cuda [0x41d9]
>>>> =========
>>>> ========= Program hit error 4 on CUDA API call to cudaLaunch
>>>> ========= Saved host backtrace up to driver entry point at
>>>> error
>>>> ========= Host Frame:/usr/lib/libcuda.so [0x26a070]
>>>> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
>>>> (cudaLaunch + 0x246) [0x3ad26]
>>>> ========= Host Frame:pmemd.cuda [0x13572c]
>>>> ========= Host Frame:pmemd.cuda [0x133597]
>>>> ========= Host Frame:pmemd.cuda [0x1335a2]
>>>> ========= Host Frame:pmemd.cuda [0x131c36]
>>>> ========= Host Frame:pmemd.cuda [0x11e734]
>>>> ========= Host Frame:pmemd.cuda [0x4e11d]
>>>> ========= Host Frame:pmemd.cuda [0x71cd9]
>>>> ========= Host Frame:pmemd.cuda [0xab2ac]
>>>> ========= Host Frame:pmemd.cuda [0x42dc]
>>>> ========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
>>>> [0x1ec4d]
>>>> ========= Host Frame:pmemd.cuda [0x41d9]
>>>> =========
>>>> ========= Program hit error 4 on CUDA API call to cudaGetLastError
>>>> ========= Saved host backtrace up to driver entry point at
>>>> error
>>>> ========= Host Frame:/usr/lib/libcuda.so [0x26a070]
>>>> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
>>>> (cudaGetLastError + 0x1da) [0x4048a]
>>>> ========= Host Frame:pmemd.cuda [0x131c3b]
>>>> ========= Host Frame:pmemd.cuda [0x11e734]
>>>> ========= Host Frame:pmemd.cuda [0x4e11d]
>>>> ========= Host Frame:pmemd.cuda [0x71cd9]
>>>> ========= Host Frame:pmemd.cuda [0xab2ac]
>>>> ========= Host Frame:pmemd.cuda [0x42dc]
>>>> ========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
>>>> [0x1ec4d]
>>>> ========= Host Frame:pmemd.cuda [0x41d9]
>>>> =========
>>>> ========= Program hit error 4 on CUDA API call to cudaFree
>>>> ========= Saved host backtrace up to driver entry point at
>>>> error
>>>> ========= Host Frame:/usr/lib/libcuda.so [0x26a070]
>>>> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
>>>> (cudaFree + 0x215) [0x39525]
>>>> ========= Host Frame:pmemd.cuda [0x12a50a]
>>>> ========= Host Frame:pmemd.cuda [0x12f4d5]
>>>> ========= Host Frame:pmemd.cuda [0x1015c6]
>>>> ========= Host Frame:pmemd.cuda [0x131c6a]
>>>> ========= Host Frame:pmemd.cuda [0x11e734]
>>>> ========= Host Frame:pmemd.cuda [0x4e11d]
>>>> ========= Host Frame:pmemd.cuda [0x71cd9]
>>>> ========= Host Frame:pmemd.cuda [0xab2ac]
>>>> ========= Host Frame:pmemd.cuda [0x42dc]
>>>> ========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
>>>> [0x1ec4d]
>>>> ========= Host Frame:pmemd.cuda [0x41d9]
>>>> =========
>>>> ========= ERROR SUMMARY: 4 errors
>>>>
>>>>
>>>> No errors where obtained by running "cuda-memcheck pmemd.cuda -O -o
>>>> mdout.tip5p_box_nvt -r restrt -x mdcrd -p tip5p_box.prmtop -c
>>>> tip5p_box.inpcrd"
>>>>
>>>> As usual, any suggestion would be greatly appreciated.
>>>>
>>>> Best regards
>>>>
>>>> Alessandro
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Alessandro Contini, PhD
>>>> Dipartimento di Scienze Farmaceutiche
>>>> Sezione di Chimica Generale e Organica "A. Marchesini"
>>>> Via Venezian, 21 20133 Milano
>>>> tel. +390250314480
>>>> e-mail alessandro.contini.unimi.it
>>>> skype alessandrocontini
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>>
>>>
>>>
>>> --
>>> Jason M. Swails
>>> Quantum Theory Project,
>>> University of Florida
>>> Ph.D. Candidate
>>> 352-392-4032
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jan 31 2013 - 12:30:03 PST