Re: [AMBER] Amber 12 cuda test suite: some tests 'hang' from Jason Swails on 2013-01-31 (Amber Archive Jan 2013)

From: Jason Swails <jason.swails.gmail.com>
Date: Thu, 31 Jan 2013 12:24:54 -0500

What bug fixes have been applied? You can get this from:

cd $AMBERHOME && ./patch_amber.py --patch-level

I've seen this occur with CUDA 5.0 on a pre-bugfix.14 version of the code.
bugfix.14 fixed this for me (on a GTX 680).

HTH,
Jason

On Thu, Jan 31, 2013 at 11:45 AM, Alessandro Contini <
alessandro.contini.unimi.it> wrote:

> Dear Amber community,
> I've recently compiled amber12 with cuda support (ubuntu 10.04, x86_64,
> 2.6.32-45-generic kernel, cuda 5.0, NVIDIA Driver Version: 310.32, GPU
> Tesla C1060, intel composer XE 2013.1.117, dual intel xeon E5506) and
> I'm experiencing "hanging" on some tests in a way similar to what
> previously described by Jan-Philip Gehrck in this lists. When the
> "hanging" test are run by "make test.cuda", the GPU hangs and no output
> is produced. By killing the test and the corresponding job no more jobs
> can be run and the GPU is "unavailable" until the system is rebooted. By
> manually running the test without backgrounding it, it still hangs by it
> can be killed by ctrl-c and the GPU became available. The test where I
> experience this problem are:
>
> chamber/dhfr_cmap/ && ./Run.dhfr_charmm.min
> chamber/dhfr_cmap/ && ./Run.dhfr_charmm.md
> tip4pew/ && ./Run.tip4pew_box_npt
> tip4pew/ && ./Run.tip4pew_oct_npt
> tip5p/ && ./Run.tip5p_oct_npt
> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm.min
> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_noshake.min
> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm.md
> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_NPT.md
> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_noshake.md
>
> All other tests runs quite fine (in some cases acceptable differences
> are observed). I also tested the system on a full length simulation
> (previously run on standard CPUs) of a tip3p solvated protein (42032
> atoms) including minimizations, restrained equilibrations (NVT and NPT)
> and unrestrained production run (4ns) and it worked fine.
> The card is not overheating (72°C on average during the run).
> Summarizing, I'm experiencing errors with chamber "cmap" runs and with
> tip4pew and tip5p npt runs (however not with tip3p npt, since I've
> tested this on my protein).
>
> By running "cuda-memcheck pmemd.cuda -o mdout.tip5p_box_npt -r restrt
> -x mdcrd -p tip5p_box.prmtop -c tip5p_box.inpcrd" I've obtained the
> following output:
>
> ========= CUDA-MEMCHECK
> Error: unspecified launch failure launching kernel
> kCalculateCOMKineticEnergy
> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
> ========= Invalid __shared__ read of size 4
> ========= at 0x00003020 in kNLOrientForcesVirial_kernel(void)
> ========= by thread (254,0,0) in block (2,0,0)
> ========= Address 0x00001840 is out of bounds
> ========= Saved host backtrace up to driver entry point at kernel
> launch time
> ========= Host Frame:/usr/lib/libcuda.so (cuLaunchKernel + 0x3dc)
> [0xc9d5c]
> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
> [0x13324]
> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
> (cudaLaunch + 0x182) [0x3ac62]
> ========= Host Frame:pmemd.cuda [0x13572c]
> ========= Host Frame:pmemd.cuda [0x1336bd]
> ========= Host Frame:pmemd.cuda [0x1336c8]
> ========= Host Frame:pmemd.cuda [0x1321fd]
> ========= Host Frame:pmemd.cuda [0x11e71f]
> ========= Host Frame:pmemd.cuda [0x4e11d]
> ========= Host Frame:pmemd.cuda [0x71cd9]
> ========= Host Frame:pmemd.cuda [0xab2ac]
> ========= Host Frame:pmemd.cuda [0x42dc]
> ========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
> [0x1ec4d]
> ========= Host Frame:pmemd.cuda [0x41d9]
> =========
> ========= Program hit error 4 on CUDA API call to cudaLaunch
> ========= Saved host backtrace up to driver entry point at error
> ========= Host Frame:/usr/lib/libcuda.so [0x26a070]
> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
> (cudaLaunch + 0x246) [0x3ad26]
> ========= Host Frame:pmemd.cuda [0x13572c]
> ========= Host Frame:pmemd.cuda [0x133597]
> ========= Host Frame:pmemd.cuda [0x1335a2]
> ========= Host Frame:pmemd.cuda [0x131c36]
> ========= Host Frame:pmemd.cuda [0x11e734]
> ========= Host Frame:pmemd.cuda [0x4e11d]
> ========= Host Frame:pmemd.cuda [0x71cd9]
> ========= Host Frame:pmemd.cuda [0xab2ac]
> ========= Host Frame:pmemd.cuda [0x42dc]
> ========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
> [0x1ec4d]
> ========= Host Frame:pmemd.cuda [0x41d9]
> =========
> ========= Program hit error 4 on CUDA API call to cudaGetLastError
> ========= Saved host backtrace up to driver entry point at error
> ========= Host Frame:/usr/lib/libcuda.so [0x26a070]
> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
> (cudaGetLastError + 0x1da) [0x4048a]
> ========= Host Frame:pmemd.cuda [0x131c3b]
> ========= Host Frame:pmemd.cuda [0x11e734]
> ========= Host Frame:pmemd.cuda [0x4e11d]
> ========= Host Frame:pmemd.cuda [0x71cd9]
> ========= Host Frame:pmemd.cuda [0xab2ac]
> ========= Host Frame:pmemd.cuda [0x42dc]
> ========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
> [0x1ec4d]
> ========= Host Frame:pmemd.cuda [0x41d9]
> =========
> ========= Program hit error 4 on CUDA API call to cudaFree
> ========= Saved host backtrace up to driver entry point at error
> ========= Host Frame:/usr/lib/libcuda.so [0x26a070]
> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
> (cudaFree + 0x215) [0x39525]
> ========= Host Frame:pmemd.cuda [0x12a50a]
> ========= Host Frame:pmemd.cuda [0x12f4d5]
> ========= Host Frame:pmemd.cuda [0x1015c6]
> ========= Host Frame:pmemd.cuda [0x131c6a]
> ========= Host Frame:pmemd.cuda [0x11e734]
> ========= Host Frame:pmemd.cuda [0x4e11d]
> ========= Host Frame:pmemd.cuda [0x71cd9]
> ========= Host Frame:pmemd.cuda [0xab2ac]
> ========= Host Frame:pmemd.cuda [0x42dc]
> ========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
> [0x1ec4d]
> ========= Host Frame:pmemd.cuda [0x41d9]
> =========
> ========= ERROR SUMMARY: 4 errors
>
>
> No errors where obtained by running "cuda-memcheck pmemd.cuda -O -o
> mdout.tip5p_box_nvt -r restrt -x mdcrd -p tip5p_box.prmtop -c
> tip5p_box.inpcrd"
>
> As usual, any suggestion would be greatly appreciated.
>
> Best regards
>
> Alessandro
>
>
>
>
> --
> Alessandro Contini, PhD
> Dipartimento di Scienze Farmaceutiche
> Sezione di Chimica Generale e Organica "A. Marchesini"
> Via Venezian, 21 20133 Milano
> tel. +390250314480
> e-mail alessandro.contini.unimi.it
> skype alessandrocontini
>
>
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>

-- 
Jason M. Swails
Quantum Theory Project,
University of Florida
Ph.D. Candidate
352-392-4032
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Thu Jan 31 2013 - 09:30:02 PST