Re: [AMBER] Amber 12 cuda test suite: some tests 'hang'

From: Alessandro Contini <alessandro.contini.unimi.it>
Date: Thu, 31 Jan 2013 17:45:40 +0100

Dear Amber community,
I've recently compiled amber12 with cuda support (ubuntu 10.04, x86_64,
2.6.32-45-generic kernel, cuda 5.0, NVIDIA Driver Version: 310.32, GPU
Tesla C1060, intel composer XE 2013.1.117, dual intel xeon E5506) and
I'm experiencing "hanging" on some tests in a way similar to what
previously described by Jan-Philip Gehrck in this lists. When the
"hanging" test are run by "make test.cuda", the GPU hangs and no output
is produced. By killing the test and the corresponding job no more jobs
can be run and the GPU is "unavailable" until the system is rebooted. By
manually running the test without backgrounding it, it still hangs by it
can be killed by ctrl-c and the GPU became available. The test where I
experience this problem are:

chamber/dhfr_cmap/ && ./Run.dhfr_charmm.min
chamber/dhfr_cmap/ && ./Run.dhfr_charmm.md
tip4pew/ && ./Run.tip4pew_box_npt
tip4pew/ && ./Run.tip4pew_oct_npt
tip5p/ && ./Run.tip5p_oct_npt
chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm.min
chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_noshake.min
chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm.md
chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_NPT.md
chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_noshake.md

All other tests runs quite fine (in some cases acceptable differences
are observed). I also tested the system on a full length simulation
(previously run on standard CPUs) of a tip3p solvated protein (42032
atoms) including minimizations, restrained equilibrations (NVT and NPT)
and unrestrained production run (4ns) and it worked fine.
The card is not overheating (72°C on average during the run).
Summarizing, I'm experiencing errors with chamber "cmap" runs and with
tip4pew and tip5p npt runs (however not with tip3p npt, since I've
tested this on my protein).

By running "cuda-memcheck pmemd.cuda -o mdout.tip5p_box_npt -r restrt
-x mdcrd -p tip5p_box.prmtop -c tip5p_box.inpcrd" I've obtained the
following output:

========= CUDA-MEMCHECK
Error: unspecified launch failure launching kernel
kCalculateCOMKineticEnergy
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
========= Invalid __shared__ read of size 4
========= at 0x00003020 in kNLOrientForcesVirial_kernel(void)
========= by thread (254,0,0) in block (2,0,0)
========= Address 0x00001840 is out of bounds
========= Saved host backtrace up to driver entry point at kernel
launch time
========= Host Frame:/usr/lib/libcuda.so (cuLaunchKernel + 0x3dc)
[0xc9d5c]
========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
[0x13324]
========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
(cudaLaunch + 0x182) [0x3ac62]
========= Host Frame:pmemd.cuda [0x13572c]
========= Host Frame:pmemd.cuda [0x1336bd]
========= Host Frame:pmemd.cuda [0x1336c8]
========= Host Frame:pmemd.cuda [0x1321fd]
========= Host Frame:pmemd.cuda [0x11e71f]
========= Host Frame:pmemd.cuda [0x4e11d]
========= Host Frame:pmemd.cuda [0x71cd9]
========= Host Frame:pmemd.cuda [0xab2ac]
========= Host Frame:pmemd.cuda [0x42dc]
========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
[0x1ec4d]
========= Host Frame:pmemd.cuda [0x41d9]
=========
========= Program hit error 4 on CUDA API call to cudaLaunch
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/libcuda.so [0x26a070]
========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
(cudaLaunch + 0x246) [0x3ad26]
========= Host Frame:pmemd.cuda [0x13572c]
========= Host Frame:pmemd.cuda [0x133597]
========= Host Frame:pmemd.cuda [0x1335a2]
========= Host Frame:pmemd.cuda [0x131c36]
========= Host Frame:pmemd.cuda [0x11e734]
========= Host Frame:pmemd.cuda [0x4e11d]
========= Host Frame:pmemd.cuda [0x71cd9]
========= Host Frame:pmemd.cuda [0xab2ac]
========= Host Frame:pmemd.cuda [0x42dc]
========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
[0x1ec4d]
========= Host Frame:pmemd.cuda [0x41d9]
=========
========= Program hit error 4 on CUDA API call to cudaGetLastError
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/libcuda.so [0x26a070]
========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
(cudaGetLastError + 0x1da) [0x4048a]
========= Host Frame:pmemd.cuda [0x131c3b]
========= Host Frame:pmemd.cuda [0x11e734]
========= Host Frame:pmemd.cuda [0x4e11d]
========= Host Frame:pmemd.cuda [0x71cd9]
========= Host Frame:pmemd.cuda [0xab2ac]
========= Host Frame:pmemd.cuda [0x42dc]
========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
[0x1ec4d]
========= Host Frame:pmemd.cuda [0x41d9]
=========
========= Program hit error 4 on CUDA API call to cudaFree
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/libcuda.so [0x26a070]
========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
(cudaFree + 0x215) [0x39525]
========= Host Frame:pmemd.cuda [0x12a50a]
========= Host Frame:pmemd.cuda [0x12f4d5]
========= Host Frame:pmemd.cuda [0x1015c6]
========= Host Frame:pmemd.cuda [0x131c6a]
========= Host Frame:pmemd.cuda [0x11e734]
========= Host Frame:pmemd.cuda [0x4e11d]
========= Host Frame:pmemd.cuda [0x71cd9]
========= Host Frame:pmemd.cuda [0xab2ac]
========= Host Frame:pmemd.cuda [0x42dc]
========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
[0x1ec4d]
========= Host Frame:pmemd.cuda [0x41d9]
=========
========= ERROR SUMMARY: 4 errors


No errors where obtained by running "cuda-memcheck pmemd.cuda -O -o
mdout.tip5p_box_nvt -r restrt -x mdcrd -p tip5p_box.prmtop -c
tip5p_box.inpcrd"

As usual, any suggestion would be greatly appreciated.

Best regards

Alessandro




-- 
Alessandro Contini, PhD
Dipartimento di Scienze Farmaceutiche
Sezione di Chimica Generale e Organica "A. Marchesini"
Via Venezian, 21 20133 Milano
tel. +390250314480
e-mail alessandro.contini.unimi.it
skype alessandrocontini
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jan 31 2013 - 09:00:05 PST
Custom Search