Re: [AMBER] Amber 12 cuda test suite: some tests 'hang'

From: Scott Le Grand <varelse2005.gmail.com>
Date: Thu, 31 Jan 2013 10:24:35 -0800

SPFP should be a performance disaster on pre-Fermi cards... Avoid, avoid,
avoid...


On Thu, Jan 31, 2013 at 10:16 AM, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Alessandro,
>
> In addition to Jason's suggestion, assuming bugfix.14 is correctly applied
> could you try compiling the SPDP version of the code.
>
> ./configure -cuda_SPDP gnu
> make install
>
> Then run the GPU tests with
>
> cd $AMBERHOME/test
> ./test_amber_cuda.sh SPDP
>
> And see if that works. The SPFP precision model is really only designed
> for GPUs with hardware revision >=2.0 and has not been fully tested on
> earlier cards since the C1060 (I don't have access to anything below a
> C2075). If SPDP works on your C1060 then my suggestion would be to use
> that precision model going forward on anything pre-Fermi.
>
> All the best
> Ross
>
>
>
> On 1/31/13 9:24 AM, "Jason Swails" <jason.swails.gmail.com> wrote:
>
> >What bug fixes have been applied? You can get this from:
> >
> >cd $AMBERHOME && ./patch_amber.py --patch-level
> >
> >I've seen this occur with CUDA 5.0 on a pre-bugfix.14 version of the code.
> > bugfix.14 fixed this for me (on a GTX 680).
> >
> >HTH,
> >Jason
> >
> >On Thu, Jan 31, 2013 at 11:45 AM, Alessandro Contini <
> >alessandro.contini.unimi.it> wrote:
> >
> >> Dear Amber community,
> >> I've recently compiled amber12 with cuda support (ubuntu 10.04, x86_64,
> >> 2.6.32-45-generic kernel, cuda 5.0, NVIDIA Driver Version: 310.32, GPU
> >> Tesla C1060, intel composer XE 2013.1.117, dual intel xeon E5506) and
> >> I'm experiencing "hanging" on some tests in a way similar to what
> >> previously described by Jan-Philip Gehrck in this lists. When the
> >> "hanging" test are run by "make test.cuda", the GPU hangs and no output
> >> is produced. By killing the test and the corresponding job no more jobs
> >> can be run and the GPU is "unavailable" until the system is rebooted. By
> >> manually running the test without backgrounding it, it still hangs by it
> >> can be killed by ctrl-c and the GPU became available. The test where I
> >> experience this problem are:
> >>
> >> chamber/dhfr_cmap/ && ./Run.dhfr_charmm.min
> >> chamber/dhfr_cmap/ && ./Run.dhfr_charmm.md
> >> tip4pew/ && ./Run.tip4pew_box_npt
> >> tip4pew/ && ./Run.tip4pew_oct_npt
> >> tip5p/ && ./Run.tip5p_oct_npt
> >> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm.min
> >> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_noshake.min
> >> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm.md
> >> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_NPT.md
> >> chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_noshake.md
> >>
> >> All other tests runs quite fine (in some cases acceptable differences
> >> are observed). I also tested the system on a full length simulation
> >> (previously run on standard CPUs) of a tip3p solvated protein (42032
> >> atoms) including minimizations, restrained equilibrations (NVT and NPT)
> >> and unrestrained production run (4ns) and it worked fine.
> >> The card is not overheating (72°C on average during the run).
> >> Summarizing, I'm experiencing errors with chamber "cmap" runs and with
> >> tip4pew and tip5p npt runs (however not with tip3p npt, since I've
> >> tested this on my protein).
> >>
> >> By running "cuda-memcheck pmemd.cuda -o mdout.tip5p_box_npt -r restrt
> >> -x mdcrd -p tip5p_box.prmtop -c tip5p_box.inpcrd" I've obtained the
> >> following output:
> >>
> >> ========= CUDA-MEMCHECK
> >> Error: unspecified launch failure launching kernel
> >> kCalculateCOMKineticEnergy
> >> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
> >> ========= Invalid __shared__ read of size 4
> >> ========= at 0x00003020 in kNLOrientForcesVirial_kernel(void)
> >> ========= by thread (254,0,0) in block (2,0,0)
> >> ========= Address 0x00001840 is out of bounds
> >> ========= Saved host backtrace up to driver entry point at kernel
> >> launch time
> >> ========= Host Frame:/usr/lib/libcuda.so (cuLaunchKernel + 0x3dc)
> >> [0xc9d5c]
> >> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
> >> [0x13324]
> >> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
> >> (cudaLaunch + 0x182) [0x3ac62]
> >> ========= Host Frame:pmemd.cuda [0x13572c]
> >> ========= Host Frame:pmemd.cuda [0x1336bd]
> >> ========= Host Frame:pmemd.cuda [0x1336c8]
> >> ========= Host Frame:pmemd.cuda [0x1321fd]
> >> ========= Host Frame:pmemd.cuda [0x11e71f]
> >> ========= Host Frame:pmemd.cuda [0x4e11d]
> >> ========= Host Frame:pmemd.cuda [0x71cd9]
> >> ========= Host Frame:pmemd.cuda [0xab2ac]
> >> ========= Host Frame:pmemd.cuda [0x42dc]
> >> ========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
> >> [0x1ec4d]
> >> ========= Host Frame:pmemd.cuda [0x41d9]
> >> =========
> >> ========= Program hit error 4 on CUDA API call to cudaLaunch
> >> ========= Saved host backtrace up to driver entry point at error
> >> ========= Host Frame:/usr/lib/libcuda.so [0x26a070]
> >> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
> >> (cudaLaunch + 0x246) [0x3ad26]
> >> ========= Host Frame:pmemd.cuda [0x13572c]
> >> ========= Host Frame:pmemd.cuda [0x133597]
> >> ========= Host Frame:pmemd.cuda [0x1335a2]
> >> ========= Host Frame:pmemd.cuda [0x131c36]
> >> ========= Host Frame:pmemd.cuda [0x11e734]
> >> ========= Host Frame:pmemd.cuda [0x4e11d]
> >> ========= Host Frame:pmemd.cuda [0x71cd9]
> >> ========= Host Frame:pmemd.cuda [0xab2ac]
> >> ========= Host Frame:pmemd.cuda [0x42dc]
> >> ========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
> >> [0x1ec4d]
> >> ========= Host Frame:pmemd.cuda [0x41d9]
> >> =========
> >> ========= Program hit error 4 on CUDA API call to cudaGetLastError
> >> ========= Saved host backtrace up to driver entry point at error
> >> ========= Host Frame:/usr/lib/libcuda.so [0x26a070]
> >> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
> >> (cudaGetLastError + 0x1da) [0x4048a]
> >> ========= Host Frame:pmemd.cuda [0x131c3b]
> >> ========= Host Frame:pmemd.cuda [0x11e734]
> >> ========= Host Frame:pmemd.cuda [0x4e11d]
> >> ========= Host Frame:pmemd.cuda [0x71cd9]
> >> ========= Host Frame:pmemd.cuda [0xab2ac]
> >> ========= Host Frame:pmemd.cuda [0x42dc]
> >> ========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
> >> [0x1ec4d]
> >> ========= Host Frame:pmemd.cuda [0x41d9]
> >> =========
> >> ========= Program hit error 4 on CUDA API call to cudaFree
> >> ========= Saved host backtrace up to driver entry point at error
> >> ========= Host Frame:/usr/lib/libcuda.so [0x26a070]
> >> ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0
> >> (cudaFree + 0x215) [0x39525]
> >> ========= Host Frame:pmemd.cuda [0x12a50a]
> >> ========= Host Frame:pmemd.cuda [0x12f4d5]
> >> ========= Host Frame:pmemd.cuda [0x1015c6]
> >> ========= Host Frame:pmemd.cuda [0x131c6a]
> >> ========= Host Frame:pmemd.cuda [0x11e734]
> >> ========= Host Frame:pmemd.cuda [0x4e11d]
> >> ========= Host Frame:pmemd.cuda [0x71cd9]
> >> ========= Host Frame:pmemd.cuda [0xab2ac]
> >> ========= Host Frame:pmemd.cuda [0x42dc]
> >> ========= Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd)
> >> [0x1ec4d]
> >> ========= Host Frame:pmemd.cuda [0x41d9]
> >> =========
> >> ========= ERROR SUMMARY: 4 errors
> >>
> >>
> >> No errors where obtained by running "cuda-memcheck pmemd.cuda -O -o
> >> mdout.tip5p_box_nvt -r restrt -x mdcrd -p tip5p_box.prmtop -c
> >> tip5p_box.inpcrd"
> >>
> >> As usual, any suggestion would be greatly appreciated.
> >>
> >> Best regards
> >>
> >> Alessandro
> >>
> >>
> >>
> >>
> >> --
> >> Alessandro Contini, PhD
> >> Dipartimento di Scienze Farmaceutiche
> >> Sezione di Chimica Generale e Organica "A. Marchesini"
> >> Via Venezian, 21 20133 Milano
> >> tel. +390250314480
> >> e-mail alessandro.contini.unimi.it
> >> skype alessandrocontini
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >
> >
> >
> >--
> >Jason M. Swails
> >Quantum Theory Project,
> >University of Florida
> >Ph.D. Candidate
> >352-392-4032
> >_______________________________________________
> >AMBER mailing list
> >AMBER.ambermd.org
> >http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jan 31 2013 - 10:30:05 PST
Custom Search