Re: [AMBER] amber12 (pmemd.cuda) on GeForce GTX 580 ?

From: Ross Walker <ross.rosswalker.co.uk>
Date: Tue, 23 Oct 2012 09:21:14 -0700

Hi Tomasz,

The fact that a reboot fixes it and you see this problem on another
machine makes me suspect driver issues. Indeed the driver being reported
by SMI is pretty ancient you might want to update that.

On another note please confirm you are using nvcc V4.2 and Amber has been
compiled with that. Please also check you LD_LIBRARY_PATH etc point to the
correct (i.e. 4.2) cuda runtime library. The errors you are seeing imply
to me that cuda 5.0 might have been used to compile and/or your paths
point to the cuda 5.0 runtime library.

A reboot probably clears out and cleans up your environment.

All the best
Ross

/\
\/
|\oss Walker

---------------------------------------------------------
| Assistant Research Professor |
| San Diego Supercomputer Center |
| Adjunct Assistant Professor |
| Dept. of Chemistry and Biochemistry |
| University of California San Diego |
| NVIDIA Fellow |
| http://www.rosswalker.co.uk | http://www.wmd-lab.org |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
---------------------------------------------------------

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.







On 10/23/12 4:40 AM, "Tomasz Borowski" <tomasz.borowski74.gmail.com> wrote:

>Dear Robert, Aron, Ross,
>
>many thanks for your helpful suggestions.
>
>I reinstalled the driver and after a restart
>the problem disappeared, i.e. pmemd.cuda
>works fine and passes all tests.
>
>Of note, the
>lspci | grep VGA
>still does not report the presence of the card.
>
>However, when I tried to repeat the same procedure
>on my second (and the last) machine with GeForce GTX 580 (the same
>hardware and software configuration), the same problem appeared.
>
>For the remaining problematic machine:
>
>1) I have run memtestCL-1.00-linux64 several times
>with no errors reported
>
>2) nvidia-smi shows the card and does not report too high temperatures:
>Tue Oct 23 13:11:40 2012
>+------------------------------------------------------+
>
>| NVIDIA-SMI 3.295.41 Driver Version: 295.41 |
>
>|-------------------------------+----------------------+------------------
>----+
>| Nb. Name | Bus Id Disp. | Volatile ECC SB /
>DB |
>| Fan Temp Power Usage /Cap | Memory Usage | GPU Util. Compute
>M. |
>|===============================+======================+==================
>====|
>| 0. GeForce GTX 580 | 0000:04:00.0 N/A | N/A
> N/A |
>| 11% 28 C N/A N/A / N/A | 0% 5MB / 1535MB | N/A Default
> |
>|-------------------------------+----------------------+------------------
>----|
>| Compute processes: GPU
>Memory |
>| GPU PID Process name Usage
> |
>|=========================================================================
>====|
>| 0. Not Supported
> |
>+-------------------------------------------------------------------------
>----+
>
>3) nvidia-smi -a gives the following output (which is basically
>identical to that produced on the machine where the problem disappeared)
>==============NVSMI LOG==============
>
>Timestamp : Tue Oct 23 13:12:45 2012
>
>Driver Version : 295.41
>
>Attached GPUs : 1
>
>GPU 0000:04:00.0
> Product Name : GeForce GTX 580
> Display Mode : N/A
> Persistence Mode : Disabled
> Driver Model
> Current : N/A
> Pending : N/A
> Serial Number : N/A
> GPU UUID : N/A
> VBIOS Version : 70.10.17.00.00
> Inforom Version
> OEM Object : N/A
> ECC Object : N/A
> Power Management Object : N/A
> PCI
> Bus : 0x04
> Device : 0x00
> Domain : 0x0000
> Device Id : 0x108010DE
> Bus Id : 0000:04:00.0
> Sub System Id : 0x83851043
> GPU Link Info
> PCIe Generation
> Max : N/A
> Current : N/A
> Link Width
> Max : N/A
> Current : N/A
> Fan Speed : 11 %
> Performance State : N/A
> Memory Usage
> Total : 1535 MB
> Used : 5 MB
> Free : 1530 MB
> Compute Mode : Default
> Utilization
> Gpu : N/A
> Memory : N/A
> Ecc Mode
> Current : N/A
> Pending : N/A
> ECC Errors
> Volatile
> Single Bit
> Device Memory : N/A
> Register File : N/A
> L1 Cache : N/A
> L2 Cache : N/A
> Total : N/A
> Double Bit
> Device Memory : N/A
> Register File : N/A
> L1 Cache : N/A
> L2 Cache : N/A
> Total : N/A
> Aggregate
> Single Bit
> Device Memory : N/A
> Register File : N/A
> L1 Cache : N/A
> L2 Cache : N/A
> Total : N/A
> Double Bit
> Device Memory : N/A
> Register File : N/A
> L1 Cache : N/A
> L2 Cache : N/A
> Total : N/A
> Temperature
> Gpu : 28 C
> Power Readings
> Power Management : N/A
> Power Draw : N/A
> Power Limit : N/A
> Clocks
> Graphics : N/A
> SM : N/A
> Memory : N/A
> Max Clocks
> Graphics : N/A
> SM : N/A
> Memory : N/A
> Compute Processes : Not Supported
>
>4) when running make test, the very same tests as before (on the
>first machine) failed, here are the fragments of the log file for tests
>that failed due to errors:
>
>==============================================================
>cd nucleosome/ && ./Run_md.1 SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>Error: unspecified launch failure launching kernel kScaleVelocities
>cudaFree GpuBuffer::Deallocate failed unspecified launch failure
> ./Run_md.1: Program error
>make[3]: [test.pmemd.cuda.gb] Error 1 (ignored)
>cd nucleosome/ && ./Run_md.2 SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>Error: unspecified launch failure launching kernel kClearForces
>cudaFree GpuBuffer::Deallocate failed unspecified launch failure
> ./Run_md.2: Program error
>make[3]: [test.pmemd.cuda.gb] Error 1 (ignored)
>cd amd/rna_gb && ./Run.gb.amd2 SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>diffing mdout.gb.amd2.GPU_SPFP with mdout.gb.amd2
>PASSED
>==============================================================
>==============================================================
>cd 4096wat_oct/ && ./Run.pure_wat_oct_NPT_NTT1 SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>cudaMemcpy GpuBuffer::Download failed unspecified launch failure
> ./Run.pure_wat_oct_NPT_NTT1: Program error
>make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
>cd large_solute_count/ && ./Run.ntb2_ntt1 SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>Error: unspecified launch failure launching kernel kPMEScalarSumRCEnergy
>cudaFree GpuBuffer::Deallocate failed unspecified launch failure
> ./Run.ntb2_ntt1: Program error
>make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
>cd jac/ && ./Run.jac SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>Error: unspecified launch failure launching kernel kNLSkinTest
>cudaFree GpuBuffer::Deallocate failed unspecified launch failure
> ./Run.jac: Program error
>make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
>cd dhfr/ && ./Run.dhfr SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>Error: unspecified launch failure launching kernel kNLSkinTest
>cudaFree GpuBuffer::Deallocate failed unspecified launch failure
> ./Run.dhfr: Program error
>make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
>cd dhfr/ && ./Run.dhfr.ntr1 SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>Error: unspecified launch failure launching kernel kNLSkinTest
>cudaFree GpuBuffer::Deallocate failed unspecified launch failure
> ./Run.dhfr.ntr1: Program error
>make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
>cd dhfr/ && ./Run.dhfr.ntb2 SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>cudaMemcpy GpuBuffer::Download failed unspecified launch failure
> ./Run.dhfr.ntb2: Program error
>make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
>cd dhfr/ && ./Run.dhfr.ntb2_ntt1 SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>Error: unspecified launch failure launching kernel kPMEScalarSumRCEnergy
>cudaFree GpuBuffer::Deallocate failed unspecified launch failure
> ./Run.dhfr.ntb2_ntt1: Program error
>make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
>cd dhfr/ && ./Run.dhfr.ntb2_ntt1_ntr1 SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>cudaMemcpy GpuBuffer::Upload failed unspecified launch failure
> ./Run.dhfr.ntb2_ntt1_ntr1: Program error
>make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
>cd dhfr/ && ./Run.dhfr.noshake SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>diffing mdout.dhfr.noshake.GPU_SPFP with mdout.dhfr.noshake
>possible FAILURE: check mdout.dhfr.noshake.dif
>==============================================================
>==============================================================
>cd chamber/dhfr_pbc/ && ./Run.dhfr_pbc_charmm_noshake.md SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>cudaMemcpy GpuBuffer::Download failed unspecified launch failure
> ./Run.dhfr_pbc_charmm_noshake.md: Program error
>make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
>cd chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm.md SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>cudaMemcpy GpuBuffer::Download failed unspecified launch failure
> ./Run.dhfr_cmap_pbc_charmm.md: Program error
>make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
>cd chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_NPT.md SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>cudaMemcpy GpuBuffer::Download failed unspecified launch failure
> ./Run.dhfr_cmap_pbc_charmm_NPT.md: Program error
>make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
>cd chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_noshake.md SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>cudaMemcpy GpuBuffer::Download failed unspecified launch failure
> ./Run.dhfr_cmap_pbc_charmm_noshake.md: Program error
>make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
>cd ips/ && ./Run.ips SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>cudaMemcpy GpuBuffer::Upload failed unspecified launch failure
> ./Run.ips: Program error
>make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
>cd amd/dhfr_pme && ./Run.pme.amd1 SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>Error: unspecified launch failure launching kernel kPMEScalarSumRCEnergy
>cudaFree GpuBuffer::Deallocate failed unspecified launch failure
> ./Run.pme.amd1: Program error
>make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
>cd amd/dhfr_pme && ./Run.pme.amd2 SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>cudaMemcpy GpuBuffer::Download failed unspecified launch failure
> ./Run.pme.amd2: Program error
>make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
>cd amd/dhfr_pme && ./Run.pme.amd3 SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>cudaMemcpy GpuBuffer::Download failed unspecified launch failure
> ./Run.pme.amd3: Program error
>make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
>cd amd/gact_ips && ./Run.ips.amd1 SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>diffing mdout.ips.amd1.GPU_SPFP with mdout.ips.amd1
>possible FAILURE: check mdout.ips.amd1.dif
>==============================================================
>==============================================================
>cd nmropt/pme/distance/ && ./Run.dist_pbc SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>Error: unspecified launch failure launching kernel kPMEGradSum
>cudaFree GpuBuffer::Deallocate failed unspecified launch failure
> ./Run.dist_pbc: Program error
>make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
>cd nmropt/pme/nmropt_1_torsion/ && ./Run.nmropt_1_torsion SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>diffing mdout.GPU_SPFP with mdout
>file ddtmp.mdout.GPU_SPFP is short
>possible FAILURE: check mdout.dif
>==============================================================
>==============================================================
>cd chamber/dhfr_pbc/ && ./Run.dhfr_pbc_charmm_noshake.min SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>cudaMemcpy GpuBuffer::Download failed unspecified launch failure
> ./Run.dhfr_pbc_charmm_noshake.min: Program error
>make[3]: [test.pmemd.cuda.pme.serial] Error 1 (ignored)
>cd chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm.min SPFP
>/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
>diffing mdout.dhfr_charmm_pbc_min.GPU_SPFP with mdout.dhfr_charmm_pbc_min
>file ddtmp.mdout.dhfr_charmm_pbc_min.GPU_SPFP is short
>possible FAILURE: check mdout.dhfr_charmm_pbc_min.dif
>==============================================================
>
>...
>42 file comparisons passed
>22 file comparisons failed
>51 tests experienced errors
>
>Since the problem is duplicated on another machine I guess this is not
>hardware
>that fails, but rather software. I tried reinstalling the drivers several
>times, yet
>with no progress at all.
>
>I would appreciate any hint what to try next.
>
>all the best,
>Tomasz
>
>
>
>
>
>
>
>2012/10/19 Ross Walker <ross.rosswalker.co.uk>
>
>> Hi Tomasz,
>>
>> My gut instinct is to agree with Aron here and say this is likely a
>>broken
>> card, most likely the memory. However, it would be useful to know what
>>you
>> mean by half the tests fail. How do they fail exactly and what error
>> messages do you get. Do you have the log files, especially the diff log
>> from running the tests?
>>
>> nvidia-smi -a
>>
>> should show the card. I would try reinstalling the NVIDIA driver first
>> though and then powering off the machine and starting it up again. You
>> might want to also take a look inside and make sure nothing is stopping
>> the GPU fan spinning as well, I've had issues like that before with an
>> internal USB cable that wasn't secured being ingested by the GPU fan.
>>
>> Try a reinstall of the drivers first and a cold power down / restart and
>> see if that helps. Run 'nvidia-smi' as root to get the driver loaded
>>once
>> the machine comes back up. If the problem still persists then suspect
>>the
>> hardware.
>>
>> All the best
>> Ross
>>
>>
>> On 10/19/12 1:13 AM, "Tomasz Borowski" <tomasz.borowski74.gmail.com>
>> wrote:
>>
>> >Dear Amber users,
>> >
>> >I have a problem with getting pmemd.cuda working on GeForce GTX 580
>> >card.
>> >
>> >With all the patches currently available applied to the code it
>>compiles
>> >with no problems (gnu compiler, open suse 11.2, no MKL libs). During
>>tests
>> >roughly half of the tests fail. However, the same binary code runs
>>with no
>> >problems on
>> >another machine where I have GeForce GTX 285. The same driver and CUDA
>> >toolkit installed
>> >on the two machines. Does it mean the GTX 580 is faulty ?
>> >
>> >Maybe it could be of relevance, on the machine with GTX 580 I cannot
>>get
>> >the info on the graphics card with the usual command:
>> >lspci | grep VGA
>> >
>> >could you, please, help ?
>> >
>> >
>> >all the best,
>> >Tomasz Borowski
>> >_______________________________________________
>> >AMBER mailing list
>> >AMBER.ambermd.org
>> >http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Oct 23 2012 - 09:30:03 PDT
Custom Search