Dear Robert, Aron, Ross,
many thanks for your helpful suggestions.
I reinstalled the driver and after a restart
the problem disappeared, i.e. pmemd.cuda
works fine and passes all tests.
Of note, the
lspci | grep VGA
still does not report the presence of the card.
However, when I tried to repeat the same procedure
on my second (and the last) machine with GeForce GTX 580 (the same
hardware and software configuration), the same problem appeared.
For the remaining problematic machine:
1) I have run memtestCL-1.00-linux64 several times
with no errors reported
2) nvidia-smi shows the card and does not report too high temperatures:
Tue Oct 23 13:11:40 2012
+------------------------------------------------------+
| NVIDIA-SMI 3.295.41 Driver Version: 295.41 |
|-------------------------------+----------------------+----------------------+
| Nb. Name | Bus Id Disp. | Volatile ECC SB /
DB |
| Fan Temp Power Usage /Cap | Memory Usage | GPU Util. Compute
M. |
|===============================+======================+======================|
| 0. GeForce GTX 580 | 0000:04:00.0 N/A | N/A
N/A |
| 11% 28 C N/A N/A / N/A | 0% 5MB / 1535MB | N/A Default
|
|-------------------------------+----------------------+----------------------|
| Compute processes: GPU
Memory |
| GPU PID Process name Usage
|
|=============================================================================|
| 0. Not Supported
|
+-----------------------------------------------------------------------------+
3) nvidia-smi -a gives the following output (which is basically
identical to that produced on the machine where the problem disappeared)
==============NVSMI LOG==============
Timestamp : Tue Oct 23 13:12:45 2012
Driver Version : 295.41
Attached GPUs : 1
GPU 0000:04:00.0
Product Name : GeForce GTX 580
Display Mode : N/A
Persistence Mode : Disabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : N/A
VBIOS Version : 70.10.17.00.00
Inforom Version
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
PCI
Bus : 0x04
Device : 0x00
Domain : 0x0000
Device Id : 0x108010DE
Bus Id : 0000:04:00.0
Sub System Id : 0x83851043
GPU Link Info
PCIe Generation
Max : N/A
Current : N/A
Link Width
Max : N/A
Current : N/A
Fan Speed : 11 %
Performance State : N/A
Memory Usage
Total : 1535 MB
Used : 5 MB
Free : 1530 MB
Compute Mode : Default
Utilization
Gpu : N/A
Memory : N/A
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Temperature
Gpu : 28 C
Power Readings
Power Management : N/A
Power Draw : N/A
Power Limit : N/A
Clocks
Graphics : N/A
SM : N/A
Memory : N/A
Max Clocks
Graphics : N/A
SM : N/A
Memory : N/A
Compute Processes : Not Supported
4) when running make test, the very same tests as before (on the
first machine) failed, here are the fragments of the log file for tests
that failed due to errors:
==============================================================
cd nucleosome/ && ./Run_md.1 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
Error: unspecified launch failure launching kernel kScaleVelocities
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
./Run_md.1: Program error
make[3]: [test.pmemd.cuda.gb] Error 1 (ignored)
cd nucleosome/ && ./Run_md.2 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
Error: unspecified launch failure launching kernel kClearForces
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
./Run_md.2: Program error
make[3]: [test.pmemd.cuda.gb] Error 1 (ignored)
cd amd/rna_gb && ./Run.gb.amd2 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
diffing mdout.gb.amd2.GPU_SPFP with mdout.gb.amd2
PASSED
==============================================================
==============================================================
cd 4096wat_oct/ && ./Run.pure_wat_oct_NPT_NTT1 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
./Run.pure_wat_oct_NPT_NTT1: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd large_solute_count/ && ./Run.ntb2_ntt1 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
Error: unspecified launch failure launching kernel kPMEScalarSumRCEnergy
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
./Run.ntb2_ntt1: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd jac/ && ./Run.jac SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
Error: unspecified launch failure launching kernel kNLSkinTest
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
./Run.jac: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd dhfr/ && ./Run.dhfr SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
Error: unspecified launch failure launching kernel kNLSkinTest
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
./Run.dhfr: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd dhfr/ && ./Run.dhfr.ntr1 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
Error: unspecified launch failure launching kernel kNLSkinTest
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
./Run.dhfr.ntr1: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd dhfr/ && ./Run.dhfr.ntb2 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
./Run.dhfr.ntb2: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd dhfr/ && ./Run.dhfr.ntb2_ntt1 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
Error: unspecified launch failure launching kernel kPMEScalarSumRCEnergy
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
./Run.dhfr.ntb2_ntt1: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd dhfr/ && ./Run.dhfr.ntb2_ntt1_ntr1 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Upload failed unspecified launch failure
./Run.dhfr.ntb2_ntt1_ntr1: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd dhfr/ && ./Run.dhfr.noshake SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
diffing mdout.dhfr.noshake.GPU_SPFP with mdout.dhfr.noshake
possible FAILURE: check mdout.dhfr.noshake.dif
==============================================================
==============================================================
cd chamber/dhfr_pbc/ && ./Run.dhfr_pbc_charmm_noshake.md SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
./Run.dhfr_pbc_charmm_noshake.md: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm.md SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
./Run.dhfr_cmap_pbc_charmm.md: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_NPT.md SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
./Run.dhfr_cmap_pbc_charmm_NPT.md: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_noshake.md SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
./Run.dhfr_cmap_pbc_charmm_noshake.md: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd ips/ && ./Run.ips SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Upload failed unspecified launch failure
./Run.ips: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd amd/dhfr_pme && ./Run.pme.amd1 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
Error: unspecified launch failure launching kernel kPMEScalarSumRCEnergy
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
./Run.pme.amd1: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd amd/dhfr_pme && ./Run.pme.amd2 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
./Run.pme.amd2: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd amd/dhfr_pme && ./Run.pme.amd3 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
./Run.pme.amd3: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd amd/gact_ips && ./Run.ips.amd1 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
diffing mdout.ips.amd1.GPU_SPFP with mdout.ips.amd1
possible FAILURE: check mdout.ips.amd1.dif
==============================================================
==============================================================
cd nmropt/pme/distance/ && ./Run.dist_pbc SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
Error: unspecified launch failure launching kernel kPMEGradSum
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
./Run.dist_pbc: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd nmropt/pme/nmropt_1_torsion/ && ./Run.nmropt_1_torsion SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
diffing mdout.GPU_SPFP with mdout
file ddtmp.mdout.GPU_SPFP is short
possible FAILURE: check mdout.dif
==============================================================
==============================================================
cd chamber/dhfr_pbc/ && ./Run.dhfr_pbc_charmm_noshake.min SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
./Run.dhfr_pbc_charmm_noshake.min: Program error
make[3]: [test.pmemd.cuda.pme.serial] Error 1 (ignored)
cd chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm.min SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
diffing mdout.dhfr_charmm_pbc_min.GPU_SPFP with mdout.dhfr_charmm_pbc_min
file ddtmp.mdout.dhfr_charmm_pbc_min.GPU_SPFP is short
possible FAILURE: check mdout.dhfr_charmm_pbc_min.dif
==============================================================
...
42 file comparisons passed
22 file comparisons failed
51 tests experienced errors
Since the problem is duplicated on another machine I guess this is not
hardware
that fails, but rather software. I tried reinstalling the drivers several
times, yet
with no progress at all.
I would appreciate any hint what to try next.
all the best,
Tomasz
2012/10/19 Ross Walker <ross.rosswalker.co.uk>
> Hi Tomasz,
>
> My gut instinct is to agree with Aron here and say this is likely a broken
> card, most likely the memory. However, it would be useful to know what you
> mean by half the tests fail. How do they fail exactly and what error
> messages do you get. Do you have the log files, especially the diff log
> from running the tests?
>
> nvidia-smi -a
>
> should show the card. I would try reinstalling the NVIDIA driver first
> though and then powering off the machine and starting it up again. You
> might want to also take a look inside and make sure nothing is stopping
> the GPU fan spinning as well, I've had issues like that before with an
> internal USB cable that wasn't secured being ingested by the GPU fan.
>
> Try a reinstall of the drivers first and a cold power down / restart and
> see if that helps. Run 'nvidia-smi' as root to get the driver loaded once
> the machine comes back up. If the problem still persists then suspect the
> hardware.
>
> All the best
> Ross
>
>
> On 10/19/12 1:13 AM, "Tomasz Borowski" <tomasz.borowski74.gmail.com>
> wrote:
>
> >Dear Amber users,
> >
> >I have a problem with getting pmemd.cuda working on GeForce GTX 580
> >card.
> >
> >With all the patches currently available applied to the code it compiles
> >with no problems (gnu compiler, open suse 11.2, no MKL libs). During tests
> >roughly half of the tests fail. However, the same binary code runs with no
> >problems on
> >another machine where I have GeForce GTX 285. The same driver and CUDA
> >toolkit installed
> >on the two machines. Does it mean the GTX 580 is faulty ?
> >
> >Maybe it could be of relevance, on the machine with GTX 580 I cannot get
> >the info on the graphics card with the usual command:
> >lspci | grep VGA
> >
> >could you, please, help ?
> >
> >
> >all the best,
> >Tomasz Borowski
> >_______________________________________________
> >AMBER mailing list
> >AMBER.ambermd.org
> >http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Oct 23 2012 - 05:00:03 PDT