Re: [AMBER] amber12 (pmemd.cuda) on GeForce GTX 580 ?

From: <peter.stauffert.boehringer-ingelheim.com>
Date: Tue, 23 Oct 2012 14:42:10 +0200

Hi Thomasz,

lspci depends on a list of known PCI ids, /usr/share/hwdata/pci.ids, and
/usr/share/hwdata/pci.ids.d/*.ids (on RHEL)
So update your lspci package or at least update your pci.ids, so that lspci
can discover the latest hardware

Peter

Dr. Peter Stauffert
Boehringer Ingelheim Pharma GmbH & Co. KG
mailto:peter.stauffert.boehringer-ingelheim.com
-----Ursprüngliche Nachricht-----
Von: Tomasz Borowski [mailto:tomasz.borowski74.gmail.com]
Gesendet: Dienstag, 23. Oktober 2012 13:40
An: AMBER Mailing List
Betreff: Re: [AMBER] amber12 (pmemd.cuda) on GeForce GTX 580 ?

Dear Robert, Aron, Ross,

many thanks for your helpful suggestions.

I reinstalled the driver and after a restart
the problem disappeared, i.e. pmemd.cuda
works fine and passes all tests.

Of note, the
lspci | grep VGA
still does not report the presence of the card.

However, when I tried to repeat the same procedure
on my second (and the last) machine with GeForce GTX 580 (the same
hardware and software configuration), the same problem appeared.

For the remaining problematic machine:

1) I have run memtestCL-1.00-linux64 several times
with no errors reported

2) nvidia-smi shows the card and does not report too high temperatures:
Tue Oct 23 13:11:40 2012
+------------------------------------------------------+

| NVIDIA-SMI 3.295.41 Driver Version: 295.41 |

|-------------------------------+----------------------+---------------------
-+
| Nb. Name | Bus Id Disp. | Volatile ECC SB /
DB |
| Fan Temp Power Usage /Cap | Memory Usage | GPU Util. Compute
M. |
|===============================+======================+=====================
=|
| 0. GeForce GTX 580 | 0000:04:00.0 N/A | N/A
 N/A |
| 11% 28 C N/A N/A / N/A | 0% 5MB / 1535MB | N/A Default
   |
|-------------------------------+----------------------+---------------------
-|
| Compute processes: GPU
Memory |
| GPU PID Process name Usage
   |
|============================================================================
=|
| 0. Not Supported
  |
+----------------------------------------------------------------------------
-+

3) nvidia-smi -a gives the following output (which is basically
identical to that produced on the machine where the problem disappeared)
==============NVSMI LOG==============

Timestamp : Tue Oct 23 13:12:45 2012

Driver Version : 295.41

Attached GPUs : 1

GPU 0000:04:00.0
    Product Name : GeForce GTX 580
    Display Mode : N/A
    Persistence Mode : Disabled
    Driver Model
        Current : N/A
        Pending : N/A
    Serial Number : N/A
    GPU UUID : N/A
    VBIOS Version : 70.10.17.00.00
    Inforom Version
        OEM Object : N/A
        ECC Object : N/A
        Power Management Object : N/A
    PCI
        Bus : 0x04
        Device : 0x00
        Domain : 0x0000
        Device Id : 0x108010DE
        Bus Id : 0000:04:00.0
        Sub System Id : 0x83851043
        GPU Link Info
            PCIe Generation
                Max : N/A
                Current : N/A
            Link Width
                Max : N/A
                Current : N/A
    Fan Speed : 11 %
    Performance State : N/A
    Memory Usage
        Total : 1535 MB
        Used : 5 MB
        Free : 1530 MB
    Compute Mode : Default
    Utilization
        Gpu : N/A
        Memory : N/A
    Ecc Mode
        Current : N/A
        Pending : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory : N/A
                Register File : N/A
                L1 Cache : N/A
                L2 Cache : N/A
                Total : N/A
            Double Bit
                Device Memory : N/A
                Register File : N/A
                L1 Cache : N/A
                L2 Cache : N/A
                Total : N/A
        Aggregate
            Single Bit
                Device Memory : N/A
                Register File : N/A
                L1 Cache : N/A
                L2 Cache : N/A
                Total : N/A
            Double Bit
                Device Memory : N/A
                Register File : N/A
                L1 Cache : N/A
                L2 Cache : N/A
                Total : N/A
    Temperature
        Gpu : 28 C
    Power Readings
        Power Management : N/A
        Power Draw : N/A
        Power Limit : N/A
    Clocks
        Graphics : N/A
        SM : N/A
        Memory : N/A
    Max Clocks
        Graphics : N/A
        SM : N/A
        Memory : N/A
    Compute Processes : Not Supported

4) when running make test, the very same tests as before (on the
first machine) failed, here are the fragments of the log file for tests
that failed due to errors:

==============================================================
cd nucleosome/ && ./Run_md.1 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
Error: unspecified launch failure launching kernel kScaleVelocities
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
  ./Run_md.1: Program error
make[3]: [test.pmemd.cuda.gb] Error 1 (ignored)
cd nucleosome/ && ./Run_md.2 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
Error: unspecified launch failure launching kernel kClearForces
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
  ./Run_md.2: Program error
make[3]: [test.pmemd.cuda.gb] Error 1 (ignored)
cd amd/rna_gb && ./Run.gb.amd2 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
diffing mdout.gb.amd2.GPU_SPFP with mdout.gb.amd2
PASSED
==============================================================
==============================================================
cd 4096wat_oct/ && ./Run.pure_wat_oct_NPT_NTT1 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
  ./Run.pure_wat_oct_NPT_NTT1: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd large_solute_count/ && ./Run.ntb2_ntt1 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
Error: unspecified launch failure launching kernel kPMEScalarSumRCEnergy
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
  ./Run.ntb2_ntt1: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd jac/ && ./Run.jac SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
Error: unspecified launch failure launching kernel kNLSkinTest
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
  ./Run.jac: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd dhfr/ && ./Run.dhfr SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
Error: unspecified launch failure launching kernel kNLSkinTest
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
  ./Run.dhfr: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd dhfr/ && ./Run.dhfr.ntr1 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
Error: unspecified launch failure launching kernel kNLSkinTest
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
  ./Run.dhfr.ntr1: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd dhfr/ && ./Run.dhfr.ntb2 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
  ./Run.dhfr.ntb2: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd dhfr/ && ./Run.dhfr.ntb2_ntt1 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
Error: unspecified launch failure launching kernel kPMEScalarSumRCEnergy
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
  ./Run.dhfr.ntb2_ntt1: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd dhfr/ && ./Run.dhfr.ntb2_ntt1_ntr1 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Upload failed unspecified launch failure
  ./Run.dhfr.ntb2_ntt1_ntr1: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd dhfr/ && ./Run.dhfr.noshake SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
diffing mdout.dhfr.noshake.GPU_SPFP with mdout.dhfr.noshake
possible FAILURE: check mdout.dhfr.noshake.dif
==============================================================
==============================================================
cd chamber/dhfr_pbc/ && ./Run.dhfr_pbc_charmm_noshake.md SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
  ./Run.dhfr_pbc_charmm_noshake.md: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm.md SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
  ./Run.dhfr_cmap_pbc_charmm.md: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_NPT.md SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
  ./Run.dhfr_cmap_pbc_charmm_NPT.md: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm_noshake.md SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
  ./Run.dhfr_cmap_pbc_charmm_noshake.md: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd ips/ && ./Run.ips SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Upload failed unspecified launch failure
  ./Run.ips: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd amd/dhfr_pme && ./Run.pme.amd1 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
Error: unspecified launch failure launching kernel kPMEScalarSumRCEnergy
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
  ./Run.pme.amd1: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd amd/dhfr_pme && ./Run.pme.amd2 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
  ./Run.pme.amd2: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd amd/dhfr_pme && ./Run.pme.amd3 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
  ./Run.pme.amd3: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd amd/gact_ips && ./Run.ips.amd1 SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
diffing mdout.ips.amd1.GPU_SPFP with mdout.ips.amd1
possible FAILURE: check mdout.ips.amd1.dif
==============================================================
==============================================================
cd nmropt/pme/distance/ && ./Run.dist_pbc SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
Error: unspecified launch failure launching kernel kPMEGradSum
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
  ./Run.dist_pbc: Program error
make[3]: [test.pmemd.cuda.pme] Error 1 (ignored)
cd nmropt/pme/nmropt_1_torsion/ && ./Run.nmropt_1_torsion SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
diffing mdout.GPU_SPFP with mdout
file ddtmp.mdout.GPU_SPFP is short
possible FAILURE: check mdout.dif
==============================================================
==============================================================
cd chamber/dhfr_pbc/ && ./Run.dhfr_pbc_charmm_noshake.min SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
  ./Run.dhfr_pbc_charmm_noshake.min: Program error
make[3]: [test.pmemd.cuda.pme.serial] Error 1 (ignored)
cd chamber/dhfr_cmap_pbc/ && ./Run.dhfr_cmap_pbc_charmm.min SPFP
/home/borowski/soft/amber12_trajkowitka_suse11.2/include/netcdf.mod
diffing mdout.dhfr_charmm_pbc_min.GPU_SPFP with mdout.dhfr_charmm_pbc_min
file ddtmp.mdout.dhfr_charmm_pbc_min.GPU_SPFP is short
possible FAILURE: check mdout.dhfr_charmm_pbc_min.dif
==============================================================

...
42 file comparisons passed
22 file comparisons failed
51 tests experienced errors

Since the problem is duplicated on another machine I guess this is not
hardware
that fails, but rather software. I tried reinstalling the drivers several
times, yet
with no progress at all.

I would appreciate any hint what to try next.

all the best,
Tomasz







2012/10/19 Ross Walker <ross.rosswalker.co.uk>

> Hi Tomasz,
>
> My gut instinct is to agree with Aron here and say this is likely a broken
> card, most likely the memory. However, it would be useful to know what you
> mean by half the tests fail. How do they fail exactly and what error
> messages do you get. Do you have the log files, especially the diff log
> from running the tests?
>
> nvidia-smi -a
>
> should show the card. I would try reinstalling the NVIDIA driver first
> though and then powering off the machine and starting it up again. You
> might want to also take a look inside and make sure nothing is stopping
> the GPU fan spinning as well, I've had issues like that before with an
> internal USB cable that wasn't secured being ingested by the GPU fan.
>
> Try a reinstall of the drivers first and a cold power down / restart and
> see if that helps. Run 'nvidia-smi' as root to get the driver loaded once
> the machine comes back up. If the problem still persists then suspect the
> hardware.
>
> All the best
> Ross
>
>
> On 10/19/12 1:13 AM, "Tomasz Borowski" <tomasz.borowski74.gmail.com>
> wrote:
>
> >Dear Amber users,
> >
> >I have a problem with getting pmemd.cuda working on GeForce GTX 580
> >card.
> >
> >With all the patches currently available applied to the code it compiles
> >with no problems (gnu compiler, open suse 11.2, no MKL libs). During tests
> >roughly half of the tests fail. However, the same binary code runs with no
> >problems on
> >another machine where I have GeForce GTX 285. The same driver and CUDA
> >toolkit installed
> >on the two machines. Does it mean the GTX 580 is faulty ?
> >
> >Maybe it could be of relevance, on the machine with GTX 580 I cannot get
> >the info on the graphics card with the usual command:
> >lspci | grep VGA
> >
> >could you, please, help ?
> >
> >
> >all the best,
> >Tomasz Borowski
> >_______________________________________________
> >AMBER mailing list
> >AMBER.ambermd.org
> >http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Oct 23 2012 - 06:00:03 PDT
Custom Search