Re: [AMBER] 4x Asus Titan boards :) GB/nucleosome passed

From: Tru Huynh <tru.pasteur.fr>
Date: Mon, 8 Jul 2013 16:08:29 +0200

On Mon, Jul 08, 2013 at 02:43:57PM +0200, Marek Maly wrote:
> Hi Tru,
>
> #1
> Did you observe any temperature difference between the 3 GPUs
> which failed and that one which passed Cellulose test ?
I don't think that we get the memory temperature, just the gpu ones.
[tru.margy ~]$ nvidia-smi
Mon Jul 8 15:15:17 2013
+------------------------------------------------------+
| NVIDIA-SMI 5.319.23 Driver Version: 319.23 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TITAN Off | 0000:03:00.0 N/A | N/A |
| 58% 80C N/A N/A / N/A | 1037MB / 6143MB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TITAN Off | 0000:04:00.0 N/A | N/A |
| 57% 80C N/A N/A / N/A | 1037MB / 6143MB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TITAN Off | 0000:83:00.0 N/A | N/A |
| 57% 80C N/A N/A / N/A | 1037MB / 6143MB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX TITAN Off | 0000:84:00.0 N/A | N/A |
| 57% 80C N/A N/A / N/A | 1037MB / 6143MB | N/A Default |
+-------------------------------+----------------------+----------------------+

They all show the same GPU temp, too bad the GTX can't get all the nice information
that the Tesla/Fermi cards can get...

[tru.oopy amber]$ nvidia-smi
Mon Jul 8 15:41:22 2013
+------------------------------------------------------+
| NVIDIA-SMI 5.319.23 Driver Version: 319.23 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K10.G2.8GB Off | 0000:04:00.0 Off | 0 |
| N/A 18C P8 17W / 117W | 9MB / 3583MB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K10.G2.8GB Off | 0000:05:00.0 Off | 0 |
| N/A 21C P8 17W / 117W | 9MB / 3583MB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K20m Off | 0000:83:00.0 Off | 0 |
| N/A 43C P0 130W / 225W | 1031MB / 4799MB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K20m Off | 0000:84:00.0 Off | 0 |
| N/A 38C P0 135W / 225W | 1031MB / 4799MB | 99% Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 2 6406 /c5/shared/amber/12/20130703/gnu/bin/pmemd.cuda 1016MB |
| 3 6412 /c5/shared/amber/12/20130703/gnu/bin/pmemd.cuda 1016MB |
+-----------------------------------------------------------------------------+

-> the K20 fanless are running at a much lower temperature.

> #2
> If the cause of the TITAN problem might be overheating, wouldn't
> be worth to simply try downclock TITANs to K20/K20x frequency ?
why not
>
> Did anybody already tried this possibility ?
> Another possibility might be simply to increase Fan activity.
AFAIK, you can't since the nvidia-smi is feature limited on the GTX*.

example: K20
[tru.oopy amber]$ nvidia-smi -i 3 --query-supported-clocks=mem,gr --format=csv
memory [MHz], graphics [MHz]
2600 MHz, 758 MHz
2600 MHz, 705 MHz
2600 MHz, 666 MHz
2600 MHz, 640 MHz
2600 MHz, 614 MHz
324 MHz, 324 MHz

[tru.oopy amber]$ nvidia-smi -i 3 -q -d CLOCK

==============NVSMI LOG==============

Timestamp : Mon Jul 8 16:07:42 2013
Driver Version : 319.23

Attached GPUs : 4
GPU 0000:84:00.0
    Clocks
        Graphics : 705 MHz
        SM : 705 MHz
        Memory : 2600 MHz
    Applications Clocks
        Graphics : 705 MHz
        Memory : 2600 MHz
    Default Applications Clocks
        Graphics : 705 MHz
        Memory : 2600 MHz
    Max Clocks
        Graphics : 758 MHz
        SM : 758 MHz
        Memory : 2600 MHz

GTXTITAN:
[tru.margy amber]$ nvidia-smi -i 3 --query-supported-clocks=mem,gr --format=csv
memory [MHz], graphics [MHz]
[Not Supported], [Not Supported]

> #5
> Did your one "good" Titan passed sufficiently all the Amber benchmarks
> twice (100K steps)
I have only tested PME/Cellulose_production_NPT and GB/nucleosome

> without any problems and with 100% reproducible results in each test
> (including JAC one) ?
I can do that next.

Tru

-- 
Dr Tru Huynh          | http://www.pasteur.fr/recherche/unites/Binfs/
mailto:tru.pasteur.fr | tel/fax +33 1 45 68 87 37/19
Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15 France  
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jul 08 2013 - 07:30:02 PDT
Custom Search