Thanks Tru,
what about point #3, any idea regarding this strange selectivity ?
I guess you will be able to reproduce this as well (having CUDA 5.0 or
5.5 installed).
If I remember well (I am not sure now) ET had the same experience here.
Anyway the K20m working temperature (cca 40°C even without Fan ?) seems to
me
really much lower than 80°C of Titan. But it is partly understable if we
compare
758 MHz in case of your K20m ? and 928 MHz in case of my Titan SC.
So maybe the downclocking might be a possibility to solve problem. It's
just a pitty
that under linux you have in Titan case no other choice that just to edit
GPU bios
which is the reason why I am still waiting with this step (e.g. for the
NVIDIA response
to this issue). Anyway if you are going to experiment here please report
your results,
especially if you succeed to flash your Titan with your K20m bios (I hope
that then you will share it ) :))
Fan activity should be in worst case also editable in GPU bios or not ?
BTW can you report some Amber benchmark times obtained on your 758 MHz
K20m ?
I would be interested how big decrease in performance one may assume after
Titan downclocking
from 928 MHz to 758 MHz. Maybe it is simply 928/758 = 1.22 x but I am not
sure here.
Anyway my actual feeling is that the problems might be connected with too
high frequency
but not necessarily with too high temperature (see e.g. my point #3).
Temperature based
issues are rather random but my experience (#3) do not confirm true random
behavior here.
Another argument is that I never obtained any TITAN GPU memory errs during
testing with memtestG80 even with those rather extensive like
"./memtestG80 2000 1000" (2GB tested for 1000 iterations) and
I just verified that the working temperature during this test is still
80°C.
I will do now even more extensive testing. In my opinion if the problem is
simply memory overheating
I should obtain some errs also in this testing which was designed for deep
GPU memory testing.
Another "#3-like" arguments are GB cases myoglobin/TRPCage which you
perhaps can simulate on Titans without any issue how long you want and for
any length reproduce the results.
So my guess is that the main problem might be frequency specific (not
necessarily temperature one)
which somehow affect cuFFT.
Best,
Marek
Dne Mon, 08 Jul 2013 16:08:29 +0200 Tru Huynh <tru.pasteur.fr> napsal/-a:
> On Mon, Jul 08, 2013 at 02:43:57PM +0200, Marek Maly wrote:
>> Hi Tru,
>>
>> #1
>> Did you observe any temperature difference between the 3 GPUs
>> which failed and that one which passed Cellulose test ?
> I don't think that we get the memory temperature, just the gpu ones.
> [tru.margy ~]$ nvidia-smi
> Mon Jul 8 15:15:17 2013
> +------------------------------------------------------+
> | NVIDIA-SMI 5.319.23 Driver Version: 319.23 |
> |-------------------------------+----------------------+----------------------+
> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile
> Uncorr. ECC |
> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
> Compute M. |
> |===============================+======================+======================|
> | 0 GeForce GTX TITAN Off | 0000:03:00.0 N/A
> | N/A |
> | 58% 80C N/A N/A / N/A | 1037MB / 6143MB | N/A
> Default |
> +-------------------------------+----------------------+----------------------+
> | 1 GeForce GTX TITAN Off | 0000:04:00.0 N/A
> | N/A |
> | 57% 80C N/A N/A / N/A | 1037MB / 6143MB | N/A
> Default |
> +-------------------------------+----------------------+----------------------+
> | 2 GeForce GTX TITAN Off | 0000:83:00.0 N/A
> | N/A |
> | 57% 80C N/A N/A / N/A | 1037MB / 6143MB | N/A
> Default |
> +-------------------------------+----------------------+----------------------+
> | 3 GeForce GTX TITAN Off | 0000:84:00.0 N/A
> | N/A |
> | 57% 80C N/A N/A / N/A | 1037MB / 6143MB | N/A
> Default |
> +-------------------------------+----------------------+----------------------+
>
> They all show the same GPU temp, too bad the GTX can't get all the nice
> information
> that the Tesla/Fermi cards can get...
>
> [tru.oopy amber]$ nvidia-smi
> Mon Jul 8 15:41:22 2013
> +------------------------------------------------------+
> | NVIDIA-SMI 5.319.23 Driver Version: 319.23 |
> |-------------------------------+----------------------+----------------------+
> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile
> Uncorr. ECC |
> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
> Compute M. |
> |===============================+======================+======================|
> | 0 Tesla K10.G2.8GB Off | 0000:04:00.0 Off
> | 0 |
> | N/A 18C P8 17W / 117W | 9MB / 3583MB | 0%
> Default |
> +-------------------------------+----------------------+----------------------+
> | 1 Tesla K10.G2.8GB Off | 0000:05:00.0 Off
> | 0 |
> | N/A 21C P8 17W / 117W | 9MB / 3583MB | 0%
> Default |
> +-------------------------------+----------------------+----------------------+
> | 2 Tesla K20m Off | 0000:83:00.0 Off
> | 0 |
> | N/A 43C P0 130W / 225W | 1031MB / 4799MB | 99%
> Default |
> +-------------------------------+----------------------+----------------------+
> | 3 Tesla K20m Off | 0000:84:00.0 Off
> | 0 |
> | N/A 38C P0 135W / 225W | 1031MB / 4799MB | 99%
> Default |
> +-------------------------------+----------------------+----------------------+
> +-----------------------------------------------------------------------------+
> | Compute processes: GPU
> Memory |
> | GPU PID Process name
> Usage |
> |=============================================================================|
> | 2 6406 /c5/shared/amber/12/20130703/gnu/bin/pmemd.cuda
> 1016MB |
> | 3 6412 /c5/shared/amber/12/20130703/gnu/bin/pmemd.cuda
> 1016MB |
> +-----------------------------------------------------------------------------+
>
> -> the K20 fanless are running at a much lower temperature.
>
>> #2
>> If the cause of the TITAN problem might be overheating, wouldn't
>> be worth to simply try downclock TITANs to K20/K20x frequency ?
> why not
>>
>> Did anybody already tried this possibility ?
>> Another possibility might be simply to increase Fan activity.
> AFAIK, you can't since the nvidia-smi is feature limited on the GTX*.
>
> example: K20
> [tru.oopy amber]$ nvidia-smi -i 3 --query-supported-clocks=mem,gr
> --format=csv
> memory [MHz], graphics [MHz]
> 2600 MHz, 758 MHz
> 2600 MHz, 705 MHz
> 2600 MHz, 666 MHz
> 2600 MHz, 640 MHz
> 2600 MHz, 614 MHz
> 324 MHz, 324 MHz
>
> [tru.oopy amber]$ nvidia-smi -i 3 -q -d CLOCK
>
> ==============NVSMI LOG==============
>
> Timestamp : Mon Jul 8 16:07:42 2013
> Driver Version : 319.23
>
> Attached GPUs : 4
> GPU 0000:84:00.0
> Clocks
> Graphics : 705 MHz
> SM : 705 MHz
> Memory : 2600 MHz
> Applications Clocks
> Graphics : 705 MHz
> Memory : 2600 MHz
> Default Applications Clocks
> Graphics : 705 MHz
> Memory : 2600 MHz
> Max Clocks
> Graphics : 758 MHz
> SM : 758 MHz
> Memory : 2600 MHz
>
> GTXTITAN:
> [tru.margy amber]$ nvidia-smi -i 3 --query-supported-clocks=mem,gr
> --format=csv
> memory [MHz], graphics [MHz]
> [Not Supported], [Not Supported]
>
>> #5
>> Did your one "good" Titan passed sufficiently all the Amber benchmarks
>> twice (100K steps)
> I have only tested PME/Cellulose_production_NPT and GB/nucleosome
>
>> without any problems and with 100% reproducible results in each test
>> (including JAC one) ?
> I can do that next.
>
> Tru
>
--
Tato zpráva byla vytvořena převratným poštovním klientem Opery:
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jul 08 2013 - 08:30:02 PDT