Re: [AMBER] GTX 780 Ti & GTX 980

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 26 Nov 2014 13:18:39 -0800

Hi Jordi,

So it looks like you have correctly diagnosed things. It looks like three
of your 780TIs are good and one is bad. This is consistent with the 30% or
so failure rate I have seen with large scale deployments of 780TI cards.
The three good cards are likely to remain good and stable (and fast!) so
you should be fine to keep using those. I would run the validation test
from time to time to be sure, or if you see any cards give you random
problems with runs, such as kernel launch errors that do not occur
consistently at the same point in the calculation or on the other cards.
In all likelyhood though these three cards will likely remain good as long
as they are receiving adequate cooling - I assume you ran the validation
test on all 4 cards at once to full stress them all temperature wise?

The one that shows inconsistent results is most definitely faulty - if you
were gaming with this card you would most likely get random lockups or
graphic glitches that you might not associate with a bad GPU or they might
be infrequent enough that you just put up with them but obviously for MD
this is not reasonable. Hopefully you should have no trouble returning it
since you have clear proof that it is faulty.

As for what to replace it with. You have a few choices as you mention -
all of which are good.

1) Order another 780TI - this is a gamble but you have about a 3 in 4
chance it will be fine.
2) Order a 980 - This would be my primary advice - so far in ongoing
testing we have seen no problems with these cards. Note this locks you to
CUDA 6.5 (5.0 is actually faster) but as you are already using this it
should not be an issue. Running a multi-gpu job across a 780TI and a 980
though is not recommended so you would only be locked to single GPU config
with this option.
3) a regular 780 is your other option - If you can get a really good deal
on one of these on price then I would recommend that over a 980 but if the
price difference is small go with the 980. Same issue applied on running a
single run across multiple GPUs here - not good to mix different GPU
models in a multi-gpu run.

Hope that helps. In all cases I would recommend rerunning the validation
suite (you can just set the test number to 40 or so and leave it going
over the weekend) on all cards once you install the new card.

All the best
Ross



On 11/26/14, 9:35 AM, "Jordi Bujons" <jordi.bujons.iqac.csic.es> wrote:

>Hello,
>
>
>
>Despite I have seen several post on the use of different models of Nvidia
>graphics cards with Amber, I would like to ask a couple additional
>questions. A few months ago I bought four Asus GTX780TI-DC2OC-3GD5 cards
>to
>mount an MD workstation. I know that these cards are not recommended now,
>but at that time (around May) the GTX780TI was among the list of supported
>cards, and although I tried to RMA them back to Asus I was told that since
>they were not broken, they could not be replaced. Therefore, I had to go
>on,
>mount them on the machine and run the Amber GPU validation test to see if
>they were OK or not. So, with Amber 14 and Cuda 6.5 I got good and
>consistent results for three of the cards but one of them gave erratic
>results as can be seen in the following extract from the output
>
>
>
>
>
>0.0: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
>-72626.3640
>
>0.1: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
>-72626.3640
>
>0.2: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
>-72626.3640
>
>:
>
>1.0: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
>-72626.3640
>
>1.1: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
>-72626.3640
>
>1.2: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
>-72626.3640
>
>:
>
>2.0: Etot = -58221.9441 EKtot = 14408.0352 EPtot =
>-72629.9793
>
>2.1: Etot = -58222.5727 EKtot = 14293.9326 EPtot =
>-72516.5054
>
>2.2: Etot = -58232.7907 EKtot = 14435.4941 EPtot =
>-72668.2848
>
>:
>
>3.0: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
>-72626.3640
>
>3.1: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
>-72626.3640
>
>3.2: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
>-72626.3640
>
>:
>
>:
>
>
>
>
>
>then, running the Amber GPU benchmark I got the following output which I
>think is not bad (here the faulty card was still installed on the
>machine):
>
>
>
>
>
>JAC_PRODUCTION_NVE - 23,558 atoms PME 4fs
>
>-----------------------------------------
>
>
>
>CPU code 6 cores: | ns/day = 19.43 seconds/ns =
>4445.64
>
> [0] 1 x GPU: | ns/day = 297.41 seconds/ns = 290.51
>
> [1] 1 x GPU: | ns/day = 295.19 seconds/ns = 292.70
>
> [2] 1 x GPU: | ns/day = 295.65 seconds/ns = 292.24
>
> [3] 1 x GPU: | ns/day = 299.56 seconds/ns = 288.43
>
>Multiple Single GPU Run Performance
>
> [0] 1 x GPU: | ns/day = 297.31 seconds/ns = 290.60
>
> [1] 1 x GPU: | ns/day = 295.30 seconds/ns = 292.59
>
> [2] 1 x GPU: | ns/day = 295.35 seconds/ns = 292.54
>
> [3] 1 x GPU: | ns/day = 297.98 seconds/ns = 289.95
>
> 2 x GPU:
>
> 3 x GPU:
>
> 4 x GPU:
>
>Multiple 2xGPU Run Performance
>
> [0,1] 2 x GPU: [2,3] 2 x GPU:
>
>JAC_PRODUCTION_NPT - 23,558 atoms PME 4fs
>
>-----------------------------------------
>
>
>
>CPU code 6 cores: | ns/day = 19.71 seconds/ns =
>4382.52
>
> [0] 1 x GPU: | ns/day = 287.96 seconds/ns = 300.04
>
> 2 x GPU:
>
> 3 x GPU:
>
> 4 x GPU:
>
>
>
>JAC_PRODUCTION_NVE - 23,558 atoms PME 2fs
>
>-----------------------------------------
>
>
>
>CPU code 6 cores: | ns/day = 10.26 seconds/ns =
>8423.74
>
> [0] 1 x GPU: | ns/day = 155.55 seconds/ns = 555.45
>
> 2 x GPU:
>
> 3 x GPU:
>
> 4 x GPU:
>
>
>
>JAC_PRODUCTION_NPT - 23,558 atoms PME 2fs
>
>-----------------------------------------
>
>
>
>CPU code 6 cores: | ns/day = 10.12 seconds/ns =
>8537.35
>
> [0] 1 x GPU: | ns/day = 147.06 seconds/ns = 587.50
>
> 2 x GPU:
>
> 3 x GPU:
>
> 4 x GPU:
>
>
>
>FACTOR_IX_PRODUCTION_NVE - 90,906 atoms PME
>
>-------------------------------------------
>
>
>
>CPU code 6 cores: | ns/day = 2.62 seconds/ns =
>32938.45
>
> [0] 1 x GPU: | ns/day = 44.50 seconds/ns = 1941.77
>
> 2 x GPU:
>
> 3 x GPU:
>
> 4 x GPU:
>
>
>
>FACTOR_IX_PRODUCTION_NPT - 90,906 atoms PME
>
>-------------------------------------------
>
>
>
>CPU code 6 cores: | ns/day = 2.56 seconds/ns =
>33698.26
>
> [0] 1 x GPU: | ns/day = 42.71 seconds/ns = 2023.15
>
> 2 x GPU:
>
> 3 x GPU:
>
> 4 x GPU:
>
>
>
>CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME
>
>--------------------------------------------
>
>
>
>CPU code 6 cores: | ns/day = 0.52 seconds/ns =
>166247.69
>
> [0] 1 x GPU: | ns/day = 10.47 seconds/ns = 8255.69
>
> 2 x GPU:
>
> 3 x GPU:
>
> 4 x GPU:
>
>
>
>CELLULOSE_PRODUCTION_NPT - 408,609 atoms PME
>
>--------------------------------------------
>
>
>
>CPU code 6 cores: | ns/day = 0.48 seconds/ns =
>180968.43
>
> [0] 1 x GPU: | ns/day = 10.12 seconds/ns = 8533.78
>
> 2 x GPU:
>
> 3 x GPU:
>
> 4 x GPU:
>
>
>
>TRPCAGE_PRODUCTION - 304 atoms GB
>
>---------------------------------
>
>
>
>CPU code 6 cores: | ns/day = 151.82 seconds/ns = 569.09
>
> [0] 1 x GPU: | ns/day = 860.80 seconds/ns = 100.37
>
>
>
>MYOGLOBIN_PRODUCTION - 2,492 atoms GB
>
>-------------------------------------
>
>
>
>CPU code 6 cores: | ns/day = 3.60 seconds/ns =
>23981.72
>
> [0] 1 x GPU: | ns/day = 238.24 seconds/ns = 362.66
>
> 2 x GPU:
>
> 3 x GPU:
>
> 4 x GPU:
>
>
>
>NUCLEOSOME_PRODUCTION - 25,095 atoms GB
>
>---------------------------------------
>
>
>
>CPU code 6 cores: | ns/day = 0.04 seconds/ns =
>2443943.60
>
> [0] 1 x GPU: | ns/day = 4.43 seconds/ns = 19498.84
>
> 2 x GPU:
>
> 3 x GPU:
>
> 4 x GPU:
>
>
>
>
>
>Since then, I have rerun the validation tests and the GPU benchmarks
>several
>times on the 3 cards that seemed to be alright and the results have been
>consistent and they did not show any obvious problem, so my first question
>is: can I be confident that these 3 cards are OK and that the results
>obtained with them can be trusted? or are there any further
>test/benchmarks
>that should be run to be 100% sure.
>
>
>
>After going back and forth for a while, finally I've been able to return
>the
>faulty card back to the dealer for an exchange, and now I am faced with
>deciding what other card should I get. I do not want to get another
>GTX780TI
>since looking at other post I guess I've been lucky that only one of the
>four was bad. Therefore I am considering a GTX 980 since the comments
>about
>these on the list seem to be good. Is this the right choice or should I
>get
>one of the fully supported (and older) GTX780 to be sure that there won't
>be
>any more issues? Could there be any problem mixing different types of
>cards,
>other than maybe not being able to run 4xGPU parallel calculations?
>Thanks
>for any comments.
>
>
>
>Jordi
>
>
>
>
>
>
>
>--------------------------------------------------------------------------
>--
>----------
>
>Jordi Bujons, PhD
>
>Dept. of Biological Chemistry and Molecular Modeling (QBMM)
>
>Institute of Advanced Chemistry of Catalonia (IQAC)
>
>National Research Council of Spain (CSIC)
>
>Address: Jordi Girona 18-26, 08034 Barcelona, Spain
>
>Phone: +34 934006100 ext. 1291
>
>FAX: +34 932045904
>
> <mailto:jordi.bujons.iqac.csic.es> jordi.bujons.iqac.csic.es
>
> <mailto:jbujons1.gmail.com> jbujons1.gmail.com
>
> <http://www.iqac.csic.es/> http://www.iqac.csic.es
>
>--------------------------------------------------------------------------
>--
>----------
>
>
>
>
>
>---
>El software de antivirus Avast ha analizado este correo electrónico en
>busca de virus.
>http://www.avast.com
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Nov 26 2014 - 13:30:02 PST
Custom Search