I can't remember right now if I got this error (as there was such a
variety) with my TITAN, but I do know I'm having this exact same problem
too with my geforce 780. I will be posting the detailed troubleshooting
later as I am a bit pressed for time.
I never got any memtest errors with my TITANS either, so it is a
combination of hardware and the CUDA code IMO, though others will be better
be able to define this better than me.
have a look at the beginning of this (insanely long) thread:
http://archive.ambermd.org/201306/0686.html
The TITAN GPUs are officially kaput and we are all waiting for a patch,
which I guess will be relevant for the 780s too. I guess your options are
to RMA or wait for the patch.
On 11 July 2013 08:35, iqtcub <iqtcub.gmail.com> wrote:
> Hi all,
>
> First of all, I'm just a sysadmin, so my technical amber knowledge is very
> limited.
>
> Here's the scenario:
>
> We have a machine with a Gigabyte GTX580(driver 319.32), SLES11 OS and
> CUDA5. We're using Amber 12 with Ambertools 13 updated with the latest
> patches. The compiler used is intel 11.1.072 but also we've tried with the
> gnu compilers that come with SLES11(gcc version 4.3.4).
>
> This machine works fine.
>
> Now we've bought another machine with four EVGA GTX TITAN(driver 319.32),
> SLES11 OS and CUDA5. Same Amber version and patches, compilers, etc.
>
> With the input i'm attaching, we're seeing wrong TEMP, Etot and EKtot
> values. It happens after the 100000 NSTEP the first time, if i kill the job
> and start it again, it happens after the 50000 NSTEP or so. Like some
> overheating memory issues i've read in the list that happens with GTX TITAN.
>
> The job has correct values when doing the same job on the GTX580. The
> output is as follows:
>
> #############################
>
> NSTEP = 70000 TIME(PS) = 270.000 TEMP(K) = 301.25 PRESS =
> 0.0
> Etot = -112943.3112 EKtot = 47807.0625 EPtot =
> -160750.3737
> BOND = 24797.9158 ANGLE = 2543.7279 DIHED = 3079.0534
> 1-4 NB = 1035.0690 1-4 EEL = 11250.6176 VDWAALS = 26954.3958
> EELEC = -230411.1531 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> ------------------------------**------------------------------**
> ------------------
>
> check COM velocity, temp: 0.000004 0.00(Removed)
> check COM velocity, temp: 0.000001 0.00(Removed)
> check COM velocity, temp: 0.000003 0.00(Removed)
> check COM velocity, temp: 0.000002 0.00(Removed)
> check COM velocity, temp: 0.000002 0.00(Removed)
> check COM velocity, temp: 0.000002 0.00(Removed)
> check COM velocity, temp: 0.000001 0.00(Removed)
> check COM velocity, temp: 0.000002 0.00(Removed)
> check COM velocity, temp: 0.000002 0.00(Removed)
> check COM velocity, temp: 0.000002 0.00(Removed)
>
>
> NSTEP = 70000 TIME(PS) = 270.000 TEMP(K) = 301.25 PRESS =
> 0.0
> Etot = -112943.3112 EKtot = 47807.0625 EPtot =
> -160750.3737
> BOND = 24797.9158 ANGLE = 2543.7279 DIHED = 3079.0534
> 1-4 NB = 1035.0690 1-4 EEL = 11250.6176 VDWAALS = 26954.3958
> EELEC = -230411.1531 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> ------------------------------**------------------------------**
> ------------------
>
> check COM velocity, temp: 0.000004 0.00(Removed)
> check COM velocity, temp: 0.000001 0.00(Removed)
> check COM velocity, temp: 0.000003 0.00(Removed)
> check COM velocity, temp: 0.000002 0.00(Removed)
> check COM velocity, temp: 0.000002 0.00(Removed)
> check COM velocity, temp: 0.000002 0.00(Removed)
> check COM velocity, temp: 0.000001 0.00(Removed)
> check COM velocity, temp: 0.000002 0.00(Removed)
> check COM velocity, temp: 0.000002 0.00(Removed)
> check COM velocity, temp: 0.000002 0.00(Removed)
>
> #############################
>
> While the output in the GTX TITAN is:
>
> #############################
>
> NSTEP = 70000 TIME(PS) = 270.000 TEMP(K) = 619.32 PRESS =
> 0.0
> Etot = -61495.6874 EKtot = 98281.8984 EPtot =
> -159777.5858
> BOND = 29075.5960 ANGLE = 2397.2902 DIHED = 3028.8070
> 1-4 NB = 1015.5372 1-4 EEL = 11294.9046 VDWAALS = 26864.1996
> EELEC = -233453.9203 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> ------------------------------**------------------------------**
> ------------------
>
> check COM velocity, temp: 0.000004 0.00(Removed)
> check COM velocity, temp: 0.000002 0.00(Removed)
> check COM velocity, temp: 0.000002 0.00(Removed)
> check COM velocity, temp: 0.000001 0.00(Removed)
> check COM velocity, temp: 0.000002 0.00(Removed)
> check COM velocity, temp: 0.000003 0.00(Removed)
> check COM velocity, temp: 1112.636152*********(Removed)
> check COM velocity, temp: 1286.464585*********(Removed)
> check COM velocity, temp: 824.413845*********(Removed)
> check COM velocity, temp: 1106.406956*********(Removed)
>
>
> NSTEP = 80000 TIME(PS) = 280.000 TEMP(K) =********* PRESS =
> 0.0
> Etot = ************** EKtot = ************** EPtot =
> **************
> BOND = 0.0000 ANGLE = 423951.1225 DIHED =
> 14356.6179
> 1-4 NB = 0.0000 1-4 EEL = 0.0067 VDWAALS =
> **************
> EELEC = -188998.4079 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> ------------------------------**------------------------------**
> ------------------
>
> check COM velocity, temp: 1163.680382*********(Removed)
> check COM velocity, temp: 750.040734*********(Removed)
> check COM velocity, temp: 629.104266*********(Removed)
> check COM velocity, temp: 1465.801815*********(Removed)
> check COM velocity, temp: 637.864373*********(Removed)
> check COM velocity, temp: 1888.864547*********(Removed)
> check COM velocity, temp: 1527.586226*********(Removed)
> check COM velocity, temp: 1659.560655*********(Removed)
> check COM velocity, temp: 953.381316*********(Removed)
> check COM velocity, temp: 1613.977188*********(Removed)
>
> #############################
>
> Is this the same issue the other people are having with the GTX TITAN and
> thats being investigated?
>
> By the way, running both memtest g80 or cudagpumemtest(http://**
> sourceforge.net/projects/**cudagpumemtest/<http://sourceforge.net/projects/cudagpumemtest/>)
> after the job gives starts giving those results, returns 0 errors.
>
> Thanks in advance!
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 11 2013 - 05:30:03 PDT