Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Gould, Ian R <i.gould.imperial.ac.uk>
Date: Tue, 28 May 2013 15:50:53 +0000

Dear All,

I have two GTX Titans, one per box, running the 319.17 driver Amber 12.2
on CENTOS 6.3 and they are rock stable. They run for days and 100ns with
no issues. And for one of the boxes I just dropped the Titan card in as a
replacement for a 680 and just updated the driver. I did have some initial
issues of jobs failing after a few minutes but this disappeared when I
moved to the current driver.

Guess this may not help too much but have you tried pulling 3 of the cards
out and just running with one? I would try run with one and if that works
try two etc etc... Whilst the new cards run a lot cooler they do require a
lot of power, I have spent to many hours of my life tracking problems to
discover a malfunctioning PSU to be the issue, I am retentive on buying
the biggest and best PSU I can afford and have colleagues who can test the
PSU's are actually delivering the quoted Amps on the rails. Just because
the machine can run 4 680's doesn't mean that the PSU can cope with 4
Titans.

HTH
Cheers
Ian

Women love us for our defects. If we have enough of them, they will
forgive us everything, even our intellects.
Oscar Wilde
-- 
Dr Ian R Gould, FRSC.
Reader in Computational Chemical Biology
Department of Chemistry
Imperial College London
Exhibition Road
London
SW7 2AY
E-mail i.gould.imperial.ac.uk
http://www3.imperial.ac.uk/people/i.gould
Tel +44 (0)207 594 5809
On 28/05/2013 16:11, "Robert Konecny" <rok.ucsd.edu> wrote:
>Hi Scott,
>
>unfortunately we are seeing similar Amber instability on GTX Titans as
>Marek is. We have a box with four GTX Titans (not oveclocked) running
>CentOS 6.3 with NVidia 319.17 driver and Amber 12.2. Any Amber simulation
>longer than 10-15 min eventually crashes on these cards, including both
>JAC 
>benchmarks (with extended run time). This is reproducible on all four
>cards.
>
>To eliminate the possible hardware error we ran extended GPU memory tests
>on all four Titans with memtestG80, cuda_memtest and also gpu_burn - all
>finished without errors. Since I agree that these programs may not test
>the 
>GPU completely we also set up simulations with NAMD. We can run four NAMD
>simulations simultaneously for many days without any errors on this
>hardware. For reference - we also have exactly the same server with the
>same hardware components but with four GTX680s and this setup works just
>fine for Amber. So all this leads me to believe that a hardware error is
>not very likely.
>
>I would appreciate your comments on this, perhaps there is something else
>causing these errors which we are not seeing.
>
>Thanks,
>
>Robert
>
>
>On Mon, May 27, 2013 at 04:25:24PM -0700, Scott Le Grand wrote:
>> I have two GTX Titans.  One is defective, the other is not.
>>Unfortunately,
>> they both pass all standard GPU memory tests.
>> 
>> What the defective one doesn't do is generate reproducibly bit-accurate
>> outputs for simulations of Factor IX (90,986 atoms) or larger, of 100K
>>or
>> so iterations.
>> 
>> Which is yet another reason why I insist on MD algorithms (especially on
>> GPUS) being deterministic.  Besides its ability to find software bugs,
>>and
>> fulfilling one of the most important tenets of science, it's a great
>>way to
>> diagnose defective hardware with very little effort.
>> 
>> 928 MHz?  That's 6% above the boost clock of a stock Titan.  Titan is
>> pushing the performance envelope as is.  If you're going to pay the
>>premium
>> for such chips, I'd send them back until you get one that runs
>>correctly.
>> I'm very curious how fast you can push one of these things before they
>>give
>> out.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Mon, May 27, 2013 at 10:01 AM, Marek Maly <marek.maly.ujep.cz> wrote:
>> 
>> > Dear all,
>> >
>> > I have recently bought two "EVGA GTX TITAN Superclocked" GPUs.
>> >
>> > I did the first calculations (pmemd.cuda in Amber12) with systems
>>around
>> > 60K atoms without any problems (NPT, Langevin), but when I later tried
>> > with bigger systems (around 100K atoms) I obtained "classical"
>>irritating
>> > errors
>> >
>> > cudaMemcpy GpuBuffer::Download failed unspecified launch failure
>> >
>> > just after few thousands of MD steps.
>> >
>> > So this was obviously the reason for memtestG80 tests.
>> > ( https://simtk.org/home/memtest ).
>> >
>> > So I compiled memtestG80 from sources ( memtestG80-1.1-src.tar.gz )
>>and
>> > then tested
>> > just small part of memory GPU (200 MB) using 100 iterations.
>> >
>> > On both cards I have obtained huge amount of errors but "just" on
>> > "Random blocks:". 0 errors in all remaining tests in all iterations.
>> >
>> > ------THE LAST ITERATION AND FINAL RESULTS-------
>> >
>> > Test iteration 100 (GPU 0, 200 MiB): 169736847 errors so far
>> >         Moving Inversions (ones and zeros): 0 errors (6 ms)
>> >         Memtest86 Walking 8-bit: 0 errors (53 ms)
>> >         True Walking zeros (8-bit): 0 errors (26 ms)
>> >         True Walking ones (8-bit): 0 errors (26 ms)
>> >         Moving Inversions (random): 0 errors (6 ms)
>> >         Memtest86 Walking zeros (32-bit): 0 errors (105 ms)
>> >         Memtest86 Walking ones (32-bit): 0 errors (104 ms)
>> >         Random blocks: 1369863 errors (27 ms)
>> >         Memtest86 Modulo-20: 0 errors (215 ms)
>> >         Logic (one iteration): 0 errors (4 ms)
>> >         Logic (4 iterations): 0 errors (8 ms)
>> >         Logic (shared memory, one iteration): 0 errors (8 ms)
>> >         Logic (shared-memory, 4 iterations): 0 errors (25 ms)
>> >
>> > Final error count after 100 iterations over 200 MiB of GPU memory:
>> > 171106710 errors
>> >
>> > ------------------------------------------
>> >
>> > I have some questions and would be really grateful for any comments.
>> >
>> > Regarding overclocking, using the deviceQuery I found out that under
>>linux
>> > both cards run
>> > automatically using boost shader/GPU frequency which is here 928 MHz
>>(the
>> > basic value for these factory OC cards is 876 MHz). deviceQuery
>>reported
>> > Memory Clock rate is 3004 MHz although "it" should be 6008 MHz but
>>maybe
>> > the quantity which is reported by deviceQuery "Memory Clock rate" is
>> > different from the product specification "Memory Clock" . It seems
>>that
>> > "Memory Clock rate" = "Memory Clock"/2. Am I right ? Or just
>>deviceQuery
>> > is not able to read this spec. properly
>> > in Titan GPU ?
>> >
>> > Anyway for the moment I assume that the problem might be due to the
>>high
>> > shader/GPU frequency.
>> > (see here : http://folding.stanford.edu/English/DownloadUtils )
>> >
>> > To verify this hypothesis one should perhaps UNDERclock to basic
>>frequency
>> > which is in this
>> > model 876 MHz or even to the TITAN REFERENCE frequency which is 837
>>MHz.
>> >
>> > Obviously I am working with these cards under linux (CentOS
>> > 2.6.32-358.6.1.el6.x86_64) and as I found, the OC tools under linux
>>are in
>> > fact limited just to NVclock utility, which is unfortunately
>> > out of date (at least speaking about the GTX Titan ). I have obtained
>>this
>> > message when I wanted
>> > just to let NVclock utility to read and print shader and memory
>> > frequencies of my Titan's:
>> >
>> > -------------------------------------------------------------------
>> >
>> > [root.dyn-138-272 NVCLOCK]# nvclock  -s  --speeds
>> > Card:           Unknown Nvidia card
>> > Card number:    1
>> > Memory clock:   -2147483.750 MHz
>> > GPU clock:      -2147483.750 MHz
>> >
>> > Card:           Unknown Nvidia card
>> > Card number:    2
>> > Memory clock:   -2147483.750 MHz
>> > GPU clock:      -2147483.750 MHz
>> >
>> >
>> > -------------------------------------------------------------------
>> >
>> >
>> > I would be really grateful for some tips regarding  "NVclock
>>alternatives",
>> > but after wasting some hours with googling it seems that there is no
>>other
>> > Linux
>> > tool with NVclock functionality. So the only possibility is here
>>perhaps
>> > to edit
>> > GPU bios with some Lin/DOS/Win tools like (Kepler BIOS Tweaker,
>>NVflash)
>> > but obviously
>> > I would like to rather avoid such approach as using it means perhaps
>>also
>> > to void the warranty even if I am going to underclock the GPUs not to
>> > overclock them.
>> > So before this eventual step (GPU bios editing) I would like to have
>>some
>> > approximative estimate
>> > of the probability, that the problems are here really because of the
>> > overclocking
>> > (too high (boost) default shader frequency).
>> >
>> > This probability I hope to estimate from the eventual responses of
>>another
>> > Amber/Titan SC users, if I am not the only crazy guy who bought this
>>model
>> > for Amber calculations :)) But of course any eventual experiences with
>> > Titan cards related to their memtestG80 results and UNDER/OVERclocking
>> > (if possible in Linux OS) are of course welcomed as well !
>> >
>> > My HW/SW configuration
>> >
>> > motherboard: ASUS P9X79 PRO
>> > CPU: Intel Core i7-3930K
>> > RAM: CRUCIAL Ballistix Sport 32GB (4x8GB) DDR3 1600 VLP
>> > CASE: CoolerMaster Dominator CM-690 II Advanced,
>> > Power:Enermax PLATIMAX EPM1200EWT 1200W, 80+, Platinum
>> > GPUs : 2 x EVGA GTX TITAN Superclocked 6GB
>> > cooler: Cooler Master Hyper 412 SLIM
>> >
>> > OS: CentOS (2.6.32-358.6.1.el6.x86_64)
>> > driver version: 319.17
>> > cudatoolkit_5.0.35_linux_64_rhel6.x
>> >
>> > The computer is in air-conditioned room with permanent external
>> > temperature around 18°C
>> >
>> >
>> >    Thanks a lot in advance for any comment/experience !
>> >
>> >       Best wishes,
>> >
>> >            Marek
>> >
>> > --
>> > Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> > http://www.opera.com/mail/
>> >
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue May 28 2013 - 09:00:04 PDT
Custom Search