Re: [AMBER] memtestG80 alternative ? - testing actual GPUs regarding soft errors

From: Dow Hurst <dphurst.uncg.edu>
Date: Mon, 15 Jul 2019 14:32:21 -0400

Ross,
I found that modifying your gpu validation scripts slightly to allow two
STMV runs on a Gigabyte RTX 2080 Ti fits perfectly in the card's ram,
allowing the full ram to be tested. I really appreciate your providing the
scripts to the community and run them around every 6 months to a year just
to check our GPUs for errors. The modification below should only be used on
simulations that, by doubling up on a card, fit the gpu ram. With two
simultaneous stmv runs on a single GPU, the performance shown below is
around half what you would normally expect. This system has ubuntu 18.04
server, cuda 10.1, amber18 w/updates, amber19tools, and two RTX cards. I
wanted to have two stmv runs on each card and have both cards tested
simultaneously. Attached is a png snapshot of the nvidia-smi output while
both cards were being tested. The temperature is low in the nvidia-smi
output due to the fact the runs had just been started. The cards got up to
~75C after running for several hours.

#How many GPUs in node
gpu_count=2

#How many tests to run
test_count=20

if [ "$run_1gpu_test" = true ] ; then

  mkdir output_files
  cd output_files

  i=0

  while [ $i -lt $test_count ]; do

    j=0

    while [ $j -lt $gpu_count ]; do
      export CUDA_VISIBLE_DEVICES=$j

      $AMBERHOME/bin/pmemd.cuda -O -i ../input/mdin -p ../input/prmtop -c
../input/inpcrd -o mdout.a.$j.$i -x mdcrd.a.$j.$i -r restrt.a.$j.$i -inf
mdinfo.a.$j.$i &
      $AMBERHOME/bin/pmemd.cuda -O -i ../input/mdin -p ../input/prmtop -c
../input/inpcrd -o mdout.b.$j.$i -x mdcrd.b.$j.$i -r restrt.b.$j.$i -inf
mdinfo.b.$j.$i &

      let j=j+1

    done

    wait

    j=0

    while [ $j -lt $gpu_count ]; do
      echo -n "a.$j.$i: " >> ../GPU_$j.log
      grep -A 1 "NSTEP = 20000" mdout.a.$j.$i > tmp
      grep -m 1 "Etot =" tmp >> ../GPU_$j.log
      echo -n "b.$j.$i: " >> ../GPU_$j.log
      grep -A 1 "NSTEP = 20000" mdout.b.$j.$i > tmp
      grep -m 1 "Etot =" tmp >> ../GPU_$j.log
      let j=j+1
    done

    let i=i+1

  done
  cd ../
fi

Here is what the output looks like:
a.0.0: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.0: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
a.0.1: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.1: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
a.0.2: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.2: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
a.0.3: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.3: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
a.0.4: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.4: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
a.0.5: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.5: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
a.0.6: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.6: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
a.0.7: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.7: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
a.0.8: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.8: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
a.0.9: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.9: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
a.0.10: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.10: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
a.0.11: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.11: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
a.0.12: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.12: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
a.0.13: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.13: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
a.0.14: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.14: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
a.0.15: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.15: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
a.0.16: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.16: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
a.0.17: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.17: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
a.0.18: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.18: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
a.0.19: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
b.0.19: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645


cat mdinfo.a.0.0

 NSTEP = 20000 TIME(PS) = 5161.003 TEMP(K) = 300.15 PRESS =
-27.5
 Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
 -3372862.6645
 BOND = 35006.6587 ANGLE = 86346.6259 DIHED =
 119495.9931
 1-4 NB = 39728.0720 1-4 EEL = 215992.5737 VDWAALS =
 326536.0574
 EELEC = -4195968.6452 EHBOND = 0.0000 RESTRAINT =
0.0000
 EKCMT = 269695.6679 VIRIAL = 276268.0553 VOLUME =
 11079513.3253
                                                    Density =
1.0035

| Final Performance Info:
| -----------------------------------------------------
| Average timings for last 1000 steps:
| Elapsed(s) = 59.86 Per Step(ms) = 59.86
| ns/day = 2.89 seconds/ns = 29929.91
|
| Average timings for all steps:
| Elapsed(s) = 1192.76 Per Step(ms) = 59.64
| ns/day = 2.90 seconds/ns = 29819.00
| -----------------------------------------------------

Sincerely,
Dow
⚛Dow Hurst, Research Scientist
       340 Sullivan Science Bldg.
       Dept. of Chem. and Biochem.
       University of North Carolina at Greensboro
       PO Box 26170 Greensboro, NC 27402-6170



On Tue, Jul 2, 2019 at 10:07 PM Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Marek,
>
> AMBER is still by far the most 'reliable' software I know of for
> 'breaking' GPUs. ;-) Just take the benchmark suite from the AMBER website.
> Take the STMV test case, make sure ig is set to a positive integer and
> adjust nstlim so it will run for about 2 hours - then set that to loop 12
> times so it runs over 24 hours. At the end diff the outputs against each
> other. They should all be identical. If they aren't then you have something
> wrong with your GPU - bad memory, overclocked etc.
>
> All the best
> Ross
>
> > On Jul 2, 2019, at 11:37, Marek Maly <marek.maly.ujep.cz> wrote:
> >
> > Hello,
> >
> > for long time, here was a very good and useful tool for testing GPUs
> > focused on "soft errors"
> >
> > memtestG80 ( https://simtk.org/projects/memtest ), but it seems that
> for
> > recent GPUs it is not useable.
> >
> > "memtest g80 is not compatible with gddr5 or later."
> >
> > ( see
> >
> https://forums.geforce.com/default/topic/1080529/rtx-strix-2080-errors-with-memtestg80-help/
>
> > )
> >
> > Does anybody know about some modern alternative for memtestG80, which
> > could be used for testing of RTX 2080 Ti (with GDDR6) ?
> >
> > Or "soft errors" on modern GPUs are so rare, that no such testing tool
> > exists ?
> >
> > I tried to find some on the internet, but without success.
> >
> > Thanks in advance for any comments,
> >
> > Best wishes,
> >
> > Marek
> >
> >
> >
> >
> > --
> > Vytvořeno poštovní aplikací Opery: http://www.opera.com/mail/
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Nvidia-smi_RTX2080TI-Validation-start.png
(image/png attachment: Nvidia-smi_RTX2080TI-Validation-start.png)

Received on Mon Jul 15 2019 - 12:00:03 PDT
Custom Search