Ross,
I found that modifying your gpu validation scripts slightly to allow two
STMV runs on a Gigabyte RTX 2080 Ti fits perfectly in the card's ram,
allowing the full ram to be tested. I really appreciate your providing the
scripts to the community and run them around every 6 months to a year just
to check our GPUs for errors. The modification below should only be used on
simulations that, by doubling up on a card, fit the gpu ram. With two
simultaneous stmv runs on a single GPU, the performance shown below is
around half what you would normally expect. This system has ubuntu 18.04
server, cuda 10.1, amber18 w/updates, amber19tools, and two RTX cards. I
wanted to have two stmv runs on each card and have both cards tested
simultaneously. Attached is a png snapshot of the nvidia-smi output while
both cards were being tested. The temperature is low in the nvidia-smi
output due to the fact the runs had just been started. The cards got up to
~75C after running for several hours.
#How many GPUs in node
gpu_count=2
#How many tests to run
test_count=20
if [ "$run_1gpu_test" = true ] ; then
mkdir output_files
cd output_files
i=0
while [ $i -lt $test_count ]; do
j=0
while [ $j -lt $gpu_count ]; do
export CUDA_VISIBLE_DEVICES=$j
$AMBERHOME/bin/pmemd.cuda -O -i ../input/mdin -p ../input/prmtop -c
../input/inpcrd -o mdout.a.$j.$i -x mdcrd.a.$j.$i -r restrt.a.$j.$i -inf
mdinfo.a.$j.$i &
$AMBERHOME/bin/pmemd.cuda -O -i ../input/mdin -p ../input/prmtop -c
../input/inpcrd -o mdout.b.$j.$i -x mdcrd.b.$j.$i -r restrt.b.$j.$i -inf
mdinfo.b.$j.$i &
let j=j+1
done
wait
j=0
while [ $j -lt $gpu_count ]; do
echo -n "a.$j.$i: " >> ../GPU_$j.log
grep -A 1 "NSTEP = 20000" mdout.a.$j.$i > tmp
grep -m 1 "Etot =" tmp >> ../GPU_$j.log
echo -n "b.$j.$i: " >> ../GPU_$j.log
grep -A 1 "NSTEP = 20000" mdout.b.$j.$i > tmp
grep -m 1 "Etot =" tmp >> ../GPU_$j.log
let j=j+1
done
let i=i+1
done
cd ../
fi
Here is what the output looks like:
a.0.0: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.0: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
a.0.1: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.1: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
a.0.2: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.2: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
a.0.3: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.3: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
a.0.4: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.4: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
a.0.5: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.5: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
a.0.6: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.6: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
a.0.7: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.7: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
a.0.8: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.8: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
a.0.9: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.9: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
a.0.10: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.10: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
a.0.11: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.11: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
a.0.12: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.12: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
a.0.13: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.13: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
a.0.14: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.14: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
a.0.15: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.15: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
a.0.16: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.16: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
a.0.17: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.17: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
a.0.18: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.18: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
a.0.19: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
b.0.19: Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
cat mdinfo.a.0.0
NSTEP = 20000 TIME(PS) = 5161.003 TEMP(K) = 300.15 PRESS =
-27.5
Etot = -2709811.4145 EKtot = 663051.2500 EPtot =
-3372862.6645
BOND = 35006.6587 ANGLE = 86346.6259 DIHED =
119495.9931
1-4 NB = 39728.0720 1-4 EEL = 215992.5737 VDWAALS =
326536.0574
EELEC = -4195968.6452 EHBOND = 0.0000 RESTRAINT =
0.0000
EKCMT = 269695.6679 VIRIAL = 276268.0553 VOLUME =
11079513.3253
Density =
1.0035
| Final Performance Info:
| -----------------------------------------------------
| Average timings for last 1000 steps:
| Elapsed(s) = 59.86 Per Step(ms) = 59.86
| ns/day = 2.89 seconds/ns = 29929.91
|
| Average timings for all steps:
| Elapsed(s) = 1192.76 Per Step(ms) = 59.64
| ns/day = 2.90 seconds/ns = 29819.00
| -----------------------------------------------------
Sincerely,
Dow
⚛Dow Hurst, Research Scientist
340 Sullivan Science Bldg.
Dept. of Chem. and Biochem.
University of North Carolina at Greensboro
PO Box 26170 Greensboro, NC 27402-6170
On Tue, Jul 2, 2019 at 10:07 PM Ross Walker <ross.rosswalker.co.uk> wrote:
> Hi Marek,
>
> AMBER is still by far the most 'reliable' software I know of for
> 'breaking' GPUs. ;-) Just take the benchmark suite from the AMBER website.
> Take the STMV test case, make sure ig is set to a positive integer and
> adjust nstlim so it will run for about 2 hours - then set that to loop 12
> times so it runs over 24 hours. At the end diff the outputs against each
> other. They should all be identical. If they aren't then you have something
> wrong with your GPU - bad memory, overclocked etc.
>
> All the best
> Ross
>
> > On Jul 2, 2019, at 11:37, Marek Maly <marek.maly.ujep.cz> wrote:
> >
> > Hello,
> >
> > for long time, here was a very good and useful tool for testing GPUs
> > focused on "soft errors"
> >
> > memtestG80 ( https://simtk.org/projects/memtest ), but it seems that
> for
> > recent GPUs it is not useable.
> >
> > "memtest g80 is not compatible with gddr5 or later."
> >
> > ( see
> >
> https://forums.geforce.com/default/topic/1080529/rtx-strix-2080-errors-with-memtestg80-help/
>
> > )
> >
> > Does anybody know about some modern alternative for memtestG80, which
> > could be used for testing of RTX 2080 Ti (with GDDR6) ?
> >
> > Or "soft errors" on modern GPUs are so rare, that no such testing tool
> > exists ?
> >
> > I tried to find some on the internet, but without success.
> >
> > Thanks in advance for any comments,
> >
> > Best wishes,
> >
> > Marek
> >
> >
> >
> >
> > --
> > Vytvořeno poštovní aplikací Opery: http://www.opera.com/mail/
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jul 15 2019 - 12:00:03 PDT