Re: [AMBER] RTX 2020 Super GPU Random Memory Errors

From: Dow Hurst <dphurst.uncg.edu>
Date: Mon, 20 Apr 2020 13:30:53 -0400

Have you validated your GPU to make sure it will reproduce the benchmark
tests? I've found some cards fail even with a tiny system like DHFR, or
later in life will develop an error that affects system sizes that are
larger. I've been trying to re-validate the gpus we have once a year.
Eventually cards can develop a memory error that would never affect a video
game, but will ruin MD runs. There is more than one thread on the amberlist
about using the benchmark scripts to validate a GPU. Here is one you could
look at:

Re: [AMBER] memtestG80 alternative ? - testing actual GPUs regarding soft
errors

I'd start with the smallest benchmark and work up from there test all the
VRAM. You can mix and match benchmarks to fit your card's ram with
simultaneous runs. Just extend the run times and repeat them at least 20
times to really test the cards. The final EPTot values from the output
should ALWAYS match exactly. If the numbers don't match even once out of 20
to 40 repeated long runs, then you have a problem with your card.
Sincerely,
Dow
⚛Dow Hurst, Research Scientist
       340 Sullivan Science Bldg.
       Dept. of Chem. and Biochem.
       University of North Carolina at Greensboro
       PO Box 26170 Greensboro, NC 27402-6170



On Wed, Apr 1, 2020 at 5:28 AM Giorgos Lambrinidis <lambrinidis.pharm.uoa.gr>
wrote:

> Hello to everyone,.
>
> Some updates,
>
> I tried my simulation with specific seed number and the simulation crashes
> on the sime step everytime. So there must be a problem with my system,
> although it seems pretty normal when i check it with vmd. Anyway i am glad
> that there is no problem with my card.
> However i am supprised that my old GPU (GTX 1060) does not produce such
> error whatever seed number i use.
>
> Best Regards
> George
>
> > Dave and Stephan
> >
> > Thank you for your suggestions.
> >
> > I shall try to use the same seed number to check about the randomness.
> > Dave i shall send you the files asked.
> >
> > I missed to tell you that all equilibration steps were run on CPU
> > without errors or warnings, and only Hold steps 1-10 and production was
> > run on GPU.
> >
> > Sice i follow the tutorial, on all hold steps i use the skinnb=5 command.
> >
> > George
> >
> >
> > Στις 22/3/20 1:44 π.μ., ο Stephan Schott έγραψε:
> >> Hi George,
> >> Indeed those errors are not very informative, but also farily common
> >> with
> >> membrane systems. Maybe David can find something wrong there, but some
> >> tips
> >> that usually help is to minimize using CPU code, rather than GPU.
> >> Sometimes
> >> and rather randomly an atom could get "lost". Also increasing the number
> >> of
> >> atoms included in the nonbonded pairlist with skinnb in the first
> >> equilibrations steps helps in some cases. For the you can add something
> >> like this at the end of your input file (default is 2A):
> >> &ewald
> >> skinnb = 5
> >> &end
> >>
> >>
> >> El sáb., 21 mar. 2020 a las 23:58, David Cerutti (<dscerutti.gmail.com
> >)
> >> escribió:
> >>
> >>> If your random seed is set to -1, that is one possible source of the
> >>> randomness (bases the PRNG on wall clock time). But I suspect that
> >>> there
> >>> is something else amiss with your system. Perhaps a strained bond or
> >>> clash
> >>> that is giving SHAKE problems. Can you reply (just to me) with your
> >>> topology and inpcrd?
> >>>
> >>> Dave
> >>>
> >>>
> >>> On Sat, Mar 21, 2020 at 3:43 PM Giorgos Lambrinidis <
> >>> lambrinidis.pharm.uoa.gr> wrote:
> >>>
> >>>> Dear Amber Users
> >>>>
> >>>> I am facing a strange problem, regarding MSI RTX 2080 Super GPU.
> >>>>
> >>>> I am working on a transmembrane GPCR protein with 69385 atoms
> >>>> including
> >>>> the Lipids and water molecules. I have created the system using the
> >>>> Amber Tutorial 16 for Lipid14 ForceField.
> >>>>
> >>>> I run the equilibration protocol + the production simulation on a
> >>>> computer with the following characteristics:
> >>>>
> >>>> AMD Ryzen 7 2700 Eight-Core Processor, 24GB RAM, GeForce GTX 1060 with
> >>>> 6GB, and nvidia driver 418,43, GNU compilers and cuda 10.1. I am using
> >>>> Amber 18 with Ambertools19.
> >>>>
> >>>> Few days ago, I bought a new GPU, MSI RTX 2080 Super 8GB and I
> >>>> installed
> >>>> on the following system:
> >>>>
> >>>> Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16GB RAM, with nvidia driver
> >>>> 435,21, GNU compilers and cuda 10.1 . I am using Amber 18 with
> >>>> Ambertools19.
> >>>>
> >>>> When I run the same GPCR protein with 69385 on the new GPU I get
> >>>> randomly the following error:
> >>>>
> >>>> “cudamemcpy gpubuffer::download failed an illegal memory access was
> >>>> encounterer”
> >>>>
> >>>> The job terminates but the next steps (hold or production based on
> >>>> Amber
> >>>> Tutorial 16) are running normally until the next error etc.
> >>>>
> >>>> I know that this kind of error is very general. I am open to
> >>>> suggestions
> >>>> how to determine if the error is because of the hardware, or in the
> >>>> compilation process.
> >>>>
> >>>> I tried a bigger system produced by CHARMM-GUI for amber, and the
> >>>> equilibration + production was run without any error.
> >>>>
> >>>> As I said the error is generating randomly. If I repeat the same job
> >>>> with the same parameters I will get the error in a different step (on
> >>>> hold or production step)
> >>>>
> >>>> I can share input files if necessary.
> >>>>
> >>>> Thank you in advance
> >>>>
> >>>> Dr. George Lamprinidis
> >>>>
> >>>> --
> >>>> ---------------------------------------------
> >>>> Dr George Lambrinidis
> >>>> Researcher & Laboratory Assistant Staff
> >>>> School of Health Sciences
> >>>> Faculty of Pharmacy
> >>>> National & Kapodistrian University of Athens
> >>>> Greece
> >>>> tel: +30 2107274304
> >>>> +30 2107274521
> >>>> fax: +30 2107274747
> >>>> e-mail: lambrinidis.pharm.uoa.gr
> >>>> geolampr.gmail.com
> >>>> ---------------------------------------------
> >>>>
> >>>> _______________________________________________
> >>>> AMBER mailing list
> >>>> AMBER.ambermd.org
> >>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>>
> >>> _______________________________________________
> >>> AMBER mailing list
> >>> AMBER.ambermd.org
> >>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>
> >>
> > --
> > ---------------------------------------------
> > Dr George Lambrinidis
> > Researcher & Laboratory Assistant Staff
> > School of Health Sciences
> > Faculty of Pharmacy
> > National & Kapodistrian University of Athens
> > Greece
> > tel: +30 2107274304
> > +30 2107274521
> > fax: +30 2107274747
> > e-mail: lambrinidis.pharm.uoa.gr
> > geolampr.gmail.com
> > ---------------------------------------------
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
>
>
> --
> ---------------------------------------------
> Dr George Lambrinidis
> Researcher & Laboratory Assistant Staff
> School of Health Sciences
> Faculty of Pharmacy
> National & Kapodistrian University of Athens
> Greece
> tel: +30 2107274304
> +30 2107274521
> fax: +30 2107274747
> e-mail: lambrinidis.pharm.uoa.gr
> geolampr.gmail.com
> ---------------------------------------------
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Apr 20 2020 - 11:00:02 PDT
Custom Search