Re: [AMBER] cuda test failing after installation

From: David Cerutti <dscerutti.gmail.com>
Date: Tue, 7 May 2019 14:02:48 -0400

I think to diagnose this I would need to see the actual outputs of the test
cases on those GTX-1080s. I don't have such a card (I do have a 1080Ti),
but if you go into ${AMBERHOME}/test/cuda/amd/dhfr_pme/, for example, and
show us the mdout.pme.amd2.dif file that might be helpful.

Dave ( Cerutti )


On Tue, May 7, 2019 at 1:27 PM Ravi Abrol <raviabrol.gmail.com> wrote:

> Dear Dave,
> Sorry took a while to test this. Thanks for your suggestion to upgrade to
> Amber18, which resolved these errors on 2 out of 3 workstations.
>
> All three workstations have the same OS (POP), gcc, mpich, CUDA-9.2, etc.
>
> Workstations where this issue is resolved have either a GTX970 or two
> RTX2080.
> Workstation on which the issue persists has two GTX1080.
>
> On this third workstation, other tests work fine (0 tests with errors), but
> test_amber_cuda_parallel tests all fail with messages like:
> ******
> cd trpcage/ && ./Run_md_trpcage DPFP /usr/local/amber18/include/netcdf.mod
> Note: The following floating-point exceptions are signalling:
> IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> diffing trpcage_md.out.GPU_DPFP with trpcage_md.out
> possible FAILURE: check trpcage_md.out.dif
> *******
> Here are the example cases with the biggest maximum absolute/relative
> errors:
>
> possible FAILURE: check nucleosome_md1_ntt1.out.dif
> ### Maximum absolute error in matching lines = 1.35e+05 at line 251 field 4
> possible FAILURE: check nucleosome_md2_ntt0.out.dif
> ### Maximum absolute error in matching lines = 1.32e+05 at line 248 field 4
> possible FAILURE: check mdout.gb.gamd2.dif
> ### Maximum absolute error in matching lines = 3.61e+06 at line 293 field 3
> ### Maximum relative error in matching lines = 8.75e+06 at line 309 field 3
> possible FAILURE: check FactorIX_NVE.out.dif
> ### Maximum absolute error in matching lines = 1.10e+06 at line 195 field 3
> possible FAILURE: check mdout.dhfr.noshake.dif
> ### Maximum absolute error in matching lines = 1.30e+05 at line 123 field 3
> possible FAILURE: check mdout.dhfr_charmm_pbc_noshake_md.dif
> ### Maximum absolute error in matching lines = 4.94e+05 at line 169 field 3
> possible FAILURE: check mdout.dhfr_charmm_pbc_noshake_md.dif
> ### Maximum absolute error in matching lines = 3.34e+05 at line 148 field 3
> possible FAILURE: check mdout.ips.dif
> ### Maximum absolute error in matching lines = 1.08e+05 at line 223 field 3
> ### Maximum relative error in matching lines = 5.93e+04 at line 255 field 3
> possible FAILURE: check mdout.pme.amd2.dif
> ### Maximum absolute error in matching lines = 1.64e+06 at line 225 field 3
> possible FAILURE: check mdout.dif
> ### Maximum absolute error in matching lines = 8.00e+07 at line 257 field 4
> possible FAILURE: check mdout.dif
> ### Maximum absolute error in matching lines = 8.00e+07 at line 260 field 4
> possible FAILURE: check mdout.dif
> ### Maximum absolute error in matching lines = 8.00e+07 at line 258 field 4
> possible FAILURE: check mdout.dif
> ### Maximum absolute error in matching lines = 8.81e+08 at line 233 field 3
> ### Maximum relative error in matching lines = 1.42e+04 at line 233 field 3
> possible FAILURE: check mdout.dif
> ### Maximum absolute error in matching lines = 3.45e+07 at line 209 field 3
> possible FAILURE: check mdout.cellulose_nvt.dif
> ### Maximum absolute error in matching lines = 4.59e+06 at line 193 field 3
> ### Maximum relative error in matching lines = 1.70e+05 at line 207 field 3
> possible FAILURE: check mdout.cellulose_npt.dif
> ### Maximum absolute error in matching lines = 4.59e+06 at line 234 field 3
> ### Maximum relative error in matching lines = 1.12e+05 at line 252 field 3
>
> How do I diagnose this problem?
>
> Thanks,
> Ravi
>
>
> On Sun, Mar 24, 2019 at 10:35 PM Ravi Abrol <raviabrol.gmail.com> wrote:
>
> > Thanks Dave for your reply.
> >
> > We have GTX 1080 with 6GB memory.
> >
> > The default mode for GPU testing was originally DPFP, which flagged even
> > more tests with large errors.
> > The runs I mentioned in my email below were done with SPFP. Hope that
> this
> > helps.
> >
> > Ravi
> >
> > ---
> > On Sun, Mar 24, 2019 at 5:35 AM David Case <david.case.rutgers.edu>
> wrote:
> >
> >> On Wed, Mar 20, 2019, Ravi Abrol wrote:
> >> >
> >> >I installed amber16 on a new linux machine (running pop_os) and during
> >> the
> >> >cuda testing (for both pmemd.cuda and pmemd.cuda.MPI), one of the tests
> >> >failed:
> >> >
> >> >$AMBERHOME/test/cuda/large_solute_count/mdout.ntb2_ntt1.dif
> >> >shows:
> >> >### Maximum absolute error in matching lines = 7.44e+08 at line 112
> >> field 3
> >> >### Maximum relative error in matching lines = 1.38e+07 at line 112
> >> field 3
> >> >
> >> >How do I diagnose this error?
> >>
> >> Sorry for the slow reply. What model of GPU are you using? How much
> >> memory does it have? It's possible that you are overflowing memory in a
> >> way that is not caught.
> >>
> >> Also, which tests are you running? SPFP or DPFP?
> >>
> >> Problems like this can indeed be hard to track down. I'm hoping that
> >> this post will trigger memories of other users/developers, in case they
> >> maight have seen similar test failures.
> >>
> >> ....dac
> >>
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue May 07 2019 - 11:30:02 PDT
Custom Search