Re: [AMBER] cuda test failing after installation

From: Ravi Abrol <raviabrol.gmail.com>
Date: Mon, 13 May 2019 11:52:43 -0700

Hi,
Can someone help with the problem in the thread below?
Thanks,
Ravi

---
On Tue, May 7, 2019 at 12:43 PM Ravi Abrol <raviabrol.gmail.com> wrote:
> Hi David,
>
> During my testing, the specific file you asked for was lost due to a
> subsequent make clean, however, I am attaching
> test/cuda/myoglobin/myoglobin_md.out.dif file for your reference, which
> shows the same type of error for cuda_parallel mode testing.
>
> Thanks,
> Ravi
>
>
>
> On Tue, May 7, 2019 at 11:03 AM David Cerutti <dscerutti.gmail.com> wrote:
>
>> I think to diagnose this I would need to see the actual outputs of the
>> test
>> cases on those GTX-1080s.  I don't have such a card (I do have a 1080Ti),
>> but if you go into ${AMBERHOME}/test/cuda/amd/dhfr_pme/, for example, and
>> show us the mdout.pme.amd2.dif file that might be helpful.
>>
>> Dave ( Cerutti )
>>
>>
>> On Tue, May 7, 2019 at 1:27 PM Ravi Abrol <raviabrol.gmail.com> wrote:
>>
>> > Dear Dave,
>> > Sorry took a while to test this. Thanks for your suggestion to upgrade
>> to
>> > Amber18, which resolved these errors on 2 out of 3 workstations.
>> >
>> > All three workstations have the same OS (POP), gcc, mpich, CUDA-9.2,
>> etc.
>> >
>> > Workstations where this issue is resolved have either a GTX970 or two
>> > RTX2080.
>> > Workstation on which the issue persists has two GTX1080.
>> >
>> > On this third workstation, other tests work fine (0 tests with errors),
>> but
>> > test_amber_cuda_parallel tests all fail with messages like:
>> > ******
>> > cd trpcage/ && ./Run_md_trpcage  DPFP
>> /usr/local/amber18/include/netcdf.mod
>> > Note: The following floating-point exceptions are signalling:
>> > IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>> > diffing trpcage_md.out.GPU_DPFP with trpcage_md.out
>> > possible FAILURE:  check trpcage_md.out.dif
>> > *******
>> > Here are the example cases with the biggest maximum absolute/relative
>> > errors:
>> >
>> > possible FAILURE:  check nucleosome_md1_ntt1.out.dif
>> > ### Maximum absolute error in matching lines = 1.35e+05 at line 251
>> field 4
>> > possible FAILURE:  check nucleosome_md2_ntt0.out.dif
>> > ### Maximum absolute error in matching lines = 1.32e+05 at line 248
>> field 4
>> > possible FAILURE:  check mdout.gb.gamd2.dif
>> > ### Maximum absolute error in matching lines = 3.61e+06 at line 293
>> field 3
>> > ### Maximum relative error in matching lines = 8.75e+06 at line 309
>> field 3
>> > possible FAILURE:  check FactorIX_NVE.out.dif
>> > ### Maximum absolute error in matching lines = 1.10e+06 at line 195
>> field 3
>> > possible FAILURE:  check mdout.dhfr.noshake.dif
>> > ### Maximum absolute error in matching lines = 1.30e+05 at line 123
>> field 3
>> > possible FAILURE:  check mdout.dhfr_charmm_pbc_noshake_md.dif
>> > ### Maximum absolute error in matching lines = 4.94e+05 at line 169
>> field 3
>> > possible FAILURE:  check mdout.dhfr_charmm_pbc_noshake_md.dif
>> > ### Maximum absolute error in matching lines = 3.34e+05 at line 148
>> field 3
>> > possible FAILURE:  check mdout.ips.dif
>> > ### Maximum absolute error in matching lines = 1.08e+05 at line 223
>> field 3
>> > ### Maximum relative error in matching lines = 5.93e+04 at line 255
>> field 3
>> > possible FAILURE:  check mdout.pme.amd2.dif
>> > ### Maximum absolute error in matching lines = 1.64e+06 at line 225
>> field 3
>> > possible FAILURE:  check mdout.dif
>> > ### Maximum absolute error in matching lines = 8.00e+07 at line 257
>> field 4
>> > possible FAILURE:  check mdout.dif
>> > ### Maximum absolute error in matching lines = 8.00e+07 at line 260
>> field 4
>> > possible FAILURE:  check mdout.dif
>> > ### Maximum absolute error in matching lines = 8.00e+07 at line 258
>> field 4
>> > possible FAILURE:  check mdout.dif
>> > ### Maximum absolute error in matching lines = 8.81e+08 at line 233
>> field 3
>> > ### Maximum relative error in matching lines = 1.42e+04 at line 233
>> field 3
>> > possible FAILURE:  check mdout.dif
>> > ### Maximum absolute error in matching lines = 3.45e+07 at line 209
>> field 3
>> > possible FAILURE:  check mdout.cellulose_nvt.dif
>> > ### Maximum absolute error in matching lines = 4.59e+06 at line 193
>> field 3
>> > ### Maximum relative error in matching lines = 1.70e+05 at line 207
>> field 3
>> > possible FAILURE:  check mdout.cellulose_npt.dif
>> > ### Maximum absolute error in matching lines = 4.59e+06 at line 234
>> field 3
>> > ### Maximum relative error in matching lines = 1.12e+05 at line 252
>> field 3
>> >
>> > How do I diagnose this problem?
>> >
>> > Thanks,
>> > Ravi
>> >
>> >
>> > On Sun, Mar 24, 2019 at 10:35 PM Ravi Abrol <raviabrol.gmail.com>
>> wrote:
>> >
>> > > Thanks Dave for your reply.
>> > >
>> > > We have GTX 1080 with 6GB memory.
>> > >
>> > > The default mode for GPU testing was originally DPFP, which flagged
>> even
>> > > more tests with large errors.
>> > > The runs I mentioned in my email below were done with SPFP. Hope that
>> > this
>> > > helps.
>> > >
>> > > Ravi
>> > >
>> > > ---
>> > > On Sun, Mar 24, 2019 at 5:35 AM David Case <david.case.rutgers.edu>
>> > wrote:
>> > >
>> > >> On Wed, Mar 20, 2019, Ravi Abrol wrote:
>> > >> >
>> > >> >I installed amber16 on a new linux machine (running pop_os) and
>> during
>> > >> the
>> > >> >cuda testing (for both pmemd.cuda and pmemd.cuda.MPI), one of the
>> tests
>> > >> >failed:
>> > >> >
>> > >> >$AMBERHOME/test/cuda/large_solute_count/mdout.ntb2_ntt1.dif
>> > >> >shows:
>> > >> >### Maximum absolute error in matching lines = 7.44e+08 at line 112
>> > >> field 3
>> > >> >### Maximum relative error in matching lines = 1.38e+07 at line 112
>> > >> field 3
>> > >> >
>> > >> >How do I diagnose this error?
>> > >>
>> > >> Sorry for the slow reply.  What model of GPU are you using?  How much
>> > >> memory does it have?  It's possible that you are overflowing memory
>> in a
>> > >> way that is not caught.
>> > >>
>> > >> Also, which tests are you running?  SPFP or DPFP?
>> > >>
>> > >> Problems like this can indeed be hard to track down.  I'm hoping that
>> > >> this post will trigger memories of other users/developers, in case
>> they
>> > >> maight have seen similar test failures.
>> > >>
>> > >> ....dac
>> > >>
>> > >>
>> > >> _______________________________________________
>> > >> AMBER mailing list
>> > >> AMBER.ambermd.org
>> > >> http://lists.ambermd.org/mailman/listinfo/amber
>> > >>
>> > >
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon May 13 2019 - 12:00:02 PDT
Custom Search