Re: [AMBER] cuda test failing after installation from Ravi Abrol on 2019-05-07 (Amber Archive May 2019)

From: Ravi Abrol <raviabrol.gmail.com>
Date: Tue, 7 May 2019 10:26:54 -0700

Dear Dave,
Sorry took a while to test this. Thanks for your suggestion to upgrade to
Amber18, which resolved these errors on 2 out of 3 workstations.

All three workstations have the same OS (POP), gcc, mpich, CUDA-9.2, etc.

Workstations where this issue is resolved have either a GTX970 or two
RTX2080.
Workstation on which the issue persists has two GTX1080.

On this third workstation, other tests work fine (0 tests with errors), but
test_amber_cuda_parallel tests all fail with messages like:
******
cd trpcage/ && ./Run_md_trpcage DPFP /usr/local/amber18/include/netcdf.mod
Note: The following floating-point exceptions are signalling:
IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
diffing trpcage_md.out.GPU_DPFP with trpcage_md.out
possible FAILURE: check trpcage_md.out.dif
*******
Here are the example cases with the biggest maximum absolute/relative
errors:

possible FAILURE: check nucleosome_md1_ntt1.out.dif
### Maximum absolute error in matching lines = 1.35e+05 at line 251 field 4
possible FAILURE: check nucleosome_md2_ntt0.out.dif
### Maximum absolute error in matching lines = 1.32e+05 at line 248 field 4
possible FAILURE: check mdout.gb.gamd2.dif
### Maximum absolute error in matching lines = 3.61e+06 at line 293 field 3
### Maximum relative error in matching lines = 8.75e+06 at line 309 field 3
possible FAILURE: check FactorIX_NVE.out.dif
### Maximum absolute error in matching lines = 1.10e+06 at line 195 field 3
possible FAILURE: check mdout.dhfr.noshake.dif
### Maximum absolute error in matching lines = 1.30e+05 at line 123 field 3
possible FAILURE: check mdout.dhfr_charmm_pbc_noshake_md.dif
### Maximum absolute error in matching lines = 4.94e+05 at line 169 field 3
possible FAILURE: check mdout.dhfr_charmm_pbc_noshake_md.dif
### Maximum absolute error in matching lines = 3.34e+05 at line 148 field 3
possible FAILURE: check mdout.ips.dif
### Maximum absolute error in matching lines = 1.08e+05 at line 223 field 3
### Maximum relative error in matching lines = 5.93e+04 at line 255 field 3
possible FAILURE: check mdout.pme.amd2.dif
### Maximum absolute error in matching lines = 1.64e+06 at line 225 field 3
possible FAILURE: check mdout.dif
### Maximum absolute error in matching lines = 8.00e+07 at line 257 field 4
possible FAILURE: check mdout.dif
### Maximum absolute error in matching lines = 8.00e+07 at line 260 field 4
possible FAILURE: check mdout.dif
### Maximum absolute error in matching lines = 8.00e+07 at line 258 field 4
possible FAILURE: check mdout.dif
### Maximum absolute error in matching lines = 8.81e+08 at line 233 field 3
### Maximum relative error in matching lines = 1.42e+04 at line 233 field 3
possible FAILURE: check mdout.dif
### Maximum absolute error in matching lines = 3.45e+07 at line 209 field 3
possible FAILURE: check mdout.cellulose_nvt.dif
### Maximum absolute error in matching lines = 4.59e+06 at line 193 field 3
### Maximum relative error in matching lines = 1.70e+05 at line 207 field 3
possible FAILURE: check mdout.cellulose_npt.dif
### Maximum absolute error in matching lines = 4.59e+06 at line 234 field 3
### Maximum relative error in matching lines = 1.12e+05 at line 252 field 3

How do I diagnose this problem?

Thanks,
Ravi

On Sun, Mar 24, 2019 at 10:35 PM Ravi Abrol <raviabrol.gmail.com> wrote:

> Thanks Dave for your reply.
>
> We have GTX 1080 with 6GB memory.
>
> The default mode for GPU testing was originally DPFP, which flagged even
> more tests with large errors.
> The runs I mentioned in my email below were done with SPFP. Hope that this
> helps.
>
> Ravi
>
> ---
> On Sun, Mar 24, 2019 at 5:35 AM David Case <david.case.rutgers.edu> wrote:
>
>> On Wed, Mar 20, 2019, Ravi Abrol wrote:
>> >
>> >I installed amber16 on a new linux machine (running pop_os) and during
>> the
>> >cuda testing (for both pmemd.cuda and pmemd.cuda.MPI), one of the tests
>> >failed:
>> >
>> >$AMBERHOME/test/cuda/large_solute_count/mdout.ntb2_ntt1.dif
>> >shows:
>> >### Maximum absolute error in matching lines = 7.44e+08 at line 112
>> field 3
>> >### Maximum relative error in matching lines = 1.38e+07 at line 112
>> field 3
>> >
>> >How do I diagnose this error?
>>
>> Sorry for the slow reply. What model of GPU are you using? How much
>> memory does it have? It's possible that you are overflowing memory in a
>> way that is not caught.
>>
>> Also, which tests are you running? SPFP or DPFP?
>>
>> Problems like this can indeed be hard to track down. I'm hoping that
>> this post will trigger memories of other users/developers, in case they
>> maight have seen similar test failures.
>>
>> ....dac
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue May 07 2019 - 10:30:02 PDT