Re: [AMBER] amber 14 tests errors from Jason Swails on 2014-08-28 (Amber Archive Aug 2014)

From: Jason Swails <jason.swails.gmail.com>
Date: Thu, 28 Aug 2014 15:15:47 -0400

On Thu, 2014-08-28 at 20:23 +0200, Jordi Bujons wrote:
>
> $ cat test_at_serial/2014-08-27_11-59-41.diffs
>
> possible FAILURE: check prmtop.new_type.dif
>
> /Programs/amber14/AmberTools/test/parmed/chamber_prmtop
>
> 5025,5028c5025,5028
>
> < 7.0320002812773935E+4 3.0991959086075940E+3 1.5730411891415534E+6
>
> < 1.2254480396417496E+6 1.5350319889737577E+6 1.8962122985815424E+6
>
> < 1.6150318076691339E+6 5.5913203238501586E+5 5.5913203238501586E+5
>
> < 7.5015433968630561E+5 1.2254480396417496E+6 1.5350319889737577E+6
>
> ---
>
> > 7.0320002812773819E+4 3.0991959086075940E+3 1.5730411891415534E+6
>
> > 1.2254480396417496E+6 1.5350319889737575E+6 1.8962122985815424E+6
>
> > 1.6150318076691339E+6 5.5913203238501656E+5 5.5913203238501656E+5
>
> > 7.5015433968630666E+5 1.2254480396417496E+6 1.5350319889737575E+6

These are tiny diffs and can be safely ignored. There is no problem
here.

> ---------------------------------------
>
> possible FAILURE: check mdout.pbsa_decres.dif
>
> /Programs/amber14/test/sander_pbsa_decres
>
> 1320c1320
>
> < Total surface charge -5.9357
>
> > Total surface charge -5.9270
>
> 1321c1321
>
> < Reaction field energy -2007.5763
>
> > Reaction field energy -1609.3322
>
> 1322c1322
>
> < Cavity solvation energy 38.8528
>
> > Cavity solvation energy 39.1489
>
> 1324c1324
>
> < 1 -3.0345E+3 1.6433E+1 8.6691E+1 C 1030
>
> > 1 -2.6359E+3 1.6488E+1 8.6861E+1 C 1030
>
> 1326c1326
>
> < VDWAALS = -546.2276 EEL = -4637.2832 EPB =
> -2007.5763
>
> > VDWAALS = -546.2276 EEL = -4637.2832 EPB =
> -1609.3322

These are worrying, and I would urge caution if you wanted to use this
functionality (PB energy decomposition). It's not very commonly used
functionality, though. I believe these have already been reported, but
I don't know what the status is on fixing them.

>
> I guess that the first 4 possible failures could be just rounding errors,
> although sometimes they affect the last three decimal places (e.g.
> 7.0320002812773935E+4 vs 7.0320002812773819E+4). But the sander_pbsa_decres
> test shows larger errors. Are these a signal that something is really not
> right?

Yes.

> ==============================================================
>
> export TESTsander='../../bin/pmemd.MPI' && cd multid_remd && ./Run.multirem
>
>
>
> Running multipmemd version of pmemd Amber12
>
> Total processors = 8
>
> Number of groups = 4
>
>
>
> [mordor9:08289] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 1
>
> mpirun noticed that job rank 1 with PID 8290 on node mordor9 exited on
> signal 15 (Terminated).

Not sure what's happening here. Is this with OpenMPI 1.2.8? That's a
very old version... What happens if you run it with a newer version (or
with mpich)? Is there an error message printed to any of the output
files?

> ==============================================================
>
> export TESTsander=/Programs/amber14/bin/sander.MPI; cd cnstph_remd/TempRem
> && ./Run.cnstph_remd
>
>
>
> Running multisander version of sander Amber14
>
> Total processors = 2
>
> Number of groups = 2
>
>
>
> [mordor9:02807] *** Process received signal ***
>
> [mordor9:02807] Signal: Segmentation fault (11)
>
> [mordor9:02807] Signal code: Address not mapped (1)
>
> [mordor9:02807] Failing at address: 0x100
[snip]

Is this the same MPI again?

> ...and when using "DO_PARALLEL='mpirun -np 8'", 3 file comparisons failed
> and 4 tests experienced errors:
>

> ==============================================================
>
> cd qmmm2/xcrd_build_test/ && ./Run.ortho_qmewald0
>
>
>
> * NB pairs 145 185645 exceeds capacity ( 185750) 3
>
> SIZE OF NONBOND LIST = 185750
>
> SANDER BOMB in subroutine nonbond_list
>
> Non bond list overflow!
>
> check MAXPR in locmem.f

These I've known about, but I've never tried to fix them. The solution
to this problem is run with fewer processors.

>
> possible FAILURE: check mdout.emil.0.5.dif
>
> /Programs/amber14/test/emil/emil_sander_tip3p
>
> 100a101,104
>
> > ***** Processor 0
>
> > ***** System must be very inhomogeneous.
>
> > ***** Readjusting recip sizes.
>
> > In this slab, Atoms found: 310 Allocated: 309

These are also known and is a result of the system being a bit too small
for 8 processors. It can be ignored.

>
> ---------------------------------------
>
> possible FAILURE: check mdout.emil.0.9.dif
>
> /Programs/amber14/test/emil/emil_sander_tip3p
>
> 118a119,122
>
> > ***** Processor 0
>
> > ***** System must be very inhomogeneous.
>
> > ***** Readjusting recip sizes.
>
> > In this slab, Atoms found: 310 Allocated: 309
>
>
> I looked on the Amber list but could not find what these parallel execution
> errors mean. Do they imply some problem during compilation, bug or something
> else?

Not sure. Could be the MPI. Could be the Amber code. Without being
able to reproduce it we can't help. I'll try running the tests on my
computer. In the meantime, please see if a modern version of OpenMPI or
mpich helps.

All the best,
Jason

-- 
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Thu Aug 28 2014 - 12:30:02 PDT