[AMBER] H-REMD, pmemd fails to calculate potential energy of neighbor coordinates

From: Jiri Wiesner <wiesner.chemi.muni.cz>
Date: Sun, 16 Dec 2012 05:26:41 +0100

Dear Amber developers and users:
I use the Hamiltonian replica exchange method to calculate the free
energy of the perturbation between the first and the last replica
(REFEP). My Amber source tree is updated by the latest set of patches
(29 for AmberTools, 13 Amber). I have compiled the sources by Intel
compiler 12 and I use OpenMPI 1.6.3. I have 16 replicas and I have
tested both sander and pmemd. I have tried to run sander on 16, 32 and
64 CPUs (all in a single machine) and pmemd on 32 and 64 CPUs (sander
can use 1 or more CPUs per replica, pmemd only allows the utilization of
2 or more CPUs per replica). My system is solvated, has PBC and thus PME
is switched on.

The problem is that whereas sander on 16 CPUs gives this result:

wiesner.iakchos:/wdraid5/computation/refep/sander_refep/g2h-g1h_16$ tail
-n34 rem.log | sed 's/ \+/ /g'
# exchange 49
  1 16 298.15 -8442.73 -8275.05 -191.52 0.00 F 0.29
  2 3 298.15 -8428.52 -8503.43 0.00 -11.86 T 0.41
  3 2 298.15 -8492.29 -8416.72 11.63 0.00 T 0.24
  4 5 298.15 -8411.43 -8390.32 0.00 -9.12 F 0.37
  5 4 298.15 -8380.86 -8400.50 9.25 0.00 F 0.16
  6 7 298.15 -8366.25 -8364.34 0.00 -7.18 F 0.24
  7 6 298.15 -8358.71 -8356.96 6.42 0.00 F 0.20
  8 9 298.15 -8323.02 -8442.06 0.00 -4.03 T 0.45
  9 8 298.15 -8437.69 -8318.50 4.17 0.00 T 0.20
  10 11 298.15 -8430.10 -8379.29 0.00 -2.03 F 0.29
  11 10 298.15 -8379.07 -8428.27 1.89 0.00 F 0.33
  12 13 298.15 -8393.50 -8385.94 0.00 0.68 F 0.16
  13 12 298.15 -8388.04 -8394.36 -0.55 0.00 F 0.33
  14 15 298.15 -8391.56 -8407.08 0.00 2.72 T 0.29
  15 14 298.15 -8410.63 -8394.55 -3.38 0.00 T 0.33
  16 1 298.15 -8389.95 -8210.65 0.00 -49.97 F 0.00
# exchange 50
  1 2 298.15 -8492.15 -8424.92 0.00 -12.88 T 0.32
  2 1 298.15 -8411.68 -8479.52 12.87 0.00 T 0.40
  3 4 298.15 -8418.02 -8434.57 0.00 -10.05 F 0.24
  4 3 298.15 -8424.69 -8405.95 10.27 0.00 F 0.36
  5 6 298.15 -8357.35 -8412.95 0.00 -8.26 F 0.16
  6 5 298.15 -8405.30 -8349.30 7.72 0.00 F 0.24
  7 8 298.15 -8398.77 -8449.54 0.00 -5.67 F 0.20
  8 7 298.15 -8445.60 -8393.17 5.33 0.00 F 0.44
  9 10 298.15 -8326.59 -8368.79 0.00 -3.57 F 0.20
  10 9 298.15 -8366.01 -8323.15 2.92 0.00 F 0.28
  11 12 298.15 -8383.93 -8398.67 0.00 -0.37 T 0.36
  12 11 298.15 -8397.44 -8381.92 0.23 0.00 T 0.16
  13 14 298.15 -8465.69 -8341.91 0.00 2.22 F 0.32
  14 13 298.15 -8345.62 -8465.62 -2.21 0.00 F 0.28
  15 16 298.15 -8437.93 -8458.31 0.00 4.18 F 0.32
  16 15 298.15 -8463.48 -8441.62 -4.80 0.00 F 0.00


a pmemd run of the same system on 64 CPUs finishes with:

wiesner.iakchos:/wdraid5/computation/refep/pmemd_refep/g2h-g1h_64$ tail
-n34 rem.log | sed 's/ \+/ /g'
# exchange 49
  1 16 298.15 -8406.64 -7414.10 -Infinity 0.00 F 0.00
  2 3 298.15 -8408.94 -7447.47 0.00 -Infinity F 0.00
  3 2 298.15 -8436.04 -7461.64 -Infinity 0.00 F 0.00
  4 5 298.15 -8422.93 -7144.37 0.00 -Infinity F 0.00
  5 4 298.15 -8409.71 -7314.05 -Infinity 0.00 F 0.00
  6 7 298.15 -8457.33 -7011.52 0.00 -Infinity F 0.00
  7 6 298.15 -8399.70 -7254.47 -Infinity 0.00 F 0.00
  8 9 298.15 -8343.03 -7365.53 0.00 -Infinity F 0.00
  9 8 298.15 -8404.97 -7270.11 -Infinity 0.00 F 0.00
  10 11 298.15 -8370.98 -7278.65 0.00 -Infinity F 0.00
  11 10 298.15 -8372.33 -7404.89 -Infinity 0.00 F 0.00
  12 13 298.15 -8387.86 -7316.22 0.00 -Infinity F 0.00
  13 12 298.15 -8385.47 -7569.65 -Infinity 0.00 F 0.00
  14 15 298.15 -8395.86 -7361.17 0.00 -Infinity F 0.00
  15 14 298.15 -8304.76 -7335.65 -Infinity 0.00 F 0.00
  16 1 298.15 -8405.69 -7429.89 0.00 -Infinity F 0.00
# exchange 50
  1 2 298.15 -8436.79 -7500.77 0.00 -Infinity F 0.00
  2 1 298.15 -8379.91 -7664.55 -Infinity 0.00 F 0.00
  3 4 298.15 -8403.21 -7456.17 0.00 -Infinity F 0.00
  4 3 298.15 -8384.24 -7344.17 -Infinity 0.00 F 0.00
  5 6 298.15 -8416.41 -7540.09 0.00 -Infinity F 0.00
  6 5 298.15 -8406.07 -6893.33 -Infinity 0.00 F 0.00
  7 8 298.15 -8386.76 -7231.86 0.00 -Infinity F 0.00
  8 7 298.15 -8420.87 -7456.75 -Infinity 0.00 F 0.00
  9 10 298.15 -8350.88 -7426.89 0.00 -Infinity F 0.00
  10 9 298.15 -8399.02 -7489.96 -Infinity 0.00 F 0.00
  11 12 298.15 -8382.66 -7271.90 0.00 -Infinity F 0.00
  12 11 298.15 -8344.58 -7327.62 -Infinity 0.00 F 0.00
  13 14 298.15 -8325.08 -7445.06 0.00 -Infinity F 0.00
  14 13 298.15 -8432.51 -7406.95 -Infinity 0.00 F 0.00
  15 16 298.15 -8447.00 -7279.87 0.00 -Infinity F 0.00
  16 15 298.15 -8373.27 -7136.48 -Infinity 0.00 F 0.00


Please note that the potential energy of the neighbor's coordinates in
the pmemd run is substantially higher than in the sander run. Some more
rem.log files are attached (sander on 16 CPUs, sander on 64 CPUs, pmemd
on 32 CPUs, pmemd on 64 CPUs). There is exactly the same issue under the
GNU 4.3.2 compiler and a greater number of CPUs per replica basically
makes the calculation fail.

I was also trying to investigate the situation on my own and modified
the code of both sander and pmemd to obtain a dump of some arrays -
forces in the case of sander (ftemp array in remd.F90, line 2447
produced by call force(x,ix,ih,...)), most of the arguments of the
pme_force subroutine (frc_temp array in remd_exch.F90, line 523, call
pme_force(atm_cnt, crd_temp, frc_temp, ...)). I am attaching the forces
calculated by replica 1 the first time the above mentioned subroutines
were executed - the (neighbor) coordinates are that of replica 16. I
think that the values in the files should be the same, because the
forces are from the very start of the simulation, but they are not. It
is notable that the file pmemd_run_64_forces.001 (pmemd run on 64 CPUs)
contains circa 75% of zero components and 25% of non-zero components,
which would correspond to the part done by the master process of that
replica. I have no knowledge of the internals of the pme_force
subroutine, therefore I am quite helpless at this point. I am not sure
if my findings about the forces are of any relevance to the failure to
calculate the potential energy of neighbor coordinates.

The file g2h-g1h_64.tar.gz, which contains data of a sample run which
should allow you to reproduce the failure, can be found here:
http://is.muni.cz/de/151570/g2h-g1h_64.tar.gz

Any help is appreciated.
Kind Regards,

Jiri Wiesner
Masaryk University
Czech Republic



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Sat Dec 15 2012 - 20:30:02 PST
Custom Search