On Mon, May 06, 2019, Charo del Genio wrote:
>
>So, I've tried running the simulations on two different workstations of
>mine, and I get no crashes there. The only difference with the cluster is
>that on the workstation I am using OpenMPI-2.0.2. So,
>I asked the cluster admin to install this specific version of OpenMPI,
>and tried to run on the cluster again. The result is that I experience no
>segfault when running with OpenMPI-2.0.2.
It's worth noting that Amber is not currently compatible with the
default OpenMPI4.x. It will compile if you add the
"--enable-mpi1-compatibility" flag, but there may still be problems
there.
>
>However, a new problem happens, which looks very much like a memory leak,
>as the memory used slowly but continuously increases, until the node dies
>for running out of memory. Interestingly, this does NOT happen on the two
>workstations. To see if there is any difference in libraries at all, I
>checked with ldd, and I found that the only difference between cluster
>and workstations is that OpenMPI on the cluster is compiled without C++
>support. In other words, on the cluster there is no libmpi_cxx.so. I have
>already asked the admin to recompile OpenMPI including C++ support, to
>make sure absolutely everything is identical, but in the meantime I'm
>pondering upon the following questions:
>
>- Does sander actually need to link against libmpi_cxx.so?
No, I don't think so. You could hand-edit your config.h file, remove
references to libmpi_cxx and try to just re-compile sander. But I'm
thinking that only cpptraj.MPI needs this library.
Note that you don't have to rely on your sysadmin: you should be able to
run AMBERHOME/AmberTools/src/configure_openmpi, to get whatever kind of
MPI installation you want (no need for root access, and you won't
interfere with any other users MPI codes.) Might also be worth trying
configure_mpich instead, just to see if it is somehow related to which
MPI stack you are using.
>- Could a memory leak be caused by its absence?
No idea....
>- Alternatively, could the bug be actually in sander? On my workstations
>I am using gcc stack smashing protection by default, whereas on the
>cluster it is disabled.
There certainly could be a bug in sander, leading to a memory leak.
It's probably going to be tough to debug remotely, since it seems to
happen only on your cluster.
>Personally, I'm leaning towards a bug in sander quietly taken care of by
>ssp. If this is the case, how do I/we go about finding precisely where it
>is?
Finding memory leaks in parallel codes is above my pay grade.... :-(
....dac
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon May 06 2019 - 08:00:04 PDT