Are you sure you are not exceeding a file system size limit (say 2GB) for
mdcrd or mdvel? Possibly file system quotas? Possibly full file systems?
It would seem it almost has to be the file system, somehow, though I believe
you said you tried 3 different ones. Actually, though, there is one other
likely problem source, and that is flakey infiniband cards. If one of the
nodes becomes unreachable in the middle of a run, everything should freeze
up as you observe. I have mostly seen this kind of grief with Myrinet, but
commodity clusters in general are not immune. We had a lot of trouble at
unc for a while with our big infiniband p4 cluster with nodes becoming
unreachable; I have no idea whether it was hardware burnin problems, config
problems, loose connectors, or what. So I would check and see what kind of
bombproofing there is on your cluster for this sort of thing (we saw the
problems mostly during startup, so that was a bit different - but with
Myrinet, as the gear got older and became flakey, you would get node hangs).
There is a reason the heavy iron from the big guys sells for more...
Regards - Bob
----- Original Message -----
From: "Fabian Boes" <fabian.boes.itb.uni-stuttgart.de>
To: <amber.scripps.edu>
Sent: Thursday, June 29, 2006 11:58 AM
Subject: AMBER: Strange problems with PMEMD on Intel Xeons with Infiniband
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Dear Amber community,
>
> i experience a strange problem with PMEMD (from the AMBER9 package) on a
> Xeon cluster with Infiniband. I used the Intel Fortran compiler 9.0
> (as 9.1 will break the compilation of PMEMD with an internal compiler
> error) together with Infiniband MPICH. Compilation went fine without
> any errors.
>
> Jobs (on different number of nodes) just start fine. Output and
> trajectory files are written out. But suddenly, mostly after
> 50.000-150.000 steps, the writing of the output files stops. The PMEMD
> executable is still running on all nodes with 99% cpu consumption. This
> seems to go on for ever, until one manually kills the mpirun command
> with CTRL-C or the PBS job reaches its walltime limit.
>
> I tried the following things:
>
> Compiled with/without Intel Math Kernel Lib
> Compiled with/without netcdf support
>
> Simulations with/without binary trajectory
>
> Used 3 different file systems (GPFS, local node scratch, global cluster
> scratch) to make sure it isn't an I/O problem.
>
> The molecular system under investigation runs fine on an 8-way dual core
> Opteron with PMEMD (compiled with Pathscale and LAM-MPI).
>
> Has someone experienced the same? Or has some tips how to debug this? I
> heard from various sources that the Intel Fortran compiler had problems
> with PMEMD in the past. Unfortunately there are no other compilers
> installed on this particular cluster.
>
> Bye,
>
> Fabian
>
> - --
>
> Fabian Bös
>
> Institute of Technical Biochemistry
> University of Stuttgart / Germany
>
> Phone: +49-711-685-65156
> Fax: +49-711-685-63196
> Email: fabian.boes.itb.uni-stuttgart.de
>
> http://www.itb.uni-stuttgart.de
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.4 (MingW32)
>
> iD8DBQFEo/ihLl4SF3oeQ9ARAnr7AJ4/C5NqTus3WCTsdboUxBEZ/IH90wCeOzDZ
> 1hszSBqc8JiX3at5M3ZjjqE=
> =fqo5
> -----END PGP SIGNATURE-----
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Sun Jul 02 2006 - 06:07:12 PDT