> I am running a Temperature REMD job with 64 replicas. It's running on a
> CPU cluster where I have to split it into several Job steps. I am
> observing that some steps slow down to 50% (880 exchange attempts
> instead of 2000), even though they are just a continuation of the
> previous step. I turned iwrap on since I read that the expanding volume
> can cause a slowdown.
>
> I don't really know where to start to look for a the reason here. I'd
> appreciate any advice.
We have seen this on the Blue Waters resource (older Cray with AMD and
K20S GPUs) without great explanation except that it tends to occur when we
witness a degradation or overload of the parallel file system each replica
is writing to... I do not think it occurs routinely on other machines
(with more balanced or higher performing I/O subsystems) and there is
really nothing you can do about it if it is happening since if one MPI
process slows down, they all do. [Another possibility could be rogue jobs
on one of the nodes that are chewing up cycles.] So, look for I/O
contention and/or see if problem goes away if you do not write (and make
sure if you have lot's of replicas, they are writing to a parallel file
system if you have one).
--tec3
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Nov 10 2017 - 13:00:02 PST