Re: [AMBER] Parallel GPU calculation

From: Jason Swails <jason.swails.gmail.com>
Date: Fri, 27 Feb 2015 09:59:02 -0500

On Fri, 2015-02-27 at 15:14 +0100, Stefano Motta wrote:
> 2015-02-27 14:06 GMT+01:00 Jason Swails <jason.swails.gmail.com>:
>
> >
> > Why are you using 'nohup'? I would definitely recommend *against* doing
> > that in a PBS script. The only time that's really useful is if you want
> > to run a job interactively and don't want it to die if either your
> > terminal closes or your ssh session dies.
> >
>
> you are right, I copied the command from a previous calculation I've done
> on a remote machine.
>
>
> > Do you know if the two GPUs are connected via Peer-to-Peer? Check the
> > "Multi GPU" section of http://ambermd.org/gpus/ for more information.
> >
> > I have not found information about the connection between 2 GPUs of the
> same node, the only information I've found is that: *"All the nodes are
> interconnected through a custom Infiniband network, allowing for a low
> latency/high bandwidth interconnection." *This is the machine I'm using:
>
> http://www.hpc.cineca.it/content/eurora-user-guide#systemarchitecture
>
> It may be that there is nothing you can do to improve scaling, and that
> > you're better off just running 2 separate jobs (or using replica
> > exchange, for instance).
> >
> >
> I'm already using advanced molecular dynamics such as aMD, so I preferred
> not to use replica exchange with that, but it is a possibility. It's a
> shame to have access to a machine with such a great number of GPUs, and can
> use only one GPU at a time.

A good deal of this may be the fault of the Eurora cluster. If their
GPUs are not on the same bus, they can't communicate via peer-to-peer
(which is where you would expect to get a lot better speedup) and
parallel scalability is "needlessly" limited. But regardless, you
wouldn't be able to scale off-node, anyway.

In general, though, problems that inherently require a lot of thread
cooperation and communication (such as MD) are challenging to get to
scale effectively across a large number of parallel threads. The reason
is that communication between threads does not come for free -- in SMP,
locks and mutexes have a cost as some threads have to "wait" for the
lock to be released, whereas in distributed processing (like with MPI),
the data transfer can be costly, particularly if the communication
pathway has either high latency (i.e., it takes a long time for messages
to "get through") or low bandwidth (not much information can be sent at
once). Once the time 'waiting' for the mechanics of parallelizing your
program become significant compared to the amount of time actually
running the calculation, you cease to benefit from more processors.

With GPUs, a single GPU at this point has become *blazingly* fast. They
are getting faster at a much faster rate than the interconnect between
them, such that communication is becoming more and more expensive
compared to the calculation. When this happens, the only way to
"rescue" scaling (assuming you can't appreciably reduce the
communication requirements) is to actually sabotage your algorithm so
that it runs slower (that way, it's not so fast compared to
communication); clearly a poor approach.

The other alternative is to use a method that requires significantly
less communication. That would be replica exchange.

HTH,
Jason

-- 
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Feb 27 2015 - 07:00:03 PST
Custom Search