The optimizations in this iteration focused on part of the code that
previously consisted of 85% of an iteration. That code is now roughly
2.3x faster than before. The other 15% is only 50% faster than it
used to be because it needs to use FFTs which are notoriously
difficult to scale on GPUs, especially at the dimensions used by the
typical molecular dynamics simulation.
So, doing the math, what once had a speed-of-light bottleneck at ~6x
single GPU performance is now pinned at ~4.6x single GPU performance.
This reduction in attainable scaling then translates down the chain to
all GPU counts. In reality, one never hits SOL because one is SOL
long before that due to the craptacularly GPU-unfriendly
implementation of MPI.
So in summary, don't worry be happy, and hope that NVIDIA can improve
GPU to GPU communication in the future because hitting SOL is
dependent entirely upon that. And if they manage to make a scaling
FFT possible then the sky's the limit.
Scott
Are you really complaing about only hitting 88 ns/day on JAC? Yeesh.
When I was young and walking uphill in the snow both ways to my grad
school laboratory we'd be lucky to get 1 ns/day and we *were*
*grateful*! :-)
On Thu, May 19, 2011 at 2:16 PM, Ross Walker <ross.rosswalker.co.uk> wrote:
> Hi Filip,
>
>> The only think that is a bit confusing for me is the scale. According
>> to the benchmark the scale between one and 2xM2090 is just 19%, but
>> between 2xM2090 and 4xM2090 is 50%, what could be the reason? Actually
>
> I need to look more closely into these benchmarks to be sure. It is possible
> the 2xM2090 is artificially slow for some reason. I'll have more confirmed
> benchmarks shortly once the grant writing season is out of the way.
>
>> from these numbers the scale for all systems is a bit worse compared to
>> the previous benchmark results. Is that due to the new hardware (the
>> new M2090 cards) or due to the new code changes?
>
> It is unfortunately the laws of physics. Making the code run 2x faster on a
> single GPU is ALWAYS going to come at the cost of scalability unless one
> also doubles the interconnect speed (and reduces the latency) at the same
> time which of course does not happen. Also the faster a single card gets
> (such as the M2090 vs C2070) for a given interconnect speed (PCI-E x16 and
> QDR IB) so the ratio of compute performance to interconnect performance gets
> more unbalanced and scaling falls.
>
> We are hoping to take advantage of new features for communicating between
> GPUs in future releases of the NVIDIA toolkit that will help improve
> scalability further. My ultimate goal right now is something approach half a
> microsecond a day for the JAC benchmark. Whether this ultimately happens or
> not and when will very much depend on whether NSF chooses to keep funding
> this effort.
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> ---------------------------------------------------------
> | Assistant Research Professor |
> | San Diego Supercomputer Center |
> | Adjunct Assistant Professor |
> | Dept. of Chemistry and Biochemistry |
> | University of California San Diego |
> | NVIDIA Fellow |
> | http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> ---------------------------------------------------------
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
> be read every day, and should not be used for urgent or sensitive issues.
>
>
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu May 19 2011 - 16:00:02 PDT