I'm forwarding you all what i received, I think it will be very
interesting for anybody.
David Konerding wrote:
>
> Stephane writes:
> >2 103.0s (x1.7)
> >
> >Please note that gcc-2.96 is not so up-to-date, and that gcc-3.0
> >may be more efficient, because it contains optimisations for
> >athlon-based processors.
>
> Be aware that you will get worse scaling with more efficient
> compilers, because scaling depends much more on communications
> which is not optimized by gcc. This is Amdahl's law--
> if you speed up the parallel part you get worse scaling because
> more percent of the time is spent in the bottleneck.
>
> As for gcc-3.0 vs. gcc-2.96 this is a tough question. You can
> even get gcc-3.0 to generate SSE and SSE2 instructions for
> floating point operations. This doesn't actually generate faster
> code in most situations because the compiler is not smart enough
> to generate SSE2 thatis faster code than the floating point unit
> can do. THis is because knowing how to generate SSE2 code that
> is fast is still very much an art.
>
> In the past, I used the Pentium Pro performance counters to observe
> AMBER performance on my PC. Basically, there were two things that
> AMBER made the chip do: a hell of a lot of very slow division operations
> and very many RAM to cache line loads. This is because most of the
> AMBER force field is basically: load a vector into the cache and
> perform floating point operations on it. It's more complicated when
> you use PME because PME is a more heterogenous calculation (FFT's
> use lots of bit shifts and ALU operations), I am just talking
> about standard nonbonded pairlist ops.
>
> Now, I have written a few small SSE2 routines which show that
> 1/sqrt(r) can be calculated 10X as fast using SSE2 (hand-written assembly)
> than regular floating point (hand-written C with unrolled loops) on my 1.4GHz P4.
> However, the gcc-3.0 SSE2 generator took the hand-written C with unrolled loops
> and generated slower code than the floating point code. Oops. The Intel C
> compiler did a slightly better job but even it couldn't do any better than this
> routine I wrote:
>
> void simd(float *d1, float *d2, float *d3) {
> int i;
> for (i=0; i < DATA_SIZE; i += 4) {
> __asm__ __volatile__("rsqrtps (%0), %%xmm0\n\t" :: "r" (&d1[i]));
> __asm__ __volatile__("movaps %%xmm0, (%0)\n\t" :: "r" (&d3[i]));
> }
> }
>
> This routine takes 4 64-bit floating point numbers and computes their reciprocal
> square roots. Obviously vectors needs to be padded to a multiple of a length of 4.
>
> >You see there that with a dual-athlon, with lower frequencies
> >(1.2 vs 1.33), you can obtain the same 'power' as that of 4
> >nodes.
> >
> >It is quite the same price for the dual one or two boxes, so for
> >half the price, you can think about having not far from twice
> >'power' !
>
> The Althon systems have very good floating point performance and memory
> performance. THe developers of the Athlon did a very good job making a
> CPU that gives good performance on AMBER-like jobs. They have floating
> point units that without being ultra-scalar can compute a floating
> point operation in very few clocks. The P4 developers, on the other
> hand, were tasked with building a CPU that could later be scaling to
> very high clock rates. They did so at least in part by dumping an
> entire floating point unit. They instead went for heavy pipelining and
> scalaring, which means that to generate code which gets optimal throughput the
> compiler needs intimate knowledge on instruction scheduling to take advantage
> of the super-scalar units and the SSE2 units. Intel's compiler does this.
> The reason that the new P4s get such good specFP scores (almost as good
> as the high-end Alphas) is because they use the Intel C/Fortran compilers
> with heavily tuned SSE2-based BLAS and LAPACK libraries. This is not "cheating"
> because they compute the same answer in a smaller amount of time, but I challenge
> you to do as good a tuning job as Intel's best performance gurus.
>
> >
> >For the second part, you may have a look at
> >:http://www.scl.ameslab.gov/Projects/MP_Lite/
>
> Very interesting, except you can't use comminicators other than MPI_COMM_WORLD which heavily
> limits your code's ability to do optimal communications. I do like the approach
> of tuning for optimal user-space performance which codes like MPICH don't do
> (MPICH is designed as a general library that others can borrow from rather than
> being a high-performance MPI implementation).
>
> Building a very high performance MPI implementation is the single most important
> thing for large-parallel systems to get optimal performance. As of yet there
> is no "freely available" high performance MPI implementation over ethernet that I
> know of. If somebody were sufficiently motivated, they could build a free MPI-over-ethernet
> tied into the Linux kernel (IE put much of the currently user-space implementation into kernel
> space, and work on various process scheduling and user-to-kernel space communication) that
> could give much better performance than current MPIs that reside on top of the TCP/IP layer.
> Then again, recent Linux developments in zero-copy networking (and modifications to
> the user code of MPI) could also give some performance improvements. WIth all that
> said, I'd rather just go with a vender that has ultra-high performance networking
> hardware not based on Ethernet (ala Myrinet etc) who has already written the optimal
> MPI library for their hardware. At least, I would go with that solution if I had the $$$.
>
> Dave
--
*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*
Teletchea Stephane - CNRS UMR 8601
Lab. de chimie et biochimie pharmacologiques et toxicologiques
45 rue des Saints-Peres 75270 Paris cedex 06
tel : (33) - 1 42 86 20 86 - fax : (33) - 1 42 86 83 87
mel : steletch_at_biomedicale.univ-paris5.fr
*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*
Received on Mon Oct 08 2001 - 00:43:24 PDT