[Gllug] code optimisations

Wed Mar 23 12:23:33 UTC 2005

On Mon, Mar 21, 2005 at 07:38:57PM +0000, Pete Ryland wrote:
> And by Cache efficiency, I mean things like organising memory usage to be
> more sequencial if possible, e.g.:
> 
> for (i=0; i<1000000; i++)
>   for (j=1; j<10000000; j++)
>     a[i][j]++;
> 
> is *significantly* faster than:
> 
> for (j=1; j<10000000; j++)
>   for (i=0; i<1000000; i++)
>     a[i][j]++;
> 
> I don't think a compiler would optimise that type of thing.

As Nix points out, this kind of outer-loop vectorisation is happening in
gcc's autovect-branch at the moment, and is mentioned on:

   http://gcc.gnu.org/wiki/Autovectorization%20Enhancements

On Tue, Mar 22, 2005 at 01:16:25PM +0000, Rich Walker wrote:
> Also, on some of the Athlon-and-later systems, the memory controller can
> support multiple fetch streams, making
> 
>  for (i=0; i<1000000/2; i++)
>    for (j=1; j<10000000; j++) {
>      a[i][j]++;
>      a[i+1000000/2][j]++;
>  }
> perform faster than the naive version.

This is standard autovectorization, and working in gcc-4.0 when you use 
-O2 and -ftree-vectorize.  There are examples of loops we can and can't 
deal with at <http://gcc.gnu.org/projects/tree-ssa/vectorization.html>,
together with a link to a PDF from last year's GCC Summit explaining the
technique in general.

> But it's ... interesting ... to communicate to a C compiler that the
> second optimisation is valid. If you did:

Quite.  The compiler can't always know that there's no data dependency 
inside the loop.  The current plan for autovect-branch is to control 
the vectorizer with a #pragma (which is apparently how icc does it).

- Chris.
-- 
Chris Ball   <chris at void.printf.net>
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug