[Gllug] code optimisations
Chris Ball
chris at void.printf.net
Wed Mar 23 12:23:33 UTC 2005
On Mon, Mar 21, 2005 at 07:38:57PM +0000, Pete Ryland wrote:
> And by Cache efficiency, I mean things like organising memory usage to be
> more sequencial if possible, e.g.:
>
> for (i=0; i<1000000; i++)
> for (j=1; j<10000000; j++)
> a[i][j]++;
>
> is *significantly* faster than:
>
> for (j=1; j<10000000; j++)
> for (i=0; i<1000000; i++)
> a[i][j]++;
>
> I don't think a compiler would optimise that type of thing.
As Nix points out, this kind of outer-loop vectorisation is happening in
gcc's autovect-branch at the moment, and is mentioned on:
http://gcc.gnu.org/wiki/Autovectorization%20Enhancements
On Tue, Mar 22, 2005 at 01:16:25PM +0000, Rich Walker wrote:
> Also, on some of the Athlon-and-later systems, the memory controller can
> support multiple fetch streams, making
>
> for (i=0; i<1000000/2; i++)
> for (j=1; j<10000000; j++) {
> a[i][j]++;
> a[i+1000000/2][j]++;
> }
> perform faster than the naive version.
This is standard autovectorization, and working in gcc-4.0 when you use
-O2 and -ftree-vectorize. There are examples of loops we can and can't
deal with at <http://gcc.gnu.org/projects/tree-ssa/vectorization.html>,
together with a link to a PDF from last year's GCC Summit explaining the
technique in general.
> But it's ... interesting ... to communicate to a C compiler that the
> second optimisation is valid. If you did:
Quite. The compiler can't always know that there's no data dependency
inside the loop. The current plan for autovect-branch is to control
the vectorizer with a #pragma (which is apparently how icc does it).
- Chris.
--
Chris Ball <chris at void.printf.net>
--
Gllug mailing list - Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug
More information about the GLLUG
mailing list