[Nottingham] Compiler optimisation flags (gcc, g++)

Thu Nov 6 22:59:18 GMT 2003

Robert Davies wrote:
[...]
> 
> Theres' been a recent article in Gentoo news, comparing some optimisation 
> flags.  One of the issues was -O3 slowing code down (due to bloat), if you 
> want non-explicit function inlining then consider -finline-limit=N where N is 
> some size smaller than the default.  Altering alignments from defaults, also 
> did not help, except if made much larger.
> 
> The problem with flags is, they are very processor and program specific, it's 
> doubtful you get enough of a benefit over -O2 -march=athlon-xp (or -O3 with 
> low -finline-limit) to warrant the effort, and noone else can tell you the 
> 'right' options to use for your software.

Lots of surfing later and details for the optimisation flags seems 
rather confused!

The best of the comparisons that I've found thus far are appended, 
although even these are flawed because no mention is made of the 
supporting hardware or even if it is the same users improving their tweaks.

So far, the good options seem to be:

export CFLAGS="-march=athlon-xp -O3 -fexpensive-optimizations 
-funroll-loops -frerun-cse-after-loop -frerun-loop-opt 
-fomit-frame-pointer -fschedule-insns2 -minline-all-stringops 
-mfancy-math-387 -mfp-ret-in-387 -m3dnow -msse -mfpmath=sse -mmmx 
-malign-double -falign-functions=4 -preferred-stack-boundary=4
-fforce-addr -pipe"

Notes:

Questions of whether the aligns should be =5 rather than =4?

Inlining and loop unrolling can be detrimental if code loops then exceed 
the CPU cache size...

How extravagant is the -funroll-loops for poisoning the L1 cache?

-ffast-math can give significant speedups... However, I don't like the 
idea of throwing away IEEE checks/conventions on the floats results. 
Good/bad/indifferent/inconsequential?

> Things like -msse and -m3dnow are meant to be switched on by -march, anyway, 
> if you check gcc in verbose mode, it actually seems to turn off these 
> settings (gcc-3.{1,2,3}) immediately after you select them.

To be investigated...

All comments/advice welcome.

(About to compile gcc 3.3.2 with the presently installed gcc 3.2.2.)

Regards,
Martin

Optimisations roughly from fastest to slowest for AthlonXP CPUs:

http://www.freebench.org/cgi-bin/showdetails.pl?13=13&anabase=1&fourbase=1&masonbase=1&pcbase=1&pibase=1&distbase=1&neuralbase=1&fpmeanbase=1&intmeanbase=1&totmeanbase=1
 >>>
Flags - Mason: -march=athlon-xp -O3 -fomit-frame-pointer -pipe 
-funroll-loops -finline-limit=100000 -fforce-addr -falign-functions=5 
-malign-double -fbranch-probabilities

Flags - PiFFT: -march=athlon-xp -O3 -fomit-frame-pointer -pipe 
-funroll-loops -falign-loops=4 -falign-jumps=4 
-mpreferred-stack-boundary=4 -fprefetch-loop-arrays -falign-functions=5

Flags - Neural: -march=athlon-xp -O3 -fomit-frame-pointer -pipe 
-funroll-loops -fprefetch-loop-arrays -falign-loops=4 -falign-jumps=4 
-falign-functions=5 -fforce-addr -fbranch-probabilities
 >>>

http://www.freebench.org/cgi-bin/showdetails.pl?35=35&anabase=1&fourbase=1&masonbase=1&pcbase=1&pibase=1&distbase=1&neuralbase=1&fpmeanbase=1&intmeanbase=1&totmeanbase=1
 >>>
Flags - Analyzer: -march=athlon-xp -O3 -fomit-frame-pointer -pipe 
-funroll-loops -falign-loops=5 -falign-jumps=5 -falign-functions=64 
-fforce-addr

Flags - FourInARow: -march=athlon-xp -O3 -fomit-frame-pointer -pipe 
-funroll-loops -falign-loops=5 -falign-jumps=5 -falign-functions=64 
-fforce-addr

Flags - Mason: -march=athlon-xp -O3 -fomit-frame-pointer -pipe 
-funroll-loops -falign-loops=5 -falign-jumps=5 -falign-functions=64 
-fforce-addr

Flags - pCompress2: -march=athlon-xp -O3 -fomit-frame-pointer -pipe 
-funroll-loops -falign-loops=5 -falign-jumps=5 -falign-functions=64 
-fforce-addr

Flags - PiFFT: -march=athlon-xp -O3 -fomit-frame-pointer -pipe 
-funroll-loops -falign-loops=5 -falign-jumps=5 -falign-functions=64 
-fforce-addr

Flags - DistRay: -march=athlon-xp -O3 -fomit-frame-pointer -pipe 
-funroll-loops -falign-loops=5 -falign-jumps=5 -falign-functions=64 
-fforce-addr

Flags - Neural: -march=athlon-xp -O3 -fomit-frame-pointer -pipe 
-funroll-loops -falign-loops=5 -falign-jumps=5 -falign-functions=64 
-fforce-addr
 >>>

http://www.freebench.org/cgi-bin/showdetails.pl?38=38&anabase=1&fourbase=1&masonbase=1&pcbase=1&pibase=1&distbase=1&neuralbase=1&fpmeanbase=1&intmeanbase=1&totmeanbase=1
 >>>
Flags - Analyzer: -march=athlon-xp -O3 -fomit-frame-pointer -pipe

Flags - FourInARow: -march=athlon-xp -O3 -fomit-frame-pointer -pipe

Flags - Mason: -march=athlon-xp -O3 -fomit-frame-pointer -pipe

Flags - pCompress2: -march=athlon-xp -O3 -fomit-frame-pointer -pipe

Flags - PiFFT: -march=athlon-xp -O3 -fomit-frame-pointer -pipe

Flags - DistRay: -march=athlon-xp -O3 -fomit-frame-pointer -pipe

Flags - Neural: -march=athlon-xp -O3 -fomit-frame-pointer -pipe
 >>>

http://www.freebench.org/cgi-bin/showdetails.pl?77=77&anabase=1&fourbase=1&masonbase=1&pcbase=1&pibase=1&distbase=1&neuralbase=1&fpmeanbase=1&intmeanbase=1&totmeanbase=1
 >>>
Flags - Analyzer: -march=athlon-xp -O3 -pipe -fomit-frame-pointer 
-fforce-addr -falign-functions=64 -maccumulate-outgoing-args -ffast-math 
-fprefetch-loop-arrays

Flags - FourInARow: -march=athlon-xp -O3 -pipe -fomit-frame-pointer 
-fforce-addr -falign-functions=64 -maccumulate-outgoing-args -ffast-math 
-fprefetch-loop-arrays

Flags - Mason: -march=athlon-xp -O3 -pipe -fomit-frame-pointer 
-fforce-addr -falign-functions=64 -maccumulate-outgoing-args -ffast-math 
-fprefetch-loop-arrays

Flags - pCompress2: -march=athlon-xp -O3 -pipe -fomit-frame-pointer 
-fforce-addr -falign-functions=64 -maccumulate-outgoing-args -ffast-math 
-fprefetch-loop-arrays

Flags - PiFFT: -march=athlon-xp -O3 -pipe -fomit-frame-pointer 
-fforce-addr -falign-functions=64 -maccumulate-outgoing-args -ffast-math 
-fprefetch-loop-arrays

Flags - DistRay: -march=athlon-xp -O3 -pipe -fomit-frame-pointer 
-fforce-addr -falign-functions=64 -maccumulate-outgoing-args -ffast-math 
-fprefetch-loop-arrays

Flags - Neural: -march=athlon-xp -O3 -pipe -fomit-frame-pointer 
-fforce-addr -falign-functions=64 -maccumulate-outgoing-args -ffast-math 
-fprefetch-loop-arrays
 >>>

http://www.freebench.org/cgi-bin/showdetails.pl?40=40&anabase=1&fourbase=1&masonbase=1&pcbase=1&pibase=1&distbase=1&neuralbase=1&fpmeanbase=1&intmeanbase=1&totmeanbase=1
 >>>
Flags - Analyzer: -O3 -pipe -march=athlon-xp -m3dnow -msse -mfpmath=sse 
-mmmx -fforce-addr -fomit-frame-pointer -funroll-loops 
-frerun-cse-after-loop -frerun-loop-opt -falign-functions=4 
-maccumulate-outgoing-args -ffa

Flags - FourInARow: -O3 -pipe -march=athlon-xp -m3dnow -msse 
-mfpmath=sse -mmmx -fforce-addr -fomit-frame-pointer -funroll-loops 
-frerun-cse-after-loop -frerun-loop-opt -falign-functions=4 
-maccumulate-outgoing-args -ffa

Flags - Mason: -O3 -pipe -march=athlon-xp -m3dnow -msse -mfpmath=sse 
-mmmx -fforce-addr -fomit-frame-pointer -funroll-loops 
-frerun-cse-after-loop -frerun-loop-opt -falign-functions=4 
-maccumulate-outgoing-args -ffa

Flags - pCompress2: -O3 -pipe -march=athlon-xp -m3dnow -msse 
-mfpmath=sse -mmmx -fforce-addr -fomit-frame-pointer -funroll-loops 
-frerun-cse-after-loop -frerun-loop-opt -falign-functions=4 
-maccumulate-outgoing-args -ffa

Flags - PiFFT: -O3 -pipe -march=athlon-xp -m3dnow -msse -mfpmath=sse 
-mmmx -fforce-addr -fomit-frame-pointer -funroll-loops 
-frerun-cse-after-loop -frerun-loop-opt -falign-functions=4 
-maccumulate-outgoing-args -ffa

Flags - DistRay: -O3 -pipe -march=athlon-xp -m3dnow -msse -mfpmath=sse 
-mmmx -fforce-addr -fomit-frame-pointer -funroll-loops 
-frerun-cse-after-loop -frerun-loop-opt -falign-functions=4 
-maccumulate-outgoing-args -ffa

Flags - Neural: -O3 -pipe -march=athlon-xp -m3dnow -msse -mfpmath=sse 
-mmmx -fforce-addr -fomit-frame-pointer -funroll-loops 
-frerun-cse-after-loop -frerun-loop-opt -falign-functions=4 
-maccumulate-outgoing-args -ffa
 >>>

http://www.freebench.org/cgi-bin/showdetails.pl?21=21&anabase=1&fourbase=1&masonbase=1&pcbase=1&pibase=1&distbase=1&neuralbase=1&fpmeanbase=1&intmeanbase=1&totmeanbase=1
 >>>
Flags - Analyzer: -O3 -march=athlon-xp -fmove-all-movables 
-fprefetch-loop-arrays -funroll-loops -fomit-frame-pointer -ffast-math 
-mmmx -msse -m3dnow -mfpmath=sse,387 -pipe

Flags - FourInARow: -O3 -march=athlon-xp -fmove-all-movables 
-fprefetch-loop-arrays -funroll-loops -fomit-frame-pointer -ffast-math 
-mmmx -msse -m3dnow -mfpmath=sse,387 -pipe

Flags - Mason: -O3 -march=athlon-xp -fmove-all-movables 
-fprefetch-loop-arrays -funroll-loops -fomit-frame-pointer -ffast-math 
-mmmx -msse -m3dnow -mfpmath=sse,387 -pipe

Flags - pCompress2: -O3 -march=athlon-xp -fmove-all-movables 
-fprefetch-loop-arrays -funroll-loops -fomit-frame-pointer -ffast-math 
-mmmx -msse -m3dnow -mfpmath=sse,387 -pipe

Flags - PiFFT: -O3 -march=athlon-xp -fmove-all-movables 
-fprefetch-loop-arrays -funroll-loops -fomit-frame-pointer -ffast-math 
-mmmx -msse -m3dnow -mfpmath=sse,387 -pipe

Flags - DistRay: -O3 -march=athlon-xp -fmove-all-movables 
-fprefetch-loop-arrays -funroll-loops -fomit-frame-pointer -ffast-math 
-mmmx -msse -m3dnow -mfpmath=sse,387 -pipe

Flags - Neural: -O3 -march=athlon-xp -fmove-all-movables 
-fprefetch-loop-arrays -funroll-loops -fomit-frame-pointer -ffast-math 
-mmmx -msse -m3dnow -mfpmath=sse,387 -pipe
 >>>

http://www.freebench.org/cgi-bin/showdetails.pl?90=90&anabase=1&fourbase=1&masonbase=1&pcbase=1&pibase=1&distbase=1&neuralbase=1&fpmeanbase=1&intmeanbase=1&totmeanbase=1
 >>>
Flags - Analyzer: -O3

Flags - FourInARow: -O3

Flags - Mason: -O3

Flags - pCompress2: -O3

Flags - PiFFT: -O3

Flags - DistRay: -O3

Flags - Neural: -O3
 >>>

-- 
----------------
Martin Lomas
martin at ml1.co.uk
----------------