[Gllug] code optimisations

Nix nix at esperi.org.uk
Mon Mar 21 15:43:06 UTC 2005


On Sun, 20 Mar 2005, Minty moaned:
> Bit of a sunday morning ramble - I'm just sharing some experience and
> may be people can add useful things to it.  I'm sure some will
> disagree and some of it is no doubt stating the obvious.  This isn't
> fact, just my experience!

:)

> I had a coursework due that basically involved taking an existing
> chunk of C code and making it run as fast as possible on a platform of
> my choice.  My understanding was that the particular code in question
> was neither here nor there - the point was generic(-ish) ways to make
> things go faster.  fwiw, it was an image edge detection algorithm.

A good example: both likely to be amenable to microoptimizations
(i.e. with small hot spots) and the sort of thing you're likely to
*want* to get every last bit of speed out of in real life.

> - Toshiba Libretto 100CT (166Mhz Pentium 1, MMX, 64 Mb Ram, 2Gb Disk)

L1 and L2 cache sizes?


Alas, this hardware is too old to run oprofile (it doesn't have
appropriate hardware monitoring counters), and has much too little
RAM to run cachegrind or any valgrind tool. A bigger machine is,
I'd say, critically important if you want to run either of these
(very useful) tools.

> - Nasty mix of Debian stable/testing/unstable.  Long story.
> - 40Mb of swap.  Longer story - I know this should be a lot higher.

If you're hitting swap you're dead anyway as far as speed is concerned.

> - Shut down all services, inc. kdm, apache, mysql, cron, atd. 
> Basically only ssh and the pcmcia card manager were running. This to
> avoid things screwing with my results.

Most of those things are unlikely to do anything much.

Calling sched_setscheduler() and switching to a realtime scheduling
priority is a more certain method --- but if your code might loop
forever, use schedtool or schedutils to start up a SCHED_RR shell at a
higher realtime priority, so you can kill your errant process if it
goes wrong!

> - Task was to make it run fast & we're talking wall clock speed here
> 
> Marks came for a "systematic approach", not necessarily getting the
> very fastest speed.  Dunno how I've done, but my approach was:
> 
> - People costs are the most expensive.  Make the computer work for
> you.  Make others work for you.  Be lazy.

> - First look at generic compiler optimisations

You've missed a few stages:

- First look at algorithms. Going from O(n^2) to O(n log n) or something
  like that will have a larger effect than any microoptimization if n is
  likely to be large.

- Then make sure your working set is OK.

- Then look at GCC flags, by all means: -O2 is a good idea. If you have
  the time, the ACOVEA tool is very useful (it runs your program over
  and over again with different compiler flags to try to determine the
  best set).

> - Then use gprof, identify key chunks of code to optimise.

There is one major cause of slowdowns that doesn't fall out of this
easily, and that's cache effects. If you're blowing your L1 or L2
caches, this might well show up as a general slowdown *throughout* the
code, as the slowdown will happen when the cache is *reloaded* but the
problem piece of code is the code that's blowing it.

cachegrind --- part of the valgrind toolset --- is *very* useful here,
as is oprofile if you have suitable hardware. But alas, as I mentioned
above, 64Mb is really too little RAM to run it on anything but toy
workloads (as valgrind and all associated tools cause your program to
use *far* more RAM than it uses when run outside valgrind; in some cases
seventeen times as much).

> This was memory/cpu bound code.  There wasn't really any IO to speak of.

Of course, `memory' is questionable: some kinds of `memory' (CPU core)
can be a hundred times faster than others (e.g. if you incur an L2 cache
miss).

> *** gcc **********
> 
> I was using whatever came with Debian. Which was v 3.0.4 apparently. 

That's a very, very old GCC. You probably want to upgrade to v3.4.3 and
use profile-generated feedback (see the -fprofile-generate and
-fprofile-use switches).

> I was using -O3 as my base level.  That gave me ~23 seconds per run.

-O3 is likely to slow down code on i386. In GCC<3.3 they will slow it
down a *lot*, as the inliners were pretty `wacked-out' (to quote the
author of the new inliner :) )

> -O3 does *not* appear to turn on ALL the compiler optimisations. 

No.

> There aren't many it misses, but there are some.  That said, it wasn't
> entirely clear and the extra ones I did turn on did not show much, if
> any, improvement.  I couldn't get my code running faster than ~22.5
> secs.  So far, so disapointing.
> 
> My reading suggested that in some cases -O2 may produce faster code
> than -O3.  Keep in mind I'm optimising for speed.  If I was after disk
> size, or memory footprint, it may be another matter entirely.

On modern CPUs, unless you have a very small code footprint, memory
footprint is the same as speed.

-O3 aims for speed over *everything* else, sometimes causing vast code
side growth.

You rpobably want -O2 or maybe even -Os.

> Then I tried with the -march=pentium flag. 19 seconds.  Over 15%
> improvement. Yay!

-fomit-frame-pointer is also a very good idea: expect to see a 10%
speedup from that alone, as register pressure falls by 1/4 and stack
spills reduce accordingly.

-ffast-math will also allow a number of greatly-speed-increasing
optimizations if you use a lot of floating-point math and don't care
about precise IEEE-compliance (if you don't use NaNs, you're probably
safe).

> This got me wondering : my gcc was over 2 years old.  It took about 6
> hours, but I grabbed the lastest stable gcc and built it from source.

You could just have installed it from Debian testing :)

> This gave me access to the -march=pentium-mmx flag (my original
> version knew nothing about mmx).

This will have no effect to speak of: all it does is enable the MMX
builtins which you can call from your program if necessary. MMX builtins
are minimally useful anyway except for massive block copies, because the
overhead of switching to and from MMX is so high.

GCC-3.4+ with -mfpmath=sse *does* use SSE2 instructions to manipulate
floating point, giving a substantial speedup: alas, this changes the
ABI, so it is only enabled by default on x86-64 platforms.

> It turns out, that with the latest stable version of gcc compiled on
> the target platform, it didn't much matter what flags I used.  Just

Actually, it does; you've just not spotted the useful ones.

> using -O3 on it's own got me to about 18.5 seconds.  Adding other
> flags helped some, but on the whole, out the box it was about as fast
> as it would go.  It appears compiling the compiler on the target meant
> it already knew which target it was targetting.  If that makes
> sense...

Unless you explicitly said `--host=i586-pc-linux-gnu' at configure time,
GCC would continue to target i386 by default.

> Lessons
> * get the latest stable version of your compiler
> * build the compiler on your target platform
> * tell the compiler as much about your hardware as you can (aka,
> chipset, aka -march=pentium-mmx)

Use profile-generated feedback and -fomit-frame-pointer (which will
probably be the default in GCC-4.1 or -4.2, because by that point GDB
should be using DWARF location lists and so won't need frame pointers to
get intelligible backtraces).

In GCC-3.4, -fweb also provides a substantial speedup (but in 4.0,
that optimization has been removed because it was subsumed in other
more general passes operating on higher-level representations).

> I had the source.  So I ran gprof.  I spotted one function taking 65%
> of the time.  This particular code was a lot of churning through
> arrays.

Sounds like a hot spot.

> So I jumped into the code, and found my costly chunk of code.  I found
> this an interesting read
> 
>   http://www.kernelthread.com/publications/faq/optimize.html

This is a very old page; it documents a number of GCC options that
haven't existed for nearly six years!

> In my case, pre-calculating values (sometimes outside the inner for
> loops) was possible/easy in a few places
> 
> For example rather than three instances of 
> 
>   array[i-]

Um, is that i-- or i-1? I assume the latter...

> I did
> 
>   tmp = i-1
> 
> then three times 
> 
>   array[tmp]
> 
> (where "i" did not change between the three calls).

I'd expect CSE to be able to do this, even the local-block-specific CSE
in GCC versions below 4.0. This looks like the compiler's job to me.

(If the mainline GCC-4.0-to-be compiler *doesn't* optimize this,
*please* reduce the testcase and report it as an optimization-class bug
on the GCC bugzilla! Something this simple should be caught!)

> At this point gprof showed another chunk of code eating 90% of the
> time.  I looked at the code and couldn't see any obvious improvements.
>  So I stopped.
> 
> Even if I doubled the performance of some other section of the code,
> it was dominated by that 90% that I couldn't easily improve.  So it
> just wasn't worth the marks at this point.

Quite so. It's amazing how many people forget this and waste ages
shaving 90% off some chunk of code that's accounting for perhaps
1% of the runtime!

> * gprof is suprisingly easy

and very inaccurate. oprofile is *far* more accurate: use if it
possible.

> * Run it using *your* real data.  Run it on your hardware.

Yes.

> * Find a few costly sections.  Concentrate on those.  Think "return on
> investment" at all times.

Yes!

> I still don't really know what the code/algorithm was doing, and I
> don't care.  These techniques I could apply to other codebases.

Algorithmic optimization is probably the most important technique of
all, though: you can't expect order-of-magnitude speedups from anything
else.

> * Optimising for performance often makes the code harder to read.
> And/or less flexible.

This is true of microoptimizations, but algorithmic optimizations
often have the opposite effect as old bummed-to-death code is ripped
out and replaced with legible stuff :)

(GCC itself provides a major example of this: it's sped up between
3.4 and 4.0, got better at compiling, *and* more readable, as two
decades of crud gets partially ditched...)

> * If you spend a load of time tweaking for your hardware, what happens
> next time you change hardware?

You shouldn't be tweaking *that* much, I'd say. Cache-size reduction
optimizations are worth it no matter what: every machine these days
has an L1 and L2 cache, and no matter how big it is, blowing it
less is good.

> Something slow and working ain't great, but it's better than fast and
> borked.

This is not always true: the existence of -ffast-math (and use of the
equivalent by default in most non-GCC compilers, even though it violates
the C standard) indicates otherwise. :)

-- 
> ...Hires Root Beer...
What we need these days is a stable, fast, anti-aliased root beer
with dynamic shading. Not that you can let just anybody have root.
 --- John M. Ford
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug




More information about the GLLUG mailing list