[Gllug] code optimisations

Sun Mar 20 12:04:26 UTC 2005

Bit of a sunday morning ramble - I'm just sharing some experience and
may be people can add useful things to it.  I'm sure some will
disagree and some of it is no doubt stating the obvious.  This isn't
fact, just my experience!

*** background **********

I had a coursework due that basically involved taking an existing
chunk of C code and making it run as fast as possible on a platform of
my choice.  My understanding was that the particular code in question
was neither here nor there - the point was generic(-ish) ways to make
things go faster.  fwiw, it was an image edge detection algorithm.

The course, by the by, was about compiler optimisations, cpu design,
memory caching models and the likes.  I took it mostly because it
dropped in nicely to the timetable and was of background interest : by
nature I prefer higher levels of abstraction when I interact with my
hardware.

*** approach **********

- Toshiba Libretto 100CT (166Mhz Pentium 1, MMX, 64 Mb Ram, 2Gb Disk)
- Nasty mix of Debian stable/testing/unstable.  Long story.
- 40Mb of swap.  Longer story - I know this should be a lot higher.

- Shut down all services, inc. kdm, apache, mysql, cron, atd. 
Basically only ssh and the pcmcia card manager were running. This to
avoid things screwing with my results.
- Only one user logged in
- Task was to make it run fast & we're talking wall clock speed here

Marks came for a "systematic approach", not necessarily getting the
very fastest speed.  Dunno how I've done, but my approach was:

- People costs are the most expensive.  Make the computer work for
you.  Make others work for you.  Be lazy.
- First look at generic compiler optimisations
- Then use gprof, identify key chunks of code to optimise.
- Then (may be) tinker with the code

I ran the program 10 times in series, and averaged the run-times.  As
the box wasn't doing anything else, this proved rough+ready, but
sufficient.

This was memory/cpu bound code.  There wasn't really any IO to speak of.

*** gcc **********

I was using whatever came with Debian. Which was v 3.0.4 apparently. 
I was using -O3 as my base level.  That gave me ~23 seconds per run.

-O3 does *not* appear to turn on ALL the compiler optimisations. 
There aren't many it misses, but there are some.  That said, it wasn't
entirely clear and the extra ones I did turn on did not show much, if
any, improvement.  I couldn't get my code running faster than ~22.5
secs.  So far, so disapointing.

My reading suggested that in some cases -O2 may produce faster code
than -O3.  Keep in mind I'm optimising for speed.  If I was after disk
size, or memory footprint, it may be another matter entirely.

Then I tried with the -march=pentium flag. 19 seconds.  Over 15%
improvement. Yay!

This got me wondering : my gcc was over 2 years old.  It took about 6
hours, but I grabbed the lastest stable gcc and built it from source. 
This gave me access to the -march=pentium-mmx flag (my original
version knew nothing about mmx).

It turns out, that with the latest stable version of gcc compiled on
the target platform, it didn't much matter what flags I used.  Just
using -O3 on it's own got me to about 18.5 seconds.  Adding other
flags helped some, but on the whole, out the box it was about as fast
as it would go.  It appears compiling the compiler on the target meant
it already knew which target it was targetting.  If that makes
sense...

Lessons
* get the latest stable version of your compiler
* build the compiler on your target platform
* tell the compiler as much about your hardware as you can (aka,
chipset, aka -march=pentium-mmx)
* forget the rest

*** gprof **********

I had the source.  So I ran gprof.  I spotted one function taking 65%
of the time.  This particular code was a lot of churning through
arrays.

So I jumped into the code, and found my costly chunk of code.  I found
this an interesting read

  http://www.kernelthread.com/publications/faq/optimize.html

In my case, pre-calculating values (sometimes outside the inner for
loops) was possible/easy in a few places

For example rather than three instances of 

  array[i-]

I did

  tmp = i-1

then three times 

  array[tmp]

(where "i" did not change between the three calls).

This is obviously code specific, and will vary.  In my (ok, contrived)
case about 10 lines of jigging and my run time was down to 7 seconds. 
Almost 70% saving.

At this point gprof showed another chunk of code eating 90% of the
time.  I looked at the code and couldn't see any obvious improvements.
 So I stopped.

Even if I doubled the performance of some other section of the code,
it was dominated by that 90% that I couldn't easily improve.  So it
just wasn't worth the marks at this point.

And yes, I checked the output of the code to make sure it was still
generating the same/correct results.

Lessons:
* gprof is suprisingly easy
* Run it using *your* real data.  Run it on your hardware.
* Find a few costly sections.  Concentrate on those.  Think "return on
investment" at all times.

I still don't really know what the code/algorithm was doing, and I
don't care.  These techniques I could apply to other codebases.

*** conclusion **********

This is subjective.  And relative.

* Set your goals, timescales & costs up front.  
* Optimising for performance often makes the code harder to read.
And/or less flexible.
* Human time is the most expensive.  Will more hardware be cheaper? 
Is it worth paying for the customised compiler?
* If you spend a load of time tweaking for your hardware, what happens
next time you change hardware?

I've been coding professionally for nearly 10 years.  I've had to
optimise stuff a few times, but not often.  Do it late.  Caching
layers can often be most easily added later.  Trying to build fast
optimised code from the start is honourable, but it's going to be
costly.  Is that the best use of your buck?

Something slow and working ain't great, but it's better than fast and borked.

All that said, I was really suprised that I got a ~70% improvement
from what seemed to be a well written chunk of code.  And in
retrospect, it wasn't that hard - took about 10 hours (including
writing it up).  Whether my tricks would work on a modern processor is
another matter.  If your code is mostly doing disk i/o, then a few
simple pre-caculated integer operations ain't doing to do diddly. 
Still.
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug