[Gllug] Patterns of locality of reference in xterm

James Courtier-Dutton james.dutton at gmail.com
Mon Jun 20 11:35:12 UTC 2011


On 20 June 2011 11:19, Nix <nix at esperi.org.uk> wrote:
> On 20 Jun 2011, James Courtier-Dutton uttered the following:
>
>> On 20 June 2011 00:40, Nix <nix at esperi.org.uk> wrote:
>>> On 19 Jun 2011, Adrian McMenamin verbalised:
>>>
>>>> A few people expressed interest in this (honestly, they did).
>>>>
>>>> My first, rather crude, results can be seen here:
>>>> http://cartesianproduct.wordpress.com/2011/06/19/patterns-of-locality-in-xterm/
>>>
>>> Doing it for GCC might *very* well be useful, because locality of
>>> reference (particularly within single compilation passes) translates
>>> directly into speed: GCC is distinctly cache-unfriendly and fixing that
>>> is a long, long slog.
>>
>> Why not use a cache-oblivious algorithm.
>
> It does, sort of: structures are arranged to maximize cacheline reuse,
> and that sort of thing.
>
>> I used one for some crypto implementation recently and it sped it up
>> by about 20x.
>
> The problem is that a crypto implementation is one very heavily studied
> algorithm. A compiler has thousands of algorithms in it, most of which
> have been studied by one or perhaps two people (though some parts such
> as SSA representation have been heavily studied, plus we are privileged
> to have an original designer of SSA working on GCC :) ). Plus there are
> a large number of shared datastructures, and great quantities of churn
> in the shape of new structures appearing and vanishing all the time.
>
>> But for my optimization I only had to arrange the data to be cache
>> oblivious. The code itself was so small it would all fit in the Level
>> 1 cache of any modern CPU.
>
> Would that this were true of any nontrivial compiler. Hell, the only
> interpreters it's true of are some Forth interpreters and Lua.
>
>> I suppose you are wishing the code to be cache oblivious with the code
>> size being larger than the cache?
>
> Well, yes, but it's more data size. I don't know of anyone who's done it
> recently, but a few years ago someone did a profile and found that GCC
> was taking one cache miss per twenty clock cycles. That's pretty bad.
> (Some effort has gone into fixing this -- this was *way* bback when
> Apple was still working on GCC, before it went GPLv3).
>

For CPU performance, most people just throw hardware at the problem.
i.e. 1 Server is not quick enough, lets just have a cluster instead or
use the Cloud to have as many CPUs as we need, when we need them.
People now work on implementing algorithms so that they can be run on
multiple CPUs/Threads at the same time, so scale well as the number of
CPUs increases rather than implement algorithms that work well on one
CPU.
So, as people move to optimize for a multi CPU environment, not so
much effort is being spent on the single CPU optimization.

My crypto algorithm was initially designed to run on a quad core CPU
and sized to make sure it kept up with the network bandwidth of the
design. At first the bottleneck was the CPU. I only needed to optimize
it until the bottleneck became the network bandwidth.
Even I was surprised how much quicker my optimizations made it,
resulting in only about 30% utilization on only one of the CPUs of the
quad core chip. I only needed to get the CPU down to less that 100% of
4 CPUs.

Regarding code localization optimizations, one has to somehow work out
the cost of an extra JUMP instruction is against the cost of a cache
miss. I.e. To get a bunch of code together in memory that it executed
in a loop, one might need to add a JUMP between the loops so that you
can fit the code used by the loop into the local area.
I don't think there is anything in GCC that even attempts this apart
from the "inline" option.
--
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug




More information about the GLLUG mailing list