[Gllug] Programming for performance on Linux

Fri May 14 20:35:03 UTC 2010

Hi,

On Tue, May 11, 2010 at 2:58 PM, James Courtier-Dutton
<james.dutton at gmail.com> wrote:
> Hi,
>
> I have an application where I receive 100 byte packets, one after the other.
> There can be 4 different types of packets, for each type a different
> number crunching algorithm needs to be run.
> The number crunching algorithm only works after 128 packets of the
> same type have arrived and processed in a single batch, then wait for
> the next 128 packets of the same type and process that batch etc.  But
> the packets of different types can arrive randomly in any order.
> The result is then written to disc in the same order the packets were received.
> I am trying to work out how best to utilize a Quad core x86 CPU for
> this processing task so that the most amount of packets can be
> processed per second.
> I am not entirely sure how best to use the CPU caches in the best way
> possible because 128 x 100 bytes can fit nicely in the Layer 1 cache,
> so in theory my number crunching could work in the CPU just using the
> Layer 1 cache, and then write out to disc.
> So, I am assuming that I will need to memcpy all 128 packets of the
> same type to a memory location so they can all sit next to each other
> in the Layer 1 cache, run the algorithm on them, and then memcpy them
> back where they came from. The memcpy is relatively inexpensive in
> relation to the number crunching done in the algorithm.
> I am also unclear as to whether I can use the Layer 1 cache like this
> with 4 threads, or would I have to have 4 processes instead to keep
> the data private in the local Layer 1 cache.

Isn't the cache global to the entire machine?

> Can anyone thing of a better or alternative approach?

This seems like a good approach. Generally data proximity and
sequentiality are a good strategy. Of course, like other people noted
about premature optimisation, first get it to work, and work well and
correctly; and only then see if it's not fast enough, and if so,
optimise it after profiling it and finding the bottlenecks, (or
possibly trying to tweak various parts of the algorithm and seeing if
it yields an improvement.). Given modern hardware, and if you write
the program in a fast language such as C or C++, with modern compilers
(recent versions of gcc should be OK.), then your program will likely
be fast enough without needing to tweak it to death.

Here are some documents that you may find of interest:

1. The first one is "What every programmer should know about memory":

http://lwn.net/Articles/250967/

2. The second one is, admittedly, mostly my own - a CC-by document
titled "Optimizing Code for Speed". The most recent version is
available on the English wikibooks:

http://en.wikibooks.org/wiki/Optimizing_Code_for_Speed

The first revision after I was happy with it at the time was placed on
my homepage and publicised on OSNews.com and other sites:

http://www.shlomifish.org/philosophy/computers/optimizing-code-for-speed/

See the bottom of the page for links to various comments.

3. The third resource is a book called "Programming Pearls" by Jon
Bentley. I've read the second edition which is more up-to-date and
enjoyed it and Bentley discusses optimisation extensively there and
also mentions an out-of-print book of his with some optimisation
patterns, which he lists in an appendix:

http://www.amazon.com/Programming-Pearls-2nd-Jon-Bentley/dp/0201657880

--------------

Hope that helps.

Regards,

-- Shlomi Fish

>
> Kind Regards
>
> James
> --
> Gllug mailing list  -  Gllug at gllug.org.uk
> http://lists.gllug.org.uk/mailman/listinfo/gllug
>

-- 
------------------------------------------
Shlomi Fish http://www.shlomifish.org/

Electrical Engineering studies. In the Technion. Been there. Done
that. Forgot a lot. Remember too much.
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug