[Gllug] Programming for performance on Linux

Tue May 11 11:58:52 UTC 2010

Hi,

I have an application where I receive 100 byte packets, one after the other.
There can be 4 different types of packets, for each type a different
number crunching algorithm needs to be run.
The number crunching algorithm only works after 128 packets of the
same type have arrived and processed in a single batch, then wait for
the next 128 packets of the same type and process that batch etc.  But
the packets of different types can arrive randomly in any order.
The result is then written to disc in the same order the packets were received.
I am trying to work out how best to utilize a Quad core x86 CPU for
this processing task so that the most amount of packets can be
processed per second.
I am not entirely sure how best to use the CPU caches in the best way
possible because 128 x 100 bytes can fit nicely in the Layer 1 cache,
so in theory my number crunching could work in the CPU just using the
Layer 1 cache, and then write out to disc.
So, I am assuming that I will need to memcpy all 128 packets of the
same type to a memory location so they can all sit next to each other
in the Layer 1 cache, run the algorithm on them, and then memcpy them
back where they came from. The memcpy is relatively inexpensive in
relation to the number crunching done in the algorithm.
I am also unclear as to whether I can use the Layer 1 cache like this
with 4 threads, or would I have to have 4 processes instead to keep
the data private in the local Layer 1 cache.
Can anyone thing of a better or alternative approach?

Kind Regards

James
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug