[Gllug] Programming for performance on Linux

Jasper Wallace jasper at pointless.net
Mon May 17 02:58:45 UTC 2010


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, 11 May 2010, James Courtier-Dutton wrote:

> Hi,
> 
> I have an application where I receive 100 byte packets, one after the other.
> There can be 4 different types of packets, for each type a different
> number crunching algorithm needs to be run.
> The number crunching algorithm only works after 128 packets of the
> same type have arrived and processed in a single batch, then wait for
> the next 128 packets of the same type and process that batch etc.  But
> the packets of different types can arrive randomly in any order.
> The result is then written to disc in the same order the packets were received.
> I am trying to work out how best to utilize a Quad core x86 CPU for
> this processing task so that the most amount of packets can be
> processed per second.
> I am not entirely sure how best to use the CPU caches in the best way
> possible because 128 x 100 bytes can fit nicely in the Layer 1 cache,
> so in theory my number crunching could work in the CPU just using the
> Layer 1 cache, and then write out to disc.
> So, I am assuming that I will need to memcpy all 128 packets of the
> same type to a memory location so they can all sit next to each other
> in the Layer 1 cache, run the algorithm on them, and then memcpy them
> back where they came from. The memcpy is relatively inexpensive in
> relation to the number crunching done in the algorithm.
> I am also unclear as to whether I can use the Layer 1 cache like this
> with 4 threads, or would I have to have 4 processes instead to keep
> the data private in the local Layer 1 cache.
> Can anyone thing of a better or alternative approach?

You don't memcpy stuff into caches, it handled for you at the memory 
manager level. You can influence what happens and how the cpu treats data 
in the caches with some instructions that are usually exposed as 
'intrinsics' by whatever compiler your using, this page has a little info, 
i can't find a better one (I havn't looked very hard, there should be a god
guide somewhere):

http://www.tommesani.com/SSECacheabilityControl.html

Will you be receiving these packets from one other machine or many? If 
many you'll probably want to use libevent:

http://en.wikipedia.org/wiki/Libevent

also worth reading about the C10K problem:

http://www.kegel.com/c10k.html

Are you more interested in throughput or latency? I.E. do you want to 
process as many 100x128 chunks per sec as possible, or do you want to 
process each 100x128 chunk as quickly as possible?

Are the packets coming over tcp (with each 100x128 chunk being a single 
tcp connection or mutiple ones?), or are you using udp?

I suspect thinking about cache control is rather over thinking your 
problem, unless your trying to process a 10Gbit network connection your 
going to be io bound rather than cpu bound. I also suspect that writing 
the result to disk will be a limit as well.

As for how to structure your program:

If you're more interested in throughput:

At program startup pre-allocate the memory your going to use to store the 
packets, don't malloc()/free() as you receive and process the packets, 
it's slow.

Make sure you have many 100 byte buffers, determine the best number to 
have by testing, put them on a free queue.

also allocate many arrays of 128 pointers to packets, put them on a free 
queue as well.

allocate 4 queues for the processing threads, these queues hold pointers
to the arrays of 128 pointers to packets.

Have a single receive thread, it looks at each packet and if it's part of 
a new 128 packet chunk take an array of pointers to 128 packets off of the
free queue and change the first pointer of that array to the packet you've
just received.

If the packet is part of an existing, but unfinished array, add the packet
to that array. If your going to have many arrays of packets on the go at any
one time find a good algorithm for working out which array a packet should
go into (a hash or tree or something).

If the packet is the last one in a 128 packet chunk add the array of
pointers to packets to the processing queue for the right processing thread,
and then wake the thread up. You might want to not wake the thread up on
every packet, but instead wake it when the queue has N packet chunks on it -
that way the processing thread will be able to process more than one chunk
at a time and things will be a bit more efficent, but at the cost of
latency.

Then grab some packet buffer pointers from the free queue and set them up as
receive buffers. now go back to waiting for more packets.

Then you have 4 proceessing threads, one for each of the number crunching
algorithms, they start out sleeping, and when woken check there input queues
and take the arrays of 128 pointers to packets off the queue and run there
algorithms on the arrays, then write the result to disk (or add the results
to another queue and have another thread that takes mutiple results and
writes the results as a batch. This gets more compilcated if you want the
results serialised).

If you still want to play about with the cache you might want to do data
prefetching before the number crunching stuff:

http://software.intel.com/en-us/articles/optimize-prefetch-on-32-bit-intel-architecture/

After processing each 100x128 chunk add the pointers to each 128 byte buffer
to the free buffer queue, then null out the pointers in the array of 128
pointers to packets and add that array of pointers to packets to the free
queue for them.

After a processing thread has emptied it's queue the thread sleeps.

You'll need to do some housekeeping - at least checking that no queue gets
full, and trying to handle that gracefully. You'll also need to do something
like keep a timestamp with each 128 pointer to packets array for when it was
created in a received thread, and discarding the packets if that 128 packet
chunk dosnt get all 100 packets whitin a certain time, otherwise missing
packets will mean you have partly complete packet chunks lying around that
can't be completed.

For the queues themselves this should be ok:

http://www.drdobbs.com/high-performance-computing/208801974

I hoped the boost libs would have a nice thread safe queue class, but they
don't seem to have one (yet), but you might look at:

http://www.boostcookbook.com/Recipe:/1234841
http://www.justsoftwaresolutions.co.uk/threading/implementing-a-thread-safe-queue-using-condition-variables.html

Disclaimer: I've done things like this in python, but not in C/C++, you
might want to look at erlang as well.

P.S. By a thread sleeping i'm pretty sure i mean a thread waiting on a
condition vairable.

- -- 
[http://pointless.net/]                                   [0x2ECA0975]

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iQEVAwUBS/Cw5gCB+Qwuygl1AQKYhAf/cIzDuywupxQ1AnanAJCeylkgJnv4JkOc
LVo6n8rWawJ/QKxPCX/tYEAQqFoaWVNvpek0bw4n8OeBN2DjFM2+1RtK/GY6K+3b
XDkdWPJFysZjhx3Z4XIZefA2kC6vFibYIZniBq7kV/MGtu8/dhJTy9RoWJ4pvAAP
F/C4ygeoJ5UWnp0dkQ4tWXXM3ZlBGKXrmUqmRvyV6Mo3PUzJ4Iso+CjnbViWMPWE
xMBZfXMHATAcrplSjI5aBypAkXI3uOOgWXpfuN4LypIlgzhP/qBhDdzSXWOF0d1w
4l4ABr9f/v1bGkNdmWE4iX9p20VhRf4fQzOv3cA3mz1aqe+pPIZdOA==
=jXo3
-----END PGP SIGNATURE-----
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug




More information about the GLLUG mailing list