[Gllug] socket buffer overrun

Peter Grandi pg_gllug at gllug.for.sabi.co.UK
Wed Oct 19 14:23:41 UTC 2005


>>> On Wed, 19 Oct 2005 12:30:59 +0100, Ben Fitzgerald
>>> <ben_m_f at yahoo.co.uk> said:

ben_m_f> Hi, I'm looking into a problem where data transfer
ben_m_f> between two servers is slow.

>> And how long is that piece of string I have over there? :-)

ben_m_f> hi peter. thanks for responding. I'll do my best to
ben_m_f> supply the required information.

Except what do you mean by «is slow» :-). And not just
quantitatively, but also whether it is a thruput or latency
issue (but your mention of insufficient buffering as a
possibility hinted it was more like throughput).

This matters because reaching 100% channel utilization as to
throutput on Gigabit Ethernet is somewhat unlikely, as the
benchmarks I gave some links for report, 600-700mb/s is OK and
if one is very lucky it can go up to 800mb/s; and latency on
Gigabit Ethernet can have several issues.

Also, if one is expecting something like 70-80MiB/s and one gets
only 40MiB some causes are rather more likely than if one is
getting 4MiB/s.

>> But... How high is CPU load?

ben_m_f> less than 10% when the app is running.

But that 10% relates to how many mb/s or MiB/s?

>> Have you got a PCI-64 card and slot?
ben_m_f> yes + yes.

That removes a large number of possible causes.

>> What makes you think that your servers can process several
>> dozen MiB/s of TCP traffic?

ben_m_f> It's fairly hefty: [ ... ]

Ok, that sounds good. Has it really got 4 CPUs or has it got HT?

[ ... ]

ben_m_f> TCP making around 300 sockets to a data feed.

In parallel or series? Using a well known protocol or one
created for the job? Part of the question here is: is it a full
or half duplex protocol?

>> What's the latency between the two servers?

ben_m_f> 10 packets transmitted, 10 received, 0% packet loss, time 9089ms
ben_m_f> rtt min/avg/max/mdev = 1.875/2.031/2.783/0.278 ms, pipe 2

Good idea to show the 'ping' numbers...  That's interesting,
because they seem a bit high from here; my home net is a
tuppenny 100MHz with far less powerful systems than yours
and I get:

------------------------------------------------------------------------
10 packets transmitted, 10 received, 0% packet loss, time 9004ms
rtt min/avg/max/mdev = 0.280/0.327/0.584/0.089 ms
------------------------------------------------------------------------

Like 2 milliseconds rountrip on a quad GHz Xeon on a gigabit
Ethernet seems a bit high, but I don't really know (I don't have
them here, are you willing to send them over for a bit of
benchmarking with Doom3? :->).

[ ... ]

>> However thanks for your inner confidence that people willing
>> to help you are psychic. :-)

ben_m_f> In my defense I did say: I am a tcp tuning novice so
ben_m_f> apologies if this makes little sense!  :)

I did take that into account -- the problem I see here is not
newbieness on TCP tuning, but on asking well documented questions.

All too many people who start with «is slow» eventually play a
game of ''guess it'' (''hot, hot'' or ''cold, cold'') with those
trying to help them... It can be fun, but not productive.

ben_m_f> Total RX packets: 109476765 -> 0.00039%. Yes this is
ben_m_f> small but it leaps up at least 1 per second when the
ben_m_f> application runs and otherwise stays static.

The absolute percentage is really insignificant. That it leaps
up once a second is probably just due to there being a sudden
surge of traffic.

But that the application only is active once per second may be a
rather more significant detail. If it runs once per second, does
it transmit in bursts? If so, how long in bytes and time is the
burst? Is it sent over 300 parallel or serial connections as
mentioned above? Does the higher level protocol require
acknowledgements?

Also: tuning for huge short bursts is a bit against the grain
for TCP, which is usually benchmarked for sustained throughput.

>> Raising the 'tcp_rmem' should help, as 1GHz can theoretically
>> do more than 100MiB/s (but see the figures in the links
>> below), and 0.17MiB of buffering is equivalent to 1.7ms of
>> buffering which is not a lot.

ben_m_f> Yes, after more investigation I agree. Increasing this
ben_m_f> has helped.

Also, consider that the 'ping' above has a roundtrip of 2ms, and
this was 1.7ms buffering, and the combination was rather likely
not optimal.

But then perhaps that 2ms is due to as-yet unrevealed details;
just conceivably the two ends of a connection are not on the
same switch, and there may be some Fast Ethernet bits in the
path, or routing, or whatever. Just guessing wildly here.

ben_m_f> The client was opening 300+ sockets and not closing the
ben_m_f> connection cleanly, leaving it in a CLOSE_WAIT with
ben_m_f> bytes in the Recv-Q. The total number of bytes in the
ben_m_f> Recv-Q was higher than rmem_max which I believe causes
ben_m_f> TCP to throttle the connection leading to a slowdown.
ben_m_f> Because CLOSE_WAIT takes a while to timeout it takes a
ben_m_f> good while to clear.

This is a bit ugly, but yes, there are a lot of anomalies
concerning attempts by the TCP/IP stack to automagically detect
network conditions and sometimes it detects the wrong thing.

Some of them are readily apparent in the pages linked to in my
previous article.

The second one "Linux TCP Tuning" was probably quite on the
spot, for example the suggestion to set the max for '[rw]mem'
high, even if I think that 16MiB might be a bit excessive, but
the your system has probably memory to spare, but then sometimes
excessive buffering does cause trouble too.

It also has some interesting 2.4 specific notes, inclouding
mentions of some anomaly. But then RHAS 2.4 kernels are strange
beasts, and perhaps some of the notes for 2.6 apply to them too.

The third one "How to achieve Gigabit speeds with Linux" has
some more specific and equally interesting notes, e.g. about
selective acknowledgements, which are indeed good for long
distance lossy connections but may have some anomalies
otherwise.

One paper is from LBL, the other from CERN, and I would guess
that hey ahve plenty of Gigabit Ethernet networks with apps
doing huge transfers, sustained or not...


  I almost cried with envy at some of the problems these big
  instutions have:

   «Users can manually set this queue size using the ifconfig
    command on the required device. Eg.
      /sbin/ifconfig eth2 txqueuelen 2000  
    The default of 100 is inadequate for long distance, high
    throughput pipes. For example, on a network with a rtt of
    120ms and at Gig rates, a txqueuelen of at least 10000 is
    recommended.»

  I wish I had the issue of having to set 'txqueuelen' to 10,000
  (Gig rate satellite/transoceanic connections :->).

-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug




More information about the GLLUG mailing list