[Gllug] Disk wait processes, load averages, and {send,fetch,proc}mail

Thu Sep 26 13:53:33 UTC 2002

I have a problem. Our main file and print server has a number of process
hung in a disk wait (uninterruptible sleep) state:

root     22716  0.0  0.0     0    0 ?        DW   Sep17   0:00 [lockd]
root     23069  0.0  0.0  1968  720 ?        D    Sep17   0:00 mount accuhost01:
root     23499  0.0  0.0  1972  724 ?        D    Sep17   0:00 mount /accucard/d
root     24426  0.0  0.0  1972  724 ?        D    Sep17   0:00 mount /accucard/d
root     25402  0.0  0.0     0    0 ?        DW   Sep17   0:00 [lockd]
root     25433  0.0  0.0  1972  724 ?        D    Sep17   0:00 mount /accucard/d
root     25647  0.0  0.0     0    0 ?        DW   Sep17   0:00 [lockd]
root     26055  0.0  0.0  1972  724 ?        D    Sep17   0:00 mount /accucard/d
root     26612  0.0  0.0  1968  720 ?        D    Sep17   0:00 mount accuhost01:
root     32215  0.0  0.0     0    0 ?        DW   Sep17   0:00 [lockd]
root     13787  0.0  0.0  1972  724 ?        D    Sep18   0:00 mount /accucard/d
root     18714  0.0  0.0     0    0 ?        DW   Sep25   0:00 [lockd]
root     18911  0.0  0.0     0    0 ?        DW   Sep25   0:00 [lockd]
root     19022  0.0  0.0  1972  908 ?        D    Sep25   0:00 mount -t nfs -a
root     19282  0.0  0.0     0    0 ?        DW   Sep25   0:00 [lockd]

No, I don't know why they're hanging. But once they're in that state,
I know of nothing short of a reboot that can clear them. Because it's
our main file and print server, rebooting is a politically non-viable
solution at the moment.

So until we get a suitably quiet time, we're stuck with them.
Processes in disk wait state aren't in themselves a problem.
However, because they're in the run queue, they count towards
the load average. So even though the actual load on the box is
minimal, the load average is hovering around the 15.2 mark.

Unfortunately, this adversely affects sendmail[1], which stops
accepting connections when the load average reaches a certain
threshold:

 1247 ?        S      0:00 sendmail: rejecting connections on daemon MTA: load average: 15

Now I've tried to configure this with the RefuseLA option. But for some
reason, it isn't working. I've also changed the QueueLA and QueueFactor
parameters, and those definitely *are* now working (verified with the
-d3.30 debugging option).

Furthermore, because sendmail isn't accepting connections, I get:

   fetchmail: SMTP connect to localhost failed
   fetchmail: can't raise the listener; falling back to /usr/bin/procmail -d %T

This works fine with the caveat that each message is being converted to
have DOS-style CR/LF line endings, rather than just the traditional Unix CR.
This is casuing problems for mh, which doesn't play well with the extra
characters.

So I guess I have several questions:

1. Should disk wait processes contributing the the load average be
   considered a bug?
2. Is there anything I can do about them short of a reboot?
3. Is there any debugging option to sendmail that will show the
   current RefuseLA threshold value?
4. Is there any way I can get sendmail to accept connections when
   the load average is greater than 12 (either via RefuseLA or some
   other method)?
5. Does anyone know why fetchmail/procmail is adding extra LF characters,
   and what I can do to change it?

Thanks,

Tet

[1] Despite what the masses claim, the more I play with the never
    versions of sendmail, the more I like it -- you'd have to
    have a *very* convincing argument to convince me to switch
    to exim/qmail/postfix/whatever)

-- 
Gllug mailing list  -  Gllug at linux.co.uk
http://list.ftech.net/mailman/listinfo/gllug