[dundee] analysing server failure

David R. Baird dundee at lists.lug.org.uk
Tue Jul 29 10:42:01 2003


Last night, sometime after 21:15, my web server (Redhat 7.2) 
stopped serving web pages and stopped allowing ssh logins. Lots 
of other things stopped as well - hourly logcheck emails, and a 5 
minutely cron job that checks if the web server and other daemons 
are running and restarts them and emails me if not. Unfortunately 
I didn't discover this until 9am today! A hardware reset brought 
the thing back up, but I'd like to find out what happened. 

In fact, I have a reasonable suspicion that the problem was an 
enormous log file created by mod_jk from the Apache server. It 
was 125MB, last modified on Jul 26th. I need to re-configure the 
server to not use that module. 

What I'd like to know is if I've missed anywhere to look for 
useful messages. I've checked all the log files in /var/log, but 
all I can get is an estimate of when the thing stopped. There 
don't seem to be any unusual things going on in the messages, 
secure, cron, maillog, or httpd/error_log files. 

Any suggestions?

d.

-- 
Dr. David R. Baird
ZeroFive Web Design
dave@zerofive.co.uk
+44 [0]1738 447780
http://www.zerofive.co.uk