[dundee] analysing server failure

Andrew Clayton dundee at lists.lug.org.uk
Tue Jul 29 18:41:00 2003


On Tue, 2003-07-29 at 10:41, David R. Baird wrote:
> Last night, sometime after 21:15, my web server (Redhat 7.2) 
> stopped serving web pages and stopped allowing ssh logins. Lots 
> of other things stopped as well - hourly logcheck emails, and a 5 
> minutely cron job that checks if the web server and other daemons 
> are running and restarts them and emails me if not. Unfortunately 
> I didn't discover this until 9am today! A hardware reset brought 
> the thing back up, but I'd like to find out what happened. 
> 

Things like this could be caused if you have run out of ram (OOM). Run
out of disk space, or some kind of hardware error, like faulty ram or
the disk subsystem seems to have vanished...


> In fact, I have a reasonable suspicion that the problem was an 
> enormous log file created by mod_jk from the Apache server. It 
> was 125MB, last modified on Jul 26th. I need to re-configure the

Thats not really very large... worry as you approach the 2GB mark.


>  
> server to not use that module. 
> 
> What I'd like to know is if I've missed anywhere to look for 
> useful messages. I've checked all the log files in /var/log, but 
> all I can get is an estimate of when the thing stopped. There 
> don't seem to be any unusual things going on in the messages, 
> secure, cron, maillog, or httpd/error_log files. 
> 

grep -i oom /var/log/messages
grep -i oops /var/log/messages

Check your disk usage.

Run the sar command to see if it is setup, this can give lots of really
useful information.

If it's installed then we can go from there.



If it's not there, then you should definately install the sysstat
package. There should be an rpm in the Red Hat 7.2 media if not google
for it.

And make sure you have a file called sysstat in /etc/cron.d that
contains

# run system activity accounting tool every 10 minutes
*/10 * * * * root /usr/lib/sa/sa1 1 1
# generate a daily summary of process accounting at 23:53
53 23 * * * root /usr/lib/sa/sa2 -A


This will gather all kinds if system information.
 
> Any suggestions?
> 
> d.

Are you familiar with your servers normal resource usage patterns? It
may be a good idea to closely monitor it for a period, using things like
top, free and vmstat 


--
Andrew