[Preston] Anyone any experiences with watchdog?

Dougie Nisbet plug at highmoor.co.uk
Sun Apr 11 10:39:47 BST 2004


On friday I tried upgrading to kernel 2.6.4 (again). I was running 2.4.22 
before. (I wanted to have another go at using alsa and my thinking was the 
most straighforward way would be to upgrade to a 2.6 kernel.)

After a day or two I remembered why I'd abandoned the upgrade last time. Every 
couple of days the server freezes up. Not in a useful, interesting or 
recoverable way, just an unusable way. This gives me a big problem. The 
server is pingable. Unfortunately, the server, 'nick', lives under the floor. 
nick is headless (expect when I'm building or reconfigurating then nick is 
not headless - just nearly headless ...). nick is also on a ups. So cycling 
power involves lifting a trapdoor and prodding at the UPS with a pointy 
stick.

I thought I'd look into the watchdog package a bit more, thinking it might 
allow me to reboot in a more dignified manner. Unfortunately, I've had 
problems with watchdog too. It's another one I stopped using and 20 minutes 
ago I remembered why. In my /etc/watchdog.conf file I tried:

ping                   = 192.168.1.1
interface              = eth0
file                    = /var/log/messages

nick is 192.168.1.9. 192.168.1.1 is my broadband router. Unfortunately, and 
infuruatingly, this isn't working. Or, more depressingly, it is. No sooner 
had I restarted the watchdog daemon than nick rebooted. The messages in the 
log include:

Apr 11 10:20:16 nick rpc.statd[578]: Version 1.0.6 Starting
Apr 11 10:20:16 nick rpc.statd[578]: statd running as root. 
chown /var/lib/nfs/sm to choose different user 
Apr 11 10:20:17 nick watchdog[301]: network is unreachable (target: 
192.168.1.1)
Apr 11 10:20:17 nick watchdog[301]: shutting down the system because of error 
101
Apr 11 10:21:15 nick watchdog[310]: starting daemon (5.2):
Apr 11 10:21:15 nick watchdog[310]: int=10s realtime=yes sync=no soft=no 
mla=24 mem=0
Apr 11 10:21:15 nick watchdog[310]: ping: 192.168.1.1
Apr 11 10:21:15 nick watchdog[310]: file: /var/log/messages:0
Apr 11 10:21:15 nick watchdog[310]: pidfile: /var/run/syslogd.pid
Apr 11 10:21:15 nick watchdog[310]: interface: eth0
Apr 11 10:21:15 nick watchdog[310]: test=none(0) repair=none 
alive=/dev/watchdog heartbeat=none temp=none to=roo
t no_act=no
Apr 11 10:21:24 nick rgpsp[538]: warning: any host will be allowed to connect
Apr 11 10:21:24 nick rgpsp[538]: ready to answer queries
Apr 11 10:21:29 nick rpc.statd[582]: Version 1.0.6 Starting
Apr 11 10:21:29 nick rpc.statd[582]: statd running as root. 
chown /var/lib/nfs/sm to choose different user 
Apr 11 10:22:00 nick watchdog[310]: network is unreachable (target: 
192.168.1.1)
Apr 11 10:22:00 nick watchdog[310]: shutting down the system because of error 
101
Apr 11 10:22:57 nick watchdog[312]: starting daemon (5.2):
Apr 11 10:22:57 nick watchdog[312]: int=10s realtime=yes sync=no soft=no 
mla=24 mem=0
Apr 11 10:22:57 nick watchdog[312]: ping: 192.168.1.1
Apr 11 10:22:57 nick watchdog[312]: file: /var/log/messages:0
Apr 11 10:22:57 nick watchdog[312]: pidfile: /var/run/syslogd.pid
:


Having a server which does nothing but reboot living under your floor is not 
something a chap should really have to cope with on an easter sunday morning. 
I've stopped it now. The only tangible output from the log is the 'error 101' 
bit. Does anyone know what that means?

More generally, the last time I upgraded to a 2.6 kernel (2.6.2 I think it 
was) I had the same problem. (not with watchdog, although that too). Running 
a 2.6 kernel on nick causes a freeze up every day or two. It's still pingable 
but I can't ssh to it. I think the last time I did this I ran a ps -ef in a 
cron job every minute to try and get a snapshot of what might be the problem, 
looping or rogue processes perhaps, but nothing came to light. 

happy easter!

Dougie




More information about the Preston mailing list