[Preston] Anyone any experiences with watchdog?
Dougie Nisbet
plug at highmoor.co.uk
Sun Apr 11 10:39:47 BST 2004
On friday I tried upgrading to kernel 2.6.4 (again). I was running 2.4.22
before. (I wanted to have another go at using alsa and my thinking was the
most straighforward way would be to upgrade to a 2.6 kernel.)
After a day or two I remembered why I'd abandoned the upgrade last time. Every
couple of days the server freezes up. Not in a useful, interesting or
recoverable way, just an unusable way. This gives me a big problem. The
server is pingable. Unfortunately, the server, 'nick', lives under the floor.
nick is headless (expect when I'm building or reconfigurating then nick is
not headless - just nearly headless ...). nick is also on a ups. So cycling
power involves lifting a trapdoor and prodding at the UPS with a pointy
stick.
I thought I'd look into the watchdog package a bit more, thinking it might
allow me to reboot in a more dignified manner. Unfortunately, I've had
problems with watchdog too. It's another one I stopped using and 20 minutes
ago I remembered why. In my /etc/watchdog.conf file I tried:
ping = 192.168.1.1
interface = eth0
file = /var/log/messages
nick is 192.168.1.9. 192.168.1.1 is my broadband router. Unfortunately, and
infuruatingly, this isn't working. Or, more depressingly, it is. No sooner
had I restarted the watchdog daemon than nick rebooted. The messages in the
log include:
Apr 11 10:20:16 nick rpc.statd[578]: Version 1.0.6 Starting
Apr 11 10:20:16 nick rpc.statd[578]: statd running as root.
chown /var/lib/nfs/sm to choose different user
Apr 11 10:20:17 nick watchdog[301]: network is unreachable (target:
192.168.1.1)
Apr 11 10:20:17 nick watchdog[301]: shutting down the system because of error
101
Apr 11 10:21:15 nick watchdog[310]: starting daemon (5.2):
Apr 11 10:21:15 nick watchdog[310]: int=10s realtime=yes sync=no soft=no
mla=24 mem=0
Apr 11 10:21:15 nick watchdog[310]: ping: 192.168.1.1
Apr 11 10:21:15 nick watchdog[310]: file: /var/log/messages:0
Apr 11 10:21:15 nick watchdog[310]: pidfile: /var/run/syslogd.pid
Apr 11 10:21:15 nick watchdog[310]: interface: eth0
Apr 11 10:21:15 nick watchdog[310]: test=none(0) repair=none
alive=/dev/watchdog heartbeat=none temp=none to=roo
t no_act=no
Apr 11 10:21:24 nick rgpsp[538]: warning: any host will be allowed to connect
Apr 11 10:21:24 nick rgpsp[538]: ready to answer queries
Apr 11 10:21:29 nick rpc.statd[582]: Version 1.0.6 Starting
Apr 11 10:21:29 nick rpc.statd[582]: statd running as root.
chown /var/lib/nfs/sm to choose different user
Apr 11 10:22:00 nick watchdog[310]: network is unreachable (target:
192.168.1.1)
Apr 11 10:22:00 nick watchdog[310]: shutting down the system because of error
101
Apr 11 10:22:57 nick watchdog[312]: starting daemon (5.2):
Apr 11 10:22:57 nick watchdog[312]: int=10s realtime=yes sync=no soft=no
mla=24 mem=0
Apr 11 10:22:57 nick watchdog[312]: ping: 192.168.1.1
Apr 11 10:22:57 nick watchdog[312]: file: /var/log/messages:0
Apr 11 10:22:57 nick watchdog[312]: pidfile: /var/run/syslogd.pid
:
Having a server which does nothing but reboot living under your floor is not
something a chap should really have to cope with on an easter sunday morning.
I've stopped it now. The only tangible output from the log is the 'error 101'
bit. Does anyone know what that means?
More generally, the last time I upgraded to a 2.6 kernel (2.6.2 I think it
was) I had the same problem. (not with watchdog, although that too). Running
a 2.6 kernel on nick causes a freeze up every day or two. It's still pingable
but I can't ssh to it. I think the last time I did this I ran a ps -ef in a
cron job every minute to try and get a snapshot of what might be the problem,
looping or rogue processes perhaps, but nothing came to light.
happy easter!
Dougie
More information about the Preston
mailing list