[Gllug] Servers and other irritations...

Mike Brodbelt mike at coruscant.demon.co.uk
Wed Jul 31 23:14:23 UTC 2002


I'm posting my own present misery to the list, in the vague hope that
someone might be able to take mercy on me.......

I've been running a particular server at work for quite a while now,
it's been taken down for maintenance only, and has typically had an
uptime of 6 to 18 months on it when it's been rebooted. It runs a load
of services (samba, dns, imap mail, dhcp), and is generally an important
part of my infrastructure. It's got dual PSU's, ECC memory, and and ICP
vortex RAID controller, with IBM SCSI disks configured as two separate
RAID 5 arrays.

On Jul 15th, it suddenly stopped accepting IMAP and SMTP sessions for no
obvious reason. I tried to SSH in and find out what was going on, but it
wasn't having any of it - I typed in the password, and the session just
hung. I tried logging in on the console, with no greater luck. Samba and
NFS were still working fine however. We couldn't live without mail, and
so I reluctantly rebooted it, by pulling the plug out of the UPS, and
letting it shut itself down. The subsequent nasty fsck's showed that it
hadn't quite managed this properly..... There were a few "Unable to load
interpreter: ld-linux.so.2" errors in the logs, and on the console.

The following morning, a disk had died. The controller had reconstructed
onto the hot spare, so I thought that was the cause of all the problems,
resolved to switch in a fresh hot spare over the next few days,
reinstalled ld-linux.so.2 from the clean glibc rpm, and relaxed.....

About 4 days later, the same thing happened. I hastened my maintenance
slot by a day, and switched out the buggered disk, configured the fresh
one, and upgraded the kernel (to 2.2.21) and bind for good measure.

25th July, it died again. Feeling less pleased with the state of
affairs, I started perusing the logs like a maniac.On this occasion, I
had a root session open on the machine when it "crashed". This was
entirely usable - I could happily do pretty much anything. I stopped and
restarted the failing services. They stopped and started, but just kept
right on failing...... I ran gdb on the unresponsive network processes,
and found all my imapd's hanging in send(), as were the new sshd
processes that were spawned when I tried to connect (though pre-existing
sshd's were fine). DNS was not pleased:-

$ host gate.demon.co uk
Error in looking up server name:

Looking through the logs, I have a few nasties like this at times:-

imapd[29117]: IOERROR: writing HK^Q^HàT"@èT"@èT"@ðT"@ðT"@
øT"@øT"@: Bad file descriptor
Jul 29 12:56:07 castor imapd[29117]: DBERROR: error fetching txn
Jul 29 12:56:21 castor imapd[29117]: IOERROR: writing
HK^Q^HàT"@èT"@èT"@ðT"@ðT"@
øT"@øT"@: Bad file descriptor
Jul 29 12:56:21 castor master[256]: process 29117 exited, signaled to
death by 11

Having gathered some info, and heedful of the wails of users, I decided
to reboot. I did a shutdown, and logged off. Alas, shutdown failed. Back
to the power switch....

30th July, and it failed again. Thankfully, the Magic SysRq sync,umount,
reboot routine on my new kernel worked, so I was spared watching fsck
again. In the evening, I took the machine down and ran 3 passes of the
standard test suite in memtest86. 5 hours and zero memory errors later,
I'm still none the wiser.

My previously stable server seems to be going belly up every 4-5 days,
and the start of this coincided nicely with a disk failure. I've made no
significant software changes recently, and I'm not aware of anything
else that seems likely to be applicable. My next course of action is
going to be to run a "parity verify" on the disks, as suggested by ICP
tech support when I asked them how I could test the controller, but
that'll take the machine down for 8 hours or so, so will have to wait
until the weekend.

Has anyone seen anything similar to this in the past. Can anyone provide
me with any pointers? At this stage, I'm hoping for divine
inspiration...... I still suspect a hardware fault, but if I can't find
anything, it's a little hard to just start buying new kit on the
offchance. If it keeps up, I'm going to end up having to reinstall the
entire OS and associated services from scratch just to rule out software
corruption.....


Help.....


Mike.


-- 
Gllug mailing list  -  Gllug at linux.co.uk
http://list.ftech.net/mailman/listinfo/gllug




More information about the GLLUG mailing list