[Gllug] disk problems

Tue Mar 14 19:51:39 UTC 2006

On Tue, 14 Mar 2006, Sean Burlington gibbered uncontrollably:
> Nix wrote:
>> On Sun, 12 Mar 2006, Sean Burlington suggested tentatively:
>>
>>>Hi all,
>>>I have a failing hard disk (my home partition seems fubared var is complaining but the rest seems OK for now)
>> Join the club. After I lost two disks in a month-long period I went
>> all-out and have now RAIDed the lot. No more disk death for *me*.
> 
> It's my second disk failure in 2 years which is bad enough.
> 
> But 2 in a month !!!

Between May 1989 (my first PC) and Jan 2006 I had two fan failures.

In late Jan 2006 and early-to-mid Feb 2006 I had

- two disk failures (one whose motor died at spinup, one just from
  old age and bearing wear), leading to the decommissioning of an
  entire machine because Sun disks cost so much to replace
- one motherboard-and-network-card failure (static, oops)
- one overheating CPU (on the replacement for the static-death box)
- and some (very) bad RAM (on that replacement).

Everything seems to be stable now, but RAID it is, and because I want
actual *robustness* I'm LVM+RAID-5ing everything necessary for normal
function except for /boot, and RAID-1ing that.

Thanks to initramfs this really isn't actually all that hard :) my /init
script in the initramfs is 76 lines, and that includes enough error
checking that if / can't be mounted, or the RAID arrays are shagged, or
LVM has eaten itself, I get a terribly primitive shell with access to
mdadm, the lvm tools, and fsck, on a guaranteed-functioning FS which can
only go away if the kernel image itself has vanished.

(I'm an early adopter; eventually use of an initramfs will be mandatory,
and even now everyone running 2.6 has an initramfs of sorts built in,
although it's empty.)

> Overheating is a possible cause with me - the box didn't really have
> enough room around it.

I hadn't noticed room *around* boxes being a problem except in tightly
packed racks or in machines with things stuck immediately in front of
them: all my home machines are stuffed in corners where I don't have to
listen to them or look at them, generally with not much airflow other
than up at the back and in at the front, and they've never had any
problems with airflow.

The pre-static-death machine had a 4Gb SCSI disk in it which ran at
~60C, a rather frightening temperature. The new one has more (larger)
fans and a disk cooler, so the disks run at temperatures between 20C and
40C; much better.

The CPU in this box is nice and cool for a P3/600, probably because I
overdid it and slung in a huge fan after hte first overheating incident:

CPU Temp:  +44.2°C  (high =   +95°C, hyst =   +89°C)

:)

> I think I'm going to make more effort to switch off overnight in
> future - it seems to me that it's boxes that get left on for months
> which have problmes.

I've left boxes on for years at a time with no problems at all. Most of
my failures have happened at poweron time (excepting the old-age disk
and that was fifteen years old when it died and had been running
constantly for all that time, except for house moves; it had been a
huge and expensive disk in its day, 4Gb!)

>> What is `the system'? /etc/mtab (as used by df(1)) is maintained by
>> mount(8) and is terribly unreliable; it's confused by mount --bind,
>> per-process filesystems, chroot(8), mount --move, mount --rmove, subtree
>> sharing, you name it, it confuses it.
>> /proc/mounts is maintained by the kernel and is actually reliable. What
> 
> this seems bad - here was no /proc/mounts

Either you don't have /proc mounted or you're running a pre-2.4 kernel.
Both these things are generally bad signs.

>>>I then used grub interactively to boot from (hd1, 1) and specified the
>>>kernel with parameter root=/dev/hdb1
>> That should have worked. I'd say you're using /dev/hdb1 unless you have
>> an initrd or initramfs that is ignoring root= (which would be terribly
>> bad form).
> 
> not using eiher of those

Your distro might be using one, I suppose.

-- 
`Come now, you should know that whenever you plan the duration of your
 unplanned downtime, you should add in padding for random management
 freakouts.'
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug