[Gllug] disk problems

Thu Mar 16 00:03:53 UTC 2006

On Tue, 14 Mar 2006, Sean Burlington whispered secretively:
> Nix wrote:
>> 
>> Between May 1989 (my first PC) and Jan 2006 I had two fan failures.
>> 
>> In late Jan 2006 and early-to-mid Feb 2006 I had
>> 
>> - two disk failures (one whose motor died at spinup, one just from
>>   old age and bearing wear), leading to the decommissioning of an
>>   entire machine because Sun disks cost so much to replace
>> - one motherboard-and-network-card failure (static, oops)
>> - one overheating CPU (on the replacement for the static-death box)
>> - and some (very) bad RAM (on that replacement).
> 
> How many machines do you have ?

At that point, three. Now, two (but one has about half a dozen virtual
machines on it). One of the disk failures was building for a couple
of months before that, though.

As a direct consequence of all this I have about, oh, five times as much
storage on *one machine* than I had on the *whole network* beforehand:
the rate at which disk tech advances continues to amaze me.

> It can't be so many that that isn't an appalling failure rate!!!

I think after sixteen years of no failures Murphy just came home to
roost. (Plus the static death thing was my damn fault. As a consequence,
I now refuse to touch hardware without rubber gloves on, and I'll go
nowhere near hardware as long as anyone else is nearby. They need not
be more competent, they just need to be more coordinated, and just
about any still-breathing human fits that criterion.)

>> Everything seems to be stable now, but RAID it is, and because I want
>> actual *robustness* I'm LVM+RAID-5ing everything necessary for normal
>> function except for /boot, and RAID-1ing that.
> 
> RAID 5 = 3 hard disks + controller...

Well, three block devices. If you're using IDE you should use separate
channels, and if you really care about robustness, separate controllers/
busses too.

> I can't really justify the expense of that even though I have had a
> couple of failures (and one or two learning experiences a while back)

Ah, well, the disks built up like this:

loki at purchase (1997): 1 SCSI, 4.2Gb
                  2000: added 1 IDE, 10Gb
                  2005: bought a 72Gb SCSI for the Sun
              Jan 2006: oops, Suns are SCA: get another disk
              Feb 2006: Sun dies. Bugger.
                        loki dies too; replacement comes with a
                        40Gb IDE disk.
              Mar 2006: 4Gb SCSI disk ditched: all SCSI disks slung
                        in. From 12Gb to 194Gb in one month. :)

So only the SCSI disks cost me anything, really.

RAIDing this has proved interesting because the disks have such
different sizes: in the end I have one 40Gb array covering the SCSI
disks and the bigger IDE disk, and a 10Gb one covering the SCSI disks
and the smaller IDE disk; the latter is dead slow by modern standards so
only stuff like news goes on it. Then there's 20Gb unRAIDed on each
disk --- oh, and the 4xRAID-1 /boot array.

This whole mess is covered by two LVM VGs, one covering all the RAIDed
stuff, and one covering the non-RAIDed stuff. The two-VGs stuff is
because if a PV dies, the VG using that PV is in serious trouble:
so the robust RAIDed stuff gets its own VG so it can't be affected
by the failure of any one disk.

> Pretty much everything important is under version control at work and
> backed up to tape nightly.

I do the VC-and-backup, too; the RAID is so that if a disk dies I can
come back up with one disk removal, and so that if a disk dies while I'm
hundreds of miles away I don't lose anything and everyone who has an
account here can still connect :) Restoring a dead box from backup is so
fantastically annoying that RAID just seemed a good move.

> I just need a more effective way of separating out stuff that needs to
> be regularly backed up from the rest of it...

I use a per-host per-backup-set exclusion list in my backup script, viz:

my @uncompressed_files = ('*.bz2', '*.gz', '*.rar', '*.jar', '*.zip', '*.mp3', '*.mpg', '*.ogg', '*.jpg', '*.gif', '*.png');
my @excluded_files = ('*~', '*.bak', '*.swp', '.Mail.log', '.X.log', '.newsrc.dribble', '.locatedb', 'aquota.group', 'aquota.user');
my %roots = ( hades => '/', loki => '/', packages => '/usr/packages' );
my %excluded_directories = ( all => [ 'lost+found', '/var/tmp', '/usr/local/tmp', '/usr/local/archive',
                                      '/var/cache', '/var/run', '/var/spool/wwwoffle', '/var/spool/locate/',
                                      '/usr/src/build', '/var/log', '/var/log.real', '/usr/spool/lpd', '/mnt' ],
                             hades => [ '/mirror', '/usr/share/dar/catalogues', '/usr/archive/music/.hades/archos',
                                        '/home/.hades.wkstn.nix/boinc', '/usr/share/clamav' ],
                             loki => [ and so on ... ]);
my @excluded_fsen = ( 'proc', 'sysfs', 'msdos', 'devpts', 'tmpfs', 'openpromfs', 'iso9660', 'udf', 'usbdevfs', 'minix', 'vfat', 'nfs', 'none' );

(Transforming the filesystem list into a list of paths to exclude is
a trivial parse-/proc/mounts job.)

> My backups have been getting less frequent as I keep finding that the
> stuff I planned to backup is over DVD size!

Use a program that knows how to split things across CDs? (I use dar;
even though it's horrifically memory-inefficient, it does *work* and
I have scripts that know about it and I've done complete restores
with it before.)

[initramfs]
>> (I'm an early adopter; eventually use of an initramfs will be mandatory,
>> and even now everyone running 2.6 has an initramfs of sorts built in,
>> although it's empty.)
> 
> I try not to be an early adopter - or at least I try and stick with
> Debian stable where I can and cherry pick from more up to date stuff.

You're sane ;) me, I treat my systems like experimental testbeds in some
respects and like critical systems in others (i.e., breaking things is
all very well, but breaking uptime or function for other users without
advance notice is verboten; they've been very forgiving over the last
few months, but I don't want to try their patience; some of them have
large LARTs ;) ).

>> CPU Temp:  +44.2°C  (high =   +95°C, hyst =   +89°C)
>> 
>> :)
> 
> lmsensors has been on my todo list for a bit ....

I've had lm-sensors running ever since hades nearly melted its
motherboard. More recently I tweaked it so it stopped producing ALARMs;
now I just have to find something that can monitor the sensor output:
I hear tales of a sensorsd, but I'm not sure where to find it.

Also recommended are smartd (I got nice emails when those disks started
to die, a week or so before they *really* went; long enough in advance
to back up), and if you're running RAID, it's a really really good idea
to add

/sbin/mdadm --monitor --pid-file /var/run/mdadm.pid --daemonise --scan -y

to some startup script, and stick a suitable MAILADDR line in your
/etc/mdadm.conf. There are periodic horror stories on the linux-raid
list about people who forget to do this and notice that they've been
running their RAID-5 array in degraded mode only when they lose a
*second* disk... RAID makes disk failures nearly invisible, but that
doesn't mean you don't want to know about them!

>> I've left boxes on for years at a time with no problems at all. Most of
>> my failures have happened at poweron time (excepting the old-age disk
>> and that was fifteen years old when it died and had been running
>> constantly for all that time, except for house moves; it had been a
>> huge and expensive disk in its day, 4Gb!)
> 
> Now I've got things up and running again I'm much happier and thinking I
> don't want to give up mythtv - so it's likely to get left switched on.
> 
> What I will do is make a full backup before shutting down such a machine!

Certainly do so if you're shutting it down for more than a few minutes
(moving house or repairing it or something).

I'd recommend simply being cautious with the init scripts: rebooting
naturally gets a bit fraught if you only do it once every 500 days (one
box I admin at work just wrapped its uptime counter) if just because
your init scripts are by that time pretty much completely untested :)

The rate of 2.6 development keeps me safe from that these days: I tend
to have to reboot at least once a month. (How *embarrassing*, even if
it *is* generally for a kernel upgrade.)

>>>this seems bad - here was no /proc/mounts
>> 
>> 
>> Either you don't have /proc mounted or you're running a pre-2.4 kernel.
>> Both these things are generally bad signs.
> 
> it was a 2.6 kernel compiled by me rather than ready distributed

You should still have a /proc/mounts!

> I'm not sure if proc was mounted or not - it was in /etc/fstab but since
> I didn't trust df and couldn't read proc .... I re-installed to a
> different disk.

It seems that something *really* strange was going on (chrooting? who
knows), but since you've reinstalled I guess we'll never know.

-- 
`Come now, you should know that whenever you plan the duration of your
 unplanned downtime, you should add in padding for random management
 freakouts.'
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug