[Gloucs] Help diagnose crash please? Suspected hard drive failure, but which drive?

Matthew Booth mbooth at redhat.com
Wed Jan 30 11:40:12 GMT 2008


Andrew Oakley wrote:
> 
> Can someone help me diagnose the following crash please?
> 
> http://aoakley.com/misc/crash-20080130.jpg
> 
> I strongly suspect that this means at least one of my four hard drives 
> has failed, but the question is, if so which one? I have two pairs of 
> RAID1 mirrors, both with ext3 filesystems (hence the kjournald errors, I 
> think).
> 
> Upon reboot, everything either works fine, or I see the hda1/hdc1 mirror 
> being rebuilt in /proc/mdstat , so my suspicion is on either hda or hdc. 
> The system then fails randomly in a similar manner in the next minutes 
> or hours.
> 
> I'm using Ubuntu 6.06.02 LTS on a home-built Athlon XP CPU with mdadm 
> software RAID. No GUI desktop was running at the time.
> 
> Any help much appreciated,
> 

The kernel shouldn't crash, so this is a bug. The bug might have been 
triggered by just about anything imaginable, but there's a good chance 
that an apparent crash in the journalling daemon is storage (block 
device or filesystem) related. However, I'd say it's a long shot to 
associate this with disk failure.

The information in the stack trace isn't usually going to be enough to 
fix the problem. The best way to track down kernel bugs is with a core 
dump. Hopefully you won't reproduce this problem too frequently, however 
my advice would be to be prepared for if you do. Unfortunately I don't 
know what's available on Ubuntu, but on Red Hat/Fedora you have 2 
options: netdump and kdump. Of these, the latter is by far the easiest 
to setup and most useful. Essentially it uses kexec to boot a fresh 
kernel in pre-allocated memory when the main kernel crashes. In doing 
so, it leaves the original kernel in place, so the new kernel can dump 
it at will without having to worry about being corrupt itself. The 
standard setup is to write a core to disk, although being a full kernel 
it's very flexible.

In short: configure your machine to dump core if it kernel panics again, 
and wait. You're much more likely to work out exactly what happened from 
a core dump than a simple stack trace. Be careful about who you give 
your core dump to though, as it will contain a complete memory image. 
Definitely don't post it to this list ;)

Matt
-- 
Matthew Booth, RHCA, RHCSS
Red Hat, Global Professional Services

M:       +44 (0)7977 267231
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 252 bytes
Desc: OpenPGP digital signature
Url : http://mailman.lug.org.uk/pipermail/gloucs/attachments/20080130/665a994b/signature.bin


More information about the gloucs mailing list