[sclug] My disk esplode

David Given dg at cowlark.com
Mon Apr 14 23:05:04 UTC 2008

I recently had a major disk failure --- bad sectors galore. I thought
people would appreciate a brief writeup to tell how I managed to recover
(AFAICT) all my data off it.

The scenario: a Porsche branded LaCie external USB disk (yes, I crashed
my Porsche). Inside is a SATA Seagate ST3500 Barracuda 7200.10 500GB
disk. It was cheap. On the disk is a JFS partition.


What happened: the disk, running quite happily, started chattering to
itself. At first I thought it was actual activity until I noticed that
it still happened when not actually plugged in to the computer. At this
point I realised something was very wrong and prepared to back up my
data. The first thing I did was nuke a very, very big directory of
temporary data because I didn't need it and didn't want to spend time
backing it up.

Lesson #1: do not do this. When your disk starts acting funny, mount it
read only and never try to write to it again.

When the disk started making rythmic grinding noises and spewing I/O
errors I killed it, remounted it, and even though it still seemed
readable, fscked it. EPIC FAIL.

Lesson #2: see lesson #1.

fsck tried to write to the superblock. The sectors containing the
superblock curled up and died. At this point I now had an unmountable
filesystem. Further fsck attempts (read-only this time!) revealed that
while the backup superblock seemed to be fine, there was no way of
fixing things because the primary superblock was unreadable.

At this point I needed to take an image of the hard disk. dd would not
work, because dd doesn't handle I/O errors appropriately. Luckily, there
are two tools that do do this: dd_rescue and ddrescue (these are
*different*); they're in Debian packages ddrescue and gddrescue
respectively. These will both read disk images and attempt to recover
bad sectors, but ddrescue does the better job of it. The Debian version
is, unfortunately, very old; if you ever do this, you will want to
compile the most recent version yourself. This supports sparse files,
which means that empty blocks in the image will consume no disk space.
Thanks to this feature, I managed to get the 500GB image to occupy only
about 300GB of real disk space on my other big drive. This was a good
thing, as otherwise I'd have had no room.

Lesson #3: it's always worth having a spare of the biggest disk you've got.

At this point all I needed to do was fsck.jfs the disk image, mount it
loopback, and copy the data off. Success. (If there were any bad sectors
in spaces used by actual files, those files will now contain blocks of
zeros, but ddrescue said that there were only 419kB of unreadable
sectors on the entire 488386560kB disk, so chances are I was lucky.)


I then proceeded to do some forensics. Removing the disk from the caddy
and plugging it in to a real SATA controller allowed me to run smartmon
to talk to the disk diagnostics. It turns out the disk knew it was
failing and had been trying to warn me... but the USB caddy wasn't
passing the information on to me.

[smartmon is your friend. apt-get install smartmontools, and your disks
will tell you if they start to feel unhappy. You also get access to
their on-board diagnostic and self-test procedures. You can, for
example, tell the disks to run self-tests every week and smartd will
mail you the results. Good stuff. But it doesn't work on USB drives.]

The disk, via 'smartctl --all', said:

SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.

[If anyone wants the entire report, if they want to see what a failing
disk looks like, let me know.]

Gee, ya think?

It added that it had run out of remappable sectors, which is what I
expected, but added that it thought the disk had gotten rather hot in
the past (it said 35404104400937 Centigrade, but I think that figure may
not be reliable).

Certainly, when I removed the disk from the caddy, it was almost too hot
to touch, and running loose on my desk, it's merely warm. The caddy
itself is a sealed plastic box with one set of vents and no air
through-flow. I suspect that the caddy simply managed to bake the drive
and kill it. Porsche may know how to design cars, but I'm damned if I'm
going to buy any more computer kit they've had anything to do with.

The disk says it's 801 hours old. This is not a lot. I got it early this
year and it's been running ever since, so I assume this does not count
spindown time. One of my *other* disks, an elderly 20GB Fujitsu, says
it's been running for 17355 hours...

???? ?????????????? ????? http://www.cowlark.com ?????
? "I have always wished for my computer to be as easy to use as my
? telephone; my wish has come true because I can no longer figure out
? how to use my telephone." --- Bjarne Stroustrup

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 252 bytes
Desc: OpenPGP digital signature
Url : http://sclug.org.uk/pipermail/sclug/attachments/20080415/c6e4b443/attachment.bin 

More information about the Sclug mailing list