[Nottingham] Large disks and storage

Wed Sep 9 00:22:20 UTC 2009

there's a thing... a lot of my old clients were of the mistaken
impression that using RAID redundancy actually decreased their chances
of catastrophic hardware failure. i showed them that this wasn't
actually the case and that the polar opposite was the case; I hope to
be able to remember how i demonstrated it for you here, now.

Say a disk had a MTBF of 1 million hours (I know, any manufacturer
claiming this would be insane but it's an arbitrary number and useful
for this experiment). This does not mean that a disk will fail after 1
million hours, it means that the chances of it failing increases over
time from 1 in a million spin-hours. These odds increase /on average/
one point per hour. So after one thousand hours, the odds of a failure
are 1 in 999,000.
Therefore two disks of identical make and model double the chances of
failure from 1:1000000 to 1:500000, or one failure in five hundred
thousand spin-hours. The odds of failure increase, after one thousand
hours of operation, increase to 1:498000. This is not a typo. The
failure odds are calculated for each disk in the array then summed for
the array, so every thousand hours in a two disk array actually counts
two thousand hours of service.
Similarly, for a ten disk array, the initial CoF is set at 1 in
100,000 (or one failure every 11.4 years). Every hour of use sets the
CoF of one disk in the array up another ten notches. So for a thousand
hours, the ten-disk array has a 1 in 90000 risk of failure. In ten
thousand hours (or 1.14 years) you're living on borrowed time.

Practically speaking, a single disk lasts, on average, 4 years. Taking
this to be the MTBF, then a ten disk array would be expected to suffer
a unit loss every 146 days. Anybody planning or currently implementing
a RAID should either be aware of this situation or have already
planned for such an eventuality.