[Nottingham] FYI: HDD disk failure statistics paper

Mon Feb 19 23:13:14 GMT 2007

Folks,

For anyone whom may be interested:

A brief-ish summary of the main points from
http://216.239.37.132/papers/disk_failures.pdf

####
Our key findings are:

      • Contrary to previously reported results, we found very little
correlation between failure rates and either elevated temperature or
activity levels.

      • Some SMART parameters (scan errors, reallocation counts, offline
reallocation counts, and probational counts) have a large impact on
failure probability.

      • Given the lack of occurrence of predictive SMART signals on a
large fraction of failed drives, it is unlikely that an accurate
predictive failure model can be built based on these signals alone.

The observed range of AFRs [Annualized failure rates] (see Figure 2)
varies from 1.7%, for drives that were in their first year of operation,
to over 8.6%, observed in the 3-year old population. The higher baseline
AFR for 3 and 4 year old drives is more strongly influenced by the
underlying reliability of the particular models in that vintage than by
disk drive aging effects. It is interesting to note that our 3-month,
6-months and 1-year data points do seem to indicate a noticeable
influence of infant mortality phenomena, with 1-year AFR dropping
significantly from the AFR observed in the first three months.

Failure rates are known to be highly correlated with drive models,
manufacturers and vintages [18]. Our results do not contradict this
fact. For example, Figure 2 changes significantly when we normalize
failure rates per each drive model. Most age-related results are
impacted by drive vintages. However, in this paper, we do not show a
breakdown of drives per manufacturer, model, or vintage due to the
proprietary nature of these data. Interestingly, this does not change
our conclusions. In contrast to age-related results, we note that all
results shown in the rest of the paper are not affected significantly by
the population mix. None of our SMART data results change significantly
when normalized by drive model. The only exception is seek error rate,
which is dependent on one specific drive manufacturer...

we expected to notice a very strong and consistent correlation between
high utilization and higher failure rates. However our results appear to
paint a more complex picture. First, only very young and very old age
groups appear to show the expected behavior. After the first year, the
AFR of high utilization drives is at most moderately higher than that of
low utilization drives. The three-year group in fact appears to have the
opposite of the expected behavior, with low utilization drives having
slightly higher failure rates than high utilization ones. One possible
explanation for this behavior is the survival of the fittest theory. It
is possible that the failure modes that are associated with higher
utilization are more prominent early in the drive’s lifetime.

...failures do not increase when the average temperature increases. In
fact, there is a clear trend showing that lower temperatures are
associated with higher failure rates. Only at very high temperatures is
there a slight reversal of this trend. Figure 5 looks at the average
temperatures for different age groups. The distributions are in sync
with Figure 4 showing a mostly flat failure rate at mid-range
temperatures and a modest increase at the low end of the temperature
distribution. What stands out are the 3 and 4-year old drives, where the
trend for higher failures with higher temperature is much more constant
and also more pronounced. [Min failures at "sweet spot" of 30 - 46 deg C]

we see a drastic and quick decrease in survival probability after the
first scan error (left graph). A little over 70% of the drives survive
the first 8 months after their first scan error survival probability
after the first reallocation. We truncate the graph to 8.5 months, due
to a drastic decrease in the confidence levels after that point. In
general, the left graph shows, about 85% of the drives survive past 8
months after the first reallocation. The effect is more pronounced
(middle graph) for drives in the age ranges [10,20) and [20, 60] months,
while newer drives in the range [0,5) months suffer more than their next
generation. This could again be due to infant mortality effects,
although it appears to be less drastic in this case than for scan
errors. After their first reallocation, drives are over 14 times more
likely to fail within 60 days than drives without reallocation counts,
making the critical threshold for this parameter also one.

After the first offline reallocation, drives have over 21 times higher
chances of failure within 60 days than drives without offline
reallocations; an effect that is again more drastic than total
reallocations.

for drives aged up to two years, this is true, there is no significant
correlation between failures and high power cycles count. But for drives
3 years and older, higher power cycle counts can increase the absolute
failure rate by over 2%. We believe this is due more to our population
mix than to aging effects. Moreover, this correlation could be the
effect (not the cause) of troubled machines that require many repair
iterations and thus many power cycles to be fixed.

Vibration: This is not a parameter that is part of the SMART set, but it
is one that is of general concern in designing drive enclosures as most
manufacturers describe how vibration can affect both performance and
reliability of disk drives. Unfortunately we do not have sensor
information to measure this effect directly for drives in service.

[Failure Prediction] ...even when we add all remaining SMART parameters
(except temperature) we still find that over 36% of all failed drives
had zero counts on all variables. This population includes seek error
rates, which we have observed to be widespread in our population (> 72%
of our drives have it) which further
reduces the sample size of drives without any errors.

We conclude that it is unlikely that SMART data alone can be effectively
used to build models that predict failures of individual drives. SMART
parameters still appear to be useful in reasoning about the aggregate
reliability of large disk populations

In our study, we did not find much correlation between failure rate and
either elevated temperature or utilization. It is the most surprising
result of our study. Our annualized failure rates were generally higher
than those reported by vendors, and more consistent with other user
experience studies.

Conclusions

We find, for example, that after their first scan error, drives are 39
times more likely to fail within 60 days than drives with no such
errors. First errors in reallocations, offline reallocations, and
probational counts are also strongly correlated to higher failure
probabilities. Despite those strong correlations, we find that failure
prediction models based on SMART parameters alone are likely to be
severely limited in their prediction accuracy, given that a large
fraction of our failed drives have shown no SMART error signals whatsoever.
####

Sorry, big summary, but interesting points there. (For me at least!)

Cheers,
Martin

-- 
----------------
Martin Lomas
martin at ml1.co.uk
----------------