[Gllug] Software Raid 5 MD0 just stopped working

Ken Smith kens at kensnet.org
Sun Apr 22 15:50:00 UTC 2012


I'm helping a friend with an old FC6 system I set up for him ages ago.

It has a Logical Lolume made from MD0 and MD1 that in turn are two three 
disk raid 5 sets.

One day MD0 decided not to play any more. When I looked at the system 
MD0 was no longer mentioned in /proc/mdstat. And the VG was showing that 
it was made of an unknown device and MD1.

I reassembled MD0 and the Raid appeared to be happy and re-established 
the uuid of MD0 and the LV was found again but the ext3 filesystem on 
the LV was a shambles. Its all backed up so it can all be put back.

The machine runs smartctl -a on its disks daily and I have the records 
of that going back for over a year. MD0 is made of two Western Digital 
500G's  and a Seagate 500G. All the smartctl data looks fine. Except 
that the Seagate is showing:-

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x000f   114   099   006    Pre-fail  Always       -       70478277
   3 Spin_Up_Time            0x0003   094   093   000    Pre-fail  Always       -       0
   4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       26
   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
   7 Seek_Error_Rate         0x000f   081   060   030    Pre-fail  Always       -       126904381
   9 Power_On_Hours          0x0032   061   061   000    Old_age   Always       -       34946
  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
  12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       26
187 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
189 Unknown_Attribute       0x003a   100   100   000    Old_age   Always       -       0
190 Temperature_Celsius     0x0022   069   060   045    Old_age   Always       -       554958879
194 Temperature_Celsius     0x0022   031   040   000    Old_age   Always       -       31 (Lifetime Min/Max 0/13)
195 Hardware_ECC_Recovered  0x001a   060   057   000    Old_age   Always       -       167912025
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

The Raw Read Error Rate drew my attention, but a year ago it showed:-

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x000f   114   099   006    Pre-fail  Always       -       70478277
   3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -       0
   4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       22
   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
   7 Seek_Error_Rate         0x000f   080   060   030    Pre-fail  Always       -       112009888
   9 Power_On_Hours          0x0032   071   071   000    Old_age   Always       -       26049
  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
  12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       22
187 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
189 Unknown_Attribute       0x003a   100   100   000    Old_age   Always       -       0
190 Temperature_Celsius     0x0022   067   060   045    Old_age   Always       -       588513313
194 Temperature_Celsius     0x0022   033   040   000    Old_age   Always       -       33 (Lifetime Min/Max 0/14)
195 Hardware_ECC_Recovered  0x001a   060   057   000    Old_age   Always       -       18314776
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

Pretty similar.

I'm trying to fathom why MD0 just packed up a went home. Noting in the 
log files to give a clue.


Any ideas/suggestions


Thanks

Ken





-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

--
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug




More information about the GLLUG mailing list