[Wolves] Software RAID reconstruction

Tue Aug 26 19:48:36 UTC 2008

I think you're wrong :)

yes sdd1 (scsi0 : destination target 3, lun 0) is failing but sdc1 *must* have been part of the
array otherwise what is sdd rebuilding from?

I dont understand the log messages you're getting about sdc not being fresh but the mdadm
--display output seems normal, I think the spare its reporting is the rebuilding disk (sdd in your
case), because when I fail a device in my array I get (ignore the odd devices :)

[root at north ~]# mdadm --manage -f /dev/md0 /dev/gnbd0
mdadm: set /dev/gnbd0 faulty in /dev/md0
[root at north ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 gnbd0[2](F) loop0[1]
      2097088 blocks [2/1] [_U]

unused devices: <none>
[root at north ~]# mdadm /dev/md0 -r /dev/gnbd0
mdadm: hot removed /dev/gnbd0
[root at north ~]# mdadm /dev/md0 -a /dev/gnbd0
mdadm: re-added /dev/gnbd0
[root at north ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 gnbd0[0] loop0[1]
      2097088 blocks [2/1] [_U]
      [>....................]  recovery =  0.3% (8640/2097088) finish=12.0min speed=2880K/sec

unused devices: <none>
[root at north ~]# mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Tue Aug 19 20:36:36 2008
     Raid Level : raid1
     Array Size : 2097088 (2048.28 MiB 2147.42 MB)
    Device Size : 2097088 (2048.28 MiB 2147.42 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Aug 26 20:25:29 2008
          State : clean, degraded, recovering
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

 Rebuild Status : 2% complete

           UUID : 8e243d95:2bfd2c02:6fdcd544:e077fdcb
         Events : 0.8

    Number   Major   Minor   RaidDevice State
       0     252        0        0      spare rebuilding   /dev/gnbd0
       1       7        0        1      active sync   /dev/loop0

<time passes/>

[root at north ~]# mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Tue Aug 19 20:36:36 2008
     Raid Level : raid1
     Array Size : 2097088 (2048.28 MiB 2147.42 MB)
    Device Size : 2097088 (2048.28 MiB 2147.42 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Aug 26 20:39:26 2008
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : 8e243d95:2bfd2c02:6fdcd544:e077fdcb
         Events : 0.10

    Number   Major   Minor   RaidDevice State
       0     252        0        0      active sync   /dev/gnbd0
       1       7        0        1      active sync   /dev/loop0

Do you get the same log messages previously in the logs? Could it just be a weird log messages get
while the array is starting up at boot time?

If so your seeing sdd rebuilding then failing again either when it writes to that block or some
sort of integrity check on finishing rebuilding, replace sdd and you'll be fine.

Of course I could be wrong about you being wrong :)

chris

--- Adam Sweet <adam at adamsweet.org> wrote:

> Hi everyone
> 
> I'm having issues with a machine running software RAID and I'd like some
> second opinion as I'm not especially familiar with it when it doesn't
> work as expected.
> 
> Essentially what happened as far as I can tell, though I might be wrong,
> is the following.
> 
> The machine has 2 software RAID arrays, md0 and md1. md1 is made up of
> sdc1 and sdd1:
> 
> [root at jet ~]# cat /proc/scsi/scsi
> Attached devices:
> Host: scsi0 Channel: 00 Id: 00 Lun: 00
>   Vendor: MAXTOR   Model: ATLAS10K5_147SAS Rev: BP00
>   Type:   Direct-Access                    ANSI SCSI revision: 05
> Host: scsi0 Channel: 00 Id: 01 Lun: 00
>   Vendor: MAXTOR   Model: ATLAS10K5_147SAS Rev: BP00
>   Type:   Direct-Access                    ANSI SCSI revision: 05
> Host: scsi0 Channel: 00 Id: 02 Lun: 00
>   Vendor: MAXTOR   Model: ATLAS10K5_147SAS Rev: BP00
>   Type:   Direct-Access                    ANSI SCSI revision: 05
> Host: scsi0 Channel: 00 Id: 03 Lun: 00
>   Vendor: MAXTOR   Model: ATLAS10K5_147SAS Rev: BP00
>   Type:   Direct-Access                    ANSI SCSI revision: 05
> Host: scsi0 Channel: 00 Id: 08 Lun: 00
>   Vendor: DP       Model: BACKPLANE        Rev: 1.00
>   Type:   Enclosure                        ANSI SCSI revision: 05
> 
> [root at jet ~]# cat /proc/mdstat
> Personalities : [raid1]
> md1 : active raid1 sdc1[2] sdd1[1]
>       142327744 blocks [2/1] [_U]
>       [==>..................]  recovery = 12.1% (17309056/142327744)
> finish=109.2min speed=19076K/sec
> md0 : active raid1 sdb1[1] sda1[0]
>       142327744 blocks [2/2] [UU]
> 
> At around 5am one morning, a write to sdd1 failed:
> 
> Jun 30 04:57:54 jet kernel: mptscsi: ioc0: attempting task abort!
> (sc=f0afd840)
> Jun 30 04:57:54 jet kernel: scsi0 : destination target 3, lun 0
> Jun 30 04:57:54 jet kernel:         command = Write (10) 00 0a 9a c2 97
> 00 00 a8 00
> Jun 30 04:57:58 jet kernel: mptbase: ioc0: LogInfo(0x31140000):
> Originator={PL}, Code={IO Executed}, SubCode(0x0000)
> Jun 30 04:57:58 jet kernel: mptscsi: ioc0: task abort: SUCCESS (sc=f0afd840)
> Jun 30 04:58:08 jet kernel: mptscsi: ioc0: attempting task abort!
> (sc=f0afd840)
> Jun 30 04:58:08 jet kernel: scsi0 : destination target 3, lun 0
> Jun 30 04:58:08 jet kernel:         command = Test Unit Ready 00 00 00 00 00
> Jun 30 04:58:12 jet kernel: mptbase: ioc0: LogInfo(0x31140000):
> Originator={PL}, Code={IO Executed}, SubCode(0x0000)
> Jun 30 04:58:12 jet kernel: mptscsi: ioc0: task abort: SUCCESS (sc=f0afd840)
> Jun 30 04:58:12 jet kernel: mptscsi: ioc0: attempting task abort!
> (sc=f2570380)
> Jun 30 04:58:12 jet kernel: scsi0 : destination target 3, lun 0
> Jun 30 04:58:12 jet kernel:         command = Write (10) 00 0a b7 d2 6f
> 00 00 08 00
> Jun 30 04:59:03 jet kernel: ): rejecting I/O to dead device
> Jun 30 04:59:03 jet kernel: scsi): rejecti): reje): re): rejecting I/O
> to dead device
> Jun 30 04:59:04 jet kernel: scsi0 (3:0): rejecting I/O to dead device
> 
> The system became unresponsive and had to be rebooted, on reboot, sdc1
> was kicked from the array for not being 'fresh':
> 
> Jun 30 08:43:27 jet kernel: md: kicking non-fresh sdc1 from array!
> Jun 30 08:43:27 jet kernel: md: unbind<sdc1>
> Jun 30 08:43:27 jet kernel: md: export_rdev(sdc1)
> 
> The array could not rebuild as sdc1 could not be added:
> 
> Jun 30 08:43:27 jet kernel: md: md1: raid array is not clean -- starting
> background reconstruction
> Jun 30 08:43:27 jet kernel: raid1: raid set md1 active with 1 out of 2
> mirrors
> Jun 30 08:43:28 jet kernel: md: Autodetecting RAID arrays.
> Jun 30 08:43:28 jet kernel: md: autorun ...
> Jun 30 08:43:28 jet kernel: md: considering sdc1 ...
> Jun 30 08:43:28 jet kernel: md:  adding sdc1 ...
> Jun 30 08:43:28 jet kernel: md: md1 already running, cannot run sdc1
> Jun 30 08:43:28 jet kernel: md: export_rdev(sdc1)
> Jun 30 08:43:28 jet kernel: md: ... autorun DONE.
> Jun 30 08:43:31 jet kernel: md: Autodetecting RAID arrays.
> Jun 30 08:43:31 jet kernel: md: autorun ...
> Jun 30 08:43:31 jet kernel: md: considering sdc1 ...
> Jun 30 08:43:31 jet kernel: md:  adding sdc1 ...
> Jun 30 08:43:31 jet kernel: md: md1 already running, cannot run sdc1
> Jun 30 08:43:31 jet kernel: md: export_rdev(sdc1)
> Jun 30 08:43:31 jet kernel: md: ... autorun DONE.
> 
> I manually marked sdc1 as failed, then removed it from the array and
> then reinserted it:
> 
> [root at jet ~]# mdadm --manage --set-faulty /dev/md1 /dev/sdc1
> [root at jet ~]# mdadm /dev/md1 -r /dev/sdc1
> [root at jet ~]# mdadm /dev/md1 -a /dev/sdc1
> 
> Array md1 then resynced successfully as far as I can tell, but sdd1 has
> been throwing disk errors regarding the same disk block and continually
> rebuilding the array ever since, on completion, it detects the bad block
> again and rebuilds the array again:
> 
> Jul  3 12:28:24 jet kernel: md: md1: sync done.
> Jul  3 12:28:51 jet kernel: scsi0: ERROR on channel 0, id 3, lun 0, CDB:
> Read (10) 00 10 f7 7d bf 00 00 80 00
> Jul  3 12:28:51 jet kernel: Info fld=0x10f77e0e, Current sdd: sense key
> Medium Error
> Jul  3 12:28:51 jet kernel: Additional sense: Unrecovered read error -
> recommend rewrite the data
> Jul  3 12:28:51 jet kernel: end_request: I/O error, dev sdd, sector
> 284655104
> Jul  3 12:28:51 jet kernel: raid1: sdd: unrecoverable I/O read error for
> block 284654976
> Jul  3 12:28:51 jet kernel: md: syncing RAID array md1
> Jul  3 12:28:51 jet kernel: md: minimum _guaranteed_ reconstruction
> speed: 1000 KB/sec/disc.
> Jul  3 12:28:51 jet kernel: md: using maximum available idle IO bandwith
> (but not more than 200000 KB/sec) for reconstruction.
> Jul  3 12:28:51 jet kernel: md: using 128k window, over a total of
> 142327744 blocks.
> 
> Sample disk errors on sdd (note the same disk sector)
> 
> /var/log/messages.3:Jul  3 12:28:51 jet kernel: raid1: sdd:
> unrecoverable I/O read error for block 284654976
> /var/log/messages.3:Jul  3 16:27:44 jet kernel: raid1: sdd:
> unrecoverable I/O read error for block 284654976
> /var/log/messages.3:Jul  3 18:59:46 jet kernel: raid1: sdd:
> unrecoverable I/O read error for block 284654976
> /var/log/messages.3:Jul  3 21:58:31 jet kernel: raid1: sdd:
> unrecoverable I/O read error for block 284654976
> 
> We have some new disks arriving today or tomorrow, so I was looking at
> removing sdd so I could rebuild md1 against sdc, however it appears that
> sdc1 never became an active part of the array:
> 
> [root at jet ~]# mdadm --detail /dev/md1
> /dev/md1:
>         Version : 00.90.01
>   Creation Time : Sat Jan 13 08:24:54 2007
>      Raid Level : raid1
>      Array Size : 142327744 (135.73 GiB 145.74 GB)
>     Device Size : 142327744 (135.73 GiB 145.74 GB)
>    Raid Devices : 2
>   Total Devices : 2
> Preferred Minor : 1
>     Persistence : Superblock is persistent
> 
>     Update Time : Tue Aug 26 09:50:52 2008
>           State : clean, degraded, recovering
>  Active Devices : 1
> Working Devices : 2
>  Failed Devices : 0
>   Spare Devices : 1
> 
> 
>  Rebuild Status : 19% complete
> 
>     Number   Major   Minor   RaidDevice State
>        0       0        0       -1      removed
>        1       8       49        1      active sync   /dev/sdd1
>        2       8       33        0      spare   /dev/sdc1
>            UUID : cb59fcb5:d61a134b:3e799237:ab0c3b4f
>          Events : 0.25533468
> 
> It seems to have been a spare disk in the array. I just failed, removed
> and re-added the disks again however it is still a spare while md1
> continues to cycle through rebuilding and detecting a bad disk block.
> 
> Having read the Software RAID Howto, 2 things occur to me:
> 
> 1) The howto is a bit out of date. Nobody should be using raidtools
> these days as far as I am aware.
> 2) It doesn't take this scenario into account. There are no details on
> making a spare device active. I assume it should have happened
> automatically but didn't because md1 was already rebuilding itself. What
> it was rebuilding from, I don't know.
> 
> Can anybody shed some more light on what actually happened to get me
> here and how I should proceed. I assume the worst case is to get all of
> the data off and start again with a new pair of disks, but we'll have to
> wait a few weeks for hardware with enough capacity to take all of the
> virtual machines running on there.
> 
> Apologies for the large amount of logs pasted in.
> 
> Adam
> 
> _______________________________________________
> Wolves LUG mailing list
> Homepage: http://www.wolveslug.org.uk/
> Mailing list: Wolves at mailman.lug.org.uk
> Mailing list home: https://mailman.lug.org.uk/mailman/listinfo/wolves
> 

Send instant messages to your online friends http://uk.messenger.yahoo.com