[Durham] SMART errors

Sun Nov 9 16:09:40 UTC 2014

Hi Olly

The disk is dead, you should replace it.
You should be very glad your OS had smart monitoring turned on :)
In Debian the package is smartmontools

You can pull out all the smart details from a disk by doing
smarctl -a /dev/sd?
and you can tun some tests by doing
smartctl -t long /dev/sd?
or
smartctl -t short /dev/sd?

The test will run in the background on the disk controller.
Once the test is finished you can get the results of the last few tests by 
doing
smartctl -l selftest /dev/sda

Looks like you are definitely running a 2 disk raid 1 here..
MD will have noticed the read errors from the broken disk and evicted it from 
the array.
I assume these are 2TB disks, in which case yes you have a 2 disk raid 1, if 
it was a 3 disk raid5 the capacity should be showing ~4TB.

Assuming /dev/sdc isnt going to die any time soon you can leave it as a single 
disk in the raid 1 array, otherwise I suggest adding /dev/sdd to the array.
sfdisk -d /dev/sdc | sfdisk /dev/sdd
mdadm --add /dev/md0 /dev/sdd
Give this a few (many? hours to resync)

When you have three working disks again you can migrate to a raid 5 array by 
pulling one of the disks from the raid 1 and making a degraded raid5 array.
Once you have copied the data across to this new array you can add the last 
disk to complete the array/
Assuming /dev/sdc1 /dev/sdd1 are a raid1 array, md0 and that /dev/sdb is the 
third disk:

Remove /dev/sdd1 from the raid1
mdadm --fail /dev/md0 /dev/sdd1
mdadm --remove /dev/md0 /dev/sdd1
mdadm --zero-super /dev/sdd1

Create a 3 disk raid 5 array in degraded mode:
mdadm -C -n 3 /dev/sd[bd]1 missing

Check the array status:
cat /proc/mdstat

You can format the new array and start copying data immediately if you like, 
however since the array will be initializing it is probably faster to wait for 
it to complete.

Format and copy data:
mkfs.whatever /dev/md1

mkdir /mnt/md1
mount /dev/md1 /mnt/md1

cp -a /mnt/md1 /mnt/md0
Alternatively, if your old raid1 is very full and is running a filesystem that 
can be extended (ext4, reiserfs, btrfs, others) it could be faster to do a 
block level copy and then extend the FS.

umount /dev/md0
sync
dd if=/dev/md0 of=/dev/md1 bs=1M
resize2fs /dev/md1
mount /dev/md1 /mnt/md1

Now destroy the old raid1 and add the remaining single disk to the raid5:
mdadm --stop /dev/md0
mdadm --zero-super /dev/sdc1

mdadm --add /dev/md1 /dev/sdc1

The array will start rebuilding again generating stripes and parity data from 
the other 2 disks.

watch cat /proc/mdstat
Wait for it to complete.

Job done!

You probably want to make a backup before doing this though, YMMV and while i 
have done this process myself, there is always the possibility to give mdadm 
the wrong device name and watch your data get annihilated.

HTH

David

On Sunday 09 Nov 2014 15:33:15 Oliver Burnett-Hall wrote:
> Does anyone know much about SMART?
> 
> I've got my backups going to a system which is running a three-disk
> RAID5 array. For the last few days I've been receiving warnings from
> smartd about problems with one of the disks and from mdadm about a
> degraded array. I'm a bit clueless about all of this and Google hasn't
> given my any idiot-friendly primers for what's happening.
> 
> First off, the SMART errors. I'm getting these two messages from smartd
> repeated in syslog every half hour:
>     Device: /dev/sdb [SAT], 14 Currently unreadable (pending) sectors
>     Device: /dev/sdb [SAT], 14 Offline uncorrectable sectors
> 
> That seems a fairly clear: /dev/sdb is failing. However when I run any
> self-tests (both short and long) the drive passes them. Surely that
> can't be right?
> 
> The messages from mdadm are even less helpful. It says:
>     A DegradedArray event had been detected on md device /dev/md0.
>     Faithfully yours, etc.
>     P.S. The /proc/mdstat file currently contains the following:
>     Personalities : [raid1]
>     md0 : active raid1 sdc1[1]
>         1953279872 blocks super 1.2 [2/1] [_U]
>     unused devices: <none>
> 
> That is confusing me. I set up a RAID5 array across sdb, sdc and sdd
> but now it's talking about a RAID1 only on sdc. If sdb has died should
> it not still being showing a degraded RAID5 on sdc and sdd
> 
> The drive is still well within its warranty period so there's the
> obvious solution of getting a replacement drive off WD, but I'm curious
> to understand what's happening here.
> 
> - olly
> 
> 
> 
> _______________________________________________
> Durham mailing list   -   Durham at mailman.lug.org.uk
> https://mailman.lug.org.uk/mailman/listinfo/durham
> http://www.nelug.org.uk/