[GLLUG] Advice on expanding an existing RAID 1 installation

Sat May 24 15:11:08 UTC 2014

Hi John,

On Sat, May 24, 2014 at 11:12:56AM +0100, John Winters wrote:
> The Linux RAID Wiki has a page on extending an existing RAID array,
> but says I should start by removing one of my existing drives.
> Presumably this isn't necessary if I have a spare physical
> slot/controller slot/power lead?  Rather than their proposed approach
> of remove=>add=>remove=>add, is there any reason I can't do
> add=>remove=>add=>remove?

That should work and would have the advantage of never having a
degraded array – you'd go to a three way RAID-1 which would not be
degraded when you took it down to two devices again.

> Second question - I'm undecided about whether to put the 3rd drive
> into the system to start with (thus three drives up to date all the
> time) or keep it on a shelf for when it's needed.  Does anyone have
> strong views on which is better?

What do you want to achieve? It's a question of
convenience+resilience versus cost of power usage really.

If your drive is on the shelf and there's a failure of one of the
existing ones, can you remove the failed drive and insert this one
relatively quickly? Can you do it without turning the machine off?
Are either of those things important?

If the third drive was in the array making it a 3-way RAID-1 then it
would be very resilient – losing one device would have no real
effect. If the third drive was in the machine as a spare then it
would be pretty resilient – it would be immediately put into the
array and rebuilt onto if there were a failure. Both of those
scenarios cause it to use power though, which may or may not be an
issue for you.

The failure mode that you would be guarding against with a 3-way
RAID-1 would be additional failures during rebuild: in a conventional
two device RAID-1, when a fresh drive is inserted the entirety of
the other drive must be read and written to the new one. If there is
an unreadable sector discovered on the other device then data is
lost and the array is broken. You'd most likely be able to force it
to re-assemble, but the damaged sector(s) would be lost.

The larger drive capacities are pushing against typical unrecoverable
read error rates, so this is a bit of a concern.

With three devices, obviously you've always got two copies of the
data even when one device fails, so it pushes back the probabilities
of data loss considerably.

You could investigate other RAID levels but none of them really make
that much sense with only three devices.

> The machine in question is only switched on for a couple of hours
> each day.  It switches itself on, backs up a selection of other
> systems, then switches itself off again.

It sounds like neither power usage nor the inconvenience of
replacing devices would be much of a factor for you. Maybe the extra
resilience is worth it then?

As James mentioned, if you were in the mood for a radical rethink
then you could use btrfs. There's also ZFS on Linux. Either of those
would help avoid the phantom read error problem by keeping checksums
of files and automatically reconstructing them from the good copies.
You still need to have enough good copies and to read the actual
files to discover the problem, though.

Cheers,
Andy

-- 
http://bitfolk.com/ -- No-nonsense VPS hosting