[Nottingham] I knew I'd forget something last night (filesystems and partitions)

Wed May 20 14:41:32 UTC 2020

Hello,

On Wed, May 20, 2020 at 01:56:47AM +0100, Martin via Nottingham wrote:
> btrfs for everything else for both SSD and HDD;

In early 2014 I decided to evaluate btrfs by using it on my home
file server.

On the whole it has been a mixed bag. The feature set is very nice.
By far my most favourite thing is being able to repurpose mismatched
types and capacities of storage from my dayjob into my home file
server and trust that btrfs will mirror it across devices.

My main frustration with btrfs and the reason why I do not recommend
its use and will not use it anywhere else — and at times regret
using it in my home file server — is that its approach to
availability absolutely sucks, and it's still got really bad bugs.

Basically: yes you can use multiple devices for redundancy and
indeed at home that is my primary reason for doing so, because I
don't want a device failure to stop things from working while I
source a replacement.

But so many failure modes of btrfs currently require you to reboot,
or at least remount the filesystem, which is effectively the same
deal.

In 2014:

    HDD died, tried to delete it using btrfs commands, process
    locked up and made no progress. Had to reboot, rebuild a new
    kernel, then move data onto new device.

    https://strugglers.net/~andy/blog/2014/08/08/whats-my-btrfs-doing/

2018:

    HDD failure again renders system unavailable due to read-only
    btrfs. Not possible to fix without rebooting and mounting
    degraded.

    https://strugglers.net/~andy/blog/2018/06/15/another-disappointing-btrfs-experience/

This sort of thing has happened multiple more times in the 6 or so
years this system has been on btrfs.

Sometimes I have been lucky: a failing device gives me time to
insert another one and do a "btrfs device replace" to get data off
the failing one and onto the new one before anything gets too upset.

Most of the time though failures pile up very quickly and the dying
device gets disabled by Linux, then it's going to be reboot time at
some point.

Today it has happened to me again. An HDD started failing, within a
couple of hours it got kicked by Linux and now it's not possible to
remove it from the btrfs without remount. I'm lucky that the entire
fs hasn't gone read-only.

    https://twitter.com/grifferz/status/1263084531583782913

Frequently when I mention these sorts of issues to btrfs fans they
tell me that they have never encountered them; that I must be
unlucky or doing something wrong. Or that my Linux distribution is
too old and why am I not using latest mainline kernel and
btrfs-tools both self-compiled?

It's true that my home file server hasn't always been as up to date
as it could be. At the moment it does run Debian 10 though (the
latest stable release). If the latest stable release of Debian isn't
considered new enough, then the fs isn't ready for mainstream use
yet imho.

But it isn't just me. From the day I started using btrfs I
subscribed to the linux-btrfs list. I don't recall a month having
gone by in the last 6 years where that list hasn't seen people
report availability problems and even data loss due to bugs,
occurring on stable operating system releases. Just have a look:

    https://www.mail-archive.com/linux-btrfs@vger.kernel.org/

If I wasn't massively lazy I would move this system to zfs. That's
probably what will happen the first time I get actual data loss. I
can't recommend btrfs for any use unfortunately, especially not if
showing it more than one block device.

It would certainly be a shame to have to give up recycling my years
old odd-sized HDDs and SSDs but an incident of restoring things from
backups will enrage me enough to do that.

LVM on top of MD is rather unexciting but I will note that it's been
years since I have seen any kind of software bug reported on
linux-raid that actually caused loss of availability or data.
There's always data loss, sure, but it's always down to severe
hardware failure or user error. The mdadm maintainers and list
regulars often help people recover from even the most bizarre of
these.

For example, I did not know that apparently ASRock motherboards
since at least 2015 like to REWRITE YOUR PARTITION TABLE AT EVERY
BOOT if they don't see a valid one. Since you don't need a partition
table to run RAID, that bites ASRock owners from time to time.

    https://marc.info/?l=linux-raid&m=158910882519940&w=2

With some help they got it back.

Cheers,
Andy

-- 
https://bitfolk.com/ -- No-nonsense VPS hosting