[Wolves] Storage advice.

Tue Jan 16 12:25:38 UTC 2024

>> > Hi,
>> >
>> > Odd position I'm in at day job.
>> >
>> > I have a VM (VMWare) running Oracle Linux 9, with two 50tb VMDKs attached.
>> >
>> > The issue is that I need to span them, and it'll store a lot of tiny files (circa 80Tb in 2-3mb files), with a high daily rate of change.
>> >
>> > Choosing the right method to achieve this is proving a challenge, when I've got to get this going this week. So testing is limited.
>>
>> What is proving challenging?
>>
>> >
>> > I can't have a single large VMDK due to a 62Tb limit in VMWare.
>>
>> Can you go around that and just iSCSI direct into the VM?
>>
>> >
>> > I can just use ext4 and LVM, or Oracle Linux likes Btrfs. ZFS is provisionally out unless I get a third 50Tb disk, for ZRAID+1.
>>
>> How come XFS is not an option?
>>
>> IMHO it would likely be the best case for this situation.  Great
>> performance, especially parallel allocations, metadata split into
>> allocation groups, good handling of small files.  In my experience
>> very robust and good recovery options, and no need to
>> periodic fsck.
>>
>> Also XFS has really good trim and debug tracing support and you can
>> align the allocation groups to the underlying RAID stripes.
>>
>> >
>> > Opinions?
>>
>> I assume the VMDK's are backed onto a SAN of similar?
>>
>> In which case they are essentially within the same failure domain,
>> therefore I would:
>>   a) Just stripe them in RAID 0 with MDRAID
>>   b) Just use LVM to do a PV span
>>   c) Use BTRFS's built in RAID and map both devices into a single BTRFS
>>
>> If you don't really trust the virtual disks, I'd suggest you mount
>> more smaller volumes and RAID 5 inside the VM.
>>
>> We used to run out databases on AWS ephemeral disks all in XFS +
>> MDRAID 0, treating the VM as a single failure domain.
>>
>> BTRFS does at least have checksumming, which can be helpful to detect
>> errors.  However running out of space is a PITA
>> and recovering from data corruption (caused by bad RAM) has been 50-50 for me.
>>
>> If performance is your main problem, I'd go XFS + MDRAID0, increase
>> the number of allocation groups and stripe align them.
>>
>> >
>> > Or I can do some basic tests and report results if anyone is interested?
>> >
>>
>
> This is pretty much what I ended up doing. Just spanning them via a LVM VG, and using XFS.
> I don't necessarily have to worry about one of the disks failing as it's effectively on the same storage (or at least the same SAN). So in 99% of situations both would fail at the same time.
> That plus the data on there is important, but not critical.

If you're struggling with write performance:
  1) RAID 0 the two VMDKs with MDRAID
  2) Increase the number of allocation groups in XFS on filesystem
creation, XFS can allocate a file per AG in parallel, so can often
give a boost on lots of small file problems.

>
> Doing weird and wonderful things like splitting it into smaller VMDKs, using ZFS or similar just seemed like an extra layer of complexity, especially as write speed is my only concern.
>
> Performance is as you'd expect for a normal NFS server. Though the storage is all SSD, so at least it's not spinning platter speeds. But testing has only been via a couple of test boxes, not fully loaded.

Are you bound by the network IO or NFS server IO?

>
> I've now got the joy of migrating the 70'odd clients, and the existing 45Tb of data over to it.
>
> Thank you everyone for your ideas though.
>
> Thanks,
> Simon.

Chris