[Nottingham] RAID stripe size?

Thu Sep 29 11:01:18 UTC 2011

Sergiusz,

On 29 September 2011 11:10, Sergiusz Pawlowicz <sergiusz at pawlowicz.name> wrote:
> and another post, sorry :)
>
> stripe size is pretty well explained at
> http://tldp.org/HOWTO/Software-RAID-0.4x-HOWTO-8.html - par. 8

Thanks for that one... Long forgotten! Unfortunately, it is rather
dated in using ext2 examples!...

A good explanatory part is:

###
Assuming that the small files are statistically well distributed
around the filesystem, (and, with the ext2fs file system, they should
be), roughly N times more overlapping, concurrent reads should be
possible without significant collision between them. Conversely, if
very small stripes are used, and a large file is read sequentially,
then a read will issued to all of the disks in the array. For a the
read of a single large file, the latency will almost double, as the
probability of a block being 3/4'ths of a revolution or farther away
will increase. Note, however, the trade-off: the bandwidth could
improve almost N-fold for reading a single, large file, as N drives
can be reading simultaneously (that is, if read-ahead is used so that
all of the disks are kept active). But there is another,
counter-acting trade-off: if all of the drives are already busy
reading one file, then attempting to read a second or third file at
the same time will cause significant contention, ruining performance
as the disk ladder algorithms lead to seeks all over the platter.
Thus, large stripes will almost always lead to the best performance.
The sole exception is the case where one is streaming a single, large
file at a time, and one requires the top possible bandwidth, and one
is also using a good read-ahead algorithm, in which case small stripes
are desired.

Note that this HOWTO previously recommended small stripe sizes for
news spools or other systems with lots of small files. This was bad
advice, and here's why: news spools contain not only many small files,
but also large summary files, as well as large directories. ... If
this directory is spread across several stripes (several disks), the
directory read (e.g. due to the ls command) could get very slow.
Thanks to Steven A. Reisman < sar at pressenter.com> for this correction.
Steve also adds:

    I found that using a 256k stripe gives much better performance. I
suspect that the optimum size would be the size of a disk cylinder (or
maybe the size of the disk drive's sector cache). However, disks
nowadays have recording zones with different sector counts (and sector
caches vary among different disk models). There's no way to guarantee
stripes won't cross a cylinder boundary.

The tools accept the stripe size specified in KBytes. You'll want to
specify a multiple of if the page size for your CPU (4KB on the x86).
###

An interest here is that all new HDDs as of January this year are
manufactured with the new 4kByte physical sectors rather than the old
512Byte sectors. They transparently maintain 512Byte sector
compatibility by performing a read-modify-writes as needed, but at a
high performance cost...

See:
http://en.wikipedia.org/wiki/Advanced_format

SSDs can have typically anything from effectively 4kByte to 16kByte
(or larger) write pages (equivalent to HDD sectors)...

So... From that little lot:

My impression at the moment is that stripe size (chunks) for
mechanical HDDs are a compromise to mitigate head (cylinder and
rotation) seek times and to lessen the effect of crossing cylinder
boundaries. That is not a concern for SSDs (they have minimal 'seek'
times)...

So... Is the read-modify-write the highest performance penalty now?

Trying to guess some real world numbers:

Hence as a guess... For HDDs depending on application, chunks of
64kByte to 256kByte;

For SSDs I'd guess go right down to the write page size of 4kBytes.

Note that for SSDs, the 4kByte read/write performance is usually given
as /the/ headline (best) spec. Also, for performance charts I've seen,
SSDs max out on their read/write data rate at about the 1kByte IO
read/write size in any case.

And then... For btrfs for SSDs, that suggests formatting with:

mkfs.btrfs -L SSD_label_name -s 4096 /dev/sdX

or for 16kByte write pages:

mkfs.btrfs -L SSD_label_name -s 16384 /dev/sdX

(And btrfs can implement it's own "RAID" at the filesystem level)

Thoughts?

Cheers,
Martin