[Gllug] Raw Partitions

Peter Grandi pg_gllug at gllug.for.sabi.co.UK
Wed Nov 16 20:06:31 UTC 2005


>>> On Wed, 16 Nov 2005 17:08:59 +0000, Steve Nelson
>>> <sanelson at gmail.com> said:

[ ... ]

sanelson> I find it unlikely that rawdevices are deprecated [
sanelson> ... ]

It is quite recent, only from the end of May 2005, for 2.6:

  http://WWW.USSG.IU.edu/hypermail/linux/kernel/0505.2/1387.html
  http://WWW.LinuxHQ.COM/kernel/v2.6/13/Documentation/feature-removal-schedule.txt

   «+What:   RAW driver (CONFIG_RAW_DRIVER)
    +When:   December 2005
    +Why:   declared obsolete since kernel 2.6.3
    +   O_DIRECT can be used instead
    +Who:   Adrian Bunk <bunk at stusta.de>»

sanelson> The current situation is a GFS 6.0 cluster, and raw
sanelson> devices, wrapped as described above, are used for the
sanelson> quorum data.

[ ... ]

sanelson> I've got hold of sg_dd for the read/writes, but need
sanelson> to create a raw partition for experimenation.

>> It is hard for me to see the relationship between 'sg_dd' and
>> «raw partition»s...

sanelson> Well, due to 'bugs' in dd, according the the raw
sanelson> manual, one should use sg_dd to perform read/write
sanelson> operations on raw partitions.

The 'raw'(8) man page says more precisely and technically:

  http://WWW.die.net/doc/linux/man/man8/raw.8.html

   «The Linux dd (1) command does not currently align its
    buffers correctly, and so cannot be used on raw devices.»

Indeed the 'raw'(8) page does not mention 'sg_dd' at all (even
if instead the 'sg_dd'(8) page does mention 'raw').

Now, that's not a «'bugs'», just a limitation, and I have
provided several links to versions of 'dd' that do not have
that same limitation, and are not 'sg_dd'.

As to 'sg_dd' its primary purpose like the rest of 'sgutils'
is to do IO issuing direct commands ('SGIO') to SCSI[-like]
devices, which also requires aligned buffers:

  http://sg.Torque.net/sg/sg_dd.html

   «The sg_dd utility is specialized for devices that use the
    SCSI command set in the Linux operating system.»

   «It is becoming common for non-SCSI devices (e.g. ATA disks)
    to appear as SCSI devices to an operating system via a
    protocol conversion in an external enclosure and via some
    transport such as USB or IEEE 1394. The sg_dd utility should
    work with most of these devices as it tends to use exactly
    the same SCSI commands that the normal block layer would
    use.»

   «Raw devices were introduced in the lk 2.4 series and are
    still available in the lk 2.6 series but opening a block
    device or a normal file with O_DIRECT (in sg_dd using the
    option 'odir=1') is preferred in the lk 2.6 series.»

   «If either the input or output file is a raw device, or
    'odir=1' is given then the internal buffers used by sg_dd
    are aligned to a memory page boundary. A memory page is 4
    kilobytes in the i386 architecture. This memory alignment is
    required by both raw devices and normal block devices that
    implement O_DIRECT.» 

In other words, there is no necessary relationship between
'sg_dd' and 'raw'(8) and not being telepathic I was uncertain
whether your question was more about 'sg_dd'/'SGIO' as such or
really about 'raw'(8), as they are entirely different subjects.

sanelson> My aim was to attempt to perform read/writre
sanelson> operations in a like manner to those performed when
sanelson> starting clumanager on a cluster node.

Using something like 'dd'? Of course you know better, but I
reckon that the IO patterns of a cluster system and those of
something like 'dd' are nowhere alike, and any crashes in the
former case are likely due to timing depending bugs that
sequential reading is unlikely to uncover.

Also, it transpires below that the 'raw' device is most likely
bound to a GFS pool, and is perhaps used with AIO, which is
quite different from binding it to another type of block device
and doing sequential sync IO to it...

[ ... ]

sanelson> I'm not especially keen on repartitioning my test
sanelson> machine so I can create a raw device. [ ... ]

[ ... ]

sanelson> If by this you are (somewhat pedantically) arguing
sanelson> over whether these are 'real' raw devices, or simply
sanelson> wrapper-like bindings,

Note that in the above you used «repartitoning» thus sort of
implying that one has to create a special ''raw partition'',
when someone who had read 'raw'(8) would have known instead that
one could bind a raw device to _any_ existing block device.

So, for example, if the test system has a swap partition, you
can experiment with the raw device wrapper by binding one raw
device to such a partition (if you want to write to it too, just
disable swapping on that swap block device).

>> Also, the 'raw' command and raw device wrappers are deprecated,
>> depending on the kernel release (and did not seem to have much
>> of an effect when I tried them, and seem to be little tested).

sanelson> Ah - ok - they may be unused on 2.6 kernels - I am
sanelson> using 2.4 kernels at present, and GFS 6.0 relies upon
sanelson> them. This is a heavily used an well-supported
sanelson> configuration, so perhaps deprecated is a slightly
sanelson> strong word.

Well, the administrator manual for GFS 6.0 is here:

  http://WWW.RedHat.com/docs/manuals/csgfs/pdf/rh-gfs-en-6_0.pdf

and there is no mention I can find of the raw device wrapper (or
of a quorum) therein, while in section 9.7.1 on page 103 there
is a mention of 'O_DIRECT' and a couple of related options.
>From doing some web searching it looks like GFS has supported
'O_DIRECT' since 5.x, and 'O_DIRECT' has been available since
Linux 2.4.4 or whereabouts:

  http://WWW.USSG.IU.edu/hypermail/linux/kernel/0104.1/0810.html
  http://WWW.RedHat.com/docs/manuals/csgfs/admin-guide/5.2.1/s1-manage-direct-io.html

But probably I am missing something here... Indeed, as it seems
likely that the quorum is the _Oracle_ quorum, as one can read
in section 2.6, page 16 of «Installing and Configuring Oracle9i
RAC with GFS 6.0»:

  http://WWW.RedHat.com/docs/manuals/csgfs/pdf/rh-gfsico-en-6_0.pdf

where the raw devices are bound _on top_ of GFS pools:

  http://WWW.RedHat.com/docs/manuals/csgfs/oracle-guide/s1-gfs-filesys.html

  «5. Bind the raw devices on each node to the GFS pool raw
      devices; that is, bind /dev/raw/raw1 to /dev/pool/oraraw1
      and bind /dev/raw/raw2 to /dev/pool/oraraw2.»

and in section 3.3, page 17 there are settings that suggest that
async IO is being done, perhas to those raw devices:

  http://WWW.RedHat.com/docs/manuals/csgfs/oracle-guide/s1-ora-nodes.html

If so, it looks like the problem is really about a combination
of bleeding edge stuff, like (perhaps) AIO to raw devices
wrapping GFS pools.

BTW, interesting article about these issues, from the design and
performance point of view, here:

  http://WWW.VLDB2005.org/program/paper/wed/p1116-hall.pdf

sanelson> [ ... ] the idea behind sg_dd

As far as ''raw devices'' are concerned, the only relevant idea
in 'sg_dd' is that it does aligned buffer IO, and that's just a
minor side effect of that being required for 'SGIO' too.

But then several other versions of 'dd' do buffer aligned IO,
and mentioning 'sg_dd' is a bit confusing, as more or less
'sg_dd' is the equivalent of 'cdrecord' or 'growisofs -Z', but
for SCSI/ATAPI hard discs instead of CD-R or DVD-R drives.

sanelson> - my only question was whether it was possible to
sanelson> create some kind of virtual device under an already
sanelson> used partition, and bind a raw device to it.

[ ... ]

>> As usual, perhaps it would be helpful if you gave some
>> context for your question, like the purpose behind the
>> question and the configuration of the system (e.g. kernel
>> version etc.).

[ ... ]

sanelson> crashing when it attempts to read quorum data from a
sanelson> raw device, [ ... ]

Then this seems your actual problem, not 'sg_dd' and 'raw'(8),
and I am not that surprised -- '/dev/raw/' are a relatively
obscure part of the kernel and would not expect them to be as
thoroughly exercised as they should be; never mind stacking
'raw'(8) on top of GFS pools...

[ ... ]

sanelson> and I do not believe the list would have been
sanelson> especially interested in the detailed background
sanelson> concerning an Oracle 10 Cluster with a failing node,
sanelson> which my investigations thus far lead be to believe is

You can lead yourself to believe what you prefer -- I reckon
however that it is useful to include small and obviously
irrelevant details like distribution, the version of that
distribution, kernel edition and minor version, applications
causing issues, and symptoms doing specific operations, when
asking for help about devices and drivers, as there are many
issues device and driver issues that are rather specific to such
details.

Probably you should have asked something more useful like:

  ''I have a RHES 3 system with the usual heavily patched RH
    kernel 2.4.21 and when I run Oracle 10 on it over GFS 6.0
    and there are crashes in the Oracle quorum manager using
    'raw'(8) devices wrapping GFS pools, as described in this
    link: [link omitted]. What can I do?

    I am also considering checking whether there are issues with
    'raw'(8) using another PC using some form of 'dd', how can I
    set up that?''

No mention of something very special purpose like 'sg_dd', which
hints at a completely different set of issues related to 'SGIO'
which are irrelevant here.

The answer would have been:

   * The 'raw' driver is likely to be somewhat unreliable, as
     well as being deprecated.
   * 'O_DIRECT' is an alternative to 'raw'(8), so one could
     check if Oracle can use that instead for quorum volume,
     but there is the small matter of AIO. Situation murky.
   * Since both 'raw'(8)/'O_DIRECT' and AIO are optimizations,
     and sometimes of dubious value, using neither is probably
     a good fallback option if one has problems.
   * This is really a very particular situation, and I suspect
     that is one of the case where having an Oracle and/or
     RedHat support account is pretty useful.

   * Testing 'raw'(8) using 'dd' is probably pointless, as the
     usage pattern is going to be too different, but if you
     really want to do that, bind a 'raw'(8) to a newly created
     temporary GFS pool, and use a 'dd' [list omitted] that does
     aligned buffer IO. You could use a 'losetup' block device
     over a file to simulate a block device, but that would make
     the test even more unrealistic and risky.

I think that the above is more the level and style of direct,
detailed technical discourse that is useful to have when
discussing nontrivial kernel + application issues.

-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug




More information about the GLLUG mailing list