[Gllug] Combining SMB and NFS - was Embedded Linux & 1Gbps?

Wed Oct 17 22:04:58 UTC 2007

On 17 Oct 2007, Anthony Newman told this:
> What is the problem with it? It works for me for remote file access, and 
> it even manages rudimentary locking if you need it to. I (amongst 
> others) run a load of platforms at work that operate on terabytes of 
> storage via NFS for email and web hosting and it never misses a beat :)

Problems, let me count them.

 - Statelessness. This is the biggie. Virtually every major problem that
   NFS has got can be attributed in some way to the stateless nature of
   the protocol.

   Because it's profoundly different from the way that essentially every
   OS that can do NFS handles filesystems, it's terribly hard to make
   NFS act like, well, *any* native filesystem. NFS was crippled to some
   extent to allow it to support DOS, but even so it doesn't support it
   properly, or anything else. Notably:

 - Lack of POSIX-compliant POSIX open()-and-unlink() semantics. The
   `silly-renaming' .nfs-blah kludge is an awful hack which has many
   not-hard-to-hit corner cases that don't work.

 - Lots of POSIX atomicity guarantees are broken over NFS. The only
   atomic operation there is rename(): POSIX guarantees a lot more.

 - A nasty choice between `trigger EIOs all over the place that apps
   aren't expecting' and `lock up your apps if the server goes
   down'. (This is really a POSIX problem, I suppose, but if NFS was a
   distributed protocol, this would be less of an issue.)

 - Massive cache-coherence problems if multiple remote hosts are
   accessing the same file simultaneously and one of them is writing to
   it. (I'm not sure if this one is fully fixed in NFSv4; they made an
   effort.)

 - The ESTALE crock. See above.

 - Total lack of security. NFSv4 fixes this to some degree by making it
   harder to pretend to be someone else.

 - Dysfunctional implementations. It's easy to count the NFS client and
   server implementations that don't violate the protocol, there are so
   very few of them. (This is better than it used to be.)  If a server
   goes down you're often in real trouble, because an awful lot of
   clients don't support forced umount.

 - Lack of fh security on Linux. This is another statelessness problem,
   in part. Because NFS requires reduction of inode numbers to
   `handles', and because NFSv3's handles are often shorter than the
   inode numbers in many implementations, the server is left with the
   interesting question `did we export this file to this client'.
   Because of NFS's stateless nature it's hit with this question
   whenever *anything* happens: every read, every write. Verifying this
   by checking the FS is surely way too inefficient (maybe a local cache
   of fh validity would help, but this isn't implemented anywhere to my
   knowledge) so most implementations simply reverse the inum->fh
   reduction (often it's an identity transform, which makes it easy) and
   check to see if that inode exists on that fs. Now an attacker can
   just guess fhes: if the fh space is only 32 bits wide, an exhaustive
   scan is trivial with modern network speeds, exposing the entire
   filesystem to the attacker. The only files that aren't readable this
   way on NFSv3 or unencrypted NFSv4 are root-owned files with
   root_squash on.

   Linux has a `fix' for this, subtree_check, but this works by encoding
   the *directory* into the fh and checking that (which is fast), which
   means that if you have a file that's hardlinked into two directories,
   it has two fhes. This breaks client apps that try to check for
   hardlinking by comparing inums, and also leads to even more horrible
   cache coherency problems than NFS already has. As such this fix is
   deprecated as of nfs-utils 1.1.0.

 - fh revalidation and heaps of upcalls to mountd to ask it if a given
   fs is exported to a given host and to get its root handle, over and
   over and *over* again. Ameliorated to some degree with caching in the
   kernel, but still, ick.

 - Locking that doesn't work. There are wide, wide race conditions in
   the lockd protocol, which can't be closed without revising it.
   (NFSv4 has fixed this, IIRC.)

   Of course lockd isn't stateless. Oops. So much for all the
   compromises that were made in order to make NFS stateless: virtually
   every current implementation (and all the useful ones) aren't
   stateless.

 - Lack of replication. I'm not asking for disconnected operation a-la
   Coda --- everything is on the Internet now --- but it would be nice
   to not have quite so many single points of failure, especially for
   things like home directories. Finely-grained replication (`make
   lots of copies of *these* files, they're important') would be
   even cooler.

 - Lack of local caching of unchanging files. This is an implementation
   thing, thankfully being fixed in the Linux implementation by cachefs.
   It probably can't be done reliably *and* efficiently in NFS<4 without
   another bag on the side of the protocol (like lockd is), because you
   need to take out a lease on all locally cached files or in some other
   way arrange to be notified when they are changed on the server or by
   other clients.

NFSv4 fixes a lot of these problems, but is *way* too complex and has
even more crazy compromises to support Windows, including an ACL scheme
which doesn't map onto POSIX ACLs properly, and the crazy global
namespace thing. Perhaps *you* like doing one huge export from a host
and importing it in a single place on every client, but that's almost
wholly useless to me. Bind mounts make it merely annoying, but still,
ick. (As far as I can tell this is useless WebNFS historical crap from
the days when Sun thought that NFSv4 would be just great across the open
Internet and MS was renaming SMB to the Common Internet File System...
but perhaps I am wrong. I've not looked at NFSv4 in years: I'd be
overjoyed to learn that this ridiculous restriction had been lifted and
you could import and export hunks of filesystem freely across the tree,
as you can with NFSv3 and below.)

At that, NFS does some things right. The biggest is *transparency*: you
can take a local filesystem, or part of one, and export it to remote
hosts without messing around making new filesystems or moving data about
or having to mount it somewhere special on the clients. This is bloody
hard to acheive with network filesystems (and NFSv4 last I checked had
thrown a lot of this advantage away).

Plus, well, it works and is just about Good Enough.

Also, it works better than basically everything else: it's far more
POSIX-compliant and far easier to set up than AFS, enormously more
scalable than Coda, more Unixlike than SMB... but this is more a comment
on the sad state of distributed filesystems than anything else.

(clusterfs is damn good too, although it had fairly high overhead for a
small network wanting replication of a few Gb last I looked, but the Sun
takeover and the rumbles about Sun-only servers I'm hearing these days
are disconcerting. Of course they're rumours and thus of little import,
and you'd hope Sun would have more clue, as the guys who freed NFS (and
the code!) and watched as it took over the world as a direct
consequence. So for now I'm assuming this won't happen, not least
because I suspect clusterfs would lose half its developers if it did.

GFS is good but like clusterfs it doesn't really serve the same purpose
as a networked filesystem, it's more a SAN one-disk-many-direct-accessors
deal. However it can be made to serve the same ends.

Both GFS and clusterfs are quite a lot harder to set up than NFS, but
the distributors should take most of the pain out of it.)

I was watching zumastor with interest, but it looks rather dead right
now :(

-- 
`Some people don't think performance issues are "real bugs", and I think 
such people shouldn't be allowed to program.' --- Linus Torvalds
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug