[Gllug] Combining SMB and NFS - was Embedded Linux & 1Gbps?
Nix
nix at esperi.org.uk
Wed Oct 17 22:04:58 UTC 2007
On 17 Oct 2007, Anthony Newman told this:
> What is the problem with it? It works for me for remote file access, and
> it even manages rudimentary locking if you need it to. I (amongst
> others) run a load of platforms at work that operate on terabytes of
> storage via NFS for email and web hosting and it never misses a beat :)
Problems, let me count them.
- Statelessness. This is the biggie. Virtually every major problem that
NFS has got can be attributed in some way to the stateless nature of
the protocol.
Because it's profoundly different from the way that essentially every
OS that can do NFS handles filesystems, it's terribly hard to make
NFS act like, well, *any* native filesystem. NFS was crippled to some
extent to allow it to support DOS, but even so it doesn't support it
properly, or anything else. Notably:
- Lack of POSIX-compliant POSIX open()-and-unlink() semantics. The
`silly-renaming' .nfs-blah kludge is an awful hack which has many
not-hard-to-hit corner cases that don't work.
- Lots of POSIX atomicity guarantees are broken over NFS. The only
atomic operation there is rename(): POSIX guarantees a lot more.
- A nasty choice between `trigger EIOs all over the place that apps
aren't expecting' and `lock up your apps if the server goes
down'. (This is really a POSIX problem, I suppose, but if NFS was a
distributed protocol, this would be less of an issue.)
- Massive cache-coherence problems if multiple remote hosts are
accessing the same file simultaneously and one of them is writing to
it. (I'm not sure if this one is fully fixed in NFSv4; they made an
effort.)
- The ESTALE crock. See above.
- Total lack of security. NFSv4 fixes this to some degree by making it
harder to pretend to be someone else.
- Dysfunctional implementations. It's easy to count the NFS client and
server implementations that don't violate the protocol, there are so
very few of them. (This is better than it used to be.) If a server
goes down you're often in real trouble, because an awful lot of
clients don't support forced umount.
- Lack of fh security on Linux. This is another statelessness problem,
in part. Because NFS requires reduction of inode numbers to
`handles', and because NFSv3's handles are often shorter than the
inode numbers in many implementations, the server is left with the
interesting question `did we export this file to this client'.
Because of NFS's stateless nature it's hit with this question
whenever *anything* happens: every read, every write. Verifying this
by checking the FS is surely way too inefficient (maybe a local cache
of fh validity would help, but this isn't implemented anywhere to my
knowledge) so most implementations simply reverse the inum->fh
reduction (often it's an identity transform, which makes it easy) and
check to see if that inode exists on that fs. Now an attacker can
just guess fhes: if the fh space is only 32 bits wide, an exhaustive
scan is trivial with modern network speeds, exposing the entire
filesystem to the attacker. The only files that aren't readable this
way on NFSv3 or unencrypted NFSv4 are root-owned files with
root_squash on.
Linux has a `fix' for this, subtree_check, but this works by encoding
the *directory* into the fh and checking that (which is fast), which
means that if you have a file that's hardlinked into two directories,
it has two fhes. This breaks client apps that try to check for
hardlinking by comparing inums, and also leads to even more horrible
cache coherency problems than NFS already has. As such this fix is
deprecated as of nfs-utils 1.1.0.
- fh revalidation and heaps of upcalls to mountd to ask it if a given
fs is exported to a given host and to get its root handle, over and
over and *over* again. Ameliorated to some degree with caching in the
kernel, but still, ick.
- Locking that doesn't work. There are wide, wide race conditions in
the lockd protocol, which can't be closed without revising it.
(NFSv4 has fixed this, IIRC.)
Of course lockd isn't stateless. Oops. So much for all the
compromises that were made in order to make NFS stateless: virtually
every current implementation (and all the useful ones) aren't
stateless.
- Lack of replication. I'm not asking for disconnected operation a-la
Coda --- everything is on the Internet now --- but it would be nice
to not have quite so many single points of failure, especially for
things like home directories. Finely-grained replication (`make
lots of copies of *these* files, they're important') would be
even cooler.
- Lack of local caching of unchanging files. This is an implementation
thing, thankfully being fixed in the Linux implementation by cachefs.
It probably can't be done reliably *and* efficiently in NFS<4 without
another bag on the side of the protocol (like lockd is), because you
need to take out a lease on all locally cached files or in some other
way arrange to be notified when they are changed on the server or by
other clients.
NFSv4 fixes a lot of these problems, but is *way* too complex and has
even more crazy compromises to support Windows, including an ACL scheme
which doesn't map onto POSIX ACLs properly, and the crazy global
namespace thing. Perhaps *you* like doing one huge export from a host
and importing it in a single place on every client, but that's almost
wholly useless to me. Bind mounts make it merely annoying, but still,
ick. (As far as I can tell this is useless WebNFS historical crap from
the days when Sun thought that NFSv4 would be just great across the open
Internet and MS was renaming SMB to the Common Internet File System...
but perhaps I am wrong. I've not looked at NFSv4 in years: I'd be
overjoyed to learn that this ridiculous restriction had been lifted and
you could import and export hunks of filesystem freely across the tree,
as you can with NFSv3 and below.)
At that, NFS does some things right. The biggest is *transparency*: you
can take a local filesystem, or part of one, and export it to remote
hosts without messing around making new filesystems or moving data about
or having to mount it somewhere special on the clients. This is bloody
hard to acheive with network filesystems (and NFSv4 last I checked had
thrown a lot of this advantage away).
Plus, well, it works and is just about Good Enough.
Also, it works better than basically everything else: it's far more
POSIX-compliant and far easier to set up than AFS, enormously more
scalable than Coda, more Unixlike than SMB... but this is more a comment
on the sad state of distributed filesystems than anything else.
(clusterfs is damn good too, although it had fairly high overhead for a
small network wanting replication of a few Gb last I looked, but the Sun
takeover and the rumbles about Sun-only servers I'm hearing these days
are disconcerting. Of course they're rumours and thus of little import,
and you'd hope Sun would have more clue, as the guys who freed NFS (and
the code!) and watched as it took over the world as a direct
consequence. So for now I'm assuming this won't happen, not least
because I suspect clusterfs would lose half its developers if it did.
GFS is good but like clusterfs it doesn't really serve the same purpose
as a networked filesystem, it's more a SAN one-disk-many-direct-accessors
deal. However it can be made to serve the same ends.
Both GFS and clusterfs are quite a lot harder to set up than NFS, but
the distributors should take most of the pain out of it.)
I was watching zumastor with interest, but it looks rather dead right
now :(
--
`Some people don't think performance issues are "real bugs", and I think
such people shouldn't be allowed to program.' --- Linus Torvalds
--
Gllug mailing list - Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug
More information about the GLLUG
mailing list