[Wolves] Tinkle tinkle little disk...

Andy Smith andy at lug.org.uk
Thu Jan 3 17:18:24 GMT 2008


Hi,

I have had some thoughts along these lines myself from time to time.

On Thu, Jan 03, 2008 at 04:53:15PM +0000, Stuart Langridge wrote:
> 1. rsyncness
> Because bandwidth is small (see point 3), the system should not back
> up all my tagged files every night: instead, it should only back up
> changes. The rsync program already knows how to do this; it can look
> at a file and transfer only the bits which changed since the last
> backup, which is great. However, the file on the backup "server"
> (which is actually, say, Adam's PC) is encrypted. A small change to
> the "real" file (on my machine) will result in a very big change to
> the encrypted version (on Adam's PC), which in practice means that
> rsync's clever "work out which bits have changed" algorithm is
> useless. This can be avoided by assuming that any change to a file
> means you have to back up the whole file again; this works OK if your
> files are all small (like, say /etc/* files on a Linux box) or if your
> large files never change (like, say, your Photos folder), but if you
> have a large file and it changes a lot (like, say, your mailbox, if
> it's one mbox file, or an Outlook PST) then you pay the long-time
> backup penalty every night. Comments on whether this is likely to be a
> problem are invited.

It will be a problem but it can be worked around, though perhaps not
by use of plain rsync.

Files could be split up into small blocks and a secure hash kept on
every node about each block.  If a block's hash does not match then
that block has changed and it will have to be retransmitted to all
backup nodes, but hopefully the block would be quite small, e.g.
64KiB blocks each with a SHA-256 hash.

> 2. encryption key loss

[...]

> I can't think of a way around this without saying to the user
> "here is a file which you must keep safe somewhere other than on
> your computer", which is a total abject loss if you're my dad
> (because (a) he won't understand the request and (b) he has
> nowhere other than his computer to keep such a file safe! Yes, he
> could burn it to a CD, but then it's hardly an easy-to-use backup
> system, is it?) I can't think of a user-friendly way to get around
> this one other than "the central server stores your backup key",
> which translates to "the central server admin can read all your
> backups", which is clearly a bad idea.

Why can't it just be a password?  Okay this is always going to be a
"quite hard" problem, but it's one that exists already with PGP
secret keys and people do manage.  You'd probably have the software
just remember it and only require it be input if there was a need to
download the data from a clean install.

> 3. bandwidth
> If I want to back up 200MB of photos somewhere away from my home
> network, then I'm screwed. This isn't a problem with my backup system,
> it's just The Way It Is, because that's how fast internet connections
> are. This possibly makes any concept of off-site backup for home users
> basically useless. Yes, some people will just use it to back up their
> most vital documents, but some people want to make sure they don't
> lose their photos and their mp3s, and saying "here is a brilliant
> backup system; only use it for text documents" is pretty stupid. This,
> all by itself, might kill the entire concept of off-site home-user
> backup. Not a lot that can be done about that if that's the case; it
> can't be solved with software, obviously, no matter how clever the
> software is. Thoughts on whether this is an issue are gratefully
> welcomed.

I had an idea about making it P2P so that you would not necessarily
need to pre-arrange the backup relationships and you get a potential
bandwidth multiplication since your blocks can be flooded out via
multiple people, and downloaded from multiple people when you need
them.

You could set some degree of importance to certain files, plus a
default importance, which would determine how many copies of each
block should exist in the backup network at any time.

So for example if your default number of copies were 2, there must
exist 2 copies of every block of your data in the backup network at
all times.  When people leave the network, your client would begin
contacting others to ask them to take copies of your stuff.
Meanwhile it would be accepting similar requests from others.

This could be quite risky as it would be easy to have a user with a
large amount of your blocks disappear never to be seen again and
leave you with a long window where you are exposed to failure.  It
could however by mitigated by the idea that you yourself can be a
member of the backup network multiple times and have preferred
backup partners.  So if you have access to another machine somewhere,
you can join it to the backup network and tell it to preferentially
accept requests from your username on other nodes.  That way all
instances of you would try to provide one (or more) copies of each
others' blocks.

This also does allow for some sort of economy to form where users
can negotiate deals with each other to take on backup requests, all
without knowing who each other are, or what the data is.

That might be putting the horse before the cart though, since I
can't think of a good way to represent reliability into the
marketplace there -- you don't want to pay someone with a modem
connection in antarctica the same as a user on a gigabit server in
a well-connected datacentre.

Cheers,
Andy

-- 
http://bitfolk.com/ -- No-nonsense VPS hosting
Encrypted mail welcome - keyid 0x604DE5DB
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://mailman.lug.org.uk/pipermail/wolves/attachments/20080103/eadad92a/attachment-0001.bin


More information about the Wolves mailing list