[Nottingham] A quick bash updating table lookup

Tue Feb 8 13:07:48 UTC 2011

On 07/02/11 16:16, Martin wrote:
> Folks,
>
> Now this is something that just must have already been done...
>
> I'm checking md5sums for files for filesystem _inodes_  ... This is so
> that my system doesn't go checking the md5sum for data corruption for
> the same inode a gazillion times through multiple hard links to the
> same inode in the archive/snapshot copies that are kept. So... The
> problem is:
>
> How to look up a table of already seen inode/md5sum pairs quickly, for
> many millions of files?
>
> Also, how to efficiently add new entries to the table and yet maintain
> a fast lookup?
>
>
> I'm running a 'quick and dirty' but non-ideal solution at the moment.
> However, it isn't all that quick and will eventually suffer an
> exponential degrade as the number of lookups increase... Also, I'm
> looking for something that will scale up to a 100 million files or
> more.
>
> Any good ideas?
>

You need to define your problem a little better...

Which will be more expensive:  create and lookup some extra md5sum or
managing the  "have I seen this before" metadata ?

It sounds like you need to distinguish at least 2 data sets:
     1. A permanent  database of "inode,checksum" so you can check if a file has changed or not.
     2. A per run list of hardlinked inodes which have already been checked.

If the multiple paths to an inode are due to hard linking (can test the inode hardlink count)
or bind mounts (cannot test the inode hardlink count)  you can limit the volume of
"have I seen this before" metadata by checking the inode hardlink count (stat -c %h) > 1

But how do you manage in "inode,checksum" database ?
When is the checksum data generated for new files and what proportion of files
are expected to change (and need their md5sum updating) between runs ?
This will affect your "inode,checksum" database strategy - many or few inserts/updates  ?

It also affects your reporting strategy.
How do you distinguish/report genuine changes from data corruption ?
At "100 million files" even 0.01% genuine change = 10,000 files.
You cannot sensibly scan a 10,000 line change log file by eye - do you need
"yet another tool" to test whether a change was benign or malign ?

Also, if the likelihood of file data corruption is high,
100 million files = a large "inode,md5sum" database (even as binary data).
How do you protect this from corruption  ?

Then there are all the warts of doing this on a running system.
Can files change while a scan is taking place ?
If yes then handling this in perl or C would be ugly but trying to do it in bash
would be bordering on insane ;)
Between any 2 _file_ operation processes
there is a window of opportunity for another process
to change the file and/or inode in question.
Think:
     scan:> inode=$(stat +c %i $file)
     you:>  rm -f $file && touch $file     # You just changed  $file and the inode number of $file!
     scan:> hash=$(md5sum $file)
     scan:> insert_into_db $inode $hash

Not only did the file contents change, the scan added it to the database with the wrong inode value.

(I'm assuming NFS isn't involved in any of this - that would make things even more interesting.)

> Already done even?

This is something existing intrusion detection / data integrity checkers do (but I
don't know if they do it well or not). Fedora and ubuntu have some of them in their
package repositories:
AIDE       http://aide.sourceforge.net/
Samhain http://www.la-samhna.de/

Better, lower level, solutions are  file system or block device checksums
  (eg .RAID or btrfs (if you can wait for it to mature sufficiently)) with regular
integrity scans.  But rolling these out onto an existing system may not be practical.

Have fun,
Duncan