[Nottingham] A quick bash updating table lookup

Martin martin at ml1.co.uk
Tue Feb 8 14:47:03 UTC 2011


On 8 February 2011 13:03, notlug <notlug at pendinas.org.uk> wrote:
> On 07/02/11 16:16, Martin wrote:
>>
>> Folks,
>>
>> Now this is something that just must have already been done...
>>
>> I'm checking md5sums for files for filesystem _inodes_  ... This is so
[---]
>> looking for something that will scale up to a 100 million files or
>> more.
>>
>> Any good ideas?
>>
>
> You need to define your problem a little better...
>
> Which will be more expensive:  create and lookup some extra md5sum or
> managing the  "have I seen this before" metadata ?

For the present first-off test, I'm assuming IO-bound and so simply
test whether the file to be checked is smaller than the checksums
cache file. If smaller, then don't waste IO looking through the cache
file:

        # File size check
        i=$(ls -S1 "$f" "$md5cache" | tail -n 1)
        if [ "$i" != "$md5cache" ]
        then
                # Small file, calculate md5sum regardless!
                md5sum -b "$f"
        else
                # Use the cache:


> It sounds like you need to distinguish at least 2 data sets:
>    1. A permanent  database of "inode,checksum" so you can check if a file
> has changed or not.
>    2. A per run list of hardlinked inodes which have already been checked.
>
> If the multiple paths to an inode are due to hard linking (can test the
> inode hardlink count)
> or bind mounts (cannot test the inode hardlink count)  you can limit the
> volume of
> "have I seen this before" metadata by checking the inode hardlink count
> (stat -c %h) > 1

Yes, except I'm doing that in the find:

-type f \( -links 1 -exec md5sum -b "{}" \; -o -exec "$md5sumc" "{}"
"$md5cache" \; \)


> But how do you manage in "inode,checksum" database ?
> When is the checksum data generated for new files and what proportion of
> files
> are expected to change (and need their md5sum updating) between runs ?
> This will affect your "inode,checksum" database strategy - many or few
> inserts/updates  ?

I'm guessing a large proportion of inserts (perhaps up to near 50%)
for each run.


> It also affects your reporting strategy.
> How do you distinguish/report genuine changes from data corruption ?
> At "100 million files" even 0.01% genuine change = 10,000 files.
> You cannot sensibly scan a 10,000 line change log file by eye - do you need
> "yet another tool" to test whether a change was benign or malign ?

That is quite an issue. First pass is to assume cut-off dates for
trees of data after which no further changes can be expected to be
seen. I may insist that a date is included as part of the path name.


> Also, if the likelihood of file data corruption is high,
> 100 million files = a large "inode,md5sum" database (even as binary data).
> How do you protect this from corruption  ?

Very good point. The dump itself can be check-summed.



> Then there are all the warts of doing this on a running system.
> Can files change while a scan is taking place ?

Yes, except that I'm using an LVM snapshot for doing the checks.


> If yes then handling this in perl or C would be ugly but trying to do it in
> bash
> would be bordering on insane ;)

I thought you could do /any/ sysadmin in less than five lines of bash! :-p


[---]
> (I'm assuming NFS isn't involved in any of this - that would make things
> even more interesting.)

NFS is part of the reason for doing this... Too slow over the network
so do the md5sums on the local machines and backup/archiving machine
and then compare timestamps and md5sums. Also trying to squeeze extra
goodies of data checks and change checks  whilst I'm doing it.


>> Already done even?
>
> This is something existing intrusion detection / data integrity checkers do
> (but I
> don't know if they do it well or not). Fedora and ubuntu have some of them
> in their
> package repositories:
> AIDE       http://aide.sourceforge.net/
> Samhain http://www.la-samhna.de/

Yes... Looked briefly at the Mandriva msec... However, I was lured
into the 'simplicity' of a one-liner for a full custom solution! Since
expanded... :-)


> Better, lower level, solutions are  file system or block device checksums
>  (eg .RAID or btrfs (if you can wait for it to mature sufficiently)) with
> regular
> integrity scans.  But rolling these out onto an existing system may not be
> practical.

Using RAID5 for the backups array, RAID1 on the main machines.

Used DRBD for a while but after zero failures I'm now using periodic
rsync to save on system bandwidth. Penalty is perhaps a few lost files
for whatever time if a machine spontaneously combusts.


Thanks for some good thoughts,

Cheers,
Martin



More information about the Nottingham mailing list