[Nottingham] A quick bash updating table lookup

Mon Feb 7 17:59:53 UTC 2011

I'm all for Perl here too: If your inode lookups are sequential, you could
always save the last looked up inode to a variable, and simply check if the
current inode you're looking at is greater than that value, and only run an
MD5 checksum if this is the case: It means that you won't be storing a
massive list of inodes in memory, and would be avoiding costly lookups in
for existing hash keys.

This of course, assumes that you're doing your inode lookups sequentially.
I'm not sure about how your current solution operates.

-Rcih

On Mon, Feb 7, 2011 at 17:48, Camilo Mesias <camilo at mesias.co.uk> wrote:

> It sounds like a database, I probably wouldn't tackle the problem with
> anything less than Perl, although the result might look quite
> bash-like.
>
> A perl hash (associative array) could map inodes to md5sums, the hash
> would also work for telling you if the inode was already summed.
>
> That would work for small numbers of entries (several thousand)
> effortlessly.
>
> To scale it to huge numbers then you could 'tie' the hash to a
> database file - it would then be implemented and persisted in the DB
> file.
>
> If Perl is installed you might have the man page for DB_File which
> might help, or search for some examples on the net.
>
> -Cam
>
>
> On Mon, Feb 7, 2011 at 4:16 PM, Martin <martin at ml1.co.uk> wrote:
> > Folks,
> >
> > Now this is something that just must have already been done...
> >
> > I'm checking md5sums for files for filesystem _inodes_  ... This is so
> > that my system doesn't go checking the md5sum for data corruption for
> > the same inode a gazillion times through multiple hard links to the
> > same inode in the archive/snapshot copies that are kept. So... The
> > problem is:
> >
> > How to look up a table of already seen inode/md5sum pairs quickly, for
> > many millions of files?
> >
> > Also, how to efficiently add new entries to the table and yet maintain
> > a fast lookup?
> >
> >
> > I'm running a 'quick and dirty' but non-ideal solution at the moment.
> > However, it isn't all that quick and will eventually suffer an
> > exponential degrade as the number of lookups increase... Also, I'm
> > looking for something that will scale up to a 100 million files or
> > more.
> >
> > Any good ideas?
> >
> > Already done even?
> >
> > (And this was supposed to be just a "5 minutes bash scripting
> exercise"... :-( )
> >
> > Cheers,
> > Martin
> >
> > _______________________________________________
> > Nottingham mailing list
> > Nottingham at mailman.lug.org.uk
> > https://mailman.lug.org.uk/mailman/listinfo/nottingham
> >
>
> _______________________________________________
> Nottingham mailing list
> Nottingham at mailman.lug.org.uk
> https://mailman.lug.org.uk/mailman/listinfo/nottingham
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.lug.org.uk/pipermail/nottingham/attachments/20110207/c89ddaec/attachment.htm>