I'm all for Perl here too: If your inode lookups are sequential, you could always save the last looked up inode to a variable, and simply check if the current inode you're looking at is greater than that value, and only run an MD5 checksum if this is the case: It means that you won't be storing a massive list of inodes in memory, and would be avoiding costly lookups in for existing hash keys.<div>
<br></div><div>This of course, assumes that you're doing your inode lookups sequentially. I'm not sure about how your current solution operates.</div><div><br></div><div>-Rcih<br><div><br><div class="gmail_quote">
On Mon, Feb 7, 2011 at 17:48, Camilo Mesias <span dir="ltr"><<a href="mailto:camilo@mesias.co.uk">camilo@mesias.co.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
It sounds like a database, I probably wouldn't tackle the problem with<br>
anything less than Perl, although the result might look quite<br>
bash-like.<br>
<br>
A perl hash (associative array) could map inodes to md5sums, the hash<br>
would also work for telling you if the inode was already summed.<br>
<br>
That would work for small numbers of entries (several thousand) effortlessly.<br>
<br>
To scale it to huge numbers then you could 'tie' the hash to a<br>
database file - it would then be implemented and persisted in the DB<br>
file.<br>
<br>
If Perl is installed you might have the man page for DB_File which<br>
might help, or search for some examples on the net.<br>
<br>
-Cam<br>
<div><div></div><div class="h5"><br>
<br>
On Mon, Feb 7, 2011 at 4:16 PM, Martin <<a href="mailto:martin@ml1.co.uk">martin@ml1.co.uk</a>> wrote:<br>
> Folks,<br>
><br>
> Now this is something that just must have already been done...<br>
><br>
> I'm checking md5sums for files for filesystem _inodes_ ... This is so<br>
> that my system doesn't go checking the md5sum for data corruption for<br>
> the same inode a gazillion times through multiple hard links to the<br>
> same inode in the archive/snapshot copies that are kept. So... The<br>
> problem is:<br>
><br>
> How to look up a table of already seen inode/md5sum pairs quickly, for<br>
> many millions of files?<br>
><br>
> Also, how to efficiently add new entries to the table and yet maintain<br>
> a fast lookup?<br>
><br>
><br>
> I'm running a 'quick and dirty' but non-ideal solution at the moment.<br>
> However, it isn't all that quick and will eventually suffer an<br>
> exponential degrade as the number of lookups increase... Also, I'm<br>
> looking for something that will scale up to a 100 million files or<br>
> more.<br>
><br>
> Any good ideas?<br>
><br>
> Already done even?<br>
><br>
> (And this was supposed to be just a "5 minutes bash scripting exercise"... :-( )<br>
><br>
> Cheers,<br>
> Martin<br>
><br>
> _______________________________________________<br>
> Nottingham mailing list<br>
> <a href="mailto:Nottingham@mailman.lug.org.uk">Nottingham@mailman.lug.org.uk</a><br>
> <a href="https://mailman.lug.org.uk/mailman/listinfo/nottingham" target="_blank">https://mailman.lug.org.uk/mailman/listinfo/nottingham</a><br>
><br>
<br>
_______________________________________________<br>
Nottingham mailing list<br>
<a href="mailto:Nottingham@mailman.lug.org.uk">Nottingham@mailman.lug.org.uk</a><br>
<a href="https://mailman.lug.org.uk/mailman/listinfo/nottingham" target="_blank">https://mailman.lug.org.uk/mailman/listinfo/nottingham</a><br>
</div></div></blockquote></div><br></div></div>