I&#39;m all for Perl here too: If your inode lookups are sequential, you could always save the last looked up inode to a variable, and simply check if the current inode you&#39;re looking at is greater than that value, and only run an MD5 checksum if this is the case: It means that you won&#39;t be storing a massive list of inodes in memory, and would be avoiding costly lookups in for existing hash keys.<div>


<br></div><div>This of course, assumes that you&#39;re doing your inode lookups sequentially. I&#39;m not sure about how your current solution operates.</div><div><br></div><div>-Rcih<br><div><br><div class="gmail_quote">


On Mon, Feb 7, 2011 at 17:48, Camilo Mesias <span dir="ltr">&lt;<a href="mailto:camilo@mesias.co.uk">camilo@mesias.co.uk</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


It sounds like a database, I probably wouldn&#39;t tackle the problem with<br>

anything less than Perl, although the result might look quite<br>

bash-like.<br>

<br>

A perl hash (associative array) could map inodes to md5sums, the hash<br>

would also work for telling you if the inode was already summed.<br>

<br>

That would work for small numbers of entries (several thousand) effortlessly.<br>

<br>

To scale it to huge numbers then you could &#39;tie&#39; the hash to a<br>

database file - it would then be implemented and persisted in the DB<br>

file.<br>

<br>

If Perl is installed you might have the man page for DB_File which<br>

might help, or search for some examples on the net.<br>

<br>

-Cam<br>

<div><div></div><div class="h5"><br>

<br>

On Mon, Feb 7, 2011 at 4:16 PM, Martin &lt;<a href="mailto:martin@ml1.co.uk">martin@ml1.co.uk</a>&gt; wrote:<br>

&gt; Folks,<br>

&gt;<br>

&gt; Now this is something that just must have already been done...<br>

&gt;<br>

&gt; I&#39;m checking md5sums for files for filesystem _inodes_  ... This is so<br>

&gt; that my system doesn&#39;t go checking the md5sum for data corruption for<br>

&gt; the same inode a gazillion times through multiple hard links to the<br>

&gt; same inode in the archive/snapshot copies that are kept. So... The<br>

&gt; problem is:<br>

&gt;<br>

&gt; How to look up a table of already seen inode/md5sum pairs quickly, for<br>

&gt; many millions of files?<br>

&gt;<br>

&gt; Also, how to efficiently add new entries to the table and yet maintain<br>

&gt; a fast lookup?<br>

&gt;<br>

&gt;<br>

&gt; I&#39;m running a &#39;quick and dirty&#39; but non-ideal solution at the moment.<br>

&gt; However, it isn&#39;t all that quick and will eventually suffer an<br>

&gt; exponential degrade as the number of lookups increase... Also, I&#39;m<br>

&gt; looking for something that will scale up to a 100 million files or<br>

&gt; more.<br>

&gt;<br>

&gt; Any good ideas?<br>

&gt;<br>

&gt; Already done even?<br>

&gt;<br>

&gt; (And this was supposed to be just a &quot;5 minutes bash scripting exercise&quot;... :-( )<br>

&gt;<br>

&gt; Cheers,<br>

&gt; Martin<br>

&gt;<br>

&gt; _______________________________________________<br>

&gt; Nottingham mailing list<br>

&gt; <a href="mailto:Nottingham@mailman.lug.org.uk">Nottingham@mailman.lug.org.uk</a><br>

&gt; <a href="https://mailman.lug.org.uk/mailman/listinfo/nottingham" target="_blank">https://mailman.lug.org.uk/mailman/listinfo/nottingham</a><br>

&gt;<br>

<br>

_______________________________________________<br>

Nottingham mailing list<br>

<a href="mailto:Nottingham@mailman.lug.org.uk">Nottingham@mailman.lug.org.uk</a><br>

<a href="https://mailman.lug.org.uk/mailman/listinfo/nottingham" target="_blank">https://mailman.lug.org.uk/mailman/listinfo/nottingham</a><br>

</div></div></blockquote></div><br></div></div>