[Nottingham] A quick bash updating table lookup

Tue Feb 8 12:52:57 UTC 2011

On 7 February 2011 21:35, Richard Hodgson <rich at dearinternet.com> wrote:
> On Mon, Feb 7, 2011 at 21:23, Camilo Mesias <camilo at mesias.co.uk> wrote:
>>
>> Even if you have an external program to do the b-tree heavy lifting,
>> its performance will likely be crippled if it has to start up,
>> navigate the tree and interact with it. It will also be better to use
>> a scripting language that has extensions for md5 (for example) rather
>> than run up a new subprocess for every md5sum.

I suspect that this exercise is IO-limited on reading the files
through md5sum and in seeking around the HDD chasing the inodes. *nix
(linux) is supposed to be designed for low cost process start up...

> Which, it should be noted, Perl has an entire module dedicated to that has
> worked rather well for me in the past:
> http://search.cpan.org/~gaas/Digest-MD5-2.51/MD5.pm

To take this further as a one-pass operation, then looks like perl or
using a compiled language would be the way to go...

However, this was only ever intended as a "five minutes" hack to avoid
md5sum re-summing inode data already seen earlier...

Also, using a database to even temporarily store inode - md5sum
lookups is quite a large dependence for a small improvement to the
minuscule md5sum utility!

OK... So... How to improve the "find ... exec md5sum ..." combination
for a heavily hard-linked archive *without* resorting to btrees and
databases and without having multi-Gigabyte lookup tables in system
memory?...

My thoughts are to go old-school and run two passes through the file system:

List all files and their inodes (find ... -print ...);
Sort the list by inode (sort);
Run through the list noting whether the present inode is the same as
the last, md5sum as needed (while read f ; do ... md5sum ...).

Eeee... That could even be piped as a one-liner ;-)

In reality, I'll be scripting that so that I can add the option for
comparing md5sums to what was seen on a previous run to check for any
changes or corruptions.

Which comes to the next question...

Is there any concern for whether to use:

cksum                (1)  - checksum and count the bytes in a file
md5sum               (1)  - compute and check MD5 message digest
sha1sum              (1)  - compute and check SHA1 message digest
sha224sum            (1)  - compute and check SHA224 message digest
sha256sum            (1)  - compute and check SHA256 message digest
sha384sum            (1)  - compute and check SHA384 message digest
sha512sum            (1)  - compute and check SHA512 message digest
shasum               (1)  - Print or Check SHA Checksums
sum                  (1)  - checksum and count the blocks in a file

?

Speed vs collisions?

Cheers,
Martin