[Nottingham] A quick bash updating table lookup
Martin
martin at ml1.co.uk
Tue Feb 8 12:52:57 UTC 2011
On 7 February 2011 21:35, Richard Hodgson <rich at dearinternet.com> wrote:
> On Mon, Feb 7, 2011 at 21:23, Camilo Mesias <camilo at mesias.co.uk> wrote:
>>
>> Even if you have an external program to do the b-tree heavy lifting,
>> its performance will likely be crippled if it has to start up,
>> navigate the tree and interact with it. It will also be better to use
>> a scripting language that has extensions for md5 (for example) rather
>> than run up a new subprocess for every md5sum.
I suspect that this exercise is IO-limited on reading the files
through md5sum and in seeking around the HDD chasing the inodes. *nix
(linux) is supposed to be designed for low cost process start up...
> Which, it should be noted, Perl has an entire module dedicated to that has
> worked rather well for me in the past:
> http://search.cpan.org/~gaas/Digest-MD5-2.51/MD5.pm
To take this further as a one-pass operation, then looks like perl or
using a compiled language would be the way to go...
However, this was only ever intended as a "five minutes" hack to avoid
md5sum re-summing inode data already seen earlier...
Also, using a database to even temporarily store inode - md5sum
lookups is quite a large dependence for a small improvement to the
minuscule md5sum utility!
OK... So... How to improve the "find ... exec md5sum ..." combination
for a heavily hard-linked archive *without* resorting to btrees and
databases and without having multi-Gigabyte lookup tables in system
memory?...
My thoughts are to go old-school and run two passes through the file system:
List all files and their inodes (find ... -print ...);
Sort the list by inode (sort);
Run through the list noting whether the present inode is the same as
the last, md5sum as needed (while read f ; do ... md5sum ...).
Eeee... That could even be piped as a one-liner ;-)
In reality, I'll be scripting that so that I can add the option for
comparing md5sums to what was seen on a previous run to check for any
changes or corruptions.
Which comes to the next question...
Is there any concern for whether to use:
cksum (1) - checksum and count the bytes in a file
md5sum (1) - compute and check MD5 message digest
sha1sum (1) - compute and check SHA1 message digest
sha224sum (1) - compute and check SHA224 message digest
sha256sum (1) - compute and check SHA256 message digest
sha384sum (1) - compute and check SHA384 message digest
sha512sum (1) - compute and check SHA512 message digest
shasum (1) - Print or Check SHA Checksums
sum (1) - checksum and count the blocks in a file
?
Speed vs collisions?
Cheers,
Martin
More information about the Nottingham
mailing list