[Nottingham] A quick bash updating table lookup

notlug notlug at pendinas.org.uk
Tue Feb 8 14:39:38 UTC 2011


On 08/02/11 12:48, Martin wrote:
>
> My thoughts are to go old-school and run two passes through the file system:
>
> List all files and their inodes (find ... -print ...);
> Sort the list by inode (sort) |

If you are going to go this way add a "-u" to the sort command.
This will trivially solve your "multiple checking of the same inode" problem.

> Run through the list noting whether the present inode is the same as
> the last, md5sum as needed (while read f ; do ... md5sum ...).
>
You have two (long) lists sorted on the same key (inode).   If you do it properly you only
need to traverse each list once.
> Eeee... That could even be piped as a one-liner ;-)
>
>
> In reality, I'll be scripting that so that I can add the option for
> comparing md5sums to what was seen on a previous run to check for any
> changes or corruptions.
>
> Which comes to the next question...
>
> Is there any concern for whether to use:
>
> cksum                (1)  - checksum and count the bytes in a file
> md5sum               (1)  - compute and check MD5 message digest
> sha1sum              (1)  - compute and check SHA1 message digest
> sha224sum            (1)  - compute and check SHA224 message digest
> sha256sum            (1)  - compute and check SHA256 message digest
> sha384sum            (1)  - compute and check SHA384 message digest
> sha512sum            (1)  - compute and check SHA512 message digest
> shasum               (1)  - Print or Check SHA Checksums
> sum                  (1)  - checksum and count the blocks in a file
>
> ?
>
> Speed vs collisions?
>
The volume of files is irrelevant.   What is relevant is the probability
a deliberate or corrupting change to any one file will result in a file which
returns the same checksum.   Since you don't appear to be concerned about
malicious intent then cksum (CRC32) is probably enough.  If
you want to have fun you could use file size to select different algorithms:

if [ $filesize -lt $small ]; then cksum $f
elif [ $filesize -ge $small -a $filesize -lt $medium ]; then md5sum $f
.
.
.
fi

Remember, every checksum you might compare needs a previous value to be
compared to.  Doing a cheap checksums before expensive checksum works for
rsync because it is comparing two files.  In your case you are comparing one file
with a historical record of that file.  If the historical record doesn't contain the
expensive checksum you have no way of getting it - so you have to generate it
somewhen.

Have fun,
Duncan



More information about the Nottingham mailing list