[Nottingham] [Misc] Making a hash of it...

Martin martin at ml1.co.uk
Sun Jun 7 19:00:48 UTC 2020


Folks,

A brief update:


On 06/06/2020 23:40, Martin via Nottingham wrote:
[---]
> The idea was that this could then be used as a long term integrity check
> and also to check for duplicate files.

A very big speedup for the deduplication is to only consider files that
have the same size and that do not already share the same inode...

Also, you should check that for any inode comparisons that you are
staying within the same device or the same btrfs volume/subvolume. (NB:
inode numbers are unique only within each filesystem space.)

If you might (inadvertently) be comparing across filesystems and/or
across subvolumes, one check is to use:

stat -f "test_file1"
stat -f "test_file2"

and ensure that the filesystem IDs are the same (or not)!


> For speed, the script is pipelined at every step and multiple
> checksums/hashes can run in parallel.

Still a good programming giggle/exercise :-)


[---]
> Next was to check what hash to use today and to benchmark for the
> fastest and/or most worthwhile...

HDDs are still too slow ;-)


[---]
> Pointless to go parallel!!
> 
> 
> In this case, md5sum may as well do.

For comparison, btrfs now includes the (selectable) choice of checksums:

crc32c, xxhash, sha256, blake2b

See:
https://btrfs.wiki.kernel.org/index.php/Main_Page#Major_Features_Currently_Implemented

See also for the background for why that selection:
https://kdave.github.io/selecting-hash-for-btrfs/

The xxHash looks spectacular for speed:
https://cyan4973.github.io/xxHash/


For the sake of future-proofing and because I very easily can, I'm using
b2sum to produce a 512-bit cryptographic hash. Way OTT for this
application but, why not? :-)

https://www.gnu.org/software/coreutils/manual/html_node/b2sum-invocation.html

Which implements:

https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE2

Fantastic!


> Still, 'twas good fun...

Indeed so :-)

And a good reminder for how filesystems work ;-)


The end result is that the runtime is too fast to enjoy a good brew...

Enjoy!
Martin





More information about the Nottingham mailing list