[sclug] Getting rid of duplicate files

Wed Sep 27 13:02:13 UTC 2006

On Wed, 2006-09-27 at 13:10 +0100, Sean Furey wrote:

> ( cat md5.txt | sort | awk '{ print $2" "$1 }' | uniq -f 1 -d ;
>   cat md5.txt | sort | awk '{ print $2" "$1 }' | uniq -f 1 -D ) |
> sort | uniq -u

That uniq -d/-D/-u trick is revolting; I love it!

There's some ugliness ("cat foo | ...") and some redundancy (you need
not run awk twice and "sort | uniq -u" is equivalent to "sort -u") so,
assuming bash:

awk '{ print $2" "$1 }' md5.txt
| sort | tee >(uniq -f 1 -d) >(uniq -f 1 -D) >/dev/null
| sort -u

(The particular sequence of operations in pipeline construction means
that the two instances of uniq will both write to the pipe which has
"sort -u" listening on it, while it is tee's unprocessed output stream
that will be dumped to /dev/null.)

Unfortunately that awk command won't cope correctly paths with spaces in
them though. e.g. "iPhoto Library". Also, the uniq trick is just a
little too revolting for my taste. If having a list of files to keep
(move them to safety and delete what remains) is OK, then how about the
somewhat simpler:

uniq -w 32 md5.txt

- Raz