[Nottingham] Re: REading filesystem into MySQL query

Thu Oct 20 20:32:02 BST 2005

On Thu, 20 Oct 2005 12:08:35 +0100
Martin <martin at ml1.co.uk> wrote:
> 
> 
> As for finding duplicate files... I once did this as a one-off using 
> "du" to generate a file list with sizes and then "sort" to find the 
> duplicates on name and size. I guess "md5sum" could be used to avoid 
> file name dependence...
> 
> OK, who can do that as a one-liner?
> 

I wrote a script to do that, years ago. I'll transcribe it below. It's
not exactly a one-liner, though :D

It would benefit from a bit of modification for general use though; it
was intended for images, and uses findimagedupes to confirm that image
files are identical. I've just modified it so that it won't actually
remove identical files, for safety's sake.

James

#!/bin/bash
# eliminate unique filesizes first - much quicker this way
find . -name '*.???' -ls |awk '{print $NF, $7}' |sort -k 2.1 > /tmp/fsizes.out

echo file list assembled ..
# non-unique sizes 
uniq -d -f1 /tmp/fsizes.out |awk '{print $2}' > /tmp/uniq.out

for i in $(cat /tmp/uniq.out); do
   grep -w $i /tmp/fsizes.out |awk '{print $1}'
done > /tmp/nonuniqsize.out

echo $(cat /tmp/nonuniqsize.out |wc -l) files have non-unique size

# SUM figure for all non-unique size files
for i in $(cat /tmp/nonuniqsize.out) ; do
   echo -n $i" "
   sum $i |awk '{print $1"_"$2}'
done |sort -k 2.1 > /tmp/files.out

uniq -d -f1 /tmp/files.out |awk '{print $2}' > /tmp/uniq.out

rm -f /tmp/confirmeddups.out

for i in $(cat /tmp/uniq.out); do
   dfils=$(grep -w $i /tmp/files.out |awk '{print $1}' | tr "\012" " ")
   if [[ $(echo $dfils | wc -w) == 2 ]]; then
      echo checking $dfils
         findimagedupes $dfils |grep -w 100.00% > /dev/null 2>&1 && echo $dfils >> /tmp/confirmeddups.out
   else
      echo getting combinations for $dfils
      all2coms $dfils > /tmp/dfil
      while read a b ; do
         echo checking $a $b
         findimagedupes $a $b |grep -w 100.00% > /dev/null 2>&1 && echo $a $b >> /tmp/confirmeddups.out
      done < /tmp/dfil
   fi
done

if [[ -a /tmp/confirmeddups.out ]]; then
   echo summary:
   sed "s/[0-9]*\.jpg//g ; s/\.\///g" /tmp/confirmeddups.out |sort -u >> ~/td
   surplus=$(awk '{print $2}' /tmp/confirmeddups.out)
   echo deleting duplicates: 
   echo $surplus |tr " " "\012"
   echo rm $surplus
else
   echo no confirmed duplicates
fi

-- 
Dig It : a forum for Euro Beatles fans - http://beatles.dyndns.org/