[sclug] Getting rid of duplicate files

Wed Sep 27 11:29:59 UTC 2006

Hi All

I'm trying to free up space on my hard disk. In particular Im trying  
to get rid of duplicate images that dont have matchine file names. Im  
using md5 in a simple script (next examples done in bash on a mac but  
need to  do the same on linux). I created a list  for all files under  
my Pictures dir:

find Pictures/ -type f >> /tmp/pictures.txt

Then I ran this little script to build md5 checksums:

       1 PICFILE=/tmp/pictures$$.txt
       2 MD5FILE=/tmp/md5$$.txt
       3 find ~/Pictures/ -type f >> ${PICFILE}
       4 while read LINE
       5 do
       6   md5sum "${LINE}" >> ${MD5FILE}
       7 done < $PICFILE

I ran the resulting file through sort, which results in a very big  
file like this:

      001ada07d60c6c3fd10cd5b17d3bdd69 /Users/timsutton/Pictures// 
iPhoto Library/Originals/2006/        100OLYMP_10/P1010020.JPG
      001dec8a7571a375b659b5da8f292409 /Users/timsutton/Pictures// 
iPhoto Library/Originals/2006/        NEW_FOLDER/P1010020.JPG
      001dec8a7571a375b659b5da8f292409 /Users/timsutton/Pictures// 
iPhoto Library/Originals/2006/        New Folder/p1010020.jpg
      0024cdaf90e82c89df0e74089cada586 /Users/timsutton/Pictures// 
iPhoto Library/Originals/2006/        2004_03_07/114_1415.JPG
      etc.
Now Im trying to think of a neat way to get rid of the duplicates. I  
want to keep at least 1 of any given md5. Can anyone offer a tasty  
bit of  bash / awk / sed /grep etc that will do that? Or should I  
just simply revert to a for loop with a buffer holding the last md5  
and check if the current val is the same as the buffered val then  
delete the file?

Thanks

Regards

-- 
Tim Sutton

Visit http://qgis.org for a great Open Source GIS
Home Page: http://linfiniti.com
Skype: timlinux
MSN: tim_bdworld at msn.com
Yahoo: tim_bdworld at yahoo.com
Jabber: timlinux
Irc: timlinux on #qgis at freenode.net