[Nottingham] Re: REading filesystem into MySQL query
James Gibbon
jg at jamesgibbon.com
Thu Oct 20 20:32:02 BST 2005
On Thu, 20 Oct 2005 12:08:35 +0100
Martin <martin at ml1.co.uk> wrote:
>
>
> As for finding duplicate files... I once did this as a one-off using
> "du" to generate a file list with sizes and then "sort" to find the
> duplicates on name and size. I guess "md5sum" could be used to avoid
> file name dependence...
>
> OK, who can do that as a one-liner?
>
I wrote a script to do that, years ago. I'll transcribe it below. It's
not exactly a one-liner, though :D
It would benefit from a bit of modification for general use though; it
was intended for images, and uses findimagedupes to confirm that image
files are identical. I've just modified it so that it won't actually
remove identical files, for safety's sake.
James
#!/bin/bash
# eliminate unique filesizes first - much quicker this way
find . -name '*.???' -ls |awk '{print $NF, $7}' |sort -k 2.1 > /tmp/fsizes.out
echo file list assembled ..
# non-unique sizes
uniq -d -f1 /tmp/fsizes.out |awk '{print $2}' > /tmp/uniq.out
for i in $(cat /tmp/uniq.out); do
grep -w $i /tmp/fsizes.out |awk '{print $1}'
done > /tmp/nonuniqsize.out
echo $(cat /tmp/nonuniqsize.out |wc -l) files have non-unique size
# SUM figure for all non-unique size files
for i in $(cat /tmp/nonuniqsize.out) ; do
echo -n $i" "
sum $i |awk '{print $1"_"$2}'
done |sort -k 2.1 > /tmp/files.out
uniq -d -f1 /tmp/files.out |awk '{print $2}' > /tmp/uniq.out
rm -f /tmp/confirmeddups.out
for i in $(cat /tmp/uniq.out); do
dfils=$(grep -w $i /tmp/files.out |awk '{print $1}' | tr "\012" " ")
if [[ $(echo $dfils | wc -w) == 2 ]]; then
echo checking $dfils
findimagedupes $dfils |grep -w 100.00% > /dev/null 2>&1 && echo $dfils >> /tmp/confirmeddups.out
else
echo getting combinations for $dfils
all2coms $dfils > /tmp/dfil
while read a b ; do
echo checking $a $b
findimagedupes $a $b |grep -w 100.00% > /dev/null 2>&1 && echo $a $b >> /tmp/confirmeddups.out
done < /tmp/dfil
fi
done
if [[ -a /tmp/confirmeddups.out ]]; then
echo summary:
sed "s/[0-9]*\.jpg//g ; s/\.\///g" /tmp/confirmeddups.out |sort -u >> ~/td
surplus=$(awk '{print $2}' /tmp/confirmeddups.out)
echo deleting duplicates:
echo $surplus |tr " " "\012"
echo rm $surplus
else
echo no confirmed duplicates
fi
--
Dig It : a forum for Euro Beatles fans - http://beatles.dyndns.org/
More information about the Nottingham
mailing list