[Nottingham] REading filesystem into MySQL query

Downes, David 02 david.downes02 at ntu.ac.uk
Thu Oct 20 02:12:09 BST 2005


This as a bit of a late reply to the MySQL query thread! I had a similar problem where I needed to sort a few gig of files.  There were lots of duplicates I wanted to delete, it was a one off job so I wanted something relatively quick and easy to automagically sort the files.  For quite a few files the name may change but the contents could be the same so I decided to capture the SHA1 hash as well as the other file details.  I've included the python program below in case it's of any use to ant one one else on the list.  I've only been using python for a relatively short time so it may not be the most elegantly coded ;-) however, it's relatively straight forward to understand it's operation.  I've also some small python utilities which based on the database: mark duplicates for deletion leaving one copy, mark all files of a certain type for deletion, delete all files marked for deletion and finally delete all empty directories.  An advantage of the two stage process is that I was able to review all files to be deleted before it actually happened.  If any one else wants the code they are welcome :-)

David Downes

#!/usr/bin/env python
# Simple python utility to create a list of files by recursively descending
# through a directory structure.  Stores SHA1 hash and other details in 'files'.
# 'hash' stores the number and aggregate sum of the hash values.
# 'type' stores the number and aggregate sum of each file type and an extra field
# used to indicate if the file type should be deleted.
#
#Create the database structure with the SQL statement below: 
#
#CREATE TABLE `files` (
#  `name` varchar(195) NOT NULL default '',
#  `path` varchar(255) NOT NULL default '',
#  `type` varchar(10) NOT NULL default '',
#  `size` int(10) unsigned NOT NULL default '0',
#  `time` int(10) unsigned NOT NULL default '0',
#  `hash` varchar(50) default NULL,
#  `deletefile` enum('y','n') NOT NULL default 'n',
#  PRIMARY KEY  (`name`,`path`));
#
#create table hash (
#  hash varchar(50) not null,
#  count int unsigned not null,
#  sum int unsigned not null,
#  primary key (hash));
#
#CREATE TABLE type (
#  type varchar(10) NOT NULL default '',
#  count int(11) default NULL,
#  deletefiles enum('y','n') default 'n',
#  sum int unsigned not null,
#  PRIMARY KEY  (type));

import string, os, sys
import MySQLdb

def test_walk(dbconn, dirname, names):
  global count
  for name in names:
    dirpath = os.path.abspath(dirname)
    path = dirpath + "/" + name
    if os.path.isfile(path):
      cmd = "sha1sum -b '" + path.replace("'","\'") + "'"
      for line in os.popen(cmd).readlines():
        sha1 = string.split(line)
        filetype = os.path.splitext(name)
        print "[%i] values ('%s','%s','%s',%s,%s,'%s')" % \
      (count, name, dirpath, filetype[1], os.path.getsize(path), os.path.getmtime(path), sha1[0])
        try:
          dbconn.execute("insert into files (name,path,type,size,time,hash) \
          values (%s,%s,%s,%s,%s,%s)", \
          (name, dirpath, filetype[1], os.path.getsize(path), os.path.getmtime(path), sha1[0]))
        except MySQLdb.IntegrityError, e:
          print "Database error: ", e
        count = count + 1

if len(sys.argv)==2:
  if os.path.exists(sys.argv[1]):
    count = 0
    # change to actual user and database name
    db=MySQLdb.connect(user="user", host="localhost", db="files")
    c=db.cursor()
    try:
      c.execute("delete from files")
    except MySQLdb.IntegrityError, e:
      print "Database error: ", e
    db.commit()
    try:
      c.execute("delete from hash")
    except MySQLdb.IntegrityError, e:
      print "Database error: ", e
    db.commit()
    try:
      c.execute("delete from type")
    except MySQLdb.IntegrityError, e:
      print "Database error: ", e
    db.commit()
    os.path.walk(sys.argv[1], test_walk, c)
    db.commit()
    try:
      c.execute("insert into hash (hash,count,sum) select hash,count(hash),sum(size) from files group by hash")
    except MySQLdb.IntegrityError, e:
      print "Database error: ", e
    db.commit()
    try:
      c.execute("insert into type (type,count,sum) select type,count(type),sum(size) from files group by type")
    except MySQLdb.IntegrityError, e:
      print "Database error: ", e
    db.commit()
  else:
    print "Error: %s is not a valid file or directory" % sys.argv[1]
else:
  print "Supply one argument, the diectory or path to be added to the database"


This email is intended solely for the addressee.  It may contain private and confidential information.  If you are not the intended addressee, please take no action based on it nor show a copy to anyone.  In this case, please reply to this email to highlight the error.  Opinions and information in this email that do not relate to the official business of Nottingham Trent University shall be understood as neither given nor endorsed by the University.
Nottingham Trent University has taken steps to ensure that this email and any attachments are virus-free, but we do advise that the recipient should check that the email and its attachments are actually virus free.  This is in keeping with good computing practice.




More information about the Nottingham mailing list