|
|
 |
|
Title: Dupinator -- detect and delete duplicate files
Submitter: Bill Bumgarner
(other recipes)
Last Updated: 2005/01/09
Version no: 1.0
Category:
Files
|
|
|
Description:
Point this script at a folder or several folders and it will find and delete all duplicate files within the folders, leaving behind the first file found of any set of duplicates. It is designed to handle hundreds of thousands of files of any size at a time and to do so quickly. It was written to eliminate duplicates across several photo libraries that had been shared between users. As the script was a one-off to solve a very particular problem, there are no options nor is it refactoring into any kind of modules or reusable functions.
Source: Text Source
import os
import sys
import stat
import md5
filesBySize = {}
def walker(arg, dirname, fnames):
d = os.getcwd()
os.chdir(dirname)
try:
fnames.remove('Thumbs')
except ValueError:
pass
for f in fnames:
if not os.path.isfile(f):
continue
size = os.stat(f)[stat.ST_SIZE]
if size < 100:
continue
if filesBySize.has_key(size):
a = filesBySize[size]
else:
a = []
filesBySize[size] = a
a.append(os.path.join(dirname, f))
os.chdir(d)
for x in sys.argv[1:]:
print 'Scanning directory "%s"....' % x
os.path.walk(x, walker, filesBySize)
print 'Finding potential dupes...'
potentialDupes = []
potentialCount = 0
trueType = type(True)
sizes = filesBySize.keys()
sizes.sort()
for k in sizes:
inFiles = filesBySize[k]
outFiles = []
hashes = {}
if len(inFiles) is 1: continue
print 'Testing %d files of size %d...' % (len(inFiles), k)
for fileName in inFiles:
if not os.path.isfile(fileName):
continue
aFile = file(fileName, 'r')
hasher = md5.new(aFile.read(1024))
hashValue = hasher.digest()
if hashes.has_key(hashValue):
x = hashes[hashValue]
if type(x) is not trueType:
outFiles.append(hashes[hashValue])
hashes[hashValue] = True
outFiles.append(fileName)
else:
hashes[hashValue] = fileName
aFile.close()
if len(outFiles):
potentialDupes.append(outFiles)
potentialCount = potentialCount + len(outFiles)
del filesBySize
print 'Found %d sets of potential dupes...' % potentialCount
print 'Scanning for real dupes...'
dupes = []
for aSet in potentialDupes:
outFiles = []
hashes = {}
for fileName in aSet:
print 'Scanning file "%s"...' % fileName
aFile = file(fileName, 'r')
hasher = md5.new()
while True:
r = aFile.read(4096)
if not len(r):
break
hasher.update(r)
aFile.close()
hashValue = hasher.digest()
if hashes.has_key(hashValue):
if not len(outFiles):
outFiles.append(hashes[hashValue])
outFiles.append(fileName)
else:
hashes[hashValue] = fileName
if len(outFiles):
dupes.append(outFiles)
i = 0
for d in dupes:
print 'Original is %s' % d[0]
for f in d[1:]:
i = i + 1
print 'Deleting %s' % f
os.remove(f)
print
Discussion:
The script uses a multipass approach to finding duplicate files. First, it walks all of the directories pass in and groups all files by size. In the next pass, the script walks each set of files of the same size and checksums the first 1024 bytes. Finally, the script walks each set of files that are the same size with the same hash of the first 1024 bytes and checksums each file in its entirety.
The very last step is to walk each set of files of the same length/hash and delete all but the first file in the set.
It ran against a 3.5 gigabyte set of files composed of about 120,000 files, of which there were about 50,000 duplicates, most of which were over 1 megabyte. The total run took about 2 minutes on a 1.33ghz G4 powerbook. Fast enough for me and fast enough without actually optimizing anything beyond the obvious.
|
|
Add comment
|
|
Number of comments: 5
Hard links?, Martin Blais, 2005/01/13
This is really cool, i was going to write something very similar. My application: I have made backups for the last ten years, and many times complete backups, which have now been copied on a single hdd for safety. Much of these files are identical. I wanted to hardlink them together to save the disk space. I can now just modify your app! thanks,
Add comment
fslint, Drew Perttula, 2005/01/13
http://www.iol.ie/~padraiga/fslint/ does this, as well as other useful things (and in shorter code too, I believe). I haven't investigated the various file-compare optimizations of each system.
BTW, a common use of the fslint tools is to find dups on the same filesystem and replace them with hardlinks. If you don't care about the once-identical files being forever identical, you can avoid needless space waste.
Add comment
Shortcuts In Windows, thattommyhall ;, 2007/08/29
I love this, freed up 60G from our stuffed file server
import win32com.client
def mkshortcut(source, target):
shell = win32com.client.Dispatch("WScript.Shell")
shortcut = shell.CreateShortCut(source)
shortcut.Targetpath = target
shortcut.save()
def shortcutise(list_of_duplicates):
filename = list_of_duplicates[0].split('\\')[-1]
#the "canonical" file is named the same as the first one in the list
dupepath = string.join(list_of_duplicates[0].split('\\')[:6],'\\') + '\\DUPLICATED\\'
#Just takes first 6 parts of the path (maps quite nicely to client folder in our setup, you will need to decide where you want them (may
if not os.path.isdir(dupepath):
os.mkdir(dupepath)
canonical = dupepath + filename
for i in list_of_duplicates:
if not os.path.isfile(i):
continue
#added in in case duplicate list is calculated first and file no longer exists
if not os.path.isfile(canonical):
#creates the canonical file if it does not exist
print "moving ",i, canonical
shutil.move(i, canonical)
print "linking ",i, canonical
mkshortcut(i + '.lnk', canonical)
continue
print "linking ",i, canonical
mkshortcut(i + '.lnk', canonical)
print "deleting", i
os.remove(i)
We used
dupefile = open('duplicates.txt','r').read()
for duplicatelist in dupefile.split('******************\n'):
duplicatelist = duplicatelist.split('\n')
shortcutise(duplicatelist)
Where duplicates.txt was created with a slight modification of the main script.
Add comment
Size of the scan, thattommyhall ;, 2007/08/30
We freed up 80G on a 1.6TB SAN in about 3 hours, memory usage was fine throughout (I ran another freeware duplicate finder and it crashed twice)
Add comment
73 Gigs freed !, Benjamin Sergeant, 2007/10/04
Cool !
I had lots of mp3 duplicated, waiting forever to be correctly tagged so kept in different directories but then merged ..., plus backups ...
[bsergean@lisa1 ~]$ df -h /media # before
Filesystem Size Used Avail Use% Mounted on
/dev/md3 287G 285G 2.7G 100% /media
[bsergean@lisa1 ~]$ df -h /media # after
Filesystem Size Used Avail Use% Mounted on
/dev/md3 287G 212G 76G 74% /media
Here is the stupid patch to create hardlinks instead of deleting files.
[bsergean@lisa1 bin]$ svn diff find_dup.py
Index: find_dup.py
===================================================================
--- find_dup.py (revision 164)
+++ find_dup.py (working copy)
@@ -96,6 +96,7 @@
print 'Original is %s' % d[0]
for f in d[1:]:
i = i + 1
- print 'Deleting %s' % f
+ print 'Deleting %s and hardlinking it' % f
os.remove(f)
+ os.link(d[0],f)
print
Add comment
|
|
|
|
|
 |
|