Geohash: Image database system for detecting duplicate images

Wed Sep 26 11:40:39 EDT 2001

I have put on my webpage

     http://members.tripod.com/~edcjones/pycode.html

a collection of modules I call "geohash" (geometric hashing).
This is set of programs for detecting duplicated images in
large image collections. The duplicated images may be
different sizes, cropped, etc.

The code is mostly written in Python / Numeric / PIL but there
is some C and C++. MySQL is used for the image database. The
code is written for a PC with Linux.

Some parts of geohash can stand alone. They are:
db.imagedb
     This is an image database program using MySQL. It stores
     basic information about images in a SQL table. It
     currently also stores a set of 10 well-spaced image
     features used by the image matching algorithm. It would be
     easy to change the program to store any hashable Python
     object of readonable size (Search image_data.py for
     "features"). Therefore the program is useful for any
     situation where information is extracted from images for
     use later.

CVnoipl
     Originally CV was written so I could use Numeric, PIL and
     the Intel OpenCV and IPL systems at the same time in
     programs.  CVnoipl is CV without the Intel code. Most of
     the work in extending CV to other systems is preparing a
     Python wrapper for the new system. If anyone is interested
     in the full CV code (very ugly), I will attach the Intel
     licensing documents and email it to you.

checkims
     Checks image files for corruption.

dumpit
     This is a simple, cute program for writing to log files.

filestuff
     relative_pathname(dirname, basedir)
         Find the pathname of "dirname" relative to "basedir".
     listsubdirs(dirname)
         List all the immediate sub-directories of "dirname".
	listfiles(directory, includes=None, excludes=[])
         List the files in a directory. Leave out directories
         and other non-files, UNIX hidden files, and file names
         ending with "~".  Exclude files with extensions listed
         in "excludes". Include only files with extensions
         listed in "includes". Returns list of full pathnames.
     join(directory, name)
         Create a pathname where file "name" is in directory
         "dir". For example join('/A/B', 'x/y') returns
         "/A/B/y".
     get_all_names(directory, includes=None, excludes=[])
         List all the files and subdirectories in or below
         "directory".  Exclude UNIX hidden files, hidden
         directories and their contents, and all files with
         extensions in the list "excludes". Include only files
         with extensions in the list "includes". Case is
         ignored for the includes and excludes. The directory
         "directory" is not included in the directory list. A
         tuple (files, subdirs) is returned.
     files_list(directory, includes=None, excludes=[])
         Return a list containing information about all the
         files in and below "directory".
     find_files(topdirs, filename, includes=None)
         Writes to "filename" a series of lines. Each line
         contains a filename. The names must end in one of the
         "includes".  If "includes" is None, any ending is OK.
         UNIX hidden directories and files are excluded. Fast
         because it uses "find". Lists only the filenames.
numpil
     Converts Numeric arrays to PIL images and vice versa. Also
     explains the (row, col) and (x, y) indexing conventions.

MultiDict
     Like a dictionary but each key can occur more than once.
     Useful for working with the mathematical definition of a
     function as a set of ordered pairs or for manipulating
     SQL-like tables.

external_sort
     Sorts files too large for memory. Uses "merge".

merge
     A C function that merges a pair of sorted Python lists.

thumbs
     Makes a collection of thumbnails. Uses the ImageMagick
     program "convert".