Geohash: Image database system for detecting duplicate images
Edward C. Jones
edcjones at erols.com
Wed Sep 26 11:40:39 EDT 2001
I have put on my webpage
http://members.tripod.com/~edcjones/pycode.html
a collection of modules I call "geohash" (geometric hashing).
This is set of programs for detecting duplicated images in
large image collections. The duplicated images may be
different sizes, cropped, etc.
The code is mostly written in Python / Numeric / PIL but there
is some C and C++. MySQL is used for the image database. The
code is written for a PC with Linux.
Some parts of geohash can stand alone. They are:
db.imagedb
This is an image database program using MySQL. It stores
basic information about images in a SQL table. It
currently also stores a set of 10 well-spaced image
features used by the image matching algorithm. It would be
easy to change the program to store any hashable Python
object of readonable size (Search image_data.py for
"features"). Therefore the program is useful for any
situation where information is extracted from images for
use later.
CVnoipl
Originally CV was written so I could use Numeric, PIL and
the Intel OpenCV and IPL systems at the same time in
programs. CVnoipl is CV without the Intel code. Most of
the work in extending CV to other systems is preparing a
Python wrapper for the new system. If anyone is interested
in the full CV code (very ugly), I will attach the Intel
licensing documents and email it to you.
checkims
Checks image files for corruption.
dumpit
This is a simple, cute program for writing to log files.
filestuff
relative_pathname(dirname, basedir)
Find the pathname of "dirname" relative to "basedir".
listsubdirs(dirname)
List all the immediate sub-directories of "dirname".
listfiles(directory, includes=None, excludes=[])
List the files in a directory. Leave out directories
and other non-files, UNIX hidden files, and file names
ending with "~". Exclude files with extensions listed
in "excludes". Include only files with extensions
listed in "includes". Returns list of full pathnames.
join(directory, name)
Create a pathname where file "name" is in directory
"dir". For example join('/A/B', 'x/y') returns
"/A/B/y".
get_all_names(directory, includes=None, excludes=[])
List all the files and subdirectories in or below
"directory". Exclude UNIX hidden files, hidden
directories and their contents, and all files with
extensions in the list "excludes". Include only files
with extensions in the list "includes". Case is
ignored for the includes and excludes. The directory
"directory" is not included in the directory list. A
tuple (files, subdirs) is returned.
files_list(directory, includes=None, excludes=[])
Return a list containing information about all the
files in and below "directory".
find_files(topdirs, filename, includes=None)
Writes to "filename" a series of lines. Each line
contains a filename. The names must end in one of the
"includes". If "includes" is None, any ending is OK.
UNIX hidden directories and files are excluded. Fast
because it uses "find". Lists only the filenames.
numpil
Converts Numeric arrays to PIL images and vice versa. Also
explains the (row, col) and (x, y) indexing conventions.
MultiDict
Like a dictionary but each key can occur more than once.
Useful for working with the mathematical definition of a
function as a set of ordered pairs or for manipulating
SQL-like tables.
external_sort
Sorts files too large for memory. Uses "merge".
merge
A C function that merges a pair of sorted Python lists.
thumbs
Makes a collection of thumbnails. Uses the ImageMagick
program "convert".
More information about the Python-list
mailing list