Geohash: Image database system for detecting duplicate images
Edward C. Jones
edcjones at erols.com
Wed Sep 26 17:40:39 CEST 2001
I have put on my webpage
a collection of modules I call "geohash" (geometric hashing).
This is set of programs for detecting duplicated images in
large image collections. The duplicated images may be
different sizes, cropped, etc.
The code is mostly written in Python / Numeric / PIL but there
is some C and C++. MySQL is used for the image database. The
code is written for a PC with Linux.
Some parts of geohash can stand alone. They are:
This is an image database program using MySQL. It stores
basic information about images in a SQL table. It
currently also stores a set of 10 well-spaced image
features used by the image matching algorithm. It would be
easy to change the program to store any hashable Python
object of readonable size (Search image_data.py for
"features"). Therefore the program is useful for any
situation where information is extracted from images for
Originally CV was written so I could use Numeric, PIL and
the Intel OpenCV and IPL systems at the same time in
programs. CVnoipl is CV without the Intel code. Most of
the work in extending CV to other systems is preparing a
Python wrapper for the new system. If anyone is interested
in the full CV code (very ugly), I will attach the Intel
licensing documents and email it to you.
Checks image files for corruption.
This is a simple, cute program for writing to log files.
Find the pathname of "dirname" relative to "basedir".
List all the immediate sub-directories of "dirname".
listfiles(directory, includes=None, excludes=)
List the files in a directory. Leave out directories
and other non-files, UNIX hidden files, and file names
ending with "~". Exclude files with extensions listed
in "excludes". Include only files with extensions
listed in "includes". Returns list of full pathnames.
Create a pathname where file "name" is in directory
"dir". For example join('/A/B', 'x/y') returns
get_all_names(directory, includes=None, excludes=)
List all the files and subdirectories in or below
"directory". Exclude UNIX hidden files, hidden
directories and their contents, and all files with
extensions in the list "excludes". Include only files
with extensions in the list "includes". Case is
ignored for the includes and excludes. The directory
"directory" is not included in the directory list. A
tuple (files, subdirs) is returned.
files_list(directory, includes=None, excludes=)
Return a list containing information about all the
files in and below "directory".
find_files(topdirs, filename, includes=None)
Writes to "filename" a series of lines. Each line
contains a filename. The names must end in one of the
"includes". If "includes" is None, any ending is OK.
UNIX hidden directories and files are excluded. Fast
because it uses "find". Lists only the filenames.
Converts Numeric arrays to PIL images and vice versa. Also
explains the (row, col) and (x, y) indexing conventions.
Like a dictionary but each key can occur more than once.
Useful for working with the mathematical definition of a
function as a set of ordered pairs or for manipulating
Sorts files too large for memory. Uses "merge".
A C function that merges a pair of sorted Python lists.
Makes a collection of thumbnails. Uses the ImageMagick
More information about the Python-list