Fastest database solution

Curt Hash curt.hash at
Fri Feb 6 21:03:23 CET 2009

On Fri, Feb 6, 2009 at 2:12 AM, Roger Binns <rogerb at> wrote:
> Hash: SHA1
> Curt Hash wrote:
> > I started out using sqlite3, but was not satisfied with the performance
> > results. I then tried using psycopg2 with a local postgresql server, and
> > the performance got even worse.
> SQLite is in the same process.  Communication with postgres is via
> another process so marshalling the traffic and context switches will
> impose overhead as you found.
> > I don't think
> > my code/queries are inherently slow, but I'm not a DBA or a very
> > accomplished Python developer, so I could be wrong.
> It doesn't sound like a database is the best solution to your issue
> anyway.  A better solution would likely be some form of hashing the
> lines and storing something that gives quick hash lookups.  The hash
> would have to do things like not care what variable names are used etc.
> There are already lots of plagiarism detectors out there so it may be
> more prudent using one of them, or at least learn how they do things so
> your own system could improve on them.

Currently, I am stripping extra whitespace and end-of-line characters
from each line of source code and storing that in addition to its hash
in a table. That table is used for exact-match comparisons. I am also
passing the source code through flex/bison to canonicalize identifiers
-- the resulting lines are also hashed and stored in a table. That
table is used for structural matching. Both tables are queried to find
matching hashes. I'm not sure how I could make the hash lookups

On my small test dataset, this solution has detected all of the
plagiarism with high confidence.

It's also beneficial to me to use this Python application as I can
easily integrate it with other Python scripts I use to prepare code
for review.

> Roger
> Version: GnuPG v1.4.9 (GNU/Linux)
> EKwAoKpDMRzr7GzCKeYxn93TU69nDx4X
> =4r01
> --

More information about the Python-list mailing list