ANN: equivalence 0.1

George Sakkis george.sakkis at gmail.com
Sun Jun 1 18:16:26 CEST 2008


Equivalence is a class that can be used to maintain a partition of
objects into equivalence sets, making sure that the equivalence
properties (reflexivity, symmetry, transitivity) are preserved. Two
objects x and y are considered equivalent either implicitly (through a
key function) or explicitly by calling merge(x,y).

Get it from pypi: http://pypi.python.org/pypi/equivalence/

Example
=======
Say that you are given a bunch of URLs you want to download and
eventually process somehow. These urls may contain duplicates, either
exact or leading to a page with the same content (e.g. redirects,
plagiarized pages, etc.). What you'd like is identify duplicates in
advance so that you can process only unique pages. More formally, you
want to partition the given URLs into equivalence sets and pick a
single representative from each set.

Getting rid of identical URLs is trivial. A more general case of URLs
that can be easily identified as duplicates can be based on some
simple regular expression based heuristics, so that for instance
'http://python.org/doc/' and 'www.python.org/doc/index.html' are
deemed equivalent. For this case you may have a normalize(url)
function that reduces a URL into its "stem" (e.g. 'python.org/doc')
and use this as a key for deciding equivalence.

This is fine but it still leaves quite a few URLs that cannot be
recognized as duplicates with simple heuristics. For these harder
cases you may have one or more "oracles" (an external database, a page
comparison program, or ultimately a human) that decides whether pages
x and y are equivalent. You can integrate such oracles by explicitly
declaring objects as equivalent using Equivalence.merge(x,y).

Both implicit (key-based) and explicit information are combined to
maintain the equivalence sets. For instance:

>>> from equivalence import Equivalence
>>> dups = Equivalence(normalize)   # for an appropriate normalize(url)
>>> dups.merge('http://python.org/doc/', 'http://pythondocs.com/')
>>> dups.are_equivalent('www.pythondocs.com/index.htm',
                                    'http://python.org/doc/
index.html')
>>> True

You can find more about the API in the included docs and the unittest
file.

Regards,
George



More information about the Python-list mailing list