Duplicate content filter..
ldo at geek-central.gen.new_zealand
Fri Oct 5 06:45:37 CEST 2007
In message <1191428555.278268.253700 at g4g2000hsf.googlegroups.com>, Abandoned
> I want to a idea for how can i find duplicate pages quickly and fast ?
Compute a hash based on a canonicalized version of the content? Disregard
white space, line wrap, upper/lower case, possibly even punctuation etc so
that you get the same hash in spite of these differences.
More information about the Python-list