Duplicate content filter..

Lawrence D'Oliveiro ldo at geek-central.gen.new_zealand
Fri Oct 5 06:45:37 CEST 2007

In message <1191428555.278268.253700 at g4g2000hsf.googlegroups.com>, Abandoned

> I want to a idea for how can i find duplicate pages quickly and fast ?

Compute a hash based on a canonicalized version of the content? Disregard
white space, line wrap, upper/lower case, possibly even punctuation etc so
that you get the same hash in spite of these differences.

More information about the Python-list mailing list