> I want to a idea for how can i find duplicate pages quickly and fast ?

Compute a hash based on a canonicalized version of the content? Disregard
white space, line wrap, upper/lower case, possibly even punctuation etc so
that you get the same hash in spite of these differences.

