testing for uniquness in a large list
ptmcg at austin.rr._bogus_.com
Wed Oct 20 21:19:34 CEST 2004
"Paul McGuire" <ptmcg at austin.rr._bogus_.com> wrote in message
news:sRydd.12955$sO5.12002 at fe1.texas.rr.com...
> "bearophile" <bearophileHUGS at lycos.com> wrote in message
> news:5c882bb5.0410200830.471a5ea9 at posting.google.com...
> > Alex Martelli's solution is *very* good, but there is a sampling
> > problem: putting a simple printing inside his program:
> > if not (len(results) % 5000):
> > print len(results)
> > You can see that it slows down a lot when the dictionary contains
> > about 100000-120000 different sequences, because there are many
> > collisions, and it keeps slowing down. Probably a little speed up of
> > this code cannot help. A different algoritm can be useful.
> > I don't know... direct sequences generation doesn't seem possibile.
> > Maybe a partially direct generation can be okay.
> > Hugs,
> > bearophile
> Is part of this slowdown the fact that the loop will exit only when rlen
> reaches 200,000, when you have just shown that it can never be greater
> Perhaps some heuristic would opt for using sets instead of random.sample
> rlen approaches some threshold value of
> len(seq)!/picklen!(len(seq)-picklen)! , perhaps .8 or so. There should be
> crossover point at which it is cheaper to build a set of all 125,970
> combinations, and then select 100,000 of them using set.pop() (which
> an arbitrary element from the set, and removes it).
> -- Paul
... and of course, raising a MathError if rlen >
More information about the Python-list