On 22 November 2012 00:06, Andrew Barnert <span dir="ltr"><<a href="mailto:abarnert@yahoo.com" target="_blank">abarnert@yahoo.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


From: Joshua Landau <<a href="mailto:joshua.landau.ws@gmail.com">joshua.landau.ws@gmail.com</a>><br>

Sent: Sat, November 17, 2012 11:38:22 AM<br>

<div class="im"><br>

>Surely the best choice is two have *two* caches; one for hashables and another<br>

>for the rest.<br>

<br>

</div>Your implementation does a try: hash() to decide whether to check the set or the<br>

list, instead of just doing a try: item in set except: item in list. Is there a<br>

reason for this? It's more complicated, and it's measurably slower.</blockquote><div><br></div><div>I did not realise that "[] in set()" raised an error! I'd just assumed it returned False.</div><div>


<br></div><div>Thank you, this does make small but significant difference.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<div class="im">

>This might be improvable with a *third* chache if some non-hashables had total<br>

>ordering, but that could also introduce bugs I think. It'd also be a lot harder<br>

>and likely be slower anyway.<br>

<br>

</div>I agree that it's probably not worth adding to something in the standard<br>

library, or a recipe given in the documentation (in fact, I think I already said<br>

as much earlier in the thread), but I think you've got most of those facts<br>

wrong.<br>

<br>

It's not a lot harder. The same 4 lines you have to add to do a<br>

try-set-except-list, you just do again, so it's<br>

try-set-except-try-sortedlist-except-list.</blockquote><div><br></div><div>Well, I'd sort-of assumed that this included adding  sorted collection to the mix, as it isn't in the standard library.</div><div> </div>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">And it won't introduce any bugs.</blockquote><div><br>


</div><div>This took me a while to prove, so I'm proud of this:</div><div><br></div><div>>>> from blist import sortedlist<br></div><div>>>> {2} in sortedlist([{1, 2}, {1, 3}, {2}])</div><div>False</div>


<div><br></div><div>You *cannot* assume that a data set has total ordering on the basis that it's working so far.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


And<br>

as for speed, it'll be O(NlogM) instead of O(NM) for N elements with M unique,<br>

which is obviously better, but probably slower for tiny M, and another 5-10%<br>

overhead for inappropriate values.<br></blockquote><div><br></div><div>Well yes... bar the fact that you may be using it on something with a non-even distribution of "things" where some types are not comparable to each-other:</div>


<div><br></div><div>[ {1, 2}, [3, 4], [1, 2], [7, 4], [2, 3], (5, 2), [2, 1] ... ]</div><div><br></div><div>Where you'll get nowhere near O(NlogM).</div><div><br></div><div>*And* then there's the fact that sorted collections have intrinsically more overhead, and so are likely to give large overhead.</div>


<div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

The problem is finding an appropriate sortedcollection type. There isn't one in<br>

the standard library. There's a link to an external SortedCollection reference<br>

in the bisect docs page, but that has O(N) insertion time, so it won't help. The<br>

most popular library I know of is blist.sortedlist, and that works, but it has<br>

quite a bit of overhead for searching tiny lists. As I understand it, the reason<br>

there isn't a standard sorted collection is the assumption that people dealing<br>

with huge sequences ought to expect  to have some searching, comparing, and<br>

profiling of algorithms in their  future, while those people dealing with len<10<br>

sequences shouldn't have to think at all.<br>

<br>

At any rate, I tried a few different sorted collections. The performance for<br>

single-digit M was anywhere from 2.9x slower to 38x slower (8x with blist); the<br>

crossover was around M=100, and you get 10x faster by around M=100K. Deciding<br>

whether this is appropriate, and which implementation to use, and so on… well,<br>

that's exactly why there's no sorted list in the stdlib in the first place.</blockquote><div><br></div><div>Thank you for the numbers. May I ask what libraries you used?</div></div></div>