Re: [Python-Dev] collections module

Paul Moore wrote:
The only reason I can see for coding in C is performance. And performance just isn't that big a deal to me. And we don't want to give a message that "dictionaries and lists are slow" by offering "fast" versions of things which can be coded in terms of them.
I think we should be more concerned about how useful a new feature is than about what "message" it gives about existing features. Anyway, the message that will actually be given is that for some purposes lists and dictionaries are not the best data structures. That's obviously true, so why try to avoid saying it? [Bags:]
In terms of "making dictionaries look slow", how would you answer the question "Why use this rather than just dict.get(k, 0) + 1" without giving the impression that dictionaries "aren't fast enough"?
If the advantage *is* that bags are faster, then I don't think there's any reason not to say so. If it gives the impression that dictionaries "aren't fast enough" for the particular tasks bags are designed for, what's so terrible? The result would be that when people want bags they use bags instead of emulating them with dictionaries; is that bad? (I'm not convinced that there would, in fact, be much speed advantage for bags over dictionaries. I'd guess that the real advantage would just be that some things can be expressed more clearly in terms of bags than in terms of dictionaries. I suspect there isn't much clarity advantage either...)
The second type would be a high speed queue
This seems like a good idea. I have found myself on occasion wanting to use Queue.Queue in non-threaded programs, simply because a "queue" fits my mental model of what I am doing. But it's for code readability reasons - not performance.
I've often built my own queue using lists, with x.append(item) and x.pop() as the enqueue and dequeue operations. That's horribly inefficient, but makes for reasonably neat code. If Python grew nice fast queues with a nice interface, I'd use those instead. It wouldn't make much difference to the clarity of my code, but it would make it a lot faster. That seems like a good thing to me.
And the same issue here over performance impressions - how do you make this seem like a good alternative to list.pop, without making lists look slow?
Lists *are* slow when you use them for queues. What's wrong with admitting that?
This module could also serve as the "one obvious place to put it" for other high performance datatypes that might be added in the future (such as fibheaps or some such).
That is the biggest benefit I see to this proposal. There have recently been a number of discussions of useful library code, where much of the problem seemed to be about where to put the code, rather than about semantics, etc.
I concur.
Guido wanted this to be discussed by a number of people. While a rich set of collection classes have proven their worth in other contexts, it could be that more is not better for python. One the language's elegances is the ability to do just about anything with just lists and dicts. If you have too many mutable containers to choose from, the choice of which to use becomes less obvious.
My thoughts are that the existence of the array module, for example, has never impaired anyone's ability to learn or use lists and dicts.
That is true, but equally the array module has a feeling of being "specialised". I'm not sure I can quantify this, but your description of the collection module doesn't feel similarly "specialised".
Being less "specialized" is a matter of being more broadly useful, no? That seems like a good thing to me.
Likewise, queues are in a mental concept space distinct from dicts, sets, and such. My only worry with queues is differentiating them from Queue objects which have builtin synchronization for use in threads.
Queues currently "feel like" a threading concept in Python (as opposed to what they feel like in a more general data structures concept). Having a queue in a collections module would clash with that. I'm not against it, but I believe your worry is important - you're fiddling with people's mental models, something bound to generate controversy...
If Python's existing library has warped users' mental models so that they think of a queue primarily as "something you use with threads" rather than as "a data structure where you can add things to the front and pull them off the rear efficiently" then that's a *bad* thing about Python's existing library, and decoupling the two concepts by adding a fast queue implementation will be a win.
Apologies for going on about performance.
It makes a refreshing change to hear someone saying "No, don't do that, it would be too fast" :-).
I am in favour of the idea of a module containing implementations of standard container data structures, but I believe it should be in Python, and should be presented as the "best of breed" implementation in terms of handling corner cases, and covering subtle algorithmic issues, and providing good design, and offering subclassability where needed. Basically, "everyone can, and often does, reinvent this stuff in Python - but getting the details right is tricky, so we've done it for you."
If performance of a Python implementation is a problem, speed up the bottlenecks in the Python VM (likely to be function calls, I'd guess) rather than spending effort reimplementing in C...
If performance of a Python implementation is a problem, then it is unlikely that any plausible modifications to the VM will stop it being a problem. -- g
participants (1)
-
Gareth McCaughan