Re: [Python-Dev] I think my set module is ready for prime time; comments?

Guido van Rossum <guido@digicool.com>:
There's already a PEP on a set object type, and everybody and their aunt has already implemented a set datatype.
Tim mentioned that he had one, and he also claimed that every other dodder had a set class, but the only one listed in the vaults is kjBuckets, which I'm not sure is maintained any more. (Is Aaron Watters hereabouts?)
I've just read the PEP. Greg's proposal has a couple of problems. The biggest one is that the interface design isn't very Pythonic -- it's formally adequate, but doesn't exploit the extent to which sets naturally have common semantics with existing Python sequence types. This is bad; it means that a lot of code that could otherwise ignore the difference between lists and sets would have to be specialized one way or the other for no good reason.
IMO, Eric's Set interface is close to perfect. PEP 218 is interesting, but I'm not sure it's worth slogging through the inevitable uproar over an entirely new syntactic construct (the "{}" notation) before getting something as useful as a set class into the standard library.
If *your* set module is ready for prime time, why not publish it in the Vaults of Parnassus?
I suppose that's what I'll do if you don't bless it for the standard library. But here are the reasons I suggest you should do so:
For what it's worth, I'm +1 on adding this to the standard library. I've seen so many set hacks with dictionaries (memory ouch) and list hacks (speed ouch) in Python code out there, that I'm convinced it would meet much more common usage than, say zlib, xdr, or even expat. On this hacker list everyone's aunt might whip up set extensions on boring weekends, but I doubt this describes the overall Python populace. -- Uche Ogbuji Principal Consultant uche.ogbuji@fourthought.com +1 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python

uche.ogbuji@fourthought.com <uche.ogbuji@fourthought.com>:
I've seen so many set hacks with dictionaries (memory ouch) and list hacks (speed ouch) in Python code out there, that I'm convinced it would meet much more common usage than, say zlib, xdr, or even expat.
Uche brings up a point I meant to make in my reply to Guido. The dict- vs.-list choice in set representation is indeed a choice between memory ouch and speed ouch. I believe most uses of sets are small sets. That reduces the speed ouch of using a list representation and increases the proportional memory ouch of a dictionary implementation. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a> Question with boldness even the existence of a God; because, if there be one, he must more approve the homage of reason, than that of blindfolded fear.... Do not be frightened from this inquiry from any fear of its consequences. If it ends in the belief that there is no God, you will find incitements to virtue in the comfort and pleasantness you feel in its exercise... -- Thomas Jefferson, in a 1787 letter to his nephew

Eric S. Raymond [esr@thyrsus.com] wrote:
I believe most uses of sets are small sets. That reduces the speed ouch of using a list representation and increases the proportional memory ouch of a dictionary implementation.
The problem is that there are a lot of uses for large sets, especially when you begin to introduce intersections and unions. If an implementation is only useful for a few dozen (or a hundered) items in the set, that eliminates a lot of places where the real use of set types is useful---optimizing large scale manipulations. Zope for example, manipulates sets with 10,000 items in it on a regular basis when doing text index manipulation. The data structures are heavily optimized for this kind of behaviour, without a major sacrifice in space. I think Jim perhaps can talk to this. Unfortunately, for me, a Python implementation of Sets is only interesting academicaly. Any time I've needed to work with them at a large scale, I've needed them *much* faster than Python could achieve without a C extension. Perhaps the difference is in problem domain. In the "scripting" problem domain, I would agree that Setswould rarely reach large sizes, and so a algorithm which performed in quadratic time might be fine, because the actual resultant time is small. However, in more full-blown applications, this would be counter productive, and the user would be forced implement their own (or use Aaron's excellent kjBuckets). Just my opinion, of course. Chris -- | Christopher Petrilli | petrilli@amber.org

On Tue, Jan 23, 2001 at 02:06:05PM -0500, Christopher Petrilli wrote:
Unfortunately, for me, a Python implementation of Sets is only interesting academicaly. Any time I've needed to work with them at a large scale, I've needed them *much* faster than Python could achieve without a C extension.
I think this argues that if sets are added to the core they should be implemented as an extension type with the speed of dictionaries and the memory usage of lists. Basicly, we would use the implementation of PyDict but drop the values. Neil

Neil Schemenauer [nas@arctrix.com] wrote:
On Tue, Jan 23, 2001 at 02:06:05PM -0500, Christopher Petrilli wrote:
Unfortunately, for me, a Python implementation of Sets is only interesting academicaly. Any time I've needed to work with them at a large scale, I've needed them *much* faster than Python could achieve without a C extension.
I think this argues that if sets are added to the core they should be implemented as an extension type with the speed of dictionaries and the memory usage of lists. Basicly, we would use the implementation of PyDict but drop the values.
This is effectively the implementation that Zope has for Sets. In addition we have "buckets" that have scores on them (which are implemented as a modified BTree). Unfortunately Jim Fulton (who wrote all the code for that level) is in a meeting, but I hope he'll comment on the implementation that was chosen for our software. Chris -- | Christopher Petrilli | petrilli@amber.org

Christopher Petrilli wrote:
Neil Schemenauer [nas@arctrix.com] wrote:
On Tue, Jan 23, 2001 at 02:06:05PM -0500, Christopher Petrilli wrote:
Unfortunately, for me, a Python implementation of Sets is only interesting academicaly. Any time I've needed to work with them at a large scale, I've needed them *much* faster than Python could achieve without a C extension.
I think this argues that if sets are added to the core they should be implemented as an extension type with the speed of dictionaries and the memory usage of lists. Basicly, we would use the implementation of PyDict but drop the values.
This is effectively the implementation that Zope has for Sets.
Except we use sorted collections with binary search for sets. I think that a simple hash-based set would make alot of sense.
In addition we have "buckets" that have scores on them (which are implemented as a modified BTree).
Unfortunately Jim Fulton (who wrote all the code for that level) is in a meeting, but I hope he'll comment on the implementation that was chosen for our software.
We have a number of special needs: - Scalability is critical. We make some special opimizations, like sets of integers and mapping objects with integer keys and values. In these cases, data are stored using C int arrays, allowing very efficient data storage and manipulation, especially when using integer keys. - We need to spread data over multiple database records. Our data structures may be hundreds of megabytes in size. We have ZODB-aware structures that use multiple independently stored database objects. - Range searches are very common, and under some circomstances, sorted collections and BTrees can have very little overhead compared to dictionaries. For this reason, out mapping objects and sets have been based on BTrees and sorted collections. Unfortunately, our current BTree implementation has a flaw that causes excessive number of objects to be updated when items are added and removed. (Each BTree internal node keeps track of the number of objects contained in it.) Also, out current sets are limited to integers and cannot be spread over multiple database records. We are completing a new BTree implementation that overcomes these limitations. IN this implementation, we will provide sets as value-less BTrees. Jim -- Jim Fulton mailto:jim@digicool.com Python Powered! Technical Director (888) 344-4332 http://www.python.org Digital Creations http://www.digicool.com http://www.zope.org

Jim Fulton wrote:
Christopher Petrilli wrote:
Neil Schemenauer [nas@arctrix.com] wrote:
On Tue, Jan 23, 2001 at 02:06:05PM -0500, Christopher Petrilli wrote:
Unfortunately, for me, a Python implementation of Sets is only interesting academicaly. Any time I've needed to work with them at a large scale, I've needed them *much* faster than Python could achieve without a C extension.
I think this argues that if sets are added to the core they should be implemented as an extension type with the speed of dictionaries and the memory usage of lists. Basicly, we would use the implementation of PyDict but drop the values.
This is effectively the implementation that Zope has for Sets.
Except we use sorted collections with binary search for sets.
I think that a simple hash-based set would make alot of sense.
In addition we have "buckets" that have scores on them (which are implemented as a modified BTree).
Unfortunately Jim Fulton (who wrote all the code for that level) is in a meeting, but I hope he'll comment on the implementation that was chosen for our software.
We have a number of special needs:
- Scalability is critical. We make some special opimizations, like sets of integers and mapping objects with integer keys and values. In these cases, data are stored using C int arrays, allowing very efficient data storage and manipulation, especially when using integer keys.
- We need to spread data over multiple database records. Our data structures may be hundreds of megabytes in size. We have ZODB-aware structures that use multiple independently stored database objects.
- Range searches are very common, and under some circomstances, sorted collections and BTrees can have very little overhead compared to dictionaries. For this reason, out mapping objects and sets have been based on BTrees and sorted collections.
Unfortunately, our current BTree implementation has a flaw that causes excessive number of objects to be updated when items are added and removed. (Each BTree internal node keeps track of the number of objects contained in it.) Also, out current sets are limited to integers and cannot be spread over multiple database records.
We are completing a new BTree implementation that overcomes these limitations. IN this implementation, we will provide sets as value-less BTrees.
You may want to check out a soon to be released new mx package: mxBeeBase. This is an on-disk b+tree implementation which supports data files up to 2GB on 32-bit platforms. Here's a preview: http://www.lemburg.com/python/mxBeeBase.html (The links on that page are not functional.) -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Neil Schemenauer <nas@arctrix.com>:
Basicly, we would use the implementation of PyDict but drop the values.
This could be incorporated into PyDict. Instead of storing keys and values in the same array, keep them in separate arrays and only allocate the values array the first time someone stores a value other than 1. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Neil Schemenauer <nas@arctrix.com>:
Basicly, we would use the implementation of PyDict but drop the values.
This could be incorporated into PyDict. Instead of storing keys and values in the same array, keep them in separate arrays and only allocate the values array the first time someone stores a value other than 1.
Not a bad idea! (But shouldn't the default value be something else, like none?) --Guido van Rossum (home page: http://www.python.org/~guido/)

[Greg Ewing]
This could be incorporated into PyDict. Instead of storing keys and values in the same array, keep them in separate arrays and only allocate the values array the first time someone stores a value other than 1.
[Guido]
Not a bad idea!
In theory, but if Vladimir were here he'd bust a gut over the possibly bad cache effects on "real dicts" (by keeping everything together, simply accessing the cached hash code brings both the key and value pointers into L1 cache too). We would need to quantify the effect of breaking that connection.
(But shouldn't the default value be something else, like none?)
Bleech. I hate the idiom of using a false value to mean "present". d = {} for x in seq: d[x] = 1 runs faster too (None needs a LOAD_GLOBAL now).

Guido:
But shouldn't the default value be something else, like none?
It should really be whatever is the first value that gets stored after the dict is created. That way people can use whatever they want for their dummy value and it will Just Work. And it will probably catch most existing uses of a dict as a set as well. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

On Wed, 24 Jan 2001, Greg Ewing <greg@cosc.canterbury.ac.nz> wrote:
This could be incorporated into PyDict. Instead of storing keys and values in the same array, keep them in separate arrays and only allocate the values array the first time someone stores a value other than 1.
Cool idea, but even cooler (would catch more idioms, that is) is "the first time someone stores something not 'is' something in the dict, allocate the values array". This would catch small numbers, None and identifier-looking strings, for the measly cost of one pointer/dict object. -- Moshe Zadka <sig@zadka.site.co.il> This is a signature anti-virus. Please stop the spread of signature viruses! Fingerprint: 4BD1 7705 EEC0 260A 7F21 4817 C7FC A636 46D0 1BD6

This could be incorporated into PyDict. Instead of storing keys and values in the same array, keep them in separate arrays and only allocate the values array the first time someone stores a value other than 1.
Cool idea, but even cooler (would catch more idioms, that is) is "the first time someone stores something not 'is' something in the ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dict, allocate the values array". This would catch small numbers, None and identifier-looking strings, for the measly cost of one pointer/dict object.
Sorry, but I don't understand what you mean by the ^^^ marked phrase. Can you please elaborate? Regarding storing one for "present", that's all well and fine, but it suggests to me that storing a false value could mean "not present". Do we really want that? --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
...
Cool idea, but even cooler (would catch more idioms, that is) is "the first time someone stores something not 'is' something in the
Sorry, but I don't understand what you mean by the ^^^ marked phrase. Can you please elaborate?
I wasn't clear about that either. The idea is: def add(new_value): if not values_array: if self.magic_value is NULL: self.magic_value = new_value elif new_value is not self.magic_value: self.values_array=[self.magic_value, new_value, ... ] else: # new_value is self.magic_value: do nothing I am neutral on this proposal myself. I think that even if we optimize any code where you pass the same thing over and over again, we should document a convention for consistency. So I'm not sure there is much advantage. Paul Prescod

On Wed, 24 Jan 2001 10:41:07 -0500, Guido van Rossum <guido@digicool.com> wrote:
Cool idea, but even cooler (would catch more idioms, that is) is "the first time someone stores something not 'is' something in the ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dict, allocate the values array". This would catch small numbers, None and identifier-looking strings, for the measly cost of one pointer/dict object.
Sorry, but I don't understand what you mean by the ^^^ marked phrase. Can you please elaborate?
I should really stop writing incomprehensible bits like that. Heck, I can't even understand it on second reading. I meant that the dictionary would keep a slot for "the one and only value". First time someone puts a value in the dict, it puts it in the "one and only value" slot, and doesn't initalize the value array. The second time someone puts a value, it checks for pointer equality with that "one and only value". If it is the same, it it still doesn't initalize the value array. The only time when the dictionary initalizes the value array is when two pointer-different values are put in. This would let me code a[key] = None For my sets (but consistent in the same set!) a[key] = 1 When the timbot codes (again, consistent in the same set) and a[key] = 'present' If you're really weird. (identifier-like strings get interned) That's not *semantics*, that's *optimization* for a commonly used (I think) idiom with dictionaries -- you can't predict the value, but it will probably remain the same. -- Moshe Zadka <sig@zadka.site.co.il> This is a signature anti-virus. Please stop the spread of signature viruses! Fingerprint: 4BD1 7705 EEC0 260A 7F21 4817 C7FC A636 46D0 1BD6

I meant that the dictionary would keep a slot for "the one and only value". First time someone puts a value in the dict, it puts it in the "one and only value" slot, and doesn't initalize the value array. The second time someone puts a value, it checks for pointer equality with that "one and only value". If it is the same, it it still doesn't initalize the value array. The only time when the dictionary initalizes the value array is when two pointer-different values are put in.
This would let me code
a[key] = None
For my sets (but consistent in the same set!)
a[key] = 1
When the timbot codes (again, consistent in the same set)
and
a[key] = 'present'
If you're really weird.
(identifier-like strings get interned)
That's not *semantics*, that's *optimization* for a commonly used (I think) idiom with dictionaries -- you can't predict the value, but it will probably remain the same.
This I like! But note that a dict currently uses 12 bytes per slot in the hash table (on a 32-bit platform: long me_hash; PyObject *me_key, *me_value). The hash table's fill factor is typically between 50 and 67%. I think removing the hashes would slow down lookups too much, so optimizing identical values out would only save 6-8 bytes per existing key on average. Not clear if it's worth enough. I think I have to agree with Tim's expectation that two (or three) separate parallel arrays will reduce the cache locality and thus slow things down. Once you start probing, you jump through the hashtable at large random strides, causing bad cache performance (for largeish hash tables); but since often enough the first slot tried is right, you have the hash, key and value right next together, typically on the same cache line. --Guido van Rossum (home page: http://www.python.org/~guido/)

Greg Ewing wrote:
Neil Schemenauer <nas@arctrix.com>:
Basicly, we would use the implementation of PyDict but drop the values.
This could be incorporated into PyDict. Instead of storing keys and values in the same array, keep them in separate arrays and only allocate the values array the first time someone stores a value other than 1.
Very good idea. It fits also in my view of how dicts should be implemented: Keep keys and values apart, since this information has different access patterns. I think (or at least hope) that dictionaries become faster, when hashes, keys and values are in seperate areas, giving more cache hits. Not sure if hashes and keys should be apart, but sure for values. ciao - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

[Christian Tismer]
... Not sure if hashes and keys should be apart, but sure for values.
How so? That is, under what assumptions? Any savings from separation would appear to require that I look up keys a lot more than I access the associated values; while trivially true for dicts used as sets, it seems dubious to me for use of dicts as mappings (count[word] += 1, etc).

[Neil Schemenauer]
I think this argues that if sets are added to the core they should be implemented as an extension type with the speed of dictionaries and the memory usage of lists. Basicly, we would use the implementation of PyDict but drop the values.
They'll be slower than dicts and take more memory than lists then. WRT memory, dicts cache the hash code with each entry for speed (so double the memory of a list even without the value field), and are never more than 2/3 full anyway. The dict implementation also gets low-level speed benefits out of using both the key and value fields to characterize the nature of a slot (the key field is NULL iff the slot is virgin; the value field is NULL iff the slot is available (virgin or dummy)). Dummy slots can be avoided (and so also the need for runtime code to distinguish them from active slots) by using a hash table of pointers to linked lists-- or flex vectors, or linked lists of small vectors --instead, and in most ways that leads to much simpler code (no more fiddling with dummies, no more probe-sequence hassles, no more boosting the size before the table is full). But without fine control over the internals of malloc, that takes even more memory in the end. Interesting twist: "a dict" *is* "a set", but a set of (key, value) pairs further constrained so that no two elements have the same key. So any set implementation can be used as-is to implement a dict as a set of 2-tuples, customizing the hash and "is equal" functions to look at just the tuples' first elements. The was the view taken by SETL in 1969, although their "map" (dict) type was eventually optimized to get away from actually constructing 2-tuples. Indeed, SETL eventually grew an elaborate optional type declaration sublanguage, allowing the user to influence many details of its many internal set-storage schemes; e.g., from pg 399 of "Programming With Sets: An Introduction to SETL": For example, we can declare [I'm putting their keywords in UPPERCASE for, umm, clarity] successors: LOCAL MMAP(ELMT b) REMOTE SET(ELMT b); This declaration specifies that for each x in b the image set successors{x} is stored in the element block of x, and that this image set is always to be represented as a bit vector. Similarly, the declaration successors: LOCAL MMAP(ELMT b) SPARSE SET(ELMT b); specifies that for each x in b the image set successors{x} is to be stored as a hash table containing pointers to elements of b. Note that the attribute LOCAL cannot be used for image sets of multivalued maps, This follows from the remarks in section 10.4.3 on the awkwardness of making local objects into subparts of composite objects. Clear? Snort. Here are some citations lifted from the web for their experience in trying to make these kinds of decisions by magic: @article{dewar:79, title="Programming by Refinement, as Exemplified by the {SETL} Representation Sublanguage", author="Robert B. K. Dewar and Arthur Grand and Ssu-Cheng Liu and Jacob T. Schwartz and Edmond Schonberg", journal=toplas, year=1979, month=jul, volume=1, number=1, pages="27--49" } @article{schonberg:81, title="An Automatic Technique for Selection of Data Structures in {SETL} Programs", author="Edmond Schonberg and Jacob T. Schwartz and Micha Sharir", journal=toplas, year=1981, month=apr, volume=3, number=2, pages="126--143" } @article{freudenberger:83, title="Experience with the {SETL} Optimizer", author="Stefan M. Freudenberger and Jacob T. Schwartz and Micha Sharir", pages="26--45", journal=toplas, year=1983, month=jan, volume=5, number=1 } If someone wanted to take sets seriously today, a better approach would be to define a minimal "set interface" ("abstract base class" in C++ terms), then supply multiple implementations of that interface, letting the user choose directly which implementation strategy they want for each of their sets. And people are doing just that in the C++ and Java worlds; e.g., http://developer.java.sun.com/developer/onlineTraining/ collections/Collection.html#SetInterface Curiously, the newer Java Collections Framework (covering multiple implementations of list, set, and dict interfaces) gave up on thread-safety by default, because it cost too much at runtime. Just another thing to argue about <wink>. we're-not-exactly-pioneers-here-ly y'rs - tim

[Christopher Petrilli]
.... Unfortunately, for me, a Python implementation of Sets is only interesting academicaly. Any time I've needed to work with them at a large scale, I've needed them *much* faster than Python could achieve without a C extension.
How do you know that? I've used large sets in Python happily without resorting to C or kjbuckets (which is really aiming at fast operations on *graphs*, in which area it has no equal). Everyone (except Eric <wink>) uses dicts to implement sets in Python, and "most" set operations can work at full C speed then; e.g., assuming both sets have N elements: membership testing O(1) -- it's just dict.has_key() element insertion O(1) -- dict[element] = 1 element removal O(1) -- del dict[element] union O(N), but at full C speed -- dict1.update(dict2) intersection O(N), but at Python speed (the only 2.1 dog in the bunch!) choose some element and remove it took O(N) time and additional space in 2.0, but is O(1) in both since dict.pop() was introduced iteration O(N), with O(N) additional space using dict.keys(), or O(1) additional space using dict.pop() repeatedly What are you going to do in C that's faster than using a Python dict for this purpose? Most key set operations are straightforward Python dict 1-liners then, and Python dicts are very fast. kjbuckets sets were slower last time I timed them (several years ago, but Python dicts have gotten faster since then while kjbuckets has been stagnant). There's a long tradition in the Lisp world of using unordered lists to represent sets (when the only tool you have is a hammer ... <0.5 wink>), but it's been easy to do much better than that in Python almost since the start. Even in the Python list world, enormous improvements for large sets can be gotten by maintaining lists in sorted order (then most O(N) operations drop to O(log2(N)), and O(N**2) to O(N)). Curiously, though, in 2.1 we can still use a dict-set for complex numbers, but no longer a sorted-list-set! Requiring a total ordering can get in the way more than requiring hashability (and vice versa -- that's a tough one). measurement-is-the-measure-of-all-measurable-things-ly y'rs - tim

Tim Peters <tim.one@home.com>:
Requiring a total ordering can get in the way more than requiring hashability
Often it's useful to have *some* total ordering, and you don't really care what it is as long as its consistent. Maybe all types should be required to support cmp(x,y) even if doing x < y via the rich comparison route raises a NotOrderable exception. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
participants (12)
-
Christian Tismer
-
Christopher Petrilli
-
Eric S. Raymond
-
Greg Ewing
-
Guido van Rossum
-
Jim Fulton
-
M.-A. Lemburg
-
Moshe Zadka
-
Neil Schemenauer
-
Paul Prescod
-
Tim Peters
-
uche.ogbuji@fourthought.com