Complexity documentation request

Hello all, Is it possible to include algorithm complexity information for the various built-in types (strings, sets, lists, dictionaries...)? This would ease the decision for choosing the correct type. The information could simply be added as a new column in the tables found on pages as the following: http://docs.python.org/lib/types-set.html http://docs.python.org/lib/typesseq.html It took me a while to find some information for my purposes, however I'm not sure whether it's outdated or incomplete. The best sources I found are python-list archive and a PEP: http://www.python.org/dev/peps/pep-3128/ http://mail.python.org/pipermail/python-list/2007-June/446877.html Nevertheless, algorithm complexity is not always the answer. I remember some years ago I preferred using "x in list" rather than "x in set" as member checking in a tight loop, because the list was rather small (10-20 elements) and the overhead for the set was far greater than the better algorithm complexity advantage. So if this information is available, it would be nice to document constant time factors too. :-) Thanks in advance, Dimitris

On Sun, Mar 09, 2008, Dimitrios Apostolou wrote:
Is it possible to include algorithm complexity information for the various built-in types (strings, sets, lists, dictionaries...)? This would ease the decision for choosing the correct type.
This has been discussed before and rejected for two reasons: * Other Python implementations (Jython, PyPy, IronPython) may not be able to provide the same type implementations * Algorithmic information does sometimes change between versions, and keeping the docs updated is not trivial There probably would be some value in a wiki page on python.org that provides this information, particularly across versions. You may be able to find volunteers to help on comp.lang.python. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "All problems in computer science can be solved by another level of indirection." --Butler Lampson

On Mar 9, 2008, at 10:22 AM, Aahz wrote:
This has been discussed before and rejected for two reasons:
* Other Python implementations (Jython, PyPy, IronPython) may not be able to provide the same type implementations
* Algorithmic information does sometimes change between versions, and keeping the docs updated is not trivial
Also, given the significance of the constant factors and the fact that these are significantly dependent, especially for containers, on the objects passed in (either to be contained or tested), it's not clear that it often makes sense to worry about at too detailed a level. Common sense, knowledge of the interpreter, and experience are often more valuable the easily-outdated documentation. -Fred -- Fred Drake <fdrake at acm.org>

Fred Drake wrote:
On Mar 9, 2008, at 10:22 AM, Aahz wrote:
This has been discussed before and rejected for two reasons:
* Other Python implementations (Jython, PyPy, IronPython) may not be able to provide the same type implementations
* Algorithmic information does sometimes change between versions, and keeping the docs updated is not trivial
Also, given the significance of the constant factors and the fact that these are significantly dependent, especially for containers, on the objects passed in (either to be contained or tested), it's not clear that it often makes sense to worry about at too detailed a level. Common sense, knowledge of the interpreter, and experience are often more valuable the easily-outdated documentation.
Fair enough but the fact is that this documentation already exists, at random locations unfortunately. Who would expect to find such valuable info in a rejected PEP (3128)! I will agree that experience and interpreter inside knowledge is most valuable for choosing the right structure, but isn't this too much for occasional python programmers? Thanks, Dimitris
-Fred

Aahz wrote:
* Other Python implementations (Jython, PyPy, IronPython) may not be able to provide the same type implementations
* Algorithmic information does sometimes change between versions, and keeping the docs updated is not trivial
Still, I think there are some general expectations one should be able to have of any implementation, e.g. that accessing a list item is roughly O(1), and not O(n) as it would be if they were implemented as linked lists. Dict access should probably be documented as no worse than O(log n) to allow for tree implementations. -- Greg

On Sun, Mar 9, 2008 at 11:50 AM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Aahz wrote:
* Other Python implementations (Jython, PyPy, IronPython) may not be able to provide the same type implementations
* Algorithmic information does sometimes change between versions, and keeping the docs updated is not trivial
Still, I think there are some general expectations one should be able to have of any implementation, e.g. that accessing a list item is roughly O(1), and not O(n) as it would be if they were implemented as linked lists.
Dict access should probably be documented as no worse than O(log n) to allow for tree implementations.
Well, there you have hit the nail on the head -- should we document the actual or the guaranteed O() expression? I think this is a can of worms better left unopened. At best we should include some hints to clarify that random access to list and tuple elements is constant time in CPython, and that dicts and sets are implemented using a hash table with open hashing -- readers can draw their own conclusions from that. What other implementations do should be up to them -- it becomes a Quality of Implementation issue. Regarding the OP's need for this kind of information (deciding between the various standard data types), I would recommend a different approach -- pick the data type that produces the shortest code. In all likelihood it's also the most efficient. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

On Sun, Mar 9, 2008 at 4:43 PM, Guido van Rossum <guido@python.org> wrote:
Well, there you have hit the nail on the head -- should we document the actual or the guaranteed O() expression?
Either. Both. Put a note at the bottom saying that factors of O(log n) have been dropped and they're basically the same thing (this is sometimes called "Soft O notation"). Big O is technically an upper-bound anyway. When was the last time a new version caused a function to become slower by more than a factor of O(log n)? As is, some operations *are* documented, but in odd places. For example, the documentation for deque describes the complexity of some of the list type's operations. -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises LLC

Hello again, Guido van Rossum wrote:
Well, there you have hit the nail on the head -- should we document the actual or the guaranteed O() expression? I think this is a can of worms better left unopened. At best we should include some hints to
I will open this can and say that average case complexity is the most important. For example sorting O(nlogn) and dict lookup O(1). But because worst case is important too, it would be nice (but not necessary) if there was an extra column on the table with that information, or blank if not applicable.
clarify that random access to list and tuple elements is constant time in CPython, and that dicts and sets are implemented using a hash table with open hashing -- readers can draw their own conclusions from that.
Notes are nice and already exist at random places in the *huge* documentation archive. But I still believe that the best place for this are the already existing tables in the docs (I pointed them at my initial post). One should trivially be able to find this information. On the other hand notes could be added for various oddities according to experience. For example that for sets up to 10(?) elements, a list would probably be better because of the hashing overhead.
What other implementations do should be up to them -- it becomes a Quality of Implementation issue.
I agree that CPython docs should document CPython behaviour. This would be the most helpful for someone writing a program in CPython. People who use other implementations should consult the appropriate docs. Implementors with worst algorithms should try to reach CPython efficiency.
Regarding the OP's need for this kind of information (deciding between the various standard data types), I would recommend a different approach -- pick the data type that produces the shortest code. In all likelihood it's also the most efficient.
Hmmm, the first thing that comes to mind is prepending an item in a list which is small and seems nice, but is O(n) I think! And besides that, for someone who *cares* about his code good looks is not enough... Thanks for your answers, Dimitris

I will open this can and say that average case complexity is the most important. For example sorting O(nlogn) and dict lookup O(1).
I would still debate the complexity of dict lookup. What are you averaging over? In absence of any distribution property of hash functions in the average case, you can't say anything about dictionary performance. I also disagree that the average complexity is "most important". I find the worst-case complexity most important. For average-case complexity, I just measure my application, and don't care about theoretical numbers.
Notes are nice and already exist at random places in the *huge* documentation archive. But I still believe that the best place for this are the already existing tables in the docs (I pointed them at my initial post). One should trivially be able to find this information.
Feel free to start a Wiki page then. With the right keywords on the page, it shouldn't take long for Google to pick up the page, and return it to people asking the right questions.
I agree that CPython docs should document CPython behaviour.
Actually, no. The "CPython documentation" documents Python, the language, and its standard library. It is a specification that CPython conforms to (hopefully). There might be fragments in it that are both worthwhile-to-mention and implementation-specific, but they should be marked as the latter.
Hmmm, the first thing that comes to mind is prepending an item in a list which is small and seems nice, but is O(n) I think! And besides that, for someone who *cares* about his code good looks is not enough...
Define "small". For a theoretically-reasonable definition of "small", prepending is O(1): namely, when a "small" list is one whose size is bounded by some upper bound (say, M). For such a list, prepending is O(1) (namely, not more than M copies). Of course, you can only prepend a finite number of times to such a list, unless you also delete in-between. Over a long series of insertions and deletions, prepending will be O(1) (if M is large, with a large factor). Regards, Martin

Martin v. Löwis wrote:
I would still debate the complexity of dict lookup. What are you averaging over?
All the Python programs I've ever run. :-)
I also disagree that the average complexity is "most important". I find the worst-case complexity most important.
While in theory the worst-case behaviour of a hash table is O(n), in practice the worst case is so vanishingly rare that nobody worries about it. Can you really say that you don't make any design decisions early on based on the assumption that dict lookup will almost certainly be a lot faster than searching a list?
Hmmm, the first thing that comes to mind is prepending an item in a list which is small and seems nice, but is O(n) I think!
Define "small".
I think he was talking about the size of the code. In other words, just because the code is small doesn't necessarily mean the algorithm is efficient.
For a theoretically-reasonable definition of "small", prepending is O(1):
Big O-notation is all about the limit when things get big. So it doesn't make sense to talk about "O() when something is small". -- Greg

Can you really say that you don't make any design decisions early on based on the assumption that dict lookup will almost certainly be a lot faster than searching a list?
I follow the advice Guido gave: I use the data structure that will make my code shortest and easiest to read, regardless of performance consequences initially. Premature optimization is the root of all evil. Regards, Martin

"Martin v. Löwis" writes:
Premature optimization is the root of all evil.
Actually, Knuth's bon mot was "Premature optimization is the root of all error." Which is probably worse; it leaves the perpetrator the excuse "but I was only trying to help!" While we all know what to do to evildoers, it's hard to punish those who deliberately invite Murphy's attention as severely as they deserve.<wink>

On Wed, Mar 12, 2008, Stephen J. Turnbull wrote:
"Martin v. L?wis" writes:
Premature optimization is the root of all evil.
Actually, Knuth's bon mot was "Premature optimization is the root of all error."
From my .sig database:
"Premature optimization is the root of all evil in programming." --C.A.R. Hoare (often misattributed to Knuth, who was himself quoting Hoare) "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." --Knuth restates Hoare -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "All problems in computer science can be solved by another level of indirection." --Butler Lampson

Martin v. Löwis wrote:
I follow the advice Guido gave: I use the data structure that will make my code shortest and easiest to read,
But given a choice such as list vs. dict, the code is almost identical either way. What do you do then? Toss a coin? Or make a few educated guesses based on what you know about each data type?
Premature optimization is the root of all evil.
This phrase seems to be widely misused. Making your code bloated and convoluted without good evidence of the necessity is premature optimisation. Picking an appropriate data structure for the task based on experience is good design practice. -- Greg

I follow the advice Guido gave: I use the data structure that will make my code shortest and easiest to read,
But given a choice such as list vs. dict, the code is almost identical either way. What do you do then?
Why do you say that? Lists and dictionaries are *completely* different things, not interchangeable at all.
Toss a coin? Or make a few educated guesses based on what you know about each data type?
I look at my problem, and the choice of container falls out of that naturally. Regards, Martin

Dimitrios Apostolou wrote:
On the other hand notes could be added for various oddities according to experience. For example that for sets up to 10(?) elements, a list would probably be better because of the hashing overhead.
That's the sort of thing that tends to be *very* implementation dependent -- not just between CPython and others, but between different releases of CPython.
Hmmm, the first thing that comes to mind is prepending an item in a list which is small and seems nice, but is O(n) I think!
Yeah, there's no substitute for having at least a rough idea of how the object you're using is implemented, e.g. array vs. linked list. This kind of very basic information is something that I think ought to be documented, and some guarantees made in the language definition. For example, I think a Python implementation that implemented lists as linked lists would make many people unhappy, as their algorithms suddenly went from O(n**m) to O(n**(m+1)) without anyone telling them. -- Greg

"Greg Ewing" <greg.ewing@canterbury.ac.nz> wrote in message news:47D5C2C4.6090701@canterbury.ac.nz... || Yeah, there's no substitute for having at least a | rough idea of how the object you're using is implemented, | e.g. array vs. linked list. As I understand it, the new 2.6 docs include a new one on CPython specifically. A page there might be appropriate. But someone has to write and submit a patch for review. | This kind of very basic information is something that | I think ought to be documented, and some guarantees | made in the language definition. For example, I think | a Python implementation that implemented lists as | linked lists would make many people unhappy, as their | algorithms suddenly went from O(n**m) to O(n**(m+1)) | without anyone telling them. Such an implementation should document such a design decision, but I don't see that an interpreter that runs the test suite should be prohibited from calling itself a 'Python interpreter' tjr

Dict access should probably be documented as no worse than O(log n) to allow for tree implementations.
That should not be documented. The current dict implementation may use O(n) for lookup operations, where n is the number of keys in the dictionary, and counting comparison operations. Regards, Martin

On Sun, Mar 9, 2008 at 9:22 AM, Aahz <aahz@pythoncraft.com> wrote:
There probably would be some value in a wiki page on python.org that provides this information, particularly across versions. You may be able to find volunteers to help on comp.lang.python.
I just created a very basic one at http://wiki.python.org/moin/TimeComplexity?action=show I'm not that familiar with the Wiki syntax, so the tables are kind of ugly at the moment. I wasn't sure about many of the set() operations, so I didn't include those. -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises LLC

On Mon, Mar 10, 2008, Daniel Stutzbach wrote:
On Sun, Mar 9, 2008 at 9:22 AM, Aahz <aahz@pythoncraft.com> wrote:
There probably would be some value in a wiki page on python.org that provides this information, particularly across versions. You may be able to find volunteers to help on comp.lang.python.
I just created a very basic one at http://wiki.python.org/moin/TimeComplexity?action=show
I'm not that familiar with the Wiki syntax, so the tables are kind of ugly at the moment.
Please specify which Python version you're using. Thanks! -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "All problems in computer science can be solved by another level of indirection." --Butler Lampson

I just created a very basic one at http://wiki.python.org/moin/TimeComplexity?action=show
I just knew that this could be a field of endless nitpicking. I think the dict.copy classification is incorrect. A dict.copy can take unbounded time, if the dict has seen many recent deletions (which didn't shrink it). Copying will have to iterate over all slots of the dictionary, and the ratio of slots to keys can grow unbounded if you have just deletions without insertions. IOW, if you do d = {} for i in xrange(N): d[i]=i for i in xrange(N-1): del d[i] then doing d.copy() takes O(N), not constant time (even though there is only one key in the dictionary).
I wasn't sure about many of the set() operations, so I didn't include those.
set is just implemented like a dictionary, without keys. Regards, Martin

On Mon, Mar 10, 2008 at 5:06 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
I just created a very basic one at http://wiki.python.org/moin/TimeComplexity?action=show
I just knew that this could be a field of endless nitpicking.
Certainly. I am hoping that this thread will soon wind down and future nitpicking can come in the form of edits to the wiki page with footnotes or links to other pages for anything non-obvious.
I think the dict.copy classification is incorrect. A dict.copy can take unbounded time, if the dict has seen many recent deletions (which didn't shrink it). Copying will have to iterate over all slots of the dictionary, and the ratio of slots to keys can grow unbounded if you have just deletions without insertions.
I have updated the wiki page accordingly. I assume there is a reason that PyDict_DelItem never calls dictresize? -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises LLC

I assume there is a reason that PyDict_DelItem never calls dictresize?
Yes - the assumption is that more del calls will follow, so that the dictionary eventually ends up empty. Only when new inserts are made, that assumption is proven wrong, and the shrinking can be done in one sweep. Regards, Martin

Martin v. Löwis wrote:
Yes - the assumption is that more del calls will follow, so that the dictionary eventually ends up empty. Only when new inserts are made, that assumption is proven wrong, and the shrinking can be done in one sweep.
Hmmm. So the most efficient way to copy a dict that you've just deleted a lot of things from is to insert something, then delete it again, and then copy. :-) -- Greg

Daniel Stutzbach wrote:
On Sun, Mar 9, 2008 at 9:22 AM, Aahz <aahz@pythoncraft.com> wrote:
There probably would be some value in a wiki page on python.org that provides this information, particularly across versions. You may be able to find volunteers to help on comp.lang.python.
I just created a very basic one at http://wiki.python.org/moin/TimeComplexity?action=show
I'm not that familiar with the Wiki syntax, so the tables are kind of ugly at the moment.
I wasn't sure about many of the set() operations, so I didn't include those.
Thanks! I'm interested too in some operations I don't know about, I think I'll just add them with blank fields so that eventually people who know fill them out. Dimitris P.S. This wiki is really bad in tables: I can't figure out how to draw a border for the tables and every table element contains a <p> tag, making the cell spanning 2-3 lines of height... :-@

On Mon, Mar 10, 2008 at 12:05 PM, Daniel Stutzbach <daniel@stutzbachenterprises.com> wrote:
On Sun, Mar 9, 2008 at 9:22 AM, Aahz <aahz@pythoncraft.com> wrote:
There probably would be some value in a wiki page on python.org that provides this information, particularly across versions. You may be able to find volunteers to help on comp.lang.python.
I just created a very basic one at http://wiki.python.org/moin/TimeComplexity?action=show
I'm not that familiar with the Wiki syntax, so the tables are kind of ugly at the moment.
I wasn't sure about many of the set() operations, so I didn't include those.
For python's purposes, I think it's simpler to classify an operation as either "linear" or "near constant", then have an explanation that "near constant" is only the typical performance (it doesn't make guarantees about worst case behaviour), may include O(log n) implementations, etc. That suffices to distinguish use cases, and anything more specific may be dominated by constant factors anyway. Something like sort is a special case. I don't think the languages needs to guarantee any particular performance, yet it's worth documenting that CPython has a rather good implementation. -- Adam Olsen, aka Rhamphoryncus

Daniel Stutzbach wrote:
I just created a very basic one at http://wiki.python.org/moin/TimeComplexity?action=show
Hi, Just one quick note. What exactly do you mean by "Amortized worst case"? Shouldn't it just be "Worst case"? I think that the word "amortized" better describes the time complexity of specific operations. For example I think that the insertion in a dictionary should be noted as "O(1) amortized" for the average case meaning that when doing infinite random insertions, the time asymptotically tends to be constant. And worst case is simply O(n), not amortized. Am I missing something? Thanks, Dimitris

On Wed, Mar 12, 2008 at 2:52 PM, Dimitrios Apostolou <jimis@gmx.net> wrote:
Just one quick note. What exactly do you mean by "Amortized worst case"? Shouldn't it just be "Worst case"? I think that the word "amortized" better describes the time complexity of specific operations.
http://en.wikipedia.org/wiki/Amortized_analysis -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises LLC

Daniel Stutzbach wrote:
On Wed, Mar 12, 2008 at 2:52 PM, Dimitrios Apostolou <jimis@gmx.net> wrote:
Just one quick note. What exactly do you mean by "Amortized worst case"? Shouldn't it just be "Worst case"? I think that the word "amortized" better describes the time complexity of specific operations.
Thanks for this, I understand now what it means. However given that for the list and deque types both columns have exactly the same values, wouldn't it be more useful if we simply mentioned the worst case in the second, or another, column? On another note which sorting algorithm is python using? Perhaps we can add this as a footnote. I always thought it was quicksort, with a worst case of O(n^2). Thanks, Dimitris

Dimitrios Apostolou <jimis@gmx.net> wrote:
On another note which sorting algorithm is python using? Perhaps we can add this as a footnote. I always thought it was quicksort, with a worst case of O(n^2).
See http://svn.python.org/projects/python/trunk/Objects/listsort.txt

On Thu, Mar 13, 2008 at 3:16 AM, Dimitrios Apostolou <jimis@gmx.net> wrote:
On another note which sorting algorithm is python using? Perhaps we can add this as a footnote. I always thought it was quicksort, with a worst case of O(n^2).
It's a highly optimized variant of mergesort, with some neat ideas to make the best-case O(n). I just made the word "Sort" into a hyperlink, pointing to the link that Duncan Booth pointed out in another response. -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises LLC

Hi, I just dug into the source code looking for complexity of set operations. In the wiki page I documented an interesting finding, that it is different to do s-t and s.difference(t). It is also interesting that you can do the first only for sets, but the second for every iterable in t. Are these portable characteristics of the python language or just implementation specific details? In addition, can someone explain me the usefulness of the loop starting with 'if (PyDict_CheckExact(other))' in set_difference()? As I understand it set_difference() is always called with two sets as arguments (set_sub() does the actual call). I'm just trying to figure out the complexity of the other set operations, but things get more complicated. I'd appreciate your help. Thanks, Dimitris

Correcting myself: Dimitrios Apostolou wrote:
Hi,
I just dug into the source code looking for complexity of set operations. In the wiki page I documented an interesting finding, that it is different to do s-t and s.difference(t). It is also interesting
it is different to do s-t than s.difference_update(t), as fas as complexity is involved. The first one is O(len(s)) while the second is O(len(t)) (I *think so, I may have missed lots of things in the source code).
that you can do the first only for sets, but the second for every iterable in t.
Are these portable characteristics of the python language or just implementation specific details? In addition, can someone explain me the
I just found it documented in the library reference, that s.method() can accept any iterable while s-t can't. So I guess it is a language characteristic.
usefulness of the loop starting with 'if (PyDict_CheckExact(other))' in set_difference()? As I understand it set_difference() is always called with two sets as arguments (set_sub() does the actual call).
I'm just trying to figure out the complexity of the other set operations, but things get more complicated. I'd appreciate your help.
Thanks, Dimitris
P.S. Who is the wiki admin? I'm desperately trying to improve the looks of tables (Add border, remove the <p> element from every cell) but I can't. I think that the page stylesheet needs to be modified, for starters...
participants (11)
-
"Martin v. Löwis"
-
Aahz
-
Adam Olsen
-
Daniel Stutzbach
-
Dimitrios Apostolou
-
Duncan Booth
-
Fred Drake
-
Greg Ewing
-
Guido van Rossum
-
Stephen J. Turnbull
-
Terry Reedy