PEP 0424: A method for exposing a length hint
data:image/s3,"s3://crabby-images/c9c6e/c9c6eac43d1767be7719cd707451187176ad2431" alt=""
Hi all, I've just submitted a PEP proposing making __length_hint__ a public API for users to define and other VMs to implement: PEP: 424 Title: A method for exposing a length hint Version: $Revision$ Last-Modified: $Date Author: Alex Gaynor <alex.gaynor@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 14-July-2012 Python-Version: 3.4 Abstract ======== CPython currently defines an ``__length_hint__`` method on several types, such as various iterators. This method is then used by various other functions (such as ``map``) to presize lists based on the estimated returned by ``__length_hint__``. Types can then define ``__length_hint__`` which are not sized, and thus should not define ``__len__``, but can estimate or compute a size (such as many iterators). Proposal ======== This PEP proposes formally documenting ``__length_hint__`` for other interpreter and non-standard library Python to implement. ``__length_hint__`` must return an integer, and is not required to be accurate. It may return a value that is either larger or smaller than the actual size of the container. It may raise a ``TypeError`` if a specific instance cannot have its length estimated. It may not return a negative value. Rationale ========= Being able to pre-allocate lists based on the expected size, as estimated by ``__length_hint__``, can be a significant optimization. CPython has been observed to run some code faster than PyPy, purely because of this optimization being present. Open questions ============== There are two open questions for this PEP: * Should ``list`` expose a kwarg in it's constructor for supplying a length hint. * Should a function be added either to ``builtins`` or some other module which calls ``__length_hint__``, like ``builtins.len`` calls ``__len__``. Copyright ========= This document has been placed into the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 Alex
data:image/s3,"s3://crabby-images/c9c6e/c9c6eac43d1767be7719cd707451187176ad2431" alt=""
On Sat, Jul 14, 2012 at 4:18 PM, Benjamin Peterson <benjamin@python.org>wrote:
ValueError, the same as with len.
Sounds reasonable to me! Should we just go ahead and strip those out now?
Alex -- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero
data:image/s3,"s3://crabby-images/4031c/4031ce5b823df99dfa8e6671025bb4b4c95251e7" alt=""
On Sat, Jul 14, 2012 at 4:21 PM, Alex Gaynor <alex.gaynor@gmail.com> wrote:
I'm +1 on not having a public API for this. Ultimately the contract for a length hint will depend heavily upon what you need it for. Some applications would require a length hint to be an "at least" others an "at most" and others something else entirely. Given that the contract here appears to be >=0, I don't think the length hint is particularly useful to the public at large.
data:image/s3,"s3://crabby-images/102be/102be22b252fd381e2db44154ec267297556abaa" alt=""
On Sun, Jul 15, 2012 at 1:28 AM, Alexandre Zani <alexandre.zani@gmail.com> wrote:
Other possible related uses could be to get an approximate number of results for a query without having to actually go through the whole query, useful for databases and search engines. But then you *do* want __len__ as well, so that also doesn't fit with the current PEP. But maybe that's a completely different usecase, even though it seems related to me? //Lennart
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Sun, Jul 15, 2012 at 9:18 AM, Benjamin Peterson <benjamin@python.org> wrote:
Length hints are very useful for *any* container implementation, whether those containers are in the standard library or not. Just as we exposed operator.index when __index__ was added, we should expose an "operator.length_hint" function with the following semantics: def length_hint(obj): """Return an estimate of the number of items in obj. This is useful for presizing containers when building from an iterable. If the object supports len(), the result will be exact. Otherwise, it may over or underestimate by an arbitrary amount. The result will be an integer >= 0. """ try: return len(obj) except TypeError: try: get_hint = obj.__length_hint__ except AttributeError: return 0 hint = get_hint() if not isinstance(hint, int): raise TypeError("Length hint must be an integer, not %r" % type(hint)) if hint < 0: raise ValueError("Length hint (%r) must be >= 0" % hint) return hint There's no reason to make pure Python container implementations reimplement all that for themselves. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
data:image/s3,"s3://crabby-images/c9c6e/c9c6eac43d1767be7719cd707451187176ad2431" alt=""
On Sat, Jul 14, 2012 at 10:16 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Sounds reasonable to me, the only issue with your psuedocode (err... I mean Python ;)), is that there's no way for the __lenght_hint__ to specify that that particular instance can't have a length hint computed. e.g. imagine some sort of lazy stream that cached itself, and only wanted to offer a length hint if it had already been evaluated. Without an exception to raise, it has to return whatever the magic value for length_hint is (in your impl it appears to be 0, the current _PyObject_LengthHint method in CPython has a required `default` parameter). The PEP proposes using TypeError for that. Anyways that code looks good, do you want to add it to the PEP? Alex -- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Alex Gaynor, 15.07.2012 07:20:
Yes, that's a major issue. I've been planning to add a length hint to Cython's generator expressions for a while, but the problem is really that in most cases it is only known at runtime if the underlying iterable has a length hint, so propagating it needs a way to say "sorry, I thought I might know, but I don't". It would be even better if this way was efficient. Since we're at a point of making this an official protocol, why not change the current behaviour and return -1 (or even just 0) to explicitly state that "we don't know"? The problem with an exception here is that it might have been raised accidentally inside of the __length_hint__() implementation that is being asked. Swallowing it just because it happened to be a TypeError rather than something else may end up covering bugs. We had a similar issue with hasattr() in the past. Also, it would be nice if this became a type slot rather than requiring a dict lookup and Python function call. Stefan
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
Nick Coghlan wrote:
As given, length_hint gives no way of distinguishing between iterables and non-iterables: py> length_hint([]) 0 py> length_hint(42) 0 nor does it give iterable objects a way to indicate that either they don't know their length, or that they are infinite. I suggest: * object (and hence all other types that don't explicitly override it) should have a __length_hint__ that raises TypeError; * __length_hint__ should be allowed to return None to indicate "don't know" or -1 to indicate "infinite". Presumably anything that wishes to create a list or other sequence from an object with a hint of -1 could then raise an exception immediately. -- Steven
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Sun, Jul 15, 2012 at 6:21 PM, Steven D'Aprano <steve@pearwood.info> wrote:
We can keep it simpler than that just by changing the order of the checks.
I'm not seeing the value in returning None over 0 for the don't know case - it just makes the API harder to use. Declaring negative results as meaning "I'm infinite" sounds reasonable, though: def length_hint(obj): """Return an estimate of the number of items in obj. This is useful for presizing containers when building from an iterable. If the object supports len(), the result will be exact. Otherwise, it may over or underestimate by an arbitrary amount. """ try: get_hint = obj.__length_hint__ except AttributeError: return len(obj) hint = get_hint() if not isinstance(hint, int): msg = "Length hint must be an integer, not %r" raise TypeError(msg % type(hint)) if hint < 0: raise ValueError("%r is an infinite iterator" % (obj,)) return hint Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Sun, 15 Jul 2012 18:47:38 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
The point is that 0 is a legitimate value for a length hint. Simple implementations of __length_hint__ will start returning 0 as a legitimate value and you will wrongly interpret that as "don't know", which kinds of defeat the purpose of __length-hint__ ;) That said, I don't think a special value for "is infinite" is useful. Just make -1 mean "I don't know". Regards Antoine.
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
Antoine Pitrou wrote:
That said, I don't think a special value for "is infinite" is useful. Just make -1 mean "I don't know".
You've obviously never accidentally called list on an infinite iterator *wink* It's not the (eventual) MemoryError that is the problem. On some systems, this can cause the PC to become unresponsive as the OS tries to free an ever-increasing amount of memory. Been there, done that, on a production system. I had to do a hard reboot to fix it. I think having a hint that says "there's no way this can succeed, fail immediately" is more useful than caring about the difference between a hint of 0 and a hint of 1. -- Steven
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
Right, I agree on the value in being able to return something to say "this cannot be converted to a concrete container". I still haven't seen a use case where the appropriate response to "I don't know" differs from the appropriate response to a hint of zero - that is, you don't preallocate, you just start iterating. Cheers, Nick. -- Sent from my phone, thus the relative brevity :)
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Mon, 16 Jul 2012 00:08:41 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
Right, I agree on the value in being able to return something to say "this cannot be converted to a concrete container".
Who would be able to return that, apart from trivial cases like itertools.cycle()? Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
data:image/s3,"s3://crabby-images/42e2c/42e2c427e050a4c1eb1e98390a96fab00e060c7a" alt=""
Am 15.07.2012 16:22, schrieb Antoine Pitrou:
For example most numerical sequence iterators like Fibonacci generator, prime number sequence generator and even trivial cases like even natural number generator. IMO it's a good idea to have a notation for infinitive iterators that can't be materialized as finite containers. +1 Christian
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Sun, 15 Jul 2012 16:33:23 +0200 Christian Heimes <lists@cheimes.de> wrote:
First, you can't implement __length_hint__ for a generator, which is the preferred (the most practical) way of writing iterators in pure Python. Second, not all iterators will implement __length_hint__ (because it's optional and, really, of rather little use). So, as a user, you cannot hope that `list(some_iterator)` will always raise instead of filling your memory with an infinite stream of values: you have to be careful anyway. Even if __length_hint__ is implemented, its result may be wrong. That's the whole point: it's a *hint*; an iterator might tell you it's finite while it's infinite, or the reverse. My conclusion is that an infinite iterator is a documentation issue. Just tell the user that it doesn't stop, and let them shoot themselves in the foot in they want to. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
Antoine Pitrou wrote:
First, you can't implement __length_hint__ for a generator, which is the preferred (the most practical) way of writing iterators in pure Python.
Limitations of generators are no reason for not improving iterators which are not generators. __length_hint__ already exists; this proposal simply proposes making it documented and officially supported. py> iter([]).__length_hint__ <built-in method __length_hint__ of list_iterator object at 0xb7bcf98c>
If it claims to be infinite, I see no reason to disbelieve it on the off-chance that it is actually both finite and small enough to fit into memory without crashing my system. If it claims to be finite, but is actually infinite, well that's not much of a hint, is it? There's an implied promise that the hint will be close to the real value, not infinitely distant.
Buffer overflows are a documentation issue. Just tell the user not to overwrite memory they don't mean to, and let them shoot themselves in the foot if they want. *wink* -- Steven
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Mon, 16 Jul 2012 02:21:20 +1000 Steven D'Aprano <steve@pearwood.info> wrote:
No, buffer overflows are bugs and they get fixed. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Antoine Pitrou, 15.07.2012 17:06:
It can be implemented for generator expressions without a conditional, though, including the case of comprehensions. I wanted to do this in Cython for a while, but the protocol wasn't very well adapted to that use case. The "don't know" case was just too common and inefficient. For the other points, I agree with the already presented counterarguments. Being able to prevent some obvious traps is a good thing, even if you can't prevent all of them. Stefan
data:image/s3,"s3://crabby-images/32b67/32b67145b0fe3069a1de27c1ec5fc1c9428e9b97" alt=""
On Jul 15, 2012, at 7:22 AM, Antoine Pitrou wrote:
FWIW, here are the notes from the docstring in Lib/test/test_iterlen.py: """ Test Iterator Length Transparency Some functions or methods which accept general iterable arguments have optional, more efficient code paths if they know how many items to expect. For instance, map(func, iterable), will pre-allocate the exact amount of space required whenever the iterable can report its length. The desired invariant is: len(it)==len(list(it)). A complication is that an iterable and iterator can be the same object. To maintain the invariant, an iterator needs to dynamically update its length. For instance, an iterable such as xrange(10) always reports its length as ten, but it=iter(xrange(10)) starts at ten, and then goes to nine after it.next(). Having this capability means that map() can ignore the distinction between map(func, iterable) and map(func, iter(iterable)). When the iterable is immutable, the implementation can straight-forwardly report the original length minus the cumulative number of calls to next(). This is the case for tuples, xrange objects, and itertools.repeat(). Some containers become temporarily immutable during iteration. This includes dicts, sets, and collections.deque. Their implementation is equally simple though they need to permanently set their length to zero whenever there is an attempt to iterate after a length mutation. The situation slightly more involved whenever an object allows length mutation during iteration. Lists and sequence iterators are dynamically updatable. So, if a list is extended during iteration, the iterator will continue through the new items. If it shrinks to a point before the most recent iteration, then no further items are available and the length is reported at zero. Reversed objects can also be wrapped around mutable objects; however, any appends after the current position are ignored. Any other approach leads to confusion and possibly returning the same item more than once. The iterators not listed above, such as enumerate and the other itertools, are not length transparent because they have no way to distinguish between iterables that report static length and iterators whose length changes with each call (i.e. the difference between enumerate('abc') and enumerate(iter('abc')). """ Raymond
data:image/s3,"s3://crabby-images/90304/903046437766f403ab1dffb9b01cc036820247a4" alt=""
Nick Coghlan wrote:
There seem to be 5 possible classes values of __length_hint__ that an iterator object can provide: 1. Don't implement it at all. 2. Implement __length_hint__() but don't want to return any value. Either raise an exception (TypeError) -- As suggested in the PEP. or return NotImplemented -- my preferred option. 3. Return a "don't know" value: Returning 0 would be fine for this, but the VM might want to respond differently to "don't know" and 0. __length_hint__() == 0 container should be minimum size. __length_hint__() == "unknown" container starts at default size. 4. Infinite iterator: Could return float('inf'), but given this is a "hint" then returning sys.maxsize or sys.maxsize + 1 might be OK. Alternatively raise an OverflowError 5. A meaningful length. No problem :) Also, what are the allowable return types? 1. int only 2. Any number (ie any type with a __int__() method)? 3. Or any integer-like object (ie a type with a __index__() method)? My suggestion: a) Don't want to return any value or "don't know": return NotImplemented b) For infinite iterators: raise an OverflowError c) All other cases: return an int or a type with a __index__() method. Cheers, Mark.
data:image/s3,"s3://crabby-images/e87f3/e87f3c7c6d92519a9dac18ec14406dd41e3da93d" alt=""
On Sun, Jul 15, 2012 at 10:39 AM, Mark Shannon <mark@hotpy.org> wrote:
I am really having a hard time differentiating infinity with "I don't know" since they are both accurate from the point of view of __length_hint__ and its typical purpose of allocation. You have no clue how many values will be grabbed from an infinite iterator, so it's the same as just not knowing upfront how long the iterator will be, infinite or not, and thus not worth distinguishing.
I'm fine with (a), drop (b), and for (c) use what we allow for __len__() since, as Nick's operator.length_hint pseudo-code suggests, people will call this as a fallback if __len__ isn't defined. -Brett
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Sun, 15 Jul 2012 16:08:00 +0100 Mark Shannon <mark@hotpy.org> wrote:
Why should it? AFAIK it's not a common complaint. You said it yourself: it's a silly thing to do. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
data:image/s3,"s3://crabby-images/4031c/4031ce5b823df99dfa8e6671025bb4b4c95251e7" alt=""
On Sun, Jul 15, 2012 at 8:08 AM, Mark Shannon <mark@hotpy.org> wrote:
The PEP so far says: "It may raise a ``TypeError`` if a specific instance cannot have its length estimated." In many ways, "I don't know" is the same as this "specific instance cannot have its length estimated". Why not just raise a TypeError? Also, regarding the code Nick posted above, I'm a little concerned about calling len as the first thing to try. That means that if I implement both __len__ and __len_hint__ (perhaps because __len__ is very expensive) __len_hint__ will never be used. It's relatively easy to say: try: hint = len_hint(l) except TypeError: hint = len(l)
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
Mark Shannon wrote:
So how does an iterator express infinite length?
The suggestion was it should return -1.
That depends on your OS. I've just tested it now on Linux Mint, and the Python process was terminated within seconds. I've also inadvertently done it on a Fedora system, which became completely unresponsive to user-input (including ctrl-alt-delete) within a few minutes. I let it run overnight (16 hours) before literally pulling the plug. (I expect the difference in behaviour is due to the default ulimit under Debian/Mint and RedHat/Fedora systems.) Ignoring OS-specific features, the promise[1] of the language is that list will try to allocate enough space for every item yielded by the iterator, or fail with a MemoryError. No promise is made as to how long that will take: it could take hours, or days, depending on how badly memory allocation performance drops when faced with unreasonably large requests. You can't expect it to fail either quickly or with an exception. With a length hint, we could strengthen that promise: "if __length_hint__ returns a negative number, list, tuple and set will fail immediately with MemoryError" which I think is a good safety feature for some things which cannot possibly succeed, but risk DOSing your system. Does it prevent every possible failure mode? No, of course not. But just because you can't prevent *every* problem doesn't mean you should prevent the ones which you can. [1] I think. I'm sure I read this somewhere in the docs, but I can't find it now. -- Steven
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Mon, Jul 16, 2012 at 1:55 AM, Steven D'Aprano <steve@pearwood.info> wrote:
(I expect the difference in behaviour is due to the default ulimit under Debian/Mint and RedHat/Fedora systems.)
Possibly also virtual memory settings. Allocating gobs of memory with a huge page file slows everything down without raising an error. And since it's possible to have non-infinite but ridiculous-sized iterators, I'd not bother putting too much effort into protecting infinite iterators - although the "huge but not infinite" case is, admittedly, rather rarer than either "reasonable-sized" or "actually infinite". ChrisA
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Mon, 16 Jul 2012 02:00:58 +1000 Chris Angelico <rosuav@gmail.com> wrote:
In the real world, I'm sure "huge but not infinite" is much more frequent than "actually infinite". Trying to list() an infinite iterator is a programming error, so it shouldn't end up in production code. However, data that grows bigger than expected (or that gets disposed of too late) is quite a common thing. <hint> When hg.python.org died of OOM two weeks ago, it wasn't because of an infinite iterator: http://mail.python.org/pipermail/python-committers/2012-July/002084.html </hint> Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
data:image/s3,"s3://crabby-images/42e2c/42e2c427e050a4c1eb1e98390a96fab00e060c7a" alt=""
Am 15.07.2012 16:39, schrieb Mark Shannon:
How is this different from "don't know"? What's the use case for knowing that the object doesn't want to say anything or doesn't know its possible length.
How about None? It's the logical choice, simple and easy to test for in Python and C code. 0 is a valid number for "I know that's I'll return nothing".
Too complex, hard to remember and even harder to check for. Since a length is always positive or zero, -1 is a good return value for infinite.
a) Don't want to return any value or "don't know": return NotImplemented
+1
b) For infinite iterators: raise an OverflowError
-1, I'm for -1. ;) I'm not a fan of using exception for valid and correct return values.
c) All other cases: return an int or a type with a __index__() method.
+1 Christian
data:image/s3,"s3://crabby-images/b96f7/b96f788b988da8930539f76bf56bada135c1ba88" alt=""
Nick Coghlan writes:
Why wouldn't one just believe the hint and jump past the iteration? What about an alternative API such as length_hint(iter, bound) returning 'cannot say' (if no hint is available), 'small' (if the estimated length is less than bound), and 'large' (if it's greater than the bound or infinite)? (Or None, True, False which would give the boolean interpretation "do I know I'm small enough to be converted to a concrete container?") The point is that I don't really see the value in returning a precise estimate that cannot be relied on to be accurate. OK, Python is a "consenting adults" language, but returning an integer here seems like invitation to abuse.
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Jul 16, 2012 1:52 PM, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
Because preallocating memory is ridiculously faster than doing multiple resizes. That's all this API is for: how many objects should a container constructor preallocate space for when building from an iterable. It's an important optimisation in CPython when using itertools, and PyPy is planning to adopt it as well. Alex is doing the right thing in attempting to standardise it rather than risk the two implementations using subtly incompatible definitions. Skipping the iteration in the zero case is a pointless micro-optimisation that just makes the API more complex for no good reason. Allowing a negative hint to mean "infinite", on the other hand, avoids certain categories of errors without making the API any harder to use (since negatives have to be rejected anyway). -- Sent from my phone, thus the relative brevity :)
data:image/s3,"s3://crabby-images/e2594/e259423d3f20857071589262f2cb6e7688fbc5bf" alt=""
On 7/14/2012 6:11 PM, Alex Gaynor wrote: ... Various thoughts: "This method is then used by various other functions (such +as ``map``) to presize lists" -- map no longer produces lists. This only makes sense in 3.x if you mean that map can pass along the value of its inputs. "Types can then define ``__length_hint__`` which are not +sized, and thus should not define ``__len__``," is awkwardly phrased. I think you mean "Types that are not sized and should not define __len__ can then define __length_hint__. What do 'sized' and 'should' mean? Some iterators know exactly how many items they have yet to yield. The main implication of having a __len__ versus __length_hint__ methods seems to be it bool() value when empty. If lists were to get a new keyword arg, so should the other classes based on one internal array. I see this has been removed. Generator functions are the nicest way to define iterators in Python. Generator instances returned from generator functions cannot be given a length hint. They are not directly helped. However ... Not addressed in the PEP: do consumers of __length_hint look for it (and __length__ before or after calling iter(input), or both? If before, then the following should work. class gwlh: # generator with length hint def __init__(self, gen, len): self.gen = gen self.len = len def __iter__(self): return self.gen def __length_hint__(self): return len Do transformation iterators pass through hints from inputs? Does map(f, iterable) look for len or hint on iterable? Ditto for some itertools, like chain (add lengths). Any guidelines in the PEP -- Terry Jan Reedy
data:image/s3,"s3://crabby-images/90304/903046437766f403ab1dffb9b01cc036820247a4" alt=""
Alex Gaynor wrote:
These seems back-to-front. __length_hint__ is *used* by the VM, not provided by it. It should be part of the object model, rather than the API.
Don't use "map" as an example. map returns an iterator so it doesn't need __length_hint__
Rather than raising a TypeError, why not return NotImplemented?
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Mark Shannon, 15.07.2012 16:14:
Right. It's a good example for something else, though. As I mentioned before, iterators should be able to propagate the length hint of an underlying iterator, e.g. in generator expressions or map(). I consider that an important feature that the protocol must support. Stefan
data:image/s3,"s3://crabby-images/ace53/ace533b21046d86d2169ea1667d42fedf3f39e15" alt=""
On 7/16/2012 9:54 AM, Stefan Behnel wrote:
map() is quite problematic in this matter, and may actually benefit from the existence of __length_hint__. It is very easy to create an infinite loop currently by doing stuff like x=[1]; x+=map(str,x) [61081 refs]
Obviously, this won't cause an infinite loop in Python2 where map is non-lazy. Also, this won't work for all mutable containers, because not all of them permit adding elements while iterating:
If map objects were to disallow changing the size of the container while iterating (I can't really think of an use-case in which such a limitation would be harmful), it might as well be with __length_hint__. Also, what would iter([1,2,3]).__length_hint__() return? 3 or unknown? If 3, then the semantics of l=[1,2,3]; l += iter(l) will change (infinite loop without __length_hint__ vs. list of 6 elements with __length_hint__). If unknown, then it doesn't seem like there are very many places where __length_hint__ can return anything but unknown. Regards, Stefan M
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Alex Gaynor, 15.07.2012 00:11:
I'd like to more visibly repeat my suggestion to make this a slot method "tp_length_hint()" in extension types that returns a Py_ssize_t. That suggests that a negative return value would have a special meaning instead of relying on return values like NotImplemented. The Python wrapper of that slot method could still implement a mapping for this. Return values could be -1 for "don't know" and -2 for "infinite" at the C level, and NotImplemented for "don't know" at the Python level. Not sure about a good Python value for "infinite". Maybe return -1 for "infinite" at both levels and -2/NotImplemented for "don't know" in C/Python? That would suggest -3 to propagate exceptions at the C level. Stefan
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
Proposing anything substantially more complicated than what is currently implemented in CPython will just get the idea rejected. The substantial immediate gain for PyPy is in skipping the memory resizing when building containers from itertools iterators, not anywhere else. -- Sent from my phone, thus the relative brevity :)
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Nick Coghlan, 16.07.2012 10:26:
The same applies to Cython, where the extension types that implement generator expressions can benefit from propagating the length hint of the underlying iterator. A type slot would help in making this more efficient overall, also for CPython itself. Stefan
data:image/s3,"s3://crabby-images/90304/903046437766f403ab1dffb9b01cc036820247a4" alt=""
While the idea behind PEP 424 is sound, the text of the PEP is rather vague and missing a lot of details. There was extended discussion on the details, but none of that has appeared in the PEP yet. So Alex, how about adding those details? Also the rationale is rather poor. Given that CPython is the reference implementation, PyPy should be compared to CPython, not vice-versa. Reversing PyPy and CPython in the rationale gives: ''' Being able to pre-allocate lists based on the expected size, as estimated by __length_hint__, can be a significant optimization. PyPy has been observed to run some code slower than CPython, purely because this optimization is absent. ''' Which is a PyPy bug report, not a rationale for a PEP ;) Perhaps a better rationale would something along the lines of: ''' Adding a __length_hint__ method to the iterator protocol allows sequences, notably lists, to be initialised from iterators with only a single resize operation. This allows sequences to be intialised quickly, yet have a small growth factor, reducing memory use. ''' Cheers, Mark.
data:image/s3,"s3://crabby-images/c3481/c3481638263af7c93d4c8a1f7b28d1cd5a9e96ff" alt=""
On Wed, Aug 1, 2012 at 10:46 AM, Mark Shannon <mark@hotpy.org> wrote:
Hi Mark. It's not about saving memory. It really is about speed. Noone bothered measuring cpython with length hint disabled to compare, however we did that for pypy hence the rationale contains it. It's merely to state "this seems like an important optimization". Since the C-level code involved is rather similar (it's mostly runtime anyway), it seems reasonable to draw a conclusion that removing length hint from cpython would cause slowdown. Cheers, fijal
data:image/s3,"s3://crabby-images/90304/903046437766f403ab1dffb9b01cc036820247a4" alt=""
Maciej Fijalkowski wrote:
It is not about making it faster *or* saving memory, but *both*. Without __length_hint__ there is a trade off between speed and memory use. You can have speed at the cost of memory by increasing the resize factor. With __length_hint__ you can get both speed and good memory use. Cheers, Mark
data:image/s3,"s3://crabby-images/c3481/c3481638263af7c93d4c8a1f7b28d1cd5a9e96ff" alt=""
On Wed, Aug 1, 2012 at 12:06 PM, Mark Shannon <mark@hotpy.org> wrote:
No, you cannot. if you allocate a huge region, you're not gonna make much of speed, because at the end you need to copy stuff anyway. Besides large allocations are slow. With length hint that is correct (sometimes you can do that) you have a zero-copy scenario
data:image/s3,"s3://crabby-images/32b67/32b67145b0fe3069a1de27c1ec5fc1c9428e9b97" alt=""
On Aug 1, 2012, at 1:46 AM, Mark Shannon wrote:
Alex's rationale is correct and well expressed. Your proposed revision reflects fuzzy thinking about why __length_hint__ is useful. Regardless of resizing growth factors, it is *always* helpful to know how much memory to allocate. Calls to the allocators (especially for large blocks) and possible the recopying of data should be avoided when possible. Raymond
data:image/s3,"s3://crabby-images/98c42/98c429f8854de54c6dfbbe14b9c99e430e0e4b7d" alt=""
On 16.07.12 10:36, Stefan Behnel wrote:
Return values could be -1 for "don't know" and -2 for "infinite" at the C level, and NotImplemented for "don't know" at the Python level.
PY_SSIZE_T_MAX is better value for "infinite". In any case no difference for consumer between PY_SSIZE_T_MAX and a real infinity.
data:image/s3,"s3://crabby-images/e87f3/e87f3c7c6d92519a9dac18ec14406dd41e3da93d" alt=""
On Mon, Jul 16, 2012 at 3:36 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Gods no. Making the return value different in C vs. Python code is just asking for trouble in terms of having to remember that specific difference while coding. Plus asking for people to check for an explicit negative values instead of just >= 0 would be problematic and prone to error.
See above. This is another reason why I don't think the infinite iterator concept is worth expressin. It's just mucking things up for no good reason. "infinite" == "I don't know" in the case of pre-allocation of a list. Just raise an exception or return None and be done with it. Nice and simple. And my vote is for an exception as EAFP. -Brett
data:image/s3,"s3://crabby-images/c9c6e/c9c6eac43d1767be7719cd707451187176ad2431" alt=""
I've updated the PEP to reflect the discussion. There are two major changes: 1) NotImplemented may be used by __length_hint__ to indicate that there is no finite length hint available. 2) callers of operator.length_hint() must provide their own default value, this is also required by the current C _PyObject_LengthHint implementation. There are no provisions for infinite iterators, that is not within the scope of this proposal. Alex
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Tue, Jul 17, 2012 at 1:03 PM, Alex Gaynor <alex.gaynor@gmail.com> wrote:
I've been thinking about this a bit more, and I think it does provide good scope for eventually adding __length_hint__ to more iterators (including map, filter and enumerate).
2) callers of operator.length_hint() must provide their own default value, this is also required by the current C _PyObject_LengthHint implementation.
And this makes it explicit that API users need to deal with the AttributeError/NotImplemented case, whilst making it easy to do so. Good call.
There are no provisions for infinite iterators, that is not within the scope of this proposal.
I'll repeat my observation that remaining silent on this point is effectively identical to blessing the practice of raising an exception in __length_hint__ to force fast failure of attempts to convert an infinite iterator to a concrete container. Rather than leaving people to figure this out on their own, we may as well make it official that TypeError can be raised in __length_hint__ to block conversion to concrete containers that use a preallocation strategy. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Tue, 17 Jul 2012 13:19:55 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
And I'll repeat that it is false ;) Being silent is certainly not the same thing as blessing a non-existent practice. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
data:image/s3,"s3://crabby-images/90304/903046437766f403ab1dffb9b01cc036820247a4" alt=""
To quote from PEP 424:
Why is it a significant optimisation? How much slower is it? Where is the data? *If* resizing list is so slow, then why not make it faster? To quote wikipedia (http://en.wikipedia.org/wiki/Dynamic_array) """ As n elements are inserted, the capacities form a geometric progression. Expanding the array by any constant proportion ensures that inserting n elements takes O(n) time overall, meaning that each insertion takes amortized constant time. The value of this proportion a leads to a time-space tradeoff: the average time per insertion operation is about a/(a-1), while the number of wasted cells is bounded above by (a-1)n. The choice of a depends on the library or application: some textbooks use a = 2, but Java's ArrayList implementation uses a = 3/2 and the C implementation of Python's list data structure uses a = 9/8. """ If resizing of lists is too slow, then we should reconsider the 9/8 factor and/or look to tweak the resizing code. Cheers, Mark.
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Mark Shannon, 16.07.2012 10:37:
If resizing of lists is too slow, then we should reconsider the 9/8 factor and/or look to tweak the resizing code.
The thing is that the performance is platform specific. On systems with a fast memory allocator, especially on Linux, the difference is negligible. However, with a slow memory allocator, especially a slow realloc(), e.g. on Windows or Solaris, this can substantially hurt the performance, up to a quadratically increasing runtime in the worst case. The length hint was implemented specifically to work around this problem. Stefan
data:image/s3,"s3://crabby-images/c3481/c3481638263af7c93d4c8a1f7b28d1cd5a9e96ff" alt=""
On Mon, Jul 16, 2012 at 11:02 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
It's not the actual allocation (typically), it's the copying that's the problem. As far as data goes - preallocation can make 4x difference (on PyPy, although the dominant operation is not different on CPython) on ''.join(some-iterable). It depends grossly on the sizes of the list, so you can't claim that there is a precise speedup of a constant factor, however, there are cases where it *can* be significant (as in the copying is by far the dominating operation), most notable giant templates with iterators written in C. Speaking of which - I find this bikeshed disgusting. The purpose of the PEP is to codify whatever is already written in code in CPython. If you guys don't want it, we'll just implement it anyway and try to follow the CPython current implementation from 2.7. Cheers, fijal
data:image/s3,"s3://crabby-images/53627/53627fe72d6b8623e1f52a67d3358e3ac4d10e21" alt=""
Speaking of which - I find this bikeshed disgusting.
Disgusting? Irritating, perhaps, but why should a PEP -- even one whose purpose is to codify existing practice -- not result in discussions about its subject matter? The final P stands for Proposal, not for Pronouncement. TJG
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Mon, Jul 16, 2012 at 7:21 PM, Tim Golden <mail@timgolden.me.uk> wrote:
Indeed - I'd be worried if any PEP sailed through python-dev review without a thorough kicking of the tires. Yes, it can be annoying having to bring people up to speed on issues that they aren't familiar with, but that's generally a sign that there is relevant background information *missing from the PEP*. PEP's aren't supposed to be written just for people that are already intimately familiar with a problem - they're supposed to provide enough background that they stand on their own. In this case, the key points that I think need to be added: - more background on why the __length_hint__ API exists in CPython in the first place: to minimise potentially expensive data copies (due to memory reallocation) when creating a concrete container from an iterator. This includes when creating them from another concrete container via an intermediate iterator. This is why at least the following produce objects that define __length_hint__ in CPython: reversed itertools.repeat iter(dict()) iter(list()) iter(tuple()) iter(str()) iter(bytes()) iter(bytearray()) iter(set()) iter(frozenset()) dict.values() dict.keys() As well as any user defined sequence that relies on the default sequence iterator: >>> class MySeq(): ... def __getitem__(self, idx): ... return idx ... def __len__(self): ... return 10 ... >>> hasattr(iter(MySeq()), "__length_hint__") True - clarification on the implications of it only being a "hint": specifically, as it may be an over- or underestimate, you *cannot* rely on the hint to decide whether or not to iterate over the object when a valid length is returned (as a value of zero may be an underestimate). However, it does allow you to presize your container more appropriately than just starting at zero as usual, thus achieving the aim of reducing the risk of unnecessary memory copies. That's the basic proposal. Separate from that, there are a few suggestions for *enhancement* beyond what CPython currently uses (and has demonstrated a clear need for): - adding operator.length_hint(). This makes sense to me, as it makes it much easier to use the API when implementing a container type in Python. It's also a pretty simple addition. - adding a C level type slot. I'm personally -1 on this one in the context of the PEP. I don't think the current PEP (which is really aimed at standardisation for PyPy's benefit) should be weighed down with this CPython specific implementation detail. As a separate proposal, independent of this PEP, from someone that cares specifically about micro-optimising this API for CPython, and (preferably) can produce benchmark numbers to show the additional implementation complexity is worthwhile, then I wouldn't object. I just don't want this orthogonal concern to derail the standardisation effort. - distinguishing between different reasons for saying "don't preallocate any space" (i.e. returning zero). I still haven't heard a convincing rationale for this one - it seems to be based on the notion that the case of skipping the iteration step for a genuinely empty iterable is worth optimising. This idea just doesn't make sense when any legitimate length value that is returned can't be trusted to be completely accurate - you have to iterate to confirm the actual number. - making it possible to fail *fast* when a known infinite iterator (like itertools.cycle or itertools.count) is passed to a concrete container. I think this is best covered in the PEP by explicitly stating that some types may implement __length_hint__ to always raise TypeError rather than defining a magic return value that means "I'm infinite". - making it possible for iterators like enumerate, map and filter to delegate __length_hint__ to their underlying iterator. This seems potentially worth doing, but requires resolving the problem that Raymond noted with handling the difference in internal behaviour between enumerate("abc") and enumerate(iter("abc")). Again, it would be unfortunate to see the PEP held up over this. - making it possible to define __length_hint__ for generator-iterator objects. While this is a nice idea, again, I don't think it's something that this particular PEP should be burdened with. My main point is that the current __length_hint__ behaviour has already proven its value in the real world. The PyPy team would like that behaviour codified, so they can be reasonably sure both implementations are doing the same thing. Many of the enhancements I have listed above may be suitable material for future enhancement proposals, but I'm not seeing any requested functionality that would be actively *blocked* by simply codifying the status quo. The PEP itself already has this general tone, but I think that it should be even more emphatic that it's about codifying the status quo, *not* about modifying it with ideas haven't been proven useful through past experience. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Mon, 16 Jul 2012 23:23:18 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
The point is that zero is a valid value for a length hint. By making it a special value for "don't know", you are making the protocol potentially confusing, and you are also departing from the current semantics. (and, yes, I think distinguishing between zero and "don't know" is useful: imagine a container that would preallocate 256 entries by default when the answer is "don't know")
Then the PEP shouldn't address infinite iterators at all. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Tue, Jul 17, 2012 at 12:01 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
No, it just means the default estimate is always zero. If you don't do that, then *every* client of __length_hint__ has to check for the magic value. It's making the API harder to use for no good reason.
Such a container has to already deal with the case when it overestimates severely. The only cost of using zero as a default estimate is that such hypothetical containers will overallocate when they technically didn't need to, which will already happen for any empty iterator that doesn't provide __length_hint__ at all. Given that all standard library containers default to assuming a size of zero (absent information indicating otherwise), and a special value would need to be special cased by *every* client of the API (and almost always defaulted to zero), it's simply not a good trade-off.
Noting that infinite iterators are free to define __length_hint__ to always throw an exception *is* the status quo. We just haven't done it for the stdlib ones. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Tue, 17 Jul 2012 00:18:55 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
Actually, dict and set default to a non-zero internal size, but I agree making containers harder to implement is not a good thing.
Being "free" to do unexpected or unconventional things is not the same thing as codifying those behaviours in a PEP, especially when noone is actually doing them. __length_hint__ is supposed to be informative, it shouldn't error out on you. So still -1 from me. Regards Antoine.
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Maciej Fijalkowski, 16.07.2012 11:14:
Note that a call to realloc() includes that part and can avoid copying if possible. A good realloc() implementation can make this use case run in amortised linear time, at least on a system with sufficient memory. A bad one can result in quadratic runtime, which is way more than a change by a constant factor. Thus my above comment that it's platform specific.
Absolutely. This is particularly visible at the C level because C implemented iterators have a very low overhead overall.
The idea behind this bikeshedding is that at the moment where we make this an official protocol rather than an implementation detail, it should be able to communicate the different states on the callee side of such a protocol. I.e. it needs a better design than the current one. Stefan
data:image/s3,"s3://crabby-images/32b67/32b67145b0fe3069a1de27c1ec5fc1c9428e9b97" alt=""
On Jul 16, 2012, at 1:37 AM, Mark Shannon wrote:
Unless pre-sized by with a length prediction, a growing list periodically needs to call realloc() which can move all the data to a new location in memory. Pre-sizing avoids that entirely.
If resizing of lists is too slow, then we should reconsider the 9/8 factor and/or look to tweak the resizing code.
A great deal of thought and care went into the current design. It has already been "tweaked". Raymond P.S. The dictionary code also uses presizing for copies, updates, set conversion, etc. It is a perfectly reasonable technique to pre-allocate the correct size container when the ultimate length is knowable in advance.
data:image/s3,"s3://crabby-images/b3d87/b3d872f9a7bbdbbdbd3c3390589970e6df22385a" alt=""
*If* resizing list is so slow, then why not make it faster?
A simple solution to speed up such problem is to change the overallocation factor, but it may waste memory. Python tries to be fast and to not waste too much memory.
I worked recently on optimizing str%args and str.format(args). Handle correctly the memory allocation is just critical for performances, especially for str with the PEP 393, because we have to shrink the buffer to the exact string length with the formatting function is done. I tried various overallocation factors and I chose 1.25 (5/4) because it was the fastest. See for example this issue for benchmark numbers: http://bugs.python.org/issue14687 The PyUnicodeWriter internal object uses various options to choose how many bytes should be allocated: * an overallocation flag to disable overallocation when we know that we are writing the last character/string into be buffer * a "minimal length" used for the first allocation * an hardcoded overallocation factor (1.25) PyUnicodeWriter is a little bit different than the __length_hint__ issue because PyUnicodeWriter has to shrink the buffer when it is done, but I can say that overallocation is very useful for speed. Victor
data:image/s3,"s3://crabby-images/c9c6e/c9c6eac43d1767be7719cd707451187176ad2431" alt=""
On Sat, Jul 14, 2012 at 4:18 PM, Benjamin Peterson <benjamin@python.org>wrote:
ValueError, the same as with len.
Sounds reasonable to me! Should we just go ahead and strip those out now?
Alex -- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero
data:image/s3,"s3://crabby-images/4031c/4031ce5b823df99dfa8e6671025bb4b4c95251e7" alt=""
On Sat, Jul 14, 2012 at 4:21 PM, Alex Gaynor <alex.gaynor@gmail.com> wrote:
I'm +1 on not having a public API for this. Ultimately the contract for a length hint will depend heavily upon what you need it for. Some applications would require a length hint to be an "at least" others an "at most" and others something else entirely. Given that the contract here appears to be >=0, I don't think the length hint is particularly useful to the public at large.
data:image/s3,"s3://crabby-images/102be/102be22b252fd381e2db44154ec267297556abaa" alt=""
On Sun, Jul 15, 2012 at 1:28 AM, Alexandre Zani <alexandre.zani@gmail.com> wrote:
Other possible related uses could be to get an approximate number of results for a query without having to actually go through the whole query, useful for databases and search engines. But then you *do* want __len__ as well, so that also doesn't fit with the current PEP. But maybe that's a completely different usecase, even though it seems related to me? //Lennart
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Sun, Jul 15, 2012 at 9:18 AM, Benjamin Peterson <benjamin@python.org> wrote:
Length hints are very useful for *any* container implementation, whether those containers are in the standard library or not. Just as we exposed operator.index when __index__ was added, we should expose an "operator.length_hint" function with the following semantics: def length_hint(obj): """Return an estimate of the number of items in obj. This is useful for presizing containers when building from an iterable. If the object supports len(), the result will be exact. Otherwise, it may over or underestimate by an arbitrary amount. The result will be an integer >= 0. """ try: return len(obj) except TypeError: try: get_hint = obj.__length_hint__ except AttributeError: return 0 hint = get_hint() if not isinstance(hint, int): raise TypeError("Length hint must be an integer, not %r" % type(hint)) if hint < 0: raise ValueError("Length hint (%r) must be >= 0" % hint) return hint There's no reason to make pure Python container implementations reimplement all that for themselves. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
data:image/s3,"s3://crabby-images/c9c6e/c9c6eac43d1767be7719cd707451187176ad2431" alt=""
On Sat, Jul 14, 2012 at 10:16 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Sounds reasonable to me, the only issue with your psuedocode (err... I mean Python ;)), is that there's no way for the __lenght_hint__ to specify that that particular instance can't have a length hint computed. e.g. imagine some sort of lazy stream that cached itself, and only wanted to offer a length hint if it had already been evaluated. Without an exception to raise, it has to return whatever the magic value for length_hint is (in your impl it appears to be 0, the current _PyObject_LengthHint method in CPython has a required `default` parameter). The PEP proposes using TypeError for that. Anyways that code looks good, do you want to add it to the PEP? Alex -- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Alex Gaynor, 15.07.2012 07:20:
Yes, that's a major issue. I've been planning to add a length hint to Cython's generator expressions for a while, but the problem is really that in most cases it is only known at runtime if the underlying iterable has a length hint, so propagating it needs a way to say "sorry, I thought I might know, but I don't". It would be even better if this way was efficient. Since we're at a point of making this an official protocol, why not change the current behaviour and return -1 (or even just 0) to explicitly state that "we don't know"? The problem with an exception here is that it might have been raised accidentally inside of the __length_hint__() implementation that is being asked. Swallowing it just because it happened to be a TypeError rather than something else may end up covering bugs. We had a similar issue with hasattr() in the past. Also, it would be nice if this became a type slot rather than requiring a dict lookup and Python function call. Stefan
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
Nick Coghlan wrote:
As given, length_hint gives no way of distinguishing between iterables and non-iterables: py> length_hint([]) 0 py> length_hint(42) 0 nor does it give iterable objects a way to indicate that either they don't know their length, or that they are infinite. I suggest: * object (and hence all other types that don't explicitly override it) should have a __length_hint__ that raises TypeError; * __length_hint__ should be allowed to return None to indicate "don't know" or -1 to indicate "infinite". Presumably anything that wishes to create a list or other sequence from an object with a hint of -1 could then raise an exception immediately. -- Steven
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Sun, Jul 15, 2012 at 6:21 PM, Steven D'Aprano <steve@pearwood.info> wrote:
We can keep it simpler than that just by changing the order of the checks.
I'm not seeing the value in returning None over 0 for the don't know case - it just makes the API harder to use. Declaring negative results as meaning "I'm infinite" sounds reasonable, though: def length_hint(obj): """Return an estimate of the number of items in obj. This is useful for presizing containers when building from an iterable. If the object supports len(), the result will be exact. Otherwise, it may over or underestimate by an arbitrary amount. """ try: get_hint = obj.__length_hint__ except AttributeError: return len(obj) hint = get_hint() if not isinstance(hint, int): msg = "Length hint must be an integer, not %r" raise TypeError(msg % type(hint)) if hint < 0: raise ValueError("%r is an infinite iterator" % (obj,)) return hint Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Sun, 15 Jul 2012 18:47:38 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
The point is that 0 is a legitimate value for a length hint. Simple implementations of __length_hint__ will start returning 0 as a legitimate value and you will wrongly interpret that as "don't know", which kinds of defeat the purpose of __length-hint__ ;) That said, I don't think a special value for "is infinite" is useful. Just make -1 mean "I don't know". Regards Antoine.
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
Antoine Pitrou wrote:
That said, I don't think a special value for "is infinite" is useful. Just make -1 mean "I don't know".
You've obviously never accidentally called list on an infinite iterator *wink* It's not the (eventual) MemoryError that is the problem. On some systems, this can cause the PC to become unresponsive as the OS tries to free an ever-increasing amount of memory. Been there, done that, on a production system. I had to do a hard reboot to fix it. I think having a hint that says "there's no way this can succeed, fail immediately" is more useful than caring about the difference between a hint of 0 and a hint of 1. -- Steven
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
Right, I agree on the value in being able to return something to say "this cannot be converted to a concrete container". I still haven't seen a use case where the appropriate response to "I don't know" differs from the appropriate response to a hint of zero - that is, you don't preallocate, you just start iterating. Cheers, Nick. -- Sent from my phone, thus the relative brevity :)
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Mon, 16 Jul 2012 00:08:41 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
Right, I agree on the value in being able to return something to say "this cannot be converted to a concrete container".
Who would be able to return that, apart from trivial cases like itertools.cycle()? Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
data:image/s3,"s3://crabby-images/42e2c/42e2c427e050a4c1eb1e98390a96fab00e060c7a" alt=""
Am 15.07.2012 16:22, schrieb Antoine Pitrou:
For example most numerical sequence iterators like Fibonacci generator, prime number sequence generator and even trivial cases like even natural number generator. IMO it's a good idea to have a notation for infinitive iterators that can't be materialized as finite containers. +1 Christian
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Sun, 15 Jul 2012 16:33:23 +0200 Christian Heimes <lists@cheimes.de> wrote:
First, you can't implement __length_hint__ for a generator, which is the preferred (the most practical) way of writing iterators in pure Python. Second, not all iterators will implement __length_hint__ (because it's optional and, really, of rather little use). So, as a user, you cannot hope that `list(some_iterator)` will always raise instead of filling your memory with an infinite stream of values: you have to be careful anyway. Even if __length_hint__ is implemented, its result may be wrong. That's the whole point: it's a *hint*; an iterator might tell you it's finite while it's infinite, or the reverse. My conclusion is that an infinite iterator is a documentation issue. Just tell the user that it doesn't stop, and let them shoot themselves in the foot in they want to. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
Antoine Pitrou wrote:
First, you can't implement __length_hint__ for a generator, which is the preferred (the most practical) way of writing iterators in pure Python.
Limitations of generators are no reason for not improving iterators which are not generators. __length_hint__ already exists; this proposal simply proposes making it documented and officially supported. py> iter([]).__length_hint__ <built-in method __length_hint__ of list_iterator object at 0xb7bcf98c>
If it claims to be infinite, I see no reason to disbelieve it on the off-chance that it is actually both finite and small enough to fit into memory without crashing my system. If it claims to be finite, but is actually infinite, well that's not much of a hint, is it? There's an implied promise that the hint will be close to the real value, not infinitely distant.
Buffer overflows are a documentation issue. Just tell the user not to overwrite memory they don't mean to, and let them shoot themselves in the foot if they want. *wink* -- Steven
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Mon, 16 Jul 2012 02:21:20 +1000 Steven D'Aprano <steve@pearwood.info> wrote:
No, buffer overflows are bugs and they get fixed. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Antoine Pitrou, 15.07.2012 17:06:
It can be implemented for generator expressions without a conditional, though, including the case of comprehensions. I wanted to do this in Cython for a while, but the protocol wasn't very well adapted to that use case. The "don't know" case was just too common and inefficient. For the other points, I agree with the already presented counterarguments. Being able to prevent some obvious traps is a good thing, even if you can't prevent all of them. Stefan
data:image/s3,"s3://crabby-images/32b67/32b67145b0fe3069a1de27c1ec5fc1c9428e9b97" alt=""
On Jul 15, 2012, at 7:22 AM, Antoine Pitrou wrote:
FWIW, here are the notes from the docstring in Lib/test/test_iterlen.py: """ Test Iterator Length Transparency Some functions or methods which accept general iterable arguments have optional, more efficient code paths if they know how many items to expect. For instance, map(func, iterable), will pre-allocate the exact amount of space required whenever the iterable can report its length. The desired invariant is: len(it)==len(list(it)). A complication is that an iterable and iterator can be the same object. To maintain the invariant, an iterator needs to dynamically update its length. For instance, an iterable such as xrange(10) always reports its length as ten, but it=iter(xrange(10)) starts at ten, and then goes to nine after it.next(). Having this capability means that map() can ignore the distinction between map(func, iterable) and map(func, iter(iterable)). When the iterable is immutable, the implementation can straight-forwardly report the original length minus the cumulative number of calls to next(). This is the case for tuples, xrange objects, and itertools.repeat(). Some containers become temporarily immutable during iteration. This includes dicts, sets, and collections.deque. Their implementation is equally simple though they need to permanently set their length to zero whenever there is an attempt to iterate after a length mutation. The situation slightly more involved whenever an object allows length mutation during iteration. Lists and sequence iterators are dynamically updatable. So, if a list is extended during iteration, the iterator will continue through the new items. If it shrinks to a point before the most recent iteration, then no further items are available and the length is reported at zero. Reversed objects can also be wrapped around mutable objects; however, any appends after the current position are ignored. Any other approach leads to confusion and possibly returning the same item more than once. The iterators not listed above, such as enumerate and the other itertools, are not length transparent because they have no way to distinguish between iterables that report static length and iterators whose length changes with each call (i.e. the difference between enumerate('abc') and enumerate(iter('abc')). """ Raymond
data:image/s3,"s3://crabby-images/90304/903046437766f403ab1dffb9b01cc036820247a4" alt=""
Nick Coghlan wrote:
There seem to be 5 possible classes values of __length_hint__ that an iterator object can provide: 1. Don't implement it at all. 2. Implement __length_hint__() but don't want to return any value. Either raise an exception (TypeError) -- As suggested in the PEP. or return NotImplemented -- my preferred option. 3. Return a "don't know" value: Returning 0 would be fine for this, but the VM might want to respond differently to "don't know" and 0. __length_hint__() == 0 container should be minimum size. __length_hint__() == "unknown" container starts at default size. 4. Infinite iterator: Could return float('inf'), but given this is a "hint" then returning sys.maxsize or sys.maxsize + 1 might be OK. Alternatively raise an OverflowError 5. A meaningful length. No problem :) Also, what are the allowable return types? 1. int only 2. Any number (ie any type with a __int__() method)? 3. Or any integer-like object (ie a type with a __index__() method)? My suggestion: a) Don't want to return any value or "don't know": return NotImplemented b) For infinite iterators: raise an OverflowError c) All other cases: return an int or a type with a __index__() method. Cheers, Mark.
data:image/s3,"s3://crabby-images/e87f3/e87f3c7c6d92519a9dac18ec14406dd41e3da93d" alt=""
On Sun, Jul 15, 2012 at 10:39 AM, Mark Shannon <mark@hotpy.org> wrote:
I am really having a hard time differentiating infinity with "I don't know" since they are both accurate from the point of view of __length_hint__ and its typical purpose of allocation. You have no clue how many values will be grabbed from an infinite iterator, so it's the same as just not knowing upfront how long the iterator will be, infinite or not, and thus not worth distinguishing.
I'm fine with (a), drop (b), and for (c) use what we allow for __len__() since, as Nick's operator.length_hint pseudo-code suggests, people will call this as a fallback if __len__ isn't defined. -Brett
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Sun, 15 Jul 2012 16:08:00 +0100 Mark Shannon <mark@hotpy.org> wrote:
Why should it? AFAIK it's not a common complaint. You said it yourself: it's a silly thing to do. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
data:image/s3,"s3://crabby-images/4031c/4031ce5b823df99dfa8e6671025bb4b4c95251e7" alt=""
On Sun, Jul 15, 2012 at 8:08 AM, Mark Shannon <mark@hotpy.org> wrote:
The PEP so far says: "It may raise a ``TypeError`` if a specific instance cannot have its length estimated." In many ways, "I don't know" is the same as this "specific instance cannot have its length estimated". Why not just raise a TypeError? Also, regarding the code Nick posted above, I'm a little concerned about calling len as the first thing to try. That means that if I implement both __len__ and __len_hint__ (perhaps because __len__ is very expensive) __len_hint__ will never be used. It's relatively easy to say: try: hint = len_hint(l) except TypeError: hint = len(l)
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
Mark Shannon wrote:
So how does an iterator express infinite length?
The suggestion was it should return -1.
That depends on your OS. I've just tested it now on Linux Mint, and the Python process was terminated within seconds. I've also inadvertently done it on a Fedora system, which became completely unresponsive to user-input (including ctrl-alt-delete) within a few minutes. I let it run overnight (16 hours) before literally pulling the plug. (I expect the difference in behaviour is due to the default ulimit under Debian/Mint and RedHat/Fedora systems.) Ignoring OS-specific features, the promise[1] of the language is that list will try to allocate enough space for every item yielded by the iterator, or fail with a MemoryError. No promise is made as to how long that will take: it could take hours, or days, depending on how badly memory allocation performance drops when faced with unreasonably large requests. You can't expect it to fail either quickly or with an exception. With a length hint, we could strengthen that promise: "if __length_hint__ returns a negative number, list, tuple and set will fail immediately with MemoryError" which I think is a good safety feature for some things which cannot possibly succeed, but risk DOSing your system. Does it prevent every possible failure mode? No, of course not. But just because you can't prevent *every* problem doesn't mean you should prevent the ones which you can. [1] I think. I'm sure I read this somewhere in the docs, but I can't find it now. -- Steven
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Mon, Jul 16, 2012 at 1:55 AM, Steven D'Aprano <steve@pearwood.info> wrote:
(I expect the difference in behaviour is due to the default ulimit under Debian/Mint and RedHat/Fedora systems.)
Possibly also virtual memory settings. Allocating gobs of memory with a huge page file slows everything down without raising an error. And since it's possible to have non-infinite but ridiculous-sized iterators, I'd not bother putting too much effort into protecting infinite iterators - although the "huge but not infinite" case is, admittedly, rather rarer than either "reasonable-sized" or "actually infinite". ChrisA
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Mon, 16 Jul 2012 02:00:58 +1000 Chris Angelico <rosuav@gmail.com> wrote:
In the real world, I'm sure "huge but not infinite" is much more frequent than "actually infinite". Trying to list() an infinite iterator is a programming error, so it shouldn't end up in production code. However, data that grows bigger than expected (or that gets disposed of too late) is quite a common thing. <hint> When hg.python.org died of OOM two weeks ago, it wasn't because of an infinite iterator: http://mail.python.org/pipermail/python-committers/2012-July/002084.html </hint> Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
data:image/s3,"s3://crabby-images/42e2c/42e2c427e050a4c1eb1e98390a96fab00e060c7a" alt=""
Am 15.07.2012 16:39, schrieb Mark Shannon:
How is this different from "don't know"? What's the use case for knowing that the object doesn't want to say anything or doesn't know its possible length.
How about None? It's the logical choice, simple and easy to test for in Python and C code. 0 is a valid number for "I know that's I'll return nothing".
Too complex, hard to remember and even harder to check for. Since a length is always positive or zero, -1 is a good return value for infinite.
a) Don't want to return any value or "don't know": return NotImplemented
+1
b) For infinite iterators: raise an OverflowError
-1, I'm for -1. ;) I'm not a fan of using exception for valid and correct return values.
c) All other cases: return an int or a type with a __index__() method.
+1 Christian
data:image/s3,"s3://crabby-images/b96f7/b96f788b988da8930539f76bf56bada135c1ba88" alt=""
Nick Coghlan writes:
Why wouldn't one just believe the hint and jump past the iteration? What about an alternative API such as length_hint(iter, bound) returning 'cannot say' (if no hint is available), 'small' (if the estimated length is less than bound), and 'large' (if it's greater than the bound or infinite)? (Or None, True, False which would give the boolean interpretation "do I know I'm small enough to be converted to a concrete container?") The point is that I don't really see the value in returning a precise estimate that cannot be relied on to be accurate. OK, Python is a "consenting adults" language, but returning an integer here seems like invitation to abuse.
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Jul 16, 2012 1:52 PM, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
Because preallocating memory is ridiculously faster than doing multiple resizes. That's all this API is for: how many objects should a container constructor preallocate space for when building from an iterable. It's an important optimisation in CPython when using itertools, and PyPy is planning to adopt it as well. Alex is doing the right thing in attempting to standardise it rather than risk the two implementations using subtly incompatible definitions. Skipping the iteration in the zero case is a pointless micro-optimisation that just makes the API more complex for no good reason. Allowing a negative hint to mean "infinite", on the other hand, avoids certain categories of errors without making the API any harder to use (since negatives have to be rejected anyway). -- Sent from my phone, thus the relative brevity :)
data:image/s3,"s3://crabby-images/e2594/e259423d3f20857071589262f2cb6e7688fbc5bf" alt=""
On 7/14/2012 6:11 PM, Alex Gaynor wrote: ... Various thoughts: "This method is then used by various other functions (such +as ``map``) to presize lists" -- map no longer produces lists. This only makes sense in 3.x if you mean that map can pass along the value of its inputs. "Types can then define ``__length_hint__`` which are not +sized, and thus should not define ``__len__``," is awkwardly phrased. I think you mean "Types that are not sized and should not define __len__ can then define __length_hint__. What do 'sized' and 'should' mean? Some iterators know exactly how many items they have yet to yield. The main implication of having a __len__ versus __length_hint__ methods seems to be it bool() value when empty. If lists were to get a new keyword arg, so should the other classes based on one internal array. I see this has been removed. Generator functions are the nicest way to define iterators in Python. Generator instances returned from generator functions cannot be given a length hint. They are not directly helped. However ... Not addressed in the PEP: do consumers of __length_hint look for it (and __length__ before or after calling iter(input), or both? If before, then the following should work. class gwlh: # generator with length hint def __init__(self, gen, len): self.gen = gen self.len = len def __iter__(self): return self.gen def __length_hint__(self): return len Do transformation iterators pass through hints from inputs? Does map(f, iterable) look for len or hint on iterable? Ditto for some itertools, like chain (add lengths). Any guidelines in the PEP -- Terry Jan Reedy
data:image/s3,"s3://crabby-images/90304/903046437766f403ab1dffb9b01cc036820247a4" alt=""
Alex Gaynor wrote:
These seems back-to-front. __length_hint__ is *used* by the VM, not provided by it. It should be part of the object model, rather than the API.
Don't use "map" as an example. map returns an iterator so it doesn't need __length_hint__
Rather than raising a TypeError, why not return NotImplemented?
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Mark Shannon, 15.07.2012 16:14:
Right. It's a good example for something else, though. As I mentioned before, iterators should be able to propagate the length hint of an underlying iterator, e.g. in generator expressions or map(). I consider that an important feature that the protocol must support. Stefan
data:image/s3,"s3://crabby-images/ace53/ace533b21046d86d2169ea1667d42fedf3f39e15" alt=""
On 7/16/2012 9:54 AM, Stefan Behnel wrote:
map() is quite problematic in this matter, and may actually benefit from the existence of __length_hint__. It is very easy to create an infinite loop currently by doing stuff like x=[1]; x+=map(str,x) [61081 refs]
Obviously, this won't cause an infinite loop in Python2 where map is non-lazy. Also, this won't work for all mutable containers, because not all of them permit adding elements while iterating:
If map objects were to disallow changing the size of the container while iterating (I can't really think of an use-case in which such a limitation would be harmful), it might as well be with __length_hint__. Also, what would iter([1,2,3]).__length_hint__() return? 3 or unknown? If 3, then the semantics of l=[1,2,3]; l += iter(l) will change (infinite loop without __length_hint__ vs. list of 6 elements with __length_hint__). If unknown, then it doesn't seem like there are very many places where __length_hint__ can return anything but unknown. Regards, Stefan M
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Alex Gaynor, 15.07.2012 00:11:
I'd like to more visibly repeat my suggestion to make this a slot method "tp_length_hint()" in extension types that returns a Py_ssize_t. That suggests that a negative return value would have a special meaning instead of relying on return values like NotImplemented. The Python wrapper of that slot method could still implement a mapping for this. Return values could be -1 for "don't know" and -2 for "infinite" at the C level, and NotImplemented for "don't know" at the Python level. Not sure about a good Python value for "infinite". Maybe return -1 for "infinite" at both levels and -2/NotImplemented for "don't know" in C/Python? That would suggest -3 to propagate exceptions at the C level. Stefan
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
Proposing anything substantially more complicated than what is currently implemented in CPython will just get the idea rejected. The substantial immediate gain for PyPy is in skipping the memory resizing when building containers from itertools iterators, not anywhere else. -- Sent from my phone, thus the relative brevity :)
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Nick Coghlan, 16.07.2012 10:26:
The same applies to Cython, where the extension types that implement generator expressions can benefit from propagating the length hint of the underlying iterator. A type slot would help in making this more efficient overall, also for CPython itself. Stefan
data:image/s3,"s3://crabby-images/90304/903046437766f403ab1dffb9b01cc036820247a4" alt=""
While the idea behind PEP 424 is sound, the text of the PEP is rather vague and missing a lot of details. There was extended discussion on the details, but none of that has appeared in the PEP yet. So Alex, how about adding those details? Also the rationale is rather poor. Given that CPython is the reference implementation, PyPy should be compared to CPython, not vice-versa. Reversing PyPy and CPython in the rationale gives: ''' Being able to pre-allocate lists based on the expected size, as estimated by __length_hint__, can be a significant optimization. PyPy has been observed to run some code slower than CPython, purely because this optimization is absent. ''' Which is a PyPy bug report, not a rationale for a PEP ;) Perhaps a better rationale would something along the lines of: ''' Adding a __length_hint__ method to the iterator protocol allows sequences, notably lists, to be initialised from iterators with only a single resize operation. This allows sequences to be intialised quickly, yet have a small growth factor, reducing memory use. ''' Cheers, Mark.
data:image/s3,"s3://crabby-images/c3481/c3481638263af7c93d4c8a1f7b28d1cd5a9e96ff" alt=""
On Wed, Aug 1, 2012 at 10:46 AM, Mark Shannon <mark@hotpy.org> wrote:
Hi Mark. It's not about saving memory. It really is about speed. Noone bothered measuring cpython with length hint disabled to compare, however we did that for pypy hence the rationale contains it. It's merely to state "this seems like an important optimization". Since the C-level code involved is rather similar (it's mostly runtime anyway), it seems reasonable to draw a conclusion that removing length hint from cpython would cause slowdown. Cheers, fijal
data:image/s3,"s3://crabby-images/90304/903046437766f403ab1dffb9b01cc036820247a4" alt=""
Maciej Fijalkowski wrote:
It is not about making it faster *or* saving memory, but *both*. Without __length_hint__ there is a trade off between speed and memory use. You can have speed at the cost of memory by increasing the resize factor. With __length_hint__ you can get both speed and good memory use. Cheers, Mark
data:image/s3,"s3://crabby-images/c3481/c3481638263af7c93d4c8a1f7b28d1cd5a9e96ff" alt=""
On Wed, Aug 1, 2012 at 12:06 PM, Mark Shannon <mark@hotpy.org> wrote:
No, you cannot. if you allocate a huge region, you're not gonna make much of speed, because at the end you need to copy stuff anyway. Besides large allocations are slow. With length hint that is correct (sometimes you can do that) you have a zero-copy scenario
data:image/s3,"s3://crabby-images/32b67/32b67145b0fe3069a1de27c1ec5fc1c9428e9b97" alt=""
On Aug 1, 2012, at 1:46 AM, Mark Shannon wrote:
Alex's rationale is correct and well expressed. Your proposed revision reflects fuzzy thinking about why __length_hint__ is useful. Regardless of resizing growth factors, it is *always* helpful to know how much memory to allocate. Calls to the allocators (especially for large blocks) and possible the recopying of data should be avoided when possible. Raymond
data:image/s3,"s3://crabby-images/98c42/98c429f8854de54c6dfbbe14b9c99e430e0e4b7d" alt=""
On 16.07.12 10:36, Stefan Behnel wrote:
Return values could be -1 for "don't know" and -2 for "infinite" at the C level, and NotImplemented for "don't know" at the Python level.
PY_SSIZE_T_MAX is better value for "infinite". In any case no difference for consumer between PY_SSIZE_T_MAX and a real infinity.
data:image/s3,"s3://crabby-images/32b67/32b67145b0fe3069a1de27c1ec5fc1c9428e9b97" alt=""
On Jul 16, 2012, at 12:36 AM, Stefan Behnel wrote:
I'd like to more visibly repeat my suggestion to make this a slot method "tp_length_hint()" in extension types that returns a Py_ssize_t.
That is merely an implementation detail, but it would be a nice improvement. Raymond
data:image/s3,"s3://crabby-images/e87f3/e87f3c7c6d92519a9dac18ec14406dd41e3da93d" alt=""
On Mon, Jul 16, 2012 at 3:36 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Gods no. Making the return value different in C vs. Python code is just asking for trouble in terms of having to remember that specific difference while coding. Plus asking for people to check for an explicit negative values instead of just >= 0 would be problematic and prone to error.
See above. This is another reason why I don't think the infinite iterator concept is worth expressin. It's just mucking things up for no good reason. "infinite" == "I don't know" in the case of pre-allocation of a list. Just raise an exception or return None and be done with it. Nice and simple. And my vote is for an exception as EAFP. -Brett
data:image/s3,"s3://crabby-images/c9c6e/c9c6eac43d1767be7719cd707451187176ad2431" alt=""
I've updated the PEP to reflect the discussion. There are two major changes: 1) NotImplemented may be used by __length_hint__ to indicate that there is no finite length hint available. 2) callers of operator.length_hint() must provide their own default value, this is also required by the current C _PyObject_LengthHint implementation. There are no provisions for infinite iterators, that is not within the scope of this proposal. Alex
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Tue, Jul 17, 2012 at 1:03 PM, Alex Gaynor <alex.gaynor@gmail.com> wrote:
I've been thinking about this a bit more, and I think it does provide good scope for eventually adding __length_hint__ to more iterators (including map, filter and enumerate).
2) callers of operator.length_hint() must provide their own default value, this is also required by the current C _PyObject_LengthHint implementation.
And this makes it explicit that API users need to deal with the AttributeError/NotImplemented case, whilst making it easy to do so. Good call.
There are no provisions for infinite iterators, that is not within the scope of this proposal.
I'll repeat my observation that remaining silent on this point is effectively identical to blessing the practice of raising an exception in __length_hint__ to force fast failure of attempts to convert an infinite iterator to a concrete container. Rather than leaving people to figure this out on their own, we may as well make it official that TypeError can be raised in __length_hint__ to block conversion to concrete containers that use a preallocation strategy. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Tue, 17 Jul 2012 13:19:55 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
And I'll repeat that it is false ;) Being silent is certainly not the same thing as blessing a non-existent practice. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
data:image/s3,"s3://crabby-images/90304/903046437766f403ab1dffb9b01cc036820247a4" alt=""
To quote from PEP 424:
Why is it a significant optimisation? How much slower is it? Where is the data? *If* resizing list is so slow, then why not make it faster? To quote wikipedia (http://en.wikipedia.org/wiki/Dynamic_array) """ As n elements are inserted, the capacities form a geometric progression. Expanding the array by any constant proportion ensures that inserting n elements takes O(n) time overall, meaning that each insertion takes amortized constant time. The value of this proportion a leads to a time-space tradeoff: the average time per insertion operation is about a/(a-1), while the number of wasted cells is bounded above by (a-1)n. The choice of a depends on the library or application: some textbooks use a = 2, but Java's ArrayList implementation uses a = 3/2 and the C implementation of Python's list data structure uses a = 9/8. """ If resizing of lists is too slow, then we should reconsider the 9/8 factor and/or look to tweak the resizing code. Cheers, Mark.
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Mark Shannon, 16.07.2012 10:37:
If resizing of lists is too slow, then we should reconsider the 9/8 factor and/or look to tweak the resizing code.
The thing is that the performance is platform specific. On systems with a fast memory allocator, especially on Linux, the difference is negligible. However, with a slow memory allocator, especially a slow realloc(), e.g. on Windows or Solaris, this can substantially hurt the performance, up to a quadratically increasing runtime in the worst case. The length hint was implemented specifically to work around this problem. Stefan
data:image/s3,"s3://crabby-images/c3481/c3481638263af7c93d4c8a1f7b28d1cd5a9e96ff" alt=""
On Mon, Jul 16, 2012 at 11:02 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
It's not the actual allocation (typically), it's the copying that's the problem. As far as data goes - preallocation can make 4x difference (on PyPy, although the dominant operation is not different on CPython) on ''.join(some-iterable). It depends grossly on the sizes of the list, so you can't claim that there is a precise speedup of a constant factor, however, there are cases where it *can* be significant (as in the copying is by far the dominating operation), most notable giant templates with iterators written in C. Speaking of which - I find this bikeshed disgusting. The purpose of the PEP is to codify whatever is already written in code in CPython. If you guys don't want it, we'll just implement it anyway and try to follow the CPython current implementation from 2.7. Cheers, fijal
data:image/s3,"s3://crabby-images/53627/53627fe72d6b8623e1f52a67d3358e3ac4d10e21" alt=""
Speaking of which - I find this bikeshed disgusting.
Disgusting? Irritating, perhaps, but why should a PEP -- even one whose purpose is to codify existing practice -- not result in discussions about its subject matter? The final P stands for Proposal, not for Pronouncement. TJG
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Mon, Jul 16, 2012 at 7:21 PM, Tim Golden <mail@timgolden.me.uk> wrote:
Indeed - I'd be worried if any PEP sailed through python-dev review without a thorough kicking of the tires. Yes, it can be annoying having to bring people up to speed on issues that they aren't familiar with, but that's generally a sign that there is relevant background information *missing from the PEP*. PEP's aren't supposed to be written just for people that are already intimately familiar with a problem - they're supposed to provide enough background that they stand on their own. In this case, the key points that I think need to be added: - more background on why the __length_hint__ API exists in CPython in the first place: to minimise potentially expensive data copies (due to memory reallocation) when creating a concrete container from an iterator. This includes when creating them from another concrete container via an intermediate iterator. This is why at least the following produce objects that define __length_hint__ in CPython: reversed itertools.repeat iter(dict()) iter(list()) iter(tuple()) iter(str()) iter(bytes()) iter(bytearray()) iter(set()) iter(frozenset()) dict.values() dict.keys() As well as any user defined sequence that relies on the default sequence iterator: >>> class MySeq(): ... def __getitem__(self, idx): ... return idx ... def __len__(self): ... return 10 ... >>> hasattr(iter(MySeq()), "__length_hint__") True - clarification on the implications of it only being a "hint": specifically, as it may be an over- or underestimate, you *cannot* rely on the hint to decide whether or not to iterate over the object when a valid length is returned (as a value of zero may be an underestimate). However, it does allow you to presize your container more appropriately than just starting at zero as usual, thus achieving the aim of reducing the risk of unnecessary memory copies. That's the basic proposal. Separate from that, there are a few suggestions for *enhancement* beyond what CPython currently uses (and has demonstrated a clear need for): - adding operator.length_hint(). This makes sense to me, as it makes it much easier to use the API when implementing a container type in Python. It's also a pretty simple addition. - adding a C level type slot. I'm personally -1 on this one in the context of the PEP. I don't think the current PEP (which is really aimed at standardisation for PyPy's benefit) should be weighed down with this CPython specific implementation detail. As a separate proposal, independent of this PEP, from someone that cares specifically about micro-optimising this API for CPython, and (preferably) can produce benchmark numbers to show the additional implementation complexity is worthwhile, then I wouldn't object. I just don't want this orthogonal concern to derail the standardisation effort. - distinguishing between different reasons for saying "don't preallocate any space" (i.e. returning zero). I still haven't heard a convincing rationale for this one - it seems to be based on the notion that the case of skipping the iteration step for a genuinely empty iterable is worth optimising. This idea just doesn't make sense when any legitimate length value that is returned can't be trusted to be completely accurate - you have to iterate to confirm the actual number. - making it possible to fail *fast* when a known infinite iterator (like itertools.cycle or itertools.count) is passed to a concrete container. I think this is best covered in the PEP by explicitly stating that some types may implement __length_hint__ to always raise TypeError rather than defining a magic return value that means "I'm infinite". - making it possible for iterators like enumerate, map and filter to delegate __length_hint__ to their underlying iterator. This seems potentially worth doing, but requires resolving the problem that Raymond noted with handling the difference in internal behaviour between enumerate("abc") and enumerate(iter("abc")). Again, it would be unfortunate to see the PEP held up over this. - making it possible to define __length_hint__ for generator-iterator objects. While this is a nice idea, again, I don't think it's something that this particular PEP should be burdened with. My main point is that the current __length_hint__ behaviour has already proven its value in the real world. The PyPy team would like that behaviour codified, so they can be reasonably sure both implementations are doing the same thing. Many of the enhancements I have listed above may be suitable material for future enhancement proposals, but I'm not seeing any requested functionality that would be actively *blocked* by simply codifying the status quo. The PEP itself already has this general tone, but I think that it should be even more emphatic that it's about codifying the status quo, *not* about modifying it with ideas haven't been proven useful through past experience. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Mon, 16 Jul 2012 23:23:18 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
The point is that zero is a valid value for a length hint. By making it a special value for "don't know", you are making the protocol potentially confusing, and you are also departing from the current semantics. (and, yes, I think distinguishing between zero and "don't know" is useful: imagine a container that would preallocate 256 entries by default when the answer is "don't know")
Then the PEP shouldn't address infinite iterators at all. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Tue, Jul 17, 2012 at 12:01 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
No, it just means the default estimate is always zero. If you don't do that, then *every* client of __length_hint__ has to check for the magic value. It's making the API harder to use for no good reason.
Such a container has to already deal with the case when it overestimates severely. The only cost of using zero as a default estimate is that such hypothetical containers will overallocate when they technically didn't need to, which will already happen for any empty iterator that doesn't provide __length_hint__ at all. Given that all standard library containers default to assuming a size of zero (absent information indicating otherwise), and a special value would need to be special cased by *every* client of the API (and almost always defaulted to zero), it's simply not a good trade-off.
Noting that infinite iterators are free to define __length_hint__ to always throw an exception *is* the status quo. We just haven't done it for the stdlib ones. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Tue, 17 Jul 2012 00:18:55 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
Actually, dict and set default to a non-zero internal size, but I agree making containers harder to implement is not a good thing.
Being "free" to do unexpected or unconventional things is not the same thing as codifying those behaviours in a PEP, especially when noone is actually doing them. __length_hint__ is supposed to be informative, it shouldn't error out on you. So still -1 from me. Regards Antoine.
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Maciej Fijalkowski, 16.07.2012 11:14:
Note that a call to realloc() includes that part and can avoid copying if possible. A good realloc() implementation can make this use case run in amortised linear time, at least on a system with sufficient memory. A bad one can result in quadratic runtime, which is way more than a change by a constant factor. Thus my above comment that it's platform specific.
Absolutely. This is particularly visible at the C level because C implemented iterators have a very low overhead overall.
The idea behind this bikeshedding is that at the moment where we make this an official protocol rather than an implementation detail, it should be able to communicate the different states on the callee side of such a protocol. I.e. it needs a better design than the current one. Stefan
data:image/s3,"s3://crabby-images/32b67/32b67145b0fe3069a1de27c1ec5fc1c9428e9b97" alt=""
On Jul 16, 2012, at 1:37 AM, Mark Shannon wrote:
Unless pre-sized by with a length prediction, a growing list periodically needs to call realloc() which can move all the data to a new location in memory. Pre-sizing avoids that entirely.
If resizing of lists is too slow, then we should reconsider the 9/8 factor and/or look to tweak the resizing code.
A great deal of thought and care went into the current design. It has already been "tweaked". Raymond P.S. The dictionary code also uses presizing for copies, updates, set conversion, etc. It is a perfectly reasonable technique to pre-allocate the correct size container when the ultimate length is knowable in advance.
data:image/s3,"s3://crabby-images/b3d87/b3d872f9a7bbdbbdbd3c3390589970e6df22385a" alt=""
*If* resizing list is so slow, then why not make it faster?
A simple solution to speed up such problem is to change the overallocation factor, but it may waste memory. Python tries to be fast and to not waste too much memory.
I worked recently on optimizing str%args and str.format(args). Handle correctly the memory allocation is just critical for performances, especially for str with the PEP 393, because we have to shrink the buffer to the exact string length with the formatting function is done. I tried various overallocation factors and I chose 1.25 (5/4) because it was the fastest. See for example this issue for benchmark numbers: http://bugs.python.org/issue14687 The PyUnicodeWriter internal object uses various options to choose how many bytes should be allocated: * an overallocation flag to disable overallocation when we know that we are writing the last character/string into be buffer * a "minimal length" used for the first allocation * an hardcoded overallocation factor (1.25) PyUnicodeWriter is a little bit different than the __length_hint__ issue because PyUnicodeWriter has to shrink the buffer when it is done, but I can say that overallocation is very useful for speed. Victor
participants (22)
-
Alex Gaynor
-
Alexandre Zani
-
Antoine Pitrou
-
Benjamin Peterson
-
Brett Cannon
-
Chris Angelico
-
Christian Heimes
-
Ethan Furman
-
Lennart Regebro
-
M Stefan
-
Maciej Fijalkowski
-
Mark Dickinson
-
Mark Shannon
-
Nick Coghlan
-
Raymond Hettinger
-
Serhiy Storchaka
-
Stefan Behnel
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Tim Golden
-
Victor Stinner