Make len() usable on a generator

Hi! I have just come across some code counting a generator comprehension expression by doing len([foo for foo in bar if some_condition]) and I realized it might be better if we could just use len(foo for foo in bar if some_condition) as it would avoid a list allocation in memory. Another possibility is to write sum(1 for foo in bar if some_condition), but that is not optimal either as it generates a lot of intermediate additions which should not be needed. Sure, len(generator) might lead to an infinite loop but since sum(generator) is allowed in Python I see no reason why len(generator) isn't. What do you think ?

On Sat, Oct 4, 2014 at 1:09 AM, Thomas Chaumeny <t.chaumeny@gmail.com> wrote:
I think len() would be confusing, as it has to exhaust the generator to find out the length. The easiest would be to roll your own: def genlen(gen): len = 0 for len, _ in enumerate(gen, 1): pass return len At least then it's clear that it'll iterate over it completely. ChrisA

Yes, it has to exhaust the generator to find the length, but that also is what list(generator) or sum(generator) have to do and yet they are allowed constructions. Actually I don't think that calling len(generator) would be very useful with a generator variable, but with an "anonymous" generator using a comprehension like: valid_customers = len(customer for customer in customers if customer.is_valid()) that would be useful. On Fri, Oct 3, 2014 at 5:17 PM, Chris Angelico <rosuav@gmail.com> wrote:

On 4 October 2014 01:21, Thomas Chaumeny <t.chaumeny@gmail.com> wrote:
It's appropriate to look at the big-O algorithmic complexity when deciding whether an iterator is an appropriate input to an operation. list(itr) and sum(itr) are both O(n) operations - the time they take is proportional to the length of the input. len(container) by contrast, is an O(1) operation - the time it takes is independent of the length of the input. There's no length independent way of calculating the length of an arbitrary generator - it's an inherently O(n) operation. "sum(1 for x in seq)" conveys that accurately, "len(x for x in seq)" does not. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Why do you think len is an inherently O(1) operation? Sure, on array-based things it is, but for an arbitrary collection, the only logical assertion is that it's O(n). (For some containers, like arrays, it might be O(1) as well, but you can't assume that in general)

On Fri, Oct 3, 2014 at 10:36 AM, Mark Young <marky1991@gmail.com> wrote:
Why do you think len is an inherently O(1) operation?
Almost all containers (dicts and sets look a bit different at casual glance, but are still O(1)) in Snake store their current length as the ob_size attribute. It's thus O(1) to "calculate" it. You need not actually count the elements in the container. For example, for lists (this is from 2.7-ish, so 3.x is perhaps subtly different), omitting comments: typedef struct { PyObject_VAR_HEAD PyObject **ob_item; Py_ssize_t allocated; } PyListObject; where PyObject_VAR_HEAD is: #define PyObject_VAR_HEAD \ PyObject_HEAD \ Py_ssize_t ob_size; All len() does is return the ob_size attribute. Again, except for dicts and sets, which do things slightly differently, but are still O(1) in this regard. Skip

Well, he said for _containers_, for which we assume a length attribute has been tallied all along; calling len() on a container should just access this tally, making it minimally intensive. For iteration-enabled classes, you can implement a __len__() magic method which returns whatever you like, so if you can make predictions about the length of an iterable class, even if the values are generated on the fly, then you can implement this to make it return the value without much computation. Some builtin generators appear to do just this: In [10]: %timeit len(range(100)) 1000000 loops, best of 3: 1.27 µs per loop In [11]: %timeit len(range(1000)) 1000000 loops, best of 3: 1.51 µs per loop In [12]: %timeit len(range(100000)) 1000000 loops, best of 3: 1.59 µs per loop In [13]: %timeit len(range(10000000)) 1000000 loops, best of 3: 1.58 µs per loop On 03/10/14 16:36, Mark Young wrote:
-- Twitter: @onetruecathal, @formabiolabs Phone: +353876363185 Blog: http://indiebiotech.com miniLock.io: JjmYYngs7akLZUjkvFkuYdsZ3PyPHSZRBKNm6qTYKZfAM

On Fri, Oct 3, 2014 at 11:20 AM, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
It is on Python 3 (but not Python 2): $ python3
range(100000) range(0, 100000)
And I assumed that most discussion for Python-ideas pertained to features for Python 3, am I incorrect in that?

On Oct 3, 2014, at 19:13, Joshua Landau <joshua@landau.ws> wrote:
In a sense, it's like the dict views, or any other sequence, set, or similar container that doesn't directly store its objects but instead created them lazily. It might be nice if we had a generic term for such things. On the other hand, maybe it's not so critical. I think most of the confusion comes from people who are Python 3 novices but not Python 2 novices (who expect that if range, dict.keys, etc. no longer produce lists, they must instead be producing iterators; new programmers, transplants from Java or Perl, etc. won't have such expectations), and the number of those is dwindling and will continue to do so.

On 4 October 2014 02:45, Ian Cordasco <graffatcolmingov@gmail.com> wrote:
Python 3 range isn't a lazy iterator either - it's a full sequence type (specifically, a calculated tuple: https://docs.python.org/3/library/stdtypes.html#ranges). The only immutable sequence operations it doesn't support are concatenation and repetition (since concatenating or repeating a range can't be represented using the simple "start + n*step" calculation that range uses internally). We'd have gotten a *lot* more complaints if we just dropped xrange in as a substitute for the list objects returned by Python 2's range builtin :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Oct 4, 2014, at 13:48, Nick Coghlan <ncoghlan@gmail.com> wrote:
For some reason, most of the 3.x early adopters understood this, but later switchers seem to often believe that this is exactly what Python 3 did. For example, at least half a dozen times, I've written an answer on StackOverflow showing someone that `n in range(start, stop)` is what they were trying to write, and explaining that it means the same thing as `start <= n < stop` (with various caveats about Python 2, and `step`, etc.) only for someone to edit or comment on my answer claiming that range.__contents__ has to be linear and should never be used. I explain that it's a sequence, and can be indexed, and show them a quick timing test, and they still insist that's just an accidental optimization that shouldn't be relied on, because it's really just meant to be Python 2's xrange. Once someone insisted that if you want this functionality you should use a multi-interval class off PyPI, even though the docs for that class explicitly say that a single interval is just an empty subclass of range in Py 3. People who stick with 2.x just really want to believe you got this wrong, I guess. I've seen similar things with, e.g., people not believing that dict.keys() returns a set that's a view into the dict's storage rather than some kind of magical iterator that knows how to be used multiple times. (I'm not even sure how that would work; how could __iter__ know when you were trying to get self and when you were trying to start over?)

On 4 October 2014 01:36, Mark Young <marky1991@gmail.com> wrote:
It's not a hard requirement (it can't be), but it *is* a rather strong convention that checking something's length should be an O(1) operation. A non-specialised container that doesn't implement len() as an O(1) operation in Python would quite likely end up with users asking for the length to be cached to make it O(1). In particular, it's generally expected that "if obj:" should be O(1), and that's likely to fall back to __len__() for most container types. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 2014-10-03 11:32 AM, Nick Coghlan wrote:
What about collections.ChainMap (computes a union)? https://hg.python.org/cpython/file/3.4/Lib/collections/__init__.py#l795 What about mailbox.Maildir (calls os.listdir)? https://hg.python.org/cpython/file/3.4/Lib/mailbox.py#l408 I think there are plenty of virtual containers that fail your rule. -Scott -- Scott Dial scott@scottdial.com

Nick Coghlan writes:
len(container) by contrast, is an O(1) operation - the time it takes is independent of the length of the input.
I think a better way to put this is that len(container) is *designed* to return an attribute that is computed as the container is (de-) populated, so is O(1) for access. The fact that some "containers" that wrap external objects (eg, Scott's example of Maildir) have to use O(N) algorithms for accuracy (since these objects are shared with other processes) doesn't really contradict the argument from "design", it just suggests one should be careful when giving those containers a __len__. Also, at least for the Maildir example, len(maildir) doesn't access the Maildir; it accesses the OS's directory listing facility, which has a "C" so small that I suppose it's effectively O(1) compared to sum(1 for m in maildir). I don't know if that (pseudo?) logic applies to other such examples, though.

On Oct 3, 2014 5:22 PM, "Thomas Chaumeny" <t.chaumeny@gmail.com> wrote:
Yes, it has to exhaust the generator to find the length, but that also is
what list(generator) or sum(generator) have to do and yet they are allowed constructions.
Actually I don't think that calling len(generator) would be very useful
with a generator variable, but with an "anonymous" generator using a comprehension like:
You could just do: sum(customer.is_valid() for customer in customers)

In that case it works because boolean values will be converted to 0/1 but you have to force the construction to generate booleans. Also, that generates intermediate additions at each step which might make it slower than a fast iteration until exhaustion. On Fri, Oct 3, 2014 at 5:40 PM, Todd <toddrjen@gmail.com> wrote:

Thomas Chaumeny writes:
sum(1 for x in generator) I'm on the fence about whether I prefer that to len(generator), on the grounds that I normally would not expect len to use up the sequence, while I sort of expect sum to do so (for values of "sort of expect" that acknowledge that s = [1, 2, 3] print(sum(s)) doesn't "use up" s in fact, but it does "use" all the values).

I searched this list before I asked and couldn't find anything. Anyway, I understand from Nick's response that len() is conceived for collections which have a size computable in constant time (I suppose structures that maintain some inner attribute to store their length). I also believe that "length" sounds more like an attribute name than a name for something which really does some computation ("count" sounds more appropriate for that). On the other hand, it doesn't seem to be a general concern not to provide construction which can lead to hidden complexity cost (like "in" used on a list vs a set). On Fri, Oct 3, 2014 at 7:05 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:

On Oct 3, 2014, at 19:43, Thomas Chaumeny <t.chaumeny@gmail.com> wrote:
I searched this list before I asked and couldn't find anything.
Anyway, I understand from Nick's response that len() is conceived for collections which have a size computable in constant time (I suppose structures that maintain some inner attribute to store their length). I also believe that "length" sounds more like an attribute name than a name for something which really does some computation ("count" sounds more appropriate for that). On the other hand, it doesn't seem to be a general concern not to provide construction which can lead to hidden complexity cost (like "in" used on a list vs a set).
The collections.abc module attempted to retroactively codify the standards for these things. All containers support in, which implies that you can't expect better than linear (even though some subtypes might provide logarithmic or amortized constant). Only sized containers support len, which implies that it should be constant time. It might have been better if the 1.x designers thought through all of these issues ahead of time (as the designers of C++'s STL did), but what they came up with semi-accidentally (of course not really accidental, because Guido and friends had experience to guide their choices even when they didn't rigorously think things through) seems pretty good to me. At any rate, it seems far too late to change things that have been consistent since before 1.0 even if alternatives _would_ be more logical a priori. If you design a new language, you might want to provide separate operators/methods/functions for a log-or-better contains and a may-be-linear contains, and likewise you might want an operator to directly do may-be-linear indexing instead of having to write next(islice(…)), or you may not, but I don't think you can expect such things in Python.

On Fri, Oct 03, 2014 at 05:09:20PM +0200, Thomas Chaumeny wrote:
I don't understand this reasoning. Why do you think that they are unnecessary? I believe that, in the general case of an arbitrary generator expression, there are only two ways to tell what the length will be. The first is to produce a list (or other sequence), then return the length of the list: len([obj for obj in generator if condition]). The second is to count each item as it is produced, but without storing them all: sum(1 for obj in generator if condition). The first is optimized for readability, the second for memory. If I have missed a third way, please tell me. I don't believe that there is any other general way to work out the length of an arbitrary generator. (Apart from trivial, or obfuscated, variations on the above two, of course.) How would you tell what the length of this generator should be, apart from actually running it to exhaustion? def generator(): while True: if random.random() < 0.5: return yield "spam" Since there is no general way to know what the length of an arbitrary generator will be, it is better to be explicit that it has to be calculated by running through the generator and exhausting it. -- Steven

On Oct 4, 2014, at 15:27, Steven D'Aprano <steve@pearwood.info> wrote:
Also more that the itertools recipes include a function for doing exactly this: def quantify(iterable, pred=bool): "Count how many times the predicate is true" return sum(map(pred, iterable)) The more-itertools library on PyPI has this, plus an even simpler `ilen` function that just does, IIRC, sum(1 for _ in iterable). Of course they're both trivial one-liners, but maybe calling more_itertools.ilen is a nice way to clarify that you're intentionally consuming an iterator just to get its length.
Would teeing the iterator and consuming the extra copy count as an obfuscated variation? In practice, it's kind of the worst of both worlds--you're constructing a sequence (a deque instead of a list) in memory, but can only access it as a one-shot iterator. Conceptually, it looks somewhat nice--you're copying the iterator to count it without destroying it--but I think that's only an illusion for those who don't understand lists as iterables.
Obviously we just need to rewrite Python around lazy dataflow variables, so whatever you assign len(generator()) to doesn't consume the generator until necessary, meaning you could still use the iterator's vales elsewhere in the mean time. :)

On Sat, Oct 4, 2014 at 3:27 PM, Steven D'Aprano <steve@pearwood.info> wrote:
I don't understand this reasoning. Why do you think that they are unnecessary?
Sorry, that was a bad formulation. Obvisouly you need to increment a counter at some point when iterating over the generator. Now, if you have a builtin function len (or other), you can define it to run a fast iterating loop using a C integer to maintain the number of elements during the loop. If you use sum(1 for a in ...), I suppose there is some overhead due do the fact that you have to deal with Python objects when doing the additions.

Support for len() on iterators isn't going to happen. The "gut feeling" reason is that len() shouldn't "consume" anything. It would lure people into thinking they can first call len() on the iterator and then iterate over it -- while, if it is an actual iterator (like an I/O stream), after calling len(), everything is consumed. Yes, it's possible that you only want to count the number of lines, but that's unusual, and the idiom for that is readily available (e.g. sum(1 for ...)). If the idiom occurs frequently in your code you can define a helper function count(). The speed benefit of doing the loop in C would be minimal in most cases. -- --Guido van Rossum (python.org/~guido)

I don't think it makes much sense for len() to work on generators and the fact that sum() works isn't a good argument. Summing the contents of a generator can make sense whereas attempting to obtain the length of something which specifically does not define a length seems a little nonsensical to me... On 5 October 2014 02:37, Guido van Rossum <guido@python.org> wrote:

On Sat, Oct 11, 2014 at 5:06 AM, <random832@fastmail.us> wrote:
For a start, you'd have to preclude anything with an 'if' in it. What's the length of this generator? gen = (sin(x*.0314159) for x in range(100) if f(x)) Without actually exhausting the iterator, there's no way to predict how many values it'll produce. The only way to do it would be to restrict this to generators that basically map a function/expression over an iterable of known length. ChrisA

On Oct 10, 2014, at 11:06, random832@fastmail.us wrote:
When I first started playing with Swift, I was impressed by the fact that they tried to make as many things as possible into views instead of iterators. (I'm going to use Python terminology here, because for some reason they came up with new terms for everything, making sure that every word was a false cognate with something in either Python, C++, or Haskell...) Map returns a view which can be indexed and in general acts like a sequence. Filter returns a view that can be indexed by non-random-accessible indices (you can start with begin or end and ++ or -- from there). And so on. But when you start trying to build your own views, iterators, and sequences, or just trying to chain together the existing ones, everything gets a lot harder. It's nowhere near as nice as Python, C++, or Haskell in practice, even though it combines the best of all three in toy examples. That's not to say that a good design for this stuff isn't possible, just that it's obviously not trivial. The if example is a good illustration of this. If a genexpr is a sequence iff it has exactly 1 for clause and 0 if clauses, becoming a non-random-accessible view if it has a second for or any if, that makes it a lot harder to read, document, or teach code that uses genexprs (unless that code just ignores all view functionality and treats everything as a generic iterable).

On Fri, Oct 10, 2014, at 19:34, Andrew Barnert wrote:
The if example is a good illustration of this. If a genexpr is a sequence iff it has exactly 1 for clause and 0 if clauses
Why not 2 for clause? Won't its length always be the product of the two inputs, then? (And you forgot if its inputs are sequences) More to the point, since python isn't statically typed, there's no reason that user code would have to produce a view (though I'd suggest it should at least produce an "iterable" that next can't be called directly on without iter - a paradigm that has worked well for Java and C#) if it doesn't want to.

On Fri, Oct 10, 2014 at 02:06:20PM -0400, random832@fastmail.us wrote:
The important question is not "Why *doesn't* it define a length?" but "Why *should* it define a length?". What advantage does it give you? Let's take the case where you are both the producer and consumer of the generator. You might like to write something like this: it = (func(x) for x in some_list) n = len(it) consume(it, n) But that's no better or easier than: it = (func(x) for x in some_list) n = len(some_list) consume(it, n) so there is no benefit to having the generator have a length. It does no harm either. However, it does require a very specific special case. It only works when you walk directly over a fixed-length sequence, and can't be used in cases like these: it = (func(x) for x in some_list if condition(x)) it = (func(x) for x in some_iterator_of_unpredictable_length) So from the perspective of the producer, generators cannot always be given a length, and if they can, since Python can determine the length, so can you. There's no advantage to having the generator type do so that I can see. Now consider from the perspective of a consumer of an iterator. You don't know where it comes from or how it is produced, so you don't know if it has a predictable length or not. Since you can't rely on it having a length, it doesn't actually give you any benefit. Perhaps you're already writing code that supports sequences and iterators, with a separate branch for sequences based on the fact that they have a known length: def function(it): try: n = len(it) except TypeError: # process the cases with no known length else: # process cases with a known length It might be nice if some generators will be processed by the "known length" branch, but it doesn't save you from having to write the "unknown length" branch. Again, the benefit is minimal or zero. There may be cases where you *require* a predictable length, in which case you probably should support only sequences. So there is no real benefit as far as I can see why generators should support len() even when they could. -- Steven

On 10/12/2014 12:24 AM, random832@fastmail.us wrote:
Iterators, including generators, can optionally have a __length_hint__ method, which can be used by list, etc, for better initial space allocation. "object.__length_hint__(self) Called to implement operator.length_hint(). Should return an estimated length for the object (which may be greater or less than the actual length). The length must be an integer >= 0. This method is purely an optimization and is never required for correctness." I believe this was added when Raymond H. noticed that some of the itertools generators could sometimes give accurate length information. Guido did not want to add __len__, just sometimes, lest it incorrectly become regarded as part of the iterator protocol. -- Terry Jan Reedy

Well, I'm already convinced actually :) The thing is I brought that question based on the fact that "sum(1 for ...)" was slower then "len([foo for ...])" and I thought it was a shame there were no builtin way to count quickly an iterable. I realized afterwards that this fact is no longer true, as "sum(1 for ...)" has become faster with some recent Python version (it looks like it will not use Python objects addition if all the operands are numbers). On Fri, Oct 10, 2014 at 5:09 PM, Adam Jorgensen <adam.jorgensen.za@gmail.com
wrote:

On Sat, Oct 4, 2014 at 1:09 AM, Thomas Chaumeny <t.chaumeny@gmail.com> wrote:
I think len() would be confusing, as it has to exhaust the generator to find out the length. The easiest would be to roll your own: def genlen(gen): len = 0 for len, _ in enumerate(gen, 1): pass return len At least then it's clear that it'll iterate over it completely. ChrisA

Yes, it has to exhaust the generator to find the length, but that also is what list(generator) or sum(generator) have to do and yet they are allowed constructions. Actually I don't think that calling len(generator) would be very useful with a generator variable, but with an "anonymous" generator using a comprehension like: valid_customers = len(customer for customer in customers if customer.is_valid()) that would be useful. On Fri, Oct 3, 2014 at 5:17 PM, Chris Angelico <rosuav@gmail.com> wrote:

On 4 October 2014 01:21, Thomas Chaumeny <t.chaumeny@gmail.com> wrote:
It's appropriate to look at the big-O algorithmic complexity when deciding whether an iterator is an appropriate input to an operation. list(itr) and sum(itr) are both O(n) operations - the time they take is proportional to the length of the input. len(container) by contrast, is an O(1) operation - the time it takes is independent of the length of the input. There's no length independent way of calculating the length of an arbitrary generator - it's an inherently O(n) operation. "sum(1 for x in seq)" conveys that accurately, "len(x for x in seq)" does not. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Why do you think len is an inherently O(1) operation? Sure, on array-based things it is, but for an arbitrary collection, the only logical assertion is that it's O(n). (For some containers, like arrays, it might be O(1) as well, but you can't assume that in general)

On Fri, Oct 3, 2014 at 10:36 AM, Mark Young <marky1991@gmail.com> wrote:
Why do you think len is an inherently O(1) operation?
Almost all containers (dicts and sets look a bit different at casual glance, but are still O(1)) in Snake store their current length as the ob_size attribute. It's thus O(1) to "calculate" it. You need not actually count the elements in the container. For example, for lists (this is from 2.7-ish, so 3.x is perhaps subtly different), omitting comments: typedef struct { PyObject_VAR_HEAD PyObject **ob_item; Py_ssize_t allocated; } PyListObject; where PyObject_VAR_HEAD is: #define PyObject_VAR_HEAD \ PyObject_HEAD \ Py_ssize_t ob_size; All len() does is return the ob_size attribute. Again, except for dicts and sets, which do things slightly differently, but are still O(1) in this regard. Skip

Well, he said for _containers_, for which we assume a length attribute has been tallied all along; calling len() on a container should just access this tally, making it minimally intensive. For iteration-enabled classes, you can implement a __len__() magic method which returns whatever you like, so if you can make predictions about the length of an iterable class, even if the values are generated on the fly, then you can implement this to make it return the value without much computation. Some builtin generators appear to do just this: In [10]: %timeit len(range(100)) 1000000 loops, best of 3: 1.27 µs per loop In [11]: %timeit len(range(1000)) 1000000 loops, best of 3: 1.51 µs per loop In [12]: %timeit len(range(100000)) 1000000 loops, best of 3: 1.59 µs per loop In [13]: %timeit len(range(10000000)) 1000000 loops, best of 3: 1.58 µs per loop On 03/10/14 16:36, Mark Young wrote:
-- Twitter: @onetruecathal, @formabiolabs Phone: +353876363185 Blog: http://indiebiotech.com miniLock.io: JjmYYngs7akLZUjkvFkuYdsZ3PyPHSZRBKNm6qTYKZfAM

On Fri, Oct 3, 2014 at 11:20 AM, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
It is on Python 3 (but not Python 2): $ python3
range(100000) range(0, 100000)
And I assumed that most discussion for Python-ideas pertained to features for Python 3, am I incorrect in that?

On Oct 3, 2014, at 19:13, Joshua Landau <joshua@landau.ws> wrote:
In a sense, it's like the dict views, or any other sequence, set, or similar container that doesn't directly store its objects but instead created them lazily. It might be nice if we had a generic term for such things. On the other hand, maybe it's not so critical. I think most of the confusion comes from people who are Python 3 novices but not Python 2 novices (who expect that if range, dict.keys, etc. no longer produce lists, they must instead be producing iterators; new programmers, transplants from Java or Perl, etc. won't have such expectations), and the number of those is dwindling and will continue to do so.

On 4 October 2014 02:45, Ian Cordasco <graffatcolmingov@gmail.com> wrote:
Python 3 range isn't a lazy iterator either - it's a full sequence type (specifically, a calculated tuple: https://docs.python.org/3/library/stdtypes.html#ranges). The only immutable sequence operations it doesn't support are concatenation and repetition (since concatenating or repeating a range can't be represented using the simple "start + n*step" calculation that range uses internally). We'd have gotten a *lot* more complaints if we just dropped xrange in as a substitute for the list objects returned by Python 2's range builtin :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Oct 4, 2014, at 13:48, Nick Coghlan <ncoghlan@gmail.com> wrote:
For some reason, most of the 3.x early adopters understood this, but later switchers seem to often believe that this is exactly what Python 3 did. For example, at least half a dozen times, I've written an answer on StackOverflow showing someone that `n in range(start, stop)` is what they were trying to write, and explaining that it means the same thing as `start <= n < stop` (with various caveats about Python 2, and `step`, etc.) only for someone to edit or comment on my answer claiming that range.__contents__ has to be linear and should never be used. I explain that it's a sequence, and can be indexed, and show them a quick timing test, and they still insist that's just an accidental optimization that shouldn't be relied on, because it's really just meant to be Python 2's xrange. Once someone insisted that if you want this functionality you should use a multi-interval class off PyPI, even though the docs for that class explicitly say that a single interval is just an empty subclass of range in Py 3. People who stick with 2.x just really want to believe you got this wrong, I guess. I've seen similar things with, e.g., people not believing that dict.keys() returns a set that's a view into the dict's storage rather than some kind of magical iterator that knows how to be used multiple times. (I'm not even sure how that would work; how could __iter__ know when you were trying to get self and when you were trying to start over?)

On 4 October 2014 01:36, Mark Young <marky1991@gmail.com> wrote:
It's not a hard requirement (it can't be), but it *is* a rather strong convention that checking something's length should be an O(1) operation. A non-specialised container that doesn't implement len() as an O(1) operation in Python would quite likely end up with users asking for the length to be cached to make it O(1). In particular, it's generally expected that "if obj:" should be O(1), and that's likely to fall back to __len__() for most container types. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 2014-10-03 11:32 AM, Nick Coghlan wrote:
What about collections.ChainMap (computes a union)? https://hg.python.org/cpython/file/3.4/Lib/collections/__init__.py#l795 What about mailbox.Maildir (calls os.listdir)? https://hg.python.org/cpython/file/3.4/Lib/mailbox.py#l408 I think there are plenty of virtual containers that fail your rule. -Scott -- Scott Dial scott@scottdial.com

Nick Coghlan writes:
len(container) by contrast, is an O(1) operation - the time it takes is independent of the length of the input.
I think a better way to put this is that len(container) is *designed* to return an attribute that is computed as the container is (de-) populated, so is O(1) for access. The fact that some "containers" that wrap external objects (eg, Scott's example of Maildir) have to use O(N) algorithms for accuracy (since these objects are shared with other processes) doesn't really contradict the argument from "design", it just suggests one should be careful when giving those containers a __len__. Also, at least for the Maildir example, len(maildir) doesn't access the Maildir; it accesses the OS's directory listing facility, which has a "C" so small that I suppose it's effectively O(1) compared to sum(1 for m in maildir). I don't know if that (pseudo?) logic applies to other such examples, though.

On Oct 3, 2014 5:22 PM, "Thomas Chaumeny" <t.chaumeny@gmail.com> wrote:
Yes, it has to exhaust the generator to find the length, but that also is
what list(generator) or sum(generator) have to do and yet they are allowed constructions.
Actually I don't think that calling len(generator) would be very useful
with a generator variable, but with an "anonymous" generator using a comprehension like:
You could just do: sum(customer.is_valid() for customer in customers)

In that case it works because boolean values will be converted to 0/1 but you have to force the construction to generate booleans. Also, that generates intermediate additions at each step which might make it slower than a fast iteration until exhaustion. On Fri, Oct 3, 2014 at 5:40 PM, Todd <toddrjen@gmail.com> wrote:

Thomas Chaumeny writes:
sum(1 for x in generator) I'm on the fence about whether I prefer that to len(generator), on the grounds that I normally would not expect len to use up the sequence, while I sort of expect sum to do so (for values of "sort of expect" that acknowledge that s = [1, 2, 3] print(sum(s)) doesn't "use up" s in fact, but it does "use" all the values).

I searched this list before I asked and couldn't find anything. Anyway, I understand from Nick's response that len() is conceived for collections which have a size computable in constant time (I suppose structures that maintain some inner attribute to store their length). I also believe that "length" sounds more like an attribute name than a name for something which really does some computation ("count" sounds more appropriate for that). On the other hand, it doesn't seem to be a general concern not to provide construction which can lead to hidden complexity cost (like "in" used on a list vs a set). On Fri, Oct 3, 2014 at 7:05 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:

On Oct 3, 2014, at 19:43, Thomas Chaumeny <t.chaumeny@gmail.com> wrote:
I searched this list before I asked and couldn't find anything.
Anyway, I understand from Nick's response that len() is conceived for collections which have a size computable in constant time (I suppose structures that maintain some inner attribute to store their length). I also believe that "length" sounds more like an attribute name than a name for something which really does some computation ("count" sounds more appropriate for that). On the other hand, it doesn't seem to be a general concern not to provide construction which can lead to hidden complexity cost (like "in" used on a list vs a set).
The collections.abc module attempted to retroactively codify the standards for these things. All containers support in, which implies that you can't expect better than linear (even though some subtypes might provide logarithmic or amortized constant). Only sized containers support len, which implies that it should be constant time. It might have been better if the 1.x designers thought through all of these issues ahead of time (as the designers of C++'s STL did), but what they came up with semi-accidentally (of course not really accidental, because Guido and friends had experience to guide their choices even when they didn't rigorously think things through) seems pretty good to me. At any rate, it seems far too late to change things that have been consistent since before 1.0 even if alternatives _would_ be more logical a priori. If you design a new language, you might want to provide separate operators/methods/functions for a log-or-better contains and a may-be-linear contains, and likewise you might want an operator to directly do may-be-linear indexing instead of having to write next(islice(…)), or you may not, but I don't think you can expect such things in Python.

On Fri, Oct 03, 2014 at 05:09:20PM +0200, Thomas Chaumeny wrote:
I don't understand this reasoning. Why do you think that they are unnecessary? I believe that, in the general case of an arbitrary generator expression, there are only two ways to tell what the length will be. The first is to produce a list (or other sequence), then return the length of the list: len([obj for obj in generator if condition]). The second is to count each item as it is produced, but without storing them all: sum(1 for obj in generator if condition). The first is optimized for readability, the second for memory. If I have missed a third way, please tell me. I don't believe that there is any other general way to work out the length of an arbitrary generator. (Apart from trivial, or obfuscated, variations on the above two, of course.) How would you tell what the length of this generator should be, apart from actually running it to exhaustion? def generator(): while True: if random.random() < 0.5: return yield "spam" Since there is no general way to know what the length of an arbitrary generator will be, it is better to be explicit that it has to be calculated by running through the generator and exhausting it. -- Steven

On Oct 4, 2014, at 15:27, Steven D'Aprano <steve@pearwood.info> wrote:
Also more that the itertools recipes include a function for doing exactly this: def quantify(iterable, pred=bool): "Count how many times the predicate is true" return sum(map(pred, iterable)) The more-itertools library on PyPI has this, plus an even simpler `ilen` function that just does, IIRC, sum(1 for _ in iterable). Of course they're both trivial one-liners, but maybe calling more_itertools.ilen is a nice way to clarify that you're intentionally consuming an iterator just to get its length.
Would teeing the iterator and consuming the extra copy count as an obfuscated variation? In practice, it's kind of the worst of both worlds--you're constructing a sequence (a deque instead of a list) in memory, but can only access it as a one-shot iterator. Conceptually, it looks somewhat nice--you're copying the iterator to count it without destroying it--but I think that's only an illusion for those who don't understand lists as iterables.
Obviously we just need to rewrite Python around lazy dataflow variables, so whatever you assign len(generator()) to doesn't consume the generator until necessary, meaning you could still use the iterator's vales elsewhere in the mean time. :)

On Sat, Oct 4, 2014 at 3:27 PM, Steven D'Aprano <steve@pearwood.info> wrote:
I don't understand this reasoning. Why do you think that they are unnecessary?
Sorry, that was a bad formulation. Obvisouly you need to increment a counter at some point when iterating over the generator. Now, if you have a builtin function len (or other), you can define it to run a fast iterating loop using a C integer to maintain the number of elements during the loop. If you use sum(1 for a in ...), I suppose there is some overhead due do the fact that you have to deal with Python objects when doing the additions.

Support for len() on iterators isn't going to happen. The "gut feeling" reason is that len() shouldn't "consume" anything. It would lure people into thinking they can first call len() on the iterator and then iterate over it -- while, if it is an actual iterator (like an I/O stream), after calling len(), everything is consumed. Yes, it's possible that you only want to count the number of lines, but that's unusual, and the idiom for that is readily available (e.g. sum(1 for ...)). If the idiom occurs frequently in your code you can define a helper function count(). The speed benefit of doing the loop in C would be minimal in most cases. -- --Guido van Rossum (python.org/~guido)

I don't think it makes much sense for len() to work on generators and the fact that sum() works isn't a good argument. Summing the contents of a generator can make sense whereas attempting to obtain the length of something which specifically does not define a length seems a little nonsensical to me... On 5 October 2014 02:37, Guido van Rossum <guido@python.org> wrote:

On Sat, Oct 11, 2014 at 5:06 AM, <random832@fastmail.us> wrote:
For a start, you'd have to preclude anything with an 'if' in it. What's the length of this generator? gen = (sin(x*.0314159) for x in range(100) if f(x)) Without actually exhausting the iterator, there's no way to predict how many values it'll produce. The only way to do it would be to restrict this to generators that basically map a function/expression over an iterable of known length. ChrisA

On Oct 10, 2014, at 11:06, random832@fastmail.us wrote:
When I first started playing with Swift, I was impressed by the fact that they tried to make as many things as possible into views instead of iterators. (I'm going to use Python terminology here, because for some reason they came up with new terms for everything, making sure that every word was a false cognate with something in either Python, C++, or Haskell...) Map returns a view which can be indexed and in general acts like a sequence. Filter returns a view that can be indexed by non-random-accessible indices (you can start with begin or end and ++ or -- from there). And so on. But when you start trying to build your own views, iterators, and sequences, or just trying to chain together the existing ones, everything gets a lot harder. It's nowhere near as nice as Python, C++, or Haskell in practice, even though it combines the best of all three in toy examples. That's not to say that a good design for this stuff isn't possible, just that it's obviously not trivial. The if example is a good illustration of this. If a genexpr is a sequence iff it has exactly 1 for clause and 0 if clauses, becoming a non-random-accessible view if it has a second for or any if, that makes it a lot harder to read, document, or teach code that uses genexprs (unless that code just ignores all view functionality and treats everything as a generic iterable).

On Fri, Oct 10, 2014, at 19:34, Andrew Barnert wrote:
The if example is a good illustration of this. If a genexpr is a sequence iff it has exactly 1 for clause and 0 if clauses
Why not 2 for clause? Won't its length always be the product of the two inputs, then? (And you forgot if its inputs are sequences) More to the point, since python isn't statically typed, there's no reason that user code would have to produce a view (though I'd suggest it should at least produce an "iterable" that next can't be called directly on without iter - a paradigm that has worked well for Java and C#) if it doesn't want to.

On Fri, Oct 10, 2014 at 02:06:20PM -0400, random832@fastmail.us wrote:
The important question is not "Why *doesn't* it define a length?" but "Why *should* it define a length?". What advantage does it give you? Let's take the case where you are both the producer and consumer of the generator. You might like to write something like this: it = (func(x) for x in some_list) n = len(it) consume(it, n) But that's no better or easier than: it = (func(x) for x in some_list) n = len(some_list) consume(it, n) so there is no benefit to having the generator have a length. It does no harm either. However, it does require a very specific special case. It only works when you walk directly over a fixed-length sequence, and can't be used in cases like these: it = (func(x) for x in some_list if condition(x)) it = (func(x) for x in some_iterator_of_unpredictable_length) So from the perspective of the producer, generators cannot always be given a length, and if they can, since Python can determine the length, so can you. There's no advantage to having the generator type do so that I can see. Now consider from the perspective of a consumer of an iterator. You don't know where it comes from or how it is produced, so you don't know if it has a predictable length or not. Since you can't rely on it having a length, it doesn't actually give you any benefit. Perhaps you're already writing code that supports sequences and iterators, with a separate branch for sequences based on the fact that they have a known length: def function(it): try: n = len(it) except TypeError: # process the cases with no known length else: # process cases with a known length It might be nice if some generators will be processed by the "known length" branch, but it doesn't save you from having to write the "unknown length" branch. Again, the benefit is minimal or zero. There may be cases where you *require* a predictable length, in which case you probably should support only sequences. So there is no real benefit as far as I can see why generators should support len() even when they could. -- Steven

On 10/12/2014 12:24 AM, random832@fastmail.us wrote:
Iterators, including generators, can optionally have a __length_hint__ method, which can be used by list, etc, for better initial space allocation. "object.__length_hint__(self) Called to implement operator.length_hint(). Should return an estimated length for the object (which may be greater or less than the actual length). The length must be an integer >= 0. This method is purely an optimization and is never required for correctness." I believe this was added when Raymond H. noticed that some of the itertools generators could sometimes give accurate length information. Guido did not want to add __len__, just sometimes, lest it incorrectly become regarded as part of the iterator protocol. -- Terry Jan Reedy

Well, I'm already convinced actually :) The thing is I brought that question based on the fact that "sum(1 for ...)" was slower then "len([foo for ...])" and I thought it was a shame there were no builtin way to count quickly an iterable. I realized afterwards that this fact is no longer true, as "sum(1 for ...)" has become faster with some recent Python version (it looks like it will not use Python objects addition if all the operands are numbers). On Fri, Oct 10, 2014 at 5:09 PM, Adam Jorgensen <adam.jorgensen.za@gmail.com
wrote:
participants (20)
-
Adam Jorgensen
-
Alexander Belopolsky
-
Andrew Barnert
-
Cathal Garvey
-
Chris Angelico
-
Guido van Rossum
-
Ian Cordasco
-
Joshua Landau
-
Mark Young
-
Nick Coghlan
-
random832@fastmail.us
-
Scott Dial
-
Skip Montanaro
-
Stefan Behnel
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Thomas
-
Thomas Chaumeny
-
Todd