[Python-ideas] Re: Adding slice Iterator to Sequences (was: islice with actual slices)

10 May 2020

      On May 10, 2020, at 11:09, Christopher Barker  wrote:

Is there any way you can fix the reply quoting on your mail client, or manually work around it? I keep reading paragraphs and saying “why is he saying the same thing I said” only to realize that you’re not, that’s just a quote from me that isn’t marked, up until the last line where it isn’t…
...
On Sat, May 9, 2020 at 9:11 PM Andrew Barnert  wrote:
...
That’s no more of a problem for a list slice view than for any of the existing views. The simplest way to implement a view is to keep a reference to the underlying object and delegate to it, which is effectively what the dict views do.
Fair enough. Though you still could get potentially surprising behavior if the original sequence's length is changed.
I don’t think it’s surprising. When you go out of your way to ask for a dynamic view instead of the default snapshot copy, and then you change the list, you’d expect the view to change.

If you don’t keep views around, because you’re only using them for more efficient one-shot iteration, you might never think about that, but then you’ll never notice it to be surprised by it. The dynamic behavior of dict views presumably hasn’t ever surprised you in the 12 years it’s worked that way.
...
And you probably don't want to lock the "host" anyway -- that could be very confusing if the view is kept all be somewhere far from the code trying to change the sequence.
Yes. I think memoryview’s locking behavior is a special case, not something we’d want to emulate here. I’m guessing many people just never use memoryview at all, but when you do, you’re generally thinking about raw buffers rather than abstract behavior. (It’s right there in the name…) And when you need something more featureful than an invisible hard lock on the host, it’s time for numpy. :)
...
I'm still a bit confused about what a dict.* view actually is
The docs explain it reasonably well. See https://docs.python.org/3/glossary.html#term-dictionary-view for the basic idea,  https://docs.python.org/3/library/stdtypes.html#dict-views for the details on the concrete types, and I think the relevant ABCs and data model entries are linked from there.
...
-- for instance, a dict_keys object pretty much acts like a set, but it isn't a subclass of set, and it has an isdisjoint() method, but not .union or any of the other set methods. But it does have what at a glance looks like pretty complete set of dunders....
The point of collections.abc.Set, and ABCs jn general, and the whole concept of protocols, is that the set protocol can be implemented by different concrete types—set, frozenset, dict_keys, third-party types like sortedcontainers.SortedSet or pyobjc.Foundation.NSSet, etc.—that are generally completely unrelated to each other, and implemented in different ways—a dict_keys is a link to the keys table in a dict somewhere, a set or frozenset has its own hash table, a SortedSet has a wide-B-tree-like structure, an NSSet is a proxy to an ObjC object, etc. if they all had to be subclasses of set, they’d be carrying around a set’s hash table but never using it; they’d have to be careful to override every method to make sure it never accidentally got used (and what would frozenset or dict_keys override add with?), etc.

And if you look at the ABC, union isn’t part of the protocol, but __or__ is, and so on.
...
Anyway, a Sequence view is simpler, because it could probably simply be an immutable sequence -- not much need for contemplating every bit of the API.
It’s really the same thing, it’s just the Sequence protocol rather than the Set protocol.

If anything, it’s _less_ simple, because for sequences you have to decide whether indexing should work with negative indices, extended slices, etc., which the protocol is silent about. But the answer there is pretty easy—unless there’s a good reason not to support those things, you want to support them. (The only open question is when you’re designing a sequence that you expect to be subclassed, but I don’t think we’re designing for subclassing here.)
...
I do see a possible objection here though. Making a small view of a large sequence would keep that sequence alive, which could be a memory issue. Which is one reason why sliced don't do that by default.
Yes. When you just want to iterate something once, non-lazily, you don’t care whether it’s a view of a snapshot, but when you want to keep it around, you do care, and you have to decide which one you want. So we certainly can’t change the default; that would be a huge but subtle change that would break all kinds of code.

But I don’t think it’s a problem for offering an alternative that people have to explicitly ask for.

Also, notice that this is true for all of the existing views, and none of them try to be un-featureful to avoid it.
...
And it could simply be a buyer beware issue. But the more featureful you make a view, the more likely it is that they will get used and passed around and kept alive without the programmer realizing the implications of that.
I think it is worth mentioning in the docs.
...
Now I need to think about how to write this all up -- which is why I wasn't sure I was ready to bring this up bu now I have, so more to do!
Feel free to borrow whatever you want (and discard whatever you don’t want) from the slices repo I posted. (It’s MIT-licensed, but I can relicense it to remove the copyright notice if you want.)

I think the biggest question is actually the API. Making this a function (or a class that most people think of as a function, like most of itertools) is easy, but as soon as you say it should be a method or property of sequences, that’s trickier. You can add it to all the builtin sequence types, but should other sequences in the stdlib have it? Should Sequence provide it as a mixin? Should it be part of the sequence protocol, and therefore checked by Sequence as an ABC (even though that could be a breaking change)?
...
PR's accepted on my draft!
https://github.com/PythonCHB/islice-pep/blob/master/islice.py
>>> d[7] = 8
    >>> next(i1)
    RuntimeError: dictionary changed size during iteration
    >>> i3 = iter(k)
    >>> next(i3)
That's probably a feature we'd want to emulate.
...
Basically, views are not like iterators at all, except in that they save time and space by being lazy.
Well, this is a vocabulary issue -- an "iterable" and "iterator" is anything that follows the protocol, so yes, they very much ARE iterables (and iterators) even though they also have some additional behavior.
...
Which is why it's not wrong to say that a range object is an iterator, but is IS wrong to say that it's Just and iterator ...
No, they’re not iterators. You’ve got it backward—every iterator is an iterable, but most iterables are not iterators.

An iterator is an iterable that has a __next__ method and returns self from __iter__. List, tuples, dicts, etc. are not iterators, and neither are ranges, or the dict views.

You can test this easily:

    >>> isinstance(range(10), collections.abc.Iterator)
    False

A lot of people get this confused. I think the problem is that we don’t have a word for “iterable that’s not an iterator”, or for the refinement “iterable that’s not an iterator and is reusable”, much less the further refinement “iterable that’s reusable, providing a distinct iterator that starts from the head each time, and allows multiple such iterators in parallel”. But that last thing is exactly the behavior you expect from “things like list, dict, etc.”, and it’s hard to explain, and therefore hard to document. The closest word for that is “collection”, but Collection is also a protocol that adds being a Container and being Sized on top of being Iterable, so it’s misleading unless you’re really careful. So the docs don’t clearly tell people that range, dict_keys, etc. are exactly that “like list, dict, etc.” thing, so people are confused about what they are. People know they’re lazy, they know iterators are lazy, so they think they’re a kind of iterator, and the docs don’t ever make it clear why that’s wrong.
...
...
Such a resettable-iterator thing (which would have some precedent in file objects, I suppose) would actually be harder to Implement, on top of being less powerful and potentially confusing. And the same is true for slices.
but the dict_keys iterator does seem to do that ...
In [48]: dk                                                                    
Out[48]: dict_keys(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'])
In [49]: list(dk)                                                              
Out[49]: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
In [50]: list(dk)                                                              
Out[50]: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
You just picked an example where “resettable iterator” and “collection” would do the same thing. Try the same test with list and it also passes, because list is a collection. You can only distinguish the two cases by partially using an iterator and then asking for another one. And if you do that, you will see that, just like list, dict_keys gives you a brand new, completely independent iterator, initialized from the start, every time you call iter() on it. Because, like list, dict_keys is a collection, not an iterator. There are no types in Python’s stdlib that have the behavior you suggested of being an iterator but resetting each time you iterate. (The closest thing is file objects, but you have to manually reset them with seek(0).)