zip(x, y, z, strict=True)
Here's something that would have saved me some debugging yesterday: >>> zipped = zip(x, y, z, strict=True) I suggest that `strict=True` would ensure that all the iterables have been exhausted, raising an exception otherwise. This is useful in cases where you're assuming that the iterables all have the same lengths. When your assumption is wrong, you currently just get a shorter result, and it could take you a while to figure out why it's happening. What do you think?
On Apr 20, 2020, at 10:42, Ram Rachum <ram@rachum.com> wrote:
Here's something that would have saved me some debugging yesterday:
>>> zipped = zip(x, y, z, strict=True)
I suggest that `strict=True` would ensure that all the iterables have been exhausted, raising an exception otherwise.
This is definitely sometimes useful, but I think less often than zip_longest, which we already decided long ago isn’t important enough to push into zip but instead should be a separate function living in itertools. I’ll bet there’s a zip_strict (or some other name for the same idea) in the more-itertools library. (If not, it’s probably worth submitting.) Whether it’s important enough to bring into itertools, add as a recipe, or call out as an example of what more-itertools can do in the itertools docs, I’m not sure. But I don’t think it needs to be added as a flag on the builtin.
This is definitely sometimes useful, but I think less often than zip_longest, which we already decided long ago isn’t important enough to push into zip but instead should be a separate function living in itertools.
I disagree. In my own personal experience, ~80% of the time when I use `zip` there is an assumption that all of the iterables are the same length. Often the context of the surrounding code proves it to be correct, but I've been bitten several times by refactorings that start silently throwing tail-end data away. It often feels just too heavy to set up a sentinel value and use zip_longest, just as a sort of assertion against unexpected results. At the same, time, I don't think that the current default behavior should be changed, since that would likely break more code than it "fixes" (and the only way to get the old behavior would be to wrap every usage in a try-except-pass). I would very much welcome this change, especially since it shouldn't have any overhead for non-strict-users, and minimal overhead for those who want to enforce the check. I'd be happy to open a BPO and PR if this gets any traction here.
On Mon, 20 Apr 2020 18:25:08 -0000 "Brandt Bucher" <brandtbucher@gmail.com> wrote:
This is definitely sometimes useful, but I think less often than zip_longest, which we already decided long ago isn’t important enough to push into zip but instead should be a separate function living in itertools.
I disagree. In my own personal experience, ~80% of the time when I use `zip` there is an assumption that all of the iterables are the same length.
FWIW, that's also my experience. Regards Antoine.
On Apr 20, 2020, at 11:25, Brandt Bucher <brandtbucher@gmail.com> wrote:
I disagree. In my own personal experience, ~80% of the time when I use `zip` there is an assumption that all of the iterables are the same length.
Sure, but I think cases where you want that assumption _checked_ are a lot less common. There are lots of postconditions that you assume just as often as “x, y, and z are fully consumed” and just as rarely want to check, so we don’t need to make it easy to check every possible one of them. As I said, wanting to check does come up sometimes—I know I have written this myself at least once, and I’d be a little surprised if it’s not in more-itertools. But often enough to be a (flag on a) builtin? I’ve also written a zip that uses the length of the first rather than the shortest or longest, and a zip that skips rather than filling past the end of short inputs, and there are probably other variations that come up occasionally. But if they don’t come up that often, and are easy to write yourself, is there really a problem that needs to be fixed? And even if checking is the most common option after the default, it seems like a weird API to have some options for what to do at the end as keyword parameter flags and other options as entirely separate functions. Maybe a flag for longest (or a single at_end parameter with an enum of different end-behaviors truncate, check, fill, skip where the signature can immediately show you that the default is truncate) would be a better design if you were doing Python from scratch, but I think the established existence of zip_longest pushes us the other way.
On 4/20/2020 3:39 PM, Andrew Barnert via Python-ideas wrote:
On Apr 20, 2020, at 11:25, Brandt Bucher <brandtbucher@gmail.com> wrote:
I disagree. In my own personal experience, ~80% of the time when I use `zip` there is an assumption that all of the iterables are the same length. Sure, but I think cases where you want that assumption _checked_ are a lot less common. There are lots of postconditions that you assume just as often as “x, y, and z are fully consumed” and just as rarely want to check, so we don’t need to make it easy to check every possible one of them.
As I said, wanting to check does come up sometimes—I know I have written this myself at least once, and I’d be a little surprised if it’s not in more-itertools.
Interestingly, it looks like it it might be more_itertools.zip_equal, which is listed at https://github.com/more-itertools/more-itertools, but is linked to https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.zip_... which is missing. Maybe it's new? Eric
On Apr 20, 2020, at 13:03, Eric V. Smith <eric@trueblade.com> wrote:
On 4/20/2020 3:39 PM, Andrew Barnert via Python-ideas wrote:
As I said, wanting to check does come up sometimes—I know I have written this myself at least once, and I’d be a little surprised if it’s not in more-itertools.
Interestingly, it looks like it it might be more_itertools.zip_equal, which is listed at https://github.com/more-itertools/more-itertools, but is linked to https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.zip_... which is missing. Maybe it's new?
Yeah, it is new. See PR 415 (https://github.com/more-itertools/more-itertools/pull/415) 21 days ago. There must be something in the air that’s made people suddenly want this more. :) The PR does a great job linking to other discussions about this, including an -ideas thread from two years ago. I haven’t read through everything yet, but I notice that the first objection last time around was David Mertz pointing out that it’s not even in more-itertools, so maybe that more-itertools PR means it’s the perfect time to reopen this discussion? Or maybe it means we should wait a few months and see if people seem to be using the one in more-itertools? (And also maybe to wait for it to stabilize—there are a few bug fix commits to it after the initial merge.)
On Mon, 20 Apr 2020 12:39:05 -0700 Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
On Apr 20, 2020, at 11:25, Brandt Bucher <brandtbucher@gmail.com> wrote:
I disagree. In my own personal experience, ~80% of the time when I use `zip` there is an assumption that all of the iterables are the same length.
Sure, but I think cases where you want that assumption _checked_ are a lot less common.
Depends what you call "want". In most cases, the implemented logic should indeed ensure that the iterables are the same length. However, implemented logic can be buggy and it's always good to not let errors pass silently. Currently, zip() will ignore data silently if one iterable is longer than the other. Of course, the fact that zip() is the shorter form that everyone is used to means that, even if a `strict` argument is added, few people will bother adding it. Regards Antoine.
21.04.20 11:15, Antoine Pitrou пише:
Of course, the fact that zip() is the shorter form that everyone is used to means that, even if a `strict` argument is added, few people will bother adding it.
The possible solution is to introduce zip_shortest() with the current behavior of zip(), make zip() emitting a pending deprecation warning when some data is ignored, and after long period of deprecation make it raising an exception if some data is ignored.
On Tue, Apr 21, 2020 at 12:09:44PM +0300, Serhiy Storchaka wrote:
21.04.20 11:15, Antoine Pitrou пише:
Of course, the fact that zip() is the shorter form that everyone is used to means that, even if a `strict` argument is added, few people will bother adding it.
The possible solution is to introduce zip_shortest() with the current behavior of zip(), make zip() emitting a pending deprecation warning when some data is ignored, and after long period of deprecation make it raising an exception if some data is ignored.
Wait, are suggesting that zip should raise by default (after a suitable deprecation period)? I think that would be horribly inconvenient. Unless the deprecation period was "for ever". -- Steven
On Wed, 22 Apr 2020 08:34:56 +1000 Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Apr 21, 2020 at 12:09:44PM +0300, Serhiy Storchaka wrote:
21.04.20 11:15, Antoine Pitrou пише:
Of course, the fact that zip() is the shorter form that everyone is used to means that, even if a `strict` argument is added, few people will bother adding it.
The possible solution is to introduce zip_shortest() with the current behavior of zip(), make zip() emitting a pending deprecation warning when some data is ignored, and after long period of deprecation make it raising an exception if some data is ignored.
Wait, are suggesting that zip should raise by default (after a suitable deprecation period)?
"if some data is ignored" Regards Antoine.
On 4/21/2020 7:07 PM, Antoine Pitrou wrote:
On Wed, 22 Apr 2020 08:34:56 +1000 Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Apr 21, 2020 at 12:09:44PM +0300, Serhiy Storchaka wrote:
21.04.20 11:15, Antoine Pitrou пише:
Of course, the fact that zip() is the shorter form that everyone is used to means that, even if a `strict` argument is added, few people will bother adding it. The possible solution is to introduce zip_shortest() with the current behavior of zip(), make zip() emitting a pending deprecation warning when some data is ignored, and after long period of deprecation make it raising an exception if some data is ignored. Wait, are suggesting that zip should raise by default (after a suitable deprecation period)? "if some data is ignored"
Even so, I don't think we could ever change the default to be to "raise unless all of the iterables are the same length". There's just too much code that would break. Eric
On Wed, Apr 22, 2020 at 01:07:07AM +0200, Antoine Pitrou wrote:
On Wed, 22 Apr 2020 08:34:56 +1000 Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Apr 21, 2020 at 12:09:44PM +0300, Serhiy Storchaka wrote:
21.04.20 11:15, Antoine Pitrou пише:
Of course, the fact that zip() is the shorter form that everyone is used to means that, even if a `strict` argument is added, few people will bother adding it.
The possible solution is to introduce zip_shortest() with the current behavior of zip(), make zip() emitting a pending deprecation warning when some data is ignored, and after long period of deprecation make it raising an exception if some data is ignored.
Wait, are suggesting that zip should raise by default (after a suitable deprecation period)?
"if some data is ignored"
Yes, I saw that and I understood it. It goes without saying that it would only raise if there was data remaining in one or more iterables after some iterable was exhausted. I didn't think I needed to spell out every fine detail to ask such a simple question. Are you suggesting that in the future, after a suitable deprecation period, the default zip() builtin function should raise an exception "if some data is ignored"? Is this is serious proposal that you would like to see happen, or are you just mentioning it for the sake of mentioning all the options? -- Steven
On Wed, 22 Apr 2020 12:52:01 +1000 Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Apr 22, 2020 at 01:07:07AM +0200, Antoine Pitrou wrote:
On Wed, 22 Apr 2020 08:34:56 +1000 Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Apr 21, 2020 at 12:09:44PM +0300, Serhiy Storchaka wrote:
21.04.20 11:15, Antoine Pitrou пише:
Of course, the fact that zip() is the shorter form that everyone is used to means that, even if a `strict` argument is added, few people will bother adding it.
The possible solution is to introduce zip_shortest() with the current behavior of zip(), make zip() emitting a pending deprecation warning when some data is ignored, and after long period of deprecation make it raising an exception if some data is ignored.
Wait, are suggesting that zip should raise by default (after a suitable deprecation period)?
"if some data is ignored"
Yes, I saw that and I understood it. It goes without saying that it would only raise if there was data remaining in one or more iterables after some iterable was exhausted. I didn't think I needed to spell out every fine detail to ask such a simple question.
Are you suggesting that in the future, after a suitable deprecation period, the default zip() builtin function should raise an exception "if some data is ignored"?
Is this is serious proposal that you would like to see happen, or are you just mentioning it for the sake of mentioning all the options?
Ideally, that's what it would do. Whether it's desirable to transition to that behaviour is an open question. But, as far as I'm concerned, the number of times where I took advantage of zip()'s current acceptance of heteregenously-sized inputs is extremely small. In most of my uses of zip(), a size difference would have been a logic error that deserves noticing and fixing. Regards Antoine.
22.04.20 11:20, Antoine Pitrou пише:
Ideally, that's what it would do. Whether it's desirable to transition to that behaviour is an open question.
But, as far as I'm concerned, the number of times where I took advantage of zip()'s current acceptance of heteregenously-sized inputs is extremely small. In most of my uses of zip(), a size difference would have been a logic error that deserves noticing and fixing.
I concur with Antoine. Ideally we should have several functions: zip_shortest(), zip_equal(), zip_longest(). In most cases (80% or 90% or more) they are equivalent, because input iterators has the same length, but it is safer to use zip_equal() to catch bugs. In other cases you would use zip_shortest() or zip_longest(). And it would be natural to rename the most popular variant to just zip(). Now it is a breaking change. We had a chance to do it in 3.0, when other breaking change was performed in zip(). I do not know if it is worth to do now. But when we plan any changes in zip() we should take into account possible future changes and make them simpler, not harder.
On Apr 22, 2020, at 01:52, Serhiy Storchaka <storchaka@gmail.com> wrote:
22.04.20 11:20, Antoine Pitrou пише:
Ideally, that's what it would do. Whether it's desirable to transition to that behaviour is an open question. But, as far as I'm concerned, the number of times where I took advantage of zip()'s current acceptance of heteregenously-sized inputs is extremely small. In most of my uses of zip(), a size difference would have been a logic error that deserves noticing and fixing.
I concur with Antoine. Ideally we should have several functions: zip_shortest(), zip_equal(), zip_longest(). In most cases (80% or 90% or more) they are equivalent, because input iterators has the same length, but it is safer to use zip_equal() to catch bugs. In other cases you would use zip_shortest() or zip_longest(). And it would be natural to rename the most popular variant to just zip().
Now it is a breaking change. We had a chance to do it in 3.0, when other breaking change was performed in zip(). I do not know if it is worth to do now. But when we plan any changes in zip() we should take into account possible future changes and make them simpler, not harder.
If that is your long-term goal, I think you could do it in three steps. First, just add itertools.zip_equal. Ideally the docs should replace the usual “Added in 3.9” with something like “Added in 3.9; if you need the same function in earlier versions see more-itertools” (linked to the more-itertools blurb at the top of the page). It seems like there’s a lot of support for this step even from people who don’t want your big goal. Second, add itertools.zip_shortest. And change zip’s docs to say that it’s the same as zip_shortest and mention the other two choices, and maybe even to try to nudge people to explicitly decide which of the three they want. And find some places in the tutorial that use zip and change them to use zip_equal and zip_shortest as appropriate. I think that gets you about as much as you can get without backward compatibility issues, and it also gets you closer to being able to deprecate zip or change it to alias zip_equal, rather than making it harder. Third, do the deprecation. By that point, everyone maintaining existing code will have an easy way to defensively prepare for it: as long as they can require 3.10+ or more-itertools, they can just change all uses of zip to zip_shortest and they’re done. Still not painless, but about as painless as a backward compatibility break could ever be. And of course after the first two steps you can proselytize for the next one. If you can convince lots of people that they should care about the choice more often and get them using the explicit functions, it’ll be a lot harder to argue that everyone is happy with today’s behavior.
On Wed, Apr 22, 2020 at 10:33:24AM -0700, Andrew Barnert via Python-ideas wrote:
If that is your long-term goal, I think you could do it in three steps.
I think the first step is a PEP. This is not a small change that can be just done on a whim.
And of course after the first two steps you can proselytize for the next one. If you can convince lots of people that they should care about the choice more often and get them using the explicit functions, it’ll be a lot harder to argue that everyone is happy with today’s behavior.
If they need to be *convinced* to use the new function, then they don't really need it and didn't want it. -- Steven
On 04/22/2020 02:03 PM, Steven D'Aprano wrote:
If they need to be *convinced* to use the new function, then they don't really need it and didn't want it.
Sadly, it is not human nature to just accept that one needs something when they have been getting along "just fine" without it. -- ~Ethan~
On Apr 22, 2020, at 14:09, Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Apr 22, 2020 at 10:33:24AM -0700, Andrew Barnert via Python-ideas wrote:
If that is your long-term goal, I think you could do it in three steps.
I think the first step is a PEP. This is not a small change that can be just done on a whim.
Yes, I agree. Each of the three steps will very likely require a PEP. And not only that, the PEP for this first step has to make it clear that it’s useful on its own—not just to people like Serhiy who eventually want to replace zip and see it as a first step, but also to people who do not want zip to ever change but do want a convenient way to opt in to checking zips (and don’t find more-itertools convenient enough) and see this as the _only_ step.
And of course after the first two steps you can proselytize for the next one. If you can convince lots of people that they should care about the choice more often and get them using the explicit functions, it’ll be a lot harder to argue that everyone is happy with today’s behavior.
If they need to be *convinced* to use the new function, then they don't really need it and didn't want it.
I had to be convinced that I wanted str.format. (The guy who convinced me was enthusiastic enough that he went through the effort of writing a __format__ method for my Fixed1616 class to show how easily extensible it is.) But really, I did want it, and just didn’t know it yet. Hell, I had to be convinced to use Python instead of sticking with Perl and Tcl, but it turned out I did want it. Let’s assume that the proponents of adding zip_strict are right that using it will often give you early failures on some common uses that are today painful to debug. If so, most people don’t know that today, and aren’t going to think of it just because a new function shows up in itertools, or a new flag on a builtin, or whatever. Someone will have to convince them to use it. But then, one evening, they’ll get an exception and realize, “Whoa, that would have taken me hours to debug otherwise, if I’d even spotted the bug…”, and they’ll realize they needed it, just as much as the handful who noticed the need in advance and went looking. The proponents of the bigger, longer-term change of eventually making this the default behavior for zip may be right too. If so, many of the people who were convinced to use zip_strict will find it helpful so often, and zip_shortest so unusual in their code, that they start asking why the hell strict isn’t the default instead of shortest. And then it’ll be a lot easier for Serhiy or whoever to sell such a big change. Of course if that doesn’t ever happen, it’ll be a lot harder to sell the change—but in that case, the change would be a mistake, so that’s good too.
24.04.20 07:58, Andrew Barnert via Python-ideas пише:
And not only that, the PEP for this first step has to make it clear that it’s useful on its own—not just to people like Serhiy who eventually want to replace zip and see it as a first step, but also to people who do not want zip to ever change but do want a convenient way to opt in to checking zips (and don’t find more-itertools convenient enough) and see this as the _only_ step.
Don't consider me an apologist. I just think that might be a good idea. But now we do not have enough information to decide. We should wait several months or years. And even if it turns out that most users prefer zip_equal(), the cost of changing zip() may be too high. But we should not reject this possibility. While we discuss zip(), we should not forget about map() and other map-like functions (like Executor.map()). All that was said about zip() is applied to map() too, so after adding zip_equal() and/or zip_shortest() we will need to add corresponding map variants.
+1 on almost always expecting my iterators to be the same length when I pass them to zip. I struggle to think of a time when I haven't had that expectation. To people asking whether I would catch the error that zip_strict would raise, almost certainly not. I rarely catch ValueError other than to log or raise a different exception. I don't really care about the state of the iterators post zip_strict (as I would generally not be catching that exception) but I suppose it should be the same as zip, evaluate left to right. Seems to me that deprecating the current zip behavior is more trouble than it's worth, just add zip_strict to itertools and call it a day. If zip_strict turns out to be super popular than we could revisit changing the behavior of zip. I don't think we need to have new versions of map, there isn't map_longest in itertools. Also building variant maps is trivial: def map_strict(f, *iters): return starmap(f, zip_strict(iters)) - Caleb Donovick On Fri, Apr 24, 2020 at 10:57 AM Serhiy Storchaka <storchaka@gmail.com> wrote:
24.04.20 07:58, Andrew Barnert via Python-ideas пише:
And not only that, the PEP for this first step has to make it clear that it’s useful on its own—not just to people like Serhiy who eventually want to replace zip and see it as a first step, but also to people who do not want zip to ever change but do want a convenient way to opt in to checking zips (and don’t find more-itertools convenient enough) and see this as the _only_ step.
Don't consider me an apologist. I just think that might be a good idea. But now we do not have enough information to decide. We should wait several months or years. And even if it turns out that most users prefer zip_equal(), the cost of changing zip() may be too high. But we should not reject this possibility.
While we discuss zip(), we should not forget about map() and other map-like functions (like Executor.map()). All that was said about zip() is applied to map() too, so after adding zip_equal() and/or zip_shortest() we will need to add corresponding map variants. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/NJB46S... Code of Conduct: http://python.org/psf/codeofconduct/
On Wed, Apr 22, 2020, 4:24 AM Antoine Pitrou
But, as far as I'm concerned, the number of times where I took advantage of zip()'s current acceptance of heteregenously-sized inputs is extremely small. In most of my uses of zip(), a size difference would have been a logic error that deserves noticing and fixing.
Your experience is very different from mine. The number of times I zip differently "sized" iterators is almost surely more than 50%. The pattern Steven points to of one or more iterator being infinite is very common. But so is the case of one merely being large enough to (hopefully) match with everything in the others, but no harm in discarding extras.
On Wed, Apr 22, 2020 at 12:23 PM David Mertz <mertz@gnosis.cx> wrote:
On Wed, Apr 22, 2020, 4:24 AM Antoine Pitrou
But, as far as I'm concerned, the number of times where I took advantage of zip()'s current acceptance of heteregenously-sized inputs is extremely small. In most of my uses of zip(), a size difference would have been a logic error that deserves noticing and fixing.
Your experience is very different from mine.
I'm in Antoine's camp on this one. A lot of our work is data analysis, where we get for example simulation results as X, Y, Z components then zip them up into coordinate triples, so any mismatch is a bug. Having zip_equal as a first-class function would replace zip in easily 90% of our use cases, but it needs to be fast as we often do this sort of thing in an inner loop...
On Thu, Apr 23, 2020 at 3:50 PM Eric Fahlgren <ericfahlgren@gmail.com> wrote:
On Wed, Apr 22, 2020 at 12:23 PM David Mertz <mertz@gnosis.cx> wrote:
On Wed, Apr 22, 2020, 4:24 AM Antoine Pitrou
But, as far as I'm concerned, the number of times where I took advantage of zip()'s current acceptance of heteregenously-sized inputs is extremely small. In most of my uses of zip(), a size difference would have been a logic error that deserves noticing and fixing.
Your experience is very different from mine.
I'm in Antoine's camp on this one. A lot of our work is data analysis, where we get for example simulation results as X, Y, Z components then zip them up into coordinate triples, so any mismatch is a bug. Having zip_equal as a first-class function would replace zip in easily 90% of our use cases, but it needs to be fast as we often do this sort of thing in an inner loop...
+1 I write a lot of standalone data-munging scripts, and expecting zipped inputs to have equal length is a common pattern. How, for example, to collate lines from 3 potentially large files while ensuring they match in length (without an external dependency)? The best I can think of is rather ugly: with open('a.txt') as a, open('b.txt') as b, open('c.txt') as c: for lineA, lineB, lineC in zip(a, b, c): do_something_with(lineA, lineB, lineC) assert next(a, None) is None assert next(b, None) is None assert next(c, None) is None Changing the zip() call to zip(aF, bF, cF, strict=True) would remove the necessity of the asserts. Moreover, the concept of strict zip or zip_equal should be intuitive to beginners, whereas my solution of next() with a sentinel is not. (Oh, an alternative would be checking if a.readline(), b.readline(), and c.readline() are nonempty, but that's not much better and wouldn't generalize to non-file iterators.) Nathan
On Thu, Apr 23, 2020 at 09:10:16PM -0400, Nathan Schneider wrote:
How, for example, to collate lines from 3 potentially large files while ensuring they match in length (without an external dependency)? The best I can think of is rather ugly:
with open('a.txt') as a, open('b.txt') as b, open('c.txt') as c: for lineA, lineB, lineC in zip(a, b, c): do_something_with(lineA, lineB, lineC) assert next(a, None) is None assert next(b, None) is None assert next(c, None) is None
Changing the zip() call to zip(aF, bF, cF, strict=True) would remove the necessity of the asserts.
I think that the "correct" (simplest, easiest, most obvious, most flexible) way is: with open('a.txt') as a, open('b.txt') as b, open('c.txt') as c: for lineA, lineB, lineC in zip_longest(a, b, c, fillvalue=''): do_something_with(lineA, lineB, lineC) and have `do_something_with` handle the empty string case, either by raising, or more likely, doing something sensible like treating it as a blank line rather than dying with an exception. Especially if the files differ in how many newlines they end with. E.g. file a.txt and c.txt end with a newline, but b.txt ends without one, or ends with an extra blank line at the end. File handling code ought to be resilient in the face of such meaningless differences, but zip_strict encourages us to be the opposite of resilient: an extra newline at the end of the file will kill the application with an unnecessary exception. An alternate way to handle it: for t in zip_longest(a, b, c, fillvalue=''): if '' in t: process() # raise if you insist else: do_something_with(*t) So my argument is that anything you want zip_strict for is better handled with zip_longest -- including the case of just raising. -- Steven
On Sat, Apr 25, 2020 at 10:41 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Thu, Apr 23, 2020 at 09:10:16PM -0400, Nathan Schneider wrote:
How, for example, to collate lines from 3 potentially large files while ensuring they match in length (without an external dependency)? The best I can think of is rather ugly:
with open('a.txt') as a, open('b.txt') as b, open('c.txt') as c: for lineA, lineB, lineC in zip(a, b, c): do_something_with(lineA, lineB, lineC) assert next(a, None) is None assert next(b, None) is None assert next(c, None) is None
Changing the zip() call to zip(aF, bF, cF, strict=True) would remove the necessity of the asserts.
I think that the "correct" (simplest, easiest, most obvious, most flexible) way is:
with open('a.txt') as a, open('b.txt') as b, open('c.txt') as c: for lineA, lineB, lineC in zip_longest(a, b, c, fillvalue=''): do_something_with(lineA, lineB, lineC)
and have `do_something_with` handle the empty string case, either by raising, or more likely, doing something sensible like treating it as a blank line rather than dying with an exception.
This is the sentinel pattern with zip_longest() rather than next(). Sure, it works, but I'm not sure it's the most obvious—conceptually zip_longest() is saying "I want to have as many items as the max of the iterables", but then the loop short-circuits if the fillvalue is used. More natural to say "I expect these iterables to have the same length from the beginning" (if that is what the application demands).
Especially if the files differ in how many newlines they end with. E.g. file a.txt and c.txt end with a newline, but b.txt ends without one, or ends with an extra blank line at the end.
Well, this depends on the application and the assumptions about where the files come from. I can see that zip_longest() will technically work with the sentinel pattern. If there is consensus that it should be a builtin, I might start using this instead of zip() with separate checks. But to enforce length-matching, it still requires an extra check, plus a decision about what the sentinel value should be (for direct file reading '' is fine, but not necessarily for other iterables like collections or file-loading wrappers). IOW, the pattern has some conceptual and code overhead as a solution to "make sure the number of items matches". Given that length-matching is a need that many of us frequently encounter, adding strict=True to zip() seems like a very useful and intuitive option to have, without breaking any existing code. Nathan
On Sat, Apr 25, 2020 at 7:43 AM Steven D'Aprano <steve@pearwood.info> wrote:
I think that the "correct" (simplest, easiest, most obvious, most flexible) way is:
with open('a.txt') as a, open('b.txt') as b, open('c.txt') as c: for lineA, lineB, lineC in zip_longest(a, b, c, fillvalue=''): do_something_with(lineA, lineB, lineC)
...
Especially if the files differ in how many newlines they end with. E.g. file a.txt and c.txt end with a newline, but b.txt ends without one, or ends with an extra blank line at the end.
File handling code ought to be resilient in the face of such meaningless differences,
sure. But what difference is "meaningless" depends on the use case. For instance, comments or blank lines in the middle of a file may be a meaningless difference. And you'd want to handle that before zipping anyway. The way I've solved these types of issues in the past is to filter the files first, maybe something like: with open('a.txt') as a, open('b.txt') as b, open('c.txt') as c: for lineA, lineB, lineC in zip(filtered(a), filtered(b), filtered(c), strict=True): do_something_with(lineA, lineB, lineC)
So my argument is that anything you want zip_strict for is better
handled with zip_longest -- including the case of just raising.
That is quite the leap! You make a decent case about handling empty lines in files, but extending that to "anything" is unwarranted. I honestly do not understand the resistance here. Yes, any change to the standard library should be carefully considered, and any change IS a disruption, and this proposed change may not be worth it. But arguing that it wouldn't ever be useful, I jsut don't get. Entirely anecdotal evidence here, but I think this is born out by the comments in this thread. * Many people are surprised when they first discover that zip() stops as the shortest, and silently ignores the rest -- I know I was. * Many uses (most?) do expect the iterators to be of equal length. - The main exception to this may be when one of them is infinite, but how common is that, really? Remember that when zip was first created (py2) it was a list builder, not an iterator, and Python itself was much less iterable-focused. * However, many uses work fine without any length-checking -- that is often taken car of elsewhere in the code -- this is kinda-sorta analogous to a lack of type checking, sure you COULD get errors, but you usually don't. We've done fine for years with zip's current behavior, but that doesn't mean it couldn't be a little better and safer for a lot of use cases, and a number of folks on this thread have said that they would use it. So: if this were added, it would get some use. How much? hard to know. Is it critically important? absolute not. But it's fully backward compatible and not a language change, the barrier to entry is not all that high. However, I agree with (I think Brandt) in that the lack of a critical need means that a zip_strict() in itertools would get a LOT less use than a flag on zip itself -- so I advocate for that. If folks think extending zip() is not worth it, then I don't think it would be worth bothering with adding a sip_strict to itertools at all. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
I have no objection to adding a zip_strict() or zip_exact() to itertools. I am used to the current behavior, and am apparently in minority in not usually assuming common length iterators. Say +0 on a new function. But I'm definitely -1 on adding a mode switch to the built-in. This is not the way Python is usually done. zip_longest() is a clear example, but so is the recent cut_suffix (or whatever final spelling was chosen). Some folks wanted a mode switch on .rstrip(), and that was appropriately rejected. If zip_strict() is genuinely what you want to do, an import from stdlib is not much effort to get it. My belief is that usually people who think they want this actually want zip_longest(), but that's up to them. On Sat, Apr 25, 2020, 12:43 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Sat, Apr 25, 2020 at 7:43 AM Steven D'Aprano <steve@pearwood.info> wrote:
I think that the "correct" (simplest, easiest, most obvious, most flexible) way is:
with open('a.txt') as a, open('b.txt') as b, open('c.txt') as c: for lineA, lineB, lineC in zip_longest(a, b, c, fillvalue=''): do_something_with(lineA, lineB, lineC)
...
Especially if the files differ in how many newlines they end with. E.g. file a.txt and c.txt end with a newline, but b.txt ends without one, or ends with an extra blank line at the end.
File handling code ought to be resilient in the face of such meaningless differences,
sure. But what difference is "meaningless" depends on the use case. For instance, comments or blank lines in the middle of a file may be a meaningless difference. And you'd want to handle that before zipping anyway. The way I've solved these types of issues in the past is to filter the files first, maybe something like:
with open('a.txt') as a, open('b.txt') as b, open('c.txt') as c: for lineA, lineB, lineC in zip(filtered(a), filtered(b), filtered(c), strict=True): do_something_with(lineA, lineB, lineC)
So my argument is that anything you want zip_strict for is better
handled with zip_longest -- including the case of just raising.
That is quite the leap! You make a decent case about handling empty lines in files, but extending that to "anything" is unwarranted.
I honestly do not understand the resistance here. Yes, any change to the standard library should be carefully considered, and any change IS a disruption, and this proposed change may not be worth it. But arguing that it wouldn't ever be useful, I jsut don't get.
Entirely anecdotal evidence here, but I think this is born out by the comments in this thread.
* Many people are surprised when they first discover that zip() stops as the shortest, and silently ignores the rest -- I know I was. * Many uses (most?) do expect the iterators to be of equal length. - The main exception to this may be when one of them is infinite, but how common is that, really? Remember that when zip was first created (py2) it was a list builder, not an iterator, and Python itself was much less iterable-focused. * However, many uses work fine without any length-checking -- that is often taken car of elsewhere in the code -- this is kinda-sorta analogous to a lack of type checking, sure you COULD get errors, but you usually don't.
We've done fine for years with zip's current behavior, but that doesn't mean it couldn't be a little better and safer for a lot of use cases, and a number of folks on this thread have said that they would use it.
So: if this were added, it would get some use. How much? hard to know. Is it critically important? absolute not. But it's fully backward compatible and not a language change, the barrier to entry is not all that high.
However, I agree with (I think Brandt) in that the lack of a critical need means that a zip_strict() in itertools would get a LOT less use than a flag on zip itself -- so I advocate for that. If folks think extending zip() is not worth it, then I don't think it would be worth bothering with adding a sip_strict to itertools at all.
-CHB
-- Christopher Barker, PhD
Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/2X74JU... Code of Conduct: http://python.org/psf/codeofconduct/
On Sat, Apr 25, 2020 at 10:11 AM David Mertz <mertz@gnosis.cx> wrote:
I have no objection to adding a zip_strict() or zip_exact() to itertools. I am used to the current behavior, and am apparently in minority in not usually assuming common length iterators. Say +0 on a new function.
I'd say I'm +0 also -- I don't think it'll get used much. But I'm definitely -1 on adding a mode switch to the built-in. This is not
the way Python is usually done. zip_longest() is a clear example,
Yes, indeed. But there is a difference here. If you want zip_longest capability, your code simply can not work with zip(). So folks will make the effort to go import it from itertools. But most code that could benefit from "strict" behavior will run fine with zip() -- just with less safely, or making a safety check separately. So I don't think a function in itertools would get used much Granted, that doesn't say anything about the API consistency (or lack thereof) of adding a "mode flag" -- if that's important, then, well, I guess it's not an option. Are there really no mode flags in Python? (I do recall a quote attributed to Guido about not using a two-mnode flag, when another function will do).
but so is the recent cut_suffix (or whatever final spelling was chosen). Some folks wanted a mode switch on .rstrip(), and that was appropriately rejected.
In my mind that was more about consistency with the rest of the string API than about a global "that's not how we do things in Python" -- but It may have been both. - CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
I'm sure you can find examples of modes. For example open() obviously has a string mode and binary mode which are importantly different. I myself use Pandas a lot, and it is absolutely thick with modes (but is very unpythonic in numerous ways). Even print() with sep and end options has a kind of mode. Nonetheless, I think it is a clear guiding principle in Python that a new function name is better than a mode when there are only two or three behaviors... Other things being equal (which they often are not). On Sat, Apr 25, 2020, 1:25 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Sat, Apr 25, 2020 at 10:11 AM David Mertz <mertz@gnosis.cx> wrote:
I have no objection to adding a zip_strict() or zip_exact() to itertools. I am used to the current behavior, and am apparently in minority in not usually assuming common length iterators. Say +0 on a new function.
I'd say I'm +0 also -- I don't think it'll get used much.
But I'm definitely -1 on adding a mode switch to the built-in. This is not
the way Python is usually done. zip_longest() is a clear example,
Yes, indeed. But there is a difference here. If you want zip_longest capability, your code simply can not work with zip(). So folks will make the effort to go import it from itertools.
But most code that could benefit from "strict" behavior will run fine with zip() -- just with less safely, or making a safety check separately. So I don't think a function in itertools would get used much
Granted, that doesn't say anything about the API consistency (or lack thereof) of adding a "mode flag" -- if that's important, then, well, I guess it's not an option. Are there really no mode flags in Python? (I do recall a quote attributed to Guido about not using a two-mnode flag, when another function will do).
but so is the recent cut_suffix (or whatever final spelling was chosen). Some folks wanted a mode switch on .rstrip(), and that was appropriately rejected.
In my mind that was more about consistency with the rest of the string API than about a global "that's not how we do things in Python" -- but It may have been both.
- CHB
-- Christopher Barker, PhD
Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Sat, Apr 25, 2020 at 8:11 PM David Mertz <mertz@gnosis.cx> wrote:
I have no objection to adding a zip_strict() or zip_exact() to itertools. I am used to the current behavior, and am apparently in minority in not usually assuming common length iterators. Say +0 on a new function.
But I'm definitely -1 on adding a mode switch to the built-in. This is not the way Python is usually done. zip_longest() is a clear example, but so is the recent cut_suffix (or whatever final spelling was chosen). Some folks wanted a mode switch on .rstrip(), and that was appropriately rejected.
Python uses such an approach (separate functions) because there are real flaws in the mode switching approach? Or just historically? As for me the mode switching approach in the current situation looks reasonable, because the question is how boundary conditions should be treated. I still prefer three cases switch like `zip(..., mode=('equal' | 'shortest' | 'longest'))`... but also ok with `strict=True` variant. Also I don't think that comparison with .rstrip() discussion is fair - because in that case, it was proposed to switch two completely different approaches (to treat as string vs to treat as set of chars) which is too much for just a switch thorugh argument. While in zip case it is just how boundaries are treated.
If zip_strict() is genuinely what you want to do, an import from stdlib is not much effort to get it. My belief is that usually people who think they want this actually want zip_longest(), but that's up to them.
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/DZZ7I3... Code of Conduct: http://python.org/psf/codeofconduct/
On Sun, Apr 26, 2020 at 3:53 AM Kirill Balunov <kirillbalunov@gmail.com> wrote:
On Sat, Apr 25, 2020 at 8:11 PM David Mertz <mertz@gnosis.cx> wrote:
I have no objection to adding a zip_strict() or zip_exact() to itertools. I am used to the current behavior, and am apparently in minority in not usually assuming common length iterators. Say +0 on a new function.
But I'm definitely -1 on adding a mode switch to the built-in. This is not the way Python is usually done. zip_longest() is a clear example, but so is the recent cut_suffix (or whatever final spelling was chosen). Some folks wanted a mode switch on .rstrip(), and that was appropriately rejected.
Python uses such an approach (separate functions) because there are real flaws in the mode switching approach? Or just historically? As for me the mode switching approach in the current situation looks reasonable, because the question is how boundary conditions should be treated. I still prefer three cases switch like `zip(..., mode=('equal' | 'shortest' | 'longest'))`... but also ok with `strict=True` variant.
Separate functions mean you can easily and simply make a per-module decision: from itertools import zip_strict as zip Tada! Now this module treats zip as strict mode. To do that with a mode-switch parameter, you have to put the parameter on every single call to zip (and if you forget one, it's not obvious), or create a wrapper function. ChrisA
But you also can always make such a switch with functools.partial. -gdg On Sat, Apr 25, 2020 at 8:59 PM Chris Angelico <rosuav@gmail.com> wrote:
On Sun, Apr 26, 2020 at 3:53 AM Kirill Balunov <kirillbalunov@gmail.com> wrote:
On Sat, Apr 25, 2020 at 8:11 PM David Mertz <mertz@gnosis.cx> wrote:
I have no objection to adding a zip_strict() or zip_exact() to
But I'm definitely -1 on adding a mode switch to the built-in. This is
not the way Python is usually done. zip_longest() is a clear example, but so is the recent cut_suffix (or whatever final spelling was chosen). Some folks wanted a mode switch on .rstrip(), and that was appropriately rejected.
Python uses such an approach (separate functions) because there are real flaws in the mode switching approach? Or just historically? As for me the mode switching approach in the current situation looks reasonable, because
itertools. I am used to the current behavior, and am apparently in minority in not usually assuming common length iterators. Say +0 on a new function. the question is how boundary conditions should be treated. I still prefer three cases switch like `zip(..., mode=('equal' | 'shortest' | 'longest'))`... but also ok with `strict=True` variant.
Separate functions mean you can easily and simply make a per-module decision:
from itertools import zip_strict as zip
Tada! Now this module treats zip as strict mode. To do that with a mode-switch parameter, you have to put the parameter on every single call to zip (and if you forget one, it's not obvious), or create a wrapper function.
ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/A3TBZF... Code of Conduct: http://python.org/psf/codeofconduct/
On Sat, Apr 25, 2020, 1:50 PM Kirill Balunov <kirillbalunov@gmail.com> wrote:
Python uses such an approach (separate functions) because there are real flaws in the mode switching approach? Or just historically?
Well, I think I'd say "philosophically." It's not an historical curiosity like some of the pass-through sys or os functions with signatures that are much more like C than Python. There are definitely pros and cons to different approaches, but striving for relative consistency within one language feels worthwhile. But Python chose to usually avoid mode switches in function signatures. I myself use Pandas, as I mentioned, and it's basically a DSL with a very different philosophy than Python. It encourages fluent style and has tons of mode switches. It's not wrong, but it's absolutely different. Or Python uses snake_case for functions (usually, not always). Which doesn't make CamelCase languages wrong, just different.
Kirill Balunov writes:
Also I don't think that comparison with .rstrip() discussion is fair - because in that case, it was proposed to switch two completely different approaches (to treat as string vs to treat as set of chars) which is too much for just a switch thorugh argument. While in zip case it is just how boundaries are treated.
You're missing the topic of the comparison. It is that we could have removeaffix(source, affix, suffix=True) strip(self, chars, where='both') # where in {both, left, right} rather than removesuffix(source, suffix) removeprefix(source, prefix) strip(self, chars) lstrip(self, chars) rstrip(self, chars) I'm not sure why David compares this to the case of print(), where end and sep are arbitrary strings. Ditto open(), where the mode argument is not 2-valued, but (text, binary) X (read, write, append, create_no_truncate) X (update, not) X (universal newline, not). Not only that, but there are other 'mode' arguments such as encoding, errors, and closefd. The convention is that rather than a function foo(..., mode) where mode takes values from a "very small" enum, say {bar, baz, default}, it's preferable to have foo, foo_bar, and foo_baz. I'd guess #enum=4 (including default) is about the limit, which is why open has all those enumerated mode arguments: even errors has more than 4 options. There are just too many of them to define separate functions for all of them, even for those like encoding and errors that the great majority of the time take literal strings from a well-defined set. There's also the qualification that the majority of calls should take a specific value rather than a variable, and this is the case here. The use cases for zip_strict and zip_longest that I can think *always* want mode='strict' or mode='longest'; I can't think of any cases where I'd call it like zip(x, y, mode=choose_mode(x, y), fillvalue=foo) Once again, this is a convention. You could have mode arguments, and many languages prefer them. There's a certain consistency to that practice, since the boundary in Python of where it's appropriate to have separate functions and where a mode argument is better style is quite fuzzy. I happen to like the Python practice, but certainly, tastes can differ in this matter. Steve
On Sun, Apr 26, 2020 at 4:42 AM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
You're missing the topic of the comparison. It is that we could have
removeaffix(source, affix, suffix=True) strip(self, chars, where='both') # where in {both, left, right}
Oh, that's not the point I was making, but it is also good. During the .removeprefix() discussion, several people seriously suggested something like `mystring.lstrip('foo', string=True)` to alter the behavior between removing from a set of characters and removing a specific subsequence. In any case, that was not what won out, which was my point. I agree that a `where=left|right|both` would be the wrong design for similar reasons.
I'm not sure why David compares this to the case of print(), where end and sep are arbitrary strings.
Yeah, OK, that is weaker. I guess I was just thinking of the fact that my sep is almost always either the default space or the empty string. And my end is similarly almost always either newline or space. But of course I *could* use anything else (and occasionally do). Ditto open(), where the mode argument
is not 2-valued, but (text, binary) X (read, write, append, create_no_truncate) X (update, not) X (universal newline, not). Not only that, but there are other 'mode' arguments such as encoding, errors, and closefd.
Oh... I'll stick with the open() example. Even though the mode is this compound string, it's really conceptually separate elements (and the documentation makes this point). Text|Binary is a two-value switch, even though it is encoded as a character of a potentially multi-character string. That slightly cryptic string code is legacy IMO (but it's not going to change). Likewise, `newline` is an actual separate argument that takes a small number of options.
The convention is that rather than a function foo(..., mode) where mode takes values from a "very small" enum, say {bar, baz, default}, it's preferable to have foo, foo_bar, and foo_baz. I'd guess #enum=4 (including default) is about the limit, which is why open has all those enumerated mode arguments: even errors has more than 4 options.
Yeah, I agree. Even if the individual aspects of mode are each few enough that separate functions would make sense, the cross product of their combinations becomes large, and we wouldn't want separate functions even if we were starting today. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Sat, Apr 25, 2020 at 10:50 AM Kirill Balunov <kirillbalunov@gmail.com> wrote: ...the mode switching approach in the current situation looks reasonable, because the question is how boundary conditions should be treated. I still prefer three cases switch like `zip(..., mode=('equal' | 'shortest' | 'longest'))` I like this -- it certainly makes a lot more sense than having zip(), zip(...,strict=True), and zip_longest() So I'm proposing that we have three options on the table: zip(..., mode=('equal' | 'shortest' | 'longest')) or zip() zip_longest() zip(equal) or, of course, keep it as it is. ... but also ok with `strict=True` variant. Chris Angelico wrote:
Separate functions mean you can easily and simply make a per-module decision:
from itertools import zip_strict as zip
Tada! Now this module treats zip as strict mode.
this is a nifty feature of multiple functions in general, but I'm having a really hard time coming up with a use case for these particular functions: you're using zip() multiple times in one module, and you want them all to be the same "version", but yiou want to be able to easily change that version on a module-wide bases? As for the string methods examples: one big difference is that the string methods are all in the same namespace. This is different because zip() is a built in, but zip_longest() and zip_equal() would not be. I don't think anyone is suggesting adding both of those to builtins. So adding a mode switch is the only way to "unify" them -- maybe not a huge deal, but I think a big enough deal that zip_equal wouldn't see much use.
and changing map and friends to iterators is a big part of why you can write all kinds of things naturally in Python 3.9 that were clumsy, complicated, or even impossible.
Sure, and I think we're all happy about that, but I also find that we've lost a bit of the nice "sequence-oriented" behavior too. Not sure that's relevant to this discussion though. Bu tit is on one way:Back in 1.5 days, you would always use zip() on sequences, so checking their length was trivial, if you really wanted to do that -- but while checking that your iterators were in fact that same length is possible, it's pretty klunky, and certainly not newbie-friendly. I've lot track of who said it, but I think someone proposed that most people really want zip_longest most often. (sorry if I'm misinterpreting that). I think this is kinda true, in the sense that if you only had one, than zip_longest would be able to conver teh most use-cases (you can build a zip_shortest out of zip_longest, but not the other way around) but particularly as zip_longest() is going to fail with infinite iterators, it can't be the only option after all. One other comment about the modes vs multiple functions: It makes a difference with implementation -- with multiple functions, you have to re-implement the core functionality three times (DRY) -- or have a hidden common function underneath -- that seems like code-smell to me. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
Let me try to explain why I believe that people who think they want zip_strict() actually want zip_longest(). I've already mentioned that I myself usually want what zip() does not (i.e. zip_shortest()) ... but indeed not always. If I have two or more "sequences" there are basically two cases of that. (1) The sequences are something like "options" where I expect a small number of them (say 5, or 50). If that is the case, code like this is perfectly fine: stuff1, stuff2 = map(list, (stuff1, stuff2)) # concretize iterators if len(stuff1) == len(stuff2): for pair in zip(stuff1, stuff2)): process(pair) else: raise UnequalLengthErrror("uh oh") (2) The sequences are either infinite or very large. I.e. they are "data", perhaps even streaming data that only arrives over time into the iterator from some external source. If this is the case, obviously we cannot concretize them. So here we either use the current tool: for pair in zip_longest(stuff1, stuff2, fillvalue=_sentinel): if _sentinel in pair: raise UnequalLengthError("uh oh") process(pair) Or alternately, we have a new function/mode that instead formulates this as: try: for pair in zip_strict(stuff1, stuff2): process(pair) except ZipLengthError: raise UnequalLengthError("uh oh") The hypothetical new style is fine. To me it looks slightly less good, but the difference is minimal. It almost feels like the proponents of the new mode/function are hoping to avoid the processing that might need to be "rolled back" in some manner if there is a synchronization problem. But that simply is not an option. If we have a billion events, or indefinitely many events that arrive over time, we simply cannot know before we get to the end that syncrhonization messed up. I mean, sure, if some characteristic of the intermediate data can indicate the mismatch, that's great... but it's not affected by which style is used, it's a separate test. Approach (1) is nice where available because it avoids processing altogether. But it is only possible for "small data" (and "ready data") no matter what. On Sun, Apr 26, 2020 at 12:34 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Sat, Apr 25, 2020 at 10:50 AM Kirill Balunov <kirillbalunov@gmail.com> wrote: ...the mode switching approach in the current situation looks reasonable, because the question is how boundary conditions should be treated. I still prefer three cases switch like `zip(..., mode=('equal' | 'shortest' | 'longest'))`
I like this -- it certainly makes a lot more sense than having zip(), zip(...,strict=True), and zip_longest()
So I'm proposing that we have three options on the table:
zip(..., mode=('equal' | 'shortest' | 'longest'))
or
zip() zip_longest() zip(equal)
or, of course, keep it as it is.
... but also ok with `strict=True` variant.
Chris Angelico wrote:
Separate functions mean you can easily and simply make a per-module decision:
from itertools import zip_strict as zip
Tada! Now this module treats zip as strict mode.
this is a nifty feature of multiple functions in general, but I'm having a really hard time coming up with a use case for these particular functions: you're using zip() multiple times in one module, and you want them all to be the same "version", but yiou want to be able to easily change that version on a module-wide bases?
As for the string methods examples: one big difference is that the string methods are all in the same namespace. This is different because zip() is a built in, but zip_longest() and zip_equal() would not be. I don't think anyone is suggesting adding both of those to builtins. So adding a mode switch is the only way to "unify" them -- maybe not a huge deal, but I think a big enough deal that zip_equal wouldn't see much use.
and changing map and friends to iterators is a big part of why you can write all kinds of things naturally in Python 3.9 that were clumsy, complicated, or even impossible.
Sure, and I think we're all happy about that, but I also find that we've lost a bit of the nice "sequence-oriented" behavior too. Not sure that's relevant to this discussion though. Bu tit is on one way:Back in 1.5 days, you would always use zip() on sequences, so checking their length was trivial, if you really wanted to do that -- but while checking that your iterators were in fact that same length is possible, it's pretty klunky, and certainly not newbie-friendly.
I've lot track of who said it, but I think someone proposed that most people really want zip_longest most often. (sorry if I'm misinterpreting that). I think this is kinda true, in the sense that if you only had one, than zip_longest would be able to conver teh most use-cases (you can build a zip_shortest out of zip_longest, but not the other way around) but particularly as zip_longest() is going to fail with infinite iterators, it can't be the only option after all.
One other comment about the modes vs multiple functions:
It makes a difference with implementation -- with multiple functions, you have to re-implement the core functionality three times (DRY) -- or have a hidden common function underneath -- that seems like code-smell to me.
-CHB
-- Christopher Barker, PhD
Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Sun, Apr 26, 2020 at 10:52 AM David Mertz <mertz@gnosis.cx> wrote:
Let me try to explain why I believe that people who think they want zip_strict() actually want zip_longest().
Thanks for laying it out so clearly. However, reading your post makes it clear to me that I DO still want zip_strict() :-) It comes down to this:
If I have two or more "sequences" there are basically two cases of that.
so you need to write different code, depending on which case? that seems not very "there's only one way to do it" to me. Or alternately, we have a new function/mode that instead formulates this as:
try: for pair in zip_strict(stuff1, stuff2): process(pair) except ZipLengthError: raise UnequalLengthError("uh oh")
The hypothetical new style is fine. To me it looks slightly less good, but the difference is minimal.
To me it looks better than both of the other options -- and much better (particularly for beginners) than the _sentinal approach. If folks think that it really won't be used often, fine -- but I'm that you think that writing the extra has to be thought out checking code is actually just as good, or better, API. In fact, if I found myself writing either of those more than once, I'd write a utility function that did it (Probably with the second version, as it is reasonable in all cases). And it I, or others, are writting little utility functions for comon uses, maybe it DOES make sense to put in in the std library. It almost feels like the proponents of the new mode/function are hoping to avoid the processing that might need to be "rolled back" in some manner if there is a synchronization problem. Not me for one, I think it's a good idea because it would prevent all of us from writing those little utilities, and particularly for newbies, would provide an easy and obvious way to do it. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Sun, Apr 26, 2020 at 11:56 PM Christopher Barker <pythonchb@gmail.com> wrote:
If I have two or more "sequences" there are basically two cases of that.
so you need to write different code, depending on which case? that seems not very "there's only one way to do it" to me.
This difference is built into the problem itself. There CANNOT be only one way to do these fundamentally different things. With iterators, there is at heart a difference between "sequences that one can (reasonably) concretize" and "sequences that must be lazy." And that difference means that for some versions of a seemingly similar problem it is possible to ask len() before looping through them while for others that is not possible (and hence we may have done some work that we want to "roll-back" in some sense). Exactly what not-reasonable means might vary by context. Infinite sequences are always a no-go. But slow iterators could go either way perhaps. I.e. I'm waiting on a slow wire for data, but when it arrives it will be moderate sized. Should I wait? I dunno, it depends. Or maybe it is fast, but there are a billion items. Do I want to use the memory? Maybe Whatever decision I make has to decide whether bounds can be checked in advance. However, the mismatched length feels like such a small concern in what can go wrong. For example, I have some time series data I was working with yesterday. The same timestamps are meant to match up with several different measurements. However, *sometimes* a measurement is missing. I might therefore wind up with two or more sequences of the same length, but with "teeth" of the zipper that don't actually match up always. Neither checking len() nor zip_strict() nor a zip_longest() sentinel are going to catch this problem.
Or alternately, we have a new function/mode that instead formulates this as:
try: for pair in zip_strict(stuff1, stuff2): process(pair) except ZipLengthError: raise UnequalLengthError("uh oh")
The hypothetical new style is fine. To me it looks slightly less good, but the difference is minimal.
To me it looks better than both of the other options -- and much better (particularly for beginners) than the _sentinal approach.
Sure. That's fine. I'm +0 or even higher on adding itertools.zip_strict(). My taste prefers the other style I showed, but as I say, this version is perfectly fine. De gustibus non disputandum est. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Apr 26, 2020, at 21:23, David Mertz <mertz@gnosis.cx> wrote:
On Sun, Apr 26, 2020 at 11:56 PM Christopher Barker <pythonchb@gmail.com> wrote:
If I have two or more "sequences" there are basically two cases of that.
so you need to write different code, depending on which case? that seems not very "there's only one way to do it" to me.
This difference is built into the problem itself. There CANNOT be only one way to do these fundamentally different things.
With iterators, there is at heart a difference between "sequences that one can (reasonably) concretize" and "sequences that must be lazy." And that difference means that for some versions of a seemingly similar problem it is possible to ask len() before looping through them while for others that is not possible (and hence we may have done some work that we want to "roll-back" in some sense).
Agreed. But here’s a different way to look at it: The Python iteration protocol hides the difference between different kinds of iterables; every iterator is just a dumb next-only iterator. So any distinction between things you can pre-check and things you can post-check has to be made at a higher level, up wherever the code knows what’s being iterated (probably the application level). That isn’t inherent to the idea of iteration, as demonstrated by C++ (and later languages like Swift), where you can have reversible or random-accessible iterators and write tools that switch on those features, so you wouldn’t be forced to make the decision at the application level. You could write a generic C++ zip_equal function that pre-checks random-accessible iterators but post-checks other iterators. But when would you want that generic function? When you’re writing that application code, you know whether you have sequences, inherently lazy iterators, or generic iterables as input, and you know whether you want no check, a pre-check, or a post-check on equal lengths, and those aren’t independent questions: when you want a pre-check, it’s because you’re thinking in sequence terms, not general iteration terms. Pre-checking sequences is so trivial that you don’t need any helpers. The only piece Python is (arguably) missing is a way to do that post-check easily when you’ve decided you need it, and that’s what the proposals in this thread are trying to solve. The fact that asking for post-checking on the zip iterator won’t look the same as manually pre-checking the input sequences isn’t a violation of TOOWTDI because the “it” you’re doing is a different thing, different in a way that’s meaningful to your code, and there doesn’t have to be one obvious way to do two different things. Just like slicing doesn’t have to look the same as islice, and a find method doesn’t have to look the same as a generic iterable find function, and so on; they only look the same when the distinction between thinking about sequences and thinking about lazy iterables is irrelevant to the problem.
On Sun, Apr 26, 2020 at 9:21 PM David Mertz <mertz@gnosis.cx> wrote:
On Sun, Apr 26, 2020 at 11:56 PM Christopher Barker <pythonchb@gmail.com> wrote:
If I have two or more "sequences" there are basically two cases of that.
so you need to write different code, depending on which case? that seems not very "there's only one way to do it" to me.
This difference is built into the problem itself. There CANNOT be only one way to do these fundamentally different things.
Isn't there? There are many cases where you CANNOT (or don't want to, for performance reasons) "consume" the entirely of the inolut iterators, and many cases where it would be fine to do that. But are there many (any?) cases where you couldn't use the "sentinal approach". To me, having a zip_equal that iterates through the inputs on demand, and checks when one is exhausted, rather than pre-determining the lengths ahead of time will solve almost all (or all? I can't think of an example where it wouldn't) use cases, and it is completely consistent with all the other things that are iterators in Py3 that were sequences in py2: zip, map, dict.items() (and friends), and ... There is a pretty consistent philosophy in py3 that anything that can be an iterator, and be lazy-evaluated is done that way, and for the time when you need an actual sequence, you can wrap list() around it. So I see no downside to having a zip_equal that doesn't pre-compute the lengths, when it could.
With iterators, there is at heart a difference between "sequences that one can (reasonably) concretize" and "sequences that must be lazy." And that difference means that for some versions of a seemingly similar problem it is possible to ask len() before looping through them while for others that is not possible (and hence we may have done some work that we want to "roll-back" in some sense).
Sure: but that is a distinction that is, as far as I know, never made in the standard library with all the "iterator related" code. There are some things that require proper sequences, but as far as I know, nothing that expects a "concretizable" iterator -- and frankly, I'm don't think there is a clear definition of that anyway -- some things clearly aren't, but others it would depend on how big they are, and the memory available to the machine, etc. In fact, the reason we have as many iterator-related tools is exactly so programmers DON'T have to make that decision. Can you think of a single case where a zip_equal() (either pre-exisiting or roll your own) would not work, but the concretizing version would? There is one "downside" to this in that it potentially leaves the iterators passed in in a undetermined state -- partially exhausted, and with a longer one having had one more item removed than was used. But that exists with "zip_shortest" behavior anyway. But it would be a minor reason to do the concertizing approach -- at least then you'd know your iterators were fully exhausted. SIDE NOTE: this is reminding me that there have been calls in the past for an optional __len__ protocol for iterators that are not proper sequences, but DO know their length -- maybe one more place to use that if it existed.
However, the mismatched length feels like such a small concern in what can go wrong.
Agreed -- but I think everyone agrees -- this is not a huge deal (or it would have been done years ago), but it's a nice convenience, and minimally disruptive.
Sure. That's fine. I'm +0 or even higher on adding itertools.zip_strict(). My taste prefers the other style I showed, but as I say, this version is perfectly fine.
If there were a zip_equal() in itertools, would you ever write the code to use zip_longest and check the sentinel? For my part, I wouldn't, and indeed once I had a second need for it, I'd write zip_equal for my own toolbox anyway :-) -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On 27/04/2020 21:39, Christopher Barker wrote:
To me, having a zip_equal that iterates through the inputs on demand, and checks when one is exhausted, rather than pre-determining the lengths ahead of time will solve almost all (or all? I can't think of an example where it wouldn't) use cases
Except for those cases where either the whole dataset needs to be processed or none of it, which is what people were thinking might be behind some of the desire for zip_equal(). That you can't do it in the general case would be a later "well, bugger" stage of the design :-) -- Rhodri James *-* Kynesim Ltd
On Mon, Apr 27, 2020 at 4:39 PM Christopher Barker <pythonchb@gmail.com> wrote:
Isn't there? There are many cases where you CANNOT (or don't want to, for performance reasons) "consume" the entirely of the inolut iterators, and many cases where it would be fine to do that. But are there many (any?) cases where you couldn't use the "sentinal approach".
It depends what you mean by "cannot." Algorithmically of course you can use sentinel. But the issue is computation cost and rollback of side-effects. E.g. change my example code just slightly: for p, q, m in zip_longest(bigints1, bigints2, bigints3, fillvalue=_sentinel): if _sentinel in pair: raise UnequalLengthError("uh oh") result = p**q % m store_to_db(result) If we have 1000-digit numbers, but only a couple thousand of them, we would be vastly better off checking the lengths in advance (if that is possible, and if generating the numbers in the first place is comparatively cheap).
Sure: but that is a distinction that is, as far as I know, never made in the standard library with all the "iterator related" code. There are some things that require proper sequences, but as far as I know, nothing that expects a "concretizable" iterator -- and frankly, I'm don't think there is a clear definition of that anyway
Oh... absolutely. "Concretizable" is very task-specific. Other than infinite iterators, any iterator could be wrapped in list() to *eventually* get a concrete sequence. This isn't a Python language distinction but a "what do you want to do?" distinction. If there were a zip_equal() in itertools, would you ever write the code to
use zip_longest and check the sentinel? For my part, I wouldn't, and indeed once I had a second need for it, I'd write zip_equal for my own toolbox anyway :-)
I dunno. I guess I might use zip_equal() in the case where I didn't want to bother with an `except` and just let the program crash on mismatch (or maybe catch with a generic "something went wrong" kind of status). Whenever I really want specific remediation action, I think I'd still prefer the sentinel. Often enough I still can do *something* with the non-exhausted elements from the other iterators that it feels like a more general pattern. FWIW, if it is added, I like the name zip_strict() better than zip_equal(). But someone else is building the bikeshed, so I'm not that worried about spelling. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Apr 27, 2020, at 13:41, Christopher Barker <pythonchb@gmail.com> wrote:
SIDE NOTE: this is reminding me that there have been calls in the past for an optional __len__ protocol for iterators that are not proper sequences, but DO know their length -- maybe one more place to use that if it existed.
But __len__ doesn’t really make sense on iterators. And no iterator is a proper sequence, so I think you meant _iterables_ that aren’t proper sequences anyway—and that’s already there: xs = {1, 2, 3} len(xs) # 3 isinstance(xs, collections.abc.Sized) # True I think the issue is that people don’t actually want zip to be an Iterator, they want it to be a smarter Iterable that preserves (at least) Sized from its inputs. The same way, e.g., dict.items or memoryview does. The same way range is lazy but not an Iterator. And it’s not just zip; the same thing is true for map, enumerate, islice, etc. And it’s also not just Sized. It would be just as cool if zip, enumerate, etc. preserved Reversible. In fact, “how do I both enumerate and reverse” comes up often enough that I’ve got a reverse_enumerate function in my toolbox to work around it. And, for that matter, why do they have to be only one-shot-iterable unless their input is? Again, dict.items and range come to mind, and there’s no real reason zip, map, islice, etc. couldn’t preserve as much of their input behavior as possible: xs = [1, 2, 3] ys = map(lambda x: x*3, xs) len(ys) # 3 reversed(enumerate(ys))[-1] # (0, 3) Of course it’s not always possible to preserve all behavior: xs = [1, 2, 3] ys = filter(lambda x: x%2, xs) len(ys) # still a TypeError even though xs is sized … but the cases where it is or isn’t possible can all be worked out for each function and each ABC: filter can _never_ preserve Sized but can _always_ preserves Reversible, etc. This is clearly feasible—Swift does it, and C++ is trying to do it in their next version, and Python already does it in a few special cases (as mentioned earlier), just not in all (or even most) of the potentially useful cases. The only really hard part of this is designing a framework that makes it possible to write all those views simply. You don’t want to have to write five different map view classes for all the ways a map can act based on its inputs, and then repeat 80% of that same work again for filter, and again for islice and so on. The boilerplate would be insane. (See the Swift 1.0 stdlib for an example of how horrible it could be, and they only implemented a handful of the possibilities.) And, except for a couple of things (notably genexprs), most of this could be written as a third-party library today. (And if it existed and people were using it widely, it would be pretty easy to argue that it should come with Python, so that it _could_ handle those last few things like genexprs, and also to serve as an example to encourage third-party libraries like toolz to similarly implement smart views instead of dumb iterators, and also as helpers to make that easier for them. That argument might or might not win the day, but at least it’s obvious what it would look like.) So I suspect the only reason nobody’s done so is that you don’t actually run into a need for it very often. How often do you actually need the result of zip to be Sized anyway? At least for me, it’s not very often. Whenever I run into any of these needs, I start thinking about the fully general solution, but put it off until I run into a second good use for it and meanwhile write a simple 2-minute workaround for my immediate use (or add a new special case like reversed_enumerate to my toolbox), and then by the time I run into another need for it, it’s been so long that I’ve almost forgotten the idea… But maybe there would be a lot more demand for this if people knew the idea was feasible? Maybe there are people who have tons of real-life examples where they could use a Sized zip or a Reversible enumerate or a Sequence map, and they just never thought they could have it so they never tried or asked?
On Mon, Apr 27, 2020 at 01:39:19PM -0700, Christopher Barker wrote:
Can you think of a single case where a zip_equal() (either pre-exisiting or roll your own) would not work, but the concretizing version would?
That's easy: if the body of your zip-handling function has side-effects which must be atomic (or at least as atomic as Python code will allow). An atomic function has to either LBYL (e.g. check the lengths of the iterables before starting to zip them), or needs to be able to roll-back if a mismatch is found at the end. In the most general case, we can't roll-back easily, or at all, so if your requirements are to avoid partial operations, then you must concretize the input streams and check their lengths. But I don't think that's a problem for this proposal. Nobody is saying that zip_strict can solve all problems, and we shouldn't hold it against it that it doesn't solve the atomicity problem (which is very hard to solve in Python). -- Steven
On Tue, May 5, 2020 at 7:20 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Mon, Apr 27, 2020 at 01:39:19PM -0700, Christopher Barker wrote:
Can you think of a single case where a zip_equal() (either pre-exisiting or roll your own) would not work, but the concretizing version would?
That's easy: if the body of your zip-handling function has side-effects which must be atomic (or at least as atomic as Python code will allow). An atomic function has to either LBYL (e.g. check the lengths of the iterables before starting to zip them), or needs to be able to roll-back if a mismatch is found at the end.
Good point. but the current "shortest" behavior would be even worse. At least if it raised you'd get a warning that you made a mess of your data :-) And yes, that's not an argument against this idea. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Mon, 27 Apr 2020 13:39:19 -0700 Christopher Barker <pythonchb@gmail.com> wrote:
There is one "downside" to this in that it potentially leaves the iterators passed in in a undetermined state -- partially exhausted, and with a longer one having had one more item removed than was used. But that exists with "zip_shortest" behavior anyway. But it would be a minor reason to do the concertizing approach -- at least then you'd know your iterators were fully exhausted.
SIDE NOTE: this is reminding me that there have been calls in the past for an optional __len__ protocol for iterators that are not proper sequences, but DO know their length -- maybe one more place to use that if it existed.
The C standard library has an ungetc function that, well, "ungets" one character from the end of a character stream. The next time the stream is read, it returns that ungotten character first, and then goes back to the stream. Such a feature could solve this problem on infinite streams without having to concretize them. (Unless, of course, zip's *caller* tried to unget an element of the iterator immediately after zip did, but that's no a likely occurrance.) Dan -- “Atoms are not things.” – Werner Heisenberg Dan Sommers, http://www.tombstonezero.net/dan
On 04/26/2020 08:56 PM, Christopher Barker wrote:
that seems not very "there's only one way to do it" to me.
The quote is "one obvious way".
It almost feels like the proponents of the new mode/function are hoping to avoid the processing that might need to be "rolled back" in some manner if there is a synchronization problem.
There is no way to "roll back" an iterator, unless you are writing custom ones -- in which case you'll need a custom zip to do the rolling back. -- ~Ethan~
Here's an idea for combining zip_longest with zip strict. Define zip like so: def zip(*iterables, strict=False, fill=object()) With no arguments, it's the regular old zip. With `strict=True`, it ensures the iterators are equal. If you pass in an argument for `fill`, it becomes zip_longest. On Sun, Apr 26, 2020 at 7:37 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Sat, Apr 25, 2020 at 10:50 AM Kirill Balunov <kirillbalunov@gmail.com> wrote: ...the mode switching approach in the current situation looks reasonable, because the question is how boundary conditions should be treated. I still prefer three cases switch like `zip(..., mode=('equal' | 'shortest' | 'longest'))`
I like this -- it certainly makes a lot more sense than having zip(), zip(...,strict=True), and zip_longest()
So I'm proposing that we have three options on the table:
zip(..., mode=('equal' | 'shortest' | 'longest'))
or
zip() zip_longest() zip(equal)
or, of course, keep it as it is.
... but also ok with `strict=True` variant.
Chris Angelico wrote:
Separate functions mean you can easily and simply make a per-module decision:
from itertools import zip_strict as zip
Tada! Now this module treats zip as strict mode.
this is a nifty feature of multiple functions in general, but I'm having a really hard time coming up with a use case for these particular functions: you're using zip() multiple times in one module, and you want them all to be the same "version", but yiou want to be able to easily change that version on a module-wide bases?
As for the string methods examples: one big difference is that the string methods are all in the same namespace. This is different because zip() is a built in, but zip_longest() and zip_equal() would not be. I don't think anyone is suggesting adding both of those to builtins. So adding a mode switch is the only way to "unify" them -- maybe not a huge deal, but I think a big enough deal that zip_equal wouldn't see much use.
and changing map and friends to iterators is a big part of why you can write all kinds of things naturally in Python 3.9 that were clumsy, complicated, or even impossible.
Sure, and I think we're all happy about that, but I also find that we've lost a bit of the nice "sequence-oriented" behavior too. Not sure that's relevant to this discussion though. Bu tit is on one way:Back in 1.5 days, you would always use zip() on sequences, so checking their length was trivial, if you really wanted to do that -- but while checking that your iterators were in fact that same length is possible, it's pretty klunky, and certainly not newbie-friendly.
I've lot track of who said it, but I think someone proposed that most people really want zip_longest most often. (sorry if I'm misinterpreting that). I think this is kinda true, in the sense that if you only had one, than zip_longest would be able to conver teh most use-cases (you can build a zip_shortest out of zip_longest, but not the other way around) but particularly as zip_longest() is going to fail with infinite iterators, it can't be the only option after all.
One other comment about the modes vs multiple functions:
It makes a difference with implementation -- with multiple functions, you have to re-implement the core functionality three times (DRY) -- or have a hidden common function underneath -- that seems like code-smell to me.
-CHB
-- Christopher Barker, PhD
Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/4EHYHP... Code of Conduct: http://python.org/psf/codeofconduct/
David Mertz writes:
If zip_strict() is genuinely what you want to do, an import from stdlib is not much effort to get it.
+1. I'm definitely in the camp wanting this to be a new function.
My belief is that usually people who think they want this actually want zip_longest(), but that's up to them.
I don't see how that would be true, to be honest. "zip" implies that the data in the streams are in fact synchronized. Any difference in lengths implies that the synchronization is lost "somewhere". In general, there's no way of knowing where the data generating process broke down so they have to throw away the whole computation. I would guess that in a quite large set of cases, desynchronization implies a logic error somewhere, so they have to fix the algorithm. Of course there are cases where zip is used to *create* the synchronization, such as zip(tasks, volunteers) or zip(shuffled(cards), players), but in those cases zip_longest rarely makes sense to me. In fact you probably want a "circular" zip_longest, where you keep "dealing", starting over at the beginning of an iterable whenever it is exhausted until all the tasks, volunteers, or cards have been dealt. Or you just want zip, as in a case like zip(rooms, students), and you deal with the remainder(s) of the longer iterable(s) in a separate process. So it seems to me that zip_shortest, zip_strict, and zip_longest all have valid use cases, and it's clear they don't exhaust all the cases of zipping. It's unclear to me which is the most common case, but I think the folks who want a zip_strict have a good case for a function in the stdlib, though I see no need to change the builtin.
On Sat, Apr 25, 2020 at 09:39:14AM -0700, Christopher Barker wrote: [...]
File handling code ought to be resilient in the face of such meaningless differences,
sure. But what difference is "meaningless" depends on the use case. For instance, comments or blank lines in the middle of a file may be a meaningless difference. And you'd want to handle that before zipping anyway.
Probably. But once you get past that simple zip(*files) pattern, the whole processing loop becomes more complex and it isn't so obvious that you'll end up using zip *at all* let alone the proposed strict version. [...]
So my argument is that anything you want zip_strict for is better handled with zip_longest -- including the case of just raising.
That is quite the leap! You make a decent case about handling empty lines in files, but extending that to "anything" is unwarranted.
Okay, fair point -- I should have said "just as well or better" rather than just better. And it's not an unwarranted leap, because you can easily implement zip_strict from zip_longest. zip_longest provides all the functionality of zip_strict, plus more: * zip_strict can *only* raise if there is a mismatch; * zip_longest can raise, or truncate, or pad the data with a default; you can transform the short data in any way you want. * A few days ago, I needed a version of zip that simply ignored missing values: zip_shrinking(['ab', '1234', 'xyz']) --> a1x b2y 3z 4 and I knocked up one using zip_longest in about thirty seconds. If we could only have one version of zip, it would be a no-brainer to choose zip_longest.
I honestly do not understand the resistance here. Yes, any change to the standard library should be carefully considered, and any change IS a disruption, and this proposed change may not be worth it. But arguing that it wouldn't ever be useful, I jsut don't get.
I am sorry if I have given you the impression that I believe that there is *never* any reason to validate equal lengths. That is not my position, and I apologise if I said anything that gave you that idea. Of course I don't believe that there is no code anywhere in the world that could make use of a zip_strict, that would be silly, but I do have serious doubts that the use-cases are all three of sufficiently important, common, and performance sensitive to justify making it a builtin. We have five options here: - Status quo wins a draw: do nothing. - Add a new builtin. - Add a flag to zip(). - Add a zip_strict to itertools. - Add a recipe to itertools showing how to do it. If Raymond agrees, I wouldn't oppose a version in itertools, even though I have my doubts about its usefulness and I think that it will more often be an attractive nuisance than an actual help. But the barrier to entry for the stdlib is lower than for builtins. I also dislike the proposed builtin API: bool flag arguments are, in my opinion, a code-smell. Now I could be convinced to change my mind by a sufficiently compelling set of use-cases, but so far they've been disappointingly weak to my mind. I also think that we don't yet have a good design for what it should do. Is the intent to make an assertion about program logic, as Brandt says? In this case, it should raise AssertionError and it should be disabled when other assertions are disabled. Or is the intent to have an exception we intend to catch and (somehow?) recover from? In this case, the most likely exception is ValueError. I know some people don't care what exception code raises. I've seen lots of people raise AssertionError for bad user data, or missing files. I've seen people raise TypeError for things that have nothing to do with types. But for me, chosing the right exception type and behaviour is important. It's about communicating intent. [...]
* Many uses (most?) do expect the iterators to be of equal length. - The main exception to this may be when one of them is infinite, but how common is that, really?
Common and useful! Really. But plain old zip isn't going to go away, so let's leave this. [...]
So: if this were added, it would get some use. How much? hard to know. Is it critically important? absolute not. But it's fully backward compatible and not a language change, the barrier to entry is not all that high.
Of course it's a language change. If we add this to zip, other Python interpreters will have to follow once they catch up to version 3.9 or 3.10.
However, I agree with (I think Brandt) in that the lack of a critical need means that a zip_strict() in itertools would get a LOT less use than a flag on zip itself
So you and Brandt are arguing that the *less* useful this is, the more we should prefer to make it a builtin? For everything else, it goes the other way: aside from maybe the odd builtin left over from Python 1.0, things become builtin only if they are *more* useful, not less. -- Steven
On Apr 25, 2020, at 09:40, Christopher Barker <pythonchb@gmail.com> wrote:
- The main exception to this may be when one of them is infinite, but how common is that, really? Remember that when zip was first created (py2) it was a list builder, not an iterator, and Python itself was much less iterable-focused.
Well, yes, and improvements like that are why Python 3.9 is a better language than Python 2.0 (when zip was first added). Python wasn’t just much less iterable-focused, it didn’t even have the concept of “iterable”. While it did have map and filter, the tutorial taught you to loop over range(len(xs)), only mentioning map and filter as “good candidates to pass to lambda forms” for people who really want to pretend Python is Lisp rather than using it properly. Adding the iterator protocol and more powerful for loop; functions like zip, enumerate, and iter; generators, comprehensions, and generator expressions; itertools; yield from; and changing map and friends to iterators is a big part of why you can write all kinds of things naturally in Python 3.9 that were clumsy, complicated, or even impossible. Sure, you can use it as if it were Python 2.0 but with Unicode, but it’s a lot more than that. But also, why was zip added with “shortest” behavior in 2.0 in the first place? It wasn’t to support infinite or otherwise lazy lists, because those didn’t exist. And it wasn’t chosen on a whim. In Python 1.x, if you knew your lists were the same length, you used map with None as the function. (Well, usually you just looped over range(len(first_list)), but if you wanted to be all Lispy, you used map.) But if you didn’t know the lists were the same length, you couldn’t (because map had “longest” behavior, with an unchangeable fillvalue of None, until 3.0). If that didn’t actually come up for people even in Python 1.x, nobody would have asked for it in 2.0.
+1. I implemented my own zip (because exceptions[1]) and it's so easy to accidentally have length-related bugs everywhere because your tests are just stopping short with no error. [1] https://creationix.github.io/http-clone/?https://soniex2.autistic.space/git-... On 2020-04-20 2:42 p.m., Ram Rachum wrote:
Here's something that would have saved me some debugging yesterday:
>>> zipped = zip(x, y, z, strict=True)
I suggest that `strict=True` would ensure that all the iterables have been exhausted, raising an exception otherwise.
This is useful in cases where you're assuming that the iterables all have the same lengths. When your assumption is wrong, you currently just get a shorter result, and it could take you a while to figure out why it's happening.
What do you think?
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/6GFUAD... Code of Conduct: http://python.org/psf/codeofconduct/
On Apr 20, 2020, at 10:42, Ram Rachum <ram@rachum.com> wrote:
Here's something that would have saved me some debugging yesterday:
>>> zipped = zip(x, y, z, strict=True)
I suggest that `strict=True` would ensure that all the iterables have been exhausted, raising an exception otherwise.
One quick bikeshedding question (which also gets to the heart of how you’d want to implement it); apologies if this came up in the thread from 2 years ago or the discussion in the more-iterables PR that I just suggested everyone should read before commenting, but I wanted to get this down before I forget it. x = iter(range(5)) y = [0] try: zipped = zip(x, y, strict=True) except ValueError: # assuming that’s the exception you want? print(next(x)) Should this print 1 or 2 or raise StopIteration or be a don’t-care? Should it matter if you zip(y, x, strict=True) instead?
Good point. It would have to be dependent on position. In other words, you would never pass an iterator into zip with any expectation that it would be in a usable condition by the time it's done. Actually, I can't think of any current scenario in which someone would want to do this, with the existing zip logic. On Mon, Apr 20, 2020, 23:34 Andrew Barnert <abarnert@yahoo.com> wrote:
On Apr 20, 2020, at 10:42, Ram Rachum <ram@rachum.com> wrote:
Here's something that would have saved me some debugging yesterday:
>>> zipped = zip(x, y, z, strict=True)
I suggest that `strict=True` would ensure that all the iterables have
been exhausted, raising an exception otherwise.
One quick bikeshedding question (which also gets to the heart of how you’d want to implement it); apologies if this came up in the thread from 2 years ago or the discussion in the more-iterables PR that I just suggested everyone should read before commenting, but I wanted to get this down before I forget it.
x = iter(range(5)) y = [0] try: zipped = zip(x, y, strict=True) except ValueError: # assuming that’s the exception you want? print(next(x))
Should this print 1 or 2 or raise StopIteration or be a don’t-care?
Should it matter if you zip(y, x, strict=True) instead?
On Apr 20, 2020, at 13:49, Ram Rachum <ram@rachum.com> wrote:
Good point. It would have to be dependent on position. In other words, you would never pass an iterator into zip with any expectation that it would be in a usable condition by the time it's done.
Actually, I can't think of any current scenario in which someone would want to do this, with the existing zip logic.
Admittedly, such cases are almost surely not that common, but I actually have some line-numbering code that did something like this (simplified a bit from real code): yield from enumerate(itertools.chain(headers, [''], body, ['']) … but then I needed to know how many lines I yielded, and there’s no way to get that from enumerate, so instead I had to do this: counter = itertools.count() yield from zip(counter, itertools.chain(headers, [''], body, ['']) lines = next(counter) (Actually, at the same time I did that, I also needed to add some conditional bits to the chain, and it got way too messy for one line, so I ended up rewriting it as a sequence of separate `yield from zip(counter, things)` statements. But that’s just a more complicated demonstration of the same idea.) But again, this probably isn’t very common. And also, while you were asking about the existing zip logic, the more important question is the new logic you’re proposing. I can’t imagine a case where you’d want to check for non-empty and _then_ use it, which is what’s relevant here. There probably are such cases, but if so, they’re even rarer, enough so that the fact that you have to wrap something in itertools.tee or more_itertools.peekable to pull it off (or just not use the new strict=True/zip_strict/zip_equal) is probably not a great tragedy.
On Mon, Apr 20, 2020 at 03:28:09PM -0700, Andrew Barnert via Python-ideas wrote:
Admittedly, such cases are almost surely not that common, but I actually have some line-numbering code that did something like this (simplified a bit from real code):
yield from enumerate(itertools.chain(headers, [''], body, [''])
… but then I needed to know how many lines I yielded, and there’s no way to get that from enumerate, so instead I had to do this:
Did you actually need to "yield from"? Unless your caller was sending values into the enumerate iterable, which as far as I know enumerate doesn't support, "yield from" isn't necessary. for t in enumerate(itertools.chain(headers, [''], body, ['']): yield t lines = t[0]
counter = itertools.count() yield from zip(counter, itertools.chain(headers, [''], body, ['']) lines = next(counter)
That gives you one more than the number of lines yielded. -- Steven
On Apr 20, 2020, at 17:22, Steven D'Aprano <steve@pearwood.info> wrote:
On Mon, Apr 20, 2020 at 03:28:09PM -0700, Andrew Barnert via Python-ideas wrote:
Admittedly, such cases are almost surely not that common, but I actually have some line-numbering code that did something like this (simplified a bit from real code): yield from enumerate(itertools.chain(headers, [''], body, ['']) … but then I needed to know how many lines I yielded, and there’s no way to get that from enumerate, so instead I had to do this:
Did you actually need to "yield from"? Unless your caller was sending values into the enumerate iterable, which as far as I know enumerate doesn't support, "yield from" isn't necessary.
True. Using yield from is more efficient, more composeable, and usually (but not here) more concise and readable, but none of those are relevant to my example (or the real code). I suppose it’s just a matter of habit to reach for yield from before a loop over yield even in cases where it doesn’t matter much.
counter = itertools.count() yield from zip(counter, itertools.chain(headers, [''], body, ['']) lines = next(counter)
That gives you one more than the number of lines yielded.
Yeah, I screwed that up in simplifying the real code without testing the result. And your version gives one _less_ than the number yielded. (With either enumerate(xs) or zip(counter, xs) the last element will be (len(xs)-1, xs[-1]). Your version has the additional problem that if the iterable is empty, t is not off by one but unbound (or bound to some stale old value)—but that’s not possible in my example, and probably not in most similar examples. Both are easy to fix in practice, but both (as we just demonstrated) even easier to get wrong the first time, like all fencepost errors. Maybe it would be better to use an undoable/peekable/tee wrapper after all, but without writing it out I’m not sure that wouldn’t be just as fencepostable… Anyway, that’s exactly why I want to make sure the fencepost behavior is actually defined for this new proposal. Any reasonable answer is probably fine; people probably won’t run into wanting the leftovers, but if they ever do, as long as the docs say what should be there, they’ll work it out. That, and the implementation constraint. If everyone were convinced that the only reasonable answer is to fully consume all inputs on error, that would be a bit of a problem, so it’s worth making sure nobody is convinced of that.
On Mon, Apr 20, 2020 at 07:47:51PM -0700, Andrew Barnert wrote:
counter = itertools.count() yield from zip(counter, itertools.chain(headers, [''], body, ['']) lines = next(counter)
That gives you one more than the number of lines yielded.
Yeah, I screwed that up in simplifying the real code without testing the result. And your version gives one _less_ than the number yielded.
No, my version repeats the last number yielded, which is precisely what you wanted (as I understand it). See below. py> def test(): ... headers = body = '' ... for t in enumerate(itertools.chain(headers, [''], body, [''])): ... yield t ... print(t[0]) ... py> list(test()) 1 [(0, ''), (1, '')]
(With either enumerate(xs) or zip(counter, xs) the last element will be (len(xs)-1, xs[-1]).
Um, yes? That's because both enumerate and counter start from zero by default. I would have asked you why you were counting your lines starting from zero instead of using `enumerate(xs, 1)` but I thought that was intentional.
Your version has the additional problem that if the iterable is empty, t is not off by one but unbound (or bound to some stale old value)—but that’s not possible in my example, and probably not in most similar examples.
But the iterable is never empty, because you always yield at least two blanks.
Anyway, that’s exactly why I want to make sure the fencepost behavior is actually defined for this new proposal. Any reasonable answer is probably fine; people probably won’t run into wanting the leftovers, but if they ever do, as long as the docs say what should be there, they’ll work it out.
Like you worked out the behaviour of counter and zip? *wink* I think you were overthinking it. The simplest, foolproof way to get the number of items yielded is to count them with enumerate starting with 1: count = 0 for count, item in enumerate(something, 1): yield item print(count) I don't believe this zip_strict proposal would help you in this situation. I think it will make it worse, because it will encourage people to use this anti-pattern: seq = list(something) for count, item in zip_strict(range(1, len(seq)+1), seq): yield item print(count) "just to be sure". -- Steven
On Apr 21, 2020, at 19:35, Steven D'Aprano <steve@pearwood.info> wrote:
On Mon, Apr 20, 2020 at 07:47:51PM -0700, Andrew Barnert wrote:
counter = itertools.count() yield from zip(counter, itertools.chain(headers, [''], body, ['']) lines = next(counter)
That gives you one more than the number of lines yielded.
Yeah, I screwed that up in simplifying the real code without testing the result. And your version gives one _less_ than the number yielded.
No, my version repeats the last number yielded, which is precisely what you wanted (as I understand it).
No, I wanted the number of lines yielded. You not only quoted that, but directly claimed that you were giving the number of lines yielded. But you’re not; you’re giving me the number of the last line, which is 1 less than that.
py> def test(): ... headers = body = '' ... for t in enumerate(itertools.chain(headers, [''], body, [''])): ... yield t ... print(t[0]) ... py> list(test()) 1 [(0, ''), (1, '')]
Right. The number of pairs yielded is 2. Your code prints 1.
(With either enumerate(xs) or zip(counter, xs) the last element will be (len(xs)-1, xs[-1]).
Um, yes? That's because both enumerate and counter start from zero by default. I would have asked you why you were counting your lines starting from zero instead of using `enumerate(xs, 1)` but I thought that was intentional.
You were right, counting from 0 was intentional. Just as it is almost everywhere in Python. The caller needs those line numbers; otherwise I wouldn’t be yielding them in the first place. And that’s why your solution is wrong: you correctly left it counting from 0, but then incorrectly assumed that the last number equals the count, which is only true when counting from 1. If that’s not a classic fencepost error, I don’t know what is. And my originally-posted version has a different fencepost error, as you pointed out. And my real code doesn’t, but I may well have made one and had to spend a minute debugging it. Nontrivial counting code often has fencepost errors, and Python only eliminates the sources that come up often, not every possible one that might come up rarely, which is fine. And this proposal doesn’t change that in any way, nor is it meant to.
Your version has the additional problem that if the iterable is empty, t is not off by one but unbound (or bound to some stale old value)—but that’s not possible in my example, and probably not in most similar examples.
But the iterable is never empty, because you always yield at least two blanks.
Yes; I said “but that’s not possible in my example”, as you quoted directly above.
I don't believe this zip_strict proposal would help you in this situation. I think it will make it worse,
Well, of course. Since it wasn’t an argument for the proposal, but an example pointing out a potential hole in the proposal that needed to be thought through, why would you expect the proposal to help it? To recap: Someone had said that it doesn’t matter what state the iterables are left in, because nobody ever looks at an iterator after zip. So I gave an example of (simplified) real code that looks at an iterator after zip. So people thought through what state the iterables should be left in by this new zip_strict function, and there is a reasonable answer. Even if your arguments about this example were correct, they wouldn’t be relevant to the thread, because the entire purpose of giving the example has been fulfilled.
x = iter(range(5)) y = [0] try: zipped = zip(x, y, strict=True) except ValueError: # assuming that’s the exception you want? print(next(x))
Should this print 1 or 2 or raise StopIteration or be a don’t-care?
Surely no exception is raised because zip is lazy? Doesn't it still have to be even with strict=True?
Alex Hall wrote:
Surely no exception is raised because zip is lazy?
Ack, you're right. The same problem would come up wherever you actually _use_ the zip, of course, but it's harder to demonstrate and reason about. So change that toy example to `zipped = list(zip(x, y, strict=True))`. (Fortunately, it looks like Ram got what I intended despite my mistake.)
Doesn't it still have to be even with strict=True?
Well, I suppose technically it doesn't _have_ to be, but it certainly _should_ be. (Although it's a bit weird to say "it should be lazy even with `strict=True`" out loud; maybe that's a mild argument for using a different qualifier like `equal`, as in more-itertools?)
20.04.20 23:33, Andrew Barnert via Python-ideas пише:
Should this print 1 or 2 or raise StopIteration or be a don’t-care?
Should it matter if you zip(y, x, strict=True) instead?
It should print 2 in both cases. The only way to determine whether the iterator ends is to try to get its next value. And this value (1) will lost, because there is no way to return it or "unput" to the iterator. There is no reason to consume more values, so StopIteration is irrelevant. There is more interesting example: x = iter(range(5)) y = [0] z = iter(range(5)) try: zipped = list(zip(x, y, z, strict=True)) except ValueError: # assuming that’s the exception you want? assert zipped == [(0, 0, 0)] assert next(x) == 2 print(next(z)) Should this print 1 or 2? The simple implementation using zip_longest() would print 2, but more optimal implementation can print 1.
On Tue, Apr 21, 2020 at 11:36 AM Serhiy Storchaka <storchaka@gmail.com> wrote:
20.04.20 23:33, Andrew Barnert via Python-ideas пише:
Should this print 1 or 2 or raise StopIteration or be a don’t-care?
Should it matter if you zip(y, x, strict=True) instead?
It should print 2 in both cases. The only way to determine whether the iterator ends is to try to get its next value. And this value (1) will lost, because there is no way to return it or "unput" to the iterator. There is no reason to consume more values, so StopIteration is irrelevant.
There is more interesting example:
x = iter(range(5)) y = [0] z = iter(range(5)) try: zipped = list(zip(x, y, z, strict=True)) except ValueError: # assuming that’s the exception you want? assert zipped == [(0, 0, 0)] assert next(x) == 2 print(next(z))
Should this print 1 or 2?
The simple implementation using zip_longest() would print 2, but more optimal implementation can print 1.
Your first assert is wrong. I think it should print 1 (i.e. raise the exception immediately when the first iterator is too short) but I don't feel too strongly about this, so if people are pushing towards 2, I'm okay with that. If there's agreement to move forward on this suggestion, here's what I volunteer to do: 1. Write tests. 2. Write documentation. I'm gonna need someone else to write the implementation.
21.04.20 11:49, Ram Rachum пише:
There is more interesting example:
x = iter(range(5)) y = [0] z = iter(range(5)) try: zipped = list(zip(x, y, z, strict=True)) except ValueError: # assuming that’s the exception you want? assert zipped == [(0, 0, 0)] assert next(x) == 2 print(next(z))
Should this print 1 or 2?
The simple implementation using zip_longest() would print 2, but more optimal implementation can print 1.
Your first assert is wrong.
Oh, right. zipped is not set when an exception is raised. It could be correct if rewrite the code: zipped = [] for item in zip(x, y, z, strict=True): zipped.append(item)
Sure, but I think cases where you want that assumption _checked_ are a lot less common. There are lots of postconditions that you assume just as often as “x, y, and z are fully consumed” and just as rarely want to check, so we don’t need to make it easy to check every possible one of them.
It seems that our experiences differ rather significantly. This isn't a "rare" assumption for me, and it's unique because it's one that `zip` already handles internally.
Of course, the fact that zip() is the shorter form that everyone is used to means that, even if a strict argument is added, few people will bother adding it.
I know that I, and everyone on my team, would use it frequently!
The possible solution is to introduce zip_shortest() with the current behavior of zip(), make zip() emitting a pending deprecation warning when some data is ignored, and after long period of deprecation make it raising an exception if some data is ignored.
Unlike some on this thread, I think the default behavior for `zip` is fine. It's not broken, and it *should* be able to handle infinite iterables by default. This isn't just a "band-aid" fix; it's a feature that allows many (if not most) call sites to check an important assumption that's easy to check inside zip (there's literally logic already handling this case) but heavy to check at every call site.
I'm gonna need someone else to write the implementation.
I'll take care of that. Feel free to reach out to me off list to coordinate. :) Since this has received some degree of support here, I'll go ahead and open a BPO issue. For anyone wondering about semantics, iterator consumption *should* be the same as any old `zip` usage... it seems obvious to me just to raise a ValueError instead of a StopIteration when the option is enabled. If there are arguments against it, lets take it up on BPO!
On Tue, 21 Apr 2020 at 15:49, Brandt Bucher <brandtbucher@gmail.com> wrote:
Of course, the fact that zip() is the shorter form that everyone is used to means that, even if a strict argument is added, few people will bother adding it.
I know that I, and everyone on my team, would use it frequently!
To be clear - would you catch the error in your code? What would you do when it was raised? Or are you simply wanting, in effect, an assert when some iterables remain unexhausted? Because I can imagine that wanting builtins to assert when their preconditions are untrue is something that might be considered a more common desire - but it's a much wider change in design philosophy than just zip(). Paul
On Tue, 21 Apr 2020 16:05:54 +0100 Paul Moore <p.f.moore@gmail.com> wrote:
To be clear - would you catch the error in your code? What would you do when it was raised? Or are you simply wanting, in effect, an assert when some iterables remain unexhausted? Because I can imagine that wanting builtins to assert when their preconditions are untrue is something that might be considered a more common desire - but it's a much wider change in design philosophy than just zip().
Picking a nit, this has to be a post condition rather than a precondition, because you can't tell how long an iterator is without consuming it. But at that point, why would zip have a post condition involving the caller's iterables? -- “Atoms are not things.” – Werner Heisenberg Dan Sommers, http://www.tombstonezero.net/dan
On Tue, 21 Apr 2020 at 16:28, Dan Sommers <2QdxY4RzWzUUiLuE@potatochowder.com> wrote:
On Tue, 21 Apr 2020 16:05:54 +0100 Paul Moore <p.f.moore@gmail.com> wrote:
To be clear - would you catch the error in your code? What would you do when it was raised? Or are you simply wanting, in effect, an assert when some iterables remain unexhausted? Because I can imagine that wanting builtins to assert when their preconditions are untrue is something that might be considered a more common desire - but it's a much wider change in design philosophy than just zip().
Picking a nit, this has to be a post condition rather than a precondition, because you can't tell how long an iterator is without consuming it.
No, I view it as a precondition. All of the supplied iterables must be the same length. And yes, that's precisely the issue, you can't actually check that precondition, so you have to assert later when you find that it was violated. My assumption is that people wanting "strict" behaviour are actually just looking for a way to error out if that precondition is violated, and because then can't check in advance, they need the zip function to complain if one of the inputs ends early. Hence my suggestion that maybe it's not so much an (actionable) exception that people want as an assertion. Paul
On Tue, 21 Apr 2020 at 17:53, Serhiy Storchaka <storchaka@gmail.com> wrote:
21.04.20 19:35, Paul Moore пише:
Hence my suggestion that maybe it's not so much an (actionable) exception that people want as an assertion.
What do you mean by assertion? Raising an AssertionError? Crashing the program?
See the original question I asked - I suspect that people asking for this feature don't ever imagine catching the exception (or at least, not for any other reason than to terminate the program saying "this should not have happened") so it's more of a "cannot happen" type of check than an exception in the more general sense that something like a ValueError would be intended. But it's not a big distinction, merely one of intent. And as yet, the person who claimed that wanting an exception from zip was an overwhelmingly common case for them hasn't replied yet to my question - so I may be completely wrong. Paul
But it's not a big distinction, merely one of intent. And as yet, the person who claimed that wanting an exception from zip was an overwhelmingly common case for them hasn't replied yet to my question - so I may be completely wrong.
Well, hopefully I get more than 2 working hours to reply to these threads! ;)
I suspect that people asking for this feature don't ever imagine catching the exception (or at least, not for any other reason than to terminate the program saying "this should not have happened")...
Honestly, it will likely be a blend of both. Most will be "this should not have happened", but sometimes I actually want it handled at runtime, where I either: - Get the data from my caller, and want to propagate the `ValueError`. - Want to handle differently-sized data as a special case (returning `False`, for example). This is rare, and probably better handled by `zip_longest`. I also don't think that using `zip_longest` is as "obvious" as it seems to many of us. In researching this, I found many cases in the stdlib `ast` module where `zip` is throwing away input that should be raising. And I'm sure I'm not the only one who has used it without setting a `fillvalue`, and forgotten that my iterables could contain `None`.
...so it's more of a "cannot happen" type of check than an exception in the more general sense that something like a ValueError would be intended.
Not sure I agree with this. There's more than one way to look at it of course, but I see this as rejecting malformed input. This is an opportunity for a very simple, lightweight change (only a handful of lines) which gives us a clear usability win. And there's no need to maintain a whole new zip object.
On Wed, Apr 22, 2020 at 12:47 AM Brandt Bucher <brandtbucher@gmail.com> wrote:
Unlike some on this thread, I think the default behavior for `zip` is fine. It's not broken, and it *should* be able to handle infinite iterables by default. This isn't just a "band-aid" fix; it's a feature that allows many (if not most) call sites to check an important assumption that's easy to check inside zip (there's literally logic already handling this case) but heavy to check at every call site.
In terms of shed colour, I think the proposed function would fit very nicely alongside zip_longest in itertools. If people don't like the default behaviour, it's easy to then change it: "from itertools import zip_shortest as zip". (Or "zip_strict" or whatever the name ends up being.) Easier that way than if it's a mode-switch keyword argument. ChrisA
Have you written your own version that does this check for input iterable equality?
No, unfortunately. As I've mentioned before, it just feels too heavy to wrap all of these nice builtin zip objects to check things that "must" be true. If this option were available, I would certainly enable it in the places I didn't care enough to be defensive before. Maybe this is an argument against adding the functionality, but I see it as an argument for. We already perform most of the logic for this check internally, so why force users to make a (possibly buggy) wrapper that effectively does the same thing?
On Tue, Apr 21, 2020 at 02:33:04PM -0000, Brandt Bucher wrote:
Sure, but I think cases where you want that assumption _checked_ are a lot less common. There are lots of postconditions that you assume just as often as “x, y, and z are fully consumed” and just as rarely want to check, so we don’t need to make it easy to check every possible one of them.
It seems that our experiences differ rather significantly. This isn't a "rare" assumption for me,
Great! Since this is not a rare assumption for you, then you should have no trouble coming up with some concrete examples of how and when you would use it. Because so far in this thread, I've seen no concrete examples, and I cannot think of any. It's not like this has obvious usefulness.
and it's unique because it's one that `zip` already handles internally.
I don't think it does. Here is an obvious case where zip does not check that the iterators are of equal length. Which it could only do by consuming items *after* it hits an empty iterator. Which it does not do: py> a = iter([]) py> b = iter([1]) py> L = list(zip(a, b)) py> next(b) 1 So it seems to me that you are incorrect, zip does not already handle the case of unequal iterators internally.
I know that I, and everyone on my team, would use it frequently!
Use it frequently for what? "We would use it!" is not a use-case. How would you use it? Reading ahead in this thread, I get the impression that you want this to *verify* that your inputs are the same length. If I'm right, then you aren't actually planning on catching and recovering from the exception raised. It's hard to see how you could recover from an exception other than to log the "failure" and then - kill the application; or - proceed as if the truncated zip was all the data you had. The first option, it seems to me, makes for a user-hostile experience. The second, it seems to me, is precisely what zip() does now, without the logging of the "failure". Either way, I don't see what logging the failure gives you. But maybe I don't understand what you intend to do with this feature. By the way, with zero use-cases for this feature so far, I think it is way to early to be opening a bpo issue. -- Steven
Since this is not a rare assumption for you, then you should have no trouble coming up with some concrete examples of how and when you would use it.
Well, my use cases are "anytime the iterables should be the same length", which several on this thread have agreed is easily the majority of the time they use `zip`. So coming up with examples is more like being asked "When do you use `zip` with what should be equal-length iterables?", or "When do you use the `int` constructor with its default `base=10`?", or "When do you want the RHS values of a `dict.__ior__` to overwrite the LHS on conflicting keys?". Uh... most of the time I use each of those functions? :) As I've said several times, zip's default behavior is a Good Thing, but silently omitting data is a big part of that, and its dangerous enough that I think a straightforward check for "valid" input would be very, very valuable. I understand that your experience differs, though, so here's a handful of situations that would be excellent candidates for the new feature: 1. Likely the most common case, for me, is when I have some data and want to iterate over both it and a calculated pairing:
x = ["a", "b", "c", "d"] y = iter_apply_some_transformation(x) for a, b in zip(x, y): ... ... # Do something. ...
This can be extrapolated to many more cases, where x and/or y are constants, iterators, calculated from each other, calculated individually, passed as arguments, etc. I've written most of them in production code, and in every case, mismatched lengths are logic errors that should "never" happen. A `ValueError` which (a) kills the job, (b) logs the bad data and proceeds to the next job, or (c) alerts me to my error in an interactive session or unit/property test is always better than a silently incomplete result. 2. This is less-well-known, but you can lazily unzip/"transpose" nested iterables by unpacking into `zip`. I've seen it suggested many times on StackOverflow:
x = iter((iter((0, 1, 2)), iter((3, 4, 5)), iter((6, 7, 8)))) y = zip(*x) tuple(y) ((0, 3, 6), (1, 4, 7), (2, 5, 8))
It's clearly a logic error if one of the tuples in `x` is longer/shorter than the others, but this move would silently toss the data instead. 3. Just to show that this has visible effects in the stdlib: below is the AST equivalent of `eval("{'KEY WITH NO VALUE': }")`. The use of `zip` to implement `ast.literal_eval` silently throws away the bad key, instead of complaining with a `ValueError` (as it typically does for malformed or invalid input).
from ast import Constant, Dict, literal_eval malformed = Dict(keys=[Constant("KEY WITH NO VALUE")], values=[]) literal_eval(malformed) {}
So it's not a difficult mistake to make.
...it seems to me that you are incorrect, zip does not already handle the case of unequal iterators internally.
Yeah, I misspoke here. In my defense, though, I've made this point several times, and in those cases I was careful to note that it handles *most of* the logic (you're right that one additional iterator pull is required if the first iterable is the "short" one). My point was not that this change is a one-liner or something, but rather that it doesn't require significant changes to the `zip.__next__` logic and should be effectively no-overhead for strict and non-strict users alike in all but the most pathological cases. Even more important: it's dead-simple to read, write, understand, and maintain. I don't feel the same can be said of sentinels and zip_longest.
I know that I, and everyone on my team, would use it frequently! Use it frequently for what? "We would use it!" is not a use-case. How would you use it?
Well, the comment I was replying to wasn't asking for a use-case, or even arguing against the proposal. They were just asserting that few would use it if it wasn't the default behavior. I felt the need to speak up that, yes, *most* of the engineers I interact with regularly certainly would. But now it's been quoted more times than any of my actual comments on the proposal, so shame on me for the poor wording which doesn't quote well. For use-cases, see above.
On Apr 24, 2020, at 11:07, Brandt Bucher <brandtbucher@gmail.com> wrote:
1. Likely the most common case, for me, is when I have some data and want to iterate over both it and a calculated pairing:
x = ["a", "b", "c", "d"] y = iter_apply_some_transformation(x) for a, b in zip(x, y): ... ... # Do something. ...
Your other examples are a lot more compelling. I can easily imagine actually being bitten by zip(*ragged_iterables_that_I_thought_were_rectangular) and having a hard time debugging that, and the other one is an actual bug in actual code, which is even harder to dismiss. I think this one, on the other hand, is exactly what I think doubters are imagining. I can easily imagine cases where you want to zip together two obviously-equal iterables, but when they’re obviously equal, adding a check for that is hardly the first thing I’d think about defending against. (For example, things like using “spam eggs cheese”.strip() instead of .split() as the input are more common logic errors and even less fun to debug…) And that’s why people keep asking for examples—because the proponents of the change keep talking as if there are examples like your 2 and 3 where everyone would agree that there’s a significant benefit to making it easier to be defensive, but the wary conservatives are only imagining examples like your 1. Anyway, if I’m right, I think you just solved that problem, and now everyone can stop talking past each other. (Although the couple of people who suggested wanting to _handle_ the error as a normal case rather than treating it as a logic error to debug like your examples still need to give use cases if they want anything different than what you want.)
On 25/04/20 6:07 am, Brandt Bucher wrote:
Well, my use cases are "anytime the iterables should be the same length", which several on this thread have agreed is easily the majority of the time they use `zip`.
If zip were being designed today, I'm sure most people wouldn't mind if it were always strict or strict by default. But given how it is, how many people would care enough to go out of their way to pass an extra argument or import a special version of zip to get strict behaviour? -- Greg
If zip were being designed today, I'm sure most people wouldn't mind if it were always strict or strict by default.
Again, I'd like to stress that I think the current default behavior is fine. I have no desire to change it, even over an extended deprecation period.
But given how it is, how many people would care enough to go out of their way to pass an extra argument...
In response to this specific question, I'll again say that I know that I, and everyone on my team, would use it. ;)
...or import a special version of zip to get strict behaviour?
Honestly, I would be much less likely to use this. Passing a boolean keyword argument is much lighter than importing something from somewhere to wrap a builtin purely as a defensive measure. As proof, I don't have any "toolbox" version of a strict zip, even though it's not very hard to make. For example, I often pass `sep='\t'` to the built-in `print` function. I'd probably sooner just use `print('\t'.join(map(str, args))` than import `print_tab_sep` from somewhere, even if it makes the call site cleaner. Not 100% sure why, but I think it just comes down to friction. I like using the builtins and don't like importing (especially in interactive sessions)... but maybe (probably) that's just me. :) I can't speak for any larger group, but I'm almost certain that users in general would be much more enthusiastic to use *either* option than to roll their own using a sentinel and `zip_longest`.
On Fri, Apr 24, 2020 at 06:07:06PM -0000, Brandt Bucher wrote:
1. Likely the most common case, for me, is when I have some data and want to iterate over both it and a calculated pairing:
x = ["a", "b", "c", "d"] y = iter_apply_some_transformation(x) for a, b in zip(x, y): ... ... # Do something. ...
This can be extrapolated to many more cases, where x and/or y are constants, iterators, calculated from each other, calculated individually, passed as arguments, etc.
It won't work with iterators unless you use tee(). py> x = iter('abc') py> y = map(str.upper, x) py> for t in zip(x, y): ... print(*t) ... a B
I've written most of them in production code, and in every case, mismatched lengths are logic errors that should "never" happen.
Sounds like this ought to be an assertion that can be disabled. And ValueError is, semantically, the wrong exception: it's a *logic error* in your code, as you say, so it ought to be AssertionError. People who know me, or read my code, know that I love `assert`. I use it a lot. According to some people, too much. But even I would struggle to justify using an assert for the code snippet you have above: x = ["a", "b", "c", "d"] y = iter_apply_some_transformation(x) # Figurative, not literal: assert len(x) == len(y) I suppose it might be justified to put a post-condition into `iter_apply_some_transformation` to check that it returns the same number of items that it had been fed in, and if it were a complex transformation that might be justified. But for a straight-forward map-like transformation: def iter_apply_some_transformation(x): for item in x: yield something(x) then it is *obviously true* that the length of the output is the length of the input and you don't need an assertion to check it. This is especially obvious when written as a comprehension: x = ["a", "b", "c", "d"] y = (transform(z) for z in x) So I think this is a weak example. It might justify a function in itertools, which could be a simple wrapper around zip_longest, but not a flag argument on builtin zip.
2. This is less-well-known, but you can lazily unzip/"transpose" nested iterables by unpacking into `zip`. I've seen it suggested many times on StackOverflow:
x = iter((iter((0, 1, 2)), iter((3, 4, 5)), iter((6, 7, 8)))) y = zip(*x) tuple(y) ((0, 3, 6), (1, 4, 7), (2, 5, 8))
Yes, that's a moderately common functional idiom, unzip(). It's also sometimes called demuxing.
It's clearly a logic error if one of the tuples in `x` is longer/shorter than the others, but this move would silently toss the data instead.
It's not clear to me that it is a logic error rather than bad data (the caller's responsibility, not zip's). And if it's bad data, then truncating the data is at least as reasonable an approach as killing the application. (In my opinion, a much better approach.)
3. Just to show that this has visible effects in the stdlib: below is the AST equivalent of `eval("{'KEY WITH NO VALUE': }")`. The use of `zip` to implement `ast.literal_eval` silently throws away the bad key, instead of complaining with a `ValueError` (as it typically does for malformed or invalid input).
from ast import Constant, Dict, literal_eval malformed = Dict(keys=[Constant("KEY WITH NO VALUE")], values=[]) literal_eval(malformed) {}
Okay, that's a good example. I too expect that it ought to complain rather than silently drop malformed code. -- Steven
On Apr 21, 2020, at 01:36, Serhiy Storchaka <storchaka@gmail.com> wrote:
20.04.20 23:33, Andrew Barnert via Python-ideas пише:
Should this print 1 or 2 or raise StopIteration or be a don’t-care? Should it matter if you zip(y, x, strict=True) instead?
It should print 2 in both cases. The only way to determine whether the iterator ends is to try to get its next value. And this value (1) will lost, because there is no way to return it or "unput" to the iterator. There is no reason to consume more values, so StopIteration is irrelevant.
There is more interesting example:
x = iter(range(5)) y = [0] z = iter(range(5)) try: zipped = list(zip(x, y, z, strict=True)) except ValueError: # assuming that’s the exception you want? assert zipped == [(0, 0, 0)] assert next(x) == 2 print(next(z))
Should this print 1 or 2?
The simple implementation using zip_longest() would print 2, but more optimal implementation can print 1.
You’re right; that’s the question I should have asked; thanks. As I said, I think either answer is probably acceptable as long as it’s documented (and, therefore, it’s also clear that the consequences have been thought through).
On Apr 21, 2020, at 01:36, Serhiy Storchaka <storchaka@gmail.com> wrote:
except ValueError: # assuming that’s the exception you want?
For what it’s worth, more_itertools.zip_equal raises an UnequalIterablesError, which is a subclass of ValueError. I’m not sure whether having a special error class is worth it, but that’s because nobody’s providing any examples of code where they’d want to handle this error. Presumably there are cases where something else in the expression could raise a ValueError for a different reason, and being able to catch this one instead of that one would be worthwhile. But how often? No idea. At a guess, I’d say that if this has to be a builtin (whether flag-switchable behavior in zip or a new builtin function) it’s probably not worth adding a new builtin exception, but if it’s going to go into itertools it probably is worth it.
On Tue, Apr 21, 2020 at 12:25:06PM -0700, Andrew Barnert via Python-ideas wrote:
On Apr 21, 2020, at 01:36, Serhiy Storchaka <storchaka@gmail.com> wrote:
except ValueError: # assuming that’s the exception you want?
For what it’s worth, more_itertools.zip_equal raises an UnequalIterablesError, which is a subclass of ValueError.
I’m not sure whether having a special error class is worth it, but that’s because nobody’s providing any examples of code where they’d want to handle this error. Presumably there are cases where something else in the expression could raise a ValueError for a different reason, and being able to catch this one instead of that one would be worthwhile. But how often? No idea.
At a guess, I’d say that if this has to be a builtin (whether flag-switchable behavior in zip or a new builtin function) it’s probably not worth adding a new builtin exception, but if it’s going to go into itertools it probably is worth it.
Why? I know that the Python community has a love-affair with more-itertools, but I don't think that it is a well-designed library offering good APIs. It's a grab-bag of "everything including the kitchen sink". Just because they use a distinct exception doesn't mean we should follow them. -- Steven
On Apr 21, 2020, at 16:02, Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Apr 21, 2020 at 12:25:06PM -0700, Andrew Barnert via Python-ideas wrote:
On Apr 21, 2020, at 01:36, Serhiy Storchaka <storchaka@gmail.com> wrote: except ValueError: # assuming that’s the exception you want? For what it’s worth, more_itertools.zip_equal raises an UnequalIterablesError, which is a subclass of ValueError. I’m not sure whether having a special error class is worth it, but that’s because nobody’s providing any examples of code where they’d want to handle this error. Presumably there are cases where something else in the expression could raise a ValueError for a different reason, and being able to catch this one instead of that one would be worthwhile. But how often? No idea.
At a guess, I’d say that if this has to be a builtin (whether flag-switchable behavior in zip or a new builtin function) it’s probably not worth adding a new builtin exception, but if it’s going to go into itertools it probably is worth it.
Why?
Well, you quoted the answer above, but I’ll repeat it:
Presumably there are cases where something else in the expression could raise a ValueError for a different reason, and being able to catch this one instead of that one would be worthwhile. But how often? No idea.
For a little more detail: A few people (like Soni) keep trying to come up with general-purpose ways to differentiate exceptions better. The strong consensus is always that we don’t need any such thing, because in most cases, Python gives you just enough to differentiate what you actually need in most code. (That wasn’t quite true in Python 2, but it is now.) We have LookupError with subclasses KeyError and IndexError, but not additional subclasses IndexTooBigError and IndexTooSmallError, and so on. For the IOError subclasses, Python does kind of lean on C/POSIX, but that’s still good enough that it’s fine. The question in every case is: do you often need to distinguish this case? In this case: will the zip_strict postcondition violation be used in a lot of places where there are other likely sources of ValueError that need to be distinguished? If so, it should be a separate subclass. If that will be rare, it shouldn’t. As I said, I don’t know the answer to that question, because none of the people saying they need an exception here have given any examples where they’d want to handle the exception, and it’s hard to guess how people want to handle an exception when you don’t even know where and when they want to handle it. So I took a guess to start the discussion. If you have a different guess, fine. But really, we need the people who have code in mind that would actually use this to show us that code or tell us about it.
I know that the Python community has a love-affair with more-itertools, but I don't think that it is a well-designed library offering good APIs. It's a grab-bag of "everything including the kitchen sink". Just because they use a distinct exception doesn't mean we should follow them.
If I thought we should just do what more-itertools does without thinking, I would have said “more-itertools has a separate exception, so we should”, rather than saying “For what it’s worth, more-itertools has a separate exception” and then concluding that I don’t know if we actually need one and we need to look at actual examples to decide. When all else is equal, I think it’s worth being consistent with more-itertools just because that way we get an automatic backport. But that’s not a huge win, and quite often, all else isn’t equal, so looking at what more-itertools does and why isn’t the answer, it’s just one piece of information to throw into the discussion. And I think that’s the case here: their design raises a question for us to answer, but it doesn’t answer it for us.
On Wed, Apr 22, 2020 at 01:26:02PM -0700, Andrew Barnert wrote:
On Apr 21, 2020, at 16:02, Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Apr 21, 2020 at 12:25:06PM -0700, Andrew Barnert via Python-ideas wrote:
On Apr 21, 2020, at 01:36, Serhiy Storchaka <storchaka@gmail.com> wrote: except ValueError: # assuming that’s the exception you want? For what it’s worth, more_itertools.zip_equal raises an UnequalIterablesError, which is a subclass of ValueError.
I’m not sure whether having a special error class is worth it, but that’s because nobody’s providing any examples of code where they’d want to handle this error. Presumably there are cases where something else in the expression could raise a ValueError for a different reason, and being able to catch this one instead of that one would be worthwhile. But how often? No idea.
At a guess, I’d say that if this has to be a builtin (whether flag-switchable behavior in zip or a new builtin function) it’s probably not worth adding a new builtin exception, but if it’s going to go into itertools it probably is worth it.
Why?
Well, you quoted the answer above, but I’ll repeat it: [...]
I saw that, so let me rephrase my question: why it is worth a subclass if this is in itertools but not if it's a builtin? Sorry for being so terse in my question. You haven't really given an answer to that, as such, but I think that it is no longer relevant given the below.
For a little more detail: [...] The question in every case is: do you often need to distinguish this case?
Indeed.
As I said, I don’t know the answer to that question, because none of the people saying they need an exception here have given any examples where they’d want to handle the exception, and it’s hard to guess how people want to handle an exception when you don’t even know where and when they want to handle it. So I took a guess to start the discussion.
And that's the missing piece I was looking for, thank you. You weren't so much advocating for this as just exploring the options. Fair enough. -- Steven
On Mon, Apr 20, 2020 at 08:42:00PM +0300, Ram Rachum wrote:
Here's something that would have saved me some debugging yesterday:
>>> zipped = zip(x, y, z, strict=True)
I suggest that `strict=True` would ensure that all the iterables have been exhausted, raising an exception otherwise.
Here you go, add this to your personal toolbox: from itertools import zip_longest def zip_strict(*iterables): sentinel = object() for t in zip_longest(*iterables, fillvalue=sentinel): if sentinel in t: p = t.index(sentinel) msg = "argument %d exhausted" raise ValueError(msg % p) yield t I added that to my personal toolbox sometime, oh, eight years or so ago, and have never used it since, so it's brand new :-)
This is useful in cases where you're assuming that the iterables all have the same lengths. When your assumption is wrong, you currently just get a shorter result, and it could take you a while to figure out why it's happening.
That assumes that you care that the assumption is wrong, rather than just saying "... if they aren't the same length, truncate at the shortest iterable". Can you give a little more detail on your use-case where the consumer of the data needs to care that all the iterables are the same length? This approach has (at least) one **big** problems as far as I am concerned: - it rules out the use of infinite iterators like itertools.count()
What do you think?
I'm tempted to say YAGNI, which would be a pretty brave thing to say given that you just said you did need it :-) -- Steven
The more I read the discussion, the more zip_strict() feels like an anti-pattern to me. I've used zip_longest occasionally, but never really hit a need for zip_strict() ... which obviously I could have written in a few lines if I wanted it. Since I never know how many elements an iterator has let—including perhaps infinity—any such function has to be leap-before-you-look. But the effect is that is that I wind up consuming more than I want of some iterators. Let's sayI have this code:
it1 = iter([1,2,3]) it2 = iter([4,5]) it3 = iter([6,7,8, 9]) list(zip(it1, it2, it3)) [(1, 4, 6), (2, 5, 7)] next(it3) 8
That seems fine. If I had used zip_longest() I could get some extra tuples, and indeed I'd have to check whether there were sentinel values inside of them. Depending on what I was doing, the non-sentinels might still be useful for my processing though. But this hypothetical zip_strict() (or I guess actual very recently in itertools, but I haven't checked the semantics) would raise an exception of some kind instead. It's not easier to check for an exception than it is for a sentinel, so that really don't get us anything at all in terms of saving code or clarity. If anything, the check-for-sentinel feels slightly cleaner to me. But worse is that most versions being discussed here seem to consume the 8 from it3 before raising the exception (perhaps sticking it in the exception object). I guess if it's stashed in the exception object it's not entirely lost. Still, in my code it3 remains a perfectly good iterator that I can keep around to pull more values from. Under the zip_strict approach, I have to dig a value out of the exception object before proceeding on normal iteration of it3 later in my code. That just feels awkward to me. Not un-doable, but certainly not easier. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On 04/21/2020 12:58 PM, David Mertz wrote:
it1 = iter([1,2,3]) it2 = iter([4,5]) it3 = iter([6,7,8, 9]) list(zip(it1, it2, it3)) [(1, 4, 6), (2, 5, 7)] next(it3) 8
[...] worse is that most versions being discussed here seem to consume the 8 from it3 before raising the exception [...] in my code it3 remains a perfectly good iterator that I can keep around to pull more values from.
it3 may be still usable, but it1 is not:
next(it1) Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration
-- ~Ethan~
This idea is something I could have used many times. I agree with many people here that the strict=True API is at least "unusual" in Python. I was thinking of 2 different API approaches that could be used for this and I think no one has mentioned: - we could add a callable filler_factory keyword argument to zip_longest. That would allow passing a function that raises an exception if I want "strict" behaviour, and also has some other uses (for example, if I want to use [] as a filler value, but not the *same* empty list for all fillers) - we could add methods to the zip() type that provide different behaviours. That way you could use zip(seq, seq2).shortest(), zip(seq1, seq2).equal(), zip(seq1, seq2).longer(filler="foo") ; zip(...).shortest() would be equivalent to zip(...). Other names might work better with this API, I can think of zip(...).drop_tails(), zip(...).consume_all() and zip(...).fill(). This also allows adding other possible behaviours (I wouldn't say it's common, but at least once I've wanted to zip lists of different length, but get shorter tuples on the tails instead of fillers). On Mon, 20 Apr 2020 at 18:44, Ram Rachum <ram@rachum.com> wrote:
Here's something that would have saved me some debugging yesterday:
>>> zipped = zip(x, y, z, strict=True)
I suggest that `strict=True` would ensure that all the iterables have been exhausted, raising an exception otherwise.
This is useful in cases where you're assuming that the iterables all have the same lengths. When your assumption is wrong, you currently just get a shorter result, and it could take you a while to figure out why it's happening.
What do you think? _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/6GFUAD... Code of Conduct: http://python.org/psf/codeofconduct/
On Apr 26, 2020, at 14:36, Daniel Moisset <dfmoisset@gmail.com> wrote:
This idea is something I could have used many times. I agree with many people here that the strict=True API is at least "unusual" in Python. I was thinking of 2 different API approaches that could be used for this and I think no one has mentioned: we could add a callable filler_factory keyword argument to zip_longest. That would allow passing a function that raises an exception if I want "strict" behaviour, and also has some other uses (for example, if I want to use [] as a filler value, but not the *same* empty list for all fillers)
This could be useful, and doesn’t seem too bad. I still think an itertools.zip_equal would be more discoverable and more easily understandable than something like itertools.zip_longest(fill_factory=lambda: throw(ValueError)), especially since you have to write that thrower function yourself. But if there really are other common uses like zip_longest(fill_factory=list), that might make up for it.
we could add methods to the zip() type that provide different behaviours. That way you could use zip(seq, seq2).shortest(), zip(seq1, seq2).equal(), zip(seq1, seq2).longer(filler="foo") ; zip(...).shortest() would be equivalent to zip(...). Other names might work better with this API, I can think of zip(...).drop_tails(), zip(...).consume_all() and zip(...).fill(). This also allows adding other possible behaviours (I wouldn't say it's common, but at least once I've wanted to zip lists of different length, but get shorter tuples on the tails instead of fillers).
This second one is a cool idea—but your argument for it seems to be an argument against it. If we stick with separate functions in itertools, and then we add a new one for your zip_skip (or whatever you’d call it) in 3.10, the backport is trivial. Either more-itertools adds zip_skip, or someone writes an itertools310 library with the new functions in 3.10, and then people just do this: try: from itertools import zip_skip except ImportError: from more_itertools import zip_skip But if we add methods on zip objects, and then we add a new skip() method in 3.10, how does the backport work? It can’t monkeypatch the zip type (unless we both make the type public and specifically design it to be monkeypatchable, which C builtins usually aren’t). So more-itertools or zip310 or whatever has to provide a full implementation of the zip type, with all of its methods, and probably twice (in Python for other implementations plus a C accelerator for CPython). Sure, maybe it could delegate to a real zip object for the methods that are already there, but that’s still not trivial (and adds a performance cost). Also, what exactly do these methods return? Do they set some flag and return self? If so, that goes against the usual Python rule that mutator methods return None rather than self. Plus, it opens the question of what zip(xs, ys).equal().shortest() should do. I think you’d want that to be an AttributeError, but the only sensible way to get that is if equal() actually returns a new object of a new zip_equal type rather than self. So, that solves both problems, but it means you have to implement four different builtin types. (Also, while the C implementation of those types, and constructing them from the zip type’s methods, seems trivial, I think the pure Python version would have to be pretty clunky.)
On Sun, Apr 26, 2020 at 04:13:27PM -0700, Andrew Barnert via Python-ideas wrote:
But if we add methods on zip objects, and then we add a new skip() method in 3.10, how does the backport work? It can’t monkeypatch the zip type (unless we both make the type public and specifically design it to be monkeypatchable, which C builtins usually aren’t).
Depends on how you define monkey-patching. I'm not saying this because I see the need for a plethora of methods on zip (on the contrary); but I do like the methods-on-function API, like itertools.chain has. Functions are namespaces, and we under-utilise that fact in our APIs. Namespaces are one honking great idea -- let's do more of those! Here is a sketch of how you might do it: # Untested. class MyZipBackport(): real_zip = builtins.zip def __call__(self, *args): return self.real_zip(*args) def __getattr__(self, name): # Delegation is another under-utilised technique. return getattr(self.real_zip, name) def skip(self, *args): # insert implementation here... builtins.zip = MyZipBackport() I don't know what "zip.skip" is supposed to do, but I predict that (like all the other variants we have discussed) it will end up being a small wrapper around zip_longest.
So more-itertools or zip310 or whatever has to provide a full implementation of the zip type, with all of its methods, and probably twice (in Python for other implementations plus a C accelerator for CPython). Sure, maybe it could delegate to a real zip object for the methods that are already there, but that’s still not trivial (and adds a performance cost).
I dunno, a two-line method (including the `def` signature line) seems pretty trivial to me. Nobody has established that *any* use of zip_whatever is performance critical. In what sort of real-world code is the bottleneck going to be the performance of zip_whatever *itself* rather than the work done on the zipped up tuples? I don't especially want zip_whatever to be slow, but the stdlib has no obligation to provide a super-fast highly optimized C accelerated version of **everything**. Especially not backports. It is perfectly acceptable to say: "Here's a functionally equivalent version that works in Python 3.old, if you want speed then provide your own C version or upgrade to 3.new"
Also, what exactly do these methods return?
An iterator. What kind of iterator is an implementation detail. The type of the zip objects is not part of the public API, only the functional behaviour. -- Steven
On Mon, Apr 27, 2020 at 9:57 AM Steven D'Aprano <steve@pearwood.info> wrote:
I don't especially want zip_whatever to be slow, but the stdlib has no obligation to provide a super-fast highly optimized C accelerated version of **everything**. Especially not backports. It is perfectly acceptable to say:
"Here's a functionally equivalent version that works in Python 3.old, if you want speed then provide your own C version or upgrade to 3.new"
True, but if taking the backport causes ALL your zip() objects to underperform, then that's a cost. It's not just "here's a slower version that works on 3.old", it's "here's a more functional version but it slows down other stuff". Still, performance of backported code is a lower consideration than getting the API right. (For the record, I still prefer the separate-functions option, but what Steven's described is a very reasonable zip-gets-methods option.) ChrisA
On Apr 26, 2020, at 16:58, Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Apr 26, 2020 at 04:13:27PM -0700, Andrew Barnert via Python-ideas wrote:
But if we add methods on zip objects, and then we add a new skip() method in 3.10, how does the backport work? It can’t monkeypatch the zip type (unless we both make the type public and specifically design it to be monkeypatchable, which C builtins usually aren’t).
Depends on how you define monkey-patching.
I'm not saying this because I see the need for a plethora of methods on zip (on the contrary); but I do like the methods-on-function API, like itertools.chain has. Functions are namespaces, and we under-utilise that fact in our APIs.
Namespaces are one honking great idea -- let's do more of those!
Here is a sketch of how you might do it:
# Untested. class MyZipBackport(): real_zip = builtins.zip def __call__(self, *args): return self.real_zip(*args) def __getattr__(self, name): # Delegation is another under-utilised technique. return getattr(self.real_zip, name) def skip(self, *args): # insert implementation here...
builtins.zip = MyZipBackport()
But this doesn’t do what the OP suggested; it’s a completely different proposal. They wanted to write this: zipped = zip(xs, ys).skip() … and you’re offering this: zipped = zip.skip(xs, ys) That’s a decent proposal—arguably better than the one being discussed—but it’s definitely not the same one.
I don't know what "zip.skip" is supposed to do,
I quoted it in the email you’re responding to: it’s supposed to yield short tuples that skip the iterables that ran out early. But from the wording you quoted it should be obvious that isn’t an issue here anyway. As long as you understand their point that they want to leave things open for expansion to new forms of zipping in the future, you can understand my point that their design makes that harder rather than easier.
Also, what exactly do these methods return?
An iterator. What kind of iterator is an implementation detail.
The type of the zip objects is not part of the public API, only the functional behaviour.
Now go back and do what the OP actually asked for, with the zip iterator type having shortest(), equal(), and longest() methods in 3.9 and a skip() method added in 3.10. It’s no longer just “some iterator type, doesn’t matter”, it has specific methods on it, documented as part of the public API, and you need to either subclass it or emulate it. That’s exactly the problem I’m pointing out. The fact that it’s not true in 3.8, it’s not required by the problem, it’s not true of other designs proposed in this thread like just having more separate functions in itertools, it’s specifically a flaw with this design. So the fact that you can come up with a different design without that flaw isn’t an argument against my point, it’s just a probably-unnecessary further demonstration of my point. Your design looks like a pretty good one at least at first glance, and I think you should propose it seriously. You should be showing why it’s better than adding methods to zip objects—and also better than adding more functions to itertools or builtins, or flags to zip, or doing nothing—not pretending it’s the same as one of those other proposals and then trying to defend that other proposal by confusing the problems with it.
On Mon, Apr 27, 2020 at 09:21:41AM -0700, Andrew Barnert wrote:
But this doesn’t do what the OP suggested; it’s a completely different proposal. They wanted to write this:
zipped = zip(xs, ys).skip()
… and you’re offering this:
zipped = zip.skip(xs, ys)
That’s a decent proposal—arguably better than the one being discussed—but it’s definitely not the same one.
So he did. I misread his comment, sorry. Perhaps I read it as I would have written it rather than as he wrote it :-( [...]
Your design looks like a pretty good one at least at first glance, and I think you should propose it seriously. You should be showing why it’s better than adding methods to zip objects—and also better than adding more functions to itertools or builtins, or flags to zip, or doing nothing—not pretending it’s the same as one of those other proposals and then trying to defend that other proposal by confusing the problems with it.
Last time I got volunteered into writing a PEP I wasn't in favour of, and (initially at least) thought I was writing to have the rejection reason documented, it ended up getting approved :-) -- Steven
On Sun, Apr 26, 2020 at 10:34:27PM +0100, Daniel Moisset wrote:
- we could add methods to the zip() type that provide different behaviours. That way you could use zip(seq, seq2).shortest(), zip(seq1, seq2).equal(), zip(seq1, seq2).longer(filler="foo") ; zip(...).shortest() would be equivalent to zip(...). Other names might work better with this API, I can think of zip(...).drop_tails(), zip(...).consume_all() and zip(...).fill(). This also allows adding other possible behaviours (I wouldn't say it's common, but at least once I've wanted to zip lists of different length, but get shorter tuples on the tails instead of fillers).
Each of those behaviours can be handled by a simple wrapper function around zip_longest. -- Steven
Thanks for weighing in, everybody. Over the course of the last week, it has become surprisingly clear that this change is controversial enough to require a PEP. With that in mind, I've started drafting one summarizing the discussion that took place here, and arguing for the addition of a boolean flag to the `zip` constructor. Antoine Pitrou has agreed to sponsor, and I've chatted with another core developer who shares my view that such a flag wouldn't violate Python's existing design philosophies. I'll be watching this thread, and should have a draft posted to the list for feedback this week. Brandt
On 28/04/2020 15:46, Brandt Bucher wrote:
Thanks for weighing in, everybody.
Over the course of the last week, it has become surprisingly clear that this change is controversial enough to require a PEP.
With that in mind, I've started drafting one summarizing the discussion that took place here, and arguing for the addition of a boolean flag to the `zip` constructor. Antoine Pitrou has agreed to sponsor, and I've chatted with another core developer who shares my view that such a flag wouldn't violate Python's existing design philosophies.
I'll be watching this thread, and should have a draft posted to the list for feedback this week.
-1 on the flag. I'd be happy to have a separate zip_strict() (however you spell it), but behaviour switches just smell wrong. -- Rhodri James *-* Kynesim Ltd
On 28 Apr 2020, at 16:12, Rhodri James <rhodri@kynesim.co.uk> wrote:
On 28/04/2020 15:46, Brandt Bucher wrote:
Thanks for weighing in, everybody. Over the course of the last week, it has become surprisingly clear that this change is controversial enough to require a PEP. With that in mind, I've started drafting one summarizing the discussion that took place here, and arguing for the addition of a boolean flag to the `zip` constructor. Antoine Pitrou has agreed to sponsor, and I've chatted with another core developer who shares my view that such a flag wouldn't violate Python's existing design philosophies. I'll be watching this thread, and should have a draft posted to the list for feedback this week.
-1 on the flag. I'd be happy to have a separate zip_strict() (however you spell it), but behaviour switches just smell wrong.
Also -1 on the flag. 1. A new name can be searched for. 2. You do not force a if on the flag for every single call to zip. Barry
-- Rhodri James *-* Kynesim Ltd _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/BZUJUT... Code of Conduct: http://python.org/psf/codeofconduct/
On Apr 29, 2020, at 07:08, Barry Scott <barry@barrys-emacs.org> wrote:
On 28 Apr 2020, at 16:12, Rhodri James <rhodri@kynesim.co.uk> wrote:
On 28/04/2020 15:46, Brandt Bucher wrote: Thanks for weighing in, everybody. Over the course of the last week, it has become surprisingly clear that this change is controversial enough to require a PEP. With that in mind, I've started drafting one summarizing the discussion that took place here, and arguing for the addition of a boolean flag to the `zip` constructor. Antoine Pitrou has agreed to sponsor, and I've chatted with another core developer who shares my view that such a flag wouldn't violate Python's existing design philosophies. I'll be watching this thread, and should have a draft posted to the list for feedback this week.
-1 on the flag. I'd be happy to have a separate zip_strict() (however you spell it), but behaviour switches just smell wrong.
Also -1 on the flag.
1. A new name can be searched for. 2. You do not force a if on the flag for every single call to zip.
Agreed on both Rhodri’s and Barry’s reasons, and more below. I also prefer the name zip_equal to zip_strict, because what we’re being strict about isn’t nearly as obvious as what’s different between shortest vs. equal vs. longest, but that’s just a mild preference, not a -1 like the flag. In addition to the three points above: Having one common zip variant spelled as a different function and the other as a flag seems really bad for learning and remembering the language. And zip_longest has a solidly established precedent. And I don’t think you want to add multiple bool flags to zip? Also, just look at these: zip_strict(xs, ys) zip(xs, ys, strict=True) The first one is easier to read because it doesn’t have the extra 5 characters to skim over that don’t really add anything to the meaning, and it puts the important distinction up front. It’s also shorter, and a lot easier to type with auto-complete—which isn’t nearly as big of a deal, but if this is really meant to be used often it does add up. And it’s obviously more extensible, if it really is at all possible that we might want to eventually deprecate shortest or add new end behaviors like yielding partial tuples or Soni’s thing of stashing the leftovers somehow (none of which I find very convincing, but others apparently do, and picking a design that rules them out means explicitly rejecting them). A string or enum flag instead of a book solves half of those problems (as long as “longest” is one of the options), but it makes others even worse. The available strings aren’t even discoverable as part of the signature, auto-complete won’t help at all, and the result is even longer and even more deemphasizes the important thing.
Andrew Barnert via Python-ideas writes:
Also -1 on the flag.
Also -1 on the flag, for the same set of reasons. I have to dissent somewhat from one of the complaints, though:
auto-complete won’t help at all,
Many (most?) people use IDEs that will catch up more or less quickly, though. Such catchup could be automated to some extent by using an Enum, although folks who would use the flag might prefer the string API. You could handle both, but that would add even more complexity to the function's initialization. I think that the issue of searchability and signature are pretty compelling reasons for such a simple feature to be part of the function name. Steve
I think that the issue of searchability and signature are pretty compelling reasons for such a simple feature to be part of the function name.
I would absolutely agree with that if all three function were in the same namespace (like the string methods referred to earlier), but in this case, one is a built in and the others will not be — which makes a huge difference in discoverability. Imagine someone that uses zip() in code that works for a while, and then discovers a bug triggered by unequal length inputs. If it’s a flag, they look at the zip docstring, and find the flag, and their problem is solved. Is it’s in itertools, they have to think to look there. Granted, some googling will probably lead them there, and the zip() docstring can point them there, but it’s still a heavier lift. -CHB
Steve _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/YJ3PBE... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On 04/30/2020 07:58 AM, Christopher Barker wrote:
On 04/29/2020 10:51 PM, Stephen J. Turnbull wrote:
I think that the issue of searchability and signature are pretty compelling reasons for such a simple feature to be part of the function name.
I would absolutely agree with that if all three function were in the same namespace (like the string methods referred to earlier), but in this case, one is a built in and the others will not be — which makes a huge difference in discoverability.
Imagine someone that uses zip() in code that works for a while, and then discovers a bug triggered by unequal length inputs.
If it’s a flag, they look at the zip docstring, and find the flag, and their problem is solved.
So update the `zip` docstring with a reference to `zip_longest`, `zip_equal`, and `zip_whatever`. -1 on using a flag. -- ~Ethan~
On 2020-04-30 1:07 p.m., Ethan Furman wrote:
On 04/30/2020 07:58 AM, Christopher Barker wrote:
On 04/29/2020 10:51 PM, Stephen J. Turnbull wrote:
I think that the issue of searchability and signature are pretty compelling reasons for such a simple feature to be part of the function name.
I would absolutely agree with that if all three function were in the same namespace (like the string methods referred to earlier), but in this case, one is a built in and the others will not be — which makes a huge difference in discoverability.
Imagine someone that uses zip() in code that works for a while, and then discovers a bug triggered by unequal length inputs.
If it’s a flag, they look at the zip docstring, and find the flag, and their problem is solved.
So update the `zip` docstring with a reference to `zip_longest`, `zip_equal`, and `zip_whatever`.
-1 on using a flag.
what about letting `zip` take a `leftover_func` with arguments `partial_results` and `remaining_iterators`, and then provide `zip_longest`, `zip_equal` and `zip_shortest` (default) as functions you can use with it? an iteration of `zip(a, b, c, leftover_func=foo)` would: 1. call next on the first iterator (internal iter(a)) 2. if it fails, call leftover_func with the () tuple as first arg and the (internal iter(b), internal iter(c)) tuple as second arg 3. call next on the second iterator (internal iter(b)) 4. if it fails, call leftover_func with the (result from a,) tuple as the first arg and the (internal iter(a), internal iter(c)) tuple as second arg 5. call next on the third iterator (internal iter(c)) 6. if it fails, call leftover_func with the (result from a, result from b) tuple as the first arg and the (internal iter(a), internal iter(b)) tuple as second arg 7. yield the (result from a, result from b, result from c) tuple the leftover_func should return an iterator that replaces the zip, or None. (zip_shortest would be the no_op function)
-- ~Ethan~ _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/T7OGRP... Code of Conduct: http://python.org/psf/codeofconduct/
On Apr 30, 2020, at 07:58, Christopher Barker <pythonchb@gmail.com> wrote:
I think that the issue of searchability and signature are pretty compelling reasons for such a simple feature to be part of the function name.
I would absolutely agree with that if all three function were in the same namespace (like the string methods referred to earlier), but in this case, one is a built in and the others will not be — which makes a huge difference in discoverability.
Imagine someone that uses zip() in code that works for a while, and then discovers a bug triggered by unequal length inputs.
If it’s a flag, they look at the zip docstring, and find the flag, and their problem is solved.
Is it’s in itertools, they have to think to look there. Granted, some googling will probably lead them there, and the zip() docstring can point them there, but it’s still a heavier lift.
I don’t understand. You’re arguing that being discoverable in the docstring is sufficient for the flag, but being discoverable in the docstring is a heavier lift from the function. Why would this be true, unless you intentionally write the docstring badly? To make this more concrete, let’s say we want to just add on to the existing doc string (even though it seems aimed more at reminding experts of the exact details than at teaching novices) and stick to the same style. We’re then talking about something like this:
Return a zip object whose .__next__() method returns a tuple where the i-th element comes from the i-th iterable argument. The .__next__() method continues until the shortest iterable in the argument sequence is exhausted and then it raises StopIteration, or, if equal is true, it checks that the remaining iterables are exhausted and otherwise raises ValueError.
… vs. this:
Return a zip object whose .__next__() method returns a tuple where the i-th element comes from the i-th iterable argument. The .__next__() method continues until the shortest iterable in the argument sequence is exhausted and then it raises StopIteration. If you need to check that all iterables are exhausted, use itertools.zip_equal, which raises ValueError if they aren’t.
If they can figure out that equal=True is what they’re looking for from the first one, it’ll be just as easy to figure out that zip_equal is what they’re looking for from the second. Of course it might be better to rewrite the whole thing to be more novice-friendly and to describe what zip iterates at a higher level instead of describing how its __next__ method operates, but that applies to both versions.
On 1/05/20 2:58 am, Christopher Barker wrote:
Imagine someone that uses zip() in code that works for a while, and then discovers a bug triggered by unequal length inputs.
If it’s a flag, they look at the zip docstring, and find the flag, and their problem is solved.
Why would they look at the docs for zip? The bug wasn't caused by incorrect use of zip. And using the flag isn't going to fix it. -- Greg
On Thu, Apr 30, 2020 at 07:58:16AM -0700, Christopher Barker wrote:
Imagine someone that uses zip() in code that works for a while, and then discovers a bug triggered by unequal length inputs.
If it’s a flag, they look at the zip docstring, and find the flag, and their problem is solved.
Their problem is not solved. All they have is an exception. Now what are they going to do with it? This is why I am still unconvinced that this functionality is anywhere near as useful as the proponents seem to think. Brandt has found one good example of a parsing bug in the ast library, but if he has shown how this zip_strict function will solve the bug, I haven't seen it. In any case, even giving Brandt the benefit of the doubt that this will solve the ast bug, its hard for me to generalise from that. If I'm expecting equal length inputs, and don't get them, what am I supposed to do with the exception as the consumer of the inputs? As the consumer of the inputs, I can pass the buck to the producer, make it their responsibility, and merely promise to truncate the inputs if they're not the same length. Otherwise, what do I do once I've caught the exception? The most common use for this I have seen in the discussion is: "I have generated two inputs which I expect are equal, and I'd like to be notified if they aren't" which to me is an assertion about program correctness. So this ought to be an assert that gets disabled under -O, not a raise that the caller might catch. So this suggests *two* new functions: - zip_equal for Brandt's parsing bug use-case, guaranteed to raise - zip_assert_equal for the more common use case of checking program correctness, and disabled under -O
Is it’s in itertools, they have to think to look there.
And this is a problem, why? Should *everything* be a builtin? Heaven forbid that somebody has to read the docs and learn about modules, let's have one giant global namespace with everything in it! Because that's good for the beginners! (Not.) -- Steven
On Sat, May 2, 2020 at 3:50 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Thu, Apr 30, 2020 at 07:58:16AM -0700, Christopher Barker wrote:
Imagine someone that uses zip() in code that works for a while, and then discovers a bug triggered by unequal length inputs.
If it’s a flag, they look at the zip docstring, and find the flag, and their problem is solved.
Their problem is not solved. All they have is an exception. Now what are they going to do with it?
I *think* Christopher was saying they have a logical bug which existed silently and led to some confusing debugging, and they'd like to be notified of the unequal lengths in the future. So they want to find the strict feature (whatever the API may be) which they've either guessed might exist or vaguely remember seeing before. In that case the zip docstring is likely the first place they'd look. If what he meant was that the flag raised an exception, then to answer your question "what are they going to do with it?", they should either fix the bug that lead to malformed inputs or remove the flag if they realise unequal lengths aren't such a problem in this case.
This is why I am still unconvinced that this functionality is anywhere near as useful as the proponents seem to think. Brandt has found one good example of a parsing bug in the ast library, but if he has shown how this zip_strict function will solve the bug, I haven't seen it.
[The bug](https://bugs.python.org/issue40355) is titled "The ast module fails to reject certain malformed nodes". The function would cause the nodes to be rejected with an exception.
In any case, even giving Brandt the benefit of the doubt that this will solve the ast bug, its hard for me to generalise from that. If I'm expecting equal length inputs, and don't get them, what am I supposed to do with the exception as the consumer of the inputs?
As the consumer of the inputs, I can pass the buck to the producer, make it their responsibility, and merely promise to truncate the inputs if they're not the same length. Otherwise, what do I do once I've caught the exception?
I would say that in pretty much all cases you wouldn't catch the exception. It's the producer's responsibility to produce correct inputs, and if they don't, tell them that they failed in their responsibility. The underlying core principle is that programs should fail loudly when users make mistakes to help them find those mistakes. I'm strongly reminded of when I was advocating for a warning/exception when iterating directly over a string and some people here didn't understand what the point was. Do some people not agree with this core principle?
The most common use for this I have seen in the discussion is:
"I have generated two inputs which I expect are equal, and I'd like to be notified if they aren't"
If there's a different use case I'm not aware of it, can someone share?
which to me is an assertion about program correctness. So this ought to be an assert that gets disabled under -O, not a raise that the caller might catch.
That's a pretty decent idea. But are there any other examples in the standard library of functions behaving differently under -O? I think if you want that kind of balance between performance and robustness, your best option is zip(x, y, strict=__debug__). Nice and explicit.
So this suggests *two* new functions:
- zip_equal for Brandt's parsing bug use-case, guaranteed to raise
- zip_assert_equal for the more common use case of checking program correctness, and disabled under -O
Again, I think Brandt's case is still just about checking program correctness.
Is it’s in itertools, they have to think to look there.
And this is a problem, why? Should *everything* be a builtin?
Heaven forbid that somebody has to read the docs and learn about modules, let's have one giant global namespace with everything in it! Because that's good for the beginners! (Not.)
The problem is not that they have to look there, it's that they have to *think to look there*. itertools might not occur to them. They might not even know it exists. Note that adding a flag is essentially adding to the (empty) namespace that is zip's named arguments. Adding a new function is adding to a much larger namespace, probably itertools.
On Sat, May 02, 2020 at 09:54:46AM +0200, Alex Hall wrote:
I would say that in pretty much all cases you wouldn't catch the exception. It's the producer's responsibility to produce correct inputs, and if they don't, tell them that they failed in their responsibility.
The underlying core principle is that programs should fail loudly when users make mistakes to help them find those mistakes.
Maybe. It depends on whether it is a meaningful mistake, and the cost of the loud failure versus the usefulness of silent truncation. py> x = 123456789.0000000001 py> x == 123456789 True Should that float literal raise or just truncate the value? How about arithmetic? py> 1e50 + 1 == 1e50 True I guess once in a while it would be useful to know that arithmetic was throwing away data, and the IEEE floating point standard allows that as an optional trap, but can you imagine how obnoxious it would be to have it happen all the time? Sometimes silently throwing away data is the right thing to do. "Errors should never pass silently" depends on what we mean by "error". Is it an error to call str.upper() on a string that contains no letters? Perhaps upper() should raise an exception if it doesn't actually convert anything, rather than silently doing nothing. If I'm expecting a string of alphabetical letters, but get digits instead, it might be useful for upper() to raise. name.upper(strict=True) Would I write my own upper() to do this? No. Should it become a builtin? Probably not. So bringing it back to zip... I don't think I ever denied that, in principle at least, somebody might need to raise on mismatched lengths. (If I did give that impression, I apologise.) I did say I never needed it myself, and my own zip_strict function in my personal toolbox remains unused after many years. But somebody needs it? Sure, I'll accept that. But I question whether *enough* people need it *often enough* to make it a builtin, or to put a flag on plain zip. Rolling your own on top of zip_longest is easy. It's half a dozen lines. It could be a recipe in itertools, or a function. It has taken years for it to be added to more-itertools, suggesting that the real world need for this is small. "Not every two line function needs to be a builtin" -- this is six lines, not two, which is in the proposal's favour, but the principle still applies. Before this becomes a builtin, there are a number of hurdles to pass: - Is there a need for it? Granted. - Is it complicated to get right? No. - Is performance critical enough that it has to be written in C? Probably not. - Is there agreement on the functionality? Somewhat. - Could that need be met by your own personal toolbox? - or a recipe in itertools? - or by a third-party library? - or a function in itertools? We've heard from people who say that they would like a strict version of zip which raises on unequal inputs. How many of them like this enough to add a six line function to their code? My personal opinion is that given that Brandt has found one concrete use for this in the stdlib, it is probably justifiable to add it to itertools. Whether it even needs a C accelerated version, or just a pure Python version, I don't care, but then I'm not doing the work :-)
The most common use for this I have seen in the discussion is:
"I have generated two inputs which I expect are equal, and I'd like to be notified if they aren't"
If there's a different use case I'm not aware of it, can someone share?
Sorry for the confusion, I intended to distinguish between the two cases: 1. I have generated two inputs which I expect are equal, and I want to assert that they are equal when I process them. 2. I consume data generated by someone else, and it is *their* responsibility to ensure that they are equal in length. Sorry that this was not clear. In the second case it is (in my opinion) perfectly acceptable to put the responsibility on the producer of the data, and silently truncate any excess data, rather than raise. Just as converting strings to floats silently truncates any extra digits. Let the producer check the lengths, if they must. [...]
The problem is not that they have to look there, it's that they have to *think to look there*. itertools might not occur to them. They might not even know it exists.
Yes? Is it our responsibility to put everything in builtins because people might not think to look in math, or functools, or os, or sys?
Note that adding a flag is essentially adding to the (empty) namespace that is zip's named arguments. Adding a new function is adding to a much larger namespace, probably itertools.
I don't agree with that description. A function signature is not a namespace. -- Steven
On Sat, May 2, 2020 at 1:19 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, May 02, 2020 at 09:54:46AM +0200, Alex Hall wrote:
I would say that in pretty much all cases you wouldn't catch the exception. It's the producer's responsibility to produce correct inputs, and if they don't, tell them that they failed in their responsibility.
The underlying core principle is that programs should fail loudly when users make mistakes to help them find those mistakes.
Maybe. It depends on whether it is a meaningful mistake, and the cost of the loud failure versus the usefulness of silent truncation.
I'm not sure what the point of this long spiel about floats and str.upper was. No one thinks that zip should always be strict. The feature would be optional and let people choose conveniently between loud failure and silent truncation. So bringing it back to zip... I don't think I ever denied that, in
principle at least, somebody might need to raise on mismatched lengths. (If I did give that impression, I apologise.) I did say I never needed it myself, and my own zip_strict function in my personal toolbox remains unused after many years. But somebody needs it? Sure, I'll accept that.
But I question whether *enough* people need it *often enough* to make it a builtin, or to put a flag on plain zip.
Well, let's add some data about people needing it. Here is a popular question on the topic: https://stackoverflow.com/questions/32954486/zip-iterators-asserting-for-equ... Here are previous threads asking for it: https://mail.python.org/archives/list/python-ideas@python.org/thread/UXX3FGO... (In that one you yourself say "Indeed. The need is real, and the question has come up many times on Python-List as well.") https://mail.python.org/archives/list/python-ideas@python.org/thread/OM3ETID... https://mail.python.org/archives/list/python-ideas@python.org/thread/K54NG74... Here are similar requests for Rust: https://internals.rust-lang.org/t/non-truncating-more-usable-zip/5205 https://mail.mozilla.org/pipermail/rust-dev/2013-May/004039.html (which mentions that Erlang's zip is strict) Rolling your own on top of
zip_longest is easy. It's half a dozen lines. It could be a recipe in itertools, or a function.
It has taken years for it to be added to more-itertools, suggesting that the real world need for this is small.
"Not every two line function needs to be a builtin" -- this is six lines, not two, which is in the proposal's favour, but the principle still applies. Before this becomes a builtin, there are a number of hurdles to pass:
- Is there a need for it? Granted. - Is it complicated to get right? No.
I would say yes. Look at the SO question for example. The asker wrote a long, slow, complicated solution and had to ask if it was good enough. Martjin (who is a prolific answerer) gave two solutions. The top comment says that the second solution is very nice. Months later someone pointed out that the second solution is actually buggy, so it was edited out. The remaining solution still has an issue which is mentioned in a comment but is not addressed. So we know that many people (including me, btw) have copy pasted this buggy code and it's now sitting in their codebases. Here are some examples from github: https://github.com/search?q=%22if+sentinel+in+combo%22&type=Code
- Is performance critical enough that it has to be written in C? Probably not.
No, probably not, but I don't see why this is a hurdle. This can be implemented in any way by different implementations of Python, but for CPython, I don't see how else this would play out. Performance isn't really the reason this should be in the language.
- Is there agreement on the functionality? Somewhat. - Could that need be met by your own personal toolbox? - or a recipe in itertools? - or by a third-party library? - or a function in itertools?
We've heard from people who say that they would like a strict version of zip which raises on unequal inputs. How many of them like this enough to add a six line function to their code?
I think a major factor here is laziness. I'm pretty sure that sometimes I've wanted this kind of strict check, just for better peace of mind, but the thought of one of the solutions above feels like too much effort. I don't want to add a third party dependency just for this. I don't want to read someone else's solution (e.g. on SO) which doesn't have tests and try to evaluate if it's correct. I certainly don't want to reimplement it myself. I brush it off thinking "it'll probably be fine", which is bad behaviour. The problem is that no one really *needs* this check. You *can* do without it. The same doesn't apply well to other functions in itertools or more-itertools. If you need the functionality of itertools.permutations(), you can't just dismiss that problem. But sometimes you may seriously regret not having checked the lengths. That's usually the point when someone (e.g. Ram) comes to python-ideas, or more-itertools, or stackoverflow, wishing they had been more disciplined in the past. Adding a function to itertools will mostly solve that problem, but not entirely. Adding `strict=True` is so easy that people will be encouraged to use it often and keep their code safe. That to me is the biggest argument for this feature and for this specific API.
The problem is not that they have to look there, it's that they have to
*think to look there*. itertools might not occur to them. They might not
even know it exists.
Yes? Is it our responsibility to put everything in builtins because people might not think to look in math, or functools, or os, or sys?
Putting math.sin or whatever in builtins makes builtins bigger. Adding a flag to zip does not. <http://python.org/psf/codeofconduct/> I think I've missed what harm you think it will do to add a flag to zip. Can you point me to your objection?
That was a good email Alex. Besides the relevant examples, you've put into words things that I wanted to say but didn't realize it. Good job :) On Sat, May 2, 2020 at 4:00 PM Alex Hall <alex.mojaki@gmail.com> wrote:
On Sat, May 2, 2020 at 1:19 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, May 02, 2020 at 09:54:46AM +0200, Alex Hall wrote:
I would say that in pretty much all cases you wouldn't catch the exception. It's the producer's responsibility to produce correct inputs, and if they don't, tell them that they failed in their responsibility.
The underlying core principle is that programs should fail loudly when users make mistakes to help them find those mistakes.
Maybe. It depends on whether it is a meaningful mistake, and the cost of the loud failure versus the usefulness of silent truncation.
I'm not sure what the point of this long spiel about floats and str.upper was. No one thinks that zip should always be strict. The feature would be optional and let people choose conveniently between loud failure and silent truncation.
So bringing it back to zip... I don't think I ever denied that, in
principle at least, somebody might need to raise on mismatched lengths. (If I did give that impression, I apologise.) I did say I never needed it myself, and my own zip_strict function in my personal toolbox remains unused after many years. But somebody needs it? Sure, I'll accept that.
But I question whether *enough* people need it *often enough* to make it a builtin, or to put a flag on plain zip.
Well, let's add some data about people needing it.
Here is a popular question on the topic: https://stackoverflow.com/questions/32954486/zip-iterators-asserting-for-equ...
Here are previous threads asking for it:
https://mail.python.org/archives/list/python-ideas@python.org/thread/UXX3FGO...
(In that one you yourself say "Indeed. The need is real, and the question has come up many times on Python-List as well.")
https://mail.python.org/archives/list/python-ideas@python.org/thread/OM3ETID...
https://mail.python.org/archives/list/python-ideas@python.org/thread/K54NG74...
Here are similar requests for Rust:
https://internals.rust-lang.org/t/non-truncating-more-usable-zip/5205
https://mail.mozilla.org/pipermail/rust-dev/2013-May/004039.html (which mentions that Erlang's zip is strict)
Rolling your own on top of
zip_longest is easy. It's half a dozen lines. It could be a recipe in itertools, or a function.
It has taken years for it to be added to more-itertools, suggesting that the real world need for this is small.
"Not every two line function needs to be a builtin" -- this is six lines, not two, which is in the proposal's favour, but the principle still applies. Before this becomes a builtin, there are a number of hurdles to pass:
- Is there a need for it? Granted. - Is it complicated to get right? No.
I would say yes. Look at the SO question for example. The asker wrote a long, slow, complicated solution and had to ask if it was good enough. Martjin (who is a prolific answerer) gave two solutions. The top comment says that the second solution is very nice. Months later someone pointed out that the second solution is actually buggy, so it was edited out. The remaining solution still has an issue which is mentioned in a comment but is not addressed. So we know that many people (including me, btw) have copy pasted this buggy code and it's now sitting in their codebases. Here are some examples from github:
https://github.com/search?q=%22if+sentinel+in+combo%22&type=Code
- Is performance critical enough that it has to be written in C? Probably not.
No, probably not, but I don't see why this is a hurdle. This can be implemented in any way by different implementations of Python, but for CPython, I don't see how else this would play out. Performance isn't really the reason this should be in the language.
- Is there agreement on the functionality? Somewhat. - Could that need be met by your own personal toolbox? - or a recipe in itertools? - or by a third-party library? - or a function in itertools?
We've heard from people who say that they would like a strict version of zip which raises on unequal inputs. How many of them like this enough to add a six line function to their code?
I think a major factor here is laziness. I'm pretty sure that sometimes I've wanted this kind of strict check, just for better peace of mind, but the thought of one of the solutions above feels like too much effort. I don't want to add a third party dependency just for this. I don't want to read someone else's solution (e.g. on SO) which doesn't have tests and try to evaluate if it's correct. I certainly don't want to reimplement it myself. I brush it off thinking "it'll probably be fine", which is bad behaviour.
The problem is that no one really *needs* this check. You *can* do without it. The same doesn't apply well to other functions in itertools or more-itertools. If you need the functionality of itertools.permutations(), you can't just dismiss that problem.
But sometimes you may seriously regret not having checked the lengths. That's usually the point when someone (e.g. Ram) comes to python-ideas, or more-itertools, or stackoverflow, wishing they had been more disciplined in the past.
Adding a function to itertools will mostly solve that problem, but not entirely. Adding `strict=True` is so easy that people will be encouraged to use it often and keep their code safe. That to me is the biggest argument for this feature and for this specific API.
The problem is not that they have to look there, it's that they have to
*think to look there*. itertools might not occur to them. They might not
even know it exists.
Yes? Is it our responsibility to put everything in builtins because people might not think to look in math, or functools, or os, or sys?
Putting math.sin or whatever in builtins makes builtins bigger. Adding a flag to zip does not. <http://python.org/psf/codeofconduct/>
I think I've missed what harm you think it will do to add a flag to zip. Can you point me to your objection? _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/C5E6GV... Code of Conduct: http://python.org/psf/codeofconduct/
On Sat, May 02, 2020 at 02:58:57PM +0200, Alex Hall wrote:
Yes? Is it our responsibility to put everything in builtins because people might not think to look in math, or functools, or os, or sys?
Putting math.sin or whatever in builtins makes builtins bigger. Adding a flag to zip does not. <http://python.org/psf/codeofconduct/>
Excuse me, why are you aiming the CoC at me? -- Steven
On Sat, May 2, 2020 at 4:34 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, May 02, 2020 at 02:58:57PM +0200, Alex Hall wrote:
Yes? Is it our responsibility to put everything in builtins because people might not think to look in math, or functools, or os, or sys?
Putting math.sin or whatever in builtins makes builtins bigger. Adding a flag to zip does not. <http://python.org/psf/codeofconduct/>
Excuse me, why are you aiming the CoC at me?
I have no idea how or why that happened. It doesn't show in the sent message in my GMail. Something went wrong with quoting.
On 5/2/2020 10:40 AM, Alex Hall wrote:
On Sat, May 2, 2020 at 4:34 PM Steven D'Aprano <steve@pearwood.info <mailto:steve@pearwood.info>> wrote:
On Sat, May 02, 2020 at 02:58:57PM +0200, Alex Hall wrote:
> > Yes? Is it our responsibility to put everything in builtins because > > people might not think to look in math, or functools, or os, or sys? > > > > Putting math.sin or whatever in builtins makes builtins bigger. Adding a > flag to zip does not. <http://python.org/psf/codeofconduct/>
Excuse me, why are you aiming the CoC at me?
I have no idea how or why that happened. It doesn't show in the sent message in my GMail. Something went wrong with quoting.
And it doesn't look that way in my inbox, either. In fact, there's a whole other sentence before the footer "I think I've missed what harm you think it will do to add a flag to zip. Can you point me to your objection?". Eric
On Sat, May 02, 2020 at 04:40:49PM +0200, Alex Hall wrote:
On Sat, May 2, 2020 at 4:34 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, May 02, 2020 at 02:58:57PM +0200, Alex Hall wrote:
Yes? Is it our responsibility to put everything in builtins because people might not think to look in math, or functools, or os, or sys?
Putting math.sin or whatever in builtins makes builtins bigger. Adding a flag to zip does not. <http://python.org/psf/codeofconduct/>
Excuse me, why are you aiming the CoC at me?
I have no idea how or why that happened. It doesn't show in the sent message in my GMail. Something went wrong with quoting.
Okay. The glitch shows up in web-archive too: https://mail.python.org/archives/list/python-ideas@python.org/message/C5E6GV... Very strange. -- Steven
On Sat, May 2, 2020 at 2:58 PM Alex Hall <alex.mojaki@gmail.com> wrote:
On Sat, May 2, 2020 at 1:19 PM Steven D'Aprano <steve@pearwood.info> wrote:
Rolling your own on top of
zip_longest is easy. It's half a dozen lines. It could be a recipe in
itertools, or a function.
It has taken years for it to be added to more-itertools, suggesting that the real world need for this is small.
"Not every two line function needs to be a builtin" -- this is six lines, not two, which is in the proposal's favour, but the principle still applies. Before this becomes a builtin, there are a number of hurdles to pass:
- Is there a need for it? Granted. - Is it complicated to get right? No.
- Is performance critical enough that it has to be written in C?
Probably not.
No, probably not
I take it back, performance is a problem worth considering. Here is the more-itertools implementation: https://github.com/more-itertools/more-itertools/blob/master/more_itertools/... ``` def zip_equal(*iterables): """``zip`` the input *iterables* together, but throw an ``UnequalIterablesError`` if any of the *iterables* terminate before the others. """ for combo in zip_longest(*iterables, fillvalue=_marker): for val in combo: if val is _marker: raise UnequalIterablesError( "Iterables have different lengths." ) yield combo ``` I didn't think carefully about this implementation and thought that there was only a performance cost in the error case. That's obviously not true - there's an `if` statement executed in Python for every item in every iterable. The overhead is O(len(iterables) * len(iterables[0])). Given that zip is used a lot and most uses of zip should probably be strict, this is a significant problem. Therefore: - Rolling your own on top of zip_longest in six lines is not a solution. - Using more-itertools is not a solution. - It's complicated to get right. - Performance is critical enough to do it in C.
On Sat, May 02, 2020 at 04:58:43PM +0200, Alex Hall wrote:
I didn't think carefully about this implementation and thought that there was only a performance cost in the error case. That's obviously not true - there's an `if` statement executed in Python for every item in every iterable.
Sorry, this does not demonstrate that the performance cost is significant. This adds one "if" per loop, terminating on (one more than) the shortest input. So O(N) on the length of the input. That's usually considered reasonable, provided the per item cost is low. The test in the "if" is technically O(N) on the number of input iterators, but since that's usually two, and rarely more than a handful, it's close enough to a fixed cost. On my old and slow PC `sentinel in combo` is quite fast: py> from timeit import Timer py> t = Timer('sentinel in combo', setup='sentinel=object(); combo=tuple(range(10))') py> t.repeat() # default is 1000000 loops [1.6585235428065062, 1.6372932828962803, 1.6347543047741055, 1.6457603527233005, 1.6405461430549622] So that's about 1.6 nanoseconds extra per loop on my PC. (For the sake of comparison, unpacking the tuple into separate variables costs about 0.6ns on my machine; so does calling len().) I would expect most people running this on a newer PC to get one tenth of that, or even 1/100, but let's assume a machine even slower and older than mine, and call it 3ns to be safe. What are you doing inside the loop with the zipped up items that 3ns is a serious performance bottleneck for your application?
The overhead is O(len(iterables) * len(iterables[0])). Given that zip is used a lot and most uses of zip should probably be strict,
That's not a given. I would say that most uses of zip should not be strict.
this is a significant problem.
Without actual measurements, this is a classic example of premature micro-optimization. Let's see some real benchmarks proving that a Python version is too slow in real-life code first. -- Steven
On Sat, May 2, 2020 at 6:09 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, May 02, 2020 at 04:58:43PM +0200, Alex Hall wrote:
there's an `if` statement executed in Python for every item in every iterable.
I didn't think carefully about this implementation and thought that there was only a performance cost in the error case. That's obviously not true
Sorry, this does not demonstrate that the performance cost is significant.
This adds one "if" per loop, terminating on (one more than) the shortest input. So O(N) on the length of the input. That's usually considered reasonable, provided the per item cost is low.
The test in the "if" is technically O(N) on the number of input iterators, but since that's usually two, and rarely more than a handful, it's close enough to a fixed cost.
On my old and slow PC `sentinel in combo` is quite fast:
`sentinel in combo` is problematic if some values have overridden `__eq__`. I referred to this problem in a previous email to you, saying that people had copied this buggy implementation from SO and that it still hadn't been fixed after being pointed out. The fact that you missed this helps to prove my point. Getting this right is hard. Fortunately, more_itertools avoids this bug by not using `in`, which you seem to not have noticed even though I copied its implementation in the email you're responding to. Without actual measurements, this is a classic example of premature
micro-optimization.
Let's see some real benchmarks proving that a Python version is too slow in real-life code first.
Here is a comparison of the current zip with more-itertools' zip_equal: ``` import timeit from collections import deque from itertools import zip_longest _marker = object() class UnequalIterablesError(Exception): pass def zip_equal(*iterables): """``zip`` the input *iterables* together, but throw an ``UnequalIterablesError`` if any of the *iterables* terminate before the others. """ for combo in zip_longest(*iterables, fillvalue=_marker): for val in combo: if val is _marker: raise UnequalIterablesError( "Iterables have different lengths." ) yield combo x1 = list(range(1000)) x2 = list(range(1000, 2000)) def my_timeit(stmt): print(timeit.repeat(stmt, globals=globals(), number=10000, repeat=3)) def consume(iterator): deque(iterator, maxlen=0) my_timeit("consume(zip(x1, x2))") my_timeit("consume(zip_equal(x1, x2))") ``` <http://python.org/psf/codeofconduct/> Output: [0.15032896999999995, 0.146724568, 0.14543148299999997] [2.039809026, 2.060877259, 2.0211361649999997] So the Python version is about 13 times slower, and 10 million iterations (quite plausible) adds about 2 seconds. That's not disastrous, but I think it's significant enough that someone working with large amounts of data and concerned about performance might choose to risk accidental malformed input.
On Sat, May 02, 2020 at 07:43:44PM +0200, Alex Hall wrote:
On Sat, May 2, 2020 at 6:09 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, May 02, 2020 at 04:58:43PM +0200, Alex Hall wrote:
there's an `if` statement executed in Python for every item in every iterable.
I didn't think carefully about this implementation and thought that there was only a performance cost in the error case. That's obviously not true
Sorry, this does not demonstrate that the performance cost is significant.
This adds one "if" per loop, terminating on (one more than) the shortest input. So O(N) on the length of the input. That's usually considered reasonable, provided the per item cost is low.
The test in the "if" is technically O(N) on the number of input iterators, but since that's usually two, and rarely more than a handful, it's close enough to a fixed cost.
On my old and slow PC `sentinel in combo` is quite fast:
`sentinel in combo` is problematic if some values have overridden `__eq__`. I referred to this problem in a previous email to you, saying that people had copied this buggy implementation from SO and that it still hadn't been fixed after being pointed out. The fact that you missed this helps to prove my point. Getting this right is hard.
I didn't miss it, I ignored it as YAGNI. Seriously, if some object defines a weird `__eq__` then half the standard library, including builtins, stops working "correctly". See for example the behaviour of float NANs in lists. My care factor for this is negligible, until such time that it is proven to be an issue for real objects in real code. Until then, YAGNI.
Fortunately, more_itertools avoids this bug by not using `in`, which you seem to not have noticed even though I copied its implementation in the email you're responding to.
Which by my testing on my machine is nearly ten times slower than the more obvious use of `in`.
Without actual measurements, this is a classic example of premature
micro-optimization.
Let's see some real benchmarks proving that a Python version is too slow in real-life code first.
Here is a comparison of the current zip with more-itertools' zip_equal: [...]
my_timeit("consume(zip_equal(x1, x2))") ``` <http://python.org/psf/codeofconduct/>
Huh, there's that weird link to the CoC again.
So the Python version is about 13 times slower, and 10 million iterations (quite plausible) adds about 2 seconds.
Adds two seconds to *what* though? That's why I care more about benchmarks than micro benchmarks. In real-world code, you are going to be processing the data somehow. Adds two seconds to an hour's processing time? I couldn't care less. Adds two seconds to a second? Now I'm interested. To be clear here, I'm not arguing *against* a C accelerated version. I'm arguing against the *necessity* of a C version, based only on micro benchmarks. If the PEP is accepted, and this goes into itertools, then whether it is implemented in C or Python should be a matter for the implementer. We shouldn't argue that this *must* be a builtin because otherwise it will be too slow. That's a bogus argument.
That's not disastrous, but I think it's significant enough that someone working with large amounts of data and concerned about performance might choose to risk accidental malformed input.
That's their choice to make, not ours. If they are worried about unequal input lengths, they can always truncate the data to make them equal *wink* [Oh no, I have a sudden image in my head of people using zip to truncate their data to equal lengths, before passing it on to zip_strict "to be sure".] -- Steven
Here is a comparison of the current zip with more-itertools' zip_equal:
So the Python version is about 13 times slower, and 10 million iterations (quite plausible) adds about 2 seconds.
Adds two seconds to *what* though? That's why I care more about benchmarks than micro benchmarks. In real-world code, you are going to be processing the data somehow. Adds two seconds to an hour's processing time? I couldn't care less. Adds two seconds to a second? Now I'm interested.
It seems to me that a Python implementation of zip_equals() shouldn't do the check in a loop like a version shows (I guess from more-itertools). More obvious is the following, and this has only a small constant speed penalty. def zip_equal(*its): yield from zip(*its) if any(_sentinel == next(o, _sentinel) for o in its): raise ZipLengthError I still like zip_strict() better as a name, but whatever. And I don't care what the exception is, or what the sentinel is called. ... and yes, I realize you can quibble about exactly how many of the multiple iterators passed in might consume an extra element, which is possibly more with this code. But honestly, if your use case is "unequal lengths are an error" then you just simply do not care about that. *MY* use, to the contrary seems like it's more like Steven's. I.e. I do this fairly often: zip(a_few_things, lots_available). Not necessarily infinite, but where I expect "more than enough" of the longer iterator. -- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
Oops. I don't mean any(), but rather 'not all()'. Or alternatively, != instead of ==. Same point though. On Sun, May 3, 2020, 11:13 PM David Mertz <mertz@gnosis.cx> wrote:
Here is a comparison of the current zip with more-itertools' zip_equal:
So the Python version is about 13 times slower, and 10 million iterations (quite plausible) adds about 2 seconds.
Adds two seconds to *what* though? That's why I care more about benchmarks than micro benchmarks. In real-world code, you are going to be processing the data somehow. Adds two seconds to an hour's processing time? I couldn't care less. Adds two seconds to a second? Now I'm interested.
It seems to me that a Python implementation of zip_equals() shouldn't do the check in a loop like a version shows (I guess from more-itertools). More obvious is the following, and this has only a small constant speed penalty.
def zip_equal(*its): yield from zip(*its) if any(_sentinel == next(o, _sentinel) for o in its): raise ZipLengthError
I still like zip_strict() better as a name, but whatever. And I don't care what the exception is, or what the sentinel is called.
... and yes, I realize you can quibble about exactly how many of the multiple iterators passed in might consume an extra element, which is possibly more with this code. But honestly, if your use case is "unequal lengths are an error" then you just simply do not care about that. *MY* use, to the contrary seems like it's more like Steven's. I.e. I do this fairly often: zip(a_few_things, lots_available). Not necessarily infinite, but where I expect "more than enough" of the longer iterator.
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
On Sun, May 03, 2020 at 11:13:58PM -0400, David Mertz wrote:
It seems to me that a Python implementation of zip_equals() shouldn't do the check in a loop like a version shows (I guess from more-itertools). More obvious is the following, and this has only a small constant speed penalty.
def zip_equal(*its): yield from zip(*its) if any(_sentinel == next(o, _sentinel) for o in its): raise ZipLengthError
Alas, that doesn't work, even with your correction of `any` to `not all`. py> list(zip_equal("abc", "xy")) [('a', 'x'), ('b', 'y')] The problem here is that zip consumes the "c" from the first iterator, exhausting it, so your check at the end finds that all the iterators are exhausted. Here's the function I used: def zip_equal(*its): _sentinel = object its = tuple(map(iter, its)) yield from zip(*its) if not all(_sentinel == next(o, _sentinel) for o in its): raise RuntimeError
I still like zip_strict() better as a name, but whatever. And I don't care what the exception is, or what the sentinel is called.
The sentinel is a local variable (or at least it ought to be -- there is no need to make it a global. -- Steven
On Mon, 4 May 2020 at 12:41, Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, May 03, 2020 at 11:13:58PM -0400, David Mertz wrote:
It seems to me that a Python implementation of zip_equals() shouldn't do the check in a loop like a version shows (I guess from more-itertools). More obvious is the following, and this has only a small constant speed penalty.
def zip_equal(*its): yield from zip(*its) if any(_sentinel == next(o, _sentinel) for o in its): raise ZipLengthError
Alas, that doesn't work, even with your correction of `any` to `not all`.
py> list(zip_equal("abc", "xy")) [('a', 'x'), ('b', 'y')]
The problem here is that zip consumes the "c" from the first iterator, exhausting it, so your check at the end finds that all the iterators are exhausted.
This got me thinking, what if we were to wrap (or as it turned out, `chain` on to the end of) each of the individual iterables instead, thereby performing the relevant check before `zip` fully exhausted them, something like the following: ```python def zip_equal(*iterables): return zip(*_checked_simultaneous_exhaustion(*iterables)) def _checked_simultaneous_exhaustion(*iterables): if len(iterables) <= 1: return iterables def check_others(): # first iterable exhausted, check the others are too sentinel=object() if any(next(i, sentinel) is not sentinel for i in iterators): raise ValueError('unequal length iterables') if False: yield def throw(): # one of iterables[1:] exhausted first, therefore it must be shorter raise ValueError('unequal length iterables') if False: yield iterators = tuple(map(iter, iterables[1:])) return ( itertools.chain(iterables[0], check_others()), *(itertools.chain(it, throw()) for it in iterators), ) ``` This has the advantage that, if desired, the `_checked_simultaneous_exhaustion` function could also be reused to implement a previously mentioned length checking version of `map`. Going further, if `checked_simultaneous_exhaustion` were to become a public function (with a better name), it could be used to impose same-length checking to the iterable arguments of any function, providing those iterables are consumed in a compatible way. Additionally, it would allow one to be specific about which iterables were checked, rather than being forced into the option of checking either all or none by `zip_equal` / `zip` respectively, thus allowing us to have our cake and eat it in terms of mixing infinite and checked-length finite iterables, e.g. ```python zip(i_am_infinite, *checked_simultaneous_exhaustion(*but_we_are_finite)) # or, if they aren't contiguous checked1, checked2 = checked_simultaneous_exhaustion(it1, it2) zip(checked1, infinite, checked2) ``` However, as I previously alluded to, this relies upon the assumption that each of the given iterators is advanced in turn, in the order they were provided to `checked_simultaneous_exhaustion`. So -- while this function would be suitable for use with `zip`, `map`, and any others which do the same -- if we wanted a more general `checked_equal_length` function that extended to cases in which the iterable-consuming function may consume the iterables in some haphazard order, we'd need something more involved, such as keeping a running tally of the current length of each iterable and, even then, we could still only guarantee raising on unequal lengths if the said function advanced all the given iterators by at least the length of the shortest.
On Mon, May 4, 2020 at 2:33 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, May 2, 2020 at 6:09 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, May 02, 2020 at 04:58:43PM +0200, Alex Hall wrote:
I didn't think carefully about this implementation and thought that
was only a performance cost in the error case. That's obviously not
-
there's an `if` statement executed in Python for every item in every iterable.
Sorry, this does not demonstrate that the performance cost is significant.
This adds one "if" per loop, terminating on (one more than) the shortest input. So O(N) on the length of the input. That's usually considered reasonable, provided the per item cost is low.
The test in the "if" is technically O(N) on the number of input iterators, but since that's usually two, and rarely more than a handful, it's close enough to a fixed cost.
On my old and slow PC `sentinel in combo` is quite fast:
`sentinel in combo` is problematic if some values have overridden `__eq__`. I referred to this problem in a previous email to you, saying that people had copied this buggy implementation from SO and that it still hadn't been fixed after being pointed out. The fact that you missed this helps to
On Sat, May 02, 2020 at 07:43:44PM +0200, Alex Hall wrote: there true prove
my point. Getting this right is hard.
I didn't miss it, I ignored it as YAGNI.
Seriously, if some object defines a weird `__eq__` then half the standard library, including builtins, stops working "correctly". See for example the behaviour of float NANs in lists.
My care factor for this is negligible, until such time that it is proven to be an issue for real objects in real code. Until then, YAGNI.
Here is an example: ``` import numpy as np from itertools import zip_longest def zip_equal(*iterables): sentinel = object() for combo in zip_longest(*iterables, fillvalue=sentinel): if sentinel in combo: raise ValueError('Iterables have different lengths') yield combo arr = np.arange(8).reshape((2, 2, 2)) print(arr) print(list(zip(*arr))) print(list(zip_equal(*arr))) ``` The output: ``` [[[0 1] [2 3]] [[4 5] [6 7]]] [(array([0, 1]), array([4, 5])), (array([2, 3]), array([6, 7]))] Traceback (most recent call last): File "/home/alex/.config/JetBrains/PyCharm2020.1/scratches/scratch_666.py", line 15, in <module> print(list(zip_equal(*arr))) File "/home/alex/.config/JetBrains/PyCharm2020.1/scratches/scratch_666.py", line 8, in zip_equal if sentinel in combo: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() ``` I know for a fact that this would confuse people badly because I've seen multiple people who know what this error message generally refers to incorrectly identify where exactly it's coming from in a similar case: https://stackoverflow.com/questions/60780328/python-valueerror-the-truth-val...
On Mon, May 04, 2020 at 09:20:28PM +0200, Alex Hall wrote:
Seriously, if some object defines a weird `__eq__` then half the standard library, including builtins, stops working "correctly". See for example the behaviour of float NANs in lists.
My care factor for this is negligible, until such time that it is proven to be an issue for real objects in real code. Until then, YAGNI.
Here is an example:
Alex, I understand the point you are trying to make, and I got the reference to numpy the first time you referenced it. I just don't care about it. As far as I am concerned, numpy array's equality behaviour is even more broken than float NANs, and it's not the stdlib's responsibility to guarantee "correctness" (for some definition thereof) if you use broken classes in your data -- especially not for something of marginal value as "zip_strict", as you admitted yourself: "The problem is that no one really *needs* this check. You *can* do without it." Right. So it's a "nice-to-have", not an essential function, and it can go into intertools. The itertools implementer can decide for themselves whether they care to provide a C accelerated version as well as a Python version from Day 1, or even whether a recipe is enough. My point here is entirely that we shouldn't feel ourselves forced into *premptively* providing a C version, let alone making this a builtin, just because `x in y` breaks if one of the elements of y is a numpy array. numpy itself doesn't need this function, they do their own length checks. -- Steven
Steven: I understand that Alex said he thought that putting "strict" in as a flag would make it a bit more likely that people would use, and that he thinks that's a good thing, and you think that's a bad thing, but... Unless we were to make it the default behavior, very few people are going to be adding this flag "just in case". And the fact that you think that making it a flag will make it more likely that folks will use it is an argument for making it a flag. Unless you don't like the idea at all, and want it to be more obscure and hard to find. In any case, you seem to making the argument that a few of us are putting forward: yes, a flag on zip() is likely to get more use than a function in itertools. Thanks for the support :-) However, you also seem to be making the argument that this feature would do more harm than good. I disagree, but if that's what you think, then it shouldn't be added at all, so please make that case, rather than arguing for adding it, but making it harder to find. -CHB On Mon, May 4, 2020 at 6:54 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Mon, May 04, 2020 at 09:20:28PM +0200, Alex Hall wrote:
Seriously, if some object defines a weird `__eq__` then half the standard library, including builtins, stops working "correctly". See for example the behaviour of float NANs in lists.
My care factor for this is negligible, until such time that it is proven to be an issue for real objects in real code. Until then, YAGNI.
Here is an example:
Alex, I understand the point you are trying to make, and I got the reference to numpy the first time you referenced it. I just don't care about it. As far as I am concerned, numpy array's equality behaviour is even more broken than float NANs, and it's not the stdlib's responsibility to guarantee "correctness" (for some definition thereof) if you use broken classes in your data -- especially not for something of marginal value as "zip_strict", as you admitted yourself:
"The problem is that no one really *needs* this check. You *can* do without it."
Right. So it's a "nice-to-have", not an essential function, and it can go into intertools. The itertools implementer can decide for themselves whether they care to provide a C accelerated version as well as a Python version from Day 1, or even whether a recipe is enough.
My point here is entirely that we shouldn't feel ourselves forced into *premptively* providing a C version, let alone making this a builtin, just because `x in y` breaks if one of the elements of y is a numpy array. numpy itself doesn't need this function, they do their own length checks.
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/ZIQZLT... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Tue, 5 May 2020 at 07:22, Christopher Barker <pythonchb@gmail.com> wrote:
In any case, you seem to making the argument that a few of us are putting forward: yes, a flag on zip() is likely to get more use than a function in itertools. Thanks for the support :-)
I'd like to add my voice to the people saying that if someone isn't willing to go and find the correct function, and import it from the correct module, to implement the behaviour that they want, then I have no interest in making it easier for them to write their code correctly, because they seem to have very little interest in correctness. Can someone come up with any sort of credible argument that someone who's trying to write their code correctly would be in any way inconvenienced by having to get the functionality from itertools? It seems like we're trying to design a way for people to "accidentally" write correct code without trying to, and without understanding what could go wrong if they use the current zip function. I'm OK with "make it easy to do the right thing", but "make it easy to do the right thing by accident" is a step too far IMO. Paul
I feel like that argument is flawed. I cannot think of another good example (I am sure there are plenty!) but there is a big difference for discoverability between: A function that does something *different* and functionality does not exist in a built-in (or whatever namespace you are considering). For example, zip_longest v.s. zip: if you have know/expect one of your iterators to run out early, but do not wish the zipping to end, normal zip won't do and so you will end up searching for an alternative. A function that is a "safer" version in some "edge case" (not extra functionality but better error handling basically) but that does otherwise work as expected is not something one will search for automatically. This is zip versus zip-with-strict-true. I did not phrase that particularly well, but I am hoping people get the gist/can rephrase it better. On Tue, 5 May 2020 at 11:34, Paul Moore <p.f.moore@gmail.com> wrote:
On Tue, 5 May 2020 at 07:22, Christopher Barker <pythonchb@gmail.com> wrote:
In any case, you seem to making the argument that a few of us are putting forward: yes, a flag on zip() is likely to get more use than a function in itertools. Thanks for the support :-)
I'd like to add my voice to the people saying that if someone isn't willing to go and find the correct function, and import it from the correct module, to implement the behaviour that they want, then I have no interest in making it easier for them to write their code correctly, because they seem to have very little interest in correctness. Can someone come up with any sort of credible argument that someone who's trying to write their code correctly would be in any way inconvenienced by having to get the functionality from itertools?
It seems like we're trying to design a way for people to "accidentally" write correct code without trying to, and without understanding what could go wrong if they use the current zip function. I'm OK with "make it easy to do the right thing", but "make it easy to do the right thing by accident" is a step too far IMO.
Paul _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/E3C4GF... Code of Conduct: http://python.org/psf/codeofconduct/
On 05/05/2020 13:12, Henk-Jaap Wagenaar wrote:
A function that is a "safer" version in some "edge case" (not extra functionality but better error handling basically) but that does otherwise work as expected is not something one will search for automatically. This is zip versus zip-with-strict-true.
I'm sorry, I don't buy it. This isn't an edge case, it's all about whether you care about what your input is. In that sense, it's exactly like the relationship between zip and zip_longest. -- Rhodri James *-* Kynesim Ltd
Brandt's example with ast in the stdlib I think is a pretty good example of this. On Tue, 5 May 2020 at 13:27, Rhodri James <rhodri@kynesim.co.uk> wrote:
On 05/05/2020 13:12, Henk-Jaap Wagenaar wrote:
A function that is a "safer" version in some "edge case" (not extra functionality but better error handling basically) but that does otherwise work as expected is not something one will search for automatically. This is zip versus zip-with-strict-true.
I'm sorry, I don't buy it. This isn't an edge case, it's all about whether you care about what your input is. In that sense, it's exactly like the relationship between zip and zip_longest.
-- Rhodri James *-* Kynesim Ltd _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/U47NZN... Code of Conduct: http://python.org/psf/codeofconduct/
On 05/05/2020 13:53, Henk-Jaap Wagenaar wrote:
Brandt's example with ast in the stdlib I think is a pretty good example of this.
On Tue, 5 May 2020 at 13:27, Rhodri James <rhodri@kynesim.co.uk> wrote:
On 05/05/2020 13:12, Henk-Jaap Wagenaar wrote:
A function that is a "safer" version in some "edge case" (not extra functionality but better error handling basically) but that does otherwise work as expected is not something one will search for automatically. This is zip versus zip-with-strict-true.
I'm sorry, I don't buy it. This isn't an edge case, it's all about whether you care about what your input is. In that sense, it's exactly like the relationship between zip and zip_longest.
Interesting, because I'd call it a counterexample to your point. The bug's authors should have cared about their input, but didn't. -- Rhodri James *-* Kynesim Ltd
On Wed, May 6, 2020 at 1:44 AM Rhodri James <rhodri@kynesim.co.uk> wrote:
On 05/05/2020 13:53, Henk-Jaap Wagenaar wrote:
Brandt's example with ast in the stdlib I think is a pretty good example of this.
On Tue, 5 May 2020 at 13:27, Rhodri James <rhodri@kynesim.co.uk> wrote:
On 05/05/2020 13:12, Henk-Jaap Wagenaar wrote:
A function that is a "safer" version in some "edge case" (not extra functionality but better error handling basically) but that does otherwise work as expected is not something one will search for automatically. This is zip versus zip-with-strict-true.
I'm sorry, I don't buy it. This isn't an edge case, it's all about whether you care about what your input is. In that sense, it's exactly like the relationship between zip and zip_longest.
Interesting, because I'd call it a counterexample to your point. The bug's authors should have cared about their input, but didn't.
Should they? I'm not sure how well-supported this actually is. If you hand-craft an AST and then compile it, is it supposed to catch every possible malformation? Has Python ever made any promises about *anything* regarding manual creation of AST nodes? Maybe it would be *nice* if it noticed the bug for you, but if you're messing around with this sort of thing, it's not that unreasonable to expect you to get your inputs right. If you're creating a language from scratch and want to have separate "strict" and "truncating" forms of zip, then by all means, go ahead. But I think the advantage here is marginal and the backward compatibility break large. ChrisA
This is a straw man in regards to backwards compatibility. This particular (sub)thread is about whether if this zip-is-strict either as a separate name or a Boolean flag or some other flag of zip should be a built-in or be in e.g. itertools. It is not about breaking backwards compatibility (presumably by making it the default behaviour of zip). On Tue, 5 May 2020, 17:17 Chris Angelico, <rosuav@gmail.com> wrote:
On Wed, May 6, 2020 at 1:44 AM Rhodri James <rhodri@kynesim.co.uk> wrote:
On 05/05/2020 13:53, Henk-Jaap Wagenaar wrote:
Brandt's example with ast in the stdlib I think is a pretty good
example of
this.
On Tue, 5 May 2020 at 13:27, Rhodri James <rhodri@kynesim.co.uk> wrote:
On 05/05/2020 13:12, Henk-Jaap Wagenaar wrote:
A function that is a "safer" version in some "edge case" (not extra functionality but better error handling basically) but that does otherwise work as expected is not something one will search for automatically. This is zip versus zip-with-strict-true.
I'm sorry, I don't buy it. This isn't an edge case, it's all about whether you care about what your input is. In that sense, it's exactly like the relationship between zip and zip_longest.
Interesting, because I'd call it a counterexample to your point. The bug's authors should have cared about their input, but didn't.
Should they? I'm not sure how well-supported this actually is. If you hand-craft an AST and then compile it, is it supposed to catch every possible malformation? Has Python ever made any promises about *anything* regarding manual creation of AST nodes? Maybe it would be *nice* if it noticed the bug for you, but if you're messing around with this sort of thing, it's not that unreasonable to expect you to get your inputs right.
If you're creating a language from scratch and want to have separate "strict" and "truncating" forms of zip, then by all means, go ahead. But I think the advantage here is marginal and the backward compatibility break large.
ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/3JKQRE... Code of Conduct: http://python.org/psf/codeofconduct/
On 05/05/2020 17:26, Henk-Jaap Wagenaar wrote:
This is a straw man in regards to backwards compatibility. This particular (sub)thread is about whether if this zip-is-strict either as a separate name or a Boolean flag or some other flag of zip should be a built-in or be in e.g. itertools.
It is not about breaking backwards compatibility (presumably by making it the default behaviour of zip).
Except that that's part of the thinking involved in choosing a flag instead of the usual new function. No one (I think) is claiming that we should break backwards compatibility and default to strict=True, but having the flag is a strong statement that length-checking is an intrinsic part of zipping. I don't believe that's true, and in consequence I think adding a flag is a mistake. -- Rhodri James *-* Kynesim Ltd
On Tue, May 05, 2020 at 05:26:02PM +0100, Henk-Jaap Wagenaar wrote:
This is a straw man in regards to backwards compatibility. This particular (sub)thread is about whether if this zip-is-strict either as a separate name or a Boolean flag or some other flag of zip should be a built-in or be in e.g. itertools.
Please don't misuse "strawman" in that fashion. A strawman argument is a logical fallacy where you attack a weaker position your opponent didn't make in order to make your own position stronger. That's not what Chris did, and frankly accusing him of strawmanning is a form of "poisoning the well". What Chris did was to propose a counterfactual to express his opinion on this proposal. To paraphrase: "If this were true (we were designing zip from scratch for the first time) then I would agree with the proposal, but since we aren't, I disagree because of these reasons." That is a perfectly legitimate position to take. "If we weren't in lockdown, I would take you out for dinner at a restaurant, but since we are in quarantine, I don't think we should go out." Personally, I don't think Chris' backwards-compatibility argument is strong. Technically adding a new keyword argument to a function is backwards-incompatible, but we normally exclude that sort of change. Who writes this? # This behaviour will be changed by the proposed new parameter. zip('', strict=1) # Raise a type error. So I think the *backwards incompatibility* argument is weak in that regard. But maybe Chris has got a different perspective on this that I haven't thought of. [Chris]
Should they? I'm not sure how well-supported this actually is. If you hand-craft an AST and then compile it, is it supposed to catch every possible malformation?
I would expect that the ast library should accept anything which could come from legal Python, and nothing that doesn't. -- Steven
On Tue, 5 May 2020, 18:24 Steven D'Aprano, <steve@pearwood.info> wrote:
On Tue, May 05, 2020 at 05:26:02PM +0100, Henk-Jaap Wagenaar wrote:
This is a straw man in regards to backwards compatibility. This particular (sub)thread is about whether if this zip-is-strict either as a separate name or a Boolean flag or some other flag of zip should be a built-in or be in e.g. itertools.
Please don't misuse "strawman" in that fashion. A strawman argument is a logical fallacy where you attack a weaker position your opponent didn't make in order to make your own position stronger. That's not what Chris did, and frankly accusing him of strawmanning is a form of "poisoning the well".
What Chris did was to propose a counterfactual to express his opinion on this proposal. To paraphrase:
"If this were true (we were designing zip from scratch for the first time) then I would agree with the proposal, but since we aren't, I disagree because of these reasons."
That is a perfectly legitimate position to take.
I agree on the face of it (in regards to strawmanning and your paraphrasing), except I wasn't disagreeing with anything you've gone into the detail above, but I disagreed with one of the reasons listed and thought it was strawmanning, namely the "the backward compatibility break large" (see further down, why).
"If we weren't in lockdown, I would take you out for dinner at a restaurant, but since we are in quarantine, I don't think we should go out."
Personally, I don't think Chris' backwards-compatibility argument is strong. Technically adding a new keyword argument to a function is backwards-incompatible, but we normally exclude that sort of change. Who writes this?
# This behaviour will be changed by the proposed new parameter. zip('', strict=1) # Raise a type error.
So I think the *backwards incompatibility* argument is weak in that regard. But maybe Chris has got a different perspective on this that I haven't thought of.
I cannot interpret that as a "large" break as Chris says, so I must assume he meant something else (changing the default is my assumption) unless somebody (Chris or otherwise) can tell me why adding a keyword argument would be a large incompatible change?
[Chris]
Should they? I'm not sure how well-supported this actually is. If you hand-craft an AST and then compile it, is it supposed to catch every possible malformation?
I would expect that the ast library should accept anything which could come from legal Python, and nothing that doesn't.
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/GQZLWL... Code of Conduct: http://python.org/psf/codeofconduct/
On Wed, May 6, 2020 at 3:25 AM Steven D'Aprano <steve@pearwood.info> wrote:
Personally, I don't think Chris' backwards-compatibility argument is strong. Technically adding a new keyword argument to a function is backwards-incompatible, but we normally exclude that sort of change. Who writes this?
# This behaviour will be changed by the proposed new parameter. zip('', strict=1) # Raise a type error.
So I think the *backwards incompatibility* argument is weak in that regard. But maybe Chris has got a different perspective on this that I haven't thought of.
Adding the flag isn't a problem, but merely adding the flag is useless. (Ditto if a new function is created.) The assumption is that the flag will be used, changing existing code from zip(x, y) to zip_strict(x, y) or zip(x, y, strict=True). Either way, it's not the creation of zip_strict or the addition of the kwonly arg that breaks backward compat, but the change to (say) the ast module, making use of this, that will cause problems.
[Chris]
Should they? I'm not sure how well-supported this actually is. If you hand-craft an AST and then compile it, is it supposed to catch every possible malformation?
I would expect that the ast library should accept anything which could come from legal Python, and nothing that doesn't.
It absolutely should accept anything which could come from legal Python. The question is, to what extent should it attempt to flag invalid forms? For example, if you mess up the lineno fields, attempting to compile that to a code object won't give you a nice exception. It'll most likely just work, and give weird results in tracebacks - but there is no legal code that could have produced that. And what code could produce this?
from ast import * eval(compile(fix_missing_locations(Expression(body=Set(elts=[]))), "-", "eval")) set()
There is no valid code that can create a Set node with an empty element list, yet it's perfectly sensible, and when executed, it produces... an empty set. Exactly like you'd expect. Should it be an error just because there's no Python code that can create it? I'm of the opinion that it's okay for it to accept things that technically can't result from any parse, just as long as there's a reasonable interpretation of them. Which means that both raising and compiling silently are valid results here; it's just a question of whether you consider "ignore the spares" to be a reasonable interpretation of an odd AST, or if you consider "mismatched lengths" to be a fundamental error. ChrisA
But you care about your input, you can do so by setting strict=True (if that's the road we go down), and unlike what others have said, the IDE I use (pycharm) would tell me that flag exists as I type "zip" and so I'd be more likely to use it than if it was in itertools/... On Tue, 5 May 2020, 16:41 Rhodri James, <rhodri@kynesim.co.uk> wrote:
On 05/05/2020 13:53, Henk-Jaap Wagenaar wrote:
Brandt's example with ast in the stdlib I think is a pretty good example of this.
On Tue, 5 May 2020 at 13:27, Rhodri James <rhodri@kynesim.co.uk> wrote:
On 05/05/2020 13:12, Henk-Jaap Wagenaar wrote:
A function that is a "safer" version in some "edge case" (not extra functionality but better error handling basically) but that does otherwise work as expected is not something one will search for automatically. This is zip versus zip-with-strict-true.
I'm sorry, I don't buy it. This isn't an edge case, it's all about whether you care about what your input is. In that sense, it's exactly like the relationship between zip and zip_longest.
Interesting, because I'd call it a counterexample to your point. The bug's authors should have cared about their input, but didn't.
-- Rhodri James *-* Kynesim Ltd
On Tue, May 05, 2020 at 05:28:08PM +0100, Henk-Jaap Wagenaar wrote:
But you care about your input, you can do so by setting strict=True (if that's the road we go down), and unlike what others have said, the IDE I use (pycharm) would tell me that flag exists as I type "zip" and so I'd be more likely to use it than if it was in itertools/...
We keep coming to this same old argument over and over. "If its builtin people will be more likely to use it, so we need to make it builtin." This argument will apply to **literally** every function and class in the standard library. Pick any arbitrary module, any function from that module: `imghdr.what`. I literally didn't even know that function existed until 30 seconds ago. If it had been a builtin, I would have known about it years ago, and would have been more likely to use it. All this is true. But it's not an argument in favour of making it a builtin. (Or at least not a *good* argument.) Firstly, we would have to agree that "maximizing the number of people using the strict version of zip" is our goal. I don't think it is. We don't try to maximize the number of people using `imghdr.what` -- it is there for those who need it, but we're not trying to push people to use it whether they need it or not. And secondly, that assumes that the benefit gained is greater than the cost in making the builtins more complicated. It now has two functions with the same name, `zip`, distinguished by a runtime flag, even though that flag will nearly always be given as a compile-time constant: # Almost always specified as a compile-time constant: zip(..., strict=True) # Almost never as a runtime variable: flag = settings['zip'].tolerant zip(..., strict=not flag) That "compile-time constant" suggests that, absent some compelling reason, the two functions ought to be split into separate named functions. "But my IDE..." is not a compelling reason. This is not a hard law, but it is a strong principle. Compile-time constants to swap from two distinct modes often make for poor APIs, and we should be reluctant to design our functions that way if we can avoid it. (Sometimes we can't -- but this is not one of those times.) Think about the strange discrepency between the three (so far...) kinds of zip: - zip (shortest) is builtin, controlled by a flag; - zip strict is builtin, controlled by a flag; - zip longest is in a module, with a distinct name. Why is zip_longest different? What if we want to add a fourth or fifth flavour of zip? Do we then have three flags on zip and have to deal with eight combinations of them? -- Steven
"If its builtin people will be more likely to use it, so we need to make
it builtin."
This argument will apply to **literally** every function and class in the standard library.
But we are not talking adding a new builtin.
Firstly, we would have to agree that "maximizing the number of people using the strict version of zip" is our goal. I don't think it is.
Neither do I. But I am suggesting that "maximizing the number of people that need a strict version of zip will use it" Rather than, say, checking the length of the inputs before calling zip. Or writing their own version. Think about the strange discrepency between the three (so far...) kinds
of zip:
- zip (shortest) is builtin, controlled by a flag; - zip strict is builtin, controlled by a flag; - zip longest is in a module, with a distinct name.
Why is zip_longest different? What if we want to add a fourth or fifth flavour of zip? Do we then have three flags on zip and have to deal with eight combinations of them?
no -- but we could (and I think should) have a ternary flag, so that zip_longest becomes unnecessary. And we'd never get to eight combinations: you can't have longest and shortest behavior at the same time! But if we did, then would it be better to have eight separate functions in itertools? -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On 05/05/2020 21:03, Christopher Barker wrote:
"If its builtin people will be more likely to use it, so we need to make
it builtin."
This argument will apply to **literally** every function and class in the standard library.
But we are not talking adding a new builtin.
Well, actually we are. As Steven pointed out further down the post, adding a flag to a function that is pretty much always going to be set at compile time is equivalent to (and IMHO would be better expressed as) a new function. -- Rhodri James *-* Kynesim Ltd
Christopher's quoting is kinda messed up and I can't be bothered fixing it, sorry, so you'll just have to guess who said what :-) On Tue, May 05, 2020 at 01:03:30PM -0700, Christopher Barker wrote:
"If its builtin people will be more likely to use it, so we need to make
it builtin."
This argument will apply to **literally** every function and class in the standard library.
But we are not talking adding a new builtin.
I didn't say a *new* builtin. You are talking about having this related but distinct functionality piggy-back on top of the existing tolerant zip function, distinguishing them by a flag. I trust you wouldn't try to argue that `int(string, base)` is not a builtin function? :-)
Firstly, we would have to agree that "maximizing the number of people using the strict version of zip" is our goal. I don't think it is.
Neither do I. But I am suggesting that "maximizing the number of people that need a strict version of zip will use it" Rather than, say, checking the length of the inputs before calling zip. Or writing their own version.
Okay, but a function in the std lib is sufficient for that. If you want to argue against the alternatives: - use more-itertools - make it a recipe in the docs then "more people will use it" is a good argument for putting it into the stdlib. But why should we fear that there will be people doing without, or rolling their own, because it's not builtin? `zip_longest` has been in the stdlib for at least a decade. We know it has use-cases, and unlike this strict version of zip the need for this was established and proven long ago. If there are people doing without, or rolling their own, zip_longest because they either don't know about, or cannot be bothered, importing from itertools, should it be in builtins too?
Why is zip_longest different? What if we want to add a fourth or fifth flavour of zip? Do we then have three flags on zip and have to deal with eight combinations of them?
no -- but we could (and I think should) have a ternary flag, so that zip_longest becomes unnecessary. And we'd never get to eight combinations: you can't have longest and shortest behavior at the same time!
A ternary flag of strict = True, False or what? This demonstrates why the "constant flag" is so often an antipattern. It doesn't scale past two behaviours. Or you end up with a series of flags: zip(*iterators, strict=False, longest=False, fillvalue=None) and then you have to check for incompatible combinations: if longest and strict: raise TypeError('cannot combine these two options') and it becomes worse the more flags you have. Or you end up with deprecated parameters: def zip(*iterators, strict=_sentinel, mode=ZIP_MODES.SHORTEST): if strict is not _sentinel: raise DeprecationWarning
But if we did, then would it be better to have eight separate functions in itertools?
You wouldn't have eight separate functions. You would have four. But to distinguish four independent modes in a single function, you need three flags, and that gives you 2**3 = 8 combinations to deal with, all of which have to be checked, and exceptions raised if the combination is invalid. -- Steven
On Tue, May 5, 2020 at 5:43 PM Steven D'Aprano <steve@pearwood.info> wrote:
Christopher's quoting is kinda messed up and I can't be bothered fixing it, sorry, so you'll just have to guess who said what :-)
Ideally, we are evaluating ideas independently of who expressed them, so I'll pretend I did that on purpose :-) First: really people, it's all been said. I think we all (and I DO include myself in that) have fallen into the trap that "if folks don't agree with me, I must not have explained myself well enough" -- but in this case, we actually do disagree. And not really on the facts, just on the relative importance. But since, I apparently did not explain myself well enough in this case:
no -- but we could (and I think should) have a ternary flag, so that
zip_longest becomes unnecessary. And we'd never get to eight combinations: you can't have longest and shortest behavior at the same time!
A ternary flag of strict = True, False or what?
Come on: ternary: having three elements, parts, or divisions ( https://www.merriam-webster.com/dictionary/ternary) did you really not know that? (and "flag" does not always mean "boolean flag", even thoughit often does (https://techterms.com/definition/flag) ) (by the way, I'm posting those references because I looked them up to make sure I wasn't using terms incorrectly) This has been proposed multiple times on this list: a flag that takes three possible values: "shortest" | "longest" | "equal" (defaulting to shortest of course). Name to be bikeshed later :-) (and enum vs string also to be bikeshed later) This demonstrates why the "constant flag" is so often an antipattern. It
doesn't scale past two behaviours. Or you end up with a series of flags:
zip(*iterators, strict=False, longest=False, fillvalue=None)
I don't think anyone proposed an API like that -- yes, that would be horrid. There are all sorts of reasons why a ternary flag would not be good, but I do think it should be mentioned in the PEP, even if only as a rejected idea. But I still like it, 'cause the "flag for two behaviors and another function for the third" seem sliek the worse of all options. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On 2020-05-06 12:48 p.m., Christopher Barker wrote:
On Tue, May 5, 2020 at 5:43 PM Steven D'Aprano <steve@pearwood.info <mailto:steve@pearwood.info>> wrote:
Christopher's quoting is kinda messed up and I can't be bothered fixing it, sorry, so you'll just have to guess who said what :-)
Ideally, we are evaluating ideas independently of who expressed them, so I'll pretend I did that on purpose :-)
First: really people, it's all been said. I think we all (and I DO include myself in that) have fallen into the trap that "if folks don't agree with me, I must not have explained myself well enough" -- but in this case, we actually do disagree. And not really on the facts, just on the relative importance.
But since, I apparently did not explain myself well enough in this case:
no -- but we could (and I think should) have a ternary flag, so that
> zip_longest becomes unnecessary. And we'd never get to eight combinations: > you can't have longest and shortest behavior at the same time!
A ternary flag of strict = True, False or what?
Come on:
ternary: having three elements, parts, or divisions (https://www.merriam-webster.com/dictionary/ternary)
did you really not know that? (and "flag" does not always mean "boolean flag", even thoughit often does (https://techterms.com/definition/flag) )
(by the way, I'm posting those references because I looked them up to make sure I wasn't using terms incorrectly)
This has been proposed multiple times on this list:
a flag that takes three possible values: "shortest" | "longest" | "equal" (defaulting to shortest of course). Name to be bikeshed later :-) (and enum vs string also to be bikeshed later)
how about "length"? length=True # longest length=False # shortest (default) length=None # equal (altho I still think the "YAGNI function" system would be better >.>)
This demonstrates why the "constant flag" is so often an antipattern. It doesn't scale past two behaviours. Or you end up with a series of flags:
zip(*iterators, strict=False, longest=False, fillvalue=None)
I don't think anyone proposed an API like that -- yes, that would be horrid.
There are all sorts of reasons why a ternary flag would not be good, but I do think it should be mentioned in the PEP, even if only as a rejected idea.
But I still like it, 'cause the "flag for two behaviors and another function for the third" seem sliek the worse of all options.
-CHB
-- Christopher Barker, PhD
Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/LSKG4X... Code of Conduct: http://python.org/psf/codeofconduct/
On Wed, May 6, 2020 at 6:49 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Tue, May 5, 2020 at 5:43 PM Steven D'Aprano <steve@pearwood.info> wrote:
Christopher's quoting is kinda messed up and I can't be bothered fixing it, sorry, so you'll just have to guess who said what :-)
Ideally, we are evaluating ideas independently of who expressed them, so I'll pretend I did that on purpose :-)
First: really people, it's all been said. I think we all (and I DO include myself in that) have fallen into the trap that "if folks don't agree with me, I must not have explained myself well enough" -- but in this case, we actually do disagree. And not really on the facts, just on the relative importance.
But since, I apparently did not explain myself well enough in this case:
no -- but we could (and I think should) have a ternary flag, so that
zip_longest becomes unnecessary. And we'd never get to eight combinations: you can't have longest and shortest behavior at the same time!
A ternary flag of strict = True, False or what?
...
a flag that takes three possible values: "shortest" | "longest" | "equal" (defaulting to shortest of course). Name to be bikeshed later :-) (and enum vs string also to be bikeshed later)
This demonstrates why the "constant flag" is so often an antipattern. It
doesn't scale past two behaviours. Or you end up with a series of flags:
zip(*iterators, strict=False, longest=False, fillvalue=None)
I don't think anyone proposed an API like that -- yes, that would be horrid.
There are all sorts of reasons why a ternary flag would not be good, but I do think it should be mentioned in the PEP, even if only as a rejected idea.
But I still like it, 'cause the "flag for two behaviors and another function for the third" seem sliek the worse of all options.
I'm totally agree with everything you said here. From my perspective, comparing three main cases: 1. zip(*iters, strict= (False | True)) 2. zip(*iters, mode = ('shortest' | 'equal' | 'longest')) 3. zip_equal(*iters) The first case looks like pretty bad idea (maybe it is practical). But every of the provided cases try to solve the same problem, so from practical point of view they are all the same. So just as you said the first case is merely "flag for two behaviors and another function for the third" - is solid -1 from my side. I like how the second case looks and feels, but obviously the proposed signature (solution) will not be enough. To plug in the existing functionality from zip_longest you will also need to provide fill= kwarg, like zip(*iters, mode = ('shortest' | 'equal' | 'longest'), fill=None). This fill kwarg will be ignored for other to cases 'shortest' and 'equal'. Heh - it is not perfect. I will give +0.1 for the second case. The third case is rather usual solution for the real problem (just one more function in stdlib) - so +0.5 for this case from my side. -gdg
On Wed, May 6, 2020 at 1:42 PM Kirill Balunov <kirillbalunov@gmail.com> wrote:
I'm totally agree with everything you said here. From my perspective, comparing three main cases: 1. zip(*iters, strict= (False | True)) 2. zip(*iters, mode = ('shortest' | 'equal' | 'longest')) 3. zip_equal(*iters)
Thanks for enumerating these. I think that's helpful so I'll flesh it out a bit more. I *think* these are the options on the table: (note, I'm keeping different names for things as the same option, and in no particular order) 1) No change zip(*iters) itertools.zip_longest(*iters, fillvalue=None) 2) Add boolean strict flag to zip zip(*iters, strict= (False | True)) itertools.zip_longest(*iters, fillvalue=None) 3) Add a ternary mode flag to zip zip(*iters, mode = ('shortest' | 'equal' | 'longest'), fillvalue=None) 4) Add a new function to itertools zip(*iters) itertools.zip_longest(*iters, fillvalue=None) itertools.zip_equal(*iters) Brandt: this might be helpful for the PEP. For my part, seeing it this way makes me think that (2) adding a strict flag to zip, while keeping zip_longest on its own in itertools, is the worst option. For me: +1 on the ternary flag +0.5 on a new function in itertools -0 on the boolean flag to zip() -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Thu, 7 May 2020 at 01:07, Christopher Barker <pythonchb@gmail.com> wrote:
3) Add a ternary mode flag to zip zip(*iters, mode = ('shortest' | 'equal' | 'longest'), fillvalue=None)
You missed itertools.zip_longest(*iters, fillvalue=None) from this one. Unless you're proposing to drop itertools.zip_longest, the fact that there's now two ways to do zip_longest seems like an important wart to point out for this proposal. For me: +0.1 on a new function in itertools +0 on no change -1 on the remainder Paul
On Thu, May 7, 2020 at 3:07 AM Christopher Barker <pythonchb@gmail.com> wrote:
On Wed, May 6, 2020 at 1:42 PM Kirill Balunov <kirillbalunov@gmail.com> wrote:
I'm totally agree with everything you said here. From my perspective, comparing three main cases: 1. zip(*iters, strict= (False | True)) 2. zip(*iters, mode = ('shortest' | 'equal' | 'longest')) 3. zip_equal(*iters)
Thanks for enumerating these. I think that's helpful so I'll flesh it out a bit more. I *think* these are the options on the table:
(note, I'm keeping different names for things as the same option, and in no particular order)
1) No change zip(*iters) itertools.zip_longest(*iters, fillvalue=None)
2) Add boolean strict flag to zip zip(*iters, strict= (False | True)) itertools.zip_longest(*iters, fillvalue=None)
3) Add a ternary mode flag to zip zip(*iters, mode = ('shortest' | 'equal' | 'longest'), fillvalue=None)
4) Add a new function to itertools zip(*iters) itertools.zip_longest(*iters, fillvalue=None) itertools.zip_equal(*iters)
I think there are two more cases which can be added to the list: 0) No change zip(*iters) itertools.zip_longest(*iters, fillvalue=None) + add a recipe about zip_equal in itertools docs 6) Add functionality through zip methods (as proposed in a separate thread, but maybe it is off topic for the current thread). -gdg
On Wed, May 06, 2020 at 08:48:51AM -0700, Christopher Barker wrote: [I asked]
A ternary flag of strict = True, False or what?
Come on:
ternary: having three elements, parts, or divisions ( https://www.merriam-webster.com/dictionary/ternary)
did you really not know that?
Of course I know what ternary is.
(and "flag" does not always mean "boolean flag", even thoughit often does (https://techterms.com/definition/flag) )
I am arguing against the proposal being discussed in this part of the thread, namely to add a **boolean** flag "strict=True|False". Then if we want to extend the API in the future, you say "Oh well that's easy, let's just turn it into a ternary flag" (paraphrasing, not a direct quote). Okay, so what will the third flag be? Standard ternary flags are: True, False, Maybe True, False, Unknown Neither Maybe nor Unknown are Python builtins. Should they be? I doubt it. So what do we use? Whenever I've needed ternary logic, I've used None. But in this case, that doesn't work: zip(*iters, strict=None) What a very non-self-explanatory API. So if we start off with a `strict=bool` API, we're stuck with it.
This has been proposed multiple times on this list:
a flag that takes three possible values: "shortest" | "longest" | "equal" (defaulting to shortest of course). Name to be bikeshed later :-) (and enum vs string also to be bikeshed later)
*Named modes* are not typically called flags. For example, the second parameter to `open` is called *mode*, not *flag*. Whether or not a named mode parameter has been proposed before, that's not what is being discussed here and now, where we have been explicitly debating the two APIs: - a named function; - piggy-backing on zip() with a **boolean parameter** taking True and False as the switch to control behaviour. so in the context of the discussion about bool parameters, other APIs aren't really relevant. I'm arguing agains the specific `strict=bool` API, not other APIs in general. Revamping zip to give it a named mode parameter is not my first preference, but it's better than a `strict=bool` flag. However the parameter would have to change: zip(*iters, strict="shortest") simply doesn't work.
This demonstrates why the "constant flag" is so often an antipattern. It doesn't scale past two behaviours. Or you end up with a series of flags:
zip(*iterators, strict=False, longest=False, fillvalue=None)
I don't think anyone proposed an API like that -- yes, that would be horrid.
I do recall someone proposing something similar to that, but I don't care enough to trawl through the thread to find it :-)
There are all sorts of reasons why a ternary flag would not be good, but I do think it should be mentioned in the PEP, even if only as a rejected idea.
But I still like it, 'cause the "flag for two behaviors and another function for the third" seem sliek the worse of all options.
*blink* But that's precisely the option on the table right now! 1. zip_longest remains in itertools; 2. zip remains the default behaviour; 3. zip_strict be implemented as a boolean True/False parameter on zip. I trust you don't actually mean what you seem to be saying: "this is the worst of all options, I'm in favour of it!" But in any case, I see from a later part of the discussion we're now considering a different option: - treat zip as a namespace, with named callable attributes which I don't dislike. -- Steven
I have no idea whether a flag on zip() or a function in itertools would get MORE USE. I *ABSOLUTELY* think it is an anti-goal to get more use for its own sake though. I'm +1 on a new function in itertools, maybe +0 or maybe -0 on a flag. But I only want APPROPRIATE USE in any case. The API conventions of Python in general very strongly favor a new function. Yes, not everything in Python naming is consistent, and counter examples for any idea can surely be found. But it just is a lot less surprising to users to follow the predominant pattern that zip_longest(), for example, follows. The real point, to me, is that users who use itertools.zip_strict() will use it for exactly the reason that they want that semantics. In contrast, a flag for `strict` or `truncate` or `equal` or whatever is a LOT more likely to be used in the "just in case" code where the programmer has not thought carefully about the semantics they want. The sky isn't falling, I certainly don't think everyone, nor even most developers, would use the flag wrong. But a separate function just provides a better, more consistent, API. I don't think anyone in the huge discussion of the walrus operator, for example, tried to make the case that the goal should be encouraging it to be used AS MUCH AS POSSIBLE. Nor likewise for any other new feature. A feature should be used *where appropriate*, and the design should not vacantly simply try to make it more common. -- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
On Tue, May 5, 2020 at 9:20 AM David Mertz <mertz@gnosis.cx> wrote:
I have no idea whether a flag on zip() or a function in itertools would get MORE USE. I *ABSOLUTELY* think it is an anti-goal to get more use for its own sake though.
I'm not sure anyone was suggesting that -- *maybe* Alex, but I think his statement was over-interpreted. I only want APPROPRIATE USE in any case.
I can't imagine we all don't agree with that.
The real point, to me, is that users who use itertools.zip_strict() will use it for exactly the reason that they want that semantics. In contrast, a flag for `strict` or `truncate` or `equal` or whatever is a LOT more likely to be used in the "just in case" code where the programmer has not thought carefully about the semantics they want.
There's no way to really know, but I think this is being overblown -- folks generally don't go use extra flags just for the heck of it -- particularly one that won't be documented in anything but the latest documents for years :-)
The sky isn't falling, I certainly don't think everyone, nor even most developers, would use the flag wrong. But a separate function just provides a better, more consistent, API.
Well, THAT IS the point of discussion, yes. I disagree, but can see both points. But I do want folks to consider that having zip() as a builtin, and zip_strict() and zip_longest() would be in itertools. Which is different than if they were all in the same namespace (like the various string methods, for example). Another key point is that if you want zip_longest() functionality, you simply can not get it with the builtin zip() -- you are forced to look elsewhere. Whereas most code that might want "strict" behavior will still work, albeit less safely, with the builtin. These considerations should be considered in evaluating the API options. And this is why I personalty think if we add a flag to zip, we should add one for longest functionality as well, unifying the API. I don't think anyone in the huge discussion of the walrus operator, for
example, tried to make the case that the goal should be encouraging it to be used AS MUCH AS POSSIBLE.
Indeed, the opposite was true: there was a lot of concern that it would be overused. though I think think that's a much bigger concern than in this case. A feature should be used *where appropriate*, and the design should not
vacantly simply try to make it more common.
Agreed, but discoverability is still something to be considered in the API. ANd it seems that there are folks arguing that we specifically want this to be less discoveble due to concerns of overuse. Which does not seem like good API design approach to me. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Tue, May 05, 2020 at 12:49:05PM -0700, Christopher Barker wrote:
Agreed, but discoverability is still something to be considered in the API. ANd it seems that there are folks arguing that we specifically want this to be less discoveble due to concerns of overuse. Which does not seem like good API design approach to me.
itertools is one of the most popular and well-known libraries in the stdlib. Saying that something is "less discoverable" in intertools compared to the builtins is a bit like saying that the Marvel superhero franchise is less well-known than the Star Wars franchise. Even if its objectively true, its a difference that makes no difference. In fact I'd even say that there are builtin functions that are less well known than itertools, like iter(seq, value) :-) -- Steven
On May 5, 2020, at 12:50, Christopher Barker <pythonchb@gmail.com> wrote:
Another key point is that if you want zip_longest() functionality, you simply can not get it with the builtin zip() -- you are forced to look elsewhere. Whereas most code that might want "strict" behavior will still work, albeit less safely, with the builtin.
I think this is a key point, but I think you’ve got it backward. You _can_ build zip_longest with zip, and before 2.6, people _did_. (Well, they built izip_longest with izip.) I’ve still got my version in an old toolbox. You chain a repeat(None) onto each iterable, izip, and you get an infinite iterator that you have to read until all(is None). You can just takewhile that into exactly the same thing as izip_longest, but unfortunately that’s a lot slower than filtering when you iterate, so I had both _longest and _infinite variants, and I think I used the latter more even though it was usually less convenient. That sounds like a silly way to do it, and it’s certainly easier to get subtly wrong than just writing a generator function like the “as if” code in the (i)zip_longest docs, but a comment in my code assures me that this is almost 4x as fast, and half the speed of a custom C implementation, so I’m pretty sure that’s why I did it. And I doubt I’m the only person who solved it that way. In fact, I’ll bet I copied it from an ActiveState recipe or a colleague or an open source project. So, most likely, izip_longest wasn’t added because you can’t build it on top of izip, but because building it on top of izip is easy to get subtly wrong (especially if you need it to be fast—or don’t need it to be fast but micro optimize it anyway, for that matter), and often people punt and do something clunkier (use _infinite instead of _longest and make the final for loop more complicated). Which is actually a pretty good parallel for the current proposal. You can write your own zip_strict on top of zip, and at least a few people do—but, as people have shown in this thread, the obvious solution is too slow, the obvious fast solution is very easy to get subtly wrong, and often people punt and do something clunkier (listify and compare len). That’s why I’m +1 on this proposal in some form. Assuming zip_strict would be useful at least as often as zip_longest (and I’ve been sold on that part, and I think most people on all sides of this discussion agree?), it calls out for a good official solution. The fact that the ecosystem is different nowadays (pip install more-itertools or copying off StackOverflow is a lot simpler, and more common, than finding a recipe on ActiveState) does make it a little less compelling, but at most that means the official solution should be a docs link to more-itertools, still not that we should do nothing. But that’s also part of the reason I’m -1 on it being a flag. Just like zip_longest, it’s a different function, one you shouldn’t think of as being built on zip even if it could be. Maybe strict really is needed so much more often than longest that “import itertools” is too onerous, but if that’s really true, that different function should be another builtin. I think nobody is arguing for that, because it’s just obvious that it isn’t needed enough to reach the high bar of adding another function to builtins. But that means it belongs in itertools. Trying to make it a flag (which will always be passed a constant value) is a clever way to try to get the best of both worlds—and so is the chain.from_iterable style. But if either of those really did get the best of both worlds and the problems of neither, it would be used all over the place, rather than as sparingly as possible. And of course it doesn’t get the best of both worlds. A flag is hiding code as data, and it looks misleadingly like the much more common uses of flags where you actually do often set the flag with a runtime value. It’s harder to type (and autocomplete makes the difference worse, not better). It’s a tiny bit harder to read, because you’re adding as much meaningless boilerplate (True) as important information (strict). It’s increasing the amount of stuff to learn in builtins just as much as another function would. And so on. So it’s only worth doing for really special cases, like open.
On Fri, May 8, 2020 at 11:22 PM Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
Trying to make it a flag (which will always be passed a constant value) is a clever way to try to get the best of both worlds—and so is the chain.from_iterable style.
At this point it sounds like you're saying that zip(..., strict=True) and zip.strict(...) are equally bad.
But if either of those really did get the best of both worlds and the problems of neither, it would be used all over the place, rather than as sparingly as possible. And of course it doesn’t get the best of both worlds. A flag is hiding code as data, and it looks misleadingly like the much more common uses of flags where you actually do often set the flag with a runtime value. It’s harder to type (and autocomplete makes the difference worse, not better). It’s a tiny bit harder to read, because you’re adding as much meaningless boilerplate (True) as important information (strict).
But all of this just applies to a flag, not to zip.strict(...).
It’s increasing the amount of stuff to learn in builtins just as much as another function would. And so on.
This applies to both, but it's not true. Both zip.strict() and zip(strict=True) are at least somewhat more hidden and encapsulated than a top level builtin zip_strict(). I also think it's worth questioning something that is being taken for granted. What exactly is the cost of adding a builtin? It's not entirely obvious, at least not to me. Clarifying the precise disadvantages would let us see how well they apply here. You mention increasing the amount of stuff to learn, but I'm guessing 80% of Python coders don't know all the functions in builtins, and that doesn't really hurt them. I wouldn't recommend anyone reading through all of https://docs.python.org/3/library/functions.html <https://docs.python.org/3/library/functions.html#zip> just for the sake of learning it all, do we want to support people doing that? People should just google the builtins they're not familiar with when they come across them. I can see several builtins that I've never or almost never used. When people read the docs for zip, it points them to zip_longest. If zip_strict was in itertools, it would point to that too. Is that significantly better than pointing to zip.strict or even a builtin zip_strict a little further down the same page? I think if someone is reading the docs for zip, it's worth their time to learn all its flavours, and where exactly they read about those doesn't matter. The only time I'm ever annoyed by something being a builtin is that I avoid using it as a variable name. This often happens with id, type, all, list, etc. That wouldn't even be a significant argument against a builtin zip_strict, it doesn't apply at all to zip.strict.
On May 9, 2020, at 04:30, Alex Hall <alex.mojaki@gmail.com> wrote:
On Fri, May 8, 2020 at 11:22 PM Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
Trying to make it a flag (which will always be passed a constant value) is a clever way to try to get the best of both worlds—and so is the chain.from_iterable style.
At this point it sounds like you're saying that zip(..., strict=True) and zip.strict(...) are equally bad.
You’re right, it did sound like that, and I don’t mean that. Sorry. zip.strict has _some_ of the same problems as zip(strict=True), but definitely not _all_ of them. And I definitely prefer zip.strict to the flag. At the time I wrote this (I don’t know why it took a few days to get delivered…), zip.strict had come up the first time and been roundly shouted down, and it seemed like.nobody but me (and the proposer, of course) had found it at all acceptable, and I was trying to make the point that if people don’t like zip.strict, the same things and more apply to passing an always-constant flag, so it should be even more acceptable. Then. over the last few days, a bunch of people came around on zip.strict. And that seems to be at least in part because people came up with better arguments than the first time around. (For example, I forget who it was that pointed out that you don’t really have to start thinking of zip as a class and zip.strict as an alternate constructor, because plenty of people don’t realize that’s true for chain.from_iterable and they still have no more problem using it than they do for datetime.now.) So now, rather than it being a +0 for me and a distant second choice behind an itertools function, I think I’m pretty close to evenly torn between the two. I do think that if we add zip.strict, we should also probably add zip.longest, not just think about maybe adding it some day. And it might even be worth adding zip.shortest, even if we have no intention of ever eliminating zip() itself or changing it to mean zip.strict. But I don’t have good arguments for these; I’ll have to think about it a bit more to explain why I think consistency easily trumps the costs for this variant of the proposal but probably fails for other variants.
On Sat, May 02, 2020 at 02:58:57PM +0200, Alex Hall wrote:
Adding a function to itertools will mostly solve that problem, but not entirely. Adding `strict=True` is so easy that people will be encouraged to use it often and keep their code safe. That to me is the biggest argument for this feature and for this specific API.
The last thing I want is to encourage people to unnecessarily enforce a rule "data streams must be equal just to be safe" when they don't actually need to be equal. What you are calling the biggest argument for this feature is, for me, a strong argument against it. If I know that consumers of my data truncate on the shortest input, then as the producer of data I don't have to care about making them equal. I can say: process_user_ids(usernames, generate_user_ids()) and pass an infinite stream of user IDs and know that the function will just truncate on the shortest stream. Yay! Life is good. But now this zip flag comes along, and the author of process_user_ids decides to protect me from myself and "fail loudly", and I will curse them onto the hundredth generation for making my life more difficult. If I'm the producer and consumer of the data, then I can pick and choose between versions, and that's all well and good. If I'm the producer of the data, and I want it to be equal in length, then I control the data and can make it equal. I don't need the consumer's help, and I don't need zip to have a flag. But if I'm only the consumer of the data, I have no business failing "just to be safe". That's an anti-feature that makes life more difficult, not less, for the producer of the data, akin to excessive runtime type checking (I'm sometimes guilty of that myself) or in other languages flagging every class in sight as final "just to be safe". It is possible to be *too* defensive, and if making the strict version of zip a builtin encourages consumers of the data to "be safe", then that is a mark against it in my strong opinion. -- Steven
On Sat, May 2, 2020 at 5:10 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, May 02, 2020 at 02:58:57PM +0200, Alex Hall wrote:
Adding a function to itertools will mostly solve that problem, but not entirely. Adding `strict=True` is so easy that people will be encouraged to use it often and keep their code safe. That to me is the biggest argument for this feature and for this specific API.
The last thing I want is to encourage people to unnecessarily enforce a rule "data streams must be equal just to be safe" when they don't actually need to be equal. What you are calling the biggest argument for this feature is, for me, a strong argument against it.
If I know that consumers of my data truncate on the shortest input, then as the producer of data I don't have to care about making them equal. I can say:
process_user_ids(usernames, generate_user_ids())
and pass an infinite stream of user IDs and know that the function will just truncate on the shortest stream. Yay! Life is good.
But now this zip flag comes along, and the author of process_user_ids decides to protect me from myself and "fail loudly", and I will curse them onto the hundredth generation for making my life more difficult.
My guess is that this kind of situation is rare and unusual. The example looks made up, is it based on something real? Do you have any examples based on reality? I've given examples of functions that check the lengths of their arguments, so it's conceivable you or someone else could have had this exact problem. The fact that those checks are there shows people thought it was a good idea and no one has complained enough to change their minds. And we have examples of people cursing the lack of a check.
If I'm the producer and consumer of the data, then I can pick and choose between versions, and that's all well and good.
FWIW I do think this use case is much more common. An external API which requires you to pass parallel iterables instead of pairs is unusual and confusing. For example, the real source of the ast.unparse problem is that ast.Dict.{keys,values} is weird. Every consumer of it such as compile and ast.unparse has to check the lengths, a strategy that has failed and will probably continue to fail. I'm not saying that bad APIs are rare, but that this kind of API is both bad and rare. We've argued a lot about what kinds of uses of zip are most common, so I did a little survey of code that I had written or worked with. The uses of zip that I found could be roughly categorised as follows: strict (if it existed) should be False: 3 Lengths need to be equal... ...but that's not checked, although that's probably OK: 11 ...but that's not checked, and that's a problem: 3 ...so there's an assert len(x) == len(y): 2 Based on that data, adding strict=True: - in the vast majority of cases would not hurt. - is significantly helpful more often than strict should be False - would ensure correctness in currently unsafe code as often as strict should be False In most cases it wouldn't ensure the correctness of code, but it could give some peace of mind and might help readers. But also in those cases if the user decides it's redundant and not worth using: - The user had better be confident of their judgement, which will inevitably sometimes be wrong. - Even if the context of the code makes it obvious that it's redundant, that context could change in the future and introduce a silent regression, and people are likely to not think to add strict=True to the zip call somewhere down the line from their changes. Adding strict=True is more future-proof to such changes.
If I'm the producer of the data, and I want it to be equal in length, then I control the data and can make it equal.
But in your example you complain about having to do that. Is it a problem or not?
But if I'm only the consumer of the data, I have no business failing
"just to be safe". That's an anti-feature that makes life more
difficult, not less, for the producer of the data, akin to excessive runtime type checking (I'm sometimes guilty of that myself) or in other languages flagging every class in sight as final "just to be safe".
It is possible to be *too* defensive, and if making the strict version of zip a builtin encourages consumers of the data to "be safe", then that is a mark against it in my strong opinion.
I think you make this situation sound worse than it is. "I will curse them onto the hundredth generation for making my life more difficult" is pretty melodramatic. If you get an exception because you tried: process_user_ids(usernames, generate_user_ids()) then you can pretty easily change it to: process_user_ids(usernames, (user_id for user_id, username in zip(generate_user_ids(), usernames))) or if you can generate user IDs one at a time: process_user_ids(usernames, (generate_user_id() for _ in usernames)) It's a bit inconvenient, but: - It's pretty easy to solve the problem, certainly compared to implementing zip_equal, and nothing like your example of final classes which are pretty much insurmountable. - An exception has clearly pointed out a problem to solve, which is much better than trying to find a silent logic error. You might complain in this case but you'd be grateful if you accidentally passed a user_ids list that was too short. - It's not a problem you can ignore, so it doesn't require you to be disciplined in case of future mistakes.
On Sat, May 02, 2020 at 10:26:18PM +0200, Alex Hall wrote:
If I know that consumers of my data truncate on the shortest input, then as the producer of data I don't have to care about making them equal. I can say:
process_user_ids(usernames, generate_user_ids())
and pass an infinite stream of user IDs and know that the function will just truncate on the shortest stream. Yay! Life is good.
But now this zip flag comes along, and the author of process_user_ids decides to protect me from myself and "fail loudly", and I will curse them onto the hundredth generation for making my life more difficult.
My guess is that this kind of situation is rare and unusual. The example looks made up, is it based on something real?
Yes, its made up, but yes, it is based on real code. I haven't had to generate user IDs yet, but I have often passed in (for example) an infinite stream of prime numbers, or some other infinite sequence. Not necessarily infinite either: it might be finite but huge, such as combinations or permutations of something. At the consumer end, the main one that comes to mind off the top of my head (apart from code I wrote myself) is numpy, for example their coefficent correlation function: py> np.corrcoef([1, 2, 3, 4, 5, 6, 7, 8], [7, 6, 5, 4, 3, 2]) Traceback (most recent call last): ... ValueError: array dimensions must agree except for d_0 Now I'm still keeping an open-mind whether this check is justified for stats functions. (There have been some requests for XY stats in the statistics module, so I may have to make a decision some day.) But my point is, regardless of whether that check is necessary or not, *I still have to check it when I produce the data* or else I get a potentially spurious exception that will eat my data. In the general case where I have iterators not lists, no recovery is possible. I can't catch the error, trim the data, and resubmit. I have to preemptively trim the data. Of course I recognise the right of each developer to choose for themselves whether to enforce the rule that input streams are equal. And sometimes that will be the right thing to do. But you are arguing that putting zip_strict as a builtin will encourage people to do this, and I am saying that *encouraging people to do this* is a point against it, because that will lead to the "things which aren't errors should never pass silently" Just In Case it might be an error. If you need to enforce equal length input, then one extra line: from itertools import zip_strict is no burden. But if someone is on the fence about checking for equal lengths, and importing would be too much trouble so they don't bother, but making it builtin is enough for them to tip them over into using it "just to be safe", then they probably shouldn't be using it.
An external API which requires you to pass parallel iterables instead of pairs is unusual and confusing.
I don't know about that. numpy's correlation coefficient supports both a single array of data pairs or a pair of separate X, Y values. So does R. Spreadsheets likewise put the X, Y data in separate cells, not in cells with two data points. If your data is coming from a CSV file, as it often does, the most natural way to get it is as two separate columns. But for APIs that do require a single stream of pairs, then this entire discussion is irrelevant, since I, the producer of the data, can choose whatever behaviour makes sense for me: - `zip` to truncate the data on shortest input; - `zip_longest` to pad it; - `zip_strict` (whether I write my own or get it from the stdlib) or anything else I like, since I'm the producer. So we shouldn't be talking about APIs that require a stream of pairs, but only APIs that require separate streams. [...]
In most cases it wouldn't ensure the correctness of code, but it could give some peace of mind and might help readers. But also in those cases if the user decides it's redundant and not worth using:
- The user had better be confident of their judgement, which will inevitably sometimes be wrong.
That's not our problem to solve. It isn't up to us to force people to use zip_strict "just in case your judgement that it isn't needed is wrong". I think you and I have a severe disagreement on the relationship between the stdlib and the end developers. Between your comment above, and the one below, you seem to believe that it is the job of the stdlib to protect the developer from themselves. And yet here you are using Python, where we have the ability to monkey-patch the builtins namespace, shadow functions, reach into functions and manipulate their data, including their code, rebind any name, remove attributes of anything, even change the class of (some) instances on the fly. Are you sure you are using the right language?
- Even if the context of the code makes it obvious that it's redundant, that context could change in the future and introduce a silent regression, and people are likely to not think to add strict=True to the zip call somewhere down the line from their changes. Adding strict=True is more future-proof to such changes.
That's an awfully condescending thing to say. You are effectively saying that Python developers ought to use strict zip by default just in case their requirements change in the future and they are too dim-witted to remember to change the implementation to match. I will try to remember that there are legitimate users for a strict zip, but this advocacy of *premptive and unnecessary length checks* just makes me more hostile to adding this to the stdlib at all, let alone making it a builtin.
If I'm the producer of the data, and I want it to be equal in length, then I control the data and can make it equal.
But in your example you complain about having to do that. Is it a problem or not?
It is a problem if I have to do it for no good reason, to satisfy some arbitrary "just in case" length check. I don't mind doing so if there is a genuine good reason for it, like in the ast case. Ultimately, the data is going to be truncated somewhere. At the moment, zip does the truncation. If people switch to strict zip, then I have to do the truncation. That's a usability regression. Unless it is justified by some functional reason, or fixing a bug, it's more harmful than helpful: "Function do_useful_stuff() no longer ignores excess stuff, but raises instead. Callers must now trim their To Do list before passing it to the function. No bugs were fixed by this change." sort of thing.
I think you make this situation sound worse than it is. "I will curse them onto the hundredth generation for making my life more difficult" is pretty melodramatic.
Hyperbole is supposed to be read as humour, not literally :-)
If you get an exception because you tried:
process_user_ids(usernames, generate_user_ids())
then you can pretty easily change it to:
process_user_ids(usernames, (user_id for user_id, username in zip(generate_user_ids(), usernames)))
or if you can generate user IDs one at a time:
process_user_ids(usernames, (generate_user_id() for _ in usernames))
So I have to write ugly, inefficient code to make up for you encouraging people to add unnecessary length checks in their code? Yay, I feel so happy about this change!!! Also, it doesn't work if usernames is an iterator.
It's a bit inconvenient, but:
- It's pretty easy to solve the problem, certainly compared to implementing zip_equal,
You haven't solved the problem, and judging by your attempted work- arounds, this could easily lead to a reduction in code quality, not an improvement.
and nothing like your example of final classes which are pretty much insurmountable.
I don't remember raising final classes in this thread. Am I losing my short-term memory or are you confusing me with someone else? (I have raised final classes in other discussions, but they're not relevant here as far as I can see.)
- An exception has clearly pointed out a problem to solve
An exception *might* have pointed out a problem to solve, but since the solution is usually to just truncate the data, and zip already does that, raising an exception may be a regression, not an enhancement. Can we agree to meet half-way? - there are legitimate, genuine uses for zip_strict; - but encouraging people to change zip to zip_strict "just in case" in the absence of a genuine reason is a bad thing. -- Steven
One small comment on this part of the thread: Yes, using an API that produces an infinite iterator with a "strict" version of zip() would be, well, bad. But it would fail the very first time it was used. I can't see how encouraging people to use a strict version of zip() would require that folks not create APIs that return infinite iterators -- some users might get a failure the first time they try to use it, and then they'd fix their code. No problem there. And "shortest" would remain the default, so I doubt it would be common for folks to use the "strict" version without thinking -- this is really a non-issue. -CHB
On Mon, May 4, 2020 at 1:59 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, May 02, 2020 at 10:26:18PM +0200, Alex Hall wrote:
If I know that consumers of my data truncate on the shortest input, then as the producer of data I don't have to care about making them equal. I can say:
process_user_ids(usernames, generate_user_ids())
and pass an infinite stream of user IDs and know that the function will just truncate on the shortest stream. Yay! Life is good.
But now this zip flag comes along, and the author of process_user_ids decides to protect me from myself and "fail loudly", and I will curse them onto the hundredth generation for making my life more difficult.
My guess is that this kind of situation is rare and unusual. The example looks made up, is it based on something real?
Yes, its made up, but yes, it is based on real code. I haven't had to generate user IDs yet, but I have often passed in (for example) an infinite stream of prime numbers, or some other infinite sequence.
Not necessarily infinite either: it might be finite but huge, such as combinations or permutations of something.
Right, but have you had an actual situation where you've passed two parallel streams of different lengths to an external library that zipped them?
At the consumer end, the main one that comes to mind off the top of my head (apart from code I wrote myself) is numpy, for example their coefficent correlation function:
py> np.corrcoef([1, 2, 3, 4, 5, 6, 7, 8], [7, 6, 5, 4, 3, 2]) Traceback (most recent call last): ... ValueError: array dimensions must agree except for d_0
Now I'm still keeping an open-mind whether this check is justified for stats functions. (There have been some requests for XY stats in the statistics module, so I may have to make a decision some day.)
Are you saying you're annoyed that they enforce the same length here, and you'd like the freedom to pass different lengths? Because to me the check seems like an extremely good idea.
But my point is, regardless of whether that check is necessary or not, *I still have to check it when I produce the data* or else I get a potentially spurious exception that will eat my data. In the general case where I have iterators not lists, no recovery is possible. I can't catch the error, trim the data, and resubmit. I have to preemptively trim the data.
Of course I recognise the right of each developer to choose for themselves whether to enforce the rule that input streams are equal. And sometimes that will be the right thing to do.
But you are arguing that putting zip_strict as a builtin will encourage people to do this, and I am saying that *encouraging people to do this* is a point against it, because that will lead to the "things which aren't errors should never pass silently" Just In Case it might be an error.
Yes. We're starting to go in circles here, but I'm arguing that it's OK for people to be mildly inconvenienced sometimes having to preemptively trim their inputs in exchange for less confusing, invisible, frustrating bugs. I'd like people to use this feature as often as possible, and I think the benefits easily outweigh the problem you describe. Going crazy trying to debug something is probably the thing programmers complain about the most, I'd like to reduce that.
If you need to enforce equal length input, then one extra line:
from itertools import zip_strict
is no burden. But if someone is on the fence about checking for equal lengths, and importing would be too much trouble so they don't bother, but making it builtin is enough for them to tip them over into using it "just to be safe", then they probably shouldn't be using it.
I don't know why you put "just to be safe" in mocking quotes. I see safety as good. If an API accepts some iterables intending to zip them, I feel pretty safe guessing that 90% of the users of that API will pass iterables that they intend to be of equal length. Occasionally someone might want to pass an infinite stream or something, but really most users will just use lists constructed in a boring manner. I can't imagine ever designing an API thinking "I'd better not make this strict, I'm sure this particular API will be used quite differently from most other similar APIs and users will want to pass different lengths unusually often". But even if I grant that such occasions exist, I see no reason to believe that they will occur most often when a user is feeling too lazy to import itertools. The correlation you propose is highly suspect.
An external API which requires you to pass parallel iterables instead of pairs is unusual and confusing.
I don't know about that. numpy's correlation coefficient supports both a single array of data pairs or a pair of separate X, Y values. So does R.
Spreadsheets likewise put the X, Y data in separate cells, not in cells with two data points. If your data is coming from a CSV file, as it often does, the most natural way to get it is as two separate columns.
But for APIs that do require a single stream of pairs, then this entire discussion is irrelevant, since I, the producer of the data, can choose whatever behaviour makes sense for me:
- `zip` to truncate the data on shortest input; - `zip_longest` to pad it; - `zip_strict` (whether I write my own or get it from the stdlib)
or anything else I like, since I'm the producer.
So we shouldn't be talking about APIs that require a stream of pairs, but only APIs that require separate streams.
My point is that the APIs you're talking about are rare, therefore the problem you describe is rare.
[...]
In most cases it wouldn't ensure the correctness of code, but it could give some peace of mind and might help readers. But also in those cases if the user decides it's redundant and not worth using:
- The user had better be confident of their judgement, which will inevitably sometimes be wrong.
That's not our problem to solve. It isn't up to us to force people to use zip_strict "just in case your judgement that it isn't needed is wrong".
I think you and I have a severe disagreement on the relationship between the stdlib and the end developers. Between your comment above, and the one below, you seem to believe that it is the job of the stdlib to protect the developer from themselves.
And yet here you are using Python, where we have the ability to monkey-patch the builtins namespace, shadow functions, reach into functions and manipulate their data, including their code, rebind any name, remove attributes of anything, even change the class of (some) instances on the fly. Are you sure you are using the right language?
Python has plenty of things to protect users from mistakes. See exist_ok, keyword only arguments, private attributes, etc. All that matters is that you can work around them.
- Even if the context of the code makes it obvious that it's redundant, that context could change in the future and introduce a silent regression, and people are likely to not think to add strict=True to the zip call somewhere down the line from their changes. Adding strict=True is more future-proof to such changes.
That's an awfully condescending thing to say. You are effectively saying that Python developers ought to use strict zip by default just in case their requirements change in the future and they are too dim-witted to remember to change the implementation to match.
You see it as condescending, I see it as empathetic and cautious. Coders are humans and make mistakes. Code is full of bugs. Changes often cause regressions because someone wasn't aware of the impact changes in one place would have elsewhere. There's a common joke "99 little bugs in the code, 99 little bugs in the code. Take one down, patch it around 117 little bugs in the code". If someone does accidentally cause iterables to become unequal length, it is FAR better that this triggers an exception which alerts them to the problem immediately than them having to figure out what new thing has gone wrong, or worse, them not noticing at all.
I will try to remember that there are legitimate users for a strict zip, but this advocacy of *premptive and unnecessary length checks* just makes me more hostile to adding this to the stdlib at all, let alone making it a builtin.
If I'm the producer of the data, and I want it to be equal in length, then I control the data and can make it equal.
But in your example you complain about having to do that. Is it a problem or not?
It is a problem if I have to do it for no good reason, to satisfy some arbitrary "just in case" length check. I don't mind doing so if there is a genuine good reason for it, like in the ast case.
Ultimately, the data is going to be truncated somewhere. At the moment, zip does the truncation. If people switch to strict zip, then I have to do the truncation. That's a usability regression. Unless it is justified by some functional reason, or fixing a bug, it's more harmful than helpful:
"Function do_useful_stuff() no longer ignores excess stuff, but raises instead. Callers must now trim their To Do list before passing it to the function. No bugs were fixed by this change."
sort of thing.
I think you make this situation sound worse than it is. "I will curse them onto the hundredth generation for making my life more difficult" is pretty melodramatic.
Hyperbole is supposed to be read as humour, not literally :-)
Hyperbole aside, I still think everything you're saying makes the problem sound worse than it is. It's a mild inconvenience.
If you get an exception because you tried:
process_user_ids(usernames, generate_user_ids())
then you can pretty easily change it to:
process_user_ids(usernames, (user_id for user_id, username in zip(generate_user_ids(), usernames)))
or if you can generate user IDs one at a time:
process_user_ids(usernames, (generate_user_id() for _ in usernames))
So I have to write ugly, inefficient code to make up for you encouraging people to add unnecessary length checks in their code? Yay, I feel so happy about this change!!!
Also, it doesn't work if usernames is an iterator.
I already consider this an unlikely situation, that just makes it even more unlikely. It's even more unlikely that you can't just convert the usernames to a list. Even then, it's pretty solvable. If you want to use iterators so badly, consider this: most people currently enforce this check using something like assert len(x) == len(y) - I've given several examples of that, including within CPython. That makes it completely impossible to pass any kind of iterator. The best way to get rid of that kind of code is to make strict zip as widely known as possible. It will be more widely known if it's widely used and if it's in the builtin. If it's in itertools, you can expect that plenty of people will never hear about it, and they will continue to use len().
It's a bit inconvenient, but:
- It's pretty easy to solve the problem, certainly compared to implementing zip_equal,
You haven't solved the problem, and judging by your attempted work- arounds, this could easily lead to a reduction in code quality, not an improvement.
and nothing like your example of final classes which are pretty much insurmountable.
I don't remember raising final classes in this thread. Am I losing my short-term memory or are you confusing me with someone else?
I quoted you mentioning them before I wrote this bit: But if I'm only the consumer of the data, I have no business failing
"just to be safe". That's an anti-feature that makes life more difficult, not less, for the producer of the data, akin to excessive runtime type checking (I'm sometimes guilty of that myself) *or in otherlanguages flagging every class in sight as final "just to be safe".*
Back to the current email:
- An exception has clearly pointed out a problem to solve
An exception *might* have pointed out a problem to solve, but since the solution is usually to just truncate the data, and zip already does that, raising an exception may be a regression, not an enhancement.
I don't mean to assert that there was a bug to be fixed, I mean the exception has made it fairly clear that you need to truncate the data, which may be annoying but is better than the frustrating debugging which could have happened had things gone differently.
Can we agree to meet half-way?
- there are legitimate, genuine uses for zip_strict;
- but encouraging people to change zip to zip_strict "just in case" in the absence of a genuine reason is a bad thing.
No, I stand by my position for now that "just in case" is a genuine reason and that safety outweighs convenience and efficiency. I haven't been given a reason to believe that your concerns would be significant.
On Tue, May 5, 2020 at 5:11 AM Alex Hall <alex.mojaki@gmail.com> wrote:
On Mon, May 4, 2020 at 1:59 AM Steven D'Aprano <steve@pearwood.info> wrote:
Can we agree to meet half-way?
- there are legitimate, genuine uses for zip_strict;
- but encouraging people to change zip to zip_strict "just in case" in the absence of a genuine reason is a bad thing.
No, I stand by my position for now that "just in case" is a genuine reason and that safety outweighs convenience and efficiency. I haven't been given a reason to believe that your concerns would be significant.
If we were at the very beginning of the zip() function's life, and could guide its future without any baggage from the past, then "just in case" might be a valid justification. But that's not where we are. Is "just in case" worth the likely break of backward compatibility? I say "likely" because, technically, the Python language and standard library would be backward compatible; but in order to get any benefit from this change, there would need to be places where the strict mode is used, and that's going to mean people change code to be more strict, "just to be safe". Is THAT worth it? ChrisA
On Mon, May 4, 2020 at 9:18 PM Chris Angelico <rosuav@gmail.com> wrote:
On Tue, May 5, 2020 at 5:11 AM Alex Hall <alex.mojaki@gmail.com> wrote:
On Mon, May 4, 2020 at 1:59 AM Steven D'Aprano <steve@pearwood.info>
wrote:
Can we agree to meet half-way?
- there are legitimate, genuine uses for zip_strict;
- but encouraging people to change zip to zip_strict "just in case" in the absence of a genuine reason is a bad thing.
No, I stand by my position for now that "just in case" is a genuine reason and that safety outweighs convenience and efficiency. I haven't been given a reason to believe that your concerns would be significant.
If we were at the very beginning of the zip() function's life, and could guide its future without any baggage from the past, then "just in case" might be a valid justification. But that's not where we are. Is "just in case" worth the likely break of backward compatibility? I say "likely" because, technically, the Python language and standard library would be backward compatible; but in order to get any benefit from this change, there would need to be places where the strict mode is used, and that's going to mean people change code to be more strict, "just to be safe". Is THAT worth it?
ChrisA
I imagine there are few people who are too lazy to copy code from SO right now, and would be too lazy to import from itertools when the feature becomes available, but if it's a builtin then they're willing to make changes to some old code to make it a bit safer, even though that would break backward compatibility in their library. That's a weird combination. And if that happens, the user still gets a clear exception which tells them what to do. That's not a bad experience when making an upgrade. There's likely to be other breaking changes too, the library just dropped support for Python 3.9.
On 05/04/2020 12:35 PM, Alex Hall wrote:
I imagine there are few people who are too lazy to copy code from SO right now, and would be too lazy to import from itertools when the feature becomes available, but if it's a builtin then they're willing to make changes
Quite frankly, I have zero concern for people who are unwilling to import the correct function from the correct module. -- ~Ethan~
On Mon, 4 May 2020 21:07:26 +0200 Alex Hall <alex.mojaki@gmail.com> wrote:
Yes. We're starting to go in circles here, but I'm arguing that it's OK for people to be mildly inconvenienced sometimes having to preemptively trim their inputs in exchange for less confusing, invisible, frustrating bugs. I'd like people to use this feature as often as possible, and I think the benefits easily outweigh the problem you describe. Going crazy trying to debug something is probably the thing programmers complain about the most, I'd like to reduce that.
[...]
If an API accepts some iterables intending to zip them, I feel pretty safe guessing that 90% of the users of that API will pass iterables that they intend to be of equal length. Occasionally someone might want to pass an infinite stream or something, but really most users will just use lists constructed in a boring manner. I can't imagine ever designing an API thinking "I'd better not make this strict, I'm sure this particular API will be used quite differently from most other similar APIs and users will want to pass different lengths unusually often". But even if I grant that such occasions exist, I see no reason to believe that they will occur most often when a user is feeling too lazy to import itertools. The correlation you propose is highly suspect.
Is a Warning the right compromise? Turn it on by default, and let that 10% (where did that number come from?) turn if off because they actually do know better. Dan -- “Atoms are not things.” – Werner Heisenberg Dan Sommers, http://www.tombstonezero.net/dan
On 05/04/2020 12:07 PM, Alex Hall wrote:
No, I stand by my position for now that "just in case" is a genuine reason
"just in case" is boiler-plate. One of the huge wins in Python is its low boiler-plate requirements. It's okay if programmers have to think about their code and what's required, and it's even more okay to import the correct functions from a module -- especially if that module is already in the stdlib. -- ~Ethan~