PEP 618: Add Optional Length-Checking To zip
data:image/s3,"s3://crabby-images/9dc20/9dc20afcdbd45240ea2b1726268727683af3f19a" alt=""
I have pushed a second draft of PEP 618: https://www.python.org/dev/peps/pep-0618 Please let me know what you think – I'd love to hear any new feedback that hasn't yet been addressed in the PEP! Brandt
data:image/s3,"s3://crabby-images/abc12/abc12520d7ab3316ea400a00f51f03e9133f9fe1" alt=""
On 10/05/2020 17:04, Brandt Bucher wrote:
I still don't buy your dismissal of the new function alternative. In particular:
But zip_equals() is also another beast entirely; it takes on the responsibility of raising an exception, a problem neither of the other variants even have. -- Rhodri James *-* Kynesim Ltd
data:image/s3,"s3://crabby-images/dd81a/dd81a0b0c00ff19c165000e617f6182a8ea63313" alt=""
On 05/10/2020 09:04 AM, Brandt Bucher wrote:
Many Python users find that most of their zip usage
I don't think you have enough data to make that claim, unless by "many" you mean five or more.
but silently start producing shortened, mismatched results if items is refactored by the caller to be a consumable iterator
This seems like a weak argument; static type checking could catch it.
the author has counted dozens of other call sites in Python's standard library
References, please.
A good rule of thumb is that "mode-switches" which change return types or significantly alter functionality are indeed an anti-pattern,
Source?
while ones which enable or disable complementary checks or behavior are not.
None of the listed examples change behavior between "working" and "raising exceptions", and none of the listed examples are for "complementary checks".
At most one additional item may be consumed from one of the iterators when compared to normal zip usage.
How, exactly?
However, zip_longest is really another beast entirely
No, it isn't.
so it makes sense that it would live in itertools while zip grows in-place.
No, it doesn't
Importing necessary functions is not an anti-pattern.
Another proposed idiom, per-module shadowing of the built-in zip with some subtly different variant from itertools, is an anti-pattern that shouldn't be encouraged.
Source?
How is this any different from calling zip the wrong way?
This proposal is further complicated by the fact that CPython's actual zip type is an undocumented implementation detail.
That actually gives us complete freedom to redesign as long we keep the API backward-compatible. Really, this entire section (alternate constructor type) seems like rubbish. While I agree that mismatched iterators is a problem worth solving, I don't think this PEP comes close to making the case for this particular solution. -1 -- ~Ethan~
data:image/s3,"s3://crabby-images/9dc20/9dc20afcdbd45240ea2b1726268727683af3f19a" alt=""
Thanks for all of your feedback. Antoine Pitrou wrote:
I'm not sure what the iters bring here. The snippet would be more readable without, IMHO.
Good point. I was trying to demonstrate that it works with iterators, but I agree it's clearer to just use the lists here. Ethan Furman wrote:
Many Python users find that most of their zip usage I don't think you have enough data to make that claim, unless by "many" you mean five or more.
It's based on a combination of my own experience, the experiences of several others, and a survey of the CPython repo. I can dial back the wording, though, since this isn't necessarily representative of the larger userbase...
but silently start producing shortened, mismatched results if items is refactored by the caller to be a consumable iterator This seems like a weak argument; static type checking could catch it.
Well, that's why I go on to make stronger, non-toy ones immediately after. :) This is mainly just to introduce the problem in an easy-to-understand way.
the author has counted dozens of other call sites in Python's standard library References, please.
Here are two dozens: - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... I'll go ahead and link these in the PEP.
A good rule of thumb is that "mode-switches" which change return types or significantly alter functionality are indeed an anti-pattern, Source?
This was based on a chat with someone who has chosen not to become involved in the larger discussion, and it was lifted almost verbatim from my notes into the draft. Looking at it again, though, I don't think this sentence belongs in the PEP... it really shouldn't be prescribing design philosophies like this.
while ones which enable or disable complementary checks or behavior are not. None of the listed examples change behavior between "working" and "raising exceptions", and none of the listed examples are for "complementary checks".
Thanks for pointing this out. If I keep this bit, I'll include some other examples from the stdlib that specifically behave this way.
At most one additional item may be consumed from one of the iterators when compared to normal zip usage. How, exactly?
I'm actually considering just leaving this line out, too. We don't currently make any promises about how "extra" items are drawn (just that they are drawn left-to-right, which is still true here), so I don't think this needs to be in the spec.
However, zip_longest is really another beast entirely No, it isn't.
It has a completely independent implementation, a different interface, lives in a separate namespace, and doesn't even reference zip in its documentation. So it seems to me that it is indeed another beast entirely.
so it makes sense that it would live in itertools while zip grows in-place. No, it doesn't
See above for why I think it does.
The goal here is not just to provide a way to catch bugs, but to also make it easy (even tempting) for a user to enable the check whenever using zip at a call site with this property. Importing necessary functions is not an anti-pattern.
Um, agreed?
Another proposed idiom, per-module shadowing of the built-in zip with some subtly different variant from itertools, is an anti-pattern that shouldn't be encouraged. Source?
Point taken. I probably went a bit far labeling this a straight-up "anti-pattern", but it is certainly annoying to find that someone has added `from pprint import pprint as print` at the top of a module, for example (which has actually happened to me before). Very hard to figure out what's happening.
It's not obvious which one will succeed, or how the other will fail. If zip.strict is implemented as a method, zm will succeed, but zd will fail in one of several confusing ways: How is this any different from calling zip the wrong way?
Because there are now twice as many ways to get it wrong, no matter how it's spelled. And in either case, we now have a brand new class of silently wrong results.
This proposal is further complicated by the fact that CPython's actual zip type is an undocumented implementation detail. That actually gives us complete freedom to redesign as long we keep the API backward-compatible.
I should have phrased this line better. My point wasn't that we don't currently have the freedom to do whatever we want with the implementation, it was that making a decision like method/classmethod/staticmethod pretty much locks us into a particular implementation going forward. I'll make that clearer here. Rhodri James wrote:
But zip_equals() is also another beast entirely; it takes on the responsibility of raising an exception, a problem neither of the other variants even have.
Would you consider `os.makedirs(...)` and `os.makedirs(..., exist_ok=True)` to be entirely different beasts? I certainly don't. Brandt
data:image/s3,"s3://crabby-images/8e91b/8e91bd2597e9c25a0a8c3497599699707003a9e9" alt=""
On Tue, 12 May 2020 at 07:53, Brandt Bucher <brandtbucher@gmail.com> wrote:
... so it's another beast because (among other reasons) it lives in a separate namespace, and it should live in a separate namespace because it's another beast? That's circular logic. If we were to put zip_strict into itertools, you could use*precisely* this logic to argue that it was the right thing to do.
So importing zip_strict from itertools is an entirely reasonable way for users to enable the check, then.
Also irrelevant. It's very easy to suggest bad ways of using a feature. That doesn't make the feature bad. You seem to be arguing that zip_strict is bad because people can misuse it. We could probably remove 99% of the Python language by that argument... Paul
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Tue, May 12, 2020 at 5:20 PM Paul Moore <p.f.moore@gmail.com> wrote:
And considering that "from __future__ import print_function" is an officially-sanctioned way to cause a semantic change to print, I don't think it's really that strong an argument. Python is *deliberately* designed so that you can shadow things. I am most in favour of the separate-functions option *because* it makes shadowing easy. Not an anti-pattern at all. ChrisA
data:image/s3,"s3://crabby-images/ab219/ab219a9dcbff4c1338dfcbae47d5f10dda22e85d" alt=""
I fear that my comment on some text in the PEP was lost amidst the voting, so I'm repeating it here. This will probably screw up some threading, but this is the oldest message I have to reply to. The PEP says "At most one additional item may be consumed from one of the iterators when compared to normal zip usage." I think this should be prefaced with "If ValueError is raised ...". Also, why does it say "at most one additional item". How could it ever be less than one? And I'm not sure I'd say "normal zip usage", maybe "the existing builtin zip function". Eric
data:image/s3,"s3://crabby-images/46816/4681669242991dedae422eef48216ca51bd611c5" alt=""
On Fri, 15 May 2020 at 21:50, Eric V. Smith <eric@trueblade.com> wrote:
It seems to me, looking at the Python implementation in the PEP (not the current or C implementation) that the crux is here: except StopIteration: if not strict: return if items: i = len(items) + 1 raise ValueError(f"zip() argument {i} is too short") So if it is not strict, it will return/stop consuming iterators. If it is strict but it runs out *not* on the first iterator it will also not consume from another iterator?
And I'm not sure I'd say "normal zip usage", maybe "the existing builtin zip function".
Depends on where we end up I guess, if we go with what Brandt' PEP says (makes sense to keep internally consistent) I'd say "zip without the strict=True flag" or similar.
data:image/s3,"s3://crabby-images/dd81a/dd81a0b0c00ff19c165000e617f6182a8ea63313" alt=""
On 05/11/2020 11:48 PM, Brandt Bucher wrote:
On 05/10/2020 14:39 PM, Ethan Furman wrote:
On 05/10/2020 09:04 AM, Brandt Bucher wrote:
- both take an unknown number of iterables - both return tuples - both names start with `zip` - both stop at exhaustion - one as soon as possible - the other as late as possible - one has one extra parameter Those seem like very similar beasts to me.
and doesn't even reference zip in its documentation.
So update the docs. -- ~Ethan~
data:image/s3,"s3://crabby-images/dd81a/dd81a0b0c00ff19c165000e617f6182a8ea63313" alt=""
On 05/11/2020 11:48 PM, Brandt Bucher wrote:
On 05/10/2020 14:39, Ethan Furman wrote:
On 05/10/2020 09:04 PM, Brandt Bucher wrote:
These are all after a function that ensures the iterables are the same length -- hardly seems a good idea to slow them down with an extra check for each digit.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
This one already has a check.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Reasonable.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Already has a check.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Reasonable.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Mismatch cannot happen.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Unsure if mismatch can happen.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Mismatch cannot happen.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Reasonable.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Wow -- I don't even know how to parse that!
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Maybe.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Definitely.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Mismatch cannot happen.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Reasonable.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Mismatch cannot happen.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Mismatch cannot happen. So half of your examples are actually counter-examples. Did you vet them, or just pick matches against `zip(` ? Also, if a flag is used, won't that slow down every call to zip even when the flag is False? I know in many cases it probably won't matter, but I can see where it could in _pydecimal. -- ~Ethan~
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Wed, May 13, 2020 at 4:38 AM Ethan Furman <ethan@stoneleaf.us> wrote:
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Wow -- I don't even know how to parse that!
Wow, that's quite an example. Of something, I'm not sure what, but definitely an example. Based on two booleans, entries is either None or a list. If it's None, this loops over just the directory names; if it's a list, then it's been populated in perfect parallel to dirs (see the preceding loop), thus guaranteeing that the two lists are perfectly parallel. But in that case, "name" actually gets a tuple of (name,entry), and then inside the loop, it does a three-way branch that is guaranteed (and asserted) to split out the name and entry ONLY when there actually will be one. Definitely an odd piece of code. But it can never zip over things of different lengths. ChrisA
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Wed, May 13, 2020 at 03:04:02PM +0200, Antoine Pitrou wrote:
I'm not an expert on Python's argument passing code in fine detail, but I'm reasonable sure that it takes longer to pass arguments by keyword than by position, and it takes time to fill in missing arguments with the default (they have to be read from the function defaults, which takes time) and these things don't happen for free. But I do think that's probably a spurious point based on a micro- or even nano-optimization. I don't think that the extra nanosecond or whatever it takes for zip to receive an extra argument, fill in the default if missing, and branch is significant. I don't think that `zip(*args, strict=flag)` is the wrong choice because it's a nanosecond slower, I think it's the wrong choice because it makes for a poorer API when there are better, more future-proof choices available. -- Steven
data:image/s3,"s3://crabby-images/dd81a/dd81a0b0c00ff19c165000e617f6182a8ea63313" alt=""
On 05/13/2020 06:04 AM, Antoine Pitrou wrote:
Not the call itself, but the running of zip. Absent some clever programming it seems to me that there are two choices if we have a flag: - have two independent branches (basically `zip` and a `zip_strict` functions inside `zip` itself) and have the flag select the branch; or - have one branch which has extra logic that checks for at least one StopIteration and at least one item in each iteration to see if it should raise. If the first method is chosen then we may as well have two different functions; if the second method is chosen that would seem to be a performance hit, even when the flag is False. -- ~Ethan~
data:image/s3,"s3://crabby-images/9dc20/9dc20afcdbd45240ea2b1726268727683af3f19a" alt=""
Ethan Furman wrote:
So half of your examples are actually counter-examples.
I claimed to have found "dozens of other call sites in Python's standard library and tooling where it would be appropriate to enable this new feature". You asked for references, and I provided two dozen cases of zipping what must be equal length iterables. I said they were "appropriate", not "needed" or even "recommended". These are call sites where unequal-length iterables, if encountered, would be an error that I would hope wouldn't pass silently. Besides, I don't think it's beyond the realm of imagination for a future refactoring of several of the "Mismatch cannot happen." cases to introduce a bug of this kind.
Did you vet them, or just pick matches against `zip(`?
Of course. I spent hours vetting them, to the point of researching the GNU tar extended sparse header and Apple property list formats (and trying to figure out what the hell was happening in `os._fwalk`) just to make sure my understanding was correct. Ethan Furman wrote:
Not the call itself, but the running of zip. Absent some clever programming it seems to me that there are two choices if we have a flag:
I wouldn't call my implementation "clever", but it differs from both of these options. We only need to check if we're strict when an error occurs in one of our iterators, which is a situation the C code for `zip` already needs to explicitly handle with a branch. So this condition is only hit on the "last" `__next__` call, not on every single iteration. As a reminder, the actual C implementation is linked in the PEP (there's no PR yet but branch reviews are welcome), though I'd prefer if the PEP discussion didn't get bogged down in those specifics. The pure-Python implementation in the PEP is *very* close to it, but it uses different abstractions for some of the details regarding error handling and argument parsing.[0] However, for those who are interested, there is no measurable performance regression (and no additional parsing overhead for no-keyword-argument calls). Parsing the keyword argument (if present) adds <0.2us of overhead at creation time on my machine. I went ahead and ran some rough PGO/LTO benchmarks: Creation time: ``` $ ./python-master -m pyperf timeit 'zip()' Mean +- std dev: 79.4 ns +- 4.3 ns $ ./python-zip-strict -m pyperf timeit 'zip()' Mean +- std dev: 79.0 ns +- 1.9 ns $ ./python-zip-strict -m pyperf timeit 'zip(strict=True)' Mean +- std dev: 240 ns +- 8 ns ``` Creation time + iteration time: ``` $ ./python-master -m pyperf timeit -s 'r = range(10)' '[*zip(r, r)]' Mean +- std dev: 577 ns +- 35 ns $ ./python-zip-strict -m pyperf timeit -s 'r = range(10)' '[*zip(r, r)]' Mean +- std dev: 565 ns +- 16 ns $ ./python-zip-strict -m pyperf timeit -s 'r = range(10)' '[*zip(r, r, strict=True)]' Mean +- std dev: 756 ns +- 27 ns $ ./python-master -m pyperf timeit -s 'r = range(100)' '[*zip(r, r)]' Mean +- std dev: 3.54 us +- 0.14 us $ ./python-zip-strict -m pyperf timeit -s 'r = range(100)' '[*zip(r, r)]' Mean +- std dev: 3.49 us +- 0.07 us $ ./python-zip-strict -m pyperf timeit -s 'r = range(100)' '[*zip(r, r, strict=True)]' Mean +- std dev: 3.73 us +- 0.13 us $ ./python-master -m pyperf timeit -s 'r = range(1000)' '[*zip(r, r)]' Mean +- std dev: 44.1 us +- 2.0 us $ ./python-zip-strict -m pyperf timeit -s 'r = range(1000)' '[*zip(r, r)]' Mean +- std dev: 45.2 us +- 2.0 us $ ./python-zip-strict -m pyperf timeit -s 'r = range(1000)' '[*zip(r, r, strict=True)]' Mean +- std dev: 45.2 us +- 1.4 us ``` Additionally, the size of a `zip` instance has not changed. Pickles for non-strict `zip` instances are unchanged as well. Brandt [0] And zip's current tuple caching, which is *very* clever.
data:image/s3,"s3://crabby-images/dd81a/dd81a0b0c00ff19c165000e617f6182a8ea63313" alt=""
On 05/14/2020 11:13 AM, Brandt Bucher wrote:
Very good point.
Which seems besides the point. As you say, if the lengths are mismatched then a bug has appeared and if the check is nearly free there's no reason not to do it.
Glad I'm not the only one that didn't immediately get that os._fwalk code.
Ah, so this is why the strict version may consume an extra element -- it has to check if any remaining iterators have elements, while the non-strict version can just quit as soon as any of the iterators are exhausted. >>> one = iter([1, 2]) >>> six = iter([6, 7, 8]) >>> zip(one, six) (stuff) >>> next(six) 8 vs >>> zip_strict(one, six) (stuff) >>> next(six) (crickets)
I went ahead and ran some rough PGO/LTO benchmarks...
Can you do those with _pydecimal? If performance were an issue anywhere I would expect to see it with number crunching. --- Paul Moore and Chris Angelico have made good arguments in favor of an itertools addition which haven't been answered yet. Regardless, I think you've made the point that /a/ solution is very desirable. So the real debate is whether it should be a flag, a mode, or a separate function. I am still -1 on the flag. -- ~Ethan~
data:image/s3,"s3://crabby-images/9dc20/9dc20afcdbd45240ea2b1726268727683af3f19a" alt=""
Ethan Furman wrote:
Can you do those with _pydecimal? If performance were an issue anywhere I would expect to see it with number crunching.
No difference, probably because those methods look like they spend most of their time doing string manipulation: ``` $ export PYPERFSETUP='from _pydecimal import Decimal; from random import getrandbits; l = Decimal(bin(getrandbits(28))[2:]); r = Decimal(bin(getrandbits(28))[2:])' $ export PYPERFRUN='l.logical_and(r); l.logical_or(r); l.logical_xor(r)' $ ./python-master -m pyperf timeit -s "$PYPERFSETUP" "$PYPERFRUN" Mean +- std dev: 53.4 us +- 2.8 us $ ./python-zip-strict -m pyperf timeit -s "$PYPERFSETUP" "$PYPERFRUN" Mean +- std dev: 53.8 us +- 2.5 us $ ./python-zip-strict -m pyperf timeit -s "$PYPERFSETUP" "$PYPERFRUN" # This time, with strict=True in each method. Mean +- std dev: 53.6 us +- 3.0 us ``` I would encourage those who are still curious to pull the branch and experiment for themselves. Let's try to keep this a design discussion, since we've established that performance isn't a problem (and there is plenty of time for code review later).
Paul Moore and Chris Angelico have made good arguments in favor of an itertools addition which haven't been answered yet.
I don't consider their arguments particularly strong, but yeah, I was getting to those. I wanted to address your points first since you weren't part of the Ideas discussion! Paul Moore wrote:
... so it's another beast because (among other reasons) it lives in a separate namespace, and it should live in a separate namespace because it's another beast? That's circular logic.
Sorry, that's on me for trying to respond to two questions with one answer right before bed. Strike the namespace argument, then. The rest stands.
So importing zip_strict from itertools is an entirely reasonable way for users to enable the check, then.
Still agreed. But I think they would be *better* served by the proposed keyword argument. This whole sub-thread of discussion has left me very confused. Was anything unclear in the PEP's phrasing here? If so, I'd like to improve it. The original quote is: "The goal here is not just to provide a way to catch bugs, but to also make it easy (even tempting) for a user to enable the check whenever using `zip` at a call site with this property."
It's very easy to suggest bad ways of using a feature. That doesn't make the feature bad. You seem to be arguing that zip_strict is bad because people can misuse it.
Well, I addressed this "irrelevant" point because right out of the gate people started suggesting that they want a separate function *because* it makes shadowing easy. Which brings me to my next quote: Chris Angelico wrote:
I am most in favour of the separate-functions option *because* it makes shadowing easy. Not an anti-pattern at all.
I *really* hope this isn't how people use this (and I don't *think* it would be predominantly used this way), but at least it's clear to me now why you want it to be a separate function. It would still be quite simple to follow this pattern, though, with `functools.partial` or a custom wrapper.
Python is *deliberately* designed so that you can shadow things.
I wouldn't confuse "can" and "should" here. Python is deliberately designed to make *many* design patterns possible, good and bad.
And considering that "from __future__ import print_function" is an officially-sanctioned way to cause a semantic change to print, I don't think it's really that strong an argument.
Well that's a parser directive that is just there for 2/3 compatibility (I'm pretty sure - I've never used Python 2). I see it as very, very different from my `from pprint import pprint as print` headache that was quoted two levels up. Brandt
data:image/s3,"s3://crabby-images/8e91b/8e91bd2597e9c25a0a8c3497599699707003a9e9" alt=""
On Fri, 15 May 2020 at 07:10, Brandt Bucher <brandtbucher@gmail.com> wrote:
This whole sub-thread of discussion has left me very confused. Was anything unclear in the PEP's phrasing here? If so, I'd like to improve it. The original quote is: "The goal here is not just to provide a way to catch bugs, but to also make it easy (even tempting) for a user to enable the check whenever using `zip` at a call site with this property."
It's not unclear, I'm just not sure I agree with the goal, and I'm not sure the proposal achieves that goal: I note that the PEP makes no mention in the rationale of the goal to make it "tempting" to use the flag. It's *only* mentioned as a reason to reject the itertools option. If you want to use that argument, you should explicitly state (and justify) the goal of making use of the flag "tempting" in the rationale for the feature. If it's not part of the rationale, your argument against an itertools function is weak (arguably flawed). My problems with the argument for rejection are: 1. Why do we want to "tempt" people to error when handling mismatched lengths? Certainly letting people catch errors easily is worthwhile, but rejecting arguments of different lengths may well *not* be an error ("be lenient in what you accept" is a well-known principle, even if not something that everyone agrees on in all cases). 2. I find "mode switch" arguments ugly, and I could even argue difficult to maintain (I can't easily use grep to check whether I missed any cases that I wanted to make strict). So I'm not tempted to use one - rather the opposite, it puts me off. (Note that I'm *not* arguing "mode switches are wrong", but rather that a mode switch makes the functionality "more tempting"). 3. I'm not even that sure it's easy to discover - a key factor in making it something people will use when needed. People who know of zip and zip_longest would naturally look for a zip_strict, not for a mode argument. (Yes, this is not a strong point - *nobody* can really tell what people in general will find "easy" - but it does at least reflect *my* thought processes). I do agree that a builtin is more "tempting" to use than a stdlib function (it's not logical, but I can see that people think that way - I do myself). What I don't agree with is that "tempting" is a goal that we want, or that being a builtin is sufficiently important to justify the downsides of a mode flag. We may just have to agree to differ, and leave the final decision to the SC. But let's at least be clear about the goals up front in the rationale section. Paul PS Despite my reservations, this is a well-reasoned and well presented PEP - you've put a lot of work into it, and it shows. Thanks!
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Fri, May 15, 2020 at 08:17:12AM +0100, Paul Moore wrote:
I concur with Paul here. There may be cases where mismatched lengths are an error, but there are also cases where they are not an error. It is patronising to be talking about tempting people into using the strict version of zip. People who need it will find it, whether it is builtin or not.
2. I find "mode switch" arguments ugly
I think we need to distinguish between *modes* and *flags*. This proposed functionality is a mode, not a flag. Mode switches are extensible: zip(*args, mode='strict') can easily be extended in the future to support new modes, such as zip_longest. There are at least two other modes that have been mentioned briefly in the Python-Ideas thread: 1. a variety of zip that returns information in the StopIterator exception identifying the argument that was empty and those that weren't; 2. a variety of zip that simply skips the missing arguments: zip_skip('abc', 'ef', 'g') => (a, e, g) (b, f), (c,) I don't even understand the point of the first one, or how it would operate in practice, or why anyone would need it, but at least one person thinks it would be useful. Take that as you will. But I have written my own version of the second one, and used it. A mode parameter naturally implies that only one mode can apply at a time, and it naturally enforces that restriction without any additional programmer effort. If there are (let's say...) four different modes: mode = shortest|longest|strict|skip you can only supply one at a time. Using *named modes* (whether as strings or enums) to switch modes is quite reasonable. It's not my preferred API for this, but I don't hate it. But a *flag* parameter is difficult to extend without making both the API and the implementation complicated, since any extension will use multiple parameters: zip(*args, strict=True, longest=False, skip=True) and the caller can supply any combination, each of which has to be tested for, exceptions raised for the incompatible combinations. With three flags, there are eight such combinations, but only four valid ones. With four flags, there are 16 combinations and only five meaningful combinations. Flags work tolerably well when the parameters are independent and orthogonal, so you can ask for any combination of flags. But flags to switch modes are *not* independent and orthogonal. We can't combine them in arbitrary combinations. Using *boolean flags* to switch modes in this way makes for a poor API. For the record, my preferred APIs for this are in order: 1. +1 itertools.zip_strict function 2. +1 zip.strict(*args) 3. +1 zip(*args, mode='strict') # mode='shortest' by default 4. +0 zip(*args, strict=True) and even that is being generous for option 4. Note that options 1 and 2 have an important advantage over options 3 and 4: the strict version of zip is a first-class callable object that can be directly passed around and used indirectly in a functional way. E.g. for testing using unittest: assertRaises(zip.strict, 'a', '', Exception) Yes, assertRaises is also usable as a context manager, but this is just an illustration of the functional style. -- Steven
data:image/s3,"s3://crabby-images/c437d/c437dcdb651291e4422bd662821948cd672a26a3" alt=""
I'm a little frustrated by the tone in which the PEP dismisses the option that is most supported in the discussion. It fine for Brandt to have a different preference himself, but I think it ought to be presented more neutrally. On Fri, May 15, 2020, 10:20 AM Steven D'Aprano
Mostly I agree with Steven on relative preference: itertools.zip_strict() +1 zip.strict() +0.5 zip(mode='strict') +0 zip(strict=True) -0.5 Fwiw, I don't think it changes my order, but 'strict' is a better word than 'equal' in all those places. I'd subtract 0.1 from each of those votes if they used "equal".
data:image/s3,"s3://crabby-images/8e91b/8e91bd2597e9c25a0a8c3497599699707003a9e9" alt=""
On Fri, 15 May 2020 at 16:01, David Mertz <mertz@gnosis.cx> wrote:
I'm a little frustrated by the tone in which the PEP dismisses the option that is most supported in the discussion. It fine for Brandt to have a different preference himself, but I think it ought to be presented more neutrally.
Agreed. The PEP gives the impression that consensus was reached, but I don't think that's the case. My feeling is that opinions are rather evenly split between the approach in the PEP and an itertools function. I also feel that proposing an itertools function would have been a *lot* less controversial, so the tone in the PEP feels a little like it's defending a weak position by aggressively opposing alternatives. After all, one of the benefits of an itertools function is that it probably wouldn't have needed a PEP in the first place! Actually, looking at the reasons for rejection of the itertools option in the PEP:
It seems that a great deal of the motivation driving this alternative is that zip_longest already exists in itertools.
Nope, the biggest motivation is that an itertools addition would have been *significantly less controversial*. Paul
data:image/s3,"s3://crabby-images/abc12/abc12520d7ab3316ea400a00f51f03e9133f9fe1" alt=""
On 15/05/2020 16:56, Chris Angelico wrote:
Well, if it's what all the cool kids are doing... * itertools.zip_strict() +1 * zip.strict() +0 * zip(mode='strict') -0 * zip(strict=True) -1 The middle two would be weird if zip_longest doesn't get folded in eventually, which might push them (more) negative. -- Rhodri James *-* Kynesim Ltd
data:image/s3,"s3://crabby-images/ab219/ab219a9dcbff4c1338dfcbae47d5f10dda22e85d" alt=""
On 5/15/2020 11:56 AM, Chris Angelico wrote:
itertools.zip_strict() +1 zip.strict() +0 zip(strict=True) -0 zip(mode='strict') -1 I don't particularly care for "strict", though. It doesn't seem specific enough, and doesn't say "they iterators must return the same number of items" to me. I sort of liked "equal" better, but not so much to make a big stink about it. Also: The PEP says "At most one additional item may be consumed from one of the iterators when compared to normalzip usage." I think this should be prefaced with "If ValueError is raised ...". Also, why does it say "at most one additional item". How could it ever be less than one? Eric
data:image/s3,"s3://crabby-images/8e91b/8e91bd2597e9c25a0a8c3497599699707003a9e9" alt=""
[Cut the previous votes because someone's quoting didn't survive my email client and I can't be bothered fixing it] If everyone else is doing it... itertools.zip_strict() +1 zip(strict=True) -0 zip.strict() -0.5 zip(mode='strict') -1 Paul
data:image/s3,"s3://crabby-images/d1d84/d1d8423b45941c63ba15e105c19af0a5e4c41fda" alt=""
These negative votes surprise me. Given that it's clear that a generic strict-mode zip is non-trivial to write, and that there is significant demand for it, are people saying "+0 Python would not be a better programming environment if itertools.zip_strict() were adopted," and "-1 Python would be a worse programming environment if zip.strict() were adopted"? I can see why folks would say the latter about zip.strict(), but even though I really dislike the mode switches, I'm still positive about adding them if one of them ranks highest among those who care. I'm not going to give them negative votes, they don't make Python worse. I don't mind hyperbole ("I'm +1000 on this feature!" or "-10 on the worst proposal I've seen since <potentially controversial example removed>!") But I would like it if "0" meant "indifferent", "+1" meant "no-brainer, add it", and "-1" meant "no-brainer, just don't". FWIW, +1 itertools.zip_strict(*iterables) +0.5 zip(*iterables, mode) # mode is 3-way, default "shortest" +0.4 zip(*iterables, strict) # strict is boolean, default False +0 zip.strict(*iterables)
data:image/s3,"s3://crabby-images/46816/4681669242991dedae422eef48216ca51bd611c5" alt=""
I used the same convention as you, and my vote was thus as the record will show (note the negatives): zip(strict=True) +1 itertools.zip_strict() +0 zip(mode='strict') -1 zip.strict() -1 And I stand by that: I think Python would be better off without the 3rd and 4th option, even if no alternative was implemented. To go to one of the examples (not exactly, you'll understand why...) was given in the other thread: if we suggested asdfasfa.kjllasdfa.asdf() as the name(space) for the zip(strict=True) functionality, I would: - vote -1 (or hyperbole -10^googol) - still use the feature when I wanted to use 'zip(strict=True)' - think Python would be a worse programming environment for allowing this to be introduced Obviously the 3rd and 4th option are not as insane/illogical as the above example (apologies, I would attribute, but the nature of this example makes it hard for me to search for it!) but I do not like functionality exposed in this way and I think the lack of this functionality in the stdlib does not weigh up against the precedent/bad example this would set. You, and anyone else, can and some definitely will disagree with me but it's my vote, and I don't think it matters that much anyway: There has been a lot of discussion, and these straw polls, from my limited understanding, are often taken to see whether there is a clear consensus to short-circuit/end the discussion, I would say in this case it has shown inconclusiveness, which is fine, the final decision on this PEP will (fortunately) not be decided by our votes. I understand your 'pain' Stephen: I still think it is weird that people on these lists don't want "for x in some_iterable if x is not None:" as valid syntax, but I have, almost, made my peace with it. On Sun, 17 May 2020 at 18:42, Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
data:image/s3,"s3://crabby-images/f81c3/f81c349b494ddf4b2afda851969a1bfe75852ddf" alt=""
On Fri, May 15, 2020 at 11:55 AM Henk-Jaap Wagenaar < wagenaarhenkjaap@gmail.com> wrote:
Agreed. The best way to reduce accidental incorrect use of the builtin is to make the builtin capable of doing what a people want directly without having to go discover something in a module somewhere. OTOH so long as zip's docstring that shows up to people from interactive help, pydoc, and hover/alt/meta text in fancier IDEs mentions the caveat and the way to get the strict behavior, either of these two should be sufficient. I'm pushing forward with a pure documentation update in https://github.com/python/cpython/pull/20118 suitable for 3.8 - it doesn't mention the way to get the other behavior as that isn't settled or short yet, just makes it more obvious what the actual behavior is. -gps
data:image/s3,"s3://crabby-images/d1d84/d1d8423b45941c63ba15e105c19af0a5e4c41fda" alt=""
Gregory P. Smith writes:
Executive summary: My argument (and one of Steven d'Aprano's) against a "strict" mode to zip is precisely that it's *extremely* likely that if I use a facility that zips together things I provide, the last thing I want it is for it to choose "strict" for me, because that *would likely be incorrect*. I do not want people using strict *for any facility I might use* "because it's there." I'm not saying strict mode is useless. I am saying the "encourage use by making it easier to use" argument cuts both ways: it can create problems as well as solve them. A couple of concrete examples: 1. In activities like constructing data arrays, which we expect to be rectangular, I'm still likely to use sequences of unequal length, including infinite sequences. As an economist, I often use lagged data, which can easily be constructed for an equation like y[t] = a + b x[t] + c x[t-1] with zip(y[1:], const(), x[1:], x[0:]) where def const(): while True: yield 1 (Here I'm using zip() as a proxy for somebody's generic facility such as a function to compute OLS estimates given a sequence of data series. Obviously for zip itself, I would just not use strict mode.) Note that y[0], not y[-1], needs to be left out. This is the critical point that I need to concentrate on when constructing this data frame. If I have to "even out" the columns, though, I need *also* to think about the lengths, a distraction which for me makes this more bug-prone. Ie, I might accidentally write zip(y[:-1], const(len(x) - 1), x[:-1], x[1:]) where def const(n): return (1 for _ in range(n)) which is not only asymmetric but wrong, as the regressor x[1:] is "future x"! More opportunities for bugs arise in the replacement for const(). Even if you don't agree about the bugs (and there is a weak argument that some fraction of the potential bugs will be caught by strict-mode zip, such as a wrong argument to const()), it's pretty clear which style is more readable. 2. My programming style is such that if I want couples that are related to each other, I will almost certainly generate those couples, not generate them separately in the right orders and then zip as needed. For example, in one of the test suites two lists are generated something like this: c_int_types = [...] # list display c_int_type_ranges = [construct_range(t) for t in c_int_types] and in many tests the two lists are zipped to produce appropriately matched couple. But I would certainly do c_int_types = [...] # as above c_int_types_with_ranges = [(t, construct_range(t)) for t in c_int_types] Of course I understand that sometimes you might very well care about the space cost of doing this, but I suspect that if I cared about the 2X cost of c_int_types_with_ranges, I wouldn't pregenerate a list of ranges at all. My point is that given my style, this particular use case will *almost never* occur, so is unlikely to provide an excuse for strict mode if I'm providing the data. I suspect this applies to a lot of claimed use cases. Of course if I only provide c_int_types, and your function constructs c_int_type_ranges and zips them, it's fine if you use strict mode -- that doesn't impact me at all. You probably *should* use strict mode. But if you claim to be providing a general facility, I think it's on you to think about whether I might want to feed sequences of unequal length to the function, even though you never would. That's quite a burden to assume, though, unless you simply provide a strict mode flag in your functions (which you can default to strict!) and let me choose. Steve
data:image/s3,"s3://crabby-images/c437d/c437dcdb651291e4422bd662821948cd672a26a3" alt=""
On Fri, May 15, 2020 at 12:55 PM Eric V. Smith <eric@trueblade.com> wrote:
This struck me as strange also. I mean, the wording can be improved to clarify "if error." But more significantly, it seems like it cannot conceivably be true. If might be "At most one additional item from EACH of the iterators." If I do zip_strict(a, b, c, d, e) and "e" is the one that is shorter, how could any algorithm ever avoid consuming one extra item of a, b, c, and d each?! -- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
data:image/s3,"s3://crabby-images/2eb67/2eb67cbdf286f4b7cb5a376d9175b1c368b87f28" alt=""
On 2020-05-15 20:36, David Mertz wrote:
Well, it does say "when compared to normal zip usage". The normal zip would consume an item of a, b, c, and d. If e is exhausted, then zip would just stop, but zip_strict would raise ValueError. There would be no difference in the number of items consumed but not used.
data:image/s3,"s3://crabby-images/e94e5/e94e50138bdcb6ec7711217f439489133d1c0273" alt=""
David Mertz wrote:
I would say that 'equal' is worse than 'strict'. but 'strict' is also wrong. Zipping to a potentially infinite sequence -- like a manual enumerate -- isn't wrong. It may be the less common case, but it isn't wrong. Using 'strict' implies that there is something sloppy about the data in, for example, cases like Stephen J. Turnbull's lagged time series. Unfortunately, the best I can come up with is 'same_length', or possibly 'equal_len' or 'equal_length'. While those are better semantically, they are also slightly too long or awkward. I would personally still consider 'same_length' the least bad option. -jJ
data:image/s3,"s3://crabby-images/f81c3/f81c349b494ddf4b2afda851969a1bfe75852ddf" alt=""
On Wed, May 20, 2020 at 11:09 AM Jim J. Jewett <jimjjewett@gmail.com> wrote:
As we've come down to naming things... if you want it to read more like English, `zip(vorpal_rabbits, holy_hand_grenades, lengths_must_match=True)` or another chosen variation of that such as `len_must_match=` or `length_must_match=` reads nicely and is pretty self explanatory that an error can be expected if the condition implied by the "must" is found untrue without really feeling a need to look it up in documentation. It is also harder to type or fit on a line. Which is one advantage to a short thing like `strict=`. I don't care so much about the particular spelling here to argue among any of those, I primarily want the feature to exist. I expect we're entering steering council territory for a decision soon... -gps _______________________________________________
data:image/s3,"s3://crabby-images/b957e/b957eb3ce8f7e4648537689d41c333147b87e8c2" alt=""
Python has always preferred full-word over old-school C/Perl/PHP-style abbreviated names. Clarity is paramount. (Or this whole discussion wouldn't even be happening.) I think this is *more* of a zip_shortest than zip_strict, but since you can never have total clarity without a method name that doubles as a docstring, whatever works will work as long as it's documented. Em On Wed, May 20, 2020 at 3:33 PM Joseph Jenne via Python-Dev < python-dev@python.org> wrote:
I'd like to suggest "len_eq" as a short but (rather) self-explanatory option.
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Thu., 21 May 2020, 4:09 am Jim J. Jewett, <jimjjewett@gmail.com> wrote:
Reading this thread and the current PEP, the main question I had was whether it might be better to flip the sense of the flag and call it "truncate". So the status quo would be "truncate=True", while the ValueError could be requested by passing an explicit "truncate=False". Draft documentation paragraph: ====== zip() can be used to combine iterables of different lengths, including combining finite iterables with infinite iterators. By default, the output iterator is implicitly truncated to produce the same number of items as the shortest input iterable. Setting *truncate* to false disables this implicit truncation and raises ValueError instead. Note that if this ValueError is raised an additional item will have been consumed from any iterators listed before the shortest iterator (or from the second listed iterator if the first iterator is the shortest one). To pad shorter input iterables rather than truncating the output or raising ValueError, see itertools.zip_longest. ====== The conceptual idea here is that the "truncate" flag name would technically be a shorter mnemonic for "truncate_silently", so clearing it gives you an exception rather enabling padding behaviour. Flipping the sense of the flag also means that "truncate=True" will appear in IDE tooltips as part of the function signature, providing significantly more information than "strict=False" would. That improved self-documentation then becomes what I would consider the strongest argument in favour of the flag-based approach: providing more information up-front to users regarding the actual behaviour of the builtin, rather than having them incorrectly assume that mismatched input iterator lengths will raise an exception. Side note: this idea pairs nicely with the "zip(itr, itr, ir)" idiom for non-overlapping data windows, as it makes it straightforward to request an exception if the last data tuple has values missing (without the flag, the idiom silently discards incomplete trailing data). Cheers, Nick. P.S. I had the opportunity to read the thread from beginning to end after belatedly catching some of the messages out of context, and FWIW, I started out assuming I would strongly favour the itertools function option, and surprised myself by favouring the flag option (albeit inverted) by the time I reached the end.
data:image/s3,"s3://crabby-images/dd81a/dd81a0b0c00ff19c165000e617f6182a8ea63313" alt=""
On 06/01/2020 04:36 AM, Nick Coghlan wrote:
Reading this thread and the current PEP, the main question I had was whether it might be better to flip the sense of the flag and call it "truncate".
So the status quo would be "truncate=True", while the ValueError could be requested by passing an explicit "truncate=False".
I like this a lot. +1 -- ~Ethan~
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Mon, Jun 01, 2020 at 09:36:40PM +1000, Nick Coghlan wrote:
It's not really *implicit* if there's an explicit flag controlling the behaviour, even with a default value. We don't use that sort of language elsewhere. For example, help(sorted) doesn't say: "Return a new list containing all items from the iterable implicitly in ascending order. Pass reverse=True to disable this implicit order." help(int) doesn't say that the base is implicitly decimal; help(print) doesn't talk about "implicit spaces between items, implicit newline at the end of the output" etc. It just states the behaviour controlled by the parameter. This is accurate, non-judgemental, and avoids being over-wordy: "By default, the output iterator is truncated at the shortest input iterable."
"Significantly" more? I don't think so. Truncate at what? - some maximum length; - some specific element; - at the shortest input. At some point people have to read the docs, not just the tooltips. If you didn't know what zip does, seeing truncate=True won't mean anything to you. If you do know what zip does, then the parameter names are mnemonics, and strict=False and truncate=True provide an equal hint for the default behaviour: * if it's not strict, it is tolerant, stopping at the shortest; * if it truncates, it truncates at the shortest input. For the default case, strict=False and truncate=True are pretty much equal in information. But for the case of non-default behaviour, strict=True is a clear winner. It can pretty much only mean one thing: raise an exception. Whereas truncate=False is ambiguous: - pad the output; - skip items as they become empty; - raise an exception. All three of these are useful behaviour, and while the middle one is not part of this PEP, it was requested in the discussions on Python-Ideas.
That improved self-documentation then becomes what I would consider the strongest argument in favour of the flag-based approach:
I don't think that "truncate=False" (which can mean three different things) is more self-documenting than `zip(*items, mode='strict')` or `zip_strict()` (either of which can only mean one thing). -- Steven
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Tue., 2 Jun. 2020, 11:23 am Steven D'Aprano, <steve@pearwood.info> wrote:
Given that the only input parameters are the iterables themselves, it's a stretch to even consider the first two as possibilities.
"strict=False" doesn't tell you whether the tolerant behaviour is truncation or padding. "truncate=True" does.
For the default case, strict=False and truncate=True are pretty much equal in information.
Nope. If you don't already know that zip truncates the output by default, "truncate=True" gives you that information, while "strict=False" doesn't.
But for the case of non-default behaviour, strict=True is a clear winner. It can pretty much only mean one thing: raise an exception.
But raise an exception when? In the context of this discussion, we know we mean "strict length checking, raising an exception for inconsistent lengths". But "strict" on its own doesn't convey that - we could be requesting strict runtime type checking, for example, where each iterable is expected to keep producing items of the same type as was produced for the first tuple. Or we could be requesting a check that the values in the tuple aren't "None".
As noted above, "strict" just means "check more constraints" - it's at least as ambiguous as "don't truncate the output". I do agree that the ambiguity of "truncate=False" is the biggest downside of that spelling, but learning that it means "raise an exception on a length mismatch instead of truncating the output iterator" isn't going to be any harder than learning what strict mode means. Cheers, Nick.
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Tue, Jun 2, 2020 at 8:55 PM Nick Coghlan <ncoghlan@gmail.com> wrote:
Why? I can conceivably imagine that zip(iter1, iter2, truncate=5) would consume at most 5 elements from each iterable. It's not much of a stretch. It doesn't happen to be what's proposed, but it's a reasonable interpretation. (Though then the default would probably be truncate=None to not truncate.) ChrisA
data:image/s3,"s3://crabby-images/c437d/c437dcdb651291e4422bd662821948cd672a26a3" alt=""
On Tue, Jun 2, 2020 at 8:07 AM Chris Angelico <rosuav@gmail.com> wrote:
This was exactly my thought, that Chris wrote very well. I can easily imagine a 'truncate=5' behavior. In fact, if it existed, it is something I would have used multiple times. As is, I use islice() or a break inside a loop, but that hypothetical parameter might be a helpful convenience. However, it is indeed NOT the current proposal or discussion. -- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
data:image/s3,"s3://crabby-images/c437d/c437dcdb651291e4422bd662821948cd672a26a3" alt=""
On Tue, Jun 2, 2020, 9:41 AM Steve Dower
Oh yeah. I've done that too. For whatever reason, I think I used to use the extra range, and nowadays I'm more likely to use islice(). I have absolutely no argument why one style or the other is better, just my habit has changed. In any case, I'm not advocating for truncate=5 behavior. Merely agreeing that the word truncate is not less ambiguous than the word strict. That's not even saying I prefer strict to truncate; itertools.zip_strict() remains my preference. But I could learn either parameter choice easily enough.
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Tue, Jun 02, 2020 at 08:52:46PM +1000, Nick Coghlan wrote:
And then:
"strict=False" doesn't tell you whether the tolerant behaviour is truncation or padding. "truncate=True" does.
You can't have it both ways Nick -- if the lack of additional parameters is enough for the user to predict that the only reasonable behaviour is to truncate, then the lack of additional parameters is also enough for them to predict that the only reasonable non-strict (tolerant) behaviour is to truncate at the shortest input. [...]
If you are going to propose that users might imagine a hypothetical check that raises if any item is None, well, isn't that *precisely* the sentinel check I gave above that you blithly dismissed as "a stretch"? If it's a stretch for me, it's a stretch for you too. Ultimately, bikeshedding on the name truncate versus strict versus equal versus shortest versus ... is quibbling. Everyone who reads the tooltips, assuming they even see them, is going to take something different from it. Some will think "truncate what, the tuples?" and some will think "strict about what?". Ultimately the tooltips are no substitute for reading the docs. If you don't know what zip does, you cannot interpret what it means for zip to truncate or be strict. No one single word is going to communicate everything we need to communicate. Function and parameter names are mnemonics, not documentation. So on that note, and in regard only to the choice between "strict" versus "truncate" etc, I'm going to bow out: call it what you will. I've got a bigger problem with the use of a boolean flag than the name. -- Steven
data:image/s3,"s3://crabby-images/d1d84/d1d8423b45941c63ba15e105c19af0a5e4c41fda" alt=""
Brandt Bucher writes:
I thought it was quite clear. Those of us who disagree simply disagree. We prefer to provide it as a separate function. Just move on, please; you're not going to convince us, and we're not going to convince you. Leave it to the PEP Delegate or Steering Council.
I wouldn't confuse "can" and "should" here.
You do exactly that in arguing for your preferred design, though. We could implement the strictness test with an argument to the zip builtin function, but I don't think we should. I still can't think of a concrete use case for it from my own experience. Of course I believe concrete use cases exist, but that introspection makes me suspicious of the claim that this should be a builtin feature, with what is to my taste an ugly API. Again, I don't expect to convince you, and you shouldn't expect to convince me, at least not without more concrete and persuasive use cases than I've seen so far. Steve
data:image/s3,"s3://crabby-images/edc98/edc9804a1e6f2ca62f3236419f69561516e5074d" alt=""
I'm on the fence about using a separate function vs. a keyword argument (I think there is merit to both), but one thing to note about the separate function suggestion is that it makes it easier to write backwards compatible code that doesn't rely on version checking. With `itertools.zip_strict`, you can do some graceful degradation like so: try: from itertools import zip_strict except ImportError: zip_strict = zip Or provide fallback easily: try: from itertools import zip_strict except ImportError: def zip_strict(*args): yield from zip(*args) for arg in args: if next(arg, None): raise ValueError("At least one input terminated early.") There's an alternate pattern for the kwarg-only approach, which is to just try it and see: try: zip(strict=True) HAS_ZIP_STRICT = True except TypeError: HAS_ZIP_STRICT = False But I would say it's considerably less idiomatic. Just food for thought here. In the long run this doesn't matter, because eventually 3.9 will fall out of everyone's support matrices and these workarounds will become obsolete anyway. Best, Paul On 5/15/20 5:20 AM, Stephen J. Turnbull wrote:
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Fri, May 15, 2020 at 09:56:03AM -0400, Paul Ganssle wrote:
This is just a special case of a much broader case: a separate function, or method, is a first class object that can be passed around to other functions, used in lists, etc. https://softwareengineering.stackexchange.com/questions/39742/when-is-a-feat... Using a mode switch or flag makes the zip strict a second class citizen. -- Steven
data:image/s3,"s3://crabby-images/2ffc5/2ffc57797bd7cd44247b24896591b7a1da6012d6" alt=""
Here’s another advantage of having a separate function that I didn’t see acknowledged in the PEP: If strict behavior is a better default for a zip-like function than non-strict, then choosing a new function would let you realize that better default. In contrast, by adding a new argument to the existing function, the function you use will forever have the less preferred default. In terms of what is a better default, I would say strict is better because errors can’t pass silently: If errors occur, you can always change the flag. But you would be doing that explicitly. —Chris On Fri, May 15, 2020 at 6:57 AM Paul Ganssle <paul@ganssle.io> wrote:
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Fri, 15 May 2020 06:06:00 -0000 "Brandt Bucher" <brandtbucher@gmail.com> wrote:
And in any case, people who are concerned about performance should use the C decimal accelerator, which is the default. Here is your micro-benchmark with _pydecimal (which is the pure Python fallback): $ python3.8 -m pyperf timeit -s "$PYPERFSETUP" "$PYPERFRUN" ..................... Mean +- std dev: 35.4 us +- 1.1 us Here is the same micro-benchmark with decimal (which loads the C accelerator by default): $ python3.8 -m pyperf timeit -s "$PYPERFSETUP" "$PYPERFRUN" ..................... Mean +- std dev: 471 ns +- 12 ns Even if you were losing performance on those 35.4us it wouldn't make sense to complain about it. Regards Antoine.
data:image/s3,"s3://crabby-images/9dc20/9dc20afcdbd45240ea2b1726268727683af3f19a" alt=""
In the last 24 hours, this thread has grown a bit beyond my capacity to continue several different lines of discussion with each individual. I count 22 messages from 14 different people since my last reply, and I assure you that I've carefully read each response and am considering them as I work on the next draft. I'd like to thank everyone who took the time to read the PEP and provide thoughtful, actionable feedback here! Brandt
data:image/s3,"s3://crabby-images/abc12/abc12520d7ab3316ea400a00f51f03e9133f9fe1" alt=""
On 10/05/2020 17:04, Brandt Bucher wrote:
I still don't buy your dismissal of the new function alternative. In particular:
But zip_equals() is also another beast entirely; it takes on the responsibility of raising an exception, a problem neither of the other variants even have. -- Rhodri James *-* Kynesim Ltd
data:image/s3,"s3://crabby-images/dd81a/dd81a0b0c00ff19c165000e617f6182a8ea63313" alt=""
On 05/10/2020 09:04 AM, Brandt Bucher wrote:
Many Python users find that most of their zip usage
I don't think you have enough data to make that claim, unless by "many" you mean five or more.
but silently start producing shortened, mismatched results if items is refactored by the caller to be a consumable iterator
This seems like a weak argument; static type checking could catch it.
the author has counted dozens of other call sites in Python's standard library
References, please.
A good rule of thumb is that "mode-switches" which change return types or significantly alter functionality are indeed an anti-pattern,
Source?
while ones which enable or disable complementary checks or behavior are not.
None of the listed examples change behavior between "working" and "raising exceptions", and none of the listed examples are for "complementary checks".
At most one additional item may be consumed from one of the iterators when compared to normal zip usage.
How, exactly?
However, zip_longest is really another beast entirely
No, it isn't.
so it makes sense that it would live in itertools while zip grows in-place.
No, it doesn't
Importing necessary functions is not an anti-pattern.
Another proposed idiom, per-module shadowing of the built-in zip with some subtly different variant from itertools, is an anti-pattern that shouldn't be encouraged.
Source?
How is this any different from calling zip the wrong way?
This proposal is further complicated by the fact that CPython's actual zip type is an undocumented implementation detail.
That actually gives us complete freedom to redesign as long we keep the API backward-compatible. Really, this entire section (alternate constructor type) seems like rubbish. While I agree that mismatched iterators is a problem worth solving, I don't think this PEP comes close to making the case for this particular solution. -1 -- ~Ethan~
data:image/s3,"s3://crabby-images/9dc20/9dc20afcdbd45240ea2b1726268727683af3f19a" alt=""
Thanks for all of your feedback. Antoine Pitrou wrote:
I'm not sure what the iters bring here. The snippet would be more readable without, IMHO.
Good point. I was trying to demonstrate that it works with iterators, but I agree it's clearer to just use the lists here. Ethan Furman wrote:
Many Python users find that most of their zip usage I don't think you have enough data to make that claim, unless by "many" you mean five or more.
It's based on a combination of my own experience, the experiences of several others, and a survey of the CPython repo. I can dial back the wording, though, since this isn't necessarily representative of the larger userbase...
but silently start producing shortened, mismatched results if items is refactored by the caller to be a consumable iterator This seems like a weak argument; static type checking could catch it.
Well, that's why I go on to make stronger, non-toy ones immediately after. :) This is mainly just to introduce the problem in an easy-to-understand way.
the author has counted dozens of other call sites in Python's standard library References, please.
Here are two dozens: - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... I'll go ahead and link these in the PEP.
A good rule of thumb is that "mode-switches" which change return types or significantly alter functionality are indeed an anti-pattern, Source?
This was based on a chat with someone who has chosen not to become involved in the larger discussion, and it was lifted almost verbatim from my notes into the draft. Looking at it again, though, I don't think this sentence belongs in the PEP... it really shouldn't be prescribing design philosophies like this.
while ones which enable or disable complementary checks or behavior are not. None of the listed examples change behavior between "working" and "raising exceptions", and none of the listed examples are for "complementary checks".
Thanks for pointing this out. If I keep this bit, I'll include some other examples from the stdlib that specifically behave this way.
At most one additional item may be consumed from one of the iterators when compared to normal zip usage. How, exactly?
I'm actually considering just leaving this line out, too. We don't currently make any promises about how "extra" items are drawn (just that they are drawn left-to-right, which is still true here), so I don't think this needs to be in the spec.
However, zip_longest is really another beast entirely No, it isn't.
It has a completely independent implementation, a different interface, lives in a separate namespace, and doesn't even reference zip in its documentation. So it seems to me that it is indeed another beast entirely.
so it makes sense that it would live in itertools while zip grows in-place. No, it doesn't
See above for why I think it does.
The goal here is not just to provide a way to catch bugs, but to also make it easy (even tempting) for a user to enable the check whenever using zip at a call site with this property. Importing necessary functions is not an anti-pattern.
Um, agreed?
Another proposed idiom, per-module shadowing of the built-in zip with some subtly different variant from itertools, is an anti-pattern that shouldn't be encouraged. Source?
Point taken. I probably went a bit far labeling this a straight-up "anti-pattern", but it is certainly annoying to find that someone has added `from pprint import pprint as print` at the top of a module, for example (which has actually happened to me before). Very hard to figure out what's happening.
It's not obvious which one will succeed, or how the other will fail. If zip.strict is implemented as a method, zm will succeed, but zd will fail in one of several confusing ways: How is this any different from calling zip the wrong way?
Because there are now twice as many ways to get it wrong, no matter how it's spelled. And in either case, we now have a brand new class of silently wrong results.
This proposal is further complicated by the fact that CPython's actual zip type is an undocumented implementation detail. That actually gives us complete freedom to redesign as long we keep the API backward-compatible.
I should have phrased this line better. My point wasn't that we don't currently have the freedom to do whatever we want with the implementation, it was that making a decision like method/classmethod/staticmethod pretty much locks us into a particular implementation going forward. I'll make that clearer here. Rhodri James wrote:
But zip_equals() is also another beast entirely; it takes on the responsibility of raising an exception, a problem neither of the other variants even have.
Would you consider `os.makedirs(...)` and `os.makedirs(..., exist_ok=True)` to be entirely different beasts? I certainly don't. Brandt
data:image/s3,"s3://crabby-images/8e91b/8e91bd2597e9c25a0a8c3497599699707003a9e9" alt=""
On Tue, 12 May 2020 at 07:53, Brandt Bucher <brandtbucher@gmail.com> wrote:
... so it's another beast because (among other reasons) it lives in a separate namespace, and it should live in a separate namespace because it's another beast? That's circular logic. If we were to put zip_strict into itertools, you could use*precisely* this logic to argue that it was the right thing to do.
So importing zip_strict from itertools is an entirely reasonable way for users to enable the check, then.
Also irrelevant. It's very easy to suggest bad ways of using a feature. That doesn't make the feature bad. You seem to be arguing that zip_strict is bad because people can misuse it. We could probably remove 99% of the Python language by that argument... Paul
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Tue, May 12, 2020 at 5:20 PM Paul Moore <p.f.moore@gmail.com> wrote:
And considering that "from __future__ import print_function" is an officially-sanctioned way to cause a semantic change to print, I don't think it's really that strong an argument. Python is *deliberately* designed so that you can shadow things. I am most in favour of the separate-functions option *because* it makes shadowing easy. Not an anti-pattern at all. ChrisA
data:image/s3,"s3://crabby-images/ab219/ab219a9dcbff4c1338dfcbae47d5f10dda22e85d" alt=""
I fear that my comment on some text in the PEP was lost amidst the voting, so I'm repeating it here. This will probably screw up some threading, but this is the oldest message I have to reply to. The PEP says "At most one additional item may be consumed from one of the iterators when compared to normal zip usage." I think this should be prefaced with "If ValueError is raised ...". Also, why does it say "at most one additional item". How could it ever be less than one? And I'm not sure I'd say "normal zip usage", maybe "the existing builtin zip function". Eric
data:image/s3,"s3://crabby-images/46816/4681669242991dedae422eef48216ca51bd611c5" alt=""
On Fri, 15 May 2020 at 21:50, Eric V. Smith <eric@trueblade.com> wrote:
It seems to me, looking at the Python implementation in the PEP (not the current or C implementation) that the crux is here: except StopIteration: if not strict: return if items: i = len(items) + 1 raise ValueError(f"zip() argument {i} is too short") So if it is not strict, it will return/stop consuming iterators. If it is strict but it runs out *not* on the first iterator it will also not consume from another iterator?
And I'm not sure I'd say "normal zip usage", maybe "the existing builtin zip function".
Depends on where we end up I guess, if we go with what Brandt' PEP says (makes sense to keep internally consistent) I'd say "zip without the strict=True flag" or similar.
data:image/s3,"s3://crabby-images/dd81a/dd81a0b0c00ff19c165000e617f6182a8ea63313" alt=""
On 05/11/2020 11:48 PM, Brandt Bucher wrote:
On 05/10/2020 14:39 PM, Ethan Furman wrote:
On 05/10/2020 09:04 AM, Brandt Bucher wrote:
- both take an unknown number of iterables - both return tuples - both names start with `zip` - both stop at exhaustion - one as soon as possible - the other as late as possible - one has one extra parameter Those seem like very similar beasts to me.
and doesn't even reference zip in its documentation.
So update the docs. -- ~Ethan~
data:image/s3,"s3://crabby-images/dd81a/dd81a0b0c00ff19c165000e617f6182a8ea63313" alt=""
On 05/11/2020 11:48 PM, Brandt Bucher wrote:
On 05/10/2020 14:39, Ethan Furman wrote:
On 05/10/2020 09:04 PM, Brandt Bucher wrote:
These are all after a function that ensures the iterables are the same length -- hardly seems a good idea to slow them down with an extra check for each digit.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
This one already has a check.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Reasonable.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Already has a check.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Reasonable.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Mismatch cannot happen.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Unsure if mismatch can happen.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Mismatch cannot happen.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Reasonable.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Wow -- I don't even know how to parse that!
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Maybe.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Definitely.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Mismatch cannot happen.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a... - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Reasonable.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Mismatch cannot happen.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Mismatch cannot happen. So half of your examples are actually counter-examples. Did you vet them, or just pick matches against `zip(` ? Also, if a flag is used, won't that slow down every call to zip even when the flag is False? I know in many cases it probably won't matter, but I can see where it could in _pydecimal. -- ~Ethan~
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Wed, May 13, 2020 at 4:38 AM Ethan Furman <ethan@stoneleaf.us> wrote:
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a...
Wow -- I don't even know how to parse that!
Wow, that's quite an example. Of something, I'm not sure what, but definitely an example. Based on two booleans, entries is either None or a list. If it's None, this loops over just the directory names; if it's a list, then it's been populated in perfect parallel to dirs (see the preceding loop), thus guaranteeing that the two lists are perfectly parallel. But in that case, "name" actually gets a tuple of (name,entry), and then inside the loop, it does a three-way branch that is guaranteed (and asserted) to split out the name and entry ONLY when there actually will be one. Definitely an odd piece of code. But it can never zip over things of different lengths. ChrisA
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Wed, May 13, 2020 at 03:04:02PM +0200, Antoine Pitrou wrote:
I'm not an expert on Python's argument passing code in fine detail, but I'm reasonable sure that it takes longer to pass arguments by keyword than by position, and it takes time to fill in missing arguments with the default (they have to be read from the function defaults, which takes time) and these things don't happen for free. But I do think that's probably a spurious point based on a micro- or even nano-optimization. I don't think that the extra nanosecond or whatever it takes for zip to receive an extra argument, fill in the default if missing, and branch is significant. I don't think that `zip(*args, strict=flag)` is the wrong choice because it's a nanosecond slower, I think it's the wrong choice because it makes for a poorer API when there are better, more future-proof choices available. -- Steven
data:image/s3,"s3://crabby-images/dd81a/dd81a0b0c00ff19c165000e617f6182a8ea63313" alt=""
On 05/13/2020 06:04 AM, Antoine Pitrou wrote:
Not the call itself, but the running of zip. Absent some clever programming it seems to me that there are two choices if we have a flag: - have two independent branches (basically `zip` and a `zip_strict` functions inside `zip` itself) and have the flag select the branch; or - have one branch which has extra logic that checks for at least one StopIteration and at least one item in each iteration to see if it should raise. If the first method is chosen then we may as well have two different functions; if the second method is chosen that would seem to be a performance hit, even when the flag is False. -- ~Ethan~
data:image/s3,"s3://crabby-images/9dc20/9dc20afcdbd45240ea2b1726268727683af3f19a" alt=""
Ethan Furman wrote:
So half of your examples are actually counter-examples.
I claimed to have found "dozens of other call sites in Python's standard library and tooling where it would be appropriate to enable this new feature". You asked for references, and I provided two dozen cases of zipping what must be equal length iterables. I said they were "appropriate", not "needed" or even "recommended". These are call sites where unequal-length iterables, if encountered, would be an error that I would hope wouldn't pass silently. Besides, I don't think it's beyond the realm of imagination for a future refactoring of several of the "Mismatch cannot happen." cases to introduce a bug of this kind.
Did you vet them, or just pick matches against `zip(`?
Of course. I spent hours vetting them, to the point of researching the GNU tar extended sparse header and Apple property list formats (and trying to figure out what the hell was happening in `os._fwalk`) just to make sure my understanding was correct. Ethan Furman wrote:
Not the call itself, but the running of zip. Absent some clever programming it seems to me that there are two choices if we have a flag:
I wouldn't call my implementation "clever", but it differs from both of these options. We only need to check if we're strict when an error occurs in one of our iterators, which is a situation the C code for `zip` already needs to explicitly handle with a branch. So this condition is only hit on the "last" `__next__` call, not on every single iteration. As a reminder, the actual C implementation is linked in the PEP (there's no PR yet but branch reviews are welcome), though I'd prefer if the PEP discussion didn't get bogged down in those specifics. The pure-Python implementation in the PEP is *very* close to it, but it uses different abstractions for some of the details regarding error handling and argument parsing.[0] However, for those who are interested, there is no measurable performance regression (and no additional parsing overhead for no-keyword-argument calls). Parsing the keyword argument (if present) adds <0.2us of overhead at creation time on my machine. I went ahead and ran some rough PGO/LTO benchmarks: Creation time: ``` $ ./python-master -m pyperf timeit 'zip()' Mean +- std dev: 79.4 ns +- 4.3 ns $ ./python-zip-strict -m pyperf timeit 'zip()' Mean +- std dev: 79.0 ns +- 1.9 ns $ ./python-zip-strict -m pyperf timeit 'zip(strict=True)' Mean +- std dev: 240 ns +- 8 ns ``` Creation time + iteration time: ``` $ ./python-master -m pyperf timeit -s 'r = range(10)' '[*zip(r, r)]' Mean +- std dev: 577 ns +- 35 ns $ ./python-zip-strict -m pyperf timeit -s 'r = range(10)' '[*zip(r, r)]' Mean +- std dev: 565 ns +- 16 ns $ ./python-zip-strict -m pyperf timeit -s 'r = range(10)' '[*zip(r, r, strict=True)]' Mean +- std dev: 756 ns +- 27 ns $ ./python-master -m pyperf timeit -s 'r = range(100)' '[*zip(r, r)]' Mean +- std dev: 3.54 us +- 0.14 us $ ./python-zip-strict -m pyperf timeit -s 'r = range(100)' '[*zip(r, r)]' Mean +- std dev: 3.49 us +- 0.07 us $ ./python-zip-strict -m pyperf timeit -s 'r = range(100)' '[*zip(r, r, strict=True)]' Mean +- std dev: 3.73 us +- 0.13 us $ ./python-master -m pyperf timeit -s 'r = range(1000)' '[*zip(r, r)]' Mean +- std dev: 44.1 us +- 2.0 us $ ./python-zip-strict -m pyperf timeit -s 'r = range(1000)' '[*zip(r, r)]' Mean +- std dev: 45.2 us +- 2.0 us $ ./python-zip-strict -m pyperf timeit -s 'r = range(1000)' '[*zip(r, r, strict=True)]' Mean +- std dev: 45.2 us +- 1.4 us ``` Additionally, the size of a `zip` instance has not changed. Pickles for non-strict `zip` instances are unchanged as well. Brandt [0] And zip's current tuple caching, which is *very* clever.
data:image/s3,"s3://crabby-images/dd81a/dd81a0b0c00ff19c165000e617f6182a8ea63313" alt=""
On 05/14/2020 11:13 AM, Brandt Bucher wrote:
Very good point.
Which seems besides the point. As you say, if the lengths are mismatched then a bug has appeared and if the check is nearly free there's no reason not to do it.
Glad I'm not the only one that didn't immediately get that os._fwalk code.
Ah, so this is why the strict version may consume an extra element -- it has to check if any remaining iterators have elements, while the non-strict version can just quit as soon as any of the iterators are exhausted. >>> one = iter([1, 2]) >>> six = iter([6, 7, 8]) >>> zip(one, six) (stuff) >>> next(six) 8 vs >>> zip_strict(one, six) (stuff) >>> next(six) (crickets)
I went ahead and ran some rough PGO/LTO benchmarks...
Can you do those with _pydecimal? If performance were an issue anywhere I would expect to see it with number crunching. --- Paul Moore and Chris Angelico have made good arguments in favor of an itertools addition which haven't been answered yet. Regardless, I think you've made the point that /a/ solution is very desirable. So the real debate is whether it should be a flag, a mode, or a separate function. I am still -1 on the flag. -- ~Ethan~
data:image/s3,"s3://crabby-images/9dc20/9dc20afcdbd45240ea2b1726268727683af3f19a" alt=""
Ethan Furman wrote:
Can you do those with _pydecimal? If performance were an issue anywhere I would expect to see it with number crunching.
No difference, probably because those methods look like they spend most of their time doing string manipulation: ``` $ export PYPERFSETUP='from _pydecimal import Decimal; from random import getrandbits; l = Decimal(bin(getrandbits(28))[2:]); r = Decimal(bin(getrandbits(28))[2:])' $ export PYPERFRUN='l.logical_and(r); l.logical_or(r); l.logical_xor(r)' $ ./python-master -m pyperf timeit -s "$PYPERFSETUP" "$PYPERFRUN" Mean +- std dev: 53.4 us +- 2.8 us $ ./python-zip-strict -m pyperf timeit -s "$PYPERFSETUP" "$PYPERFRUN" Mean +- std dev: 53.8 us +- 2.5 us $ ./python-zip-strict -m pyperf timeit -s "$PYPERFSETUP" "$PYPERFRUN" # This time, with strict=True in each method. Mean +- std dev: 53.6 us +- 3.0 us ``` I would encourage those who are still curious to pull the branch and experiment for themselves. Let's try to keep this a design discussion, since we've established that performance isn't a problem (and there is plenty of time for code review later).
Paul Moore and Chris Angelico have made good arguments in favor of an itertools addition which haven't been answered yet.
I don't consider their arguments particularly strong, but yeah, I was getting to those. I wanted to address your points first since you weren't part of the Ideas discussion! Paul Moore wrote:
... so it's another beast because (among other reasons) it lives in a separate namespace, and it should live in a separate namespace because it's another beast? That's circular logic.
Sorry, that's on me for trying to respond to two questions with one answer right before bed. Strike the namespace argument, then. The rest stands.
So importing zip_strict from itertools is an entirely reasonable way for users to enable the check, then.
Still agreed. But I think they would be *better* served by the proposed keyword argument. This whole sub-thread of discussion has left me very confused. Was anything unclear in the PEP's phrasing here? If so, I'd like to improve it. The original quote is: "The goal here is not just to provide a way to catch bugs, but to also make it easy (even tempting) for a user to enable the check whenever using `zip` at a call site with this property."
It's very easy to suggest bad ways of using a feature. That doesn't make the feature bad. You seem to be arguing that zip_strict is bad because people can misuse it.
Well, I addressed this "irrelevant" point because right out of the gate people started suggesting that they want a separate function *because* it makes shadowing easy. Which brings me to my next quote: Chris Angelico wrote:
I am most in favour of the separate-functions option *because* it makes shadowing easy. Not an anti-pattern at all.
I *really* hope this isn't how people use this (and I don't *think* it would be predominantly used this way), but at least it's clear to me now why you want it to be a separate function. It would still be quite simple to follow this pattern, though, with `functools.partial` or a custom wrapper.
Python is *deliberately* designed so that you can shadow things.
I wouldn't confuse "can" and "should" here. Python is deliberately designed to make *many* design patterns possible, good and bad.
And considering that "from __future__ import print_function" is an officially-sanctioned way to cause a semantic change to print, I don't think it's really that strong an argument.
Well that's a parser directive that is just there for 2/3 compatibility (I'm pretty sure - I've never used Python 2). I see it as very, very different from my `from pprint import pprint as print` headache that was quoted two levels up. Brandt
data:image/s3,"s3://crabby-images/8e91b/8e91bd2597e9c25a0a8c3497599699707003a9e9" alt=""
On Fri, 15 May 2020 at 07:10, Brandt Bucher <brandtbucher@gmail.com> wrote:
This whole sub-thread of discussion has left me very confused. Was anything unclear in the PEP's phrasing here? If so, I'd like to improve it. The original quote is: "The goal here is not just to provide a way to catch bugs, but to also make it easy (even tempting) for a user to enable the check whenever using `zip` at a call site with this property."
It's not unclear, I'm just not sure I agree with the goal, and I'm not sure the proposal achieves that goal: I note that the PEP makes no mention in the rationale of the goal to make it "tempting" to use the flag. It's *only* mentioned as a reason to reject the itertools option. If you want to use that argument, you should explicitly state (and justify) the goal of making use of the flag "tempting" in the rationale for the feature. If it's not part of the rationale, your argument against an itertools function is weak (arguably flawed). My problems with the argument for rejection are: 1. Why do we want to "tempt" people to error when handling mismatched lengths? Certainly letting people catch errors easily is worthwhile, but rejecting arguments of different lengths may well *not* be an error ("be lenient in what you accept" is a well-known principle, even if not something that everyone agrees on in all cases). 2. I find "mode switch" arguments ugly, and I could even argue difficult to maintain (I can't easily use grep to check whether I missed any cases that I wanted to make strict). So I'm not tempted to use one - rather the opposite, it puts me off. (Note that I'm *not* arguing "mode switches are wrong", but rather that a mode switch makes the functionality "more tempting"). 3. I'm not even that sure it's easy to discover - a key factor in making it something people will use when needed. People who know of zip and zip_longest would naturally look for a zip_strict, not for a mode argument. (Yes, this is not a strong point - *nobody* can really tell what people in general will find "easy" - but it does at least reflect *my* thought processes). I do agree that a builtin is more "tempting" to use than a stdlib function (it's not logical, but I can see that people think that way - I do myself). What I don't agree with is that "tempting" is a goal that we want, or that being a builtin is sufficiently important to justify the downsides of a mode flag. We may just have to agree to differ, and leave the final decision to the SC. But let's at least be clear about the goals up front in the rationale section. Paul PS Despite my reservations, this is a well-reasoned and well presented PEP - you've put a lot of work into it, and it shows. Thanks!
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Fri, May 15, 2020 at 08:17:12AM +0100, Paul Moore wrote:
I concur with Paul here. There may be cases where mismatched lengths are an error, but there are also cases where they are not an error. It is patronising to be talking about tempting people into using the strict version of zip. People who need it will find it, whether it is builtin or not.
2. I find "mode switch" arguments ugly
I think we need to distinguish between *modes* and *flags*. This proposed functionality is a mode, not a flag. Mode switches are extensible: zip(*args, mode='strict') can easily be extended in the future to support new modes, such as zip_longest. There are at least two other modes that have been mentioned briefly in the Python-Ideas thread: 1. a variety of zip that returns information in the StopIterator exception identifying the argument that was empty and those that weren't; 2. a variety of zip that simply skips the missing arguments: zip_skip('abc', 'ef', 'g') => (a, e, g) (b, f), (c,) I don't even understand the point of the first one, or how it would operate in practice, or why anyone would need it, but at least one person thinks it would be useful. Take that as you will. But I have written my own version of the second one, and used it. A mode parameter naturally implies that only one mode can apply at a time, and it naturally enforces that restriction without any additional programmer effort. If there are (let's say...) four different modes: mode = shortest|longest|strict|skip you can only supply one at a time. Using *named modes* (whether as strings or enums) to switch modes is quite reasonable. It's not my preferred API for this, but I don't hate it. But a *flag* parameter is difficult to extend without making both the API and the implementation complicated, since any extension will use multiple parameters: zip(*args, strict=True, longest=False, skip=True) and the caller can supply any combination, each of which has to be tested for, exceptions raised for the incompatible combinations. With three flags, there are eight such combinations, but only four valid ones. With four flags, there are 16 combinations and only five meaningful combinations. Flags work tolerably well when the parameters are independent and orthogonal, so you can ask for any combination of flags. But flags to switch modes are *not* independent and orthogonal. We can't combine them in arbitrary combinations. Using *boolean flags* to switch modes in this way makes for a poor API. For the record, my preferred APIs for this are in order: 1. +1 itertools.zip_strict function 2. +1 zip.strict(*args) 3. +1 zip(*args, mode='strict') # mode='shortest' by default 4. +0 zip(*args, strict=True) and even that is being generous for option 4. Note that options 1 and 2 have an important advantage over options 3 and 4: the strict version of zip is a first-class callable object that can be directly passed around and used indirectly in a functional way. E.g. for testing using unittest: assertRaises(zip.strict, 'a', '', Exception) Yes, assertRaises is also usable as a context manager, but this is just an illustration of the functional style. -- Steven
data:image/s3,"s3://crabby-images/c437d/c437dcdb651291e4422bd662821948cd672a26a3" alt=""
I'm a little frustrated by the tone in which the PEP dismisses the option that is most supported in the discussion. It fine for Brandt to have a different preference himself, but I think it ought to be presented more neutrally. On Fri, May 15, 2020, 10:20 AM Steven D'Aprano
Mostly I agree with Steven on relative preference: itertools.zip_strict() +1 zip.strict() +0.5 zip(mode='strict') +0 zip(strict=True) -0.5 Fwiw, I don't think it changes my order, but 'strict' is a better word than 'equal' in all those places. I'd subtract 0.1 from each of those votes if they used "equal".
data:image/s3,"s3://crabby-images/8e91b/8e91bd2597e9c25a0a8c3497599699707003a9e9" alt=""
On Fri, 15 May 2020 at 16:01, David Mertz <mertz@gnosis.cx> wrote:
I'm a little frustrated by the tone in which the PEP dismisses the option that is most supported in the discussion. It fine for Brandt to have a different preference himself, but I think it ought to be presented more neutrally.
Agreed. The PEP gives the impression that consensus was reached, but I don't think that's the case. My feeling is that opinions are rather evenly split between the approach in the PEP and an itertools function. I also feel that proposing an itertools function would have been a *lot* less controversial, so the tone in the PEP feels a little like it's defending a weak position by aggressively opposing alternatives. After all, one of the benefits of an itertools function is that it probably wouldn't have needed a PEP in the first place! Actually, looking at the reasons for rejection of the itertools option in the PEP:
It seems that a great deal of the motivation driving this alternative is that zip_longest already exists in itertools.
Nope, the biggest motivation is that an itertools addition would have been *significantly less controversial*. Paul
data:image/s3,"s3://crabby-images/abc12/abc12520d7ab3316ea400a00f51f03e9133f9fe1" alt=""
On 15/05/2020 16:56, Chris Angelico wrote:
Well, if it's what all the cool kids are doing... * itertools.zip_strict() +1 * zip.strict() +0 * zip(mode='strict') -0 * zip(strict=True) -1 The middle two would be weird if zip_longest doesn't get folded in eventually, which might push them (more) negative. -- Rhodri James *-* Kynesim Ltd
data:image/s3,"s3://crabby-images/ab219/ab219a9dcbff4c1338dfcbae47d5f10dda22e85d" alt=""
On 5/15/2020 11:56 AM, Chris Angelico wrote:
itertools.zip_strict() +1 zip.strict() +0 zip(strict=True) -0 zip(mode='strict') -1 I don't particularly care for "strict", though. It doesn't seem specific enough, and doesn't say "they iterators must return the same number of items" to me. I sort of liked "equal" better, but not so much to make a big stink about it. Also: The PEP says "At most one additional item may be consumed from one of the iterators when compared to normalzip usage." I think this should be prefaced with "If ValueError is raised ...". Also, why does it say "at most one additional item". How could it ever be less than one? Eric
data:image/s3,"s3://crabby-images/8e91b/8e91bd2597e9c25a0a8c3497599699707003a9e9" alt=""
[Cut the previous votes because someone's quoting didn't survive my email client and I can't be bothered fixing it] If everyone else is doing it... itertools.zip_strict() +1 zip(strict=True) -0 zip.strict() -0.5 zip(mode='strict') -1 Paul
data:image/s3,"s3://crabby-images/d1d84/d1d8423b45941c63ba15e105c19af0a5e4c41fda" alt=""
These negative votes surprise me. Given that it's clear that a generic strict-mode zip is non-trivial to write, and that there is significant demand for it, are people saying "+0 Python would not be a better programming environment if itertools.zip_strict() were adopted," and "-1 Python would be a worse programming environment if zip.strict() were adopted"? I can see why folks would say the latter about zip.strict(), but even though I really dislike the mode switches, I'm still positive about adding them if one of them ranks highest among those who care. I'm not going to give them negative votes, they don't make Python worse. I don't mind hyperbole ("I'm +1000 on this feature!" or "-10 on the worst proposal I've seen since <potentially controversial example removed>!") But I would like it if "0" meant "indifferent", "+1" meant "no-brainer, add it", and "-1" meant "no-brainer, just don't". FWIW, +1 itertools.zip_strict(*iterables) +0.5 zip(*iterables, mode) # mode is 3-way, default "shortest" +0.4 zip(*iterables, strict) # strict is boolean, default False +0 zip.strict(*iterables)
data:image/s3,"s3://crabby-images/46816/4681669242991dedae422eef48216ca51bd611c5" alt=""
I used the same convention as you, and my vote was thus as the record will show (note the negatives): zip(strict=True) +1 itertools.zip_strict() +0 zip(mode='strict') -1 zip.strict() -1 And I stand by that: I think Python would be better off without the 3rd and 4th option, even if no alternative was implemented. To go to one of the examples (not exactly, you'll understand why...) was given in the other thread: if we suggested asdfasfa.kjllasdfa.asdf() as the name(space) for the zip(strict=True) functionality, I would: - vote -1 (or hyperbole -10^googol) - still use the feature when I wanted to use 'zip(strict=True)' - think Python would be a worse programming environment for allowing this to be introduced Obviously the 3rd and 4th option are not as insane/illogical as the above example (apologies, I would attribute, but the nature of this example makes it hard for me to search for it!) but I do not like functionality exposed in this way and I think the lack of this functionality in the stdlib does not weigh up against the precedent/bad example this would set. You, and anyone else, can and some definitely will disagree with me but it's my vote, and I don't think it matters that much anyway: There has been a lot of discussion, and these straw polls, from my limited understanding, are often taken to see whether there is a clear consensus to short-circuit/end the discussion, I would say in this case it has shown inconclusiveness, which is fine, the final decision on this PEP will (fortunately) not be decided by our votes. I understand your 'pain' Stephen: I still think it is weird that people on these lists don't want "for x in some_iterable if x is not None:" as valid syntax, but I have, almost, made my peace with it. On Sun, 17 May 2020 at 18:42, Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
data:image/s3,"s3://crabby-images/f81c3/f81c349b494ddf4b2afda851969a1bfe75852ddf" alt=""
On Fri, May 15, 2020 at 11:55 AM Henk-Jaap Wagenaar < wagenaarhenkjaap@gmail.com> wrote:
Agreed. The best way to reduce accidental incorrect use of the builtin is to make the builtin capable of doing what a people want directly without having to go discover something in a module somewhere. OTOH so long as zip's docstring that shows up to people from interactive help, pydoc, and hover/alt/meta text in fancier IDEs mentions the caveat and the way to get the strict behavior, either of these two should be sufficient. I'm pushing forward with a pure documentation update in https://github.com/python/cpython/pull/20118 suitable for 3.8 - it doesn't mention the way to get the other behavior as that isn't settled or short yet, just makes it more obvious what the actual behavior is. -gps
data:image/s3,"s3://crabby-images/d1d84/d1d8423b45941c63ba15e105c19af0a5e4c41fda" alt=""
Gregory P. Smith writes:
Executive summary: My argument (and one of Steven d'Aprano's) against a "strict" mode to zip is precisely that it's *extremely* likely that if I use a facility that zips together things I provide, the last thing I want it is for it to choose "strict" for me, because that *would likely be incorrect*. I do not want people using strict *for any facility I might use* "because it's there." I'm not saying strict mode is useless. I am saying the "encourage use by making it easier to use" argument cuts both ways: it can create problems as well as solve them. A couple of concrete examples: 1. In activities like constructing data arrays, which we expect to be rectangular, I'm still likely to use sequences of unequal length, including infinite sequences. As an economist, I often use lagged data, which can easily be constructed for an equation like y[t] = a + b x[t] + c x[t-1] with zip(y[1:], const(), x[1:], x[0:]) where def const(): while True: yield 1 (Here I'm using zip() as a proxy for somebody's generic facility such as a function to compute OLS estimates given a sequence of data series. Obviously for zip itself, I would just not use strict mode.) Note that y[0], not y[-1], needs to be left out. This is the critical point that I need to concentrate on when constructing this data frame. If I have to "even out" the columns, though, I need *also* to think about the lengths, a distraction which for me makes this more bug-prone. Ie, I might accidentally write zip(y[:-1], const(len(x) - 1), x[:-1], x[1:]) where def const(n): return (1 for _ in range(n)) which is not only asymmetric but wrong, as the regressor x[1:] is "future x"! More opportunities for bugs arise in the replacement for const(). Even if you don't agree about the bugs (and there is a weak argument that some fraction of the potential bugs will be caught by strict-mode zip, such as a wrong argument to const()), it's pretty clear which style is more readable. 2. My programming style is such that if I want couples that are related to each other, I will almost certainly generate those couples, not generate them separately in the right orders and then zip as needed. For example, in one of the test suites two lists are generated something like this: c_int_types = [...] # list display c_int_type_ranges = [construct_range(t) for t in c_int_types] and in many tests the two lists are zipped to produce appropriately matched couple. But I would certainly do c_int_types = [...] # as above c_int_types_with_ranges = [(t, construct_range(t)) for t in c_int_types] Of course I understand that sometimes you might very well care about the space cost of doing this, but I suspect that if I cared about the 2X cost of c_int_types_with_ranges, I wouldn't pregenerate a list of ranges at all. My point is that given my style, this particular use case will *almost never* occur, so is unlikely to provide an excuse for strict mode if I'm providing the data. I suspect this applies to a lot of claimed use cases. Of course if I only provide c_int_types, and your function constructs c_int_type_ranges and zips them, it's fine if you use strict mode -- that doesn't impact me at all. You probably *should* use strict mode. But if you claim to be providing a general facility, I think it's on you to think about whether I might want to feed sequences of unequal length to the function, even though you never would. That's quite a burden to assume, though, unless you simply provide a strict mode flag in your functions (which you can default to strict!) and let me choose. Steve
data:image/s3,"s3://crabby-images/c437d/c437dcdb651291e4422bd662821948cd672a26a3" alt=""
On Fri, May 15, 2020 at 12:55 PM Eric V. Smith <eric@trueblade.com> wrote:
This struck me as strange also. I mean, the wording can be improved to clarify "if error." But more significantly, it seems like it cannot conceivably be true. If might be "At most one additional item from EACH of the iterators." If I do zip_strict(a, b, c, d, e) and "e" is the one that is shorter, how could any algorithm ever avoid consuming one extra item of a, b, c, and d each?! -- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
data:image/s3,"s3://crabby-images/2eb67/2eb67cbdf286f4b7cb5a376d9175b1c368b87f28" alt=""
On 2020-05-15 20:36, David Mertz wrote:
Well, it does say "when compared to normal zip usage". The normal zip would consume an item of a, b, c, and d. If e is exhausted, then zip would just stop, but zip_strict would raise ValueError. There would be no difference in the number of items consumed but not used.
data:image/s3,"s3://crabby-images/e94e5/e94e50138bdcb6ec7711217f439489133d1c0273" alt=""
David Mertz wrote:
I would say that 'equal' is worse than 'strict'. but 'strict' is also wrong. Zipping to a potentially infinite sequence -- like a manual enumerate -- isn't wrong. It may be the less common case, but it isn't wrong. Using 'strict' implies that there is something sloppy about the data in, for example, cases like Stephen J. Turnbull's lagged time series. Unfortunately, the best I can come up with is 'same_length', or possibly 'equal_len' or 'equal_length'. While those are better semantically, they are also slightly too long or awkward. I would personally still consider 'same_length' the least bad option. -jJ
data:image/s3,"s3://crabby-images/f81c3/f81c349b494ddf4b2afda851969a1bfe75852ddf" alt=""
On Wed, May 20, 2020 at 11:09 AM Jim J. Jewett <jimjjewett@gmail.com> wrote:
As we've come down to naming things... if you want it to read more like English, `zip(vorpal_rabbits, holy_hand_grenades, lengths_must_match=True)` or another chosen variation of that such as `len_must_match=` or `length_must_match=` reads nicely and is pretty self explanatory that an error can be expected if the condition implied by the "must" is found untrue without really feeling a need to look it up in documentation. It is also harder to type or fit on a line. Which is one advantage to a short thing like `strict=`. I don't care so much about the particular spelling here to argue among any of those, I primarily want the feature to exist. I expect we're entering steering council territory for a decision soon... -gps _______________________________________________
data:image/s3,"s3://crabby-images/b957e/b957eb3ce8f7e4648537689d41c333147b87e8c2" alt=""
Python has always preferred full-word over old-school C/Perl/PHP-style abbreviated names. Clarity is paramount. (Or this whole discussion wouldn't even be happening.) I think this is *more* of a zip_shortest than zip_strict, but since you can never have total clarity without a method name that doubles as a docstring, whatever works will work as long as it's documented. Em On Wed, May 20, 2020 at 3:33 PM Joseph Jenne via Python-Dev < python-dev@python.org> wrote:
I'd like to suggest "len_eq" as a short but (rather) self-explanatory option.
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Thu., 21 May 2020, 4:09 am Jim J. Jewett, <jimjjewett@gmail.com> wrote:
Reading this thread and the current PEP, the main question I had was whether it might be better to flip the sense of the flag and call it "truncate". So the status quo would be "truncate=True", while the ValueError could be requested by passing an explicit "truncate=False". Draft documentation paragraph: ====== zip() can be used to combine iterables of different lengths, including combining finite iterables with infinite iterators. By default, the output iterator is implicitly truncated to produce the same number of items as the shortest input iterable. Setting *truncate* to false disables this implicit truncation and raises ValueError instead. Note that if this ValueError is raised an additional item will have been consumed from any iterators listed before the shortest iterator (or from the second listed iterator if the first iterator is the shortest one). To pad shorter input iterables rather than truncating the output or raising ValueError, see itertools.zip_longest. ====== The conceptual idea here is that the "truncate" flag name would technically be a shorter mnemonic for "truncate_silently", so clearing it gives you an exception rather enabling padding behaviour. Flipping the sense of the flag also means that "truncate=True" will appear in IDE tooltips as part of the function signature, providing significantly more information than "strict=False" would. That improved self-documentation then becomes what I would consider the strongest argument in favour of the flag-based approach: providing more information up-front to users regarding the actual behaviour of the builtin, rather than having them incorrectly assume that mismatched input iterator lengths will raise an exception. Side note: this idea pairs nicely with the "zip(itr, itr, ir)" idiom for non-overlapping data windows, as it makes it straightforward to request an exception if the last data tuple has values missing (without the flag, the idiom silently discards incomplete trailing data). Cheers, Nick. P.S. I had the opportunity to read the thread from beginning to end after belatedly catching some of the messages out of context, and FWIW, I started out assuming I would strongly favour the itertools function option, and surprised myself by favouring the flag option (albeit inverted) by the time I reached the end.
data:image/s3,"s3://crabby-images/dd81a/dd81a0b0c00ff19c165000e617f6182a8ea63313" alt=""
On 06/01/2020 04:36 AM, Nick Coghlan wrote:
Reading this thread and the current PEP, the main question I had was whether it might be better to flip the sense of the flag and call it "truncate".
So the status quo would be "truncate=True", while the ValueError could be requested by passing an explicit "truncate=False".
I like this a lot. +1 -- ~Ethan~
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Mon, Jun 01, 2020 at 09:36:40PM +1000, Nick Coghlan wrote:
It's not really *implicit* if there's an explicit flag controlling the behaviour, even with a default value. We don't use that sort of language elsewhere. For example, help(sorted) doesn't say: "Return a new list containing all items from the iterable implicitly in ascending order. Pass reverse=True to disable this implicit order." help(int) doesn't say that the base is implicitly decimal; help(print) doesn't talk about "implicit spaces between items, implicit newline at the end of the output" etc. It just states the behaviour controlled by the parameter. This is accurate, non-judgemental, and avoids being over-wordy: "By default, the output iterator is truncated at the shortest input iterable."
"Significantly" more? I don't think so. Truncate at what? - some maximum length; - some specific element; - at the shortest input. At some point people have to read the docs, not just the tooltips. If you didn't know what zip does, seeing truncate=True won't mean anything to you. If you do know what zip does, then the parameter names are mnemonics, and strict=False and truncate=True provide an equal hint for the default behaviour: * if it's not strict, it is tolerant, stopping at the shortest; * if it truncates, it truncates at the shortest input. For the default case, strict=False and truncate=True are pretty much equal in information. But for the case of non-default behaviour, strict=True is a clear winner. It can pretty much only mean one thing: raise an exception. Whereas truncate=False is ambiguous: - pad the output; - skip items as they become empty; - raise an exception. All three of these are useful behaviour, and while the middle one is not part of this PEP, it was requested in the discussions on Python-Ideas.
That improved self-documentation then becomes what I would consider the strongest argument in favour of the flag-based approach:
I don't think that "truncate=False" (which can mean three different things) is more self-documenting than `zip(*items, mode='strict')` or `zip_strict()` (either of which can only mean one thing). -- Steven
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Tue., 2 Jun. 2020, 11:23 am Steven D'Aprano, <steve@pearwood.info> wrote:
Given that the only input parameters are the iterables themselves, it's a stretch to even consider the first two as possibilities.
"strict=False" doesn't tell you whether the tolerant behaviour is truncation or padding. "truncate=True" does.
For the default case, strict=False and truncate=True are pretty much equal in information.
Nope. If you don't already know that zip truncates the output by default, "truncate=True" gives you that information, while "strict=False" doesn't.
But for the case of non-default behaviour, strict=True is a clear winner. It can pretty much only mean one thing: raise an exception.
But raise an exception when? In the context of this discussion, we know we mean "strict length checking, raising an exception for inconsistent lengths". But "strict" on its own doesn't convey that - we could be requesting strict runtime type checking, for example, where each iterable is expected to keep producing items of the same type as was produced for the first tuple. Or we could be requesting a check that the values in the tuple aren't "None".
As noted above, "strict" just means "check more constraints" - it's at least as ambiguous as "don't truncate the output". I do agree that the ambiguity of "truncate=False" is the biggest downside of that spelling, but learning that it means "raise an exception on a length mismatch instead of truncating the output iterator" isn't going to be any harder than learning what strict mode means. Cheers, Nick.
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Tue, Jun 2, 2020 at 8:55 PM Nick Coghlan <ncoghlan@gmail.com> wrote:
Why? I can conceivably imagine that zip(iter1, iter2, truncate=5) would consume at most 5 elements from each iterable. It's not much of a stretch. It doesn't happen to be what's proposed, but it's a reasonable interpretation. (Though then the default would probably be truncate=None to not truncate.) ChrisA
data:image/s3,"s3://crabby-images/c437d/c437dcdb651291e4422bd662821948cd672a26a3" alt=""
On Tue, Jun 2, 2020 at 8:07 AM Chris Angelico <rosuav@gmail.com> wrote:
This was exactly my thought, that Chris wrote very well. I can easily imagine a 'truncate=5' behavior. In fact, if it existed, it is something I would have used multiple times. As is, I use islice() or a break inside a loop, but that hypothetical parameter might be a helpful convenience. However, it is indeed NOT the current proposal or discussion. -- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
data:image/s3,"s3://crabby-images/c437d/c437dcdb651291e4422bd662821948cd672a26a3" alt=""
On Tue, Jun 2, 2020, 9:41 AM Steve Dower
Oh yeah. I've done that too. For whatever reason, I think I used to use the extra range, and nowadays I'm more likely to use islice(). I have absolutely no argument why one style or the other is better, just my habit has changed. In any case, I'm not advocating for truncate=5 behavior. Merely agreeing that the word truncate is not less ambiguous than the word strict. That's not even saying I prefer strict to truncate; itertools.zip_strict() remains my preference. But I could learn either parameter choice easily enough.
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Tue, Jun 02, 2020 at 08:52:46PM +1000, Nick Coghlan wrote:
And then:
"strict=False" doesn't tell you whether the tolerant behaviour is truncation or padding. "truncate=True" does.
You can't have it both ways Nick -- if the lack of additional parameters is enough for the user to predict that the only reasonable behaviour is to truncate, then the lack of additional parameters is also enough for them to predict that the only reasonable non-strict (tolerant) behaviour is to truncate at the shortest input. [...]
If you are going to propose that users might imagine a hypothetical check that raises if any item is None, well, isn't that *precisely* the sentinel check I gave above that you blithly dismissed as "a stretch"? If it's a stretch for me, it's a stretch for you too. Ultimately, bikeshedding on the name truncate versus strict versus equal versus shortest versus ... is quibbling. Everyone who reads the tooltips, assuming they even see them, is going to take something different from it. Some will think "truncate what, the tuples?" and some will think "strict about what?". Ultimately the tooltips are no substitute for reading the docs. If you don't know what zip does, you cannot interpret what it means for zip to truncate or be strict. No one single word is going to communicate everything we need to communicate. Function and parameter names are mnemonics, not documentation. So on that note, and in regard only to the choice between "strict" versus "truncate" etc, I'm going to bow out: call it what you will. I've got a bigger problem with the use of a boolean flag than the name. -- Steven
data:image/s3,"s3://crabby-images/d1d84/d1d8423b45941c63ba15e105c19af0a5e4c41fda" alt=""
Brandt Bucher writes:
I thought it was quite clear. Those of us who disagree simply disagree. We prefer to provide it as a separate function. Just move on, please; you're not going to convince us, and we're not going to convince you. Leave it to the PEP Delegate or Steering Council.
I wouldn't confuse "can" and "should" here.
You do exactly that in arguing for your preferred design, though. We could implement the strictness test with an argument to the zip builtin function, but I don't think we should. I still can't think of a concrete use case for it from my own experience. Of course I believe concrete use cases exist, but that introspection makes me suspicious of the claim that this should be a builtin feature, with what is to my taste an ugly API. Again, I don't expect to convince you, and you shouldn't expect to convince me, at least not without more concrete and persuasive use cases than I've seen so far. Steve
data:image/s3,"s3://crabby-images/edc98/edc9804a1e6f2ca62f3236419f69561516e5074d" alt=""
I'm on the fence about using a separate function vs. a keyword argument (I think there is merit to both), but one thing to note about the separate function suggestion is that it makes it easier to write backwards compatible code that doesn't rely on version checking. With `itertools.zip_strict`, you can do some graceful degradation like so: try: from itertools import zip_strict except ImportError: zip_strict = zip Or provide fallback easily: try: from itertools import zip_strict except ImportError: def zip_strict(*args): yield from zip(*args) for arg in args: if next(arg, None): raise ValueError("At least one input terminated early.") There's an alternate pattern for the kwarg-only approach, which is to just try it and see: try: zip(strict=True) HAS_ZIP_STRICT = True except TypeError: HAS_ZIP_STRICT = False But I would say it's considerably less idiomatic. Just food for thought here. In the long run this doesn't matter, because eventually 3.9 will fall out of everyone's support matrices and these workarounds will become obsolete anyway. Best, Paul On 5/15/20 5:20 AM, Stephen J. Turnbull wrote:
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Fri, May 15, 2020 at 09:56:03AM -0400, Paul Ganssle wrote:
This is just a special case of a much broader case: a separate function, or method, is a first class object that can be passed around to other functions, used in lists, etc. https://softwareengineering.stackexchange.com/questions/39742/when-is-a-feat... Using a mode switch or flag makes the zip strict a second class citizen. -- Steven
data:image/s3,"s3://crabby-images/2ffc5/2ffc57797bd7cd44247b24896591b7a1da6012d6" alt=""
Here’s another advantage of having a separate function that I didn’t see acknowledged in the PEP: If strict behavior is a better default for a zip-like function than non-strict, then choosing a new function would let you realize that better default. In contrast, by adding a new argument to the existing function, the function you use will forever have the less preferred default. In terms of what is a better default, I would say strict is better because errors can’t pass silently: If errors occur, you can always change the flag. But you would be doing that explicitly. —Chris On Fri, May 15, 2020 at 6:57 AM Paul Ganssle <paul@ganssle.io> wrote:
data:image/s3,"s3://crabby-images/fef1e/fef1ed960ef8d77a98dd6e2c2701c87878206a2e" alt=""
On Fri, 15 May 2020 06:06:00 -0000 "Brandt Bucher" <brandtbucher@gmail.com> wrote:
And in any case, people who are concerned about performance should use the C decimal accelerator, which is the default. Here is your micro-benchmark with _pydecimal (which is the pure Python fallback): $ python3.8 -m pyperf timeit -s "$PYPERFSETUP" "$PYPERFRUN" ..................... Mean +- std dev: 35.4 us +- 1.1 us Here is the same micro-benchmark with decimal (which loads the C accelerator by default): $ python3.8 -m pyperf timeit -s "$PYPERFSETUP" "$PYPERFRUN" ..................... Mean +- std dev: 471 ns +- 12 ns Even if you were losing performance on those 35.4us it wouldn't make sense to complain about it. Regards Antoine.
data:image/s3,"s3://crabby-images/9dc20/9dc20afcdbd45240ea2b1726268727683af3f19a" alt=""
In the last 24 hours, this thread has grown a bit beyond my capacity to continue several different lines of discussion with each individual. I count 22 messages from 14 different people since my last reply, and I assure you that I've carefully read each response and am considering them as I work on the next draft. I'd like to thank everyone who took the time to read the PEP and provide thoughtful, actionable feedback here! Brandt
participants (23)
-
Antoine Pitrou
-
Brandt Bucher
-
Chris Angelico
-
Chris Jerdonek
-
David Mertz
-
Dennis Sweeney
-
Emily Bowman
-
Eric V. Smith
-
Ethan Furman
-
Gregory P. Smith
-
Guido van Rossum
-
Henk-Jaap Wagenaar
-
Jim J. Jewett
-
Joseph Jenne
-
MRAB
-
Nick Coghlan
-
Paul Ganssle
-
Paul Moore
-
Rhodri James
-
Stephen J. Turnbull
-
Steve Dower
-
Steve Holden
-
Steven D'Aprano