On Mon, May 4, 2020 at 1:59 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, May 02, 2020 at 10:26:18PM +0200, Alex Hall wrote:

> > If I know that consumers of my data truncate on the shortest input, then
> > as the producer of data I don't have to care about making them equal. I
> > can say:
> >
> >     process_user_ids(usernames, generate_user_ids())
> >
> > and pass an infinite stream of user IDs and know that the function will
> > just truncate on the shortest stream. Yay! Life is good.
> >
> > But now this zip flag comes along, and the author of process_user_ids
> > decides to protect me from myself and "fail loudly", and I will curse
> > them onto the hundredth generation for making my life more difficult.
> >
>
> My guess is that this kind of situation is rare and unusual. The example
> looks made up, is it based on something real?

Yes, its made up, but yes, it is based on real code. I haven't had to
generate user IDs yet, but I have often passed in (for example) an
infinite stream of prime numbers, or some other infinite sequence.

Not necessarily infinite either: it might be finite but huge, such as
combinations or permutations of something.

Right, but have you had an actual situation where you've passed two parallel streams of different lengths to an external library that zipped them?
 
At the consumer end, the main one that comes to mind off the top of my
head (apart from code I wrote myself) is numpy, for example their
coefficent correlation function:

    py> np.corrcoef([1, 2, 3, 4, 5, 6, 7, 8], [7, 6, 5, 4, 3, 2])
    Traceback (most recent call last):
      ...
    ValueError: array dimensions must agree except for d_0

Now I'm still keeping an open-mind whether this check is justified for
stats functions. (There have been some requests for XY stats in the
statistics module, so I may have to make a decision some day.)

Are you saying you're annoyed that they enforce the same length here, and you'd like the freedom to pass different lengths? Because to me the check seems like an extremely good idea.
 
But my point is, regardless of whether that check is necessary or not,
*I still have to check it when I produce the data* or else I get a
potentially spurious exception that will eat my data. In the general
case where I have iterators not lists, no recovery is possible. I can't
catch the error, trim the data, and resubmit. I have to preemptively
trim the data.

Of course I recognise the right of each developer to choose for
themselves whether to enforce the rule that input streams are equal. And
sometimes that will be the right thing to do.

But you are arguing that putting zip_strict as a builtin will encourage
people to do this, and I am saying that *encouraging people to do this*
is a point against it, because that will lead to the "things which
aren't errors should never pass silently" Just In Case it might be an
error.

Yes. We're starting to go in circles here, but I'm arguing that it's OK for people to be mildly inconvenienced sometimes having to preemptively trim their inputs in exchange for less confusing, invisible, frustrating bugs. I'd like people to use this feature as often as possible, and I think the benefits easily outweigh the problem you describe. Going crazy trying to debug something is probably the thing programmers complain about the most, I'd like to reduce that.
 
If you need to enforce equal length input, then one extra line:

    from itertools import zip_strict

is no burden. But if someone is on the fence about checking for equal
lengths, and importing would be too much trouble so they don't bother,
but making it builtin is enough for them to tip them over into using
it "just to be safe", then they probably shouldn't be using it.

I don't know why you put "just to be safe" in mocking quotes. I see safety as good.

If an API accepts some iterables intending to zip them, I feel pretty safe guessing that 90% of the users of that API will pass iterables that they intend to be of equal length. Occasionally someone might want to pass an infinite stream or something, but really most users will just use lists constructed in a boring manner. I can't imagine ever designing an API thinking "I'd better not make this strict, I'm sure this particular API will be used quite differently from most other similar APIs and users will want to pass different lengths unusually often". But even if I grant that such occasions exist, I see no reason to believe that they will occur most often when a user is feeling too lazy to import itertools. The correlation you propose is highly suspect.
 
> An external API which
> requires you to pass parallel iterables instead of pairs is unusual and
> confusing.

I don't know about that. numpy's correlation coefficient supports both a
single array of data pairs or a pair of separate X, Y values. So does R.

Spreadsheets likewise put the X, Y data in separate cells, not in cells
with two data points. If your data is coming from a CSV file, as it
often does, the most natural way to get it is as two separate columns.

But for APIs that do require a single stream of pairs, then this entire
discussion is irrelevant, since I, the producer of the data, can choose
whatever behaviour makes sense for me:

- `zip` to truncate the data on shortest input;
- `zip_longest` to pad it;
- `zip_strict` (whether I write my own or get it from the stdlib)

or anything else I like, since I'm the producer.

So we shouldn't be talking about APIs that require a stream of pairs,
but only APIs that require separate streams.

My point is that the APIs you're talking about are rare, therefore the problem you describe is rare.
 
[...]
> In most cases it wouldn't ensure the correctness of code, but it could give
> some peace of mind and might help readers. But also in those cases if the
> user decides it's redundant and not worth using:
>
> - The user had better be confident of their judgement, which will
> inevitably sometimes be wrong.

That's not our problem to solve. It isn't up to us to force people to
use zip_strict "just in case your judgement that it isn't needed is
wrong".

I think you and I have a severe disagreement on the relationship between
the stdlib and the end developers. Between your comment above, and the
one below, you seem to believe that it is the job of the stdlib to
protect the developer from themselves.

And yet here you are using Python, where we have the ability to
monkey-patch the builtins namespace, shadow functions, reach into
functions and manipulate their data, including their code, rebind any
name, remove attributes of anything, even change the class of (some)
instances on the fly. Are you sure you are using the right language?

Python has plenty of things to protect users from mistakes. See exist_ok, keyword only arguments, private attributes, etc. All that matters is that you can work around them.
 
> - Even if the context of the code makes it obvious that it's redundant,
> that context could change in the future and introduce a silent regression,
> and people are likely to not think to add strict=True to the zip call
> somewhere down the line from their changes. Adding strict=True is more
> future-proof to such changes.

That's an awfully condescending thing to say. You are effectively saying
that Python developers ought to use strict zip by default just in case
their requirements change in the future and they are too dim-witted to
remember to change the implementation to match.

You see it as condescending, I see it as empathetic and cautious. Coders are humans and make mistakes. Code is full of bugs. Changes often cause regressions because someone wasn't aware of the impact changes in one place would have elsewhere. There's a common joke "99 little bugs in the code, 99 little bugs in the code. Take one down, patch it around 117 little bugs in the code".

If someone does accidentally cause iterables to become unequal length, it is FAR better that this triggers an exception which alerts them to the problem immediately than them having to figure out what new thing has gone wrong, or worse, them not noticing at all.
 
I will try to remember that there are legitimate users for a strict zip,
but this advocacy of *premptive and unnecessary length checks* just
makes me more hostile to adding this to the stdlib at all, let alone
making it a builtin.


> > If I'm the producer of the data, and I want it to be equal in length,
> > then I control the data and can make it equal.
>
> But in your example you complain about having to do that. Is it a problem
> or not?

It is a problem if I have to do it for no good reason, to satisfy some
arbitrary "just in case" length check. I don't mind doing so if there is
a genuine good reason for it, like in the ast case.

Ultimately, the data is going to be truncated somewhere. At the moment,
zip does the truncation. If people switch to strict zip, then I have to
do the truncation. That's a usability regression. Unless it is justified
by some functional reason, or fixing a bug, it's more harmful than
helpful:

"Function do_useful_stuff() no longer ignores excess stuff, but raises
instead. Callers must now trim their To Do list before passing it to the
function. No bugs were fixed by this change."

sort of thing.


> I think you make this situation sound worse than it is. "I will curse them
> onto the hundredth generation for making my life more difficult" is pretty
> melodramatic.

Hyperbole is supposed to be read as humour, not literally :-)

Hyperbole aside, I still think everything you're saying makes the problem sound worse than it is. It's a mild inconvenience. 
 
> If you get an exception because you tried:
>
>     process_user_ids(usernames, generate_user_ids())
>
> then you can pretty easily change it to:
>
>     process_user_ids(usernames, (user_id for user_id, username in
> zip(generate_user_ids(), usernames)))
>
> or if you can generate user IDs one at a time:
>
>     process_user_ids(usernames, (generate_user_id() for _ in usernames))

So I have to write ugly, inefficient code to make up for you encouraging
people to add unnecessary length checks in their code? Yay, I feel so
happy about this change!!!

Also, it doesn't work if usernames is an iterator.

I already consider this an unlikely situation, that just makes it even more unlikely. It's even more unlikely that you can't just convert the usernames to a list. Even then, it's pretty solvable.

If you want to use iterators so badly, consider this: most people currently enforce this check using something like assert len(x) == len(y) - I've given several examples of that, including within CPython. That makes it completely impossible to pass any kind of iterator. The best way to get rid of that kind of code is to make strict zip as widely known as possible. It will be more widely known if it's widely used and if it's in the builtin. If it's in itertools, you can expect that plenty of people will never hear about it, and they will continue to use len().
 
> It's a bit inconvenient, but:
>
> - It's pretty easy to solve the problem, certainly compared to implementing
> zip_equal,

You haven't solved the problem, and judging by your attempted work-
arounds, this could easily lead to a reduction in code quality, not an
improvement.


> and nothing like your example of final classes which are pretty
> much insurmountable.

I don't remember raising final classes in this thread. Am I losing my
short-term memory or are you confusing me with someone else?

I quoted you mentioning them before I wrote this bit:

But if I'm only the consumer of the data, I have no business failing
"just to be safe". That's an anti-feature that makes life more
difficult, not less, for the producer of the data, akin to excessive
runtime type checking (I'm sometimes guilty of that myself) or in other
languages flagging every class in sight as final "just to be safe".

 
Back to the current email:

> - An exception has clearly pointed out a problem to solve

An exception *might* have pointed out a problem to solve, but since the
solution is usually to just truncate the data, and zip already does
that, raising an exception may be a regression, not an enhancement.

I don't mean to assert that there was a bug to be fixed, I mean the exception has made it fairly clear that you need to truncate the data, which may be annoying but is better than the frustrating debugging which could have happened had things gone differently.
 
Can we agree to meet half-way?

- there are legitimate, genuine uses for zip_strict;

- but encouraging people to change zip to zip_strict "just in case"
  in the absence of a genuine reason is a bad thing.

No, I stand by my position for now that "just in case" is a genuine reason and that safety outweighs convenience and efficiency. I haven't been given a reason to believe that your concerns would be significant.