On Sat, May 2, 2020 at 5:10 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, May 02, 2020 at 02:58:57PM +0200, Alex Hall wrote:

> Adding a function to itertools will mostly solve that problem, but not
> entirely. Adding `strict=True` is so easy that people will be encouraged to
> use it often and keep their code safe. That to me is the biggest argument
> for this feature and for this specific API.

The last thing I want is to encourage people to unnecessarily enforce a
rule "data streams must be equal just to be safe" when they don't
actually need to be equal. What you are calling the biggest argument for
this feature is, for me, a strong argument against it.

If I know that consumers of my data truncate on the shortest input, then
as the producer of data I don't have to care about making them equal. I
can say:

    process_user_ids(usernames, generate_user_ids())

and pass an infinite stream of user IDs and know that the function will
just truncate on the shortest stream. Yay! Life is good.

But now this zip flag comes along, and the author of process_user_ids
decides to protect me from myself and "fail loudly", and I will curse
them onto the hundredth generation for making my life more difficult.

My guess is that this kind of situation is rare and unusual. The example looks made up, is it based on something real? Do you have any examples based on reality? I've given examples of functions that check the lengths of their arguments, so it's conceivable you or someone else could have had this exact problem. The fact that those checks are there shows people thought it was a good idea and no one has complained enough to change their minds. And we have examples of people cursing the lack of a check.
 
If I'm the producer and consumer of the data, then I can pick and
choose between versions, and that's all well and good.

FWIW I do think this use case is much more common. An external API which requires you to pass parallel iterables instead of pairs is unusual and confusing. For example, the real source of the ast.unparse problem is that ast.Dict.{keys,values} is weird. Every consumer of it such as compile and ast.unparse has to check the lengths, a strategy that has failed and will probably continue to fail. I'm not saying that bad APIs are rare, but that this kind of API is both bad and rare.

We've argued a lot about what kinds of uses of zip are most common, so I did a little survey of code that I had written or worked with. The uses of zip that I found could be roughly categorised as follows:

strict (if it existed) should be False: 3
Lengths need to be equal...
...but that's not checked, although that's probably OK: 11
...but that's not checked, and that's a problem: 3
...so there's an assert len(x) == len(y): 2

Based on that data, adding strict=True:

- in the vast majority of cases would not hurt.
- is significantly helpful more often than strict should be False
- would ensure correctness in currently unsafe code as often as strict should be False

In most cases it wouldn't ensure the correctness of code, but it could give some peace of mind and might help readers. But also in those cases if the user decides it's redundant and not worth using:

- The user had better be confident of their judgement, which will inevitably sometimes be wrong.
- Even if the context of the code makes it obvious that it's redundant, that context could change in the future and introduce a silent regression, and people are likely to not think to add strict=True to the zip call somewhere down the line from their changes. Adding strict=True is more future-proof to such changes.
 
If I'm the producer of the data, and I want it to be equal in length,
then I control the data and can make it equal.

But in your example you complain about having to do that. Is it a problem or not?
 
But if I'm only the consumer of the data, I have no business failing 
"just to be safe". That's an anti-feature that makes life more
difficult, not less, for the producer of the data, akin to excessive
runtime type checking (I'm sometimes guilty of that myself) or in other
languages flagging every class in sight as final "just to be safe".

It is possible to be *too* defensive, and if making the strict version
of zip a builtin encourages consumers of the data to "be safe", then
that is a mark against it in my strong opinion.

I think you make this situation sound worse than it is. "I will curse them onto the hundredth generation for making my life more difficult" is pretty melodramatic. If you get an exception because you tried:

    process_user_ids(usernames, generate_user_ids())

then you can pretty easily change it to:

    process_user_ids(usernames, (user_id for user_id, username in zip(generate_user_ids(), usernames)))

or if you can generate user IDs one at a time:

    process_user_ids(usernames, (generate_user_id() for _ in usernames))

It's a bit inconvenient, but:

- It's pretty easy to solve the problem, certainly compared to implementing zip_equal, and nothing like your example of final classes which are pretty much insurmountable.
- An exception has clearly pointed out a problem to solve, which is much better than trying to find a silent logic error. You might complain in this case but you'd be grateful if you accidentally passed a user_ids list that was too short.
- It's not a problem you can ignore, so it doesn't require you to be disciplined in case of future mistakes.