On Sat, May 02, 2020 at 10:26:18PM +0200, Alex Hall wrote:
If I know that consumers of my data truncate on the shortest input, then as the producer of data I don't have to care about making them equal. I can say:
process_user_ids(usernames, generate_user_ids())
and pass an infinite stream of user IDs and know that the function will just truncate on the shortest stream. Yay! Life is good.
But now this zip flag comes along, and the author of process_user_ids decides to protect me from myself and "fail loudly", and I will curse them onto the hundredth generation for making my life more difficult.
My guess is that this kind of situation is rare and unusual. The example looks made up, is it based on something real?
Yes, its made up, but yes, it is based on real code. I haven't had to generate user IDs yet, but I have often passed in (for example) an infinite stream of prime numbers, or some other infinite sequence. Not necessarily infinite either: it might be finite but huge, such as combinations or permutations of something. At the consumer end, the main one that comes to mind off the top of my head (apart from code I wrote myself) is numpy, for example their coefficent correlation function: py> np.corrcoef([1, 2, 3, 4, 5, 6, 7, 8], [7, 6, 5, 4, 3, 2]) Traceback (most recent call last): ... ValueError: array dimensions must agree except for d_0 Now I'm still keeping an open-mind whether this check is justified for stats functions. (There have been some requests for XY stats in the statistics module, so I may have to make a decision some day.) But my point is, regardless of whether that check is necessary or not, *I still have to check it when I produce the data* or else I get a potentially spurious exception that will eat my data. In the general case where I have iterators not lists, no recovery is possible. I can't catch the error, trim the data, and resubmit. I have to preemptively trim the data. Of course I recognise the right of each developer to choose for themselves whether to enforce the rule that input streams are equal. And sometimes that will be the right thing to do. But you are arguing that putting zip_strict as a builtin will encourage people to do this, and I am saying that *encouraging people to do this* is a point against it, because that will lead to the "things which aren't errors should never pass silently" Just In Case it might be an error. If you need to enforce equal length input, then one extra line: from itertools import zip_strict is no burden. But if someone is on the fence about checking for equal lengths, and importing would be too much trouble so they don't bother, but making it builtin is enough for them to tip them over into using it "just to be safe", then they probably shouldn't be using it.
An external API which requires you to pass parallel iterables instead of pairs is unusual and confusing.
I don't know about that. numpy's correlation coefficient supports both a single array of data pairs or a pair of separate X, Y values. So does R. Spreadsheets likewise put the X, Y data in separate cells, not in cells with two data points. If your data is coming from a CSV file, as it often does, the most natural way to get it is as two separate columns. But for APIs that do require a single stream of pairs, then this entire discussion is irrelevant, since I, the producer of the data, can choose whatever behaviour makes sense for me: - `zip` to truncate the data on shortest input; - `zip_longest` to pad it; - `zip_strict` (whether I write my own or get it from the stdlib) or anything else I like, since I'm the producer. So we shouldn't be talking about APIs that require a stream of pairs, but only APIs that require separate streams. [...]
In most cases it wouldn't ensure the correctness of code, but it could give some peace of mind and might help readers. But also in those cases if the user decides it's redundant and not worth using:
- The user had better be confident of their judgement, which will inevitably sometimes be wrong.
That's not our problem to solve. It isn't up to us to force people to use zip_strict "just in case your judgement that it isn't needed is wrong". I think you and I have a severe disagreement on the relationship between the stdlib and the end developers. Between your comment above, and the one below, you seem to believe that it is the job of the stdlib to protect the developer from themselves. And yet here you are using Python, where we have the ability to monkey-patch the builtins namespace, shadow functions, reach into functions and manipulate their data, including their code, rebind any name, remove attributes of anything, even change the class of (some) instances on the fly. Are you sure you are using the right language?
- Even if the context of the code makes it obvious that it's redundant, that context could change in the future and introduce a silent regression, and people are likely to not think to add strict=True to the zip call somewhere down the line from their changes. Adding strict=True is more future-proof to such changes.
That's an awfully condescending thing to say. You are effectively saying that Python developers ought to use strict zip by default just in case their requirements change in the future and they are too dim-witted to remember to change the implementation to match. I will try to remember that there are legitimate users for a strict zip, but this advocacy of *premptive and unnecessary length checks* just makes me more hostile to adding this to the stdlib at all, let alone making it a builtin.
If I'm the producer of the data, and I want it to be equal in length, then I control the data and can make it equal.
But in your example you complain about having to do that. Is it a problem or not?
It is a problem if I have to do it for no good reason, to satisfy some arbitrary "just in case" length check. I don't mind doing so if there is a genuine good reason for it, like in the ast case. Ultimately, the data is going to be truncated somewhere. At the moment, zip does the truncation. If people switch to strict zip, then I have to do the truncation. That's a usability regression. Unless it is justified by some functional reason, or fixing a bug, it's more harmful than helpful: "Function do_useful_stuff() no longer ignores excess stuff, but raises instead. Callers must now trim their To Do list before passing it to the function. No bugs were fixed by this change." sort of thing.
I think you make this situation sound worse than it is. "I will curse them onto the hundredth generation for making my life more difficult" is pretty melodramatic.
Hyperbole is supposed to be read as humour, not literally :-)
If you get an exception because you tried:
process_user_ids(usernames, generate_user_ids())
then you can pretty easily change it to:
process_user_ids(usernames, (user_id for user_id, username in zip(generate_user_ids(), usernames)))
or if you can generate user IDs one at a time:
process_user_ids(usernames, (generate_user_id() for _ in usernames))
So I have to write ugly, inefficient code to make up for you encouraging people to add unnecessary length checks in their code? Yay, I feel so happy about this change!!! Also, it doesn't work if usernames is an iterator.
It's a bit inconvenient, but:
- It's pretty easy to solve the problem, certainly compared to implementing zip_equal,
You haven't solved the problem, and judging by your attempted work- arounds, this could easily lead to a reduction in code quality, not an improvement.
and nothing like your example of final classes which are pretty much insurmountable.
I don't remember raising final classes in this thread. Am I losing my short-term memory or are you confusing me with someone else? (I have raised final classes in other discussions, but they're not relevant here as far as I can see.)
- An exception has clearly pointed out a problem to solve
An exception *might* have pointed out a problem to solve, but since the solution is usually to just truncate the data, and zip already does that, raising an exception may be a regression, not an enhancement. Can we agree to meet half-way? - there are legitimate, genuine uses for zip_strict; - but encouraging people to change zip to zip_strict "just in case" in the absence of a genuine reason is a bad thing. -- Steven