[Python-ideas] Support WHATWG versions of legacy encodings

Sun Jan 21 11:36:58 EST 2018

> The question to my mind is whether or not this "latin1replace" handler,
> in conjunction with existing codecs, will do the same thing as the
> WHATWG codecs. If I have understood you correctly, I think it will. Have
> I missed something?

It won't do the same thing, and neither will the "chaining coders"
proposal. It's easy to miss details like this in all the counterproposals.

The difference between WHATWG encodings and the ones in Python is, in all
but one case, *only* in the C1 control character range (0x80 to 0x9F), a
range of Unicode characters that has historically evaded standardization
because they never had a clear purpose even before Unicode. Filling in all
the gaps with Latin-1 would do the right thing for, I think, 3 of the
encodings, and the wrong thing in the other 5 cases. (In the anomalous case
of Windows-1255, it would do a more explicitly wrong thing.)

Let's take Windows-1253 (Greek) as an example. Windows-1253 has a bunch of
gaps in the 0x80 to 0x9F range, like most of the others. It also has gaps
for 0xAA, 0xD2, and 0xFF. WHATWG does _not_ recommend decoding these as the
letters "ª", "Ò", and "ÿ", the characters in the equivalent positions in
Latin-1. They are simply unassigned. Other software sometimes maps them to
the Private Use Area, but this is not standardized at all, and it seems
clear that Python should handle them with its usual error handler for
unassigned bytes. (Which is one of the reasons not to replace the error
handler with something different: we still need the error handler.)

Of course, you could define an encoding that's Windows-1253 plus the
letters "ª", "Ò", and "ÿ", filling in all the gaps with Latin-1. It would
be weird and new (who ever heard of an encoding that has a mapping for "Ò"
but not "ò"?). One point I hope to have agreement on is that we do not want
to create _new_ legacy encodings that are not used anywhere else.

The reason I was proposing to move ahead with a PR was not that I thought
it would be automatically accepted -- it was to have a point of reference
for exactly what I'm proposing, so we can discuss exactly what the
functional difference is between this and counterproposals without getting
lost. But I can see how writing the point of reference in PEP form instead
of PR form can be the right way to focus discussion.

Thanks for the recommendation there, and I'd like a little extra
information -- I don't know _mechanically_ how to write a PEP. (Where do I
submit it to, for example?)

-- Rob Speer

On Sun, 21 Jan 2018 at 05:44 Steven D'Aprano <steve at pearwood.info> wrote:

> On Fri, Jan 19, 2018 at 06:35:30PM +0000, Rob Speer wrote:
> > > It depends on what you want to achieve. You may want to fail, assign a
> > code point from a private area or use a surrogate escape approach.
> >
> > And the way to express that is with errors='replace',
> > errors='surrogateescape', or whatever, which Python already does. We do
> not
> > need an explosion of error handlers. This problem can be very
> > straightforwardly solved with encodings, and error handlers can keep
> doing
> > their usual job on top of encodings.
> >
> > > You could also add a "latin1replace" error handler which simply passes
> > through everything that's undefined as-is.
> >
> > Nobody asked for this.
>
> Actually, Soni L. seems to have suggested a similar idea in the thread
> titled "Chaining coders" (codecs).
>
> But what does it matter whether someone asked for it? Until this thread,
> nobody had asked for support for WHATWG encodings either.
>
> The question to my mind is whether or not this "latin1replace" handler,
> in conjunction with existing codecs, will do the same thing as the
> WHATWG codecs. If I have understood you correctly, I think it will. Have
> I missed something?
>
>
> > > I just don't want to have people start using "web-1252" as encoding
> > simply because they they are writing out text for a web application -
> they
> > should use "utf-8" instead.
> >
> > I did ask for input on the name. If the problem is that you think my
> > working name for the encoding is misleading, you could help with that
> > instead of constantly trying to replace the proposal with something
> > different.
>
> Rob, you've come here with a proposal based on an actual problem (web
> pages with mojibake and broken encodings), an existing solution (a third
> party library) you dislike, and a suggested new solution you will like
> (move the encodings into the std lib). That's great, and we need more
> suggestions like this: concrete use-cases and concrete solutions.
>
> But you cannot expect that we're going to automatically agree that:
>
> - the problem is something that Python the language has to solve
>   (it seems to be a *browser* problem, not a general programming
>   problem);
>
> - the existing solution is not sufficient; and
>
> - your proposal is the right solution.
>
>
> All of these things need to be justified, and counter-proposals are part
> of that.
>
> When we make a non-trivial proposal on Python-Ideas, it is very rare
> that they are so clearly the right solution for the right problem that
> they get instant approval and you can go straight to the PR. Often there
> are legitimate questions about all three steps. That's why I suggested
> earlier that (in my opinion) there needs to be a PEP to summarise the
> issue, justify the proposal, and counter the arguments against it.
>
> (Even if the proposal is agreed upon by everyone, if it is sufficiently
> non-trivial, we sometimes require a PEP summarising the issue for future
> reference.)
>
> As the author of one PEP myself, I know how frustrating this process can
> seem when you think that this is a bloody obvious proposal with no
> downside that all right-thinking people ought to instantly recognise as
> a great idea *wink* but nevertheless, in *my opinion* (I don't speak for
> anyone else) I think a PEP would be a good idea.
>
>
> > Guido had some very sensible feedback just a moment ago. I am wondering
> now
> > if we lost Guido because I broke python-ideas etiquette (is a pull
> request
> > not the next step, for example? I never got a good answer on the
> process),
> > or because this thread is just constantly being derailed.
>
> I don't speak for Guido, but it might simply be he isn't invested
> enough in *this specific issue* to spend the time wading through a
> long thread. (That's another reason why a PEP is sometimes valuable.)
> Perhaps he's still on holiday and only has limited time to spend on
> this.
>
> If I were in your position, my next step would be to write a new
> post summarising the thread so far:
>
> - a brief summary of the nature of the problem;
> - why you think a solution (whatever that solution turns out to be)
>   should be in the stdlib rather than a third-party library;
> - what you think the solution should be;
> - and give a fair critique of the alternatives suggested so far and
>   why you thik that they aren't suitable.
>
> That's the same sort of information given in a PEP, but without having
> to go through the formal PEP process. That might be enough to gain
> consensus on what happens next -- and maybe even agreement that a formal
> and more detailed PEP is not needed.
>
> Oh, and in case you're thinking this is all a great PITA, it might help
> if you read these to get an understanding of why things are as they are:
>
>
> https://www.curiousefficiency.org/posts/2011/02/status-quo-wins-stalemate.html
>
>
> https://www.curiousefficiency.org/posts/2011/04/musings-on-culture-of-python-dev.html
>
>
> Good luck!
>
>
> --
> Steve
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180121/8d71287e/attachment-0001.html>