[Python-ideas] Support WHATWG versions of legacy encodings
Steven D'Aprano
steve at pearwood.info
Fri Feb 2 01:52:39 EST 2018
On Thu, Feb 01, 2018 at 10:20:00AM +0100, M.-A. Lemburg wrote:
> In general, we have only added new encodings when there was an encoding
> missing which a lot of people were actively using. We asked for
> official documentation defining the mappings, references showing
> usage and IANA or similar standard names to use for the encoding
> itself and its aliases.
[...]
> Now the OP comes proposing to add a whole set of encodings which
> only differ slightly from our existing ones. Backing is their
> use and definition by WHATWG, a consortium of browser vendors
> who are interested in showing web pages to users in a consistent
> way.
That gives us a defined mapping, references showing usage, but (alas)
not standard names, due to the WHATWG's (foolish and arrogantly
obnoxious, in my opinion) decision to re-use the standard names for the
non-standard usages.
Two out of three seems like a reasonable start to me.
But one thing we haven't really discussed is, why is this an issue for
Python? Everything I've seen so far suggests that these standards are
only for browsers and/or web scrapers. That seems fairly niche to me. If
you're writing a browser in Python, surely it isn't too much to ask that
you import a set of codecs from a third party library?
If I've missed something, please say so.
> We also have the naming issue, since WHATWG chose to use
> the same names as the standard mappings. Anything we'd
> define will neither match WHATWG nor any other encoding
> standard name, so we'd be creating a new set of encoding
> names - which is really not what the world is after,
> including WHATWG itself.
I hear you, but I think this is a comparatively minor objection. I don't
think it is a major problem for usability if we were to call these
encodings "spam-whatwg" instead of "spam". It isn't difficult for
browser authors to write:
encoding = get_document_encoding()
if config.USE_WHATWG_ENCODINGS:
encoding += '-whatwg'
or otherwise look the encodings up in a mapping. We could even provide
that mapping in the codecs module:
encoding = codecs.whatwg_mapping.get(encoding, encoding)
So the naming issue shouldn't be more than a minor nuisance, and one we
can entirely place in the lap of the WHATWG for misusing standard names.
Documentation-wise, I'd argue for placing these in a seperate
sub-section of the codecs docs, with a strong notice that they should
only be used for decoding web documents and not for creating new
documents (except for testing purposes).
> People would start creating encoded text using these new
> encoding names, resulting in even more mojibake out there
> instead of fixing the errors in the data and using Unicode
> or UTF-8 for interchange.
We can't stop people from doing that: so long as the encodings exist as
a third-party package, people who really insist on creating such
abominable documents can do so. Just as they currently can accidentally
create mojibake in their own documents by misunderstanding encodings, or
as they can create new documents using legacy encodings like MacRoman
instead of UTF-8 like they should.
(And very occasionally, they might even have a good reason for doing so
-- while we can and should *discourage* such uses, we cannot and should
not expect to prohibit them.)
If it were my decision, I'd have these codecs raise a warning (not an
error) when used for encoding. But I guess some people will consider
that either going too far or not far enough :-)
> As I mentioned before, we could disable encoding in the new
> mappings to resolve this concern, but the OP wasn't interested
> in such an approach. As alternative approach we proposed error
> handlers, which are the normal technology to use when dealing
> with encoding errors. Again, the OP wasn't interested.
Be fair: it isn't that the OP (Rob Speer) merely isn't interested, he
does make some reasonable arguments that error handlers are the wrong
solution. He's convinced me that an error handler isn't the right way to
do this.
He *hasn't* convinced me that the stdlib needs to solve this problem,
but if it does, I think some new encodings are the right way to do it.
> Please also note that once we start adding, say
> "whatwg-<original name>" encodings (or rather decodings :-),
> going for the simple charmap encodings first, someone
> will eventually also request addition of the more complex
> Asian encodings which WHATWG defines. Maintaining these
> is hard, since they require writing C code for performance
> reasons and to keep the mapping tables small.
YAGNI -- we can deal with that when and if it gets requested. This is
not the camel's nose: adding a handful of 8-bit WHATWG encodings does
not oblige us to add more.
[...]
> There are quite a few downsides to consider
Indeed -- this isn't a "no-brainer". That's why I'm still hoping to see
a fair and balanced PEP.
> and since the OP
> is not interested in going for a compromise as described above,
> I don't see a way forward.
Status quo wins a stalemate. Sometimes that's better than a broken
solution that won't satisfy anyone.
--
Steve
More information about the Python-ideas
mailing list