[Web-SIG] WSGI configuration and character encoding.
Phillip J. Eby
pje at telecommunity.com
Tue Nov 30 19:35:32 CET 2004
At 06:03 PM 11/30/04 +0000, Alan Kennedy wrote:
>[Phillip J. Eby]
> > As long as the encoding is restricted to basically the same set of
> > encodings that work for Python source code, it should only be
> > necessary to have the encoding specified as a configuration variable
> > in the file.
> > However, if it's considered desirable to also detect a BOM, we can
> > implement that by reading the first four bytes of the file, and then
> > either backing up if there's no BOM, or wrapping the file object with
> > the appropriate decoding wrapper before passing it to ConfigParser.
> > Of course, at that point we could just as well implement the exact
> > same detection algorithm as PEP 263, except that we could also support
> > wide encodings as long as there's a BOM.
>I'm really, really, really, really, *really* against us trying to come up
>with our own solution to the encoding problem. There are just too many
>pitfalls and special cases.
You've lost me here. I was suggesting that we use PEP 263 or a subset
thereof. I've seen the patches for PEP 263, and they're pretty darn
simple, even in C!
>Take XML 1.1, for example. XML 1.0 omitted the use of the IBM EBCDIC NEL
>character 0x85 as a permitted line terminator. XML 1.1 tried to rectify
>that omission, and despite the fact that dozens of clever people (i.e. the
>W3C XML working group) worked on the problem, and the spec was reviewed by
>literally thousands of eyeballs worldwide, they *all* *still* got it wrong!
>XML 1.1: Dead on Arrival
>I strongly urge that we adopt a solution that already has built-in
>encoding support, e.g. python or XML.
If by "solution" you mean "implementation", it doesn't really solve
anything in the XML case, because for example a specific XML library could
be completely broken with respect to Unicode... and many of them are!
If by "solution" you mean "detection algorithm", then I'm fine with that;
the PEP 263 algorithm can be easily coded in Python as a front-end to
>Failing that, if we want to use ConfigParser, I see three ways forward
>1. Make the user specify the encoding of the config file *outside* the
>config file itself.
I think that will lead to significant complications for simple servers
(e.g. ones that just want to publish files in a directory)
>2. Approach ein den deutsche-enkoding-bots on python-dev, e.g. MAL or MvL,
>and ask their advice.
Sure, although I also assume that their input has already gone into PEP
263, so using it as-is should be fine.
>3. Spend days or weeks bending our brains about how to make ConfigParser
>also do encodings, and about whether the proposed approach works or not.
What's wrong with PEP 263?
>Lastly, here's a wild suggestion: How about a hybrid approach? We use
>ConfigParser and the nice .ini syntax, but we wrap it in a simple XML
>wrapper, just so that we don't have to worry about encodings. For example
><?xml version="1.0" encoding="windows-1252"?>
>webmaster: aláin_ó_cinnéide at spam.org
>Ugly, but perfectly functional and trivial to implement too.
And it introduces all the problems of < > & ", too. It'd be
simpler to just use:
webmaster: aláin_ó_cinnéide at spam.org
Which is one of several valid ways to spell an encoding declaration under
PEP 263, that is also a valid comment in ConfigParser .ini format.
More information about the Web-SIG