[Web-SIG] WSGI configuration and character encoding.

Tue Nov 30 19:35:32 CET 2004

At 06:03 PM 11/30/04 +0000, Alan Kennedy wrote:
>[Phillip J. Eby]
> > As long as the encoding is restricted to basically the same set of
> > encodings that work for Python source code, it should only be
> > necessary to have the encoding specified as a configuration variable
> > in the file.
> >
> > However, if it's considered desirable to also detect a BOM, we can
> > implement that by reading the first four bytes of the file, and then
> > either backing up if there's no BOM, or wrapping the file object with
> > the appropriate decoding wrapper before passing it to ConfigParser.
> >
> > Of course, at that point we could just as well implement the exact
> > same detection algorithm as PEP 263, except that we could also support
> > wide encodings as long as there's a BOM.
>
>I'm really, really, really, really, *really* against us trying to come up 
>with our own solution to the encoding problem. There are just too many 
>pitfalls and special cases.

You've lost me here.  I was suggesting that we use PEP 263 or a subset 
thereof.  I've seen the patches for PEP 263, and they're pretty darn 
simple, even in C!

>Take XML 1.1, for example. XML 1.0 omitted the use of the IBM EBCDIC NEL 
>character 0x85 as a permitted line terminator. XML 1.1 tried to rectify 
>that omission, and despite the fact that dozens of clever people (i.e. the 
>W3C XML working group) worked on the problem, and the spec was reviewed by 
>literally thousands of eyeballs worldwide, they *all* *still* got it wrong!
>
>XML 1.1: Dead on Arrival
>http://norman.walsh.name/2004/09/30/xml11
>
>I strongly urge that we adopt a solution that already has built-in 
>encoding support, e.g. python or XML.

If by "solution" you mean "implementation", it doesn't really solve 
anything in the XML case, because for example a specific XML library could 
be completely broken with respect to Unicode...  and many of them are!

If by "solution" you mean "detection algorithm", then I'm fine with that; 
the PEP 263 algorithm can be easily coded in Python as a front-end to 
ConfigParser.

>Failing that, if we want to use ConfigParser, I see three ways forward
>
>1. Make the user specify the encoding of the config file *outside* the 
>config file itself.

I think that will lead to significant complications for simple servers 
(e.g. ones that just want to publish files in a directory)

>2. Approach ein den deutsche-enkoding-bots on python-dev, e.g. MAL or MvL, 
>and ask their advice.

Sure, although I also assume that their input has already gone into PEP 
263, so using it as-is should be fine.

>3. Spend days or weeks bending our brains about how to make ConfigParser 
>also do encodings, and about whether the proposed approach works or not.

What's wrong with PEP 263?

>Lastly, here's a wild suggestion: How about a hybrid approach? We use 
>ConfigParser and the nice .ini syntax, but we wrap it in a simple XML 
>wrapper, just so that we don't have to worry about encodings. For example
>
>#----begin----
><?xml version="1.0" encoding="windows-1252"?>
><wsgi_config>
>
>[server]
>webmaster: aláin_ó_cinnéide at spam.org
>
></wsgi_config>
>#-----end-----
>
>Ugly, but perfectly functional and trivial to implement too.

And it introduces all the problems of &lt; &gt; &amp; &quot, too.  It'd be 
simpler to just use:

# encoding:windows-1252
[server]
webmaster: aláin_ó_cinnéide at spam.org

Which is one of several valid ways to spell an encoding declaration under 
PEP 263, that is also a valid comment in ConfigParser .ini format.