[Web-SIG] WSGI adoption

Tue Nov 30 17:00:49 CET 2004

At 08:59 AM 11/30/04 +0000, Alan Kennedy wrote:
>[Phillip J. Eby]
> > Yes, I meant decode-after-load, specifying the encoding as one of the
> > configuration variables.
>
>I'm a little confused. When you say "specifying the encoding as one of the 
>configuration variables", do you mean a configuration variable that is 
>specified inside or outside the ConfigParser .ini configuration file? Or 
>somewhere else?
>
>Obviously, if you put the encoding declaration inside the config file 
>itself, then you face the chicken and egg problem of needing to know what 
>encoding the file is in before you can decode it to find out what its 
>contents are, including what encoding it is in .......

That was why I said it would only work for encodings that don't require 
escaping [, ], #, ;, =, and whitespace.

>XML solves this problem with the "<?xml" declaration: it is a fixed set of 
>characters at the very beginning of the file from which you can guess the 
>character encoding of the file. More here
>
>http://www.w3.org/TR/REC-xml/#sec-guessing

Reading this section makes it seem to me that we can easily support:

"""UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any 
other 7-bit, 8-bit, or mixed-width encoding which ensures that the 
characters of ASCII have their normal positions, width, and values """

...as long as the configuration keys (and the string specifying the 
encoding) are guaranteed to be ASCII.  It seems to me that most of the 
Asian codecs use unusual characters for escaping, such as $, \, and the 
ASCII escape character, so it shouldn't be too hard to steer clear of these 
in our keys.

I would also recommend that application authors inform their users if their 
deployment files are in an encoding that is not bundled with Python.

>So if we're going to use ConfigParser *and* support encodings, then we 
>need to either
>
>A: Make the user specify the encoding *outside* the configuration file
>B: Require some form of "magic string" at the top of the file so that we 
>can guess the encoding. And write the guessing algorithm.

As long as the encoding is restricted to basically the same set of 
encodings that work for Python source code, it should only be necessary to 
have the encoding specified as a configuration variable in the file.

However, if it's considered desirable to also detect a BOM, we can 
implement that by reading the first four bytes of the file, and then either 
backing up if there's no BOM, or wrapping the file object with the 
appropriate decoding wrapper before passing it to ConfigParser.

Of course, at that point we could just as well implement the exact same 
detection algorithm as PEP 263, except that we could also support wide 
encodings as long as there's a BOM.