[Web-SIG] WSGI adoption

Tue Nov 30 09:59:25 CET 2004

[Alan Kennedy]
 >>> 3. Even if we did use ConfigParser, it still doesn't solve the lack
 >>> of encoding support.

[Phillip J. Eby]
 >> True, but entirely manageable for any 8-bit encoding that doesn't
 >> require escaping for the characters (such as #, ; , [, =, ], :, and
 >> whitespace) that ConfigParser uses for syntax.  IOW, the various
 >> Latin codings and UTF-8 are all fine.

[Ian Bicking]
 > Well, it returns text as 8-bit strings, not as unicode strings.  I
 > think Alan wants unicode.

Well, not exactly.

I have no problem with using 8-bit strings, as long as they can be in 
encodings other than ascii or iso-8859-1. If we can support UTF-8, then 
the only problem for Indian, Korean, Chinese, Greek, Russian, etc, WSGI 
users is that their configuration files use more bytes than they would 
in a local encoding: they can still specify the full range of unicode 
characters using UTF-8.

[Ian Bicking]
 >> I imagine it would be easy enough to add -- maybe
 >> enough just to open the file with an encoding specified (though being
 >> able to detect the encoding would be better).
 >> Or, if applied as a wrapper, you could decode all the strings after
 >> they've been loaded.  Maybe that's what you were thinking?

[Phillip J. Eby]
 > Yes, I meant decode-after-load, specifying the encoding as one of the
 > configuration variables.

I'm a little confused. When you say "specifying the encoding as one of 
the configuration variables", do you mean a configuration variable that 
is specified inside or outside the ConfigParser .ini configuration file? 
Or somewhere else?

Obviously, if you put the encoding declaration inside the config file 
itself, then you face the chicken and egg problem of needing to know 
what encoding the file is in before you can decode it to find out what 
its contents are, including what encoding it is in .......

XML solves this problem with the "<?xml" declaration: it is a fixed set 
of characters at the very beginning of the file from which you can guess 
the character encoding of the file. More here

http://www.w3.org/TR/REC-xml/#sec-guessing

Python solves the problem using a similar trick, as described in PEP-263 
(i.e. using the magic comment string "# -*- coding: <encoding name> 
-*-"). However, python is not able to use 2-byte encodings, for example 
UTF-16, because that would make the guessing algorithm too complex. From 
the PEP

"""
Any encoding which allows processing the first two lines in the
way indicated above is allowed as source code encoding, this
includes ASCII compatible encodings as well as certain
multi-byte encodings such as Shift_JIS. It does not include
encodings which use two or more bytes for all characters like
e.g. UTF-16. The reason for this is to keep the encoding
detection algorithm in the tokenizer simple.
"""

So if we're going to use ConfigParser *and* support encodings, then we 
need to either

A: Make the user specify the encoding *outside* the configuration file
B: Require some form of "magic string" at the top of the file so that we 
can guess the encoding. And write the guessing algorithm.

Apart from encoding issues, I have no big problem with ConfigParser. My 
#1 choice is python syntax, but I understand that it may be overly 
complex for WSGI requirements.

However, I do *not* like the java.util.Properties solution to the 
encoding problem: i.e. any character that isn't 8-bit must be specified 
with an Unicode escape. Which would mean that end-users would have to go 
looking up unicode hex/decimal character codes one by one. For 
non-technical users, this is unacceptable.

Regards,

Alan.