[Web-SIG] WSGI adoption
py-web-sig at xhaus.com
Tue Nov 30 09:59:25 CET 2004
>>> 3. Even if we did use ConfigParser, it still doesn't solve the lack
>>> of encoding support.
[Phillip J. Eby]
>> True, but entirely manageable for any 8-bit encoding that doesn't
>> require escaping for the characters (such as #, ; , [, =, ], :, and
>> whitespace) that ConfigParser uses for syntax. IOW, the various
>> Latin codings and UTF-8 are all fine.
> Well, it returns text as 8-bit strings, not as unicode strings. I
> think Alan wants unicode.
Well, not exactly.
I have no problem with using 8-bit strings, as long as they can be in
encodings other than ascii or iso-8859-1. If we can support UTF-8, then
the only problem for Indian, Korean, Chinese, Greek, Russian, etc, WSGI
users is that their configuration files use more bytes than they would
in a local encoding: they can still specify the full range of unicode
characters using UTF-8.
>> I imagine it would be easy enough to add -- maybe
>> enough just to open the file with an encoding specified (though being
>> able to detect the encoding would be better).
>> Or, if applied as a wrapper, you could decode all the strings after
>> they've been loaded. Maybe that's what you were thinking?
[Phillip J. Eby]
> Yes, I meant decode-after-load, specifying the encoding as one of the
> configuration variables.
I'm a little confused. When you say "specifying the encoding as one of
the configuration variables", do you mean a configuration variable that
is specified inside or outside the ConfigParser .ini configuration file?
Or somewhere else?
Obviously, if you put the encoding declaration inside the config file
itself, then you face the chicken and egg problem of needing to know
what encoding the file is in before you can decode it to find out what
its contents are, including what encoding it is in .......
XML solves this problem with the "<?xml" declaration: it is a fixed set
of characters at the very beginning of the file from which you can guess
the character encoding of the file. More here
Python solves the problem using a similar trick, as described in PEP-263
(i.e. using the magic comment string "# -*- coding: <encoding name>
-*-"). However, python is not able to use 2-byte encodings, for example
UTF-16, because that would make the guessing algorithm too complex. From
Any encoding which allows processing the first two lines in the
way indicated above is allowed as source code encoding, this
includes ASCII compatible encodings as well as certain
multi-byte encodings such as Shift_JIS. It does not include
encodings which use two or more bytes for all characters like
e.g. UTF-16. The reason for this is to keep the encoding
detection algorithm in the tokenizer simple.
So if we're going to use ConfigParser *and* support encodings, then we
need to either
A: Make the user specify the encoding *outside* the configuration file
B: Require some form of "magic string" at the top of the file so that we
can guess the encoding. And write the guessing algorithm.
Apart from encoding issues, I have no big problem with ConfigParser. My
#1 choice is python syntax, but I understand that it may be overly
complex for WSGI requirements.
However, I do *not* like the java.util.Properties solution to the
encoding problem: i.e. any character that isn't 8-bit must be specified
with an Unicode escape. Which would mean that end-users would have to go
looking up unicode hex/decimal character codes one by one. For
non-technical users, this is unacceptable.
More information about the Web-SIG