[Python-Dev] String encoding

M.-A. Lemburg mal@lemburg.com
Tue, 23 May 2000 16:47:40 +0200


Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> > The recent discussion about repr() et al. brought up the idea
> > of a locale based string encoding again.
> > [...]
> >
> > A support module for querying the encoding used in the current
> > locale together with the experimental hook to set the string
> > encoding could yield a compromise which satisfies ASCII, Latin-1
> > and UTF-8 proponents.
> 
> agreed.
> 
> > The idea is to use the site.py module to customize the interpreter
> > from within Python (rather than making the encoding a compile
> > time option). This is easily doable using the (yet to be written)
> > support module and the sys.setstringencoding() hook.
> 
> agreed.
> 
> note that parsing LANG (etc) variables on a POSIX platform is
> easy enough to do in Python (either in site.py or in locale.py).
> no need for external support modules for Unix, in other words.

Agreed... the locale.py (and _locale builtin module) are probably
the right place to put such a parser.
 
> for windows, I suggest adding GetACP() to the _locale module,
> and let the glue layer (site.py 0or locale.py) do:
> 
>     if sys.platform == "win32":
>         sys.setstringencoding("cp%d" % GetACP())
> 
> on mac, I think you can determine the encoding by inspecting the
> system font, and fall back to "macroman" if that doesn't work out.
> but figuring out the right way to do that is best left to anyone who
> actually has access to a Mac.  in the meantime, just make it:
> 
>     elif sys.platform == "mac":
>         sys.setstringencoding("macroman")
> 
> > The default encoding would be 'ascii' and could then be changed
> > to whatever the user or administrator wants it to be on a per
> > site basis.
> 
> Tcl defaults to "iso-8859-1" on all platforms except the Mac.  assuming
> that the vast majority of non-Mac platforms are either modern Unixes
> or Windows boxes, that makes a lot more sense than US ASCII...
> 
> in other words:
> 
>     else:
>         # try to determine encoding from POSIX locale environment
>         # variables
>         ...
> 
>     else:
>         sys.setstringencoding("iso-latin-1")

That's a different topic which I don't want to revive ;-)

With the above tools you can easily code the latin-1 default
into your site.py.

> > Furthermore, the encoding should be settable on a per thread basis
> > inside the interpreter (Python threads do not seem to inherit any
> > per-thread globals, so the encoding would have to be set for all
> > new threads).
> 
> is the C/POSIX locale setting thread specific?

Good question -- I don't know.

> if not, I think the default encoding should be a global setting, just
> like the system locale itself.  otherwise, you'll just be addressing a
> real problem (thread/module/function/class/object specific locale
> handling), but not really solving it...
>
> better use unicode strings and explicit encodings in that case.

Agreed.
 
> > Minor nit: due to the implementation, the C parser markers
> > "s" and "t" and the hash() value calculation will still need
> > to work with a fixed encoding which still is UTF-8.
> 
> can this be fixed?  or rather, what changes to the buffer api
> are required if we want to work around this problem?

The problem is that "s" and "t" return C pointers to some
internal data structure of the object. It has to be assured
that this data remains intact at least as long as the object
itself exists.

AFAIK, this cannot be fixed without creating a memory leak.
 
The "es" parser marker uses a different strategy, BTW: the
data is copied into a buffer, thus detaching the object
from the data.

> > C APIs which want to support Unicode should be fixed to use
> > "es" or query the object directly and then apply proper, possibly
> > OS dependent conversion.
> 
> for convenience, it might be a good idea to have a "wide system
> encoding" too, and special parser markers for that purpose.
> 
> or can we assume that all wide system API's use unicode all the
> time?

At least in all references I've seen (e.g. ODBC, wchar_t
implementations, etc.) "wide" refers to Unicode.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/