[Python-Dev] String encoding

Tue, 23 May 2000 13:38:41 +0200

M.-A. Lemburg wrote:
> The recent discussion about repr() et al. brought up the idea
> of a locale based string encoding again.

before proceeding down this (not very slippery but slightly
unfortunate, imho) slope, I think we should decide whether

    assert eval(repr(s)) =3D=3D s

should be true for strings.

if this isn't important, nothing stops you from changing 'repr'
to use isprint, without having to make sure that you can still
parse the resulting string.

but if it is important, you cannot really change 'repr' without
addressing the big issue.

so assuming that the assertion must hold, and that changing
'repr' to be locale-dependent is a good idea, let's move on:

> A support module for querying the encoding used in the current
> locale together with the experimental hook to set the string
> encoding could yield a compromise which satisfies ASCII, Latin-1
> and UTF-8 proponents.

agreed.

> The idea is to use the site.py module to customize the interpreter
> from within Python (rather than making the encoding a compile
> time option). This is easily doable using the (yet to be written)
> support module and the sys.setstringencoding() hook.

agreed.

note that parsing LANG (etc) variables on a POSIX platform is
easy enough to do in Python (either in site.py or in locale.py).
no need for external support modules for Unix, in other words.

for windows, I suggest adding GetACP() to the _locale module,
and let the glue layer (site.py 0or locale.py) do:

    if sys.platform =3D=3D "win32":
        sys.setstringencoding("cp%d" % GetACP())

on mac, I think you can determine the encoding by inspecting the
system font, and fall back to "macroman" if that doesn't work out.
but figuring out the right way to do that is best left to anyone who
actually has access to a Mac.  in the meantime, just make it:

    elif sys.platform =3D=3D "mac":
        sys.setstringencoding("macroman")

> The default encoding would be 'ascii' and could then be changed
> to whatever the user or administrator wants it to be on a per
> site basis.=20

Tcl defaults to "iso-8859-1" on all platforms except the Mac.  assuming
that the vast majority of non-Mac platforms are either modern Unixes
or Windows boxes, that makes a lot more sense than US ASCII...

in other words:

    else:
        # try to determine encoding from POSIX locale environment
        # variables
        ...

    else:
        sys.setstringencoding("iso-latin-1")

> Furthermore, the encoding should be settable on a per thread basis
> inside the interpreter (Python threads do not seem to inherit any
> per-thread globals, so the encoding would have to be set for all
> new threads).

is the C/POSIX locale setting thread specific?

if not, I think the default encoding should be a global setting, just
like the system locale itself.  otherwise, you'll just be addressing a
real problem (thread/module/function/class/object specific locale
handling), but not really solving it...

better use unicode strings and explicit encodings in that case.

> Minor nit: due to the implementation, the C parser markers
> "s" and "t" and the hash() value calculation will still need
> to work with a fixed encoding which still is UTF-8.

can this be fixed?  or rather, what changes to the buffer api
are required if we want to work around this problem?

> C APIs which want to support Unicode should be fixed to use
> "es" or query the object directly and then apply proper, possibly
> OS dependent conversion.

for convenience, it might be a good idea to have a "wide system
encoding" too, and special parser markers for that purpose.

or can we assume that all wide system API's use unicode all the
time?

unproductive-ly yrs /F