Mailman 3 String encoding - Python-Dev

String encoding

older
Larry's need for metacharacters...

M.-A. Lemburg

May 23, 2000

10:10 a.m.

The recent discussion about repr() et al. brought up the idea of a locale based string encoding again. A support module for querying the encoding used in the current locale together with the experimental hook to set the string encoding could yield a compromise which satisfies ASCII, Latin-1 and UTF-8 proponents. The idea is to use the site.py module to customize the interpreter from within Python (rather than making the encoding a compile time option). This is easily doable using the (yet to be written) support module and the sys.setstringencoding() hook. The default encoding would be 'ascii' and could then be changed to whatever the user or administrator wants it to be on a per site basis. Furthermore, the encoding should be settable on a per thread basis inside the interpreter (Python threads do not seem to inherit any per-thread globals, so the encoding would have to be set for all new threads). E.g. a site.py module could look like this: """ import locale,sys # Get encoding, defaulting to 'ascii' in case it cannot be # determined defenc = locale.get_encoding('ascii') # Set main thread's string encoding sys.setstringencoding(defenc) This would result in the Unicode implementation to assume defenc as encoding of strings. """ Minor nit: due to the implementation, the C parser markers "s" and "t" and the hash() value calculation will still need to work with a fixed encoding which still is UTF-8. C APIs which want to support Unicode should be fixed to use "es" or query the object directly and then apply proper, possibly OS dependent conversion. Before starting off into implementing the above, I'd like to hear some comments... Thanks, -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Show replies by date

Greg Stein

May 2000

10:57 a.m.

I still think that having any kind of global setting is going to be troublesome. Whether it is per-thread or not, it still means that Module Foo cannot alter the value without interfering with Module Bar. Cheers, -g On Tue, 23 May 2000, M.-A. Lemburg wrote:

...

The recent discussion about repr() et al. brought up the idea of a locale based string encoding again.

A support module for querying the encoding used in the current locale together with the experimental hook to set the string encoding could yield a compromise which satisfies ASCII, Latin-1 and UTF-8 proponents.

The idea is to use the site.py module to customize the interpreter from within Python (rather than making the encoding a compile time option). This is easily doable using the (yet to be written) support module and the sys.setstringencoding() hook.

The default encoding would be 'ascii' and could then be changed to whatever the user or administrator wants it to be on a per site basis. Furthermore, the encoding should be settable on a per thread basis inside the interpreter (Python threads do not seem to inherit any per-thread globals, so the encoding would have to be set for all new threads).

E.g. a site.py module could look like this:

""" import locale,sys

# Get encoding, defaulting to 'ascii' in case it cannot be # determined defenc = locale.get_encoding('ascii')

# Set main thread's string encoding sys.setstringencoding(defenc)

This would result in the Unicode implementation to assume defenc as encoding of strings. """

Minor nit: due to the implementation, the C parser markers "s" and "t" and the hash() value calculation will still need to work with a fixed encoding which still is UTF-8. C APIs which want to support Unicode should be fixed to use "es" or query the object directly and then apply proper, possibly OS dependent conversion.

Before starting off into implementing the above, I'd like to hear some comments...

Thanks, -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://www.python.org/mailman/listinfo/python-dev

-- Greg Stein, http://www.lyra.org/

M.-A. Lemburg

11:14 a.m.

Greg Stein wrote:

...

I still think that having any kind of global setting is going to be troublesome. Whether it is per-thread or not, it still means that Module Foo cannot alter the value without interfering with Module Bar.

True. The only reasonable place to alter the setting is in site.py for the main thread. I think the setting should be inherited by child threads, but I'm not sure whether this is possible or not. Modules that would need to change the settings are better (re)designed in a way that doesn't rely on the setting at all, e.g. work on Unicode exclusively which doesn't introduce the need in the first place. And then, noone is forced to alter the ASCII default to begin with :-) The good thing about exposing this mechanism in Python is that it gets user attention...

...

Cheers, -g

On Tue, 23 May 2000, M.-A. Lemburg wrote:

...
The recent discussion about repr() et al. brought up the idea of a locale based string encoding again.

A support module for querying the encoding used in the current locale together with the experimental hook to set the string encoding could yield a compromise which satisfies ASCII, Latin-1 and UTF-8 proponents.

The idea is to use the site.py module to customize the interpreter from within Python (rather than making the encoding a compile time option). This is easily doable using the (yet to be written) support module and the sys.setstringencoding() hook.

The default encoding would be 'ascii' and could then be changed to whatever the user or administrator wants it to be on a per site basis. Furthermore, the encoding should be settable on a per thread basis inside the interpreter (Python threads do not seem to inherit any per-thread globals, so the encoding would have to be set for all new threads).

E.g. a site.py module could look like this:

""" import locale,sys

# Get encoding, defaulting to 'ascii' in case it cannot be # determined defenc = locale.get_encoding('ascii')

# Set main thread's string encoding sys.setstringencoding(defenc)

This would result in the Unicode implementation to assume defenc as encoding of strings. """

Minor nit: due to the implementation, the C parser markers "s" and "t" and the hash() value calculation will still need to work with a fixed encoding which still is UTF-8. C APIs which want to support Unicode should be fixed to use "es" or query the object directly and then apply proper, possibly OS dependent conversion.

Before starting off into implementing the above, I'd like to hear some comments...

Thanks, -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://www.python.org/mailman/listinfo/python-dev

-- Greg Stein, http://www.lyra.org/

_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://www.python.org/mailman/listinfo/python-dev

-- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Fredrik Lundh

11:38 a.m.

M.-A. Lemburg wrote:

...

The recent discussion about repr() et al. brought up the idea of a locale based string encoding again.

before proceeding down this (not very slippery but slightly unfortunate, imho) slope, I think we should decide whether assert eval(repr(s)) == s should be true for strings. if this isn't important, nothing stops you from changing 'repr' to use isprint, without having to make sure that you can still parse the resulting string. but if it is important, you cannot really change 'repr' without addressing the big issue. so assuming that the assertion must hold, and that changing 'repr' to be locale-dependent is a good idea, let's move on:

...

A support module for querying the encoding used in the current locale together with the experimental hook to set the string encoding could yield a compromise which satisfies ASCII, Latin-1 and UTF-8 proponents.

agreed.

...

The idea is to use the site.py module to customize the interpreter from within Python (rather than making the encoding a compile time option). This is easily doable using the (yet to be written) support module and the sys.setstringencoding() hook.

agreed. note that parsing LANG (etc) variables on a POSIX platform is easy enough to do in Python (either in site.py or in locale.py). no need for external support modules for Unix, in other words. for windows, I suggest adding GetACP() to the _locale module, and let the glue layer (site.py 0or locale.py) do: if sys.platform == "win32": sys.setstringencoding("cp%d" % GetACP()) on mac, I think you can determine the encoding by inspecting the system font, and fall back to "macroman" if that doesn't work out. but figuring out the right way to do that is best left to anyone who actually has access to a Mac. in the meantime, just make it: elif sys.platform == "mac": sys.setstringencoding("macroman")

...

The default encoding would be 'ascii' and could then be changed to whatever the user or administrator wants it to be on a per site basis.

Tcl defaults to "iso-8859-1" on all platforms except the Mac. assuming that the vast majority of non-Mac platforms are either modern Unixes or Windows boxes, that makes a lot more sense than US ASCII... in other words: else: # try to determine encoding from POSIX locale environment # variables ... else: sys.setstringencoding("iso-latin-1")

...

Furthermore, the encoding should be settable on a per thread basis inside the interpreter (Python threads do not seem to inherit any per-thread globals, so the encoding would have to be set for all new threads).

is the C/POSIX locale setting thread specific? if not, I think the default encoding should be a global setting, just like the system locale itself. otherwise, you'll just be addressing a real problem (thread/module/function/class/object specific locale handling), but not really solving it... better use unicode strings and explicit encodings in that case.

...

Minor nit: due to the implementation, the C parser markers "s" and "t" and the hash() value calculation will still need to work with a fixed encoding which still is UTF-8.

can this be fixed? or rather, what changes to the buffer api are required if we want to work around this problem?

...

C APIs which want to support Unicode should be fixed to use "es" or query the object directly and then apply proper, possibly OS dependent conversion.

for convenience, it might be a good idea to have a "wide system encoding" too, and special parser markers for that purpose. or can we assume that all wide system API's use unicode all the time? unproductive-ly yrs /F

pf＠artcom-gmbh.de

12:02 p.m.

Hi Fredrik! you wrote:

...

before proceeding down this (not very slippery but slightly unfortunate, imho) slope, I think we should decide whether

assert eval(repr(s)) == s

should be true for strings. [...]

What's the problem with this one? I've played around with several locale settings here and I observed no problems, while doing:

...

...
...
import string s = string.join(map(chr, range(128,256)),"") assert eval('"'+s+'"') == s

What do you fear here, if 'repr' will output characters from the upper half of the charset without quoting them as octal sequences? I don't understand. Regards, Peter

Fredrik Lundh

1:09 p.m.

Peter wrote:

...

...
assert eval(repr(s)) == s

What's the problem with this one? I've played around with several locale settings here and I observed no problems, while doing:

what if the default encoding for source code is different from the locale? (think UTF-8 source code) (no, that's not supported by 1.6. but if we don't consider that case now, we won't be able to support source encodings in the future -- unless the above assertion isn't important, of course). </F>

M.-A. Lemburg

2:20 p.m.

Fredrik Lundh wrote:

...

M.-A. Lemburg wrote:

...
The recent discussion about repr() et al. brought up the idea of a locale based string encoding again.

before proceeding down this (not very slippery but slightly unfortunate, imho) slope, I think we should decide whether

assert eval(repr(s)) == s

should be true for strings.

if this isn't important, nothing stops you from changing 'repr' to use isprint, without having to make sure that you can still parse the resulting string.

but if it is important, you cannot really change 'repr' without addressing the big issue.

This is a different discussion which I don't really want to get into... I don't have any need for repr() being locale dependent, since I only use it for debugging purposes and never to rebuild objects (marshal and pickle are much better at that). BTW, repr(unicode) is not affected by the string encoding: it always returns unicode-escape. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Fredrik Lundh

3:16 p.m.

M.-A. Lemburg <mal@lemburg.com> wrote:

...

...
before proceeding down this (not very slippery but slightly unfortunate, imho) slope, I think we should decide whether

assert eval(repr(s)) == s

should be true for strings.

footnote: as far as I can tell, the language reference says it should: http://www.python.org/doc/current/ref/string-conversions.html

...

This is a different discussion which I don't really want to get into... I don't have any need for repr() being locale dependent, since I only use it for debugging purposes and never to rebuild objects (marshal and pickle are much better at that).

in other words, you leave it to 'pickle' to call 'repr' for you ;-) </F>

M.-A. Lemburg

5:15 p.m.

Fredrik Lundh wrote:

...

M.-A. Lemburg <mal@lemburg.com> wrote:

...
...
before proceeding down this (not very slippery but slightly unfortunate, imho) slope, I think we should decide whether

assert eval(repr(s)) == s

should be true for strings.

footnote: as far as I can tell, the language reference says it should: http://www.python.org/doc/current/ref/string-conversions.html

...
This is a different discussion which I don't really want to get into... I don't have any need for repr() being locale dependent, since I only use it for debugging purposes and never to rebuild objects (marshal and pickle are much better at that).

in other words, you leave it to 'pickle' to call 'repr' for you ;-)

Ooops... now this gives a totally new ring the changing repr(). Hehe, perhaps we need a string.encode() method too ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg

2:47 p.m.

Fredrik Lundh wrote:

...

M.-A. Lemburg wrote:

...
The recent discussion about repr() et al. brought up the idea of a locale based string encoding again. [...]

A support module for querying the encoding used in the current locale together with the experimental hook to set the string encoding could yield a compromise which satisfies ASCII, Latin-1 and UTF-8 proponents.

agreed.

...
The idea is to use the site.py module to customize the interpreter from within Python (rather than making the encoding a compile time option). This is easily doable using the (yet to be written) support module and the sys.setstringencoding() hook.

agreed.

note that parsing LANG (etc) variables on a POSIX platform is easy enough to do in Python (either in site.py or in locale.py). no need for external support modules for Unix, in other words.

Agreed... the locale.py (and _locale builtin module) are probably the right place to put such a parser.

...

for windows, I suggest adding GetACP() to the _locale module, and let the glue layer (site.py 0or locale.py) do:

if sys.platform == "win32": sys.setstringencoding("cp%d" % GetACP())

on mac, I think you can determine the encoding by inspecting the system font, and fall back to "macroman" if that doesn't work out. but figuring out the right way to do that is best left to anyone who actually has access to a Mac. in the meantime, just make it:

elif sys.platform == "mac": sys.setstringencoding("macroman")

...
The default encoding would be 'ascii' and could then be changed to whatever the user or administrator wants it to be on a per site basis.

Tcl defaults to "iso-8859-1" on all platforms except the Mac. assuming that the vast majority of non-Mac platforms are either modern Unixes or Windows boxes, that makes a lot more sense than US ASCII...

in other words:

else: # try to determine encoding from POSIX locale environment # variables ...

else: sys.setstringencoding("iso-latin-1")

That's a different topic which I don't want to revive ;-) With the above tools you can easily code the latin-1 default into your site.py.

...

...
Furthermore, the encoding should be settable on a per thread basis inside the interpreter (Python threads do not seem to inherit any per-thread globals, so the encoding would have to be set for all new threads).

is the C/POSIX locale setting thread specific?

Good question -- I don't know.

...

if not, I think the default encoding should be a global setting, just like the system locale itself. otherwise, you'll just be addressing a real problem (thread/module/function/class/object specific locale handling), but not really solving it...

better use unicode strings and explicit encodings in that case.

Agreed.

...

...
Minor nit: due to the implementation, the C parser markers "s" and "t" and the hash() value calculation will still need to work with a fixed encoding which still is UTF-8.

can this be fixed? or rather, what changes to the buffer api are required if we want to work around this problem?

The problem is that "s" and "t" return C pointers to some internal data structure of the object. It has to be assured that this data remains intact at least as long as the object itself exists. AFAIK, this cannot be fixed without creating a memory leak. The "es" parser marker uses a different strategy, BTW: the data is copied into a buffer, thus detaching the object from the data.

...

...
C APIs which want to support Unicode should be fixed to use "es" or query the object directly and then apply proper, possibly OS dependent conversion.

for convenience, it might be a good idea to have a "wide system encoding" too, and special parser markers for that purpose.

or can we assume that all wide system API's use unicode all the time?

At least in all references I've seen (e.g. ODBC, wchar_t implementations, etc.) "wide" refers to Unicode. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Fred L. Drake

3:13 p.m.

On Tue, 23 May 2000, M.-A. Lemburg wrote:

...

The problem is that "s" and "t" return C pointers to some internal data structure of the object. It has to be assured that this data remains intact at least as long as the object itself exists.

AFAIK, this cannot be fixed without creating a memory leak.

The "es" parser marker uses a different strategy, BTW: the data is copied into a buffer, thus detaching the object from the data.

...
...
C APIs which want to support Unicode should be fixed to use "es" or query the object directly and then apply proper, possibly OS dependent conversion.

for convenience, it might be a good idea to have a "wide system encoding" too, and special parser markers for that purpose.

or can we assume that all wide system API's use unicode all the time?

At least in all references I've seen (e.g. ODBC, wchar_t implementations, etc.) "wide" refers to Unicode.

On Linux, wchar_t is 4 bytes; that's not just Unicode. Doesn't ISO 10646 require a 32-bit space? I recall a fair bit of discussion about wchar_t when it was introduced to ANSI C, and the character set and encoding were specifically not made part of the specification. Making a requirement that wchar_t be Unicode doesn't make a lot of sense, and opens up potential portability issues. -1 on any assumption that wchar_t is usefully portable. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org>

M.-A. Lemburg

4:48 p.m.

"Fred L. Drake" wrote:

...

On Tue, 23 May 2000, M.-A. Lemburg wrote:

...
The problem is that "s" and "t" return C pointers to some internal data structure of the object. It has to be assured that this data remains intact at least as long as the object itself exists.

AFAIK, this cannot be fixed without creating a memory leak.

The "es" parser marker uses a different strategy, BTW: the data is copied into a buffer, thus detaching the object from the data.

...
...
C APIs which want to support Unicode should be fixed to use "es" or query the object directly and then apply proper, possibly OS dependent conversion.

for convenience, it might be a good idea to have a "wide system encoding" too, and special parser markers for that purpose.

or can we assume that all wide system API's use unicode all the time?

At least in all references I've seen (e.g. ODBC, wchar_t implementations, etc.) "wide" refers to Unicode.

On Linux, wchar_t is 4 bytes; that's not just Unicode. Doesn't ISO 10646 require a 32-bit space?

It is, Unicode is definitely moving in the 32-bit direction.

...

I recall a fair bit of discussion about wchar_t when it was introduced to ANSI C, and the character set and encoding were specifically not made part of the specification. Making a requirement that wchar_t be Unicode doesn't make a lot of sense, and opens up potential portability issues.

-1 on any assumption that wchar_t is usefully portable.

Ok... so could be that Fredrik has a point there, but I'm not deep enough into this to be able to comment. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Fredrik Lundh

3:41 p.m.

M.-A. Lemburg wrote:

...

That's a different topic which I don't want to revive ;-)

in a way, you've already done that -- if you're setting the system encoding in the site.py module, lots of people will end up with the encoding set to ISO Latin 1 or it's windows superset. one might of course the system encoding if the user actually calls setlocale, but there's no way for python to trap calls to that function from a submodule (e.g. readline), so it's easy to get out of sync. hmm. (on the other hand, I'd say it's far more likely that americans are among the few who don't know how to set the locale, so defaulting to us ascii might be best after all -- even if their computers really use iso-latin-1, we don't have to cause unnecessary confusion...) ... but I guess you're right: let's be politically correct and pretend that this really is a completely different issue ;-) </F>

Fredrik Lundh

9:42 a.m.

...

one might of course the system encoding if the user actually calls setlocale,

I think that was supposed to be: one might of course SET the system encoding ONLY if the user actually calls setlocale, or something... </F>

Greg Stein

8:15 p.m.

On Wed, 24 May 2000, Fredrik Lundh wrote:

...

...
one might of course the system encoding if the user actually calls setlocale,

I think that was supposed to be:

one might of course SET the system encoding ONLY if the user actually calls setlocale,

or something...

Bleh. Global switches are bogus. Since you can't depend on the setting, and you can't change it (for fear of busting something else), then you have to be explicit about your encoding all the time. Since you're never going to rely on a global encoding, then why keep it? This global encoding (per thread or not) just reminds me of the single hook for import, all over again. Cheers, -g -- Greg Stein, http://www.lyra.org/

M.-A. Lemburg

12:22 p.m.

Greg Stein wrote:

...

On Wed, 24 May 2000, Fredrik Lundh wrote:

...
...
one might of course the system encoding if the user actually calls setlocale,

I think that was supposed to be:

one might of course SET the system encoding ONLY if the user actually calls setlocale,

or something...

Bleh. Global switches are bogus. Since you can't depend on the setting, and you can't change it (for fear of busting something else),

Sure you can: in site.py before any other code using Unicode gets executed.

...

then you have to be explicit about your encoding all the time. Since you're never going to rely on a global encoding, then why keep it?

For the same reason you use setlocale() in C (and Python): to make programs portable to other locales without too much fuzz.

...

This global encoding (per thread or not) just reminds me of the single hook for import, all over again.

Think of it as a configuration switch which is made settable via a Python interface -- much like the optimize switch or the debug switch (which are settable via Python APIs in mxTools). The per-thread implementation is mainly a design question: I think globals should always be implemented on a per-thread basis. Hmm, I wish Guido would comment on the idea of keeping the runtime settable encoding... -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

9056

Age (days ago)

9058

Last active (days ago)

List overview

Download

15 comments

6 participants

participants (6)

Fred L. Drake
Fredrik Lundh
Fredrik Lundh
Greg Stein
M.-A. Lemburg
pf＠artcom-gmbh.de

String encoding

M.-A. Lemburg

Greg Stein

M.-A. Lemburg

Fredrik Lundh

pf＠artcom-gmbh.de

Fredrik Lundh

M.-A. Lemburg

Fredrik Lundh

M.-A. Lemburg

M.-A. Lemburg

Fred L. Drake

M.-A. Lemburg

Fredrik Lundh

Fredrik Lundh

Greg Stein

M.-A. Lemburg

tags

participants (6)