[I18n-sig] Re: Unicode debate
Fri, 28 Apr 2000 14:09:36 +0200
[Diving off into the Great Unkown... perhaps we'll end up with
a useful proposal ;-)]
Just van Rossum wrote:
> At 12:28 PM +0200 28-04-2000, M.-A. Lemburg wrote:
> [ encoding attr for 8 bit strings ]
> >This would indeed solve some issues... it would cost sizeof(short)
> >per string object though (the integer would map into a table
> >of encoding names).
> >I'm not sure what to do with the attribute when strings with
> >differing encodings meet. UTF-8 + ASCII will still be UTF-8,
> >but e.g. UTF-8 + Latin will not result in meaningful data. Two
> >ideas for coercing strings with different encodings:
> > 1. the encoding of the resulting string is set to 'undefined'
> > 2. coerce both strings to Unicode and then apply the action
> 1, because 2 can lead to surprises when two strings containing binary goop
> are added and only one was a literal in a source file with an explicit
> (Would "undefined" be the same as "default"? It would still be nice to be
> able to set the global default encoding.)
I should have been more precise:
2. provided both strings have encodings which can be converted
to Unicode, coerce them to Unicode and then apply the action;
otherwise proceed as in 1., i.e. the result has an undefined
If 2. does try to convert to Unicode, conversion errors should
be raised (just like they are now for Unicode coercion errors).
Some more tricky business:
How should str('bla', 'enc1') and str('bla', 'enc2') compare ?
What about the hash values of the two ?
> >Also, how would one create a string having a specific encoding ?
> >str(object, encname) would match unicode(object, encname)...
> Dunno. Is such a high level interface needed? I'm not proposing to make
> 8-bit strings almost as powerful as unicode strings: unicode strings are
> just fine for those kinds of operations... Hm, I just realized that the
> encoding attr can't be mutable (doh!), so maybe your suggestion isn't so
> bad at all.
That's why I was proposing str(obj, encname)... because the
encoding can't be changed after creation. Default encoding
would be 'undefined' for strings created dynamically using
just "..." and the source code encoding in case the strings
were defined in a Python source file (the compiler would set
Hmm, we'd still loose big in case someone puts a raw data
string into a Python source file without changing the encoding
to e.g. 'binary'.
We'd then have to write:
s = "...bla..." # source code encoding
data = str("...data...","binary") # binary data
Although binary data should really use:
data = buffer("...data...")
Side note: "...bla..." + buffer("...data...") currently returns
"...bla......data..." -- not very useful: I would have expected
a new buffer object instead. With string encoding attribute
this could be remedied to produce a string having 'binary'
encoding (at least).
Some more issues:
How should str(obj,encname) extract the information from the
object: via getcharbuf or getreadbuf ? Should it take the
encoding of the obj into account (in case it is a string object) ?
What should str(unicode, encname) return (the same as
What would file.read() return (a string with 'undefined'
encoding ?) ? An extra parameter to open() could be added
to have it return strings with a predefined encoding.
> Off-topic, what's the idea behind this behavior?:
> >>> unicode(u"abc")
Hmm, I get:
This was fixed upon Guido's request some weeks ago.
> >> Can you open a file *with* an explicit encoding?
> >You can specify the encoding by means of using codecs.open()
> >instead of open(), but the interface will currently only
> >accept (.write) and return (.read) Unicode objects.
> Thanks, I wasn't aware of that. Can't the builtin open() function get an
> additional encoding argument?
That would be probably be an option after some rounds of refinement
of the interface.
Python Pages: http://www.lemburg.com/python/