[I18n-sig] Re: Unicode debate

M.-A. Lemburg mal@lemburg.com
Fri, 28 Apr 2000 12:28:48 +0200

Just van Rossum wrote:
> At 11:01 AM -0400 27-04-2000, Guido van Rossum wrote:
> >Where does the current approach require work?
> >
> >- We need a way to indicate the encoding of Python source code.
> >(Probably a "magic comment".)
> How will other parts of a program know which encoding was used for
> non-unicode string literals?
> It seems to me that an encoding attribute for 8-bit strings solves this
> nicely. The attribute should only be set automatically if the encoding of
> the source file was specified or when the string has been encoded from a
> unicode string. The attribute should *only* be used when converting to
> unicode. (Hm, it could even be used when calling unicode() without the
> encoding argument.) It should *not* be used when comparing (or adding,
> etc.) 8-bit strings to each other, since they still may contain binary
> goop, even in a source file with a specified encoding!

This would indeed solve some issues... it would cost sizeof(short)
per string object though (the integer would map into a table
of encoding names).

I'm not sure what to do with the attribute when strings with
differing encodings meet. UTF-8 + ASCII will still be UTF-8,
but e.g. UTF-8 + Latin will not result in meaningful data. Two
ideas for coercing strings with different encodings:

 1. the encoding of the resulting string is set to 'undefined'

 2. coerce both strings to Unicode and then apply the action

Also, how would one create a string having a specific encoding ?
str(object, encname) would match unicode(object, encname)...

> >- We need a way to indicate the encoding of input and output data
> >files, and we need shortcuts to set the encoding of stdin, stdout and
> >stderr (and maybe all files opened without an explicit encoding).
> Can you open a file *with* an explicit encoding?

You can specify the encoding by means of using codecs.open()
instead of open(), but the interface will currently only
accept (.write) and return (.read) Unicode objects.

We'll probably have to make these a little more comfortable,
e.g. by accepting strings and Unicode objects. The needed
machinery is there -- we'd only need to define a suitable
interface on top of the classic file interface.

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/