[I18n-sig] Python and Unicode == Britain and the Euro?

M.-A. Lemburg mal@lemburg.com
Sun, 11 Feb 2001 15:33:48 +0100


Brian Takashi Hooper wrote:
> 
> Hi there, Brian in Tokyo again,
> 
> On Sat, 10 Feb 2001 11:17:19 -0800
> Paul Prescod <paulp@ActiveState.com> wrote:
> 
> > Andy, I think that part of the reason that Westerners push harder for
> > Unicode than Japanese is because we are pressured (rightly) to right
> > software that works world-wide and it is simply not sane to try to do
> > that by supporting multiple character sets. Multiple encodings maybe.
> > Multiple character sets? Forget it.
> I think this is a true and valid point (that Westerners are more likely
> to want to make internationalized software), but it sounds here like
> because Westerners want to make it easier to internationalize software,
> that that is a valid reason to make it harder to make software that has
> no particular need for internationalization, in non-Western languages,
> and change the _meaning_ of such a basic data type as the Python string.
> 
> If in fact, as the proposal proposes, usage of open() without an
> encoding, for example, is at some point deprecated, then if I am
> manipulating non-Unicode data in "" strings, then I think I _do_ at some
> point have to port them over.  b"<blob of binary data>" then becomes
> different from "<blob of binary data>", because "<blob of binary data>"
> is now automatically being interpreted behind the scenes into an
> internal Unicode representation.  If the blob of binary data actually
> happened to be in Unicode, or some Unicode-favored representation (like
> UTF-8), then I might be happy about this - but if it wasn't, I think
> that this result would instead be rather dismaying.

We are certainly not goind to make the encoding parameter
mandatory for open(). What type the .read() method returns for
a file opened using an encoding is dependent on the codec in
use, e.g. a Unicode codec would return Unicod, but other codecs
may choose to return an encoded 8-bit string instead (with encoding
attribute set accordingly).

There's still much to do down that road and I wouldn't take the
current proposals too seriously yet. We are still in the idea
gathering phase...

> The current Unicode support is more explicit about this - the meaning of
> the string literal itself has not changed, so I can continue to ignore
> Unicode in cases where it serves no useful purpose.  I realize that it
> would be nicer from a design perspective, more consistent, to have
> Python string mean only character data, but right now, it does sometimes
> mean binary and sometimes mean characters. The only one who can
> distinguish which is the programmer - if at some point "" means only
> Unicode character strings, then the programmer _does_, I think, have to
> go through all their programs looking for places where they are using
> strings to hold non-Unicode character data, or binary data, and
> explicitly convert them over.  I have difficulty seeing how we would be
> able to provide a smooth upgrade path - maybe a command-line backwards
> compatibility option?  Maybe defaults?  I've heard a lot of people
> voicing dislike for default encodings, but from my perspective,
> something like ISO-Latin-1, or UTF-8, or even ASCII (EUC-JP and SJIS are,
> strictly speaking, not supersets of ASCII because the ASCII ranges are
> usually interpreted as JIS-Roman, which contains about 4 different
> characters) is functionally a default encoding...  Requiring encoding
> declarations, as the proposal suggests, is nice for people working in
> the i18n domain, but is an unnecessary inconvenience for those who are
> not.

First, I think that most string literals in programs are
in fact text data, so switching to a text data type for ""
wouldn't be such a big change. For those few cases, where
these literals are used for binary data, switching to b""
doesn't really hurt.

Of course, the programmer will have to rethink text vs. binary
data, but this is what we are aiming at after all. 

Since this step can be too much of a burden for the programmer, 
we'll have to come up with a way which allows Python to maintain the 
old style behaviour, e.g. by telling Python to use a codec which 
returns a normal 8-bit string object instead of Unicode...

#?encoding="old-style-strings"

at the top of the source code would then do the trick.
 
-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/