[I18n-sig] Python and Unicode == Britain and the Euro?

Brian Takashi Hooper brian@tomigaya.shibuya.tokyo.jp
Sun, 11 Feb 2001 14:58:44 +0900

Hi there, Brian in Tokyo again,

On Sat, 10 Feb 2001 11:17:19 -0800
Paul Prescod <paulp@ActiveState.com> wrote:

> Andy, I think that part of the reason that Westerners push harder for
> Unicode than Japanese is because we are pressured (rightly) to right
> software that works world-wide and it is simply not sane to try to do
> that by supporting multiple character sets. Multiple encodings maybe.
> Multiple character sets? Forget it.
I think this is a true and valid point (that Westerners are more likely
to want to make internationalized software), but it sounds here like
because Westerners want to make it easier to internationalize software,
that that is a valid reason to make it harder to make software that has
no particular need for internationalization, in non-Western languages,
and change the _meaning_ of such a basic data type as the Python string.

If in fact, as the proposal proposes, usage of open() without an
encoding, for example, is at some point deprecated, then if I am
manipulating non-Unicode data in "" strings, then I think I _do_ at some
point have to port them over.  b"<blob of binary data>" then becomes
different from "<blob of binary data>", because "<blob of binary data>"
is now automatically being interpreted behind the scenes into an
internal Unicode representation.  If the blob of binary data actually
happened to be in Unicode, or some Unicode-favored representation (like
UTF-8), then I might be happy about this - but if it wasn't, I think
that this result would instead be rather dismaying.

The current Unicode support is more explicit about this - the meaning of
the string literal itself has not changed, so I can continue to ignore
Unicode in cases where it serves no useful purpose.  I realize that it
would be nicer from a design perspective, more consistent, to have
Python string mean only character data, but right now, it does sometimes
mean binary and sometimes mean characters. The only one who can
distinguish which is the programmer - if at some point "" means only
Unicode character strings, then the programmer _does_, I think, have to
go through all their programs looking for places where they are using
strings to hold non-Unicode character data, or binary data, and
explicitly convert them over.  I have difficulty seeing how we would be
able to provide a smooth upgrade path - maybe a command-line backwards
compatibility option?  Maybe defaults?  I've heard a lot of people
voicing dislike for default encodings, but from my perspective,
something like ISO-Latin-1, or UTF-8, or even ASCII (EUC-JP and SJIS are,
strictly speaking, not supersets of ASCII because the ASCII ranges are
usually interpreted as JIS-Roman, which contains about 4 different
characters) is functionally a default encoding...  Requiring encoding
declarations, as the proposal suggests, is nice for people working in
the i18n domain, but is an unnecessary inconvenience for those who are

> I don't know of any commercial software written in Japan but used in the
> west so I think that they probably have less I18N pressure than we do.
> Unicode is only interesting when you want the same software to run in
> multiple character set environments!
That's exactly true.  The point I would like to make is that a lot,
probably the majority of Python software and libraries that are out
there today, don't have any need to run in multiple character set
environments.  Python is useful for a lot more things than just for
commercial development of products designed for international markets.

> Andy Robinson wrote:
> > 
> > ...
> > 
> > 2. I have been told that there are angry mumblings on the
> > Python-Japan mailing list that such a change would break all
> > their existing Python programs; I'm trying to set up my tools to
> > ask out loud in that forum.
> I don't think it is posssible to say in the abstract that a move to
> Unicode would break code. Depending on implementation strategy it might.
> But I can't imagine there is really a ton of code that would break
> merely from widening the character.
See above.  I think there is, at least outside of Europe.  Is it a
higher priority for Python to make it easier for Western users to
internationalize, or to save people who currently use Python strings to
manipulate binary data the trouble of having to port their applications
to support the new conventions?  I guess my own personal preference is
not to change things too much, because from my perspective, the Unicode
support is fine - if it's not broken, don't fix it.

Maybe it would be instructive to take the current proposal and any
others that come out, and without actually implementing, pretend-apply
the changes to parts of the existing code base to try to see how big the
effect would be?  That way, neither of us has to accept just on faith
that changing so-and-so would or would not break existing code...