[I18n-sig] Pre-PEP: Proposed Python Character Model

Hooper Brian brian@tomigaya.shibuya.tokyo.jp
Wed, 7 Feb 2001 15:01:06 +0900 (JST)


Hi there, this is Brian Hooper from Japan,

--- Paul Prescod <paulp@ActiveState.com> wrote:
> If Guido is philosophically opposed to Unicode as
> some people were the
> last time we discussed it, then I do not have time
> to work out details
> and then later find out that the project was doomed
> from the start
> because of the philosophical issue.

As someone who is frequently using Python with Japanese
from day to day, I'd just like to offer that I think that
most Japanese users are not philosophically opposed to
Unicode, they would just like support for Unicode to have
as little an impact as possible on older
pre-Unicode-support code.  One fairly extended discussion
on this list concerned how to allow for a different
encoding default than UTF-8, since a lot of programs here
are written to handle EUC and SJIS directly as byte-string
literals.

The best thing, at least from the point of view of
supporting old code, would be to be able to continue to
have Python continue to handle SJIS and EUC (which, in
spite of Unicode support in Windows, etc., are still by
far the dominant encodings for information interchange in
Japan) without trying to help out by converting it into
characters.  If my input is a blob of binary data, then
having the bytes of that data automatically grouped into
two- or four- bytes per character, or automatically
converted into Unicode, isn't so nice if what I actually
wanted was the binary data as is.  What about adding an
optional encoding argument to the existing open(),
allowing encoding to be passed to that, and using 'raw' as
the default format (what it does now)?

As one example of this, Java (unless you give the compiler
an -encoding flag) assumes that string literals and file
input is in Unicode, but for example in web programming,
where almost all the clients are using SJIS or EUC, and
the designers of the web sites are also using SJIS or EUC,
none of the input is in Unicode.  This is also kind of a
pain with JSP where pages are compiled int servlets by the
server, again in the "wrong" encoding.  Unicode _support_
is already here, on many fronts, but compatibility is
important, because the old encodings will take a long time
to go away, I think.

I agree that Unicode is where we want to go - being able
to do things like cleanly slice double-byte strings
without having to worry about breaking the encoding would
be a refreshing change from the current state of things,
and it would be nice to be able to have a useful string
length measure also!  I do however think that some things
_will_ break in the process of getting there... the
question is just how much will break, and when.  In this
sense, adding new functions like fopen() seems like a
reasonable solution to me, since it doesn't change the way
already existing constructs work.  

Sorry that this message is kind of a ramble, but I hope it
adds to the discussion.

Cheers,
-Brian

__________________________________________________
Do You Yahoo!?
$B%$%s%9%?%s%H%a%C%;!<%8$rAw$m$&!*(B Yahoo!$B%a%C%;%s%8%c!<(B
http://messenger.yahoo.co.jp/