[I18n-sig] Python and Unicode == Britain and the Euro?
Brian Takashi Hooper
brian@tomigaya.shibuya.tokyo.jp
Mon, 12 Feb 2001 10:27:23 +0900
Thanks for the clarifications, Marc-Andre.
I have no problem with following new conventions, when they are decided
upon, for new programs - I just don't want old programs to break _too_ much;
measures like the encoding directives, if they are implemented properly,
are needed I feel to ease a transition to a new paradigm.
Also, I hadn't yet read Paul's "Guidelines for Language Evolution" PEP,
which I just now _did_ read, if the change is gradual and provides
warning messages for deprecated constructs, then that makes this
proposal seem less scary (does this mean that it might also be time to
start thinking about the workings of a "deprecation and warning facility"
as described in that document, also?)
--Brian
On Sun, 11 Feb 2001 15:33:48 +0100
"M.-A. Lemburg" <mal@lemburg.com> wrote:
> Brian Takashi Hooper wrote:
> >
> > Hi there, Brian in Tokyo again,
> >
> > On Sat, 10 Feb 2001 11:17:19 -0800
> > Paul Prescod <paulp@ActiveState.com> wrote:
> >
> > > Andy, I think that part of the reason that Westerners push harder for
> > > Unicode than Japanese is because we are pressured (rightly) to right
> > > software that works world-wide and it is simply not sane to try to do
> > > that by supporting multiple character sets. Multiple encodings maybe.
> > > Multiple character sets? Forget it.
> > I think this is a true and valid point (that Westerners are more likely
> > to want to make internationalized software), but it sounds here like
> > because Westerners want to make it easier to internationalize software,
> > that that is a valid reason to make it harder to make software that has
> > no particular need for internationalization, in non-Western languages,
> > and change the _meaning_ of such a basic data type as the Python string.
> >
> > If in fact, as the proposal proposes, usage of open() without an
> > encoding, for example, is at some point deprecated, then if I am
> > manipulating non-Unicode data in "" strings, then I think I _do_ at some
> > point have to port them over. b"<blob of binary data>" then becomes
> > different from "<blob of binary data>", because "<blob of binary data>"
> > is now automatically being interpreted behind the scenes into an
> > internal Unicode representation. If the blob of binary data actually
> > happened to be in Unicode, or some Unicode-favored representation (like
> > UTF-8), then I might be happy about this - but if it wasn't, I think
> > that this result would instead be rather dismaying.
>
> We are certainly not goind to make the encoding parameter
> mandatory for open(). What type the .read() method returns for
> a file opened using an encoding is dependent on the codec in
> use, e.g. a Unicode codec would return Unicod, but other codecs
> may choose to return an encoded 8-bit string instead (with encoding
> attribute set accordingly).
>
> There's still much to do down that road and I wouldn't take the
> current proposals too seriously yet. We are still in the idea
> gathering phase...
>
> > The current Unicode support is more explicit about this - the meaning of
> > the string literal itself has not changed, so I can continue to ignore
> > Unicode in cases where it serves no useful purpose. I realize that it
> > would be nicer from a design perspective, more consistent, to have
> > Python string mean only character data, but right now, it does sometimes
> > mean binary and sometimes mean characters. The only one who can
> > distinguish which is the programmer - if at some point "" means only
> > Unicode character strings, then the programmer _does_, I think, have to
> > go through all their programs looking for places where they are using
> > strings to hold non-Unicode character data, or binary data, and
> > explicitly convert them over. I have difficulty seeing how we would be
> > able to provide a smooth upgrade path - maybe a command-line backwards
> > compatibility option? Maybe defaults? I've heard a lot of people
> > voicing dislike for default encodings, but from my perspective,
> > something like ISO-Latin-1, or UTF-8, or even ASCII (EUC-JP and SJIS are,
> > strictly speaking, not supersets of ASCII because the ASCII ranges are
> > usually interpreted as JIS-Roman, which contains about 4 different
> > characters) is functionally a default encoding... Requiring encoding
> > declarations, as the proposal suggests, is nice for people working in
> > the i18n domain, but is an unnecessary inconvenience for those who are
> > not.
>
> First, I think that most string literals in programs are
> in fact text data, so switching to a text data type for ""
> wouldn't be such a big change. For those few cases, where
> these literals are used for binary data, switching to b""
> doesn't really hurt.
>
> Of course, the programmer will have to rethink text vs. binary
> data, but this is what we are aiming at after all.
>
> Since this step can be too much of a burden for the programmer,
> we'll have to come up with a way which allows Python to maintain the
> old style behaviour, e.g. by telling Python to use a codec which
> returns a normal 8-bit string object instead of Unicode...
>
> #?encoding="old-style-strings"
>
> at the top of the source code would then do the trick.
>
> --
> Marc-Andre Lemburg
> ______________________________________________________________________
> Company: http://www.egenix.com/
> Consulting: http://www.lemburg.com/
> Python Pages: http://www.lemburg.com/python/
--
Brian Takashi Hooper <brian@tomigaya.shibuya.tokyo.jp>