[I18n-sig] Python and Unicode == Britain and the Euro?
Mon, 12 Feb 2001 11:24:32 +0100
Brian Takashi Hooper wrote:
> Thanks for the clarifications, Marc-Andre.
> I have no problem with following new conventions, when they are decided
> upon, for new programs - I just don't want old programs to break _too_ much;
> measures like the encoding directives, if they are implemented properly,
> are needed I feel to ease a transition to a new paradigm.
Good to have you back on board :-)
> Also, I hadn't yet read Paul's "Guidelines for Language Evolution" PEP,
> which I just now _did_ read, if the change is gradual and provides
> warning messages for deprecated constructs, then that makes this
> proposal seem less scary (does this mean that it might also be time to
> start thinking about the workings of a "deprecation and warning facility"
> as described in that document, also?)
Right. The warning facility is already in place in 2.1: Guido added
a complete warning framework which is currently used to warn about
deprecated module usage like e.g. regex, regsub, etc.
> On Sun, 11 Feb 2001 15:33:48 +0100
> "M.-A. Lemburg" <email@example.com> wrote:
> > Brian Takashi Hooper wrote:
> > >
> > > Hi there, Brian in Tokyo again,
> > >
> > > On Sat, 10 Feb 2001 11:17:19 -0800
> > > Paul Prescod <paulp@ActiveState.com> wrote:
> > >
> > > > Andy, I think that part of the reason that Westerners push harder for
> > > > Unicode than Japanese is because we are pressured (rightly) to right
> > > > software that works world-wide and it is simply not sane to try to do
> > > > that by supporting multiple character sets. Multiple encodings maybe.
> > > > Multiple character sets? Forget it.
> > > I think this is a true and valid point (that Westerners are more likely
> > > to want to make internationalized software), but it sounds here like
> > > because Westerners want to make it easier to internationalize software,
> > > that that is a valid reason to make it harder to make software that has
> > > no particular need for internationalization, in non-Western languages,
> > > and change the _meaning_ of such a basic data type as the Python string.
> > >
> > > If in fact, as the proposal proposes, usage of open() without an
> > > encoding, for example, is at some point deprecated, then if I am
> > > manipulating non-Unicode data in "" strings, then I think I _do_ at some
> > > point have to port them over. b"<blob of binary data>" then becomes
> > > different from "<blob of binary data>", because "<blob of binary data>"
> > > is now automatically being interpreted behind the scenes into an
> > > internal Unicode representation. If the blob of binary data actually
> > > happened to be in Unicode, or some Unicode-favored representation (like
> > > UTF-8), then I might be happy about this - but if it wasn't, I think
> > > that this result would instead be rather dismaying.
> > We are certainly not goind to make the encoding parameter
> > mandatory for open(). What type the .read() method returns for
> > a file opened using an encoding is dependent on the codec in
> > use, e.g. a Unicode codec would return Unicod, but other codecs
> > may choose to return an encoded 8-bit string instead (with encoding
> > attribute set accordingly).
> > There's still much to do down that road and I wouldn't take the
> > current proposals too seriously yet. We are still in the idea
> > gathering phase...
> > > The current Unicode support is more explicit about this - the meaning of
> > > the string literal itself has not changed, so I can continue to ignore
> > > Unicode in cases where it serves no useful purpose. I realize that it
> > > would be nicer from a design perspective, more consistent, to have
> > > Python string mean only character data, but right now, it does sometimes
> > > mean binary and sometimes mean characters. The only one who can
> > > distinguish which is the programmer - if at some point "" means only
> > > Unicode character strings, then the programmer _does_, I think, have to
> > > go through all their programs looking for places where they are using
> > > strings to hold non-Unicode character data, or binary data, and
> > > explicitly convert them over. I have difficulty seeing how we would be
> > > able to provide a smooth upgrade path - maybe a command-line backwards
> > > compatibility option? Maybe defaults? I've heard a lot of people
> > > voicing dislike for default encodings, but from my perspective,
> > > something like ISO-Latin-1, or UTF-8, or even ASCII (EUC-JP and SJIS are,
> > > strictly speaking, not supersets of ASCII because the ASCII ranges are
> > > usually interpreted as JIS-Roman, which contains about 4 different
> > > characters) is functionally a default encoding... Requiring encoding
> > > declarations, as the proposal suggests, is nice for people working in
> > > the i18n domain, but is an unnecessary inconvenience for those who are
> > > not.
> > First, I think that most string literals in programs are
> > in fact text data, so switching to a text data type for ""
> > wouldn't be such a big change. For those few cases, where
> > these literals are used for binary data, switching to b""
> > doesn't really hurt.
> > Of course, the programmer will have to rethink text vs. binary
> > data, but this is what we are aiming at after all.
> > Since this step can be too much of a burden for the programmer,
> > we'll have to come up with a way which allows Python to maintain the
> > old style behaviour, e.g. by telling Python to use a codec which
> > returns a normal 8-bit string object instead of Unicode...
> > #?encoding="old-style-strings"
> > at the top of the source code would then do the trick.
> > --
> > Marc-Andre Lemburg
> > ______________________________________________________________________
> > Company: http://www.egenix.com/
> > Consulting: http://www.lemburg.com/
> > Python Pages: http://www.lemburg.com/python/
> Brian Takashi Hooper <firstname.lastname@example.org>
Python Pages: http://www.lemburg.com/python/