[Python-Dev] Python 1.5.2 modules need porting to 2.0 because of
unicode - comments please
M.-A. Lemburg
mal@lemburg.com
Tue, 19 Sep 2000 23:29:06 +0200
Guido van Rossum wrote:
>
> > > I doubt that we can fix all Unicode related bugs in the 2.0
> > > stdlib before the final release... let's make this a project
> > > for 2.1.
> >
> > Exactly my feelings. Since we cannot possibly fix all problems, we may
> > need to change the behaviour later.
> >
> > If we now silently do the wrong thing, silently changing it to the
> > then-right thing in 2.1 may break peoples code. So I'm asking that
> > cases where it does not clearly do the right thing produces an
> > exception now; we can later fix it to accept more cases, should need
> > occur.
> >
> > In the specific case, dropping support for Unicode output in binary
> > files is the right thing. We don't know what the user expects, so it
> > is better to produce an exception than to silently put incorrect bytes
> > into the stream - that is a bug that we still can fix.
> >
> > The easiest way with the clearest impact is to drop the buffer
> > interface in unicode objects. Alternatively, not supporting them in
> > for s# also appears reasonable. Users experiencing the problem in
> > testing will then need to make an explicit decision how they want to
> > encode the Unicode objects.
> >
> > If any expedition of the issue is necessary, I can submit a bug report,
> > and propose a patch.
>
> Sounds reasonable to me (but I haven't thought of all the issues).
>
> For writing binary Unicode strings, one can use
>
> f.write(u.encode("utf-16")) # Adds byte order mark
> f.write(u.encode("utf-16-be")) # Big-endian
> f.write(u.encode("utf-16-le")) # Little-endian
Right.
Possible ways to fix this:
1. disable Unicode's getreadbuf slot
This would effectively make Unicode object unusable for
all APIs which use "s#"... and probably give people a lot
of headaches. OTOH, it would probably motivate lots of
users to submit patches for the stdlib which makes it
Unicode aware (hopefully ;-)
2. same as 1., but also make "s#" fall back to getcharbuf
in case getreadbuf is not defined
This would make Unicode objects compatible with "s#", but
still prevent writing of binary data: getcharbuf returns
the Unicode object encoded using the default encoding which
is ASCII per default.
3. special case "s#" in some way to handle Unicode or to
raise an exception pointing explicitly to the problem
and its (possible) solution
I'm not sure which of these paths to take. Perhaps solution
2. is the most feasable compromise between "exceptions everywhere"
and "encoding confusion".
--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/