[Python-Dev] Python 1.5.2 modules need porting to 2.0 because of unicode - comments please

M.-A. Lemburg mal@lemburg.com
Tue, 19 Sep 2000 23:29:06 +0200


Guido van Rossum wrote:
> 
> > > I doubt that we can fix all Unicode related bugs in the 2.0
> > > stdlib before the final release... let's make this a project
> > > for 2.1.
> >
> > Exactly my feelings. Since we cannot possibly fix all problems, we may
> > need to change the behaviour later.
> >
> > If we now silently do the wrong thing, silently changing it to the
> > then-right thing in 2.1 may break peoples code. So I'm asking that
> > cases where it does not clearly do the right thing produces an
> > exception now; we can later fix it to accept more cases, should need
> > occur.
> >
> > In the specific case, dropping support for Unicode output in binary
> > files is the right thing. We don't know what the user expects, so it
> > is better to produce an exception than to silently put incorrect bytes
> > into the stream - that is a bug that we still can fix.
> >
> > The easiest way with the clearest impact is to drop the buffer
> > interface in unicode objects. Alternatively, not supporting them in
> > for s# also appears reasonable. Users experiencing the problem in
> > testing will then need to make an explicit decision how they want to
> > encode the Unicode objects.
> >
> > If any expedition of the issue is necessary, I can submit a bug report,
> > and propose a patch.
> 
> Sounds reasonable to me (but I haven't thought of all the issues).
> 
> For writing binary Unicode strings, one can use
> 
>   f.write(u.encode("utf-16"))           # Adds byte order mark
>   f.write(u.encode("utf-16-be"))        # Big-endian
>   f.write(u.encode("utf-16-le"))        # Little-endian

Right.

Possible ways to fix this:

1. disable Unicode's getreadbuf slot

   This would effectively make Unicode object unusable for
   all APIs which use "s#"... and probably give people a lot
   of headaches. OTOH, it would probably motivate lots of
   users to submit patches for the stdlib which makes it
   Unicode aware (hopefully ;-)

2. same as 1., but also make "s#" fall back to getcharbuf
   in case getreadbuf is not defined

   This would make Unicode objects compatible with "s#", but
   still prevent writing of binary data: getcharbuf returns
   the Unicode object encoded using the default encoding which
   is ASCII per default.

3. special case "s#" in some way to handle Unicode or to
   raise an exception pointing explicitly to the problem
   and its (possible) solution

I'm not sure which of these paths to take. Perhaps solution
2. is the most feasable compromise between "exceptions everywhere"
and "encoding confusion".

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/