[Python-Dev] Python 1.5.2 modules need porting to 2.0 because of unicode - comments please

M.-A. Lemburg mal@lemburg.com
Tue, 19 Sep 2000 11:02:46 +0200


"Martin v. Loewis" wrote:
> 
> > The smtplib problem may be easily explained -- AFAIK, the SMTP
> > protocol doesn't support Unicode, and the module isn't
> > Unicode-aware, so it is probably writing garbage to the socket.
> 
> I've investigated this somewhat, and noticed the cause of the problem.
> The send method of the socket passes the raw memory representation of
> the Unicode object to send(2). On i386, this comes out as UTF-16LE.

The send method probably uses "s#" to write out the data. Since
this maps to the getreadbuf buffer slot, the Unicode object returns
a pointer to the internal buffer.
 
> It appears that this behaviour is not documented anywhere (where is
> the original specification of the Unicode type, anyway).

Misc/unicode.txt has it all. Documentation for PyArg_ParseTuple()
et al. is in Doc/ext/ext.tex.
 
> I believe this behaviour is a bug, on the grounds of being
> confusing. The same holds for writing a Unicode string to a file in
> binary mode. Again, it should not write out the internal
> representation. Or else, why doesn't file.write(42) work? I want that
> it writes the internal representation in binary :-)

This was discussed on python-dev at length earlier this year.
The outcome was that files opened in binary mode should write
raw object data to the file (using getreadbuf) while file's opened
in text mode should write character data (using getcharbuf).
 
Note that Unicode objects are the first to make a difference
between getcharbuf and getreadbuf.

IMHO, the bug really is in getargs.c: "s" uses getcharbuf while
"s#" uses getreadbuf. Ideal would be using "t"+"t#" exclusively
for getcharbuf and "s"+"s#" exclusively for getreadbuf, but I guess
common usage prevents this.

> So in essence, I suggest that the Unicode object does not implement
> the buffer interface. If that has any undesirable consequences (which
> ones?), I suggest that 'binary write' operations (sockets, files)
> explicitly check for Unicode objects, and either reject them, or
> invoke the system encoding (i.e. ASCII).

It's too late for any generic changes in the Unicode area.

The right thing to do is to make the *tools* Unicode aware, since
you can't really expect the Unicode-string integration mechanism 
to fiddle things right in every possible case out there.

E.g. in the above case it is clear that 8-bit text is being sent over
the wire, so the smtplib module should explicitly call the .encode()
method to encode the data into whatever encoding is suitable.

> In the case of smtplib, this would do the right thing: the protocol
> requires ASCII commands, so if anybody passes a Unicode string with
> characters outside ASCII, you'd get an error.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/