[Python-Dev] just say no...

M.-A. Lemburg mal@lemburg.com
Tue, 16 Nov 1999 14:06:39 +0100


Greg Stein wrote:
> 
> On Mon, 15 Nov 1999, M.-A. Lemburg wrote:
> > Guido van Rossum wrote:
> >...
> > > t# refers to byte-encoded data.  Multibyte encodings are explicitly
> > > designed to be passed cleanly through processing steps that handle
> > > single-byte character data, as long as they are 8-bit clean and don't
> > > do too much processing.
> >
> > Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not
> > "8-bit clean" as you obviously did.
> 
> Hrm. That might be dangerous. Many of the functions that use "t#" assume
> that each character is 8-bits long. i.e. the returned length == the number
> of characters.
> 
> I'm not sure what the implications would be if you interpret the semantics
> of "t#" as multi-byte characters.

FYI, the next version of the proposal now says "s#" gives you
UTF-16 and "t#" returns UTF-8. File objects opened in text mode
will use "t#" and binary ones use "s#".

I'll just use explicit u.encode('utf-8') calls if I want to write
UTF-8 to binary files -- perhaps everyone else should too ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/