[Python-Dev] just say no...

Mon, 15 Nov 1999 20:04:59 +0100

Guido van Rossum wrote:
> 
> [Misunderstanding in the reasoning behind "t#" and "s#"]
> 
> Thanks for not picking an argument.  Multibyte encodings typically
> have ASCII as a subset (in such a way that an ASCII string is
> represented as itself in bytes).  This is the characteristic that's
> needed in my view.
> 
> > It was my understanding that "t#" refers to single byte character
> > data. That's where the above arguments were aiming at...
> 
> t# refers to byte-encoded data.  Multibyte encodings are explicitly
> designed to be passed cleanly through processing steps that handle
> single-byte character data, as long as they are 8-bit clean and don't
> do too much processing.

Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not
"8-bit clean" as you obviously did.

> > Perhaps I'm missing something...
> 
> The idea is that (1)/s# disallows any translation of the data, while
> (2)/t# requires translation of the data to an ASCII superset (possibly
> multibyte, such as UTF-8 or shift-JIS).  (2)/t# assumes that the data
> contains text and that if the text consists of only ASCII characters
> they are represented as themselves.  (1)/s# makes no such assumption.
> 
> In terms of implementation, Unicode objects should translate
> themselves to the default encoding for t# (if possible), but they
> should make the native representation available for s#.
> 
> For example, take an encryption engine.  While it is defined in terms
> of byte streams, there's no requirement that the bytes represent
> characters -- they could be the bytes of a GIF file, an MP3 file, or a
> gzipped tar file.  If we pass Unicode to an encryption engine, we want
> Unicode to come out at the other end, not UTF-8.  (If we had wanted to
> encrypt UTF-8, we should have fed it UTF-8.)
> 
> > > Note that the definition of the 's' format was left alone -- as
> > > before, it means you need an 8-bit text string not containing null
> > > bytes.
> >
> > This definition should then be changed to "text string without
> > null bytes" dropping the 8-bit reference.
> 
> Aha, I think there's a confusion about what "8-bit" means.  For me, a
> multibyte encoding like UTF-8 is still 8-bit.  Am I alone in this?
> (As far as I know, C uses char* to represent multibyte characters.)
> Maybe we should disambiguate it more explicitly?

There should be some definition for the two markers and the
ideas behind them in the API guide, I guess.

> > Hmm, I would strongly object to making "s#" return the internal
> > format. file.write() would then default to writing UTF-16 data
> > instead of UTF-8 data. This could result in strange errors
> > due to the UTF-16 format being endian dependent.
> 
> But this was the whole design.  file.write() needs to be changed to
> use s# when the file is open in binary mode and t# when the file is
> open in text mode.

Ok, that would make the situation a little clearer (even though
I expect the two different encodings to produce some FAQs). 

I still don't feel very comfortable about the fact that all
existing APIs using "s#" will suddenly receive UTF-16 data if
being passed Unicode objects: this probably won't get us the
"magical" Unicode integration we invision, since "t#" usage is not
very wide spread and character handling code will probably not
work well with UTF-16 encoded strings.

Anyway, we should probably try out both methods...

> > It would also break the symmetry between file.write(u) and
> > unicode(file.read()), since the default encoding is not used as
> > internal format for other reasons (see proposal).
> 
> If the file is encoded using UTF-16 or UCS-2, you should open it in
> binary mode and use unicode(file.read(), 'utf-16').  (Or perhaps the
> app should read the first 2 bytes and check for a BOM and then decide
> to choose bewteen 'utf-16-be' and 'utf-16-le'.)

Right, that's the idea (there is a note on this in the Standard
Codec section of the proposal).

> > > Any of the following choices is acceptable (from the point of view of
> > > not breaking the intended t# semantics; we can now start deciding
> > > which we like best):
> >
> > I think we have already agreed on using UTF-8 for the default
> > encoding. It has quite a few advantages. See
> >
> >       http://czyborra.com/utf/
> >
> > for a good overview of the pros and cons.
> 
> Of course.  I was just presenting the list as an argument that if
> we changed our mind about the default encoding, t# should follow the
> default encoding (and not pick an encoding by other means).

Ok.

> > > - utf-8
> > > - latin-1
> > > - ascii
> > > - shift-jis
> > > - lower byte of unicode ordinal
> > > - some user- or os-specified multibyte encoding
> > >
> > > As far as t# is concerned, for encodings that don't encode all of
> > > Unicode, untranslatable characters could be dealt with in any number
> > > of ways (raise an exception, ignore, replace with '?', make best
> > > effort, etc.).
> >
> > The usual Python way would be: raise an exception. This is what
> > the proposal defines for Codecs in case an encoding/decoding
> > mapping is not possible, BTW. (UTF-8 will always succeed on
> > output.)
> 
> Did you read Andy Robinson's case study?  He suggested that for
> certain encodings there may be other things you can do that are more
> user-friendly than raising an exception, depending on the application.
> I am proposing to leave this a detail of each specific translation.
> There may even be translations that do the same thing except they have
> a different behavior for untranslatable cases -- e.g. a strict version
> that raises an exception and a non-strict version that replaces bad
> characters with '?'.  I think this is one of the powers of having an
> extensible set of encodings.

Agreed, the Codecs should decide for themselves what to do. I'll
add a note to the next version of the proposal.

> > > Given the current context, it should probably be the same as the
> > > default encoding -- i.e., utf-8.  If we end up making the default
> > > user-settable, we'll have to decide what to do with untranslatable
> > > characters -- but that will probably be decided by the user too (it
> > > would be a property of a specific translation specification).
> > >
> > > In any case, I feel that t# could receive a multi-byte encoding,
> > > s# should receive raw binary data, and they should correspond to
> > > getcharbuffer and getreadbuffer, respectively.
> >
> > Why would you want to have "s#" return the raw binary data for
> > Unicode objects ?
> 
> Because file.write() for a binary file, and other similar things
> (e.g. the encryption engine example I mentioned above) must have
> *some* way to get at the raw bits.

What for ? Any lossless encoding should do the trick... UTF-8
is just as good as UTF-16 for binary files; plus it's more compact
for ASCII data. I don't really see a need to get explicitly
at the internal data representation because both encodings are
in fact "internal" w/r to Unicode objects.

The only argument I can come up with is that using UTF-16 for
binary files could (possibly) eliminate the UTF-8 conversion step
which is otherwise always needed.

> > Note that it is not mentioned anywhere that
> > "s#" and "t#" do have to necessarily return different things
> > (binary being a superset of text). I'd opt for "s#" and "t#" both
> > returning UTF-8 data. This can be implemented by delegating the
> > buffer slots to the <defencstr> object (see below).
> 
> This would defeat the whole purpose of introducing t#.  We might as
> well drop t# then altogether if we adopt this.

Well... yes ;-)

> > > > Now Greg would chime in with the buffer interface and
> > > > argue that it should make the underlying internal
> > > > format accessible. This is a bad idea, IMHO, since you
> > > > shouldn't really have to know what the internal data format
> > > > is.
> > >
> > > This is for C code.  Quite likely it *does* know what the internal
> > > data format is!
> >
> > C code can use the PyUnicode_* APIs to access the data. I
> > don't think that argument parsing is powerful enough to
> > provide the C code with enough information about the data
> > contents, e.g. it can only state the encoding length, not the
> > string length.
> 
> Typically, all the C code does is pass multibyte encoded strings on to
> other library routines that know what to do to them, or simply give
> them back unchanged at a later time.  It is essential to know the
> number of bytes, for memory allocation purposes.  The number of
> characters is totally immaterial (and multibyte-handling code knows
> how to calculate the number of characters anyway).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/