[Python-Dev] marshal (was:Buffer interface in abstract.c? )

M.-A. Lemburg mal@lemburg.com
Sat, 14 Aug 1999 14:30:45 +0200


Greg Stein wrote:
> 
> M.-A. Lemburg wrote:
> >
> > Greg Stein wrote:
> > >
> > > On Tue, 10 Aug 1999, Fredrik Lundh wrote:
> > > > maybe the unicode class shouldn't implement the
> > > > buffer interface at all?  sure looks like the best way
> > >
> > > It is needed for fp.write(unicodeobj) ...
> > >
> > > It is also very handy for C functions to deal with Unicode strings.
> >
> > Wouldn't a special C API be (even) more convenient ?
> 
> Why? Accessing the Unicode values as a series of bytes matches exactly
> to the semantics of the buffer interface. Why throw in Yet Another
> Function?

I meant PyUnicode_* style APIs for dealing with all the aspects
of Unicode objects -- much like the PyString_* APIs available.
 
> Your abstract.c functions make it quite simple.

BTW, do we need an extra set of those with buffer index or not ?
Those would really be one-liners for the sake of hiding the
type slots from applications.

> > > > to avoid trivial mistakes (the current behaviour of
> > > > fp.write(unicodeobj) is even more serious than the
> > > > marshal glitch...)
> > >
> > > What's wrong with fp.write(unicodeobj)? It should write the unicode value
> > > to the file. Are you suggesting that it will need to be done differently?
> > > Icky.
> >
> > Would this also write some kind of Unicode encoding header ?
> > [Sorry, this is my Unicode ignorance shining through... I only
> >  remember lots of talk about these things on the string-sig.]
> 
> Absolutely not. Placing the Byte Order Mark (BOM) into an output stream
> is an application-level task. It should never by done by any subsystem.
> 
> There are no other "encoding headers" that would go into the output
> stream. The output would simply be UTF-16 (2-byte values in host byte
> order).

Ok.

> > Since fp.write() uses "s#" this would use the getreadbuffer
> > slot in 1.5.2... I think what it *should* do is use the
> > getcharbuffer slot instead (see my other post), since dumping
> > the raw unicode data would loose too much information. Again,
> 
> I very much disagree. To me, fp.write() is not about writing characters
> to a stream. I think it makes much more sense as "writing bytes to a
> stream" and the buffer interface fits that perfectly.

This is perfectly ok, but shouldn't the behaviour of fp.write()
mimic that of previous Python versions ? How does JPython
write the data ?

Inlined different subject:
I think the internal semantics of "s#" using the getreadbuffer slot
and "t#" the getcharbuffer slot should be switched; see my other post.
In previous Python versions "s#" had the semantics of string data
with possibly embedded NULL bytes. Now it suddenly has the meaning
of binary data and you can't simply change extensions to use the
new "t#" because people are still using them with older Python
versions.
 
> There is no loss of data. You could argue that the byte order is lost,
> but I think that is incorrect. The application defines the semantics:
> the file might be defined as using host-order, or the application may be
> writing a BOM at the head of the file.

The problem here is that many application were not written
to handle these kind of objects. Previously they could only
handle strings, now they can suddenly handle any object
having the buffer interface and then fail when the data
gets read back in.

> > such things should be handled by extra methods, e.g. fp.rawwrite().
> 
> I believe this would be a needless complication of the interface.

It would clarify things and make the interface 100% backward
compatible again.
 
> > Hmm, I guess the philosophy behind the interface is not
> > really clear.
> 
> I didn't design or implement it initially, but (as you may have guessed)
> I am a proponent of its existence.
> 
> > Binary data is fetched via getreadbuffer and then
> > interpreted as character data... I always thought that the
> > getcharbuffer should be used for such an interpretation.
> 
> The former is bad behavior. That is why getcharbuffer was added (by me,
> for 1.5.2). It was a preventative measure for the introduction of
> Unicode strings. Using getreadbuffer for characters would break badly
> given a Unicode string. Therefore, "clients" that want (8-bit)
> characters from an object supporting the buffer interface should use
> getcharbuffer. The Unicode object doesn't implement it, implying that it
> cannot provide 8-bit characters. You can get the raw bytes thru
> getreadbuffer.

I agree 100%, but did you add the "t#" instead of having
"s#" use the getcharbuffer interface ? E.g. my mxTextTools
package uses "s#" on many APIs. Now someone could stick
in a Unicode object and get pretty strange results without
any notice about mxTextTools and Unicode being incompatible.
You could argue that I change to "t#", but that doesn't
work since many people out there still use Python versions
<1.5.2 and those didn't have "t#", so mxTextTools would then
fail completely for them.
 
-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                   139 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/