Fwd: [XML-SIG] xmlpickle.py ?!

M.-A. Lemburg mal@lemburg.com
Tue, 08 Aug 2000 14:38:12 +0200

Jim Fulton wrote:
> "M.-A. Lemburg" wrote:
> > I wonder how well SOAP would handle pickling arbitrary
> > Python objects...
> In particular, I wonder if it tries to be complete. I haven't
> really looked at SOAP lately. In my experience, RPC mechanisms
> don't really need or try to handle arbitrarily complex objects.
> OTOH, lots of applications don't need complete transfer.

True, but instead of "rolling your own" every time, I think
a simple extensible definition would help. E.g. we could add
a callback mechanism (__xml__ method) to aid in converting
objects into a certain object defined way.
> > I've had a look at ppml.py in Zope, but didn't really
> > grok the idea behind it -- it's completely undocumented
> > and contains some really weird callbacks :-/
> Yes, well, if your interested in pusuing it, I'll provide more
> info.

I'm not sure whether I'll use ppml.py as template or just
as source of ideas. 

The design doesn't look clear to me, e.g. it seems as if you
are first pickling an object using the standard Python pickle
mechanism and the pass the pickled string to the XML converter.
Is that so ?  ...I think I'd rather like to go the direct way.

Hmm, I should really try to get the module run outside of
Zope... for now I've just been looking at the code.

> > My general idea for xmlpickle is to come up with a format that
> > is human readable and editable, i.e. literal representations
> > should be used in favour of binary ones (size is not a problem;
> > speed can later be added via a C extension).
> OK. Obviously, a gif image needs to be encoded.  We could certainly
> modify the algorithm that decides between repr and base64 to give
> more prference to repr.

I'd say that any string containing at least one \000 character
should be considered binary (and encoded in base64 or some other
standard format).

This should get all typical text strings into readable format.

> > > > I've looked at pickle.py a bit and realized that the extensible
> > > > nature of the pickle mechanism would probably cause trouble
> > > > because the DTD would have to be generated as well (not a good
> > > > idea).
> > >
> > > Why would a DTD have to be generated?
> >
> > If you take the first path (see below; one element per pickle'able
> > type), then you'd have to regenerate the DTD in case new types
> > were registered through copy_reg.
> Yes, but why do you need t DTD. Lots of people don't
> seem to use DTDs and DTDs don't work very well with namsspaces.

Good question ;-) I just thought that having a DTD around would
be good to validate input data and perhaps help the XML editor.
> >). The only part I don't
> > like about ppml.py's approach is that it pickles e.g.
> > integers to a binary format.
> Nah:
> <pickle> <int>123</int> </pickle>

Ah, I was seeing all these binary formatting APIs in ppml.py
especially for 64-bit ints. Looks as if these are not used anywhere
in the code though...
> > > Note the id attributes
> > > and reference tags, which allow cyclical data structures.
> >
> > Way cool, yes :-)
> >
> > > (I recently discovered that there is a problem with my id
> > > values. Does anyone know what it is? ;)
> > >
> > > One other note. I found the XML spec to be a little
> > > ambigouos (or maybe I'm just too dense) wrt binary data
> > > and newlines, so I decided to punt and escape newlines and
> > > binary data.  I encode strings as either "repr" which is a
> > > repr like encoding that escapes things in a way that is
> > > just a tad more terse than repr. I switch to base64 when
> > > the escaping penalty exceeds 40%.
> >
> > I don't really care about size... my goal is keeping data editable
> > and human readable -- this also makes writing backends in
> > other languages a lot easier.
> So we could add some tuning to this. Note that the goal is not to
> reduce size, but to detect "binary" data. Python doesn't make
> a distinction between binary and text, but base64 is probably
> a much better way to encode truly binary data.

See above. I'd rather use the definition above for deciding on
binary or not.

> > > Since alot of our pickles
> > > have marked up text, I automatically use CDATA sections when
> > > I can and where it would help. See the example above.
> >
> > How robust is this CDATA wrapping ? What if the data itself
> > is XML and contains a CDATA section ?
> Then it's not used. We will only use CDATA if we can.


Thanks for the feedback,
Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/