[ANN] pyxser-1.2r --- Python-Object to XML serialization module
Daniel Molina Wegener
dmw at coder.cl
Tue Aug 25 06:03:55 CEST 2009
-----BEGIN PGP SIGNED MESSAGE-----
Stefan Behnel <stefan_ml at behnel.de>
on Monday 24 August 2009 09:00
wrote in comp.lang.python:
> Daniel Molina Wegener wrote:
>> unicode objects are encoded into the
>> encoding that the XML document encoding has, and as you say, the whole
>> XML document has one encoding. There is no mixing of byte encoded strings
>> with different encodings in the outout document.
> Ok, that's what I hoped anyway. It just wasn't clear from your
>> When the object is restored, by using pyxser.unserialize:
>> pyobj = pyxser.unserialize(obj = xmldocstr, enc = "utf-8")
> But this is XML, right? What do you need to pass the encoding for at this
The user may want a different encoding, other than utf-8, it can
be any encoding supported by libxml2.
>> Another issue is the fact that if you have mixed some encodings in byte
>> strings objects in your object tree, such as iso-8859-1 and utf-8, and
>> you try to serialize that object, pyxser will output to stdout the
>> serialization errors by trying to handle those mixed encodings which are
>> not regarding the document encoding.
> There shouldn't be any serialisation errors (unless you try to recode byte
> strings on the way out, which is a no-no for arbitrary user input). All
> you have to do is properly escape the byte string so that it passes the
> XML encoding step.
Yup, but if the encodings are mixed inside Python byte strings, I think
that there is no way to know which encoding are using them. This may cause
XML serialization errors, by having a different encoding that the user
have set as the document encoding.
> One trick to do that is to decode the byte string as ISO-8859-1 and
> serialise the result as a normal Unicode string. Then you can re-encode
> the unicode string on input back to ISO-8859-1.
> I choose ISO-8859-1 here because it has the well-defined side-effect of
> mapping byte values directly to Unicode characters with an identical code
> point value. So you do not risk any failures or data loss.
Sure, but if there are Python byte strings (not Unicode strings), ones
encoded in big5 and others in iso-8859-1 inside the object tree, the
XML serialization would throw errors on the encoding conversion, by
setting those bytes inside the document...
Thanks for commenting, and sorry for the late answer. This day was
.O. | Daniel Molina Wegener | FreeBSD & Linux
..O | dmw [at] coder [dot] cl | Open Standards
OOO | http://coder.cl/ | FOSS Developer
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (FreeBSD)
-----END PGP SIGNATURE-----
More information about the Python-list