what's "resource saving" by using a slower serialization model that needs more memory ? In the first place, I was thinking lxml would be able to return an unicode object directly in the Python internal format, and that's where
Hello Fredrik, Wednesday, May 10, 2006, 5:07:10 PM, you wrote: the resource saving was expect from. If it cannot handle that, there is no point in implementing it, indeed.
why would you do this on the serialized document, rather than on the infoset ? how would you generalize the above to handle arbitrary strings ? what about surrogates ? For any reason the user wants. That was just an example. A text editor handling unicode is an example.
As I said, I just wanted to avoid an extra .encode() call which would work with two buffers in memory.
that's not portable, of course. Python cannot print arbitrary Unicode to stdout on all platforms. it has no trouble printing ASCII to stdout... "Not portable" is not an argument. Python supports lots of other non-portable APIs.
according to the DBXML documentation, it expects well-formed XML, not necessarily "UTF-8", and definitely not "unicode". have you tried the above with non-ASCII data? with latin-1 data serialized as "iso-8859-1" ? what does sys.getdefaultencoding() return on your machine ? I can't do those tests right now, sorry, but it should be 'ascii'.
DBXML expects NodeStorage containers to be UTF-8 (or plain ASCII), and the XQuery interfaces support only UTF8. Anyway, as I pointed several times, I just want to avoid having a string in memory, then create another UTF-8 object - it's unnecessary if you wanted unicode in the start. I'm sure you understand it's important to have encodings support since .tostring() supports it - but through an inefficient way due to implementation issues. -- Best regards, Steve mailto:howe@carcass.dhs.org