Re: [lxml-dev] Re: Re: Python unicode string support in lxml

10 May 2006

      ...
what's "resource saving" by using a slower serialization model that needs
more memory ?
In the first place, I was thinking lxml would be able to return an
unicode object directly in the Python internal format, and that's where
Hello Fredrik,

Wednesday, May 10, 2006, 5:07:10 PM, you wrote:

the resource saving was expect from. If it cannot handle that, there is
no point in implementing it, indeed.
...
why would you do this on the serialized document, rather than on
the infoset ?  how would you generalize the above to handle arbitrary
strings ?  what about surrogates ?
For any reason the user wants. That was just an example. A text editor
handling unicode is an example.
As I said, I just wanted to avoid an extra .encode() call which would
work with two buffers in memory.
...
that's not portable, of course.  Python cannot print arbitrary Unicode
to stdout on all platforms.  it has no trouble printing ASCII to stdout...
"Not portable" is not an argument. Python supports lots of other
non-portable APIs.
...
according to the DBXML documentation, it expects well-formed XML, not
necessarily "UTF-8", and definitely not "unicode".  have you tried the above
with non-ASCII data?  with latin-1 data serialized as "iso-8859-1" ?  what
does sys.getdefaultencoding() return on your machine ?
I can't do those tests right now, sorry, but it should be 'ascii'.
DBXML expects NodeStorage containers to be UTF-8 (or plain ASCII), and
the XQuery interfaces support only UTF8.

Anyway, as I pointed several times, I just want to avoid having a string
in memory, then create another UTF-8 object - it's unnecessary if you
wanted unicode in the start. I'm sure you understand it's important to
have encodings support since .tostring() supports it - but through an
inefficient way due to implementation issues.

-- 
Best regards,
 Steve                            mailto:howe@carcass.dhs.org

Re: [lxml-dev] Re: Re: Python unicode string support in lxml

Steve Howe