Re: [lxml-dev] Re: Re: Python unicode string support in lxml

10 May 2006

      Hi Steve,

Steve Howe wrote:
...
Wednesday, May 10, 2006, 5:01:43 PM, you wrote:
...
Careful, this is more or less how tounicode() is currently implemented
(although at the libxml2 level). It currently serializes to UTF-8 (which, at
least, is pretty fast in libxml2, as all strings are already UTF-8) and then
calls the Python API function to convert from UTF-8 to Python unicode in one
run (which is also pretty efficient). It's difficult to do otherwise, as
libxml2 and Python have independent memory management, so we can't just mange
pointers here.
...
Note also that libxml2 uses a dynamically adapted output buffer, so it likely
uses more memory during serialization than absolutely necessary.
...
So, while the idea of the API is that it's more efficient (which it still is),
the gain may not be as big as expected. But since tostring uses the same
mechanism (and thus suffers from the same problem), the gain in overhead is
still about 1/3 if the result is required as unicode.
...
I was thinking lxml would return the data encoded as unicode, in the
same format Python uses, and thus the gain would be more dramatic.
I guess you mean libxml2 here, not lxml. Given the above procedure, I don't
think it's a big difference in speed if libxml2 encodes to native Python (from
internal UTF-8 data) or if Python does that from libxml2 serialized UTF-8
data. In any case, we'd have to copy the buffer to get it into Python.

I assume that the libxml2->UTF8->Python approach is already the most memory
friendly order in most cases, as UTF-8 tends to be (much) shorter than 32bit
unicode (which the Python interpreter *may* use, although it *may* also be
16bit). So generating everything in UTF-8 and then expanding it to unicode
actually saves RAM compared to copying from unicode to unicode.
...
In this case, I think you should judge how more efficient that is then
calling .tostring(encoding) and implement if the gain is reasonable.
Sorry, I don't understand what you mean here. This is all done at the C-level:
serialization and conversion. If you did the same at the Python level, it
cannot be faster or less memory intensive. But you would still have to copy
the string before you pass it back through the API. So doing the conversion
/as/ the copy operation is the most efficient way.

Stefan

Re: [lxml-dev] Re: Re: Python unicode string support in lxml

Stefan Behnel