Re[2]: [lxml-dev] Re: Re: Python unicode string support in lxml

May 10, 2006


      Hello Stefan,

Wednesday, May 10, 2006, 5:01:43 PM, you wrote:
...
Careful, this is more or less how tounicode() is currently implemented
(although at the libxml2 level). It currently serializes to UTF-8 (which, at
least, is pretty fast in libxml2, as all strings are already UTF-8) and then
calls the Python API function to convert from UTF-8 to Python unicode in one
run (which is also pretty efficient). It's difficult to do otherwise, as
libxml2 and Python have independent memory management, so we can't just mange
pointers here.
...
Note also that libxml2 uses a dynamically adapted output buffer, so it likely
uses more memory during serialization than absolutely necessary.
...
So, while the idea of the API is that it's more efficient (which it still is),
the gain may not be as big as expected. But since tostring uses the same
mechanism (and thus suffers from the same problem), the gain in overhead is
still about 1/3 if the result is required as unicode.
I was thinking lxml would return the data encoded as unicode, in the
same format Python uses, and thus the gain would be more dramatic. In
this case, I think you should judge how more efficient that is then
calling .tostring(encoding) and implement if the gain is reasonable.
-- 
Best regards,
 Steve                            mailto:howe@carcass.dhs.org

Re[2]: [lxml-dev] Re: Re: Python unicode string support in lxml

Steve Howe