Hello Stefan, Wednesday, May 10, 2006, 5:01:43 PM, you wrote:
Careful, this is more or less how tounicode() is currently implemented (although at the libxml2 level). It currently serializes to UTF-8 (which, at least, is pretty fast in libxml2, as all strings are already UTF-8) and then calls the Python API function to convert from UTF-8 to Python unicode in one run (which is also pretty efficient). It's difficult to do otherwise, as libxml2 and Python have independent memory management, so we can't just mange pointers here.
Note also that libxml2 uses a dynamically adapted output buffer, so it likely uses more memory during serialization than absolutely necessary.
So, while the idea of the API is that it's more efficient (which it still is), the gain may not be as big as expected. But since tostring uses the same mechanism (and thus suffers from the same problem), the gain in overhead is still about 1/3 if the result is required as unicode. I was thinking lxml would return the data encoded as unicode, in the same format Python uses, and thus the gain would be more dramatic. In this case, I think you should judge how more efficient that is then calling .tostring(encoding) and implement if the gain is reasonable.
-- Best regards, Steve mailto:howe@carcass.dhs.org