Hi Steve, Steve Howe wrote:
Wednesday, May 10, 2006, 5:01:43 PM, you wrote:
Careful, this is more or less how tounicode() is currently implemented (although at the libxml2 level). It currently serializes to UTF-8 (which, at least, is pretty fast in libxml2, as all strings are already UTF-8) and then calls the Python API function to convert from UTF-8 to Python unicode in one run (which is also pretty efficient). It's difficult to do otherwise, as libxml2 and Python have independent memory management, so we can't just mange pointers here.
Note also that libxml2 uses a dynamically adapted output buffer, so it likely uses more memory during serialization than absolutely necessary.
So, while the idea of the API is that it's more efficient (which it still is), the gain may not be as big as expected. But since tostring uses the same mechanism (and thus suffers from the same problem), the gain in overhead is still about 1/3 if the result is required as unicode.
I was thinking lxml would return the data encoded as unicode, in the same format Python uses, and thus the gain would be more dramatic.
I guess you mean libxml2 here, not lxml. Given the above procedure, I don't think it's a big difference in speed if libxml2 encodes to native Python (from internal UTF-8 data) or if Python does that from libxml2 serialized UTF-8 data. In any case, we'd have to copy the buffer to get it into Python. I assume that the libxml2->UTF8->Python approach is already the most memory friendly order in most cases, as UTF-8 tends to be (much) shorter than 32bit unicode (which the Python interpreter *may* use, although it *may* also be 16bit). So generating everything in UTF-8 and then expanding it to unicode actually saves RAM compared to copying from unicode to unicode.
In this case, I think you should judge how more efficient that is then calling .tostring(encoding) and implement if the gain is reasonable.
Sorry, I don't understand what you mean here. This is all done at the C-level: serialization and conversion. If you did the same at the Python level, it cannot be faster or less memory intensive. But you would still have to copy the string before you pass it back through the API. So doing the conversion /as/ the copy operation is the most efficient way. Stefan