[lxml-dev] Python unicode string support in lxml
Hi all, I had a discussion with Fredrik lately and it convinced me (though not Fredrik) that it would be a good idea to improve the support for Python unicode strings in lxml.etree. I think that unicode strings are the most comfortable way of doing XML I/O from/to strings in Python, so I added support for simply calling unicode() on XML nodes and ElementTrees. It behaves just like tostring(), but always returns Python unicode strings:
from lxml import etree uxml = u'<test> \uf8d1 + \uf8d2 </test>' root = etree.XML(uxml)
unicode(root) u'<test> \uf8d1 + \uf8d2 </test>'
el = etree.Element("test") unicode(el) u'<test/>'
subel = etree.SubElement(el, "subtest") unicode(el) u'<test><subtest/></test>'
unicode( etree.ElementTree(el) ) u'<test><subtest/></test>'
Note that ElementTree does not support this at all. It will raise a parser exception in the XML() call in the second line and return the same generic strings for unicode() as it does for str(). There is a longer doctest in http://codespeak.net/svn/lxml/trunk/doc/api.txt that explains this in more detail. As usual: any comments appreciated. Stefan
Hello Stefan, Wednesday, May 10, 2006, 6:39:29 AM, you wrote:
I had a discussion with Fredrik lately and it convinced me (though not Fredrik) that it would be a good idea to improve the support for Python unicode strings in lxml.etree. I think that unicode strings are the most comfortable way of doing XML I/O from/to strings in Python, so I added support for simply calling unicode() on XML nodes and ElementTrees. It behaves just like tostring(), but always returns Python unicode strings:
from lxml import etree uxml = u'<test> \uf8d1 + \uf8d2 </test>' root = etree.XML(uxml)
unicode(root) u'<test> \uf8d1 + \uf8d2 </test>'
el = etree.Element("test") unicode(el) u'<test/>'
subel = etree.SubElement(el, "subtest") unicode(el) u'<test><subtest/></test>'
unicode( etree.ElementTree(el) ) u'<test><subtest/></test>'
Note that ElementTree does not support this at all. It will raise a parser exception in the XML() call in the second line and return the same generic strings for unicode() as it does for str().
There is a longer doctest in http://codespeak.net/svn/lxml/trunk/doc/api.txt that explains this in more detail.
As usual: any comments appreciated. Shouldn't this be implemented as etree.tounicode() or something like that instead ? This will be more intuitive since there is the tostring() method. And since str(root) will return something like "'<Element a at 8413144>'", I would also expect unicode(root) to behave like that.
Whatever the calling method gets named, its a great feature, thanks. -- Best regards, Steve mailto:howe@carcass.dhs.org
Hi Steve, Steve Howe wrote:
Wednesday, May 10, 2006, 6:39:29 AM, you wrote:
simply calling unicode() on XML nodes and ElementTrees. It behaves just like tostring(), but always returns Python unicode strings:
el = etree.Element("test") unicode(el) u'<test/>'
As usual: any comments appreciated.
Shouldn't this be implemented as etree.tounicode() or something like that instead ? This will be more intuitive since there is the tostring() method. And since str(root) will return something like "'<Element a at 8413144>'", I would also expect unicode(root) to behave like that.
Actually, I had first implemented it as "etree.tounicode()" and then switched to plain "unicode()" as I thought /that/ would be more intuitive... Note that _XSLTResultTree already supports str() and now also supports unicode() for the same thing (but unicode). I may let myself get convinced that this is different, though. I'm not sure which is better. Maybe "tounicode()" really prevents people from thinking it should behave as str() - as you do. It's trivial to change, but let me wait to see if other people have similar feelings on this. So, don't consider this feature stable for now. Stefan
Hi again, Stefan Behnel wrote:
Steve Howe wrote:
simply calling unicode() on XML nodes and ElementTrees. It behaves just like tostring(), but always returns Python unicode strings:
el = etree.Element("test") unicode(el) u'<test/>' As usual: any comments appreciated. Shouldn't this be implemented as etree.tounicode() or something like
Wednesday, May 10, 2006, 6:39:29 AM, you wrote: that instead ? This will be more intuitive since there is the tostring() method.
I think there's one good argument that fits independent of the question which is more intuitive: extensibility. The unicode() function is fixed and does not allow us to extend the call parameters to support things like "prettyprint=True" keyword arguments. And that's already a difference to "_XSLTResultTree.__unicode__": that API will never need to be extended as it is configured through xsl:output. Ok, so I'm convinced that our home-grown tounicode() is better. I'll fix it on the trunk. Stefan
Hello Stefan, Wednesday, May 10, 2006, 7:49:59 AM, you wrote:
Stefan Behnel wrote:
Steve Howe wrote:
simply calling unicode() on XML nodes and ElementTrees. It behaves just like tostring(), but always returns Python unicode strings:
> el = etree.Element("test") > unicode(el) u'<test/>' As usual: any comments appreciated. Shouldn't this be implemented as etree.tounicode() or something like
Wednesday, May 10, 2006, 6:39:29 AM, you wrote: that instead ? This will be more intuitive since there is the tostring() method.
I think there's one good argument that fits independent of the question which is more intuitive: extensibility. The unicode() function is fixed and does not allow us to extend the call parameters to support things like "prettyprint=True" keyword arguments.
And that's already a difference to "_XSLTResultTree.__unicode__": that API will never need to be extended as it is configured through xsl:output.
Ok, so I'm convinced that our home-grown tounicode() is better. I'll fix it on the trunk. Well thought. I think it would not hurt, however, if unicode() calls .tounicode() with default params, *if* str() behaves the same way. in fact, there is no point in printing <Element a at 8413144> at all - is there ? That would bring the best of both worlds, the extensibility/compatibility of .tostr()/.unicode(), and the intuitive str()/unicode() call.
Besides, repr(root) will be printing the same as str(root) does today, in case someone really wants that... -- Best regards, Steve mailto:howe@carcass.dhs.org
Hi Steve, Steve Howe wrote:
I think it would not hurt, however, if unicode() calls .tounicode() with default params, *if* str() behaves the same way. in fact, there is no point in printing <Element a at 8413144> at all - is there ? That would bring the best of both worlds, the extensibility/compatibility of .tostr()/.unicode(), and the intuitive str()/unicode() call.
You know, that was exactly the first thing that came to my mind when I thought about it: who needs those stupid str() results anyway? :) I then, however, ducked away from the holy cow. So, your proposal is to change the current behaviour of str()/unicode() into this: str(element) == etree.tostring(element) unicode(element) == etree.tounicode(element) and to make repr(element) the obvious replacement. The problem is that current code (for lxml or ElementTree) may rely on the fact that str() is a simple thing to call on an element and that it does *not* do anything recursively. The above modification changes the runtime complexity of these calls. That can really make a difference in that case. Imagine debugging (or logging) output where someone adds a str(element) to see what is currently dealt with or to trace the way some element takes through a processing chain. So, since the above change is only a minor improvement compared to calling tounicode/tostring directly (as few as 2 characters if you do the respective import), I'm -0.5 on breaking ElementTree compatibility in these cases. Stefan
Stefan Behnel wrote: [snip]
So, since the above change is only a minor improvement compared to calling tounicode/tostring directly (as few as 2 characters if you do the respective import), I'm -0.5 on breaking ElementTree compatibility in these cases.
-1 on breaking ElementTree compatibility. tounicode() explicit behavior seems better to me __unicode__. Let's be very careful with implicit behavior in the area of string creation - explicit is better here. With implicit behavior, it's just too easy for a developer to do something wrong with encodings and then get very confused. Regards, Martijn
Steve Howe wrote: [snip]
Well thought. I think it would not hurt, however, if unicode() calls .tounicode() with default params, *if* str() behaves the same way. in fact, there is no point in printing <Element a at 8413144> at all - is there ? That would bring the best of both worlds, the extensibility/compatibility of .tostr()/.unicode(), and the intuitive str()/unicode() call.
I'm not sure what behavior exactly is being proposed here, but I strongly urge lxml not to print serialized XML instead of an element automatically. It's too implicit and during debugging it might get very annoying to see massive amounts of XML if you just want to see what elements you have in, say, a list, currently. Anyway, that concerns repr() more than str(), but I'm still worried. I'd suggest sticking to whatever behavior ElementTree has in this area. Regards, Martijn
Steve Howe wrote:
there is no point in printing <Element a at 8413144> at all - is there ?
depends on how large XML files you work with, of course. I prefer an API that forces me to be a bit more explicit than a plain "print" before dumping 10 megabytes of stuff to the console... </F>
depends on how large XML files you work with, of course. I prefer an API that forces me to be a bit more explicit than a plain "print" before dumping 10 megabytes of stuff to the console... I understand, but I think a programmer would be expecting str(root) to
Hello Fredrik, Wednesday, May 10, 2006, 11:58:09 AM, you wrote: print the string representation of the tree, just like he calls str(int), and see "1" instead of "<int object at xxxxx>". That breaks Pythonic behaviour. Large dumping will happen also when printing *any* large text dump to the screen, and just as Python won't "protect" you from such a dumping, I don't see a pointing in doing it. If a programmer does that dumping once, I think he should be smart enough to press Ctrl+C and change his code. Anyway, I don't see that as bad design, just as a taste matter, and I'm happy with either way - and I agree it would probably be bad to break ElementTree compatibility even if we disagree about something. I would have designed it to have both str() and .tostring() support, having the first calling the second. As I said, other Python types use str() to convert from its native types to strings, so I think that should be used also with Elements and ElementTrees instead of the repr() output - when they want that, they should use that function. From the Python documentation: repr( object) Return a string containing a printable representation of an object. (...) str( [object]) Return a string containing a nicely printable representation of an object. (...) -- Best regards, Steve mailto:howe@carcass.dhs.org
Hi Steve, Steve Howe wrote:
I understand, but I think a programmer would be expecting str(root) to print the string representation of the tree, just like he calls str(int), and see "1" instead of "<int object at xxxxx>". [...] Anyway, I don't see that as bad design, just as a taste matter, and I'm happy with either way - and I agree it would probably be bad to break ElementTree compatibility even if we disagree about something. I would have designed it to have both str() and .tostring() support, having the first calling the second. [...] From the Python documentation:
repr( object) Return a string containing a printable representation of an object. (...)
str( [object]) Return a string containing a nicely printable representation of an object. (...)
It's definitely a matter of taste. It's also the question what exactly is meant by "object" here: the Element? The entire tree of the Element? Sadly, that makes a huge difference... Regarding unicode() vs. tounicode(), I think both ideas are intuitive in a way and neither of them has clear advantages. So it's the holy cow that makes the difference. (Also: there should be one - and preferably only one - way of doing it) Regarding tounicode() or not: I can understand that Fredrik has objections to using Unicode to carry XML in general. But I really don't see why you should try to actively prevent users from efficiently getting a straight unicode string out of the API if they want it. Python distinguishes between str and unicode - we can't just go "oh well, it shouldn't have been that way, so we won't support it". We can always add a docstring to "tostring" and "tounicode" saying that the first is preferable for serialization to files. But then, what is "write" for? Stefan
On Wed, 2006-05-10 at 12:23 +0200, Stefan Behnel wrote:
Actually, I had first implemented it as "etree.tounicode()" and then switched to plain "unicode()" as I thought /that/ would be more intuitive...
That would be intuitive if str(Element) would return string presentation of XML, but not '<Element aa at 2aaaac826210>', otherwise that is inconsistent behavior, that's worse then just lack of shortcut.
Steve Howe wrote:
Whatever the calling method gets named, its a great feature, thanks.
so what's your use case? (I hope you're aware that the XML file format is defined in terms of en- coded data, not as sequences of Unicode code points, and that XML encoding involves more than just character sets; there's no such thing as an "XML document in a Unicode string") stefan's argument is basically "we should add it because we can", which is a rather lousy way to design software. </F>
Hi Fredrik, Fredrik Lundh wrote:
there's no such thing as an "XML document in a Unicode string"
Well, there's things like "XML documents in files", "XML documents in HTTP" and "XML documents in SMTP", so why not "XML documents in Unicode strings"? Stefan
Stefan Behnel wrote:
Fredrik Lundh wrote:
there's no such thing as an "XML document in a Unicode string"
Well, there's things like "XML documents in files", "XML documents in HTTP" and "XML documents in SMTP", so why not "XML documents in Unicode strings"?
files, HTTP, and SMTP all deal with bytes (or if you prefer, octets). a Python Unicode string doesn't contain bytes; it contains a sequence of Unicode code points, which are indexes into an abstract character space. a Python Unicode string doesn't have an encoding. XML serialization is all about converting between the XML infoset (which contains sequences of abstract code points) and the XML file format (which contains bytes). an XML file is a bunch of bytes, not a bunch of code points. storing a bunch of bytes as a bunch of code points is simply not a very good idea, and is a great way to make people who don't understand Unicode to write XML applications that will break when exposed to non- ASCII text. </F>
Fredrik Lundh wrote:
Stefan Behnel wrote:
Fredrik Lundh wrote:
there's no such thing as an "XML document in a Unicode string"
Well, there's things like "XML documents in files", "XML documents in HTTP" and "XML documents in SMTP", so why not "XML documents in Unicode strings"?
files, HTTP, and SMTP all deal with bytes (or if you prefer, octets).
a Python Unicode string doesn't contain bytes; it contains a sequence of Unicode code points, which are indexes into an abstract character space.
a Python Unicode string doesn't have an encoding.
The XML specification does not restrict how parsed entities encode code points into bit patterns. In the end the Python unicode string *does* encode code points into a bit patterns in memory (in a way which has nice properties for indexing characters). It's very clear how to get to unicode code points from bit patterns given a unicode string, as they're more or less identical, and that's why normally we don't care about how Python stores unicode internally. The Python unicode string therefore seems to be to be a legitimate source of XML data. See also my reply to your previous mail where I actually quote the XML spec to back this up. :)
XML serialization is all about converting between the XML infoset (which contains sequences of abstract code points) and the XML file format (which contains bytes). an XML file is a bunch of bytes, not a bunch of code points. storing a bunch of bytes as a bunch of code points is simply not a very good idea, and is a great way to make people who don't understand Unicode to write XML applications that will break when exposed to non- ASCII text.
XML is more than just a file format. The XML spec is careful to talk about 'entities'. It recognizes that an entity can be exist in numerous encodings, and that the encoding information can be in the entity (the encoding declaration), but that they can also be externally specified. I agree that it is a valid argument against this API that people who do not unstand unicode are going to make even more mistakes when using this. Having reviewed the API, I think the chances that people will get even more confused are relatively minor, though. The API as defined now refuses to guess in all cases: * when XML() is presented with a unicode string, it's clear what to do, unless that string also contains an encoding declaration. In that case, the system refuses to guess and an exception is raised. * .tounicode() needs to be called explicitly in order to get unicode form of XML. and that's it. A naive user would just open an XML file and pass that into the XML() function (or use the file access functions), and that will work (if the encoding declaration in the XML is correct). A naive user would also use tostring() as they don't know all that unicode stuff. I think naive users therefore aren't any worse off than before. Regards, Martijn
Martijn Faassen wrote:
* .tounicode() needs to be called explicitly in order to get unicode form of XML.
and the user must then make sure to treat the output carefully, if he's doing anything with it at all. because it's not really Unicode; it just looks as if it is.
and that's it.
far from it. the real problem appears when you want to write the resulting bytes-encoded-in-Unicode string to a file, socket, or some other byte- oriented output device. what do you need to do to make this work, and what happens if you don't ? </F>
Fredrik Lundh wrote:
Martijn Faassen wrote:
* .tounicode() needs to be called explicitly in order to get unicode form of XML.
and the user must then make sure to treat the output carefully, if he's doing anything with it at all. because it's not really Unicode; it just looks as if it is.
Why is it "not really Unicode"?
and that's it.
far from it. the real problem appears when you want to write the resulting bytes-encoded-in-Unicode string to a file, socket, or some other byte- oriented output device. what do you need to do to make this work, and what happens if you don't ?
When you try to write a unicode string to a byte-oriented devise, you'll have to encode, like always when you write a unicode string. Possibly you're pointing out the issue of the encoding header - if you would encode the string to latin-1 and save it, say, there'd be a problem as the XML does not carry along its encoding information in any encoding header. Regards, Martijn
Martijn Faassen wrote:
Fredrik Lundh wrote:
the real problem appears when you want to write the resulting bytes-encoded-in-Unicode
It's not "bytes-encoded-in-unicode". It's Python unicode. That's well defined. Everything inside the Python interpreter knows how to deal with that.
string to a file, socket, or some other byte-
oriented output device. what do you need to do to make this work, and what happens if you don't ?
When you try to write a unicode string to a byte-oriented devise, you'll have to encode, like always when you write a unicode string.
Possibly you're pointing out the issue of the encoding header - if you would encode the string to latin-1 and save it, say, there'd be a problem as the XML does not carry along its encoding information in any encoding header.
But then, that's just like taking a letter out of an envelope and saying "Hey! How did that get here?" Stefan
Martijn Faassen wrote:
Fredrik Lundh wrote:
the real problem appears when you want to write the resulting bytes-encoded-in-Unicode string to a file, socket, or some other byte- oriented output device. what do you need to do to make this work, and what happens if you don't ?
Possibly you're pointing out the issue of the encoding header - if you would encode the string to latin-1 and save it, say, there'd be a problem as the XML does not carry along its encoding information in any encoding header.
Hmmm, I just noticed that we don't do that anyway (we rely on libxml2 here):
from lxml.etree import XML, tostring, tounicode tostring( XML("<test/>") ) '<test/>' tostring( XML("<test/>"), encoding="UTF-8" ) '<test/>' tounicode( XML("<test/>") ) u'<test/>'
That's very consistent as far as lxml is concerned. ElementTree handles this a bit different, though:
from elementtree.ElementTree import XML, tostring tostring(XML("<test/>")) '<test />' tostring(XML("<test/>"), encoding="UTF-8") "<?xml version='1.0' encoding='UTF-8'?>\n<test />"
This admittedly makes sense when you have the intention of handing that string to someone else. Stefan
Stefan Behnel wrote:
That's very consistent as far as lxml is concerned. ElementTree handles this a bit different, though:
from elementtree.ElementTree import XML, tostring tostring(XML("<test/>")) '<test />' tostring(XML("<test/>"), encoding="UTF-8") "<?xml version='1.0' encoding='UTF-8'?>\n<test />"
This admittedly makes sense when you have the intention of handing that string to someone else.
that's a wart, though: the current ET serializer outputs the <?xml?> header as soon as you use a non-default encoding (and the default encoding is us-ascii). iirc, ET 1.3 adds an "xml_declaration" option which can be set to None (the default): old "it depends on the encoding" behaviour a true value: always include, with version and encoding a false value (except None): never include feel free to emulate ET 1.3 here. </F>
Hi Fredrik, Fredrik Lundh wrote:
the current ET serializer outputs the <?xml?> header as soon as you use a non-default encoding (and the default encoding is us-ascii).
Yup, I understood that from the tests. :)
iirc, ET 1.3 adds an "xml_declaration" option which can be set to
None (the default): old "it depends on the encoding" behaviour a true value: always include, with version and encoding a false value (except None): never include
feel free to emulate ET 1.3 here.
Ok, we do that now. Would you know about any other API changes that we should take into consideration? Stefan
Hello Fredrik, Wednesday, May 10, 2006, 1:07:47 PM, you wrote:
far from it. the real problem appears when you want to write the resulting bytes-encoded-in-Unicode string to a file, socket, or some other byte- oriented output device. what do you need to do to make this work, and what happens if you don't ? It will happen the same that happens to any unicode object in Python: you should encode it to some str form before doing such a thing, or should have called .tostring() in the first place. Your argument seems to be the same as "do not support unicode at all, it will end up as a bytes sequence in the disk anyway". There are cases for using str, and cases for using unicode. Not all data will be promptly serialized; some will be processed (as unicode) or even printed into an unicode console.
-- Best regards, Steve mailto:howe@carcass.dhs.org
Fredrik Lundh wrote:
files, HTTP, and SMTP all deal with bytes (or if you prefer, octets).
a Python Unicode string doesn't contain bytes; it contains a sequence of Unicode code points, which are indexes into an abstract character space.
Ok, so then that means that unicode strings are completely unparsable. A standards-compliant XML API should raise an error when it is asked to parse a sequence of unicode code points. Let's see...
from elementtree.ElementTree import XML XML(u"<test/>") <Element test at 2ad6c0771bd8>
What? I didn't put any bytes in there? Where did the element come from?
a Python Unicode string doesn't have an encoding.
Well, it does, internally. And it's even well-defined across the whole platform.
XML serialization is all about converting between the XML infoset (which contains sequences of abstract code points) and the XML file format (which contains bytes). an XML file is a bunch of bytes, not a bunch of code points. storing a bunch of bytes as a bunch of code points is simply not a very good idea, and is a great way to make people who don't understand Unicode to write XML applications that will break when exposed to non- ASCII text.
You're definitely the first to tell me that using unicode makes people write programs that break for non-ascii text... Stefan
Stefan Behnel wrote:
a Python Unicode string doesn't contain bytes; it contains a sequence of Unicode code points, which are indexes into an abstract character space.
Ok, so then that means that unicode strings are completely unparsable. A standards-compliant XML API should raise an error when it is asked to parse a sequence of unicode code points. Let's see...
from elementtree.ElementTree import XML XML(u"<test/>") <Element test at 2ad6c0771bd8>
What? I didn't put any bytes in there? Where did the element come from?
the CPython interpreter uses a default encoding, and attempts to *encode* Unicode strings using this encoding when you pass them to an interface that expects bytes. if that doesn't work, the function won't even get called; instead, you'll get a "can't encode" exception:
XML(u"<föö/>") Traceback (most recent call last): File "<stdin>", line 1, in ? File "<string>", line 67, in XML UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)
still think that XML supports Unicode ? or are you saying that the subset of Unicode that happens to be ASCII is a good enough subset ?
a Python Unicode string doesn't have an encoding.
Well, it does, internally. And it's even well-defined across the whole platform.
that's an implementation detail. a Python implementation may use whatever representation it wants on the inside. on the outside, there's no encoding (in the traditional sense); all there is is a sequence of Unicode code points.
XML serialization is all about converting between the XML infoset (which contains sequences of abstract code points) and the XML file format (which contains bytes). an XML file is a bunch of bytes, not a bunch of code points. storing a bunch of bytes as a bunch of code points is simply not a very good idea, and is a great way to make people who don't understand Unicode to write XML applications that will break when exposed to non- ASCII text.
You're definitely the first to tell me that using unicode makes people write programs that break for non-ascii text...
using Unicode with interfaces that expect bytes will break, if the Unicode string contains the wrong things. for example,
XML(u"<föö/>") Traceback (most recent call last): File "<stdin>", line 1, in ? File "<string>", line 67, in XML UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)
and
f = open("file", "wb") f.write(u"föö") Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128)
and so on. which means that
f = open("file.xml", "wb") f.write(ET.tounicode(tree))
will sometimes work, and sometimes fail, and sometimes generate broken XML files, depending on the data. while
f = open("file.xml", "wb") f.write(ET.tostring(tree))
will always do the right thing. </F>
Hello Fredrik, Wednesday, May 10, 2006, 2:01:30 PM, you wrote: [...]
and so on. which means that
f = open("file.xml", "wb") f.write(ET.tounicode(tree))
will sometimes work, and sometimes fail, and sometimes generate broken XML files, depending on the data. while
f = open("file.xml", "wb") f.write(ET.tostring(tree))
will always do the right thing. ...agreed, *if* the "right thing" is serializing. If I want to process that unicode data, I would have to encode it as unicode, then process it. And for a large string, that would at a lot of resources, not to mention all the trouble involved. As I pointed out: that are places for using .tounicode(), and places for using .tostring().
-- Best regards, Steve mailto:howe@carcass.dhs.org
Steve Howe wrote:
...agreed, *if* the "right thing" is serializing. If I want to process that unicode data, I would have to encode it as unicode
Unicode is not an encoding, it's a text model used by, among others, XML's infoset, and Python's Unicode string type. Encoding Unicode as Unicode doesn't make sense, unless you're confusing encoding with serialization. And *any* conversion between XML infoset (which is the XML information model) and the XML file representation is serialization; Stefan's "tounicode" function serializes to UTF-16 or UCS-4, depending on platform, and stuffs the result into the internal buffer of a Python Unicode string.
then process it.
As I just pointed out, if you want to process serialized XML, nothing keeps you from doing that on the byte stream (be it UTF-8 or ASCII or whatever). *Why* you would want to process the serialized form of an XML infoset instead of the actual infoset is still an open question. (I know people who've written XML-to-SGML post-processors for ET, but they don't count ;-) The right way to solve that kind of problems is to use a custom serializer, like the one in Kid).
And for a large string, that would at a lot of resources, not to mention all the trouble involved. As I pointed out: that are places for using .tounicode(), and places for using .tostring().
You keep saying this, but Martijn is the only one who's attempted to list some use cases. I'm pretty sure he made them all up on the spot; I'm still waiting for some real-life cases. </F>
Hello Fredrik, Wednesday, May 10, 2006, 3:46:28 PM, you wrote:
Unicode is not an encoding, it's a text model used by, among others, XML's infoset, and Python's Unicode string type. Encoding Unicode as Unicode doesn't make sense, unless you're confusing encoding with serialization. I mean I would have to str.encode() the string. I know what Unicode is.
As I just pointed out, if you want to process serialized XML, nothing keeps you from doing that on the byte stream (be it UTF-8 or ASCII or whatever). Doesn't "resource saving" have a consideration here ?
*Why* you would want to process the serialized form of an XML infoset instead of the actual infoset is still an open question. (I know people who've written XML-to-SGML post-processors for ET, but they don't count ;-) The right way to solve that kind of problems is to use a custom serializer, like the one in Kid).
You keep saying this, but Martijn is the only one who's attempted to list some use cases. I'm pretty sure he made them all up on the spot; I'm still waiting for some real-life cases. Ok:
1) say you want to search "föö" on the XML, but it could have any case sense, such as FÖÖ. uxml = etree.tounicode(root).lower() if uxml.find('föö') > -1: print 'found' 2) To print unicode directly into the console instead of a string: print etree.tounicode(root) 3) Provide unicode data directly to a native unicode database such as Berkeley DBXML, which uses UTF-8 for all its operations: uxml = etree.tounicode(root) mgr = XmlManager() uc = mgr.createUpdateContext() container = mgr.createContainer("test.dbxml") container.putDocument('mydoc', uxml, uc) In general, on all situations where you will have to encode() the output from etree.tostring(), its much better to have that value given already as unicode. The point of etree.tounicode() is avoiding an unnecessary, resource-wasting .encode() call. And if you don't want unicode, use etree.tostring(). What is so mysterious here ? -- Best regards, Steve mailto:howe@carcass.dhs.org
Steve Howe wrote:
As I just pointed out, if you want to process serialized XML, nothing keeps you from doing that on the byte stream (be it UTF-8 or ASCII or whatever).
Doesn't "resource saving" have a consideration here ?
what's "resource saving" by using a slower serialization model that needs more memory ?
1) say you want to search "föö" on the XML, but it could have any case sense, such as FÖÖ.
uxml = etree.tounicode(root).lower() if uxml.find('föö') > -1: print 'found'
why would you do this on the serialized document, rather than on the infoset ? how would you generalize the above to handle arbitrary strings ? what about surrogates ?
2) To print unicode directly into the console instead of a string:
print etree.tounicode(root)
that's not portable, of course. Python cannot print arbitrary Unicode to stdout on all platforms. it has no trouble printing ASCII to stdout...
3) Provide unicode data directly to a native unicode database such as Berkeley DBXML, which uses UTF-8 for all its operations:
uxml = etree.tounicode(root) mgr = XmlManager() uc = mgr.createUpdateContext() container = mgr.createContainer("test.dbxml") container.putDocument('mydoc', uxml, uc)
according to the DBXML documentation, it expects well-formed XML, not necessarily "UTF-8", and definitely not "unicode". have you tried the above with non-ASCII data? with latin-1 data serialized as "iso-8859-1" ? what does sys.getdefaultencoding() return on your machine ? </F>
what's "resource saving" by using a slower serialization model that needs more memory ? In the first place, I was thinking lxml would be able to return an unicode object directly in the Python internal format, and that's where
Hello Fredrik, Wednesday, May 10, 2006, 5:07:10 PM, you wrote: the resource saving was expect from. If it cannot handle that, there is no point in implementing it, indeed.
why would you do this on the serialized document, rather than on the infoset ? how would you generalize the above to handle arbitrary strings ? what about surrogates ? For any reason the user wants. That was just an example. A text editor handling unicode is an example.
As I said, I just wanted to avoid an extra .encode() call which would work with two buffers in memory.
that's not portable, of course. Python cannot print arbitrary Unicode to stdout on all platforms. it has no trouble printing ASCII to stdout... "Not portable" is not an argument. Python supports lots of other non-portable APIs.
according to the DBXML documentation, it expects well-formed XML, not necessarily "UTF-8", and definitely not "unicode". have you tried the above with non-ASCII data? with latin-1 data serialized as "iso-8859-1" ? what does sys.getdefaultencoding() return on your machine ? I can't do those tests right now, sorry, but it should be 'ascii'.
DBXML expects NodeStorage containers to be UTF-8 (or plain ASCII), and the XQuery interfaces support only UTF8. Anyway, as I pointed several times, I just want to avoid having a string in memory, then create another UTF-8 object - it's unnecessary if you wanted unicode in the start. I'm sure you understand it's important to have encodings support since .tostring() supports it - but through an inefficient way due to implementation issues. -- Best regards, Steve mailto:howe@carcass.dhs.org
Hi Steve, Steve Howe wrote:
1) say you want to search "föö" on the XML, but it could have any case sense, such as FÖÖ.
uxml = etree.tounicode(root).lower() if uxml.find('föö') > -1: print 'found'
That's maybe not the best example as serialization already involves tree traversal. There is not much point in serializing to search a string in the .text field. Stefan
Fredrik Lundh wrote:
Steve Howe wrote:
Whatever the calling method gets named, its a great feature, thanks.
so what's your use case?
(I hope you're aware that the XML file format is defined in terms of en- coded data, not as sequences of Unicode code points, and that XML encoding involves more than just character sets; there's no such thing as an "XML document in a Unicode string")
For fun let's look at the XML spec and see whether we can get some answers there. The spec says: The mechanism for encoding character code points into bit patterns MAY vary from entity to entity. All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode 3.1 It also says: In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 MUST begin with a text declaration (see 4.3.1 The Text Declaration) containing an encoding declaration: ... [...] In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration. Confusingly in the first part it talks about 'stored in an encoding other than..' and later on it talks about "information provided by an external transport protocol". Still, my interpretation would be that in the case of Python unicode strings, we *do* have a form of 'external character encoding information'. So, in the presence of such external information, this means that the encoding declaration is *not* necessary in the document (and in fact I'd say it shouldn't be there in case of XML in unicode strings). Whether it's useful in practical applications to have the ability to store XML in Python unicode strings is an interesting debate. In the case of in-memory XML processors it might simplify matters if you can just treat any text everywhere as unicode. At least, it'd simplify combining XML text with non-XML text somehow. (You'd prefer to use the ElementTree API for such manipulation though. :) On the other hand, in the lxml implementation it'll be slower than actually dealing with XML as UTF-8, as that's what libxml2 will be able to parse most quickly. So we could argue that encouraging the above usage pattern is going to lead to less than optimal performance. I don't consider that a big problem as fast performance is still available, though. I'm fine with a tounicode() output function (I'd be more worried about the unicode(), but I'm glad that idea got revoked already). I also don't see harm in accepting unicode input into the XML() function. I see that it fails in case an encoding is expressed in the XML itself, so that's good. So, +1 to the current set of changes. Regards, Martijn
Martijn Faassen wrote:
Whether it's useful in practical applications to have the ability to store XML in Python unicode strings is an interesting debate. In the case of in-memory XML processors it might simplify matters if you can just treat any text everywhere as unicode.
the mapping between Unicode text in the infoset and serialized XML involves more things than just the Unicode-to-byte encoding, so that "simplification" is far from obvious.
At least, it'd simplify combining XML text with non-XML text somehow.
what exactly is "XML text", and why would you want to combine that with non-XML text? again, what's the use case ? </F>
Fredrik Lundh wrote:
Martijn Faassen wrote:
Whether it's useful in practical applications to have the ability to store XML in Python unicode strings is an interesting debate. In the case of in-memory XML processors it might simplify matters if you can just treat any text everywhere as unicode.
the mapping between Unicode text in the infoset and serialized XML involves more things than just the Unicode-to-byte encoding, so that "simplification" is far from obvious.
You mean escaped unicode entities? If you want to turn it back into the infoset, you pass it into XML(), right?
At least, it'd simplify combining XML text with non-XML text somehow.
what exactly is "XML text", and why would you want to combine that with non-XML text? again, what's the use case ?
I can come up with a few: * quick and dirty applications that mess about with the XML text on a textual level. I agree there are usually better ways to do the same thing in a clear way (XSLT, ElementTree API). * web applications that use unicode inside (Zope 3, Silva on Zope 2) that want to present XML in a web page. In Zope 3, HTTP response text is initially a unicode string before it's encoded to UTF-8 and sent out to the network. (requests variables are converted to unicode automatically as well) In Zope 3, I'd need the XML encoded as a unicode string in order to put it on a web page. Putting something on a web page typically means combining it with HTML. Normally you'd need an extra escaping run for the <, > and such first, of course, which is in fact an excellent candidate the for 'quick and dirty' application above that's not easily solved another way. * more generally, any application that uses a user interface framework that's unicode-aware. (it might be worthwhile to investigate that Java UI toolkits do. Java uses unicode strings everywhere and presumably also in the UI api, so how do they display XML text?) I think the main use cases are in the area of XML being displayed in the context of a UI environment that's unicode native. This means that support for unicode in XML() is less necessary than 'tounicode()' (though I'm probably missing use cases), but since you already support the former in ElementTree as Stefan pointed out, we're following suit. :) Regards, Martijn
Martijn Faassen wrote:
* quick and dirty applications that mess about with the XML text on a textual level. I agree there are usually better ways to do the same thing in a clear way (XSLT, ElementTree API).
or messing with the XML encoded text on the textual level. UTF-8 is care- fully designed to allow things like this, of course...
* web applications that use unicode inside (Zope 3, Silva on Zope 2) that want to present XML in a web page. In Zope 3, HTTP response text is initially a unicode string before it's encoded to UTF-8 and sent out to the network.
so how do you return images from Zope 3 ? I can buy that a framework might automagically encode Unicode strings as UTF-8 byte strings, but forcing the use of Unicode sounds like a really lousy idea to me.
* more generally, any application that uses a user interface framework that's unicode-aware. (it might be worthwhile to investigate that Java UI toolkits do. Java uses unicode strings everywhere and presumably also in the UI api, so how do they display XML text?)
I doubt the set of applications that displays XML files as text is even noticable compared to the set of applications that displays text from the XML infoset...
I think the main use cases are in the area of XML being displayed in the context of a UI environment that's unicode native.
which, frankly, means that the use case is almost nonexistent.
This means that support for unicode in XML() is less necessary than 'tounicode()' (though I'm probably missing use cases), but since you already support the former in ElementTree as Stefan pointed out
he's confused: the XML() function does not support Unicode (see my followup mail). </F>
Hello Fredrik, Wednesday, May 10, 2006, 2:33:29 PM, you wrote: [...]
I can buy that a framework might automagically encode Unicode strings as UTF-8 byte strings, but forcing the use of Unicode sounds like a really lousy idea to me. I still don't understand the "forcing" thing here. Does anyone want to get rid of the .tostring() call ? Or just provide an alternative .tounicode() call for those who want unicode returned ?
-- Best regards, Steve mailto:howe@carcass.dhs.org
Steve Howe wrote:
I can buy that a framework might automagically encode Unicode strings as UTF-8 byte strings, but forcing the use of Unicode sounds like a really lousy idea to me.
I still don't understand the "forcing" thing here. Does anyone want to get rid of the .tostring() call ? Or just provide an alternative .tounicode() call for those who want unicode returned ?
It helps if you read the posts you quote: Martijn's example was a web frame- work used Unicode strings for HTTP responses, and encoded it as UTF-8 on the way out. Such a framework won't be able to handle output from the current ET serializer, but it won't work with images, external templating systems, resources read from disk, preformatted resources, etc, either. </F>
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Fredrik Lundh wrote:
Steve Howe wrote:
I can buy that a framework might automagically encode Unicode strings as UTF-8 byte strings, but forcing the use of Unicode sounds like a really lousy idea to me.
I still don't understand the "forcing" thing here. Does anyone want to get rid of the .tostring() call ? Or just provide an alternative .tounicode() call for those who want unicode returned ?
It helps if you read the posts you quote: Martijn's example was a web frame- work used Unicode strings for HTTP responses, and encoded it as UTF-8 on the way out.
Right. Zope3 does this for any "text" (i.e Unicode) reposnses. If the response body is "bytes" (an encoded string of some sort), it doesn't do that processing: it is then the application's job to have set the correct encoding into the 'Content-type' header.
Such a framework won't be able to handle output from the current ET serializer, but it won't work with images, external templating systems, resources read from disk, preformatted resources, etc, either.
The values so obtained are all "bytes" and not "text" in the Zope3 world. Tres. - -- =================================================================== Tres Seaver +1 202-558-7113 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFEYkOL+gerLs4ltQ4RAlhqAKCmQQgtdZi34Uglz0VYASTRraViygCeKZyz H4Fg6t57jFYzObNWrp+wiKQ= =QWBN -----END PGP SIGNATURE-----
Fredrik Lundh wrote: [snip]
Such a framework won't be able to handle output from the current ET serializer, but it won't work with images, external templating systems, resources read from disk, preformatted resources, etc, either.
Most of these are resources by itself. I'm talking about the use case where XML content is mixed with template content, for instance in a form or for display of some XML on a web page. Obviously the Zope 3 publisher is capable of handling images, but I wouldn't want to have to change my whole web application to work with encoded strings just because I want to display an XML snippet on my web pages. Regards, Martijn
Fredrik Lundh wrote:
Martijn Faassen wrote:
* quick and dirty applications that mess about with the XML text on a textual level. I agree there are usually better ways to do the same thing in a clear way (XSLT, ElementTree API).
or messing with the XML encoded text on the textual level. UTF-8 is care- fully designed to allow things like this, of course...
* web applications that use unicode inside (Zope 3, Silva on Zope 2) that want to present XML in a web page. In Zope 3, HTTP response text is initially a unicode string before it's encoded to UTF-8 and sent out to the network.
so how do you return images from Zope 3 ?
It's a slightly different case; the image is a binary object by itself, and Zope 3 must take special measures somewhere so it doesn't try to encode. I'm talking about inclusion in a template (for instance a form).
I can buy that a framework might automagically encode Unicode strings as UTF-8 byte strings, but forcing the use of Unicode sounds like a really lousy idea to me.
I think the idea that Zope translates human-readable text to unicode (which most strings are) and back to UTF-8 again is a really great idea. It makes applications in Zope 3 unicode-aware without the user having to take special action, and remarkably free of unicode errors. It is possible (though I don't know the details) to force output of the whole page to be non-unicode already, and that's useful in special cases if you want XML over HTTP. In that case I wouldn't want to spit out unicode and have it recoded for efficiency reasons.
* more generally, any application that uses a user interface framework that's unicode-aware. (it might be worthwhile to investigate that Java UI toolkits do. Java uses unicode strings everywhere and presumably also in the UI api, so how do they display XML text?)
I doubt the set of applications that displays XML files as text is even noticable compared to the set of applications that displays text from the XML infoset...
I think the main use cases are in the area of XML being displayed in the context of a UI environment that's unicode native.
which, frankly, means that the use case is almost nonexistent.
It's almost nonexistent, but not quite, and I have run into that very usecase a number of times in the last five years, most recently with SilvaFlexibleXML. I'm sure web UIs for XML databases also have this problem, and I've seen one for eXist (in Java) among other things. The question would be whether this usecase is strong enough to weigh against the drawbacks of tounicode().
This means that support for unicode in XML() is less necessary than 'tounicode()' (though I'm probably missing use cases), but since you already support the former in ElementTree as Stefan pointed out
he's confused: the XML() function does not support Unicode (see my followup mail).
Ah, good point. Perhaps it'd be worthwile in ElementTree to do an assert for plain strings when that function is called then, as it gives the appearance passing unicode strings into XML() works sometimes, and doesn't work other times. It demonstrates the same behavior you show with f.write(etree.tounicode(..)); it sometimes appears to work but sometimes doesn't, depending on the contents of the string. You point this out as a problem in another mail. Regards, Martijn
Hello Fredrik, Wednesday, May 10, 2006, 9:40:07 AM, you wrote:
so what's your use case? I think it's obvious: any place where there is XML data represented as unicode and not as plain ASCII.
(I hope you're aware that the XML file format is defined in terms of en- coded data, not as sequences of Unicode code points, and that XML encoding involves more than just character sets; there's no such thing as an "XML document in a Unicode string") Yes, I am aware of the XML spec, thank you.
stefan's argument is basically "we should add it because we can", which is a rather lousy way to design software. I don't think that was his argument and didn't find your comment very elegant either...
-- Best regards, Steve mailto:howe@carcass.dhs.org
Steve Howe wrote:
so what's your use case?
I think it's obvious: any place where there is XML data represented as unicode and not as plain ASCII.
Huh? Have you noticed that tostring takes an optional encoding argument? </F>
Hello Fredrik, Wednesday, May 10, 2006, 4:25:20 PM, you wrote:
Steve Howe wrote:
I think it's obvious: any place where there is XML data represented as unicode and not as plain ASCII.
Huh? Have you noticed that tostring takes an optional encoding argument? Won't that waste exactly the same resources as this ?
xml = etree.tostring(element).encode(encoding) For a large xml, this would more then double the memory requirements to do that processing, when it could be returned directly as an unicode object. -- Best regards, Steve mailto:howe@carcass.dhs.org
Hi Steve, Steve Howe wrote:
Wednesday, May 10, 2006, 4:25:20 PM, you wrote:
Have you noticed that tostring takes an optional encoding argument?
Won't that waste exactly the same resources as this ?
xml = etree.tostring(element).encode(encoding)
For a large xml, this would more then double the memory requirements to do that processing, when it could be returned directly as an unicode object.
Careful, this is more or less how tounicode() is currently implemented (although at the libxml2 level). It currently serializes to UTF-8 (which, at least, is pretty fast in libxml2, as all strings are already UTF-8) and then calls the Python API function to convert from UTF-8 to Python unicode in one run (which is also pretty efficient). It's difficult to do otherwise, as libxml2 and Python have independent memory management, so we can't just mange pointers here. Note also that libxml2 uses a dynamically adapted output buffer, so it likely uses more memory during serialization than absolutely necessary. So, while the idea of the API is that it's more efficient (which it still is), the gain may not be as big as expected. But since tostring uses the same mechanism (and thus suffers from the same problem), the gain in overhead is still about 1/3 if the result is required as unicode. Stefan
Hello Stefan, Wednesday, May 10, 2006, 5:01:43 PM, you wrote:
Careful, this is more or less how tounicode() is currently implemented (although at the libxml2 level). It currently serializes to UTF-8 (which, at least, is pretty fast in libxml2, as all strings are already UTF-8) and then calls the Python API function to convert from UTF-8 to Python unicode in one run (which is also pretty efficient). It's difficult to do otherwise, as libxml2 and Python have independent memory management, so we can't just mange pointers here.
Note also that libxml2 uses a dynamically adapted output buffer, so it likely uses more memory during serialization than absolutely necessary.
So, while the idea of the API is that it's more efficient (which it still is), the gain may not be as big as expected. But since tostring uses the same mechanism (and thus suffers from the same problem), the gain in overhead is still about 1/3 if the result is required as unicode. I was thinking lxml would return the data encoded as unicode, in the same format Python uses, and thus the gain would be more dramatic. In this case, I think you should judge how more efficient that is then calling .tostring(encoding) and implement if the gain is reasonable.
-- Best regards, Steve mailto:howe@carcass.dhs.org
Hi Steve, Steve Howe wrote:
Wednesday, May 10, 2006, 5:01:43 PM, you wrote:
Careful, this is more or less how tounicode() is currently implemented (although at the libxml2 level). It currently serializes to UTF-8 (which, at least, is pretty fast in libxml2, as all strings are already UTF-8) and then calls the Python API function to convert from UTF-8 to Python unicode in one run (which is also pretty efficient). It's difficult to do otherwise, as libxml2 and Python have independent memory management, so we can't just mange pointers here.
Note also that libxml2 uses a dynamically adapted output buffer, so it likely uses more memory during serialization than absolutely necessary.
So, while the idea of the API is that it's more efficient (which it still is), the gain may not be as big as expected. But since tostring uses the same mechanism (and thus suffers from the same problem), the gain in overhead is still about 1/3 if the result is required as unicode.
I was thinking lxml would return the data encoded as unicode, in the same format Python uses, and thus the gain would be more dramatic.
I guess you mean libxml2 here, not lxml. Given the above procedure, I don't think it's a big difference in speed if libxml2 encodes to native Python (from internal UTF-8 data) or if Python does that from libxml2 serialized UTF-8 data. In any case, we'd have to copy the buffer to get it into Python. I assume that the libxml2->UTF8->Python approach is already the most memory friendly order in most cases, as UTF-8 tends to be (much) shorter than 32bit unicode (which the Python interpreter *may* use, although it *may* also be 16bit). So generating everything in UTF-8 and then expanding it to unicode actually saves RAM compared to copying from unicode to unicode.
In this case, I think you should judge how more efficient that is then calling .tostring(encoding) and implement if the gain is reasonable.
Sorry, I don't understand what you mean here. This is all done at the C-level: serialization and conversion. If you did the same at the Python level, it cannot be faster or less memory intensive. But you would still have to copy the string before you pass it back through the API. So doing the conversion /as/ the copy operation is the most efficient way. Stefan
Hello Stefan, Wednesday, May 10, 2006, 6:00:11 PM, you wrote:
Careful, this is more or less how tounicode() is currently implemented (although at the libxml2 level). It currently serializes to UTF-8 (which, at least, is pretty fast in libxml2, as all strings are already UTF-8) and then calls the Python API function to convert from UTF-8 to Python unicode in one run (which is also pretty efficient). It's difficult to do otherwise, as libxml2 and Python have independent memory management, so we can't just mange pointers here.
Note also that libxml2 uses a dynamically adapted output buffer, so it likely uses more memory during serialization than absolutely necessary.
So, while the idea of the API is that it's more efficient (which it still is), the gain may not be as big as expected. But since tostring uses the same mechanism (and thus suffers from the same problem), the gain in overhead is still about 1/3 if the result is required as unicode.
I was thinking lxml would return the data encoded as unicode, in the same format Python uses, and thus the gain would be more dramatic.
I guess you mean libxml2 here, not lxml. Given the above procedure, I don't think it's a big difference in speed if libxml2 encodes to native Python (from internal UTF-8 data) or if Python does that from libxml2 serialized UTF-8 data. In any case, we'd have to copy the buffer to get it into Python.
I assume that the libxml2->UTF8->Python approach is already the most memory friendly order in most cases, as UTF-8 tends to be (much) shorter than 32bit unicode (which the Python interpreter *may* use, although it *may* also be 16bit). So generating everything in UTF-8 and then expanding it to unicode actually saves RAM compared to copying from unicode to unicode.
In this case, I think you should judge how more efficient that is then calling .tostring(encoding) and implement if the gain is reasonable.
Sorry, I don't understand what you mean here. This is all done at the C-level: serialization and conversion. If you did the same at the Python level, it cannot be faster or less memory intensive. But you would still have to copy the string before you pass it back through the API. So doing the conversion /as/ the copy operation is the most efficient way. I meant lxml. I thought it could serialize the input stream from lxml into a Python unicode object without having the whole string in memory, doing it in chunks instead of retrieving a huge buffer, then converting it to unicode - just that.
-- Best regards, Steve mailto:howe@carcass.dhs.org
Hello Steve, Wednesday, May 10, 2006, 6:08:41 PM, you wrote:
I meant lxml. I thought it could serialize the input stream from lxml into a Python unicode object without having the whole string in memory, doing it in chunks instead of retrieving a huge buffer, then converting it to unicode - just that. Sorry, I meant "it could serialize the input stream from libxml2..."
-- Best regards, Steve mailto:howe@carcass.dhs.org
Hi Steve, Steve Howe wrote:
I meant lxml. I thought it could serialize the input stream from lxml into a Python unicode object without having the whole string in memory, doing it in chunks instead of retrieving a huge buffer, then converting it to unicode - just that. Sorry, I meant "it could serialize the input stream from libxml2..."
Ok, I get it now. Yes, it *could* do that. But that's much more work than the way it is now. That would involve writing a libxml2 I/O writer ourselves, accept UTF-8 data from the traversal process and then pass that to Python's converter step-by-step. I guess that's really for a future version. If anyone finds out that this is really needed, we may decide to implement it that way - under the same API. Note that the current implementation is efficient for the way it works (it's even likely (unverified) a bit faster than the approach above if RAM is there). The above would be a different optimisation, for space. Stefan
Hi Steve, Stefan Behnel wrote:
Steve Howe wrote:
I meant lxml. I thought it could serialize the input stream from lxml into a Python unicode object without having the whole string in memory, doing it in chunks instead of retrieving a huge buffer, then converting it to unicode - just that. Sorry, I meant "it could serialize the input stream from libxml2..."
Ok, I get it now. Yes, it *could* do that. But that's much more work than the way it is now. That would involve writing a libxml2 I/O writer ourselves, accept UTF-8 data from the traversal process and then pass that to Python's converter step-by-step.
I just noticed that it's even worse. The PyUnicode_DecodeUTF8Stateful function I had in mind (which is also only available in Python 2.4) doesn't allow us to grow the unicode string, so it would also require copying. So the only way I currently see to get the memory consumption down to about the size of the result string is letting libxml2 do the conversion, write the resulting chunks into a Python memory buffer through a custom libxml2 I/O writer, growing the buffer ourselves as needed (which likely also involves allocating more memory than necessary) and then somehow tricking a unicode object into using it (PyUnicode_FromUnicode seems to allow you to create an empty unicode object). So I really think it's worth waiting for a use case that shows how doubling the memory for unicode string serialization keeps someone from using lxml. Stefan
Hello Stefan, Thursday, May 11, 2006, 2:59:44 AM, you wrote:
I just noticed that it's even worse. The PyUnicode_DecodeUTF8Stateful function I had in mind (which is also only available in Python 2.4) doesn't allow us to grow the unicode string, so it would also require copying.
So the only way I currently see to get the memory consumption down to about the size of the result string is letting libxml2 do the conversion, write the resulting chunks into a Python memory buffer through a custom libxml2 I/O writer, growing the buffer ourselves as needed (which likely also involves allocating more memory than necessary) and then somehow tricking a unicode object into using it (PyUnicode_FromUnicode seems to allow you to create an empty unicode object).
So I really think it's worth waiting for a use case that shows how doubling the memory for unicode string serialization keeps someone from using lxml. Although it is not urgent, there is a common case where scripts run in server limited memory - its typical to see 32 ou 64mb on some VPS servers. If the source has like 11Mb, we'll spend, on my FreeBSD 6.1 system (monitoring with "top"):
3260K - python interpreter 7988K - above + from lxml import etree 36812K - above + loaded tree 47528K - above + str 90084K - above + unicode The commands I ran were:
import cElementTree a = cElementTree.parse('a.xml') b = cElementTree.tostring(a.getroot()) c = unicode(b)
By the way, the Element -> str converting is *really* slow, took almost a minute. And 11Mb is not such a huge size, and there is nothing more loaded on python. The case xml source is ascii only; I would expect larger sizes for more non-ascii chars. For web servers processing documents, many time in threads, this could be a huge memory waster. At least the str() step could be avoided if what you mentioned could be implemented. Just for fun, let's see how cElementTree behaves on the same system: 3260K - python interpreter 6360K - above + import ElementTree 32804K - above + loaded tree 75920K - above + str 116M - above + unicode On cElementTree, the Element -> str operation is *much* faster, about 10s, but I did not benchmark them. It is interesting to see that it uses much more memory, however. So, if I did everything right, using cElementTree to load this test 11Mb xml file as a unicode string will at a point use 90Mb of memory under lxml or 116Mb under cElementTree. -- Best regards, Steve mailto:howe@carcass.dhs.org
Hi Steve, Steve Howe wrote:
server limited memory - its typical to see 32 ou 64mb on some VPS servers. If the source has like 11Mb, we'll spend, on my FreeBSD 6.1 system (monitoring with "top"):
3260K - python interpreter 7988K - above + from lxml import etree 36812K - above + loaded tree 47528K - above + str 90084K - above + unicode
The commands I ran were:
import cElementTree a = cElementTree.parse('a.xml') b = cElementTree.tostring(a.getroot()) c = unicode(b)
Good idea, although not necessarily the absolute benchmark setup. I ran this on my machine: 3948K - Python interpreter 5532K - + from lxml import parse, tounicode 137M - + a = parse("big.xml") - [max: 156M] 180M - + c = tounicode(a.getroot()) - [max: 190M] 3948K - Python interpreter 5544K - + from lxml import parse, tostring 137M - + a = parse("big.xml") - [max: 156M] 148M - + b = tostring(a.getroot(), 'UTF-8') - [max: 153M] 190M - + c = unicode(b, 'UTF-8') 180M - + del b Ok, well, that actually looks like both were exactly identical in terms of memory usage. I also tried that with cElementTree: 3948K - Python interpreter 6352K - + from cElementTree import parse, tostring 92M - + a = parse("big.xml") 137M - + b = tostring(a.getroot(), 'UTF-8') - [max: 150M] 180M - + c = unicode(b, 'UTF-8') 170M - + del b The main reason for the big difference is that I'm on a 64bit machine (I assume you're on 32bit?). That doubles the size of pointers, and libxml2 uses tons of them (char*, double-linked trees, hash-tables, ...).
By the way, the Element -> str converting is *really* slow, took almost a minute.
I hope you meant (c)ElementTree, right? I posted some pretty interesting benchmark results on that lately. You can really look how memory usage increases MB by MB... If you meant lxml, you should redo the test and make sure there was no swapping involved. These kind of benchmarks should always read from RAM.
And 11Mb is not such a huge size, and there is nothing more loaded on python. The case xml source is ascii only; I would expect larger sizes for more non-ascii chars. For web servers processing documents, many time in threads, this could be a huge memory waster. At least the str() step could be avoided if what you mentioned could be implemented.
Hmm, not really. The main memory hog is the unicode itself. If you waste 32bits for an ASCII character, that's 25 empty bits per character!
So, if I did everything right, using cElementTree to load this test 11Mb xml file as a unicode string will at a point use 90Mb of memory under lxml or 116Mb under cElementTree.
It's a little closer on my side. Still, what do we learn? Unicode strings are bad for large amounts of ASCII data and huge serializations should be done to files. Anything else? Changing the way serialization works will only change the results marginally. The in-memory tree itself is so huge that UTF-8 serialization only takes about an eighths of its size in additional memory (a 4th on your side). That's not really something to worry about, I'd say. Stefan
Hello Stefan, Thursday, May 11, 2006, 5:05:28 AM, you wrote: [...]
The main reason for the big difference is that I'm on a 64bit machine (I assume you're on 32bit?). That doubles the size of pointers, and libxml2 uses tons of them (char*, double-linked trees, hash-tables, ...). Yes, 32 bits, FreeBSD 6.1.
I hope you meant (c)ElementTree, right? I posted some pretty interesting benchmark results on that lately. You can really look how memory usage increases MB by MB... If you meant lxml, you should redo the test and make sure there was no swapping involved. These kind of benchmarks should always read from RAM. No, I meant lxml, and yes, I could have made it read from ram, but I think it did swap. It was not a very controlled test, I admit, just something quick I made on my python prompt. I just ran the test again and the results were similar. There are is plenty of ram available, however.
It's a little closer on my side. Still, what do we learn? Unicode strings are bad for large amounts of ASCII data and huge serializations should be done to files. Anything else? Changing the way serialization works will only change the results marginally. The in-memory tree itself is so huge that UTF-8 serialization only takes about an eighths of its size in additional memory (a 4th on your side). That's not really something to worry about, I'd say. That is not something so important, indeed. It would be nice if it was something easy to implement, but not otherwise. This was the main reason I was interested about .tounicode().
-- Best regards, Steve mailto:howe@carcass.dhs.org
Hi Steve, Steve Howe wrote:
Thursday, May 11, 2006, 5:05:28 AM, you wrote:
I hope you meant (c)ElementTree, right? I posted some pretty interesting benchmark results on that lately. You can really look how memory usage increases MB by MB... If you meant lxml, you should redo the test and make sure there was no swapping involved. These kind of benchmarks should always read from RAM.
No, I meant lxml, and yes, I could have made it read from ram, but I think it did swap. It was not a very controlled test, I admit, just something quick I made on my python prompt. I just ran the test again and the results were similar. There are is plenty of ram available, however.
Hmm, interesting. Could you run the I/O tests from the benchmark suite (trunk version) and post the results? My results here are that lxml is about 20-50 times faster on serialization than cET or ET. I would be surprised if that was so much different on your machine. Try: cd lxml python bench.py -i -a tostring_utf8 tostring_utf16 tostring_utf8_unicode_XML write_utf8_parse_stringIO (the latter all in one line, '-i' adds 'src' to the PYTHONPATH, '-a' runs with lxml, cET and ET if installed) It's gonna take a while and the output is rather lengthy. The benchmarks run this, which is more or less what we talk about here: ---------------------------------- @with_text(text=True, utext=True) def bench_tostring_utf8(self, root): self.etree.tostring(root, 'UTF-8') @with_text(text=True, utext=True) def bench_tostring_utf16(self, root): self.etree.tostring(root, 'UTF-16') @with_text(text=True, utext=True) def bench_tostring_utf8_unicode_XML(self, root): xml = unicode(self.etree.tostring(root, 'UTF-8'), 'UTF-8') self.etree.XML(xml) @with_text(text=True, utext=True) def bench_write_utf8_parse_stringIO(self, root): f = StringIO() self.etree.ElementTree(root).write(f, 'UTF-8') f.seek(0) self.etree.parse(f) ---------------------------------- Thanks, Stefan
Hello Stefan, Thursday, May 11, 2006, 9:55:47 AM, you wrote:
Hmm, interesting. Could you run the I/O tests from the benchmark suite (trunk version) and post the results? My results here are that lxml is about 20-50 times faster on serialization than cET or ET. I would be surprised if that was so much different on your machine. [...]
The results attached, supporting that lxml is faster, but I suspect the slowdown happens only on very large xml files - the larger, the worst. How large is the xml stream on this test ? Remember I test on a 11Mb file. This is probably related to the way Python allocates and handle strings - appending is slow and expensive. -- Best regards, Steve mailto:howe@carcass.dhs.org
Hi Steve, Steve Howe wrote:
The results attached, supporting that lxml is faster
Just like on my side.
but I suspect the slowdown happens only on very large xml files - the larger, the worst. How large is the xml stream on this test ? Remember I test on a 11Mb file. This is probably related to the way Python allocates and handle strings - appending is slow and expensive.
Admittedly, the largest was only about 1M. Otherwise, the benchmarks would take too long to run, especially on ET. I changed bench.py to use longer strings now, that should not make a difference in most tests but give us better numbers of tree copying and serialization. You can also now pass the options -l and -L (large or LARGE trees). Anyway, it can't be related to Python. Python just get's a char* and a size and can then happily allocate its final buffer to memcpy it. No appending at all. Maybe it's libxml2 then, but I really wouldn't know why... If you want, you can run the modified bench script again, and if you have enough RAM, you can pass the -L option to see if that makes a difference. Stefan
Hello Stefan, Thursday, May 11, 2006, 5:37:11 PM, you wrote:
Admittedly, the largest was only about 1M. Otherwise, the benchmarks would take too long to run, especially on ET. I changed bench.py to use longer strings now, that should not make a difference in most tests but give us better numbers of tree copying and serialization. You can also now pass the options -l and -L (large or LARGE trees).
Anyway, it can't be related to Python. Python just get's a char* and a size and can then happily allocate its final buffer to memcpy it. No appending at all.
Maybe it's libxml2 then, but I really wouldn't know why...
If you want, you can run the modified bench script again, and if you have enough RAM, you can pass the -L option to see if that makes a difference. Sure, anything I can help with. I've ran the tests with "-L", and they took quite a while to perform and even crashed after a point. Since I'm in a hurry lately I did not have the time to see why, but the results are attached. See that some tests results were really slow compared to ET and cET. Ex:
lxe: tostring_utf16 (SA T3 ) 80626.3903 msec/pass, best of ( 81157.4800 80631.1504 80626.3903 ) cET: tostring_utf16 (SA T3 ) 3305.6618 msec/pass, best of ( 3305.6618 3332.7984 3310.7507 ) ET : tostring_utf16 (SA T3 ) 3413.7482 msec/pass, best of ( 3418.9271 3413.7482 3415.6650 ) lxe: tostring_utf8 (UA T3 ) 37834.8396 msec/pass, best of ( 37834.8396 37970.8700 37908.7146 ) cET: tostring_utf8 (UA T3 ) 2880.5753 msec/pass, best of ( 2880.5753 2886.5763 2885.0215 ) ET : tostring_utf8 (UA T3 ) 2981.1059 msec/pass, best of ( 3000.1362 2981.1059 2988.5129 ) The server is at your disposal if you want to use it. -- Best regards, Steve mailto:howe@carcass.dhs.org
Hello, Sorry, I forgot the attachment - here it is. -- Best regards, Steve mailto:howe@carcass.dhs.org
Hi Steve, Steve Howe wrote:
Thursday, May 11, 2006, 5:37:11 PM, you wrote:
If you want, you can run the modified bench script again, and if you have enough RAM, you can pass the -L option to see if that makes a difference.
Sure, anything I can help with. I've ran the tests with "-L", and they took quite a while to perform and even crashed after a point.
I think the crash might be related to a bug I fixed lately. Maybe your version didn't have that (you passed the '-i' option to run it against the working directory version, right?)
Since I'm in a hurry lately I did not have the time to see why, but the results are attached. See that some tests results were really slow compared to ET and cET. Ex:
lxe: tostring_utf16 (SA T3 ) 80626.3903 msec/pass, best of ( 81157.4800 80631.1504 80626.3903 ) cET: tostring_utf16 (SA T3 ) 3305.6618 msec/pass, best of ( 3305.6618 3332.7984 3310.7507 ) ET : tostring_utf16 (SA T3 ) 3413.7482 msec/pass, best of ( 3418.9271 3413.7482 3415.6650 )
lxe: tostring_utf8 (UA T3 ) 37834.8396 msec/pass, best of ( 37834.8396 37970.8700 37908.7146 ) cET: tostring_utf8 (UA T3 ) 2880.5753 msec/pass, best of ( 2880.5753 2886.5763 2885.0215 ) ET : tostring_utf8 (UA T3 ) 2981.1059 msec/pass, best of ( 3000.1362 2981.1059 2988.5129 )
That absolutely looks like your system hit the harddisk. So, it would be interesting to have some hints about memory usage during these two benchmarks.
The server is at your disposal if you want to use it.
Thanks for the offer. I may ask Martijn first if they don't have similar facilities at infrae. Might be easier. I'll be away until thursday, but I may still come back to the offer, thanks. Stefan
Steve Howe wrote:
Huh? Have you noticed that tostring takes an optional encoding argument?
Won't that waste exactly the same resources as this ?
xml = etree.tostring(element).encode(encoding)
tostring returns encoded data. did you mean xml = etree.tounicode(element).encode(encoding) ? if so, the answer is no -- the serializer encodes the infoset piece by piece, using different approaches for different parts of the infoset (at least that's what the ET serializer does; not sure about lxml). there's some overhead from cStringIO, though, but that should be far from the 3x/5x worst-case overhead in your example. (and for western users, the worst case is quite often the typical case) </F>
participants (6)
-
Andrey Tatarinov
-
Fredrik Lundh
-
Martijn Faassen
-
Stefan Behnel
-
Steve Howe
-
Tres Seaver