Return type of text_content()

I noticed that the text_content() method of lxml.html elements returns a _ElementUnicodeResult, i.e. a 'smart' string. However, its getparent(), attrname are None, and is_tail, is_text, is_attribute are False. This is the case even if the element contains a single text node. The XPath "string()" used in text_content()'s implementation never returns an existing text node, but always a new string. Wouldn't it make more sense for text_content() to return a normal str? E.g. by adding smart_strings=False to _collect_string_content. I am not aware of any real issues caused by text_content() returning a 'smart' string -- for example, I don't think it can cause any memory leaks, because it doesn't seem to have a reference to the original document. But it still seems unexpected and perhaps unintentional. In theory this might be a breaking change, if anyone expects elem.text_content().getparent() to exist and return None. But https://lxml.de/lxmlhtml.html doesn't mention that text_content() returns a 'smart' string. 'Smart' strings are only documented at https://lxml.de/xpathxslt.html. Given lxml 6.0.0 is in the works, now seemed like a good time to suggest this change. Thanks for reading, and thank you for all your work on lxml. Tomi

Hi, tomi.belan--- via lxml - The Python XML Toolkit schrieb am 15.03.25 um 01:15:
I noticed that the text_content() method of lxml.html elements returns a _ElementUnicodeResult, i.e. a 'smart' string.
However, its getparent(), attrname are None, and is_tail, is_text, is_attribute are False. This is the case even if the element contains a single text node. The XPath "string()" used in text_content()'s implementation never returns an existing text node, but always a new string.
Wouldn't it make more sense for text_content() to return a normal str? E.g. by adding smart_strings=False to _collect_string_content.
Yes, that seems useless and unintended. I'll change it for lxml 6.0. Thanks for reporting this. Stefan

On Wed, Mar 19, 2025 at 9:17 PM Stefan Behnel <stefan_ml@behnel.de> wrote:
Hi,
Yes, that seems useless and unintended. I'll change it for lxml 6.0.
Thanks for reporting this.
Stefan
Awesome. Thanks Stefan. I see you went with my proposed fix. The commit looks good. The issue is solved, but for completeness, here is another possible fix I thought of later. You could change .xpath() and etree.XPath() itself so that the expression "string(...)" always returns a plain str. 'Smart' strings will only be returned (as elements of a Python list) when the XPath result is a node set containing text/cdata/attribute nodes. This could be implemented by removing the _elementStringResultFactory() call in _unwrapXPathObject() when xpathObj.type == xpath.XPATH_STRING. I'm not certain if this alternative fix is a good idea. On one hand, you could argue that a smart string is only meaningful when it has additional information about its origin node, not XPATH_STRING, and hence that every user of "string(...)" is receiving 'smart' strings by accident. On the other hand, it's a bigger change and it would require updating the documentation, which does currently say a 'smart' string is returned whenever the XPath expression has a string result. What do you think? Tomi

Tomi Belan schrieb am 20.03.25 um 23:53:
You could change .xpath() and etree.XPath() itself so that the expression "string(...)" always returns a plain str. 'Smart' strings will only be returned (as elements of a Python list) when the XPath result is a node set containing text/cdata/attribute nodes. This could be implemented by removing the _elementStringResultFactory() call in _unwrapXPathObject() when xpathObj.type == xpath.XPATH_STRING.
A simple string can still originate from the text content of a single node, in which case we'd want to report the origin by returning a smart string. Distinguishing between "string()" and the text or tail of a node might seem arbitrary and difficult to handle for user code. If users request smart strings, they should get them for all text results and not just for some, with plain string objects mixed in that lack the expected attributes. Stefan
participants (3)
-
Stefan Behnel
-
Tomi Belan
-
tomi.belan@gmail.com