How to control the processing of newlines in etree.html xpath text() function?

Hi List, I recently upgraded from linux Fedora 17 to 18, and am facing a change in functionality of the lxml xpath text() function. Here's an example: from io import BytesIO from lxml import etree myHtmlString = \ '<!doctype html public "-//w3c//dtd html 4.0 transitional//en">\r\n'+\ '<html>\r\n'+\ '<head>\r\n'+\ ' <title> a b c </title>\r\n'+\ '</head>\r\n'+\ '<body/>\r\n'+\ '</html>\r\n' myFile = BytesIO(myHtmlString) myTree = etree.parse(myFile, etree.HTMLParser()) myTextElements = myTree.xpath("//text()") myFullText = ''.join([myEl for myEl in myTextElements]) print repr(myFullText) Under F17 that piece of code will write ' a b c ' whereas F18 produces '\r\n\r\n a b c \r\n\r\n\r\n' The version specifications are as follows: f17: Python : sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0) lxml.etree : (2, 3, 5, 0) libxml used : (2, 7, 8) libxml compiled : (2, 7, 8) libxslt used : (1, 1, 26) libxslt compiled : (1, 1, 26) f18: Python : sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0) lxml.etree : (2, 3, 5, 0) libxml used : (2, 9, 1) libxml compiled : (2, 9, 0) libxslt used : (1, 1, 28) libxslt compiled : (1, 1, 26) I.e. neither python nor lxml.etree changed versions and I presume it's therefore due to the underlying libraries having changed versions. I wrote an application under F17 that is now broken under F18. I can imagine three ways to make it work under F18, in increasing order of impact on my own code: 1. Change a flag in an lxml call, recovering the F17 behaviour. 2. Wrap the F18 xpath function, to try and reproduce the F17 xpath 3. Port my downstream code which searches and transforms the output of xpath. Does anybody know whether solution 1 is possible, and if not, does anybody have a suggestion for the implementation of (2)? Bye, Olivier P.S. I initially put up this question on stackoverflow, but no satisfactory answer there yet: http://stackoverflow.com/questions/16123277/how-to-control-newline-processin...

Le 30/04/2013 09:56, Olivier de Mirleau a écrit :
Hi, I think this is not related to XPath. I’ve seen similar difference regarding whitespace in the libxml2 HTML parser when upgrading libxml2 from 2.8.x to 2.9.x. I think that the old behavior was a bug that was fixed. (Whitespace was ignored where it shouldn’t have been) Try looking at the parsed data structure without going through XPath: print([(el.text, el.tail) for el in myTree.iter()]) (This will not be in proper source order, but you get the idea.) Unfortunately I don’t know of a way to get consistent behavior, other than upgrading libxml2 everywhere. In this particular case, you might want to take a more specific set of elements. For example: myTree.xpath('//title/text()') -- Simon Sapin

Hi, thanks for your answer. I was fearing as much. (The fixing of a bug). The <title> thing was just for explaining. The html that my application is trying to parse tends to be spread all over the body parts. My first implementation was actually using the .text and .tail attributes, and then I found that text() was cutting out stuff that I would never be interested in. (Has no effect on what is shown in a browser), and it also fixed the ordering issue that you mentioned. So I switched to using text(). Then that one broke, so I guess I'll be heading back to the drawing board... I'll still try to use the text() function to retain the right ordering, and deal with those spurious newlines in a more explicit way. Thanks for your time :-) Olivier On 04/30/2013 11:15 AM, Simon Sapin wrote:

Hi, please don't top-post. Olivier de Mirleau, 30.04.2013 16:07:
There are many ways to do these things, just like there are tons of ways people want to process text in their documents. You can use iterwalk() to get text content in the right order (start=>text, end=>tail), you can call tostring() to serialise specific elements to plain text, you can ''.join() over XPath text() matches, or you can use XPath's normalize-space() function to get rid of redundant line endings etc. I personally like iterwalk() as it additionally allows you to (manually) convert intermediate elements like <p> or <br> into newlines in order to properly retrieve "some<br/>text" as "some\ntext" instead of "sometext". But it really depends on what you want to do, i.e. on exactly what parts of your document are relevant to you and which aren't, and in which way you want to process those that are. Stefan

Top posting because is a general answer , I had this king of problems when I read for dealing with html is better use lxml.html, so I changed to lxml.html import lxml.html hparser = lxml.html.HTMLParser(encoding=pcoding , remove_comments=True) html_document = lxml.html.fromstring(content, parser=hparser) html_document.resolve_base_href() On Ter, 2013-04-30 at 09:56 +0200, Olivier de Mirleau wrote:

Top posting because is a general answer , I had this kind of problems when I read for dealing with html is better use lxml.html, I changed to lxml.html import lxml.html hparser = lxml.html.HTMLParser(encoding=pcoding , remove_comments=True) html_document = lxml.html.fromstring(content, parser=hparser) html_document.resolve_base_href() On Ter, 2013-04-30 at 09:56 +0200, Olivier de Mirleau wrote:

Le 30/04/2013 09:56, Olivier de Mirleau a écrit :
Hi, I think this is not related to XPath. I’ve seen similar difference regarding whitespace in the libxml2 HTML parser when upgrading libxml2 from 2.8.x to 2.9.x. I think that the old behavior was a bug that was fixed. (Whitespace was ignored where it shouldn’t have been) Try looking at the parsed data structure without going through XPath: print([(el.text, el.tail) for el in myTree.iter()]) (This will not be in proper source order, but you get the idea.) Unfortunately I don’t know of a way to get consistent behavior, other than upgrading libxml2 everywhere. In this particular case, you might want to take a more specific set of elements. For example: myTree.xpath('//title/text()') -- Simon Sapin

Hi, thanks for your answer. I was fearing as much. (The fixing of a bug). The <title> thing was just for explaining. The html that my application is trying to parse tends to be spread all over the body parts. My first implementation was actually using the .text and .tail attributes, and then I found that text() was cutting out stuff that I would never be interested in. (Has no effect on what is shown in a browser), and it also fixed the ordering issue that you mentioned. So I switched to using text(). Then that one broke, so I guess I'll be heading back to the drawing board... I'll still try to use the text() function to retain the right ordering, and deal with those spurious newlines in a more explicit way. Thanks for your time :-) Olivier On 04/30/2013 11:15 AM, Simon Sapin wrote:

Hi, please don't top-post. Olivier de Mirleau, 30.04.2013 16:07:
There are many ways to do these things, just like there are tons of ways people want to process text in their documents. You can use iterwalk() to get text content in the right order (start=>text, end=>tail), you can call tostring() to serialise specific elements to plain text, you can ''.join() over XPath text() matches, or you can use XPath's normalize-space() function to get rid of redundant line endings etc. I personally like iterwalk() as it additionally allows you to (manually) convert intermediate elements like <p> or <br> into newlines in order to properly retrieve "some<br/>text" as "some\ntext" instead of "sometext". But it really depends on what you want to do, i.e. on exactly what parts of your document are relevant to you and which aren't, and in which way you want to process those that are. Stefan

Top posting because is a general answer , I had this king of problems when I read for dealing with html is better use lxml.html, so I changed to lxml.html import lxml.html hparser = lxml.html.HTMLParser(encoding=pcoding , remove_comments=True) html_document = lxml.html.fromstring(content, parser=hparser) html_document.resolve_base_href() On Ter, 2013-04-30 at 09:56 +0200, Olivier de Mirleau wrote:

Top posting because is a general answer , I had this kind of problems when I read for dealing with html is better use lxml.html, I changed to lxml.html import lxml.html hparser = lxml.html.HTMLParser(encoding=pcoding , remove_comments=True) html_document = lxml.html.fromstring(content, parser=hparser) html_document.resolve_base_href() On Ter, 2013-04-30 at 09:56 +0200, Olivier de Mirleau wrote:
participants (4)
-
Olivier de Mirleau
-
Simon Sapin
-
Stefan Behnel
-
Sérgio Basto