Mailman 3 How to control the processing of newlines in etree.html xpath text() function? - lxml - The Python XML Toolkit

April 30, 2013

      Hi List,

I recently upgraded from linux Fedora 17 to 18, and am facing a change
in functionality of the lxml xpath text() function. Here's an example:

from io import BytesIO
from lxml import etree

myHtmlString = \
    '<!doctype html public "-//w3c//dtd html 4.0 transitional//en">\r\n'+\
    '<html>\r\n'+\
    '<head>\r\n'+\
    '   <title> a b c </title>\r\n'+\
    '</head>\r\n'+\
    '<body/>\r\n'+\
    '</html>\r\n'
myFile = BytesIO(myHtmlString)
myTree = etree.parse(myFile, etree.HTMLParser())
myTextElements = myTree.xpath("//text()")
myFullText = ''.join([myEl for myEl in myTextElements])

print repr(myFullText)

Under F17 that piece of code will write
' a b c '
whereas F18 produces
'\r\n\r\n    a b c \r\n\r\n\r\n'

The version specifications are as follows:
f17:
Python              : sys.version_info(major=2, minor=7, micro=3,
releaselevel='final', serial=0)
lxml.etree          : (2, 3, 5, 0)
libxml used         : (2, 7, 8)
libxml compiled     : (2, 7, 8)
libxslt used        : (1, 1, 26)
libxslt compiled    : (1, 1, 26)

f18:
Python              : sys.version_info(major=2, minor=7, micro=3,
releaselevel='final', serial=0)
lxml.etree          : (2, 3, 5, 0)
libxml used         : (2, 9, 1)
libxml compiled     : (2, 9, 0)
libxslt used        : (1, 1, 28)
libxslt compiled    : (1, 1, 26)

I.e. neither python nor lxml.etree changed versions and I presume it's
therefore due to the underlying libraries having changed versions.

I wrote an application under F17 that is now broken under F18. I can
imagine three ways to make it work under F18, in increasing order of
impact on my own code:

1. Change a flag in an lxml call, recovering the F17 behaviour.
2. Wrap the F18 xpath function, to try and reproduce the F17 xpath
3. Port my downstream code which searches and transforms the output of
xpath.

Does anybody know whether solution 1 is possible, and if not, does
anybody have a suggestion for the implementation of (2)?

Bye, Olivier

P.S. I initially put up this question on stackoverflow, but no
satisfactory answer there yet:

http://stackoverflow.com/questions/16123277/how-to-control-newline-processin...

How to control the processing of newlines in etree.html xpath text() function?

Olivier de Mirleau

Simon Sapin

Stefan Behnel

Olivier de Mirleau

Stefan Behnel

Sérgio Basto

Sérgio Basto

Simon Sapin

Stefan Behnel

Olivier de Mirleau

Stefan Behnel

Sérgio Basto

Sérgio Basto

tags

participants (4)