[lxml-dev] Some problem with an xpath
Hi, I'd like to extract a string from an html document without caring where it is in the tree. However somehow my xpath expression returns all nodes :-( I tried to reproduce the problem with a mini script, but
xml=etree.XML('<a><b>Test</b><b>Test</b><b>Tf<e />Test</b><b>dgfd</b></a>') xml.xpath('//b[fn:contains(self::text(),\'Test\')]') [
, , , ]
works as expected, however for my html page I get
html = etree.parse('/home/andreas/public_html/batman_dvd.html',etree.HTMLParser()) html.xpath('//font[fn:contains(self::text(),\'Minuten\')]/text()') ['\n ', '\n ', '\n ', '\n ', u' Die folgenden Daten wurden noch nicht redaktionell \xfcberpr\xfcft.\n ', '\n ', '\n ', '\n ', '\n ', 'Erscheinungsart:\n ', '\n ', '\n ', 'Label:\n ', '\n ', '\n ', u'V\xd6-Termin:\n
The page can be seen at http://www.ofdb.de/view.php?page=fassung&fid=1130&vid=148784 Is this a problem of lxml or my xpath expression? Even if I provide a more apropriate "start path", i.e. select a table deep in the hierarchy that contains the looked for element I get a lot of text nodes back. Andreas -- You definitely intend to start living sometime soon.
Andreas Pakulat
(AP) wrote:
AP> Hi, AP> I'd like to extract a string from an html document without caring where AP> it is in the tree. However somehow my xpath expression returns all AP> nodes :-(
AP> I tried to reproduce the problem with a mini script, but AP> >>> xml=etree.XML('<a><b>Test</b><b>Test</b><b>Tf<e AP> />Test</b><b>dgfd</b></a>') AP> >>> xml.xpath('//b[fn:contains(self::text(),\'Test\')]') AP> [
, , <Element b AP> at -4833fdcc>, ]
AP> works as expected,
No, it doesn't. It gives you 4 nodes while there are only 3 with 'Test'. And try replacing 'Test' with 'Tf' or 'dgfd'.
however for my html page I get
html = etree.parse('/home/andreas/public_html/batman_dvd.html',etree.HTMLParser()) html.xpath('//font[fn:contains(self::text(),\'Minuten\')]/text()') [snip]
AP> Is this a problem of lxml or my xpath expression? Even if I provide a AP> more apropriate "start path", i.e. select a table deep in the hierarchy AP> that contains the looked for element I get a lot of text nodes back.
use:
html.xpath('//font[contains(.,"Minuten")]/text()')
or
html.xpath('//font/text()[contains(.,"Minuten")]')
depending on whether you want the whole <font> contents or only the
part with 'Minuten'
(fn: is just a namespace prefix, and you probably haven't setup
one)
--
Piet van Oostrum
On 16.06.06 00:25:47, Piet van Oostrum wrote:
Andreas Pakulat
(AP) wrote: AP> Hi, AP> I'd like to extract a string from an html document without caring where AP> it is in the tree. However somehow my xpath expression returns all AP> nodes :-( AP> I tried to reproduce the problem with a mini script, but AP> >>> xml=etree.XML('<a><b>Test</b><b>Test</b><b>Tf<e AP> />Test</b><b>dgfd</b></a>') AP> >>> xml.xpath('//b[fn:contains(self::text(),\'Test\')]') AP> [
, , <Element b AP> at -4833fdcc>, ] AP> works as expected,
No, it doesn't. It gives you 4 nodes while there are only 3 with 'Test'. And try replacing 'Test' with 'Tf' or 'dgfd'.
Looks like I cannot count, not even using my hands ;-)
use: html.xpath('//font[contains(.,"Minuten")]/text()') or html.xpath('//font/text()[contains(.,"Minuten")]')
depending on whether you want the whole <font> contents or only the part with 'Minuten'
(fn: is just a namespace prefix, and you probably haven't setup one)
Damn, another one that got me. I had the same problem already but at that time it was somehow more obvious to me. Thanks. Andreas -- Your talents will be recognized and suitably rewarded.
participants (2)
-
Andreas Pakulat
-
Piet van Oostrum