xpath issue:same structure different result
I am having issues with the urllib and lxml.html modules. Here is my original code: import urllib import lxml.html down='http://v.163.com/special/visualizingdata/' file=urllib.urlopen(down).read() root=lxml.html.document_fromstring(file) xpath_str="//div[@class='down s-fc3 f-fl']/a" urllist=root.xpath(xpath_str)for url in urllist:print url.get("href") When run, it returns this output: http://mov.bn.netease.com/movieMP4/2012/12/A/7/S8H1TH9A7.mp4 http://mov.bn.netease.com/movieMP4/2012/12/D/9/S8H1ULCD9.mp4 http://mov.bn.netease.com/movieMP4/2012/12/4/P/S8H1UUH4P.mp4 http://mov.bn.netease.com/movieMP4/2012/12/B/V/S8H1V8RBV.mp4 http://mov.bn.netease.com/movieMP4/2012/12/6/E/S8H1VIF6E.mp4 http://mov.bn.netease.com/movieMP4/2012/12/B/G/S8H1VQ2BG.mp4 But, when I change the line xpath_str='//div[@class="down s-fc3 f-fl"]//a' into xpath_str='//div[@class="col f-cb"]//div[@class="down s-fc3 f-fl"]//a' that is to say, urllist=root.xpath('//div[@class="col f-cb"]//div[@class="down s-fc3 f-fl"]//a') I do not receive any output. What is the flaw in this code? it is so strange that the shorter one can work,the longer one can not,they have the same xpath structure!
Hi ... The expression '//div[@class="col f-cb"]' with single space inside, selects a single element in line 296 without <a> in it. The elements you are probably looking for, have double space inside: "col f-cb". Try this: root.xpath( '//div[@class="col f-cb"]//div[@class="down s-fc3 f-fl"]//a') - there is double space between "col" and "f-cb". This will return only the white rows, the gray rows have class "col col2 f-cb". If you want to get all the elements, you may want to select the elements which contain in the class atribute "col" and "f-cb", regardless whether they have col2 or not: root.xpath('//div[contains(@class,"col") and contains(@class,"f-cb")]/div[@class="down s-fc3 f-fl"]/a') Regards, Piotr 2013/2/22 python <mailtomanage@163.com>
I am having issues with the urllib and lxml.html modules.
Here is my original code:
import urllib import lxml.html down='http://v.163.com/special/visualizingdata/' file=urllib.urlopen(down).read() root=lxml.html.document_fromstring(file) xpath_str="//div[@class='down s-fc3 f-fl']/a" urllist=root.xpath(xpath_str) for url in urllist: print url.get("href")
When run, it returns this output:
http://mov.bn.netease.com/movieMP4/2012/12/A/7/S8H1TH9A7.mp4 http://mov.bn.netease.com/movieMP4/2012/12/D/9/S8H1ULCD9.mp4 http://mov.bn.netease.com/movieMP4/2012/12/4/P/S8H1UUH4P.mp4 http://mov.bn.netease.com/movieMP4/2012/12/B/V/S8H1V8RBV.mp4 http://mov.bn.netease.com/movieMP4/2012/12/6/E/S8H1VIF6E.mp4 http://mov.bn.netease.com/movieMP4/2012/12/B/G/S8H1VQ2BG.mp4
But, when I change the line
xpath_str='//div[@class="down s-fc3 f-fl"]//a'
into
xpath_str='//div[@class="col f-cb"]//div[@class="down s-fc3 f-fl"]//a'
that is to say,
urllist=root.xpath('//div[@class="col f-cb"]//div[@class="down s-fc3 f-fl"]//a')
I do not receive any output. What is the flaw in this code? it is so strange that the shorter one can work,the longer one can not,they have the same xpath structure!
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
On 22.02.2013, at 02:15, python <mailtomanage@163.com> wrote:
I am having issues with the urllib and lxml.html modules.
You try to select css classes by xpath, so you may want to have a look at http://lxml.de/cssselect.html which creates (working) xpath from css selector expressions.
import urllib import lxml.html down='http://v.163.com/special/visualizingdata/' file=urllib.urlopen(down).read() root=lxml.html.document_fromstring(file)
from lxml import cssselect xpath_str = cssselect.CSSSelector('div.col.f-cb div.down.s-fc3.f-fl a').path len(root.xpath(xpath_str)) 6 print [x.get('href') for x in root.xpath(xpath_str)] ['http://mov.bn.netease.com/movieMP4/2012/12/A/7/S8H1TH9A7.mp4', 'http://mov.bn.netease.com/movieMP4/2012/12/D/9/S8H1ULCD9.mp4', 'http://mov.bn.netease.com/movieMP4/2012/12/4/P/S8H1UUH4P.mp4', 'http://mov.bn.netease.com/movieMP4/2012/12/B/V/S8H1V8RBV.mp4', 'http://mov.bn.netease.com/movieMP4/2012/12/6/E/S8H1VIF6E.mp4', 'http://mov.bn.netease.com/movieMP4/2012/12/B/G/S8H1VQ2BG.mp4']
xpath_str u"descendant-or-self::div[contains(concat(' ', normalize-space(@class), ' '), ' col ') and (contains(concat(' ', normalize-space(@class), ' '), ' f-cb '))]/descendant-or-self::*/div[contains(concat(' ', normalize-space(@class), ' '), ' down ') and (contains(concat(' ', normalize-space(@class), ' '), ' s-fc3 ')) and (contains(concat(' ', normalize-space(@class), ' '), ' f-fl '))]/descendant-or-self::*/a"
participants (3)
-
jens quade
-
Piotr Owcarz
-
python