xpath issue:same structure different result
data:image/s3,"s3://crabby-images/be0c5/be0c56ef39e64981e3c7e9d604899e5cdeba105d" alt=""
I am having issues with the urllib and lxml.html modules. Here is my original code: import urllib import lxml.html down='http://v.163.com/special/visualizingdata/' file=urllib.urlopen(down).read() root=lxml.html.document_fromstring(file) xpath_str="//div[@class='down s-fc3 f-fl']/a" urllist=root.xpath(xpath_str)for url in urllist:print url.get("href") When run, it returns this output: http://mov.bn.netease.com/movieMP4/2012/12/A/7/S8H1TH9A7.mp4 http://mov.bn.netease.com/movieMP4/2012/12/D/9/S8H1ULCD9.mp4 http://mov.bn.netease.com/movieMP4/2012/12/4/P/S8H1UUH4P.mp4 http://mov.bn.netease.com/movieMP4/2012/12/B/V/S8H1V8RBV.mp4 http://mov.bn.netease.com/movieMP4/2012/12/6/E/S8H1VIF6E.mp4 http://mov.bn.netease.com/movieMP4/2012/12/B/G/S8H1VQ2BG.mp4 But, when I change the line xpath_str='//div[@class="down s-fc3 f-fl"]//a' into xpath_str='//div[@class="col f-cb"]//div[@class="down s-fc3 f-fl"]//a' that is to say, urllist=root.xpath('//div[@class="col f-cb"]//div[@class="down s-fc3 f-fl"]//a') I do not receive any output. What is the flaw in this code? it is so strange that the shorter one can work,the longer one can not,they have the same xpath structure!
data:image/s3,"s3://crabby-images/a58ac/a58acd3305be090db6a312c3fb0d9c0f3dfd6745" alt=""
Hi ... The expression '//div[@class="col f-cb"]' with single space inside, selects a single element in line 296 without <a> in it. The elements you are probably looking for, have double space inside: "col f-cb". Try this: root.xpath( '//div[@class="col f-cb"]//div[@class="down s-fc3 f-fl"]//a') - there is double space between "col" and "f-cb". This will return only the white rows, the gray rows have class "col col2 f-cb". If you want to get all the elements, you may want to select the elements which contain in the class atribute "col" and "f-cb", regardless whether they have col2 or not: root.xpath('//div[contains(@class,"col") and contains(@class,"f-cb")]/div[@class="down s-fc3 f-fl"]/a') Regards, Piotr 2013/2/22 python <mailtomanage@163.com>
data:image/s3,"s3://crabby-images/f456d/f456d99adf8976ed9e43b908659d2775041cec72" alt=""
On 22.02.2013, at 02:15, python <mailtomanage@163.com> wrote:
I am having issues with the urllib and lxml.html modules.
You try to select css classes by xpath, so you may want to have a look at http://lxml.de/cssselect.html which creates (working) xpath from css selector expressions.
xpath_str u"descendant-or-self::div[contains(concat(' ', normalize-space(@class), ' '), ' col ') and (contains(concat(' ', normalize-space(@class), ' '), ' f-cb '))]/descendant-or-self::*/div[contains(concat(' ', normalize-space(@class), ' '), ' down ') and (contains(concat(' ', normalize-space(@class), ' '), ' s-fc3 ')) and (contains(concat(' ', normalize-space(@class), ' '), ' f-fl '))]/descendant-or-self::*/a"
data:image/s3,"s3://crabby-images/a58ac/a58acd3305be090db6a312c3fb0d9c0f3dfd6745" alt=""
Hi ... The expression '//div[@class="col f-cb"]' with single space inside, selects a single element in line 296 without <a> in it. The elements you are probably looking for, have double space inside: "col f-cb". Try this: root.xpath( '//div[@class="col f-cb"]//div[@class="down s-fc3 f-fl"]//a') - there is double space between "col" and "f-cb". This will return only the white rows, the gray rows have class "col col2 f-cb". If you want to get all the elements, you may want to select the elements which contain in the class atribute "col" and "f-cb", regardless whether they have col2 or not: root.xpath('//div[contains(@class,"col") and contains(@class,"f-cb")]/div[@class="down s-fc3 f-fl"]/a') Regards, Piotr 2013/2/22 python <mailtomanage@163.com>
data:image/s3,"s3://crabby-images/f456d/f456d99adf8976ed9e43b908659d2775041cec72" alt=""
On 22.02.2013, at 02:15, python <mailtomanage@163.com> wrote:
I am having issues with the urllib and lxml.html modules.
You try to select css classes by xpath, so you may want to have a look at http://lxml.de/cssselect.html which creates (working) xpath from css selector expressions.
xpath_str u"descendant-or-self::div[contains(concat(' ', normalize-space(@class), ' '), ' col ') and (contains(concat(' ', normalize-space(@class), ' '), ' f-cb '))]/descendant-or-self::*/div[contains(concat(' ', normalize-space(@class), ' '), ' down ') and (contains(concat(' ', normalize-space(@class), ' '), ' s-fc3 ')) and (contains(concat(' ', normalize-space(@class), ' '), ' f-fl '))]/descendant-or-self::*/a"
participants (3)
-
jens quade
-
Piotr Owcarz
-
python