in the middle of web ,there is a problem,how to parse

contro opinion contropinion at gmail.com
Wed Jan 18 04:56:52 EST 2012


here is my code:

import urllib
import lxml.html

down="
http://sc.hkex.com.hk/gb/www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm
"
file=urllib.urlopen(down).read()
root=lxml.html.document_fromstring(file)

data1 = root.xpath('//tr[@class="tr_normal"  and  .//img]')
print "the row which contains img  :"
for u in data1:
    print  u.text_content()

data2 = root.xpath('//tr[@class="tr_normal"  and  not(.//img)]')
print "the row which do not contain img  :"
for u in data2:
    print  u.text_content()


the output is :(i omit many lines )

the row which contains img  :
00329
the row which do not contain img  :
00001长江实业1,000#HOF
................many lines omitted
00327百富环球1,000#H
00328ALCO HOLDINGS2,000#

i wondered why  there are so many lines i can't get such as :
(you can see in the web
http://sc.hkex.com.hk/gb/www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm
)


00330思捷环球<http://sc.hkex.com.hk/gb/www.hkex.com.hk/chi/invest/company/profile_page_c.asp?WidCoID=00330&WidCoAbbName=&Month=&langcode=c>
100#HOF00331春天百货<http://sc.hkex.com.hk/gb/www.hkex.com.hk/chi/invest/company/profile_page_c.asp?WidCoID=00331&WidCoAbbName=&Month=&langcode=c>
2,000#H  00332NGAI LIK
IND<http://sc.hkex.com.hk/gb/www.hkex.com.hk/chi/invest/company/profile_page_c.asp?WidCoID=00332&WidCoAbbName=&Month=&langcode=c>
4,000#   ...................many lines  ommitted
i want to know how can i get these ??
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20120118/31ea63e9/attachment.html>


More information about the Python-list mailing list