in the middle of web ,there is a problem,how to parse
contro opinion
contropinion at gmail.com
Wed Jan 18 04:56:52 EST 2012
here is my code:
import urllib
import lxml.html
down="
http://sc.hkex.com.hk/gb/www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm
"
file=urllib.urlopen(down).read()
root=lxml.html.document_fromstring(file)
data1 = root.xpath('//tr[@class="tr_normal" and .//img]')
print "the row which contains img :"
for u in data1:
print u.text_content()
data2 = root.xpath('//tr[@class="tr_normal" and not(.//img)]')
print "the row which do not contain img :"
for u in data2:
print u.text_content()
the output is :(i omit many lines )
the row which contains img :
00329
the row which do not contain img :
00001长江实业1,000#HOF
................many lines omitted
00327百富环球1,000#H
00328ALCO HOLDINGS2,000#
i wondered why there are so many lines i can't get such as :
(you can see in the web
http://sc.hkex.com.hk/gb/www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm
)
00330思捷环球<http://sc.hkex.com.hk/gb/www.hkex.com.hk/chi/invest/company/profile_page_c.asp?WidCoID=00330&WidCoAbbName=&Month=&langcode=c>
100#HOF00331春天百货<http://sc.hkex.com.hk/gb/www.hkex.com.hk/chi/invest/company/profile_page_c.asp?WidCoID=00331&WidCoAbbName=&Month=&langcode=c>
2,000#H 00332NGAI LIK
IND<http://sc.hkex.com.hk/gb/www.hkex.com.hk/chi/invest/company/profile_page_c.asp?WidCoID=00332&WidCoAbbName=&Month=&langcode=c>
4,000# ...................many lines ommitted
i want to know how can i get these ??
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20120118/31ea63e9/attachment.html>
More information about the Python-list
mailing list