scrape text from web
守株待兔
1248283536 at qq.com
Fri Aug 5 11:21:05 EDT 2011
python-list at python.org:
hi ,everyone,
i want to scrap something from
http://search.dangdang.com/search_pub.php?key=python
my code is :
import urllib
import lxml.html
down='http://search.dangdang.com/search_pub.php?key=python'
file=urllib.urlopen(down).read()
root=lxml.html.fromstring(file)
tnodes = root.xpath("//div[@class='listitem detail']//li[@class='maintitle']//a")
for i,x in enumerate(tnodes):
print i," ",x.get('name'),x.get('href'),x.get('onclick'),x.text,"\n"
the output is :
0 p_name http://product.dangdang.com/product.aspx?product_id=20872365&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','20872365_1_22591_p','','',''); None
1 p_name http://product.dangdang.com/product.aspx?product_id=20255354&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','20255354_2_12605_p','','',''); None
2 p_name http://product.dangdang.com/product.aspx?product_id=20836565&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','20836565_3_2361_p','','',''); None
3 p_name http://product.dangdang.com/product.aspx?product_id=21004615&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','21004615_4_3387_p','','',''); None
4 p_name http://product.dangdang.com/product.aspx?product_id=21063086&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','21063086_5_18815_p','','',''); None
5 pr_name http://product.dangdang.com/product.aspx?product_id=20678461&ref=search-1-pub s('click','python','01.54.04.03,01.54.06.18','','86_1_25','','','20678461_6_3967_p','','','RECO'); None
6 pr_name http://product.dangdang.com/product.aspx?product_id=20650363&ref=search-1-pub s('click','python','01.54.19.00','','86_1_25','','','20650363_7_62_p','','','RECO'); 黑客之道:漏洞发掘的艺术(原书第二版)(赠1CD)(电子制品CD-ROM)(
7 pr_name http://product.dangdang.com/product.aspx?product_id=20767932&ref=search-1-pub s('click','python','01.54.19.00','','86_1_25','','','20767932_8_4475_p','','','RECO'); Binary Hacks――黑客秘笈100选
8 p_name http://product.dangdang.com/product.aspx?product_id=20596189&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','20596189_9_639_p','','',''); None
9 p_name http://product.dangdang.com/product.aspx?product_id=20947680&ref=search-1-pub s('click','python','01.54.24.00,01.54.06.18','','86_1_25','','','20947680_10_7295_p','','',''); None
10 p_name http://product.dangdang.com/product.aspx?product_id=21050368&ref=search-1-pub s('click','python','01.54.19.00','','86_1_25','','','21050368_11_7039_p','','',''); None
11 p_name http://product.dangdang.com/product.aspx?product_id=20667966&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','20667966_12_383_p','','',''); None
12 p_name http://product.dangdang.com/product.aspx?product_id=21022493&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','21022493_13_5183_p','','',''); None
13 pr_name http://product.dangdang.com/product.aspx?product_id=479654&ref=search-1-pub s('click','python','01.54.06.08,01.54.06.18','','86_1_25','','','479654_14_2095_p','','','RECO'); Perl语言编程(第三版)
14 pr_name http://product.dangdang.com/product.aspx?product_id=20999855&ref=search-1-pub s('click','python','01.54.10.00','','86_1_25','','','20999855_15_6715_p','','','RECO'); 程序员的思维修炼:开发认知潜能的九堂课
15 pr_name http://product.dangdang.com/product.aspx?product_id=20696203&ref=search-1-pub s('click','python','01.54.06.08','','86_1_25','','','20696203_16_31615_p','','','RECO'); Perl语言入门(第五版)(原书名:Learning Perl,5/e)
16 p_name http://product.dangdang.com/product.aspx?product_id=20670643&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','20670643_17_24_p','','',''); 可爱的
17 p_name http://product.dangdang.com/product.aspx?product_id=20362210&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','20362210_18_32_p','','',''); 学习
18 p_name http://product.dangdang.com/product.aspx?product_id=9053236&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','9053236_19_4_p','','',''); 学习
19 p_name http://product.dangdang.com/product.aspx?product_id=20850780&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','20850780_20_1055_p','','',''); None
20 pr_name http://product.dangdang.com/product.aspx?product_id=20449068&ref=search-1-pub s('click','python','01.54.06.08','','86_1_25','','','20449068_21_38_p','','','RECO'); 精通Perl
21 p_name http://product.dangdang.com/product.aspx?product_id=21127816&ref=search-1-pub s('click','python','01.54.24.00,01.54.06.18','','86_1_25','','','21127816_22_12545_p','','',''); None
22 p_name http://product.dangdang.com/product.aspx?product_id=21107633&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','21107633_23_19245_p','','',''); Hadoop权威指南(第2版)修订升级版
23 None http://bang.dangdang.com/product_redirect.php?product_id=9317290 None None
24 p_name http://product.dangdang.com/product.aspx?product_id=9317290&ref=search-1-pub s('click','python','01.54.06.06,01.49.01.11,01.54.26.00','','86_1_25','','','9317290_24_81727_p','','',''); Java编程思想(第4版)
25 p_name http://product.dangdang.com/product.aspx?product_id=20773186&ref=search-1-pub s('click','python','01.54.06.17','','86_1_25','','','20773186_25_80479_p','','',''); Android应用开发揭秘
the problem is x.text ,for example:
1.
<a name="p_name" target="_blank" href="http://product.dangdang.com/product.aspx?product_id=20872365&ref=search-1-pub" onclick="s('click','python','01.54.06.18','','86_1_25','','','20872365_1_22591_p','','','');">
<font class="skcolor_ljg">Python</font>
基础教程(第2版)
</a>
what i want to get is "Python 基础教程(第2版)",the output is None
2:
<a name="p_name" target="_blank" href="http://product.dangdang.com/product.aspx?product_id=20670643&ref=search-1-pub" onclick="s('click','python','01.54.06.18','','86_1_25','','','20670643_17_24_p','','','');">
可爱的
<font class="skcolor_ljg">Python</font>
</a>
what i want to get is "可爱的python",the output is 可爱的
would you mind to tell me how to revise my code?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110805/55cbd474/attachment.html>
More information about the Python-list
mailing list