scrape text from web

守株待兔 1248283536 at qq.com
Fri Aug 5 17:21:05 CEST 2011


python-list at python.org:
hi ,everyone,
i want to scrap something from
http://search.dangdang.com/search_pub.php?key=python
my code is :

import urllib
import lxml.html
down='http://search.dangdang.com/search_pub.php?key=python'
file=urllib.urlopen(down).read()
root=lxml.html.fromstring(file)
tnodes = root.xpath("//div[@class='listitem detail']//li[@class='maintitle']//a")
for i,x in  enumerate(tnodes):
   print i,"  ",x.get('name'),x.get('href'),x.get('onclick'),x.text,"\n"

the output is :
0    p_name http://product.dangdang.com/product.aspx?product_id=20872365&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','20872365_1_22591_p','','',''); None

1    p_name http://product.dangdang.com/product.aspx?product_id=20255354&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','20255354_2_12605_p','','',''); None

2    p_name http://product.dangdang.com/product.aspx?product_id=20836565&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','20836565_3_2361_p','','',''); None

3    p_name http://product.dangdang.com/product.aspx?product_id=21004615&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','21004615_4_3387_p','','',''); None

4    p_name http://product.dangdang.com/product.aspx?product_id=21063086&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','21063086_5_18815_p','','',''); None

5    pr_name http://product.dangdang.com/product.aspx?product_id=20678461&ref=search-1-pub s('click','python','01.54.04.03,01.54.06.18','','86_1_25','','','20678461_6_3967_p','','','RECO'); None

6    pr_name http://product.dangdang.com/product.aspx?product_id=20650363&ref=search-1-pub s('click','python','01.54.19.00','','86_1_25','','','20650363_7_62_p','','','RECO'); 黑客之道:漏洞发掘的艺术(原书第二版)(赠1CD)(电子制品CD-ROM)(

7    pr_name http://product.dangdang.com/product.aspx?product_id=20767932&ref=search-1-pub s('click','python','01.54.19.00','','86_1_25','','','20767932_8_4475_p','','','RECO'); Binary Hacks――黑客秘笈100选

8    p_name http://product.dangdang.com/product.aspx?product_id=20596189&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','20596189_9_639_p','','',''); None

9    p_name http://product.dangdang.com/product.aspx?product_id=20947680&ref=search-1-pub s('click','python','01.54.24.00,01.54.06.18','','86_1_25','','','20947680_10_7295_p','','',''); None

10    p_name http://product.dangdang.com/product.aspx?product_id=21050368&ref=search-1-pub s('click','python','01.54.19.00','','86_1_25','','','21050368_11_7039_p','','',''); None

11    p_name http://product.dangdang.com/product.aspx?product_id=20667966&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','20667966_12_383_p','','',''); None

12    p_name http://product.dangdang.com/product.aspx?product_id=21022493&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','21022493_13_5183_p','','',''); None

13    pr_name http://product.dangdang.com/product.aspx?product_id=479654&ref=search-1-pub s('click','python','01.54.06.08,01.54.06.18','','86_1_25','','','479654_14_2095_p','','','RECO'); Perl语言编程(第三版)

14    pr_name http://product.dangdang.com/product.aspx?product_id=20999855&ref=search-1-pub s('click','python','01.54.10.00','','86_1_25','','','20999855_15_6715_p','','','RECO'); 程序员的思维修炼:开发认知潜能的九堂课

15    pr_name http://product.dangdang.com/product.aspx?product_id=20696203&ref=search-1-pub s('click','python','01.54.06.08','','86_1_25','','','20696203_16_31615_p','','','RECO'); Perl语言入门(第五版)(原书名:Learning Perl,5/e)

16    p_name http://product.dangdang.com/product.aspx?product_id=20670643&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','20670643_17_24_p','','',''); 可爱的

17    p_name http://product.dangdang.com/product.aspx?product_id=20362210&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','20362210_18_32_p','','',''); 学习

18    p_name http://product.dangdang.com/product.aspx?product_id=9053236&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','9053236_19_4_p','','',''); 学习

19    p_name http://product.dangdang.com/product.aspx?product_id=20850780&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','20850780_20_1055_p','','',''); None

20    pr_name http://product.dangdang.com/product.aspx?product_id=20449068&ref=search-1-pub s('click','python','01.54.06.08','','86_1_25','','','20449068_21_38_p','','','RECO'); 精通Perl

21    p_name http://product.dangdang.com/product.aspx?product_id=21127816&ref=search-1-pub s('click','python','01.54.24.00,01.54.06.18','','86_1_25','','','21127816_22_12545_p','','',''); None

22    p_name http://product.dangdang.com/product.aspx?product_id=21107633&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','','','21107633_23_19245_p','','',''); Hadoop权威指南(第2版)修订升级版

23    None  http://bang.dangdang.com/product_redirect.php?product_id=9317290 None None

24    p_name http://product.dangdang.com/product.aspx?product_id=9317290&ref=search-1-pub s('click','python','01.54.06.06,01.49.01.11,01.54.26.00','','86_1_25','','','9317290_24_81727_p','','',''); Java编程思想(第4版)

25    p_name http://product.dangdang.com/product.aspx?product_id=20773186&ref=search-1-pub s('click','python','01.54.06.17','','86_1_25','','','20773186_25_80479_p','','',''); Android应用开发揭秘

the problem  is  x.text  ,for example:

1.
<a name="p_name" target="_blank" href="http://product.dangdang.com/product.aspx?product_id=20872365&ref=search-1-pub" onclick="s('click','python','01.54.06.18','','86_1_25','','','20872365_1_22591_p','','','');">
<font class="skcolor_ljg">Python</font>
基础教程(第2版)
</a>
what i want to get is   "Python 基础教程(第2版)",the output is None

2:
<a name="p_name" target="_blank" href="http://product.dangdang.com/product.aspx?product_id=20670643&ref=search-1-pub" onclick="s('click','python','01.54.06.18','','86_1_25','','','20670643_17_24_p','','','');">
可爱的
<font class="skcolor_ljg">Python</font>
</a>
what i want to get is "可爱的python",the output is  可爱的

would you mind to tell me how to revise my code?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110805/55cbd474/attachment.html>


More information about the Python-list mailing list