[Tutor] retrieve URLs and text from web pages

Tino Dai oberoc at gmail.com
Tue Jun 29 04:34:26 CEST 2010


On Sun, Jun 27, 2010 at 12:15 PM, Khawla Al-Wehaibi <kwehaibi at yahoo.com>wrote:

> Hi,
>
> I’m new to programming. I’m currently learning python to write a web
> crawler to extract all text from a web page, in addition to, crawling to
> further URLs and collecting the text there. The idea is to place all the
> extracted text in a .txt file with each word in a single line. So the text
> has to be tokenized. All punctuation marks, duplicate words and non-stop
> words have to be removed.
>

Welcome to Python! What you are doing is best done in a multi step process
so that you can understand everything that you are doing. To really
leverage Python, there are a couple of things that you need to read right
off the bat.

http://docs.python.org/library/stdtypes.html   (Stuff about strings). In
Python, everything is an object so everything will have methods or functions
related to it. For instance, the String object has a find method that will
return position of the string. Pretty handy if you ask me.

Also, I would read up on sets for python. That will reduce the size of your
code significantly.

>
> The program should crawl the web to a certain depth and collect the URLs
> and text from each depth (level). I decided to choose a depth of 3. I
> divided the code to two parts. Part one to collect the URLs and part two to
> extract the text. Here is my problem:
>
> 1.    The program is extremely slow.
>

The best way to go about this is to use a profiler:

 http://docs.python.org/library/profile.html

2.    I'm not sure if it functions properly.
>

To debug your code, you may want to read up on the python debugger.
 http://docs.python.org/library/pdb.html

3.    Is there a better way to extract text?
>

See the strings and the lists. I think that you will be pleasantly surprised


> 4.    Are there any available modules to help clean the text i.e. removing
> duplicates, non-stop words ...
>

Read up on sets and the string functions/method. They are your friend

> 5.    Any suggestions or feedback is appreciated.
>
>
-Tino

PS: Please don't send html ladden emails, it makes it harder to work with.
Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20100628/1b43b8c8/attachment.html>


More information about the Tutor mailing list