extract news article from web

Steve Holden steve at holdenweb.com
Wed Dec 22 15:21:37 EST 2004


Zhang Le wrote:

> Hello,
> I'm writing a little Tkinter application to retrieve news from
> various news websites such as http://news.bbc.co.uk/, and display them
> in a TK listbox. All I want are news title and url information. Since
> each news site has a different layout, I think I need some
> template-based techniques to build news extractors for each site,
> ignoring information such as table, image, advertise, flash that I'm
> not interested in.
> 
> So far I have built a simple GUI using Tkinter, a link extractor
> using HTMLlib to extract HREFs from web page. But I really have no idea
> how to extract news from web site. Is anyone aware of general
> techniques for extracting web news? Or can point me to some falimiar
> projects.
> I have seen some search engines doing this, for
> example:http://news.ithaki.net/, but do not know the technique used.
> Any tips?
> 
> Thanks in advance,
> 
> Zhang Le
> 
Well, for Python-related news is suck stuff from O'Reilly's meerkat 
service using xmlrpc. Once upon a time I used to update 
www.holdenweb.com every four hours, but until my current hosting 
situation changes I can't be arsed.

However, the code to extract the news is pretty simple. Here's the whole 
  program, modulo newsreader wrapping. It would be shorter if I weren't 
stashing the extracted links it a relational database:

#!/usr/bin/python
#
# mkcheck.py: Get a list of article categories from the O'Reilly Network
#                       and update the appropriate section database
#
import xmlrpclib
server = 
xmlrpclib.Server("http://www.oreillynet.com/meerkat/xml-rpc/server.php")

from db import conn, pmark
import mx.DateTime as dt
curs = conn.cursor()

pyitems = server.meerkat.getItems(
         {'search':'/[Pp]ython/','num_items':10,'descriptions':100})

sqlinsert = "INSERT INTO PyLink (pylWhen, pylURL, pylDescription) 
VALUES(%s, %s, %s)" % (pmark, pmark, pmark)
for itm in pyitems:
         description = itm['description'] or itm['title']
         if itm['link'] and not ("<" in description):
                 curs.execute("""SELECT COUNT(*) FROM PyLink
                    WHERE pylURL=%s""" % pmark, (itm['link'], ))
                 newlink = curs.fetchone()[0] == 0
                 if newlink:
                         print "Adding", itm['link']
                         curs.execute(sqlinsert,
 
(dt.DateTimeFromTicks(int(dt.now())), itm['link'], description))

conn.commit()
conn.close()

Similar techniques can be used on many other sites, and you will find 
that (some) RSS feeds are a fruitful source of news.

regards
  Steve
-- 
Steve Holden               http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/
Holden Web LLC      +1 703 861 4237  +1 800 494 3119



More information about the Python-list mailing list