extract news article from web
Steve Holden
steve at holdenweb.com
Wed Dec 22 15:21:37 EST 2004
Zhang Le wrote:
> Hello,
> I'm writing a little Tkinter application to retrieve news from
> various news websites such as http://news.bbc.co.uk/, and display them
> in a TK listbox. All I want are news title and url information. Since
> each news site has a different layout, I think I need some
> template-based techniques to build news extractors for each site,
> ignoring information such as table, image, advertise, flash that I'm
> not interested in.
>
> So far I have built a simple GUI using Tkinter, a link extractor
> using HTMLlib to extract HREFs from web page. But I really have no idea
> how to extract news from web site. Is anyone aware of general
> techniques for extracting web news? Or can point me to some falimiar
> projects.
> I have seen some search engines doing this, for
> example:http://news.ithaki.net/, but do not know the technique used.
> Any tips?
>
> Thanks in advance,
>
> Zhang Le
>
Well, for Python-related news is suck stuff from O'Reilly's meerkat
service using xmlrpc. Once upon a time I used to update
www.holdenweb.com every four hours, but until my current hosting
situation changes I can't be arsed.
However, the code to extract the news is pretty simple. Here's the whole
program, modulo newsreader wrapping. It would be shorter if I weren't
stashing the extracted links it a relational database:
#!/usr/bin/python
#
# mkcheck.py: Get a list of article categories from the O'Reilly Network
# and update the appropriate section database
#
import xmlrpclib
server =
xmlrpclib.Server("http://www.oreillynet.com/meerkat/xml-rpc/server.php")
from db import conn, pmark
import mx.DateTime as dt
curs = conn.cursor()
pyitems = server.meerkat.getItems(
{'search':'/[Pp]ython/','num_items':10,'descriptions':100})
sqlinsert = "INSERT INTO PyLink (pylWhen, pylURL, pylDescription)
VALUES(%s, %s, %s)" % (pmark, pmark, pmark)
for itm in pyitems:
description = itm['description'] or itm['title']
if itm['link'] and not ("<" in description):
curs.execute("""SELECT COUNT(*) FROM PyLink
WHERE pylURL=%s""" % pmark, (itm['link'], ))
newlink = curs.fetchone()[0] == 0
if newlink:
print "Adding", itm['link']
curs.execute(sqlinsert,
(dt.DateTimeFromTicks(int(dt.now())), itm['link'], description))
conn.commit()
conn.close()
Similar techniques can be used on many other sites, and you will find
that (some) RSS feeds are a fruitful source of news.
regards
Steve
--
Steve Holden http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/
Holden Web LLC +1 703 861 4237 +1 800 494 3119
More information about the Python-list
mailing list