a text processing problem: are regexpressions necessary?

Sun Mar 17 08:29:18 EST 2002

Hi,

I thought I'd share this problem which has just confronted me.

The problem

An automatic way to tranform urls to articles on various news sites to
their printerfriendly counterparts is complicated by the fact that
different sites have different schemes for doing this. (see examples
below)

Now given two examples for each site: a regular link to an article and
its printer-friendly counterpart, is there a way to automatically
generate transformation code that is specific to each site, but which
generalizes across all article urls within that site?

Here are a few examples from several online publications:

http://news.bbc.co.uk/hi/english/world/africa/newsid_1871000/1871611.stm
http://news.bbc.co.uk/low/english/world/africa/newsid_1871000/1871611.stm

http://www.economist.com/agenda/displayStory.cfm?Story_ID=1043688
http://www.economist.com/agenda/PrinterFriendly.cfm?Story_ID=1043688

http://www.nationalreview.com/ponnuru/ponnuru031502.shtml
http://www.nationalreview.com/ponnuru/ponnuruprint031502.html

http://www.thenation.com/doc.mhtml?i=20020204&s=said
http://www.thenation.com/docPrint.mhtml?i=20020204&s=said

I'm kinda heading in the direction of attempting to generate regular
expressions for each site... But I'm a bit apprehensive about doing
this. Is there a more pythonic way to approach this problem?

Any advice would be appreciated.

regards,

Sandy