a text processing problem: are regexpressions necessary?

Sandy Norton sandskyfly at hotmail.com
Sun Mar 17 08:29:18 EST 2002


Hi,

I thought I'd share this problem which has just confronted me.

The problem

An automatic way to tranform urls to articles on various news sites to
their printerfriendly counterparts is complicated by the fact that
different sites have different schemes for doing this. (see examples
below)
 
Now given two examples for each site: a regular link to an article and
its printer-friendly counterpart, is there a way to automatically
generate transformation code that is specific to each site, but which
generalizes across all article urls within that site?
 
Here are a few examples from several online publications:

http://news.bbc.co.uk/hi/english/world/africa/newsid_1871000/1871611.stm
http://news.bbc.co.uk/low/english/world/africa/newsid_1871000/1871611.stm
    
http://www.economist.com/agenda/displayStory.cfm?Story_ID=1043688
http://www.economist.com/agenda/PrinterFriendly.cfm?Story_ID=1043688
    
http://www.nationalreview.com/ponnuru/ponnuru031502.shtml
http://www.nationalreview.com/ponnuru/ponnuruprint031502.html
    
http://www.thenation.com/doc.mhtml?i=20020204&s=said
http://www.thenation.com/docPrint.mhtml?i=20020204&s=said

I'm kinda heading in the direction of attempting to generate regular
expressions for each site... But I'm a bit apprehensive about doing
this. Is there a more pythonic way to approach this problem?

Any advice would be appreciated.

regards,

Sandy



More information about the Python-list mailing list