[Tutor] downloader-script

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Fri, 27 Sep 2002 09:40:46 -0700 (PDT)


On Fri, 27 Sep 2002, Francois Granger wrote:

> At 11:14 +0200 27/09/02, in message Re: [Tutor] downloader-script,
> Magnus Lycka wrote:
> >
> >Ok, you want to remove HTML tags from an HTML
> >file and turn it into plain text?
> >
> >http://www.faqts.com/knowledge_base/view.phtml/aid/3680/fid/199
>
> I was looking in the same issue. This looks like a good simple solution.


It's simple, but it has a bug: it assumes that a whole HTML tag lies
entirely on the same line.

###
>>> data = """<table
...                 bgcolor="blue">
...            hello world
...            </table>"""
>>> text = re.sub('<.*?>', '', data)
>>> text
'<table\n                bgcolor="blue">\n           hello world\n
###

This may not be an issue on simple pages, but still one to be aware of.
One way to fix the regular expression would be to tell the regular
expression engine to allow the '.' to span across lines:

###
>>> html_pattern = re.compile('<.*?>', re.DOTALL)
>>> text = html_pattern.sub('', data)
>>> text
'\n           hello world\n           '
###



Good luck!