need help with re module
David Wahler
dwahler at gmail.com
Wed Jun 20 16:56:30 EDT 2007
On 6/20/07, Gabriel Genellina <gagsl-py2 at yahoo.com.ar> wrote:
> En Wed, 20 Jun 2007 13:58:34 -0300, linuxprog <linuxprog at gmail.com>
> escribió:
>
> > i have that string "<html>hello</a>world<anytag>ok" and i want to
> > extract all the text , without html tags , the result should be some
> > thing like that : helloworldok
> >
> > i have tried that :
> >
> > from re import findall
> >
> > chaine = """<html>hello</a>world<anytag>ok"""
> >
> > print findall('[a-zA-z][^(<.*>)].+?[a-zA-Z]',chaine)
> > >>> ['html', 'hell', 'worl', 'anyt', 'ag>o']
> >
> > the result is not correct ! what would be the correct regex to use ?
>
> You can't use a regular expression for this task (no matter how
> complicated you write it).
[snip]
I agree that BeautifulSoup is probably the best tool for the job, but
this doesn't sound right to me. Since the OP doesn't care about tags
being properly nested, I don't see why a regex (albeit a tricky one)
wouldn't work. For example:
regex = re.compile(r'''
<[^!] # beginning of normal tag
([^'">]* # unquoted text...
|'[^']*' # or single-quoted text...
|"[^"]*")* # or double-quoted text
> # end of tag
|<!-- # beginning of comment
([^-]|-[^-])*
--\s*> # end of comment
''', re.VERBOSE)
text = regex.sub('', html)
Granted, this misses out a few things (e.g. DOCTYPE declarations), but
those should be straightforward to handle.
-- David
More information about the Python-list
mailing list