need help with re module

Wed Jun 20 16:56:30 EDT 2007

On 6/20/07, Gabriel Genellina <gagsl-py2 at yahoo.com.ar> wrote:
> En Wed, 20 Jun 2007 13:58:34 -0300, linuxprog <linuxprog at gmail.com>
> escribió:
>
> > i have that string "<html>hello</a>world<anytag>ok" and i want to
> > extract all the text , without html tags , the result should be some
> > thing like that : helloworldok
> >
> > i have tried that :
> >
> >         from re import findall
> >
> >         chaine = """<html>hello</a>world<anytag>ok"""
> >
> >         print findall('[a-zA-z][^(<.*>)].+?[a-zA-Z]',chaine)
> >       >>> ['html', 'hell', 'worl', 'anyt', 'ag>o']
> >
> > the result is not correct ! what would be the correct regex to use ?
>
> You can't use a regular expression for this task (no matter how
> complicated you write it).
[snip]

I agree that BeautifulSoup is probably the best tool for the job, but
this doesn't sound right to me. Since the OP doesn't care about tags
being properly nested, I don't see why a regex (albeit a tricky one)
wouldn't work. For example:

regex = re.compile(r'''
    <[^!]             # beginning of normal tag
        ([^'">]*        # unquoted text...
        |'[^']*'        # or single-quoted text...
        |"[^"]*")*      # or double-quoted text
    >                 # end of tag
   |<!--              # beginning of comment
        ([^-]|-[^-])*
    --\s*>            # end of comment
''', re.VERBOSE)
text = regex.sub('', html)

Granted, this misses out a few things (e.g. DOCTYPE declarations), but
those should be straightforward to handle.

-- David