need help with re module
Matimus
mccredie at gmail.com
Wed Jun 20 14:02:12 EDT 2007
On Jun 20, 9:58 am, linuxprog <linuxp... at gmail.com> wrote:
> hello
>
> i have that string "<html>hello</a>world<anytag>ok" and i want to
> extract all the text , without html tags , the result should be some
> thing like that : helloworldok
>
> i have tried that :
>
> from re import findall
>
> chaine = """<html>hello</a>world<anytag>ok"""
>
> print findall('[a-zA-z][^(<.*>)].+?[a-zA-Z]',chaine)
>
> >>> ['html', 'hell', 'worl', 'anyt', 'ag>o']
>
> the result is not correct ! what would be the correct regex to use ?
This: [^(<.*>)] is a set that contains everything but the characters
"(","<",".","*",">" and ")". It most certainly doesn't do what you
want it to. Is it absolutely necessary that you use a regular
expression? There are a few HTML parsing libraries out there. The
easiest approach using re might be to do a search and replace on all
tags. Just replace the tags with nothing.
Matt
More information about the Python-list
mailing list