Best way to match everything between tags

Robert Roy rjroy at takingcontrol.com
Wed Jan 31 19:17:09 EST 2001


On Wed, 31 Jan 2001 23:46:17 +0100, "Henning VON ROSEN"
<hvrosen at world-online.no> wrote:

>Hi!
>I am learning regular expressions.
>
>What is thenatural way to match everything that is not "something"
>fx i want to maipulate all the text of a html document, but none of the tags
>
>
>This matches all the tags;
>
>s="text<body>some text<tag>more text</tag>text"
>r = re.compile(r"<.*?>")
> print r.sub("WWW",s)
>
>this works fine if I want to manipulate the tags, but I want to maipulate
>the text between tags
>
>I haven't found a one-expression way. Thankful for any indications!
>
>/Henning von Rosen
>
>
Some ideas:

You should define groups in your re

>>> s="text<body>some text<tag>more text</tag>text"

Definining a group around the content

>>> m = re.search(r'<[^>]*>([^<]*)<', s)
>>> m.groups()
('some text',)

Adding a group around the tag
>>> m = re.search(r'<([^>]*)>([^<]*)<\1', s)
>>> m.groups()
('body', 'some text')

Uing a lookahead to find the matching tag, this will find an inner
pair ie one with no more tags in it

>>> m = re.search(r'<([^>]*)>([^<]*)</\1', s)
>>> m.groups()
('tag', 'more text')

Example of that
>>> s="text<body>some text<tag>more <b>bolded</b> text</tag>text"

>>> m = re.search(r'<([^>]*)>([^<]*)</\1', s)
>>> m.groups()
('b', 'bolded')

This will pick up outermost tag pair
>>> m = re.search(r'<([^>]*)>(.*)</\1', s)
>>> m.groups()
('tag', 'more <b>bolded</b> text')


>>> s="text<body>some text<tag>more <tag> stuff</tag><b>bolded</b> text</tag>text"
>>> m = re.search(r'<([^>]*)>(.*)</\1', s)
>>> m.groups()
('tag', 'more <tag> stuff</tag><b>bolded</b> text')

 do non greedy match, this will pickup leftmost tag and leftmost close
tag
>>> m = re.search(r'<([^>]*)>(.*?)</\1', s)
>>> m.groups()
('tag', 'more <tag> stuff')

Lots of ways to get in trouble here
Bob



More information about the Python-list mailing list