re.split() not keeping matched text
Christopher T King
squirrel at WPI.EDU
Sun Jul 25 16:55:21 EDT 2004
On Sun, 25 Jul 2004, Robert Oschler wrote:
> Given the following program:
>
> --------------
>
> import re
>
> x = "The dog ran. The cat eats! The bird flies? Done."
> l = re.split("[.?!]", x)
>
> for s in l:
> print s.strip()
> # for
> ---------------
> I want to keep the punctuation marks.
>
> Where am I going wrong here?
What you need is some magic with the (?<=...), or 'look-behind assertion'
operator:
re.split(r'(?<=[.?!])\s*')
What this regex is saying is "match a string of spaces that follows one of
[.?!]". This way, it will not consume the punctuation, but will consume
the spaces (thus killing two birds with one stone by obviating the need
for the subsequent s.strip()).
Unfortunately, there is a slight bug, where if the punctuation is not
followed by whitespace, re.split won't split, because the regex returns a
zero-length string. There is a patch to fix this (SF #988761, see the end
of the message for a link), but until then, you can prevent the error by
using:
re.split(r'(?<=[.?!])\s+')
This won't match end-of-character marks not followed by whitespace, but
that may be preferable behaviour anyways (e.g. if you're parsing Python
documentation).
Hope this helps.
Patch #988761:
http://sourceforge.net/tracker/index.php?func=detail&aid=988761&group_id=5470&atid=305470
More information about the Python-list
mailing list