Strange re behavior: normal?
Fredrik Lundh
fredrik at pythonware.com
Sun Aug 17 14:40:31 EDT 2003
Michael Janssen wrote:
> Well, I belive it's good choice, to not split a string by an empty
> string, but when you really want (version with empty results on start
> and end omitted):
>
> def boundary_split(s):
> back = []
> while 1:
> try:
> # r'.\b' and +1 prevents endless loop
> pos = re.search(r'.\b', s, re.DOTALL).start()+1
> except AttributeError:
> if s: back.append(s)
> break
> back.append(s[:pos])
> s = s[pos:]
> return back
note that \b is defined in terms of \w and \W, so you can replace the
above with:
def boundary_split(text):
return re.findall("\w+|\W+", text)
> What's the good of splitting by boundaries? Someone else wanted this a
> few days ago on tutor and I can't figure out a reason by now.
the function extracts the words from a text, but includes the non-word
parts in the list as well (unlike, e.g. text.split() and re.findall("\w+")).
might be useful if you're writing some kind of text filter.
for part in re.findall("\w+|\W+", text):
...
here's an alternative pattern, which might be easier to use:
for word, sep in re.findall("(\w+)(\W*)", text):
...
</F>
PS. for proper support of non-ASCII text, prefix the pattern with (?u)
for ISO-8859-1 or Unicode strings, or (?L) to support localized text
(locale.setlocale).
More information about the Python-list
mailing list