[Tutor] Splitting by word boundaries

Fri Aug 15 00:03:28 EDT 2003

On Thu, 14 Aug 2003, Jonathan Hayward http://JonathansCorner.com wrote:

> Neil Schemenauer wrote:

> >re.findall(r'\w+', ...) should do what is intended.

> I couldn't figure out how to get this to accept re.DOTALL or equivalent
> and wrote a little bit of HTML handling that the original regexp
> wouldn't have done:

[you needn't set re.DOTALL, when your pattern hasn't got a dot]

Running your code (after putting it into a state to run it ;-) doesn't
split input :-( For my eyes it is missing a while loop. In my approach I
relay on r'\b' to look ahead until next boundary and slice the string:

def boundary_split(s):
    back = []
    while 1:
        try:
            # r'.\b' and +1 prevents endless loop
            pos = re.search(r'.\b', s, re.DOTALL).start()+1
        except AttributeError:
            if s: back.append(s)
            break
        back.append(s[:pos])
        s = s[pos:]
    return back

>>> boundary_split("  ad=\n2  ")
['  ', 'ad', '=\n', '2', '  ']
>>> boundary_split("<title>word boundary</title>")
['<', 'title', '>', 'word', ' ', 'boundary', '</', 'title', '>']

is it this what you want to get?

Michael