Strange re behavior: normal?

Michael Janssen Janssen at rz.uni-frankfurt.de
Sun Aug 17 13:06:48 EDT 2003


Robin Munn wrote:
> How is re.split supposed to work? This wasn't at all what I expected:

>>>>import re
>>>>re.split(r'\b', 'a b c d')
> ['a b c d']

the code (INSTALL_DIR/Modul/_sre.c function pattern_split) seems to show 
this behavior on intention. At least this if-clause has no other purpose 
to my eyes:

         if (state.start == state.ptr) { # empty string? mj
             if (last == state.end)
                 break;
             /* skip one character */
             state.start = (void*) ((char*) state.ptr + state.charsize);
             continue;
         }

Well, I belive it's good choice, to not split a string by an empty 
string, but when you really want (version with empty results on start 
and end omitted):

def boundary_split(s):
     back = []
     while 1:
         try:
             # r'.\b' and +1 prevents endless loop
             pos = re.search(r'.\b', s, re.DOTALL).start()+1
         except AttributeError:
             if s: back.append(s)
             break
         back.append(s[:pos])
         s = s[pos:]
     return back


boundary_split('a b c d')
#['a', ' ', 'b', ' ', 'c', ' ', 'd']


What's the good of splitting by boundaries? Someone else wanted this a 
few days ago on tutor and I can't figure out a reason by now.

Michael





More information about the Python-list mailing list