[Tutor] Splitting by word boundaries

Michael Janssen Janssen at rz.uni-frankfurt.de
Thu Aug 14 21:57:41 EDT 2003


On Thu, 14 Aug 2003, A.M. Kuchling wrote:

> On Thu, Aug 14, 2003 at 06:28:24PM +0100, Jonathan Hayward http://JonathansCorner.com wrote:
> >        text_tokens = re.split("\b", text)
>
> Inside a string literal, the \b is interpreted as a backspace
> character; if there were any backspaces in the string, it would be
> split along them.  Either use a raw string (r"\b") to avoid this
> interpretation, or quote the backslash (r"\\b") so it gets passed to
> the re module.

this is important but not enough. re.split(r'\b', 'word boundary') is
yet infunctional. I've looked through the sources to find out why.

My conclusion is: re.split tries to *match* (not search) on a given
string: a match is only found, when the strings *starts* with a possibly
match. In case the match is empty, it steps one character further
and matches again. Empty match indicates, that nothing is to be
done than proceding within the string and don't split on current
position. Since r'\b' matches the empty string, this leads to confusion.

Note that re.match doesn't return None if nothing is found but ''
(unlike re.search)


Unless you rewrite split (I'm not the one to tell if this is possible),
you can't use it with a pattern, that matches the empty string (But
you're free to do it in another way ;-).


Where did I found this information (beside some guessing ;-) ? Well, how
to tell it? split is implemented as the method pattern_split of a
SRE_Pattern in modul _sre. _sre is a builtin module (python -vEc "import
_sre"  gives this piece of information). This means, reading c code.
Suppose you've installed from tar you find the _src.c file in
INSTALL_DIR/Modules/. In case you are unfamilar with c you might look
into the old pre Modul (under PYTHON_LIB/), wich got a RegexObject class
with a split method very similar to _sre.SRE_Pattern.pattern_split (did
those c-coders made it the python way?).

The interesting lines are on 2155 (v2.3):

        if (state.start == state.ptr) {  # indicates empty match; mj
            if (last == state.end)
                break;  # break while loop if end of string; mj
            /* skip one character */
            state.start = (void*) ((char*) state.ptr + state.charsize);
            continue;


"continue" stops from processing the rest of the while loop. Here it
stops _sre.pattern_split from actually spliting the string.


Michael



More information about the Tutor mailing list