[Patches] [ python-Patches-988761 ] re.split emptyok flag (fix for #852532)

Wed Jul 28 18:23:32 CEST 2004

Patches item #988761, was opened at 2004-07-10 22:25
Message generated for change (Comment added) made by mkc
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=988761&group_id=5470

Category: Modules
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Submitted By: Mike Coleman (mkc)
Assigned to: Nobody/Anonymous (nobody)
Summary: re.split emptyok flag (fix for #852532)

Initial Comment:
This patch addresses bug #852532.  The underlying
problem is that re.split ignores any match it makes
that has length zero, which blocks a number of useful
possibilities.  The attached patch implements a flag
'emptyok', which if set to True, causes re.split to 
allow zero length matches.

My preference would be to just change the behavior of
re.split, rather than adding this flag.  The old
behavior isn't documented (though a couple of cases in
test_re.py do depend on it).  As a practical matter,
though, I realize that there may be some code out there
relying on this undocumented behavior.  And I'm hoping
that this useful feature can be added quickly.  Perhaps
this new behavior could be made the default in a future
version of Python.

(Linux 2.6.3 i686)

----------------------------------------------------------------------

>Comment By: Mike Coleman (mkc)
Date: 2004-07-28 11:23

Message:
Logged In: YES 
user_id=555

I picked through CVS, python-dev and google and came up with
this.  The current behavior was present way back in the
earliest regsub.py in CVS (dated Sep 1992); subsequent
implementation seem to be mirroring this behavior.  The CVS
comment back in 1992 described split as modeled on nawk.  A
check of nawk(1) confirms that nawk only splits on non-null
matches.  Perl (circa 5.6) on the other hand, appears to
split the way this patch does (though I wasn't aware of that
when I wrote the patch) so that might argue in the other
direction.  I would note, too, that re.findall and
re.finditer tend in this direction ("Empty matches are
included in the result unless they touch the beginning of
another match.").

The python-dev archive doesn't seem to go back far enough to
be relevant and I'm not sure how to search it.  General
googling (python "re.split" empty match) found a few hits. 
Probably the most relevant is Tim Peters saying "Python
won't change here (IMO)" and giving the example that he also
gives in a comment to bug #852532 (which this patch
addresses).  He also wonders in his comment about the
possibility of a "design constraint", but I think this patch
addresses that concern.

As far as I can tell, the current behavior was a design
decision made over 10 years ago, between two alternatives
that probably didn't matter much at the time.  Skipping
empty matches probably seemed harmless before
lookahead/lookbehind assertions.  Now, though, the current
behavior seems like a significant hindrance.  Furthermore,
it ought to be pretty trivial to modify any existing
patterns to get the old behavior, should that be desired
(e.g., use 'x+' instead of 'x*').

(I didn't notice that re.findall doc when I originally wrote
this patch.  Perhaps the doc in the patch should be slightly
modified to help emphasize the similarity between how
re.findall and re.split handle empty matches.)

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2004-07-27 09:08

Message:
Logged In: YES 
user_id=11375

Overall I like the patch and wouldn't mind seeing the change
become the default behaviour.  However, I'm nervous about
possibly not understanding the reason the prohibition on
zero-length matches was added in the first place.  Can you
please do some research in the CVS logs and python-dev
archives to figure out why the limitation was implemented in
the first place?

----------------------------------------------------------------------

Comment By: Chris King (colander_man)
Date: 2004-07-21 07:46

Message:
Logged In: YES 
user_id=573252

Practical example where the current behaviour produces
undesirable results (splitting on character transitions):

>>> import re
>>> re.split(r'(?<=[A-Z])(?=[^a-z])','SOMEstring')
['SOMEstring']    # desired is ['SOME','string']

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=988761&group_id=5470