[ python-Bugs-852532 ] ^$ won't split on empty line

SourceForge.net noreply at sourceforge.net
Thu Jan 1 00:28:44 EST 2004

Bugs item #852532, was opened at 2003-12-02 05:01
Message generated for change (Comment added) made by mkc
You can respond by visiting: 

Category: Regular Expressions
Group: Python 2.3
Status: Open
Resolution: Postponed
Priority: 5
Submitted By: Jan Burgy (jburgy)
Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: ^$ won't split on empty line

Initial Comment:
Python 2.3.2 (#49, Oct  2 2003, 20:02:00) [MSC v.1200 
32 bit (Intel)] on win32

>>> import re
>>> re.compile('^$', re.MULTILINE).split('foo\n\nbar')

I expect ['foo\n', '\nbar'], since, according to the 
documentation $ "in MULTILINE mode also matches 
before a newline".

Thanks, Jan


Comment By: Mike Coleman (mkc)
Date: 2003-12-31 23:28

Logged In: YES 

Hi, I was going to file this bug just now myself, as this
seems like a really useful feature.  For example, I've
several times wanted to split on '^' or '^(?=S)' (to split
up a data file into paragraphs that start with an initial
S).  Instead I have to do something like '\n(?=S)', which is
rather more hideous.

To answer tim_one's challenge, yes, I *do* expect splitting
by 'x*' to break a string into letters, now that I've
thought about it.  To not do so is a bizarre and surprising
behavior, IMO.  (Patient: Doctor, when I split on this
nonsense pattern I get nonsense!  Doctor: Then don't do that.)

The fix should be near this line in _sre.c, I think.

        if (state.start == state.ptr) {

I could work on a patch if you'll take it...



Comment By: Fredrik Lundh (effbot)
Date: 2003-12-11 07:42

Logged In: YES 

Split never splits on empty substrings; see Tim's answer for a 
brief discussion.

Fred, can you perhaps add something to the documentation?


Comment By: Tim Peters (tim_one)
Date: 2003-12-02 09:20

Logged In: YES 

Confirmed on Pythons 2.1.3, 2.2.3, 2.3.2, and current CVS.

More generally, split() doesn't appear to split on any empty 
(0-length) match.  For example,

>>> pat = re.compile(r'\b')
>>> pat.split('(a b)')
['(a b)']
>>> pat.findall('(a b)')  # but the pattern matches 4 places
['', '', '', '']

That's probably a design constraint, but isn't documented.  
For example, if you split "abc" by the pattern x*, what do you 
expect?  The pattern matches (with length 0) at 4 places, 
but I bet most people would be surprised to get

['', 'a', 'b', 'c', '']

back instead of (as they do get)



You can respond by visiting: 

More information about the Python-bugs-list mailing list