[New-bugs-announce] [issue17668] re.split loses characters matching ungrouped parts of a pattern

Tomasz J. Kotarba report at bugs.python.org
Mon Apr 8 20:18:58 CEST 2013


New submission from Tomasz J. Kotarba:

Tested in 2.7 but possibly affects the other versions as well.

A real life example (note the first character '>' being lost):

>>> import re
>>> re.split(r'^>(.*)$', '>Homo sapiens catenin (cadherin-associated)')

produces:

['', 'Homo sapiens catenin (cadherin-associated)', '']


Expected (and IMHO most useful) behaviour would be for it to return:

['', '>Homo sapiens catenin (cadherin-associated)', '']

or (IMHO much less useful as one can already get this one just by adding external grouping parentheses and it is ):

['', '>Homo sapiens catenin (cadherin-associated)', 'Homo sapiens catenin (cadherin-associated)', '']

Not sure whether it can be changed in such a mature and widely used module without breaking compatibility but just adding a new optional parameter for deciding how re.split() deals with patterns containing grouping parentheses and making it default to the current behaviour would be very helpful.
Best Regards

----------
components: Regular Expressions
messages: 186324
nosy: ezio.melotti, mrabarnett, triquetra011
priority: normal
severity: normal
status: open
title: re.split loses characters matching ungrouped parts of a pattern
type: behavior
versions: Python 2.7

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue17668>
_______________________________________


More information about the New-bugs-announce mailing list