string split without consumption

robert no-spam at not-existing.invalid
Sat Feb 2 11:16:22 EST 2008


Tim Chase wrote:
>>>> this didn't work elegantly as expected:
>>>>
>>>>  >>> ss
>>>> 'owi\nweoifj\nfheu\n'
>>>>  >>> re.split(r'(?m)$',ss)
>>>> ['owi\nweoifj\nfheu\n']
>>> Do you have a need to use a regexp?
>> I'd like the general case - split without consumption.
> 
> I'm not sure there's a one-pass regex solution to the problem
> using Python's regex engine.  If pre-processing was allowed, one
> could do it.
> 

I only found it partly with inverse logic - findall:

 >>> re.findall(r'(?s).*?(?:\n|$)','owi\nweoifj\nfheu\nxx')
['owi\n', 'weoifj\n', 'fheu\n', 'xx', '']
 >>> re.findall(r'(?s).*?(?:\n|$)','owi\nweoifj\nfheu\n')
['owi\n', 'weoifj\n', 'fheu\n', '']
 >>>

but its also wrong regarding partial last lines.

re.split obviously doesn't understand \A \Z ^ $ and also \b etc. 
empty matches.

 >>> re.split(r'\b(?=\n)','owi\nweoifj\nfheu\n\nxx')
['owi\nweoifj\nfheu\n\nxx']


>>>>>> ss.splitlines(True)
>>> ['owi\n', 'weoifj\n', 'fheu\n']
>>>
>> thanks. Yet this does not work "naturally" consistent in my line 
>> processing algorithm - the further buffering. Compare e.g. 
>> ss.split('\n')  ..
> 
> well, one can do
> 
>   >>> [line + '\n' for line in ss.splitlines()]
>   ['owi\n', 'eoifj\n', 'heu\n']
>   >>> [line + '\n' for line in (ss+'xxx').splitlines()]
>   ['owi\n', 'eoifj\n', 'heu\n', 'xxx\n']
> 
> as another try for your edge case.  It's understandable and
> natural-looking
> 

nice for some display purposes, but "wrong" regarding a general 
logic. The 'xxx' is not a complete line in the general case. Its 
and (open) part and should appear so.


Robert




More information about the Python-list mailing list