[Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?

Finn Bock bckfnn@worldonline.dk
Sat, 13 May 2000 13:47:10 GMT


On Sat, 13 May 2000 14:56:41 +0200, you wrote:

>in the current 're' engine, a newline is chr(10) and nothing
>else.
>
>however, in the new unicode aware engine, I used the new
>LINEBREAK predicate instead, but it turned out to break one
>of the tests in the current test suite:
>
>    sre.match('a\rb', 'a.b') => None
>
>(unicode adds chr(13), chr(28), chr(29), chr(30), and also
>unichr(133), unichr(8232), and unichr(8233) to the list of
>line breaking codes)
>
>what's the best way to deal with this?  I see three alter-
>natives:
>
>a) stick to the old definition, and use chr(10) also for
>   unicode strings

In the ORO matcher that comes with jpython, the dot matches all but
chr(10). But that is bad IMO. Unicode should use the LINEBREAK
predicate.

regards,
finn