[Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?

M.-A. Lemburg mal@lemburg.com
Tue, 16 May 2000 00:07:53 +0200


Finn Bock wrote:
> 
> On Sat, 13 May 2000 14:56:41 +0200, you wrote:
> 
> >in the current 're' engine, a newline is chr(10) and nothing
> >else.
> >
> >however, in the new unicode aware engine, I used the new
> >LINEBREAK predicate instead, but it turned out to break one
> >of the tests in the current test suite:
> >
> >    sre.match('a\rb', 'a.b') => None
> >
> >(unicode adds chr(13), chr(28), chr(29), chr(30), and also
> >unichr(133), unichr(8232), and unichr(8233) to the list of
> >line breaking codes)
>
> >what's the best way to deal with this?  I see three alter-
> >natives:
> >
> >a) stick to the old definition, and use chr(10) also for
> >   unicode strings
> 
> In the ORO matcher that comes with jpython, the dot matches all but
> chr(10). But that is bad IMO. Unicode should use the LINEBREAK
> predicate.

+1 on that one... just like \s should use Py_UNICODE_ISSPACE()
and \d Py_UNICODE_ISDECIMAL().

BTW, how have you implemented the locale aware \w and \W
for Unicode ? Unicode doesn't have any locales, but quite a
lot more alphanumeric characters (or equivalents) and there
currently is no Py_UNICODE_ISALPHA() in the core.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/