[Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?

Tue, 30 May 2000 16:38:29 +0200

M.-A. Lemburg wrote:
...
> > background: in the current implementation, this decision has to
> > be made at compile time, and a compiled expression can be used
> > with either 8-bit strings or 16-bit strings.
...
> For Unicode objects you should really default to using the=20
> Py_UNICODE_ISLINEBREAK() macro which defines all line break
> characters (note that CRLF should be interpreted as a
> single line break; see PyUnicode_Splitlines()). The reason
> here is that Unicode defines how to handle line breaks
> and we should try to stick to the standard as close as possible.
> All other possibilities could still be made available via new
> flags.
>=20
> For 8-bit strings I'd suggest sticking to the re definition.

guess my background description wasn't clear:

Once a pattern has been compiled, it will always handle line
endings in the same way. The parser doesn't really care if the
pattern is a unicode string or an 8-bit string (unicode strings
can contain "wide" characters, but that's the only difference).

At the other end, the same compiled pattern can be applied
to either 8-bit or unicode strings.  It's all just characters to
the engine...

Now, I can of course change the engine so that it always uses
chr(10) on 8-bit strings and LINEBREAK on 16-bit strings, but the
result is that

    pattern.match(widestring)

won't necessarily match the same thing as

    pattern.match(str(widestring))

even if the wide string only contains plain ASCII.

(an other alternative is to recompile the pattern for each target
string type, but that will hurt performance...)

</F>

<project name=3D"sre" complete=3D"97.1%" />