[Python-Dev] unicode regex quickie: should a newline be the same
thing as a linebreak?
Tue, 30 May 2000 14:03:17 +0200
Fredrik Lundh wrote:
> I wrote:
> > what's the best way to deal with this? I see three alter-
> > natives:
> > a) stick to the old definition, and use chr(10) also for
> > unicode strings
> > b) use different definitions for 8-bit strings and unicode
> > strings; if given an 8-bit string, use chr(10); if given
> > a 16-bit string, use the LINEBREAK predicate.
> > c) use LINEBREAK in either case.
> > I think (c) is the "right thing", but it's the only that may
> > break existing code...
> I'm probably getting old, but I don't remember if anyone followed
> up on this, and I don't have time to check the archives right now.
> so for the upcoming "feature complete" release, I've decided to
> stick to (a).
> for the next release, I suggest implementing a fourth alternative:
> d) add a new unicode flag. if set, use LINEBREAK. otherwise,
> use chr(10).
> background: in the current implementation, this decision has to
> be made at compile time, and a compiled expression can be used
> with either 8-bit strings or 16-bit strings.
> a fifth alternative would be to use the locale flag to tell the
> difference between unicode and 8-bit characters:
> e) if locale is not set, use LINEBREAK. otherwise, use chr(10).
For Unicode objects you should really default to using the
Py_UNICODE_ISLINEBREAK() macro which defines all line break
characters (note that CRLF should be interpreted as a
single line break; see PyUnicode_Splitlines()). The reason
here is that Unicode defines how to handle line breaks
and we should try to stick to the standard as close as possible.
All other possibilities could still be made available via new
For 8-bit strings I'd suggest sticking to the re definition.
Python Pages: http://www.lemburg.com/python/