[Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?

M.-A. Lemburg mal@lemburg.com
Tue, 30 May 2000 16:57:57 +0200


Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> ...
> > > background: in the current implementation, this decision has to
> > > be made at compile time, and a compiled expression can be used
> > > with either 8-bit strings or 16-bit strings.
> ...
> > For Unicode objects you should really default to using the
> > Py_UNICODE_ISLINEBREAK() macro which defines all line break
> > characters (note that CRLF should be interpreted as a
> > single line break; see PyUnicode_Splitlines()). The reason
> > here is that Unicode defines how to handle line breaks
> > and we should try to stick to the standard as close as possible.
> > All other possibilities could still be made available via new
> > flags.
> >
> > For 8-bit strings I'd suggest sticking to the re definition.
> 
> guess my background description wasn't clear:
> 
> Once a pattern has been compiled, it will always handle line
> endings in the same way. The parser doesn't really care if the
> pattern is a unicode string or an 8-bit string (unicode strings
> can contain "wide" characters, but that's the only difference).

Ok.

> At the other end, the same compiled pattern can be applied
> to either 8-bit or unicode strings.  It's all just characters to
> the engine...

Doesn't the engine remember wether the pattern was a string
or Unicode ?
 
> Now, I can of course change the engine so that it always uses
> chr(10) on 8-bit strings and LINEBREAK on 16-bit strings, but the
> result is that
> 
>     pattern.match(widestring)
> 
> won't necessarily match the same thing as
> 
>     pattern.match(str(widestring))
> 
> even if the wide string only contains plain ASCII.

Hmm, I wouldn't mind, as long as the engine does the right
thing for Unicode which is to respect the line break
standard defined in Unicode TR13.

Thinking about this some more: I wouldn't even mind if
the engine would use LINEBREAK for all strings :-). It would
certainly make life easier whenever you have to deal with
file input from different platforms, e.g. Mac, Unix and
Windows.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/