[issue7643] What is an ASCII linebreak?
Florent Xicluna
report at bugs.python.org
Fri Jan 8 12:42:42 CET 2010
Florent Xicluna <laxyf at yahoo.fr> added the comment:
It's confusing.
There's a specific annex UAX #14 which defines "Line Breaking Properties".
Some properties are defines as "Mandatory Line Breaks (non-tailorable)":
BK, CR, LF, NL
And the resulting list is different:
CAT BIDI BRK
------------------------------------------------------------------------000A LF LINE FEED Cc B LF
000B VT LINE TABULATION Cc S BK (since Unicode 5.0)
000C FF FORM FEED Cc WS BK
000D CR CARRIAGE RETURN Cc B CR
0085 NEL NEXT LINE Cc B NL (C1 Control Code)
2028 LS LINE SEPARATOR Zl WS BK
2029 PS PARAGRAPH SEPARATOR Zp B BK
------------------------------------------------------------------------
Differences:
- VT and FF are mandatory breaks (even if “implementations are not
required to support the VT character”)
- FS, GS, US are combined marks (CM): “Prohibit a line break between
the character and the preceding character”
According to this Annex, the current splitlines() implementation violates the Unicode standard.
References:
- Unicode Standard Annex #14 - Line Breaking Algorithm
http://www.unicode.org/reports/tr14/
- UCD LineBreak.txt
http://www.unicode.org/Public/5.2.0/ucd/LineBreak.txt
----------
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue7643>
_______________________________________
More information about the Python-bugs-list
mailing list