[Python-3000] Lines breaking

Tue May 29 10:17:20 CEST 2007

"Martin v. Löwis" writes:

 > Alexandre Vassalotti writes:

 > > The change would extend the line breaking behavior to three other
 > > ASCII characters:
 > >   NEL "Next Line" 85
 > >   VT "Vertical Tab" 0B
 > >   FF "Form Feed" 0C
 > > Of course, it is not really necessary to change, but I think full
 > > conformance to the standard [1] could give Python better support of
 > > multilingual texts. However, full conformance would require a good
 > > amount of work.

I don't understand why full conformance would require much work, not
for the language.  Unicode does not propose to place requirements on
the syntax of Python *including the repertoire of characters allowed*,
only that where a character does occur, it must have the semantics
defined in UAX#14.  (Of course text processing modules in the stdlib
will have some work to do!)

I see no reason in UAX#14 that the Python grammar cannot ignore or
prohibit VT and NEL (see below), prohibit use of LINE SEPARATOR and
PARAGRAPH SEPARATOR, and restrict FORM FEED to occur immediately after
a line break.  (All outside of strings, of course, where there would
be no restriction.  Restrictions *must* apply to comment content,
however.)  Note that given Python's semantics for lines, the algorithm
in Unicode (v4.1, Section 5.8, R1) for remapping to unambiguous use of
LS and PS is well-defined and will leave zero residual ambiguity in a
legal Python program (and no instances of PS).

With the provisions above, you'll get the same display of a legal
Python program as ever when you switch to a UAX#14-conforming text
editor, except that it may provide a more friendly display for strings
containing very long lines.  People who wish to edit Python programs
in Microsoft Word should preprocess with the R1 algorithm.<wink>

 > Can you please point to the chapter and verse where it says that VT
 > must be considered? I only found mention of FF, in R4.

In UAX#14, revision 19, in the descriptions of classes it says:

------------------------------------------------------------------------
  BK: Mandatory Break (A) (Non-tailorable)

  Explicit breaks act independently of the surrounding characters. No
  characters can be added to the BK class as part of tailoring, but
  implementations are not required to support the VT character.

  000C      FORM FEED (FF)
  000B      LINE TABULATION (VT)

  FORM FEED separates pages. The text on the new page starts at the
  beginning of the line. No paragraph formatting is applied.

  2028      LINE SEPARATOR (LS)

  The text after the Line Separator starts at the beginning of the
  line. No paragraph formatting is applied. This is similar to HTML
  <BR>.

  2029      PARAGRAPH SEPARATOR (PS)

  The text of the new paragraph starts at the beginning of the
  line. Paragraph formatting is applied.

  Newline Function (NLF)

  Newline Functions are defined in the Unicode Standard as providing
  additional explicit breaks. They are not individual characters, but
  are encoded as sequences of the control characters NEL, LF, and CR.
------------------------------------------------------------------------

In the descriptions of the singleton classes LF, CR, and NL
(containing NEL), it is indicated that supporting LF and CR is
mandatory, the rules are the ones used by Python's universal newline
feature AFAICT.  And NL need not be supported:

------------------------------------------------------------------------
  NL: Next Line (A) (Non-tailorable) 

  0085      NEXT LINE (NEL)

  The NL class acts like BK in all respects (there is a mandatory break
  after any NEL character). It cannot be tailored, but implementations
  are not required to support the NEL character; see the discussion
  under BK.
------------------------------------------------------------------------