python 3 and Unicode line breaking

Fri Jan 14 09:29:27 EST 2011

On Jan 14, 11:48 am, Stefan Behnel <stefan... at behnel.de> wrote:
> Sadly, the OP did not clearly state that the required feature
> is really not supported by "textwrap" and in what way textwrap
> behaves differently. That would have helped in answering.

Oh, textwrap doesn’t work for arbitrary Unicode text at all.  For
example, it separates combining sequences:

    >>> s = "tiếng Việt" # precomposed
    >>> len(s)
    10
    >>> s = "tiếng Việt" # combining
    >>> len(s) # number of unicode characters; ≠ line length
    14
    >>> print(textwrap.fill(s, width=4)) # breaks sequences
    tiê
    ng
    Viê
    t

It also doesn’t know about double-width characters:

    >>> s1 = "日本語のテキト"
    >>> s2 = "12345678901234" # both s1 and s2 use 14 columns
    >>> print(textwrap.fill(s1, width=7))
    日本語のテキト
    >>> print(textwrap.fill(s2, width=7))
    1234567
    8901234

It doesn’t know about non-ascii punctuation:

    >>> print(textwrap.fill("abc-def", width=5)) # ASCII minus-hyphen
    abc-
    def
    >>> print(textwrap.fill("abc‐def", width=5)) # true hyphen U+2010
    abc‐d
    ef

It doesn’t know East Asian filling rules (though this is
perhaps pushing it a bit beyond textwrap’s goals):

    >>> print(textwrap.fill("日本語、中国語", width=3))
    日本語
    、中国 # should avoid linebreak before CJK punctuation
    語

And it generally doesn’t try to pick good places to break lines
at all, just making the assumption that 1 character = 1 column
and that breaking on ASCII whitespaces/hyphens is enough.  We
can’t really blame textwrap for that, it is a very simple module
and Unicode line breaking gets complex fast (that’s why the
consortium provides a ready-made algorithm).  It’s just that,
with python3’s emphasis on Unicode support, I was surprised not
to be able to find an UAX #14 implementation.  I thought someone
would surely have written one and I simply couldn’t find, so I
asked precisely that.