python 3 and Unicode line breaking
leoboiko
leoboiko at gmail.com
Fri Jan 14 09:29:27 EST 2011
On Jan 14, 11:48 am, Stefan Behnel <stefan... at behnel.de> wrote:
> Sadly, the OP did not clearly state that the required feature
> is really not supported by "textwrap" and in what way textwrap
> behaves differently. That would have helped in answering.
Oh, textwrap doesn’t work for arbitrary Unicode text at all. For
example, it separates combining sequences:
>>> s = "tiếng Việt" # precomposed
>>> len(s)
10
>>> s = "tiếng Việt" # combining
>>> len(s) # number of unicode characters; ≠ line length
14
>>> print(textwrap.fill(s, width=4)) # breaks sequences
tiê
ng
Viê
t
It also doesn’t know about double-width characters:
>>> s1 = "日本語のテキト"
>>> s2 = "12345678901234" # both s1 and s2 use 14 columns
>>> print(textwrap.fill(s1, width=7))
日本語のテキト
>>> print(textwrap.fill(s2, width=7))
1234567
8901234
It doesn’t know about non-ascii punctuation:
>>> print(textwrap.fill("abc-def", width=5)) # ASCII minus-hyphen
abc-
def
>>> print(textwrap.fill("abc‐def", width=5)) # true hyphen U+2010
abc‐d
ef
It doesn’t know East Asian filling rules (though this is
perhaps pushing it a bit beyond textwrap’s goals):
>>> print(textwrap.fill("日本語、中国語", width=3))
日本語
、中国 # should avoid linebreak before CJK punctuation
語
And it generally doesn’t try to pick good places to break lines
at all, just making the assumption that 1 character = 1 column
and that breaking on ASCII whitespaces/hyphens is enough. We
can’t really blame textwrap for that, it is a very simple module
and Unicode line breaking gets complex fast (that’s why the
consortium provides a ready-made algorithm). It’s just that,
with python3’s emphasis on Unicode support, I was surprised not
to be able to find an UAX #14 implementation. I thought someone
would surely have written one and I simply couldn’t find, so I
asked precisely that.
More information about the Python-list
mailing list