[Python-Dev] textwrap and unicode

Martin v. Loewis martin@v.loewis.de
22 Oct 2002 23:17:52 +0200


Greg Ward <gward@python.net> writes:

>         if self.replace_whitespace:
>             text = text.translate(self.whitespace_trans)
>         return text
> 
> (The rationale: having tabs and newlines in a paragraph about to be
> wrapped doesn't make any sense to me.)

Is it then really necessary to replace each of these characters, or
would it be acceptable to replace sequences of them, as Fredrik
proposes (.split,.join)?

>     for c in string.whitespace:
>         unicode_whitespace_trans[ord(unicode(c))] = ord(u' ')

There are conceptually 5 times as many space characters in Unicode
(NO-BREAK SPACE, THREE-PER-EM SPACE, OGHAM SPACE MARK, and whatnot),
but it is probably safe to ignore them. The complete fragment would
read

for c in range(sys.maxunicode):
  if unichr(c).isspace():
    unicode_whitespace_trans[c] = u' '
(which is somewhat time-consuming, so you could hard-code a larger
 list if you wanted to)

> That's ugly as hell, but it works.  Is there a cleaner way?

You may want to time re.sub, perhaps to find that the speed decrease
is acceptable:

space = re.compile("\s")

   text = space.sub(" ", text)

> The other bit of ASCII/English prejudice hardcoded into textwrap.py is
> this regex:
> 
>     sentence_end_re = re.compile(r'[%s]'              # lowercase letter
>                                  r'[\.\!\?]'          # sentence-ending punct.
>                                  r'[\"\']?'           # optional end-of-quote
>                                  % string.lowercase)

For the issue at hand: this code does "work" with Unicode, right?
I.e. it will give some result, even if confronted with funny characters?
If so, I think you can ignore this for the moment.

> But I couldn't find a better way to do it when I wrote this code
> last spring.  Is there one?

I believe the right approach is to support more classes in
SRE. This one would be covered if there was a [:lower:] class.

Regards,
Martin