[Python-Dev] textwrap and unicode

Tue, 22 Oct 2002 16:24:08 -0400

On 22 October 2002, Martin v. Loewis said:
> I don't know how precisely you want to formulate the property. If x is
> a Unicode letter, then x.isspace() tells you whether it is a space
> character (this property holds for all characters of the Zs category,
> and all characters that have a bidirectionality of WS, B, or S).

OK, then it's an implementation problem rather than a "you can't get
there from here" problem.  Good.  The reason I need a list of
"whitespace chars" is to convert all whitespace to spaces; I use
string.maketrans() and s.translate() to do this efficiently:

class TextWrapper:

    [...]

    whitespace_trans = string.maketrans(string.whitespace,
                                        ' ' * len(string.whitespace))
    [...]

    def _munge_whitespace(self, text):
        """_munge_whitespace(text : string) -> string

        Munge whitespace in text: expand tabs and convert all other
        whitespace characters to spaces.  Eg. " foo\tbar\n\nbaz"
        becomes " foo    bar  baz".
        """
        if self.expand_tabs:
            text = text.expandtabs()
        if self.replace_whitespace:
            text = text.translate(self.whitespace_trans)
        return text

(The rationale: having tabs and newlines in a paragraph about to be
wrapped doesn't make any sense to me.)

Ahh, OK, I'm starting to see the problem: there's nothing wrong with the
translate() method of strings or unicode strings, but string.maketrans()
doesn't generate a mapping that u''.translate() likes.  Hmmmm.

Right, now I've RTFD'd (read the fine docstring) for u''.translate().
Here's what I've got now:

    whitespace_trans = string.maketrans(string.whitespace,
                                        ' ' * len(string.whitespace))

    unicode_whitespace_trans = {}
    for c in string.whitespace:
        unicode_whitespace_trans[ord(unicode(c))] = ord(u' ')
    [...]
    def _munge_whitespace (self, text):
        [...]
        if self.replace_whitespace:
            if isinstance(text, str):
                text = text.translate(self.whitespace_trans)
            elif isinstance(text, unicode):
                text = text.translate(self.unicode_whitespace_trans)

That's ugly as hell, but it works.  Is there a cleaner way?

The other bit of ASCII/English prejudice hardcoded into textwrap.py is
this regex:

    sentence_end_re = re.compile(r'[%s]'              # lowercase letter
                                 r'[\.\!\?]'          # sentence-ending punct.
                                 r'[\"\']?'           # optional end-of-quote
                                 % string.lowercase)

You may recall this from the kerfuffle over whether there should be two
spaces after a sentence in fixed-width fonts.  The feature is there, and
off by default, in TextWrapper.  I'm not so concerned about this -- I
mean, this doesn't even work with German or French, never mind Hebrew or
Chinese or Hindi.  Apart from the narrow definition of "lowercase
letter", it has English punctuation conventions hardcoded into it.  But
still, it seems *awfully* dumb in this day and age to hardcode
string.lowercase into a regex that's meant to detect "lowercase
letters".  But I couldn't find a better way to do it when I wrote this
code last spring.  Is there one?

Thanks!

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
OUR PLAN HAS FAILED STOP JOHN DENVER IS NOT TRULY DEAD STOP
HE LIVES ON IN HIS MUSIC STOP PLEASE ADVISE FULL STOP