
On 22 October 2002, Martin v. Loewis said:
I don't know how precisely you want to formulate the property. If x is a Unicode letter, then x.isspace() tells you whether it is a space character (this property holds for all characters of the Zs category, and all characters that have a bidirectionality of WS, B, or S).
OK, then it's an implementation problem rather than a "you can't get there from here" problem. Good. The reason I need a list of "whitespace chars" is to convert all whitespace to spaces; I use string.maketrans() and s.translate() to do this efficiently: class TextWrapper: [...] whitespace_trans = string.maketrans(string.whitespace, ' ' * len(string.whitespace)) [...] def _munge_whitespace(self, text): """_munge_whitespace(text : string) -> string Munge whitespace in text: expand tabs and convert all other whitespace characters to spaces. Eg. " foo\tbar\n\nbaz" becomes " foo bar baz". """ if self.expand_tabs: text = text.expandtabs() if self.replace_whitespace: text = text.translate(self.whitespace_trans) return text (The rationale: having tabs and newlines in a paragraph about to be wrapped doesn't make any sense to me.) Ahh, OK, I'm starting to see the problem: there's nothing wrong with the translate() method of strings or unicode strings, but string.maketrans() doesn't generate a mapping that u''.translate() likes. Hmmmm. Right, now I've RTFD'd (read the fine docstring) for u''.translate(). Here's what I've got now: whitespace_trans = string.maketrans(string.whitespace, ' ' * len(string.whitespace)) unicode_whitespace_trans = {} for c in string.whitespace: unicode_whitespace_trans[ord(unicode(c))] = ord(u' ') [...] def _munge_whitespace (self, text): [...] if self.replace_whitespace: if isinstance(text, str): text = text.translate(self.whitespace_trans) elif isinstance(text, unicode): text = text.translate(self.unicode_whitespace_trans) That's ugly as hell, but it works. Is there a cleaner way? The other bit of ASCII/English prejudice hardcoded into textwrap.py is this regex: sentence_end_re = re.compile(r'[%s]' # lowercase letter r'[\.\!\?]' # sentence-ending punct. r'[\"\']?' # optional end-of-quote % string.lowercase) You may recall this from the kerfuffle over whether there should be two spaces after a sentence in fixed-width fonts. The feature is there, and off by default, in TextWrapper. I'm not so concerned about this -- I mean, this doesn't even work with German or French, never mind Hebrew or Chinese or Hindi. Apart from the narrow definition of "lowercase letter", it has English punctuation conventions hardcoded into it. But still, it seems *awfully* dumb in this day and age to hardcode string.lowercase into a regex that's meant to detect "lowercase letters". But I couldn't find a better way to do it when I wrote this code last spring. Is there one? Thanks! Greg -- Greg Ward <gward@python.net> http://www.gerg.ca/ OUR PLAN HAS FAILED STOP JOHN DENVER IS NOT TRULY DEAD STOP HE LIVES ON IN HIS MUSIC STOP PLEASE ADVISE FULL STOP