
Well, my ignorance of Unicode has finally bitten me -- someone filed a bug (#622831) against textwrap.py because it crashes when it attempts to wrap a Unicode string. Here are the problems that I am aware of: * textwrap assumes "whitespace" means "the characters in string.whitespace" * textwrap assumes "lowercase letter" means "the characters in string.lowercase" (heck, this only works in English) Can someone tell me what the proper way to do this is? Or just point me at the relevant documentation? I've scoured the online docs and *Python Essential Reference*, and I know more about the codes and unicodedata modules than I did before. But I still don't know how to replace all whitespace with space, or detect words that end with a lowercase letter. Thanks -- Greg -- Greg Ward <gward@python.net> http://www.gerg.ca/ I brought my BOWLING BALL -- and some DRUGS!!

Greg Ward <gward@python.net> writes:
I don't know how precisely you want to formulate the property. If x is a Unicode letter, then x.isspace() tells you whether it is a space character (this property holds for all characters of the Zs category, and all characters that have a bidirectionality of WS, B, or S).
* textwrap assumes "lowercase letter" means "the characters in string.lowercase" (heck, this only works in English)
Works the same way: x.islower() tells you whether a character is lower-case (meaning it is in the Ll category). HTH, Martin

On 22 October 2002, Martin v. Loewis said:
OK, then it's an implementation problem rather than a "you can't get there from here" problem. Good. The reason I need a list of "whitespace chars" is to convert all whitespace to spaces; I use string.maketrans() and s.translate() to do this efficiently: class TextWrapper: [...] whitespace_trans = string.maketrans(string.whitespace, ' ' * len(string.whitespace)) [...] def _munge_whitespace(self, text): """_munge_whitespace(text : string) -> string Munge whitespace in text: expand tabs and convert all other whitespace characters to spaces. Eg. " foo\tbar\n\nbaz" becomes " foo bar baz". """ if self.expand_tabs: text = text.expandtabs() if self.replace_whitespace: text = text.translate(self.whitespace_trans) return text (The rationale: having tabs and newlines in a paragraph about to be wrapped doesn't make any sense to me.) Ahh, OK, I'm starting to see the problem: there's nothing wrong with the translate() method of strings or unicode strings, but string.maketrans() doesn't generate a mapping that u''.translate() likes. Hmmmm. Right, now I've RTFD'd (read the fine docstring) for u''.translate(). Here's what I've got now: whitespace_trans = string.maketrans(string.whitespace, ' ' * len(string.whitespace)) unicode_whitespace_trans = {} for c in string.whitespace: unicode_whitespace_trans[ord(unicode(c))] = ord(u' ') [...] def _munge_whitespace (self, text): [...] if self.replace_whitespace: if isinstance(text, str): text = text.translate(self.whitespace_trans) elif isinstance(text, unicode): text = text.translate(self.unicode_whitespace_trans) That's ugly as hell, but it works. Is there a cleaner way? The other bit of ASCII/English prejudice hardcoded into textwrap.py is this regex: sentence_end_re = re.compile(r'[%s]' # lowercase letter r'[\.\!\?]' # sentence-ending punct. r'[\"\']?' # optional end-of-quote % string.lowercase) You may recall this from the kerfuffle over whether there should be two spaces after a sentence in fixed-width fonts. The feature is there, and off by default, in TextWrapper. I'm not so concerned about this -- I mean, this doesn't even work with German or French, never mind Hebrew or Chinese or Hindi. Apart from the narrow definition of "lowercase letter", it has English punctuation conventions hardcoded into it. But still, it seems *awfully* dumb in this day and age to hardcode string.lowercase into a regex that's meant to detect "lowercase letters". But I couldn't find a better way to do it when I wrote this code last spring. Is there one? Thanks! Greg -- Greg Ward <gward@python.net> http://www.gerg.ca/ OUR PLAN HAS FAILED STOP JOHN DENVER IS NOT TRULY DEAD STOP HE LIVES ON IN HIS MUSIC STOP PLEASE ADVISE FULL STOP

Greg Ward <gward@python.net> writes:
Is it then really necessary to replace each of these characters, or would it be acceptable to replace sequences of them, as Fredrik proposes (.split,.join)?
for c in string.whitespace: unicode_whitespace_trans[ord(unicode(c))] = ord(u' ')
There are conceptually 5 times as many space characters in Unicode (NO-BREAK SPACE, THREE-PER-EM SPACE, OGHAM SPACE MARK, and whatnot), but it is probably safe to ignore them. The complete fragment would read for c in range(sys.maxunicode): if unichr(c).isspace(): unicode_whitespace_trans[c] = u' ' (which is somewhat time-consuming, so you could hard-code a larger list if you wanted to)
That's ugly as hell, but it works. Is there a cleaner way?
You may want to time re.sub, perhaps to find that the speed decrease is acceptable: space = re.compile("\s") text = space.sub(" ", text)
For the issue at hand: this code does "work" with Unicode, right? I.e. it will give some result, even if confronted with funny characters? If so, I think you can ignore this for the moment.
But I couldn't find a better way to do it when I wrote this code last spring. Is there one?
I believe the right approach is to support more classes in SRE. This one would be covered if there was a [:lower:] class. Regards, Martin

Greg Ward wrote:
Use the trick Fredrik posted: u' '.join(x.split()) (.split() defaults to splitting on whitespace, Unicode whitespace if x is Unicode).
Unicode uses a different API for this since it wouldn't make sense to pass a sys.maxunicode character Unicode string to translate just to map a few characters.
There are far too many lowercase characters in Unicode to make this approach usable. It would be better if there were a way to use Unicode character categories in the re sets. Since that's not available, why not search for all potential sentence ends and then try all of the using .islower() in a for-loop ?! -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Greg Ward wrote:
It should use u.isspace() for this. You might also want to consider u.splitlines() for line breaking, since Unicode has a lot more line breaking characters than ASCII (which u.splitlines() knows about).
* textwrap assumes "lowercase letter" means "the characters in string.lowercase" (heck, this only works in English)
u.lower() will do the right thing for Unicode.
Hope that helps, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

greg wrote:
But I still don't know how to replace all whitespace with space
string.join(phrase.split(), " ") or re.sub("(?u)\s+", " ", phrase) not sure which one's faster; I suggest benchmarking. (if you want to preserve leading/trailing space with the split approach, use isspace on the start/end of phrase)
or detect words that end with a lowercase letter.
word[-1].islower() </F>

Greg Ward <gward@python.net> writes:
I don't know how precisely you want to formulate the property. If x is a Unicode letter, then x.isspace() tells you whether it is a space character (this property holds for all characters of the Zs category, and all characters that have a bidirectionality of WS, B, or S).
* textwrap assumes "lowercase letter" means "the characters in string.lowercase" (heck, this only works in English)
Works the same way: x.islower() tells you whether a character is lower-case (meaning it is in the Ll category). HTH, Martin

On 22 October 2002, Martin v. Loewis said:
OK, then it's an implementation problem rather than a "you can't get there from here" problem. Good. The reason I need a list of "whitespace chars" is to convert all whitespace to spaces; I use string.maketrans() and s.translate() to do this efficiently: class TextWrapper: [...] whitespace_trans = string.maketrans(string.whitespace, ' ' * len(string.whitespace)) [...] def _munge_whitespace(self, text): """_munge_whitespace(text : string) -> string Munge whitespace in text: expand tabs and convert all other whitespace characters to spaces. Eg. " foo\tbar\n\nbaz" becomes " foo bar baz". """ if self.expand_tabs: text = text.expandtabs() if self.replace_whitespace: text = text.translate(self.whitespace_trans) return text (The rationale: having tabs and newlines in a paragraph about to be wrapped doesn't make any sense to me.) Ahh, OK, I'm starting to see the problem: there's nothing wrong with the translate() method of strings or unicode strings, but string.maketrans() doesn't generate a mapping that u''.translate() likes. Hmmmm. Right, now I've RTFD'd (read the fine docstring) for u''.translate(). Here's what I've got now: whitespace_trans = string.maketrans(string.whitespace, ' ' * len(string.whitespace)) unicode_whitespace_trans = {} for c in string.whitespace: unicode_whitespace_trans[ord(unicode(c))] = ord(u' ') [...] def _munge_whitespace (self, text): [...] if self.replace_whitespace: if isinstance(text, str): text = text.translate(self.whitespace_trans) elif isinstance(text, unicode): text = text.translate(self.unicode_whitespace_trans) That's ugly as hell, but it works. Is there a cleaner way? The other bit of ASCII/English prejudice hardcoded into textwrap.py is this regex: sentence_end_re = re.compile(r'[%s]' # lowercase letter r'[\.\!\?]' # sentence-ending punct. r'[\"\']?' # optional end-of-quote % string.lowercase) You may recall this from the kerfuffle over whether there should be two spaces after a sentence in fixed-width fonts. The feature is there, and off by default, in TextWrapper. I'm not so concerned about this -- I mean, this doesn't even work with German or French, never mind Hebrew or Chinese or Hindi. Apart from the narrow definition of "lowercase letter", it has English punctuation conventions hardcoded into it. But still, it seems *awfully* dumb in this day and age to hardcode string.lowercase into a regex that's meant to detect "lowercase letters". But I couldn't find a better way to do it when I wrote this code last spring. Is there one? Thanks! Greg -- Greg Ward <gward@python.net> http://www.gerg.ca/ OUR PLAN HAS FAILED STOP JOHN DENVER IS NOT TRULY DEAD STOP HE LIVES ON IN HIS MUSIC STOP PLEASE ADVISE FULL STOP

Greg Ward <gward@python.net> writes:
Is it then really necessary to replace each of these characters, or would it be acceptable to replace sequences of them, as Fredrik proposes (.split,.join)?
for c in string.whitespace: unicode_whitespace_trans[ord(unicode(c))] = ord(u' ')
There are conceptually 5 times as many space characters in Unicode (NO-BREAK SPACE, THREE-PER-EM SPACE, OGHAM SPACE MARK, and whatnot), but it is probably safe to ignore them. The complete fragment would read for c in range(sys.maxunicode): if unichr(c).isspace(): unicode_whitespace_trans[c] = u' ' (which is somewhat time-consuming, so you could hard-code a larger list if you wanted to)
That's ugly as hell, but it works. Is there a cleaner way?
You may want to time re.sub, perhaps to find that the speed decrease is acceptable: space = re.compile("\s") text = space.sub(" ", text)
For the issue at hand: this code does "work" with Unicode, right? I.e. it will give some result, even if confronted with funny characters? If so, I think you can ignore this for the moment.
But I couldn't find a better way to do it when I wrote this code last spring. Is there one?
I believe the right approach is to support more classes in SRE. This one would be covered if there was a [:lower:] class. Regards, Martin

Greg Ward wrote:
Use the trick Fredrik posted: u' '.join(x.split()) (.split() defaults to splitting on whitespace, Unicode whitespace if x is Unicode).
Unicode uses a different API for this since it wouldn't make sense to pass a sys.maxunicode character Unicode string to translate just to map a few characters.
There are far too many lowercase characters in Unicode to make this approach usable. It would be better if there were a way to use Unicode character categories in the re sets. Since that's not available, why not search for all potential sentence ends and then try all of the using .islower() in a for-loop ?! -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Greg Ward wrote:
It should use u.isspace() for this. You might also want to consider u.splitlines() for line breaking, since Unicode has a lot more line breaking characters than ASCII (which u.splitlines() knows about).
* textwrap assumes "lowercase letter" means "the characters in string.lowercase" (heck, this only works in English)
u.lower() will do the right thing for Unicode.
Hope that helps, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

greg wrote:
But I still don't know how to replace all whitespace with space
string.join(phrase.split(), " ") or re.sub("(?u)\s+", " ", phrase) not sure which one's faster; I suggest benchmarking. (if you want to preserve leading/trailing space with the split approach, use isspace on the start/end of phrase)
or detect words that end with a lowercase letter.
word[-1].islower() </F>
participants (5)
-
Fredrik Lundh
-
Greg Ward
-
M.-A. Lemburg
-
martin@v.loewis.de
-
Raymond Hettinger