
Greg Ward wrote:
On 22 October 2002, Martin v. Loewis said:
OK, then it's an implementation problem rather than a "you can't get there from here" problem. Good. The reason I need a list of "whitespace chars" is to convert all whitespace to spaces; I use string.maketrans() and s.translate() to do this efficiently:
Use the trick Fredrik posted: u' '.join(x.split()) (.split() defaults to splitting on whitespace, Unicode whitespace if x is Unicode).
Ahh, OK, I'm starting to see the problem: there's nothing wrong with the translate() method of strings or unicode strings, but string.maketrans() doesn't generate a mapping that u''.translate() likes. Hmmmm.
Unicode uses a different API for this since it wouldn't make sense to pass a sys.maxunicode character Unicode string to translate just to map a few characters.
The other bit of ASCII/English prejudice hardcoded into textwrap.py is this regex:
sentence_end_re = re.compile(r'[%s]' # lowercase letter r'[\.\!\?]' # sentence-ending punct. r'[\"\']?' # optional end-of-quote % string.lowercase)
You may recall this from the kerfuffle over whether there should be two spaces after a sentence in fixed-width fonts. The feature is there, and off by default, in TextWrapper. I'm not so concerned about this -- I mean, this doesn't even work with German or French, never mind Hebrew or Chinese or Hindi. Apart from the narrow definition of "lowercase letter", it has English punctuation conventions hardcoded into it. But still, it seems *awfully* dumb in this day and age to hardcode string.lowercase into a regex that's meant to detect "lowercase letters". But I couldn't find a better way to do it when I wrote this code last spring. Is there one?
There are far too many lowercase characters in Unicode to make this approach usable. It would be better if there were a way to use Unicode character categories in the re sets. Since that's not available, why not search for all potential sentence ends and then try all of the using .islower() in a for-loop ?! -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/