[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Terry J. Reedy
report at bugs.python.org
Mon Aug 15 06:18:45 CEST 2011
Terry J. Reedy <tjreedy at udel.edu> added the comment:
>It is always better to deliver more than you say than to deliver less.
Except when promising too little is a copout.
>Everyone always talks about important they're sure O(1) access must be,
I thought that too until your challenge. But now that you mention it, indexing is probably not the bottleneck in most document processing. We are optimizing without measuring! We all know that is bad.
If done transparently, non-O(1) indexing should only be done when it is *needed*. And if it is a bottleneck, switch to a wide build -- or get a newer, faster machine.
I first used Python 1.3 on a 10 megahertz DOS machine. I just got a multicore 3.+ gigahertz machine. Tradeoffs have changed and just as we use cycles (and space) for nice graphical interfaces, we should use some for global text support. In the same pair of machines, core memory jumped from 2 megabytes to 24 gigabytes. (And the new machine cost perhaps as much in adjusted dollars.) Of course, better unicode support should come standard with the OS and not have to be re-invented by every language and app.
Having promised to actually 'work on a prototype in Python', I decided to do so before playing. I wrote the following test:
tucs2 = 'A\U0001043cBC\U0001042f\U00010445DE\U00010428H'
tlist = ['A', '\U0001043c','B','C','\U0001042f','\U00010445',
tlis2 = [tutf16[i] for i in range(len(tlist))]
assert tlist == tlis2
and in a couple hours wrote and debugged the class to make it pass (and added a couple of length tests). See the uploaded file.
Adding an __iter__ method to iterate by characters (with hi chars returned as wrapped length-1 surrogate pairs) instead of code units would be trivial. Adding the code to __getitem__ to handle slices should not be too hard. Slices containing hi characters should be wrapped. The cpdex array would make that possible without looking at the whole slice.
The same idea could be used to index by graphemes. For European text that used codepoints for pre-combined (accented) characters as much as possible, the overhead should not be too much.
This may not be the best issue to attach this to, but I believe that improving the narrow build would allow fixing of the re/regex problems reported here.
Added file: http://bugs.python.org/file22900/utf16.py
Python tracker <report at bugs.python.org>
More information about the Python-bugs-list