"Fredrik" == Fredrik Lundh
writes:
Fredrik> Stephen J. Turnbull wrote: >> But I worry that it's an exceptional example, when you use >> assumptions like "real-life text uses characters drawn from a >> small number of short contiguous regions in the alphabet." Fredrik> The problem is that I cannot tell if you've studied Fredrik> search issues, Enough to understand Boyer-Moore and how the proposed algorithm differs, and to recognize that your statements about the distribution of search applications are true. Not that I want to argue about search, I'm all in favor of better search. I was startled to read that Python still uses a brute-force algorithm for searching. My point about distribution of ideographs was simply that you made an unjustified assumption in the context of what is (to me, anyway) an important subdomain of text processing. Here, it is "obviously harmless," but that's because brute force search is so bad. In other applications, or with a better status quo, there very well may be real tradeoffs between what's good for 8-bit text and what's good for Unicode. Fredrik> or if you're just applying general "but wait, it's Fredrik> different for asian languages" arguments here. No, I know that ostrich won't fly. Fredrik> Searches for "human text" are not that common, really, Fredrik> and search terms are usually limited to only a few words. In the context of PEP 292 is a focus on "human text" unwarranted? After all, what motivated the PEP and the implementation was evidently "human text" processing. In my experience, the notation for interpolation it uses would have much bigger advantages over the format string style for "human text" than for the "non-human text" applications I know of. Not that it's useless for the latter, just that it's much more of a luxury there. If that's valid, there's a point where it makes sense for people who develop human-text-oriented features based on Unicode strings to say "pick the features you really want for 8-bit strings, because you have to support them yourselves." Fredrik> The only way to know for sure is if anyone has the time Fredrik> and energy to carry out tests on real-life datasets. (or Fredrik> at least prepare some datasets; I can prepare datasets and do some statistical work for Japanese, but it probably won't happen this month. Sounds like a worthwhile thing to have around, though. -- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.