[Python-Dev] Re: Alternative Implementation for PEP 292: Simple
String Substitutions
Stephen J. Turnbull
stephen at xemacs.org
Mon Sep 13 06:21:32 CEST 2004
>>>>> "Fredrik" == Fredrik Lundh <fredrik at pythonware.com> writes:
Fredrik> M.-A. Lemburg wrote:
>>> (google for "stringlib" for some work I'm doing in this area)
>> Ah, now I know where you're coming from :-) Shift tables don't
>> work well in the Unicode world with its large alphabet.
Fredrik> since most real-life text use characters from only a
Fredrik> small number of regions in that alphabet,
This is true of "most real-life text", but it's going to be false most
of the time for a large (and rapidly growing) minority of users: those
working with texts comprised mostly of Asian ideographs. Unihan
(spread over about 80 256-character rows) has a potential big problem:
because it is ordered by root, then stroke count, the simpler (and
usually more frequently used) ideographs with a common root cluster
near the root. Whether those clusters frequently overlap based on a
simple compression method like "lowest 5 bits" I don't know offhand.
I don't know whether the composed Hangul (~ 40 rows) would show
clustering; that would depend on phonetic frequencies in the Korean
language.
Of course the find algorithm you present is almost surely a big win
over the brute-force method, even in the presence of some degree of
clustering in Unihan and Hangul. But I worry that it's an exceptional
example, when you use assumptions like "real-life text uses characters
drawn from a small number of short contiguous regions in the alphabet."
--
Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.
More information about the Python-Dev
mailing list