[Python-Dev] Re: Alternative Implementation for PEP 292: Simple String Substitutions

Stephen J. Turnbull stephen at xemacs.org
Mon Sep 13 06:21:32 CEST 2004


>>>>> "Fredrik" == Fredrik Lundh <fredrik at pythonware.com> writes:

    Fredrik> M.-A. Lemburg wrote:

    >>> (google for "stringlib" for some work I'm doing in this area)

    >> Ah, now I know where you're coming from :-) Shift tables don't
    >> work well in the Unicode world with its large alphabet.

    Fredrik> since most real-life text use characters from only a
    Fredrik> small number of regions in that alphabet,

This is true of "most real-life text", but it's going to be false most
of the time for a large (and rapidly growing) minority of users: those
working with texts comprised mostly of Asian ideographs.  Unihan
(spread over about 80 256-character rows) has a potential big problem:
because it is ordered by root, then stroke count, the simpler (and
usually more frequently used) ideographs with a common root cluster
near the root.  Whether those clusters frequently overlap based on a
simple compression method like "lowest 5 bits" I don't know offhand.

I don't know whether the composed Hangul (~ 40 rows) would show
clustering; that would depend on phonetic frequencies in the Korean
language.

Of course the find algorithm you present is almost surely a big win
over the brute-force method, even in the presence of some degree of
clustering in Unihan and Hangul.  But I worry that it's an exceptional
example, when you use assumptions like "real-life text uses characters
drawn from a small number of short contiguous regions in the alphabet."

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.


More information about the Python-Dev mailing list