[Python-3000] String comparison

Rauli Ruohonen rauli.ruohonen at gmail.com
Fri Jun 8 10:21:01 CEST 2007


On 6/8/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> In principle, yes. What's the cost of the additional field in terms of
> a size increase? If you just need another bit, could that fit into
> _PyUnicode_TypeRecord.flags instead?

The additional field is 8 bits, two bits for each normalization (a
Yes/Maybe/No value). In Unicode 4.1 only 5 different combinations are
used, but I don't know if that's true of later versions. As
_PyUnicode_Database_Records stores only unique records, this also results
in an increase of the number of records, from 219 to 304. Each record
looks like this:

typedef struct {
    const unsigned char category;
    const unsigned char combining;
    const unsigned char bidirectional;
    const unsigned char mirrored;
    const unsigned char east_asian_width;
    const unsigned char normalization_quick_check; /* my addition */
} _PyUnicode_DatabaseRecord;

I added the field to this record because the function needs to get the
record anyway for each character (it needs the field "combining", too).
The new field combines values for the derived properties (trinary)
NFD_Quick_Check, NFKD_Quick_Check, NFC_Quick_Check and NFKC_Quick_Check.

Here's the main loop (works for all four normalizations, only the value of
quickcheck_shift changes):

    while (i < end) {
        const _PyUnicode_DatabaseRecord *record = _getrecord_ex(*i++);
        unsigned char combining = record->combining;
        unsigned char quickcheck = record->normalization_quick_check;

        if ((quickcheck>>quickcheck_shift) & 3)
            return 0; /* this character might need normalization */
        if (combining && prev_combining > combining)
            return 0; /* non-canonical order, not normalized */
        prev_combining = combining;
    }

> That would be best. You only need to include the patch to the generator,
> not the generated data. I'd like to see it in 2.6, so ideally, you would
> test it for the trunk (not that the branch should matter much)).

This is easy to do. The differences in these files between the versions
are very small, and I actually initially wrote it for 2.5, as
py3k-struni's normalization test fails at the moment.

> Don't forget to include test suite and documentation changes.

It doesn't affect behavior or the API much(*), only performance. Current
test_normalize.py uses a test suite it fetches from UCD, so it
should be adequate.

(*) You *can* test for its presence by e.g. checking whether
    id(unicodedata.normalize('NFC', u'a')) is id(u'a') or not.
    The documentation does not specify either way. I'd say it's an
    implementation detail, and both tests and documentation should ignore
    it.


More information about the Python-3000 mailing list