How to waste computer memory?
Steven D'Aprano
steve at pearwood.info
Sat Mar 19 10:39:47 EDT 2016
On Sat, 19 Mar 2016 11:42 pm, Marko Rauhamaa wrote:
> The problem is not theoretical. If I implement a web form and someone
> enters "Aña" as their name, how do I make sure queries find the name
> regardless of the unicode code point sequence? I have to normalize using
> unicodedata.normalize().
I didn't say that it was theoretical. It is a real problem, but it is a
problem with human languages: the number of characters-with-accents is
vast, possibly impossibly vast. They can't all have unique code points.
I must admit I had completely missed your example of multiple combining
characters, that's a good one. Here's the example again:
a + combining ring above + combining ring below, versus
a + combining ring below + combining ring above
Naturally just comparing them gives unequal:
py> s = "a\u030A\u0325"
py> t = "a\u0325\u030A"
py> s == t
False
But we can normalise them:
==== ============= ============= ================== =================
Form NFC NFKC NFKD NFKD
==== ============= ============= ================== =================
s U+1E01,030A U+1E01,030A U+0061,0325,030A U+0061,0325,030A
t U+1E01,030A U+1E01,030A U+0061,0325,030A U+0061,0325,030A
==== ============= ============= ================== =================
As you can see, *any* of the normalisation forms will put the code points
into the same, canonical order, making them equal.
> When glorifying Python's advanced Unicode capabilities, are we careful
> to emphasize the necessity of unicodedata.normalize() everywhere? Should
> Python normalize strings unconditionally and transparently? What does
> the O(1) character lookup mean under normalization?
>
> Some weeks ago I had to spend 30 minutes to debug my Python program when
> a user complained it didn't work. Turns out they had accidentally
> invoked the program using a space and a composing tilde instead of the
> ASCII ~. There was no visual indication of a problem on the screen, but
> the Python program acted up.
We recently had somebody here who wrote capital I by pressing the lower case
l on the keyboard. Should a pure-ASCII program be able to operate without
malfunction if the user confuses 0 and O, or I l and 1? What about ' and `
or possibly even '' and "?
--
Steven
More information about the Python-list
mailing list