ascii to latin1
Richie Hindle
richie at entrian.com
Tue May 9 09:48:09 EDT 2006
[Serge]
> I have to admit that using
> normalize is a far from perfect way to implement search. The most
> advanced algorithm is published by Unicode guys:
> <http://www.unicode.org/reports/tr10/> If you read it you'll understand
> it's not so easy.
I only have to look at the length of the document to understand it's not
so easy. 8-) I'll take your two-line normalization function any day.
> IMHO It is perfectly acceptable to declare you don't interpret those
> symbols. After all they are called *compatibility* code points. I
> tried "a quater" symbol: Google and MSN don't interpret it. Yahoo
> doesn't support it at all. [...]
> if you have character "digit two" followed by "superscript
> digit two"; they look like 2 power 2, but NFKD will convert them into
> 22 (twenty two), which is wrong. So if you want to use NFKD for search
> your will have to preprocess your data, for example inserting space
> between the twos.
I'm not sure it's obvious that it's wrong. How might a user enter
"2<superscript digit 2>" into a search box? They might enter a genuine
"<superscript digit 2>" in which case you're fine, or they might enter
"2^2" in which case it depends how you deal with punctuation. They
probably won't enter "2 2".
It's certainly not wrong in the case of ligatures like LATIN SMALL
LIGATURE FI - it's quite likely that the user will search for "fish"
rather than finding and (somehow) typing the ligature.
Some superscripts are similar - I imagine there's a code point for the
"superscript st" in "1st" (though I can't find it offhand) and you'd
definitely want to convert that to "st".
NFKD normalization doesn't convert VULGAR FRACTION ONE QUARTER into
"1/4" - I wonder whether there's some way to do that?
> After all they are called *compatibility* code points.
Yes, compatible with what the user types. 8-)
--
Richie Hindle
richie at entrian.com
More information about the Python-list
mailing list