trying to strip out non ascii.. or rather convert non ascii
steve+comp.lang.python at pearwood.info
Fri Nov 1 08:16:36 CET 2013
On Thu, 31 Oct 2013 03:33:15 -0700, wxjmfauth wrote:
> Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a écrit :
>> I'm glad that you know so much better than Google, Bing, Yahoo, and
>> search engines. When I search for "mispealled" Google gives me:
> As far as I know, I recognized my mistake. I had more text processing
> systems in mind, than search engines.
Yes, you have, I acknowledge that now. I see now that at the time I made
my response to you, you had already replied recognising your error.
Unfortunately I had not seen that. So in that case, I withdraw my
comments and apologize.
> I can even tell you, I am really stupid. I wrote pure Unicode software
> to sort French or German strings.
> Pure unicode == independent from any locale.
Unfortunately it is not that simple. The same code point can have
different meanings in different languages, and should be treated
differently when sorting. The natural Unicode sort order satisfies very
few European languages, including English. A few examples:
* Swedish ä is a distinct letters of the alphabet, appearing
after z: "a b c z ä" is sorted according to Swedish rules.
But in German ä is considered to be the letter 'a' plus an
umlaut, and is collated after 'a': "a ä b c z" is sorted
according to German rules.
* In German ö is considered to be a variant of o, equivalent
to 'oe', while in Finish ö is a distinct letter which
cannot be expanded to 'oe', and which appears at the end
of the alphabet.
* Similarly, in modern English æ is a ligature of ae, while in
Danish and Norwegian is it a distinct letter of the alphabet
appearing after z: in English dictionaries, "Æsir" will be
found with other "A" words, often expanded to "Aesir", while
in Norwegian it will be found after "Z" words.
* Most European languages convert uppercase I to lowercase i,
but Turkish has distinct letters for dotted and dotless I.
According to Turkish rules, lowercase(I) is ı and uppercase(i)
While it is true that the Unicode character set is independent of locale,
for natural processing of characters, it isn't enough to just use Unicode.
More information about the Python-list