ascii to latin1
Serge.Orlov at gmail.com
Tue May 9 15:06:54 CEST 2006
Luis P. Mendes wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> Richie Hindle escreveu:
> > [Serge]
> >> def search_key(s):
> >> de_str = unicodedata.normalize("NFD", s)
> >> return ''.join(cp for cp in de_str if not
> >> unicodedata.category(cp).startswith('M'))
> > Lovely bit of code - thanks for posting it!
> > You might want to use "NFKD" to normalize things like LATIN SMALL
> > LIGATURE FI and subscript/superscript characters as well as diacritics.
> Thank you very much for your info. It's a very good aproach.
> When I used the "NFD" option, I came across many errors on these and
> possibly other codes: \xba, \xc9, \xcd.
What errors? normalize method is not supposed to give any errors. You
mean it doesn't work as expected? Well, I have to admit that using
normalize is a far from perfect way to implement search. The most
advanced algorithm is published by Unicode guys:
<http://www.unicode.org/reports/tr10/> If you read it you'll understand
it's not so easy.
> I tried to use "NFKD" instead, and the number of errors was only about
> half a dozen, for a universe of 600000+ names, on code \xbf.
> It looks like I have to do a search and substitute using regular
> expressions for these cases. Or is there a better way to do it?
Perhaps you can use unicode translate method to map the characters that
still give you problems to whatever you want.
More information about the Python-list