[Baypiggies] April snippets meeting - Unicode normalisation/normalization trick

Shannon -jj Behrens jjinux at gmail.com
Sat Apr 14 02:21:05 CEST 2007


On 4/12/07, Chris Clark <Chris.Clark at ingres.com> wrote:
> Here is the "barely a snippet, more of a reminder of the wealth of
> libraries that Python ships with", I showed this evening. jj told me
> that there is a similar piece of code in the cookbook, I've not checked
> it for fear of seeing a better explanation....
>
> I've sent this mail in cp1252 encoding (almost latin1), if you can't see
> some of the special characters in it, try the html attachment in a
> browser instead.
>
> I needed to perform string comparisons so that:
>
>     "o" == "ö"
>
> would be considered true. I.e. lower case "O" matches lower case "O"
> with umlaut. I really wanted this to work with all decorated characters,
> "A acute", "A caret", ......
>
> The unicodedate library has a normalisation function which allows
> normalisation to different forms. One of the forms decomposes a single
> decorated characters into surrogate pairs of the undecoratered character
> + the decorater. If you can strip the decorater off you end up with the
> undecoratered character. Simply encoding in 7 bit ascii (with an ignore
> unmappables) happens to do just that very thing. Viz.:
>
>      >>> import unicodedata
>      >>> test_str = u'Bj\N{LATIN SMALL LETTER O WITH DIAERESIS}rk'
>      >>> test_str
>     u'Bj\xf6rk'
>      >>> print test_str
>     Björk
>      >>> unicodedata.normalize('NFKD', test_str )
>     u'Bjo\u0308rk'
>      >>> unicodedata.normalize('NFKD', test_str ).encode('ASCII', 'ignore')
>     'Bjork'
>
> Once converted you can perform comparisons, storage, or even convert
> back to Unicode :-)
>
> One word of caution, some characters won't decompose (for sensible
> reasons), this is really intended for decorated or accented characters.
> E.g.:
>
>      >>> test_str = u'\N{LATIN CAPITAL LETTER AE}'
>      >>> test_str
>     u'\xc6'
>      >>> print test_str
>     Æ
>      >>> unicodedata.normalize('NFKD', test_str ).encode('ASCII', 'ignore')
>     ''
>
> I.e. u'\N{LATIN CAPITAL LETTER AE}' is already normalised (to NFKD).
>
> http://unicode.org is *the* place for all things Unicode but there are
> some sites around that are slightly more friendly for simple lookups,
> e.g.
> http://www.fileformat.info/info/unicode/char/00f6/index.htm
> gives
> the name, picture (in case you do not have a suitable font installed) as
> well as a bunch of other truly useful information.
>
> Chris
>
>
>
>
>  >>> import unicodedata
>  >>> test_str = u'Bj\N{LATIN SMALL LETTER O WITH DIAERESIS}rk'
>  >>> print test_str
>  Björk
>  >>> test_str = u'Bj\u00F6rk'
>  >>> print test_str
>  Björk
>  >>> print unicodedata.normalize('NFKD', test_str ).encode('ASCII',
> 'ignore')
>  Bjork
>  >>> test_str = u'\N{LATIN CAPITAL LETTER AE}'
>  >>> print test_str
>  Æ
>  >>> print unicodedata.normalize('NFKD', test_str ).encode('ASCII',
> 'ignore')
>
>  >>>
>
>
>
>  See
> http://www.fileformat.info/info/unicode/char/00f6/index.htm

latin1_to_ascii -- The UNICODE Hammer -- AKA "The Stupid American"
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/251871

I always remember the name of this cookbook entry because of the
clever name ;)    The comments are actually better than the recipe
itself, if I remember right.

Best Regards,
-jj

-- 
"'Software Engineering' is something of an oxymoron.  It's very
difficult to have real engineering before you have physics, and there
isn't anything even close to a physics for software." -- L. Peter
Deutsch


More information about the Baypiggies mailing list