compare unicode to non-unicode strings

Matt Nordhoff mnordhoff at mattnordhoff.com
Sun Aug 31 21:31:52 CEST 2008


Asterix wrote:
> how could I test that those 2 strings are the same:
> 
> 'séd' (repr is 's\\xc3\\xa9d')
> 
> u'séd' (repr is u's\\xe9d')

You may also want to look at unicodedata.normalize(). For example, é can
be represented multiple ways:

>>> import unicodedata
>>> unicodedata.normalize('NFC', u'é')
u'\xe9'
>>> unicodedata.normalize('NFD', u'é')
u'e\u0301'
>>> u'\xe9' == u'e\u0301'
False

The first form is "composed", just being U+00E9 (LATIN SMALL LETTER E
WITH ACUTE). The second form is "decomposed", being made up of U+0065
(LATIN SMALL LETTER E) and U+0301 (COMBINING ACUTE ACCENT).

Even though they represent the same thing to a human, they don't compare
as equal. But if you normalize them to the same form, they will.

For more information, look at the unicodedata module's documentation:
<http://docs.python.org/lib/module-unicodedata.html>
-- 



More information about the Python-list mailing list