[Tutor] UTF-8 title() string method

Kent Johnson kent37 at tds.net
Wed Jul 4 22:33:11 CEST 2007


Jon Crump wrote:
> Dear All,
> 
> I have some utf-8 unicode text with lines like this:
> 
> ANVERS-LE-HOMONT, Maine.
> ANGOULÊME, Angoumois.
> ANDELY (le Petit), Normandie.
> 
> which I'm using as-is in this line of code:
> 
> place.append(line.strip())
> 
> What I would prefer would be something like this:
> 
> place.append(line.title().strip())
> 
> which works for most lines, giving me, for example:
> 
> Anvers-Le-Homont, Maine.
> and
> Andely (Le Petit), Normandie.
> 
> but where there are diacritics involved, title() gives me:
> 
> AngoulÊMe, Angoumois.
> 
> Can anyone give the clueless a clue on how to manage such unicode 
> strings more effectively?

First, don't confuse unicode and utf-8.

Second, convert the string to unicode and then title-case it, then 
convert back to utf-8 if you need to:
In [3]: s='ANGOUL\303\212ME, Angoumois'
In [5]: s
Out[5]: 'ANGOUL\xc3\x8aME, Angoumois'
In [4]: s.title()
Out[4]: 'Angoul\xc3\x8aMe, Angoumois'
In [10]: print s.title()
AngoulÊMe, Angoumois
In [6]: u=s.decode('utf-8')
In [7]: u.title()
Out[7]: u'Angoul\xeame, Angoumois'
In [8]: print u.title()
------------------------------------------------------------
Traceback (most recent call last):
   File "<ipython console>", line 1, in <module>
<type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode 
character u'\xea' in position 6: ordinal not in range(128)

Oops, print is trying to convert to a byte string with the default 
encoding, have to give it some help...

In [9]: print u.title().encode('utf-8')
Angoulême, Angoumois

Kent


More information about the Tutor mailing list