[Tutor] UTF-8 title() string method
Kent Johnson
kent37 at tds.net
Wed Jul 4 22:33:11 CEST 2007
Jon Crump wrote:
> Dear All,
>
> I have some utf-8 unicode text with lines like this:
>
> ANVERS-LE-HOMONT, Maine.
> ANGOULÊME, Angoumois.
> ANDELY (le Petit), Normandie.
>
> which I'm using as-is in this line of code:
>
> place.append(line.strip())
>
> What I would prefer would be something like this:
>
> place.append(line.title().strip())
>
> which works for most lines, giving me, for example:
>
> Anvers-Le-Homont, Maine.
> and
> Andely (Le Petit), Normandie.
>
> but where there are diacritics involved, title() gives me:
>
> AngoulÊMe, Angoumois.
>
> Can anyone give the clueless a clue on how to manage such unicode
> strings more effectively?
First, don't confuse unicode and utf-8.
Second, convert the string to unicode and then title-case it, then
convert back to utf-8 if you need to:
In [3]: s='ANGOUL\303\212ME, Angoumois'
In [5]: s
Out[5]: 'ANGOUL\xc3\x8aME, Angoumois'
In [4]: s.title()
Out[4]: 'Angoul\xc3\x8aMe, Angoumois'
In [10]: print s.title()
AngoulÊMe, Angoumois
In [6]: u=s.decode('utf-8')
In [7]: u.title()
Out[7]: u'Angoul\xeame, Angoumois'
In [8]: print u.title()
------------------------------------------------------------
Traceback (most recent call last):
File "<ipython console>", line 1, in <module>
<type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode
character u'\xea' in position 6: ordinal not in range(128)
Oops, print is trying to convert to a byte string with the default
encoding, have to give it some help...
In [9]: print u.title().encode('utf-8')
Angoulême, Angoumois
Kent
More information about the Tutor
mailing list