Problem with lower() for unicode strings in russian

konstantin konstantin.selivanov at gmail.com
Mon Oct 6 07:35:36 EDT 2008


On Oct 6, 8:39 am, Alexey Moskvin <d... at inbox.ru> wrote:
> Martin, thanks for fast reply, now anything is ok!
> On Oct 6, 1:30 am, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
>
> > > I have a set of strings (all letters are capitalized) at utf-8,
>
> > That's the problem. If these are really utf-8 encoded byte strings,
> > then .lower likely won't work. It uses the C library's tolower API,
> > which works on a byte level, i.e. can't work for multi-byte encodings.
>
> > What you need to do is to operate on Unicode strings. I.e. instead
> > of
>
> >   s.lower()
>
> > do
>
> >   s.decode("utf-8").lower()
>
> > or (if you need byte strings back)
>
> >   s.decode("utf-8").lower().encode("utf-8")
>
> > If you find that you write the latter, I recommend that you redesign
> > your application. Don't use byte strings to represent text, but use
> > Unicode strings all the time, except at the system boundary (where
> > you decode/encode as appropriate).
>
> > There are some limitations with Unicode .lower also, but I don't
> > think they apply to Russian (specifically, SpecialCasing.txt is
> > not considered).
>
> > HTH,
> > Martin

Alexey,

if your strings stored in some text file you can use "codecs" package

> import codecs
> handler = codecs.open('somefile', 'r', 'utf-8')
> # ... do the job
> handler.close()

I prefer this way to deal with russian in utf-8.

Konstantin.





More information about the Python-list mailing list