utf - string translation

John Machin sjmachin at lexicon.net
Wed Nov 22 20:49:22 CET 2006


hg wrote:
> Duncan Booth wrote:
> > hg <hg at nospam.com> wrote:
> >
> >>> or in other words, put this at the top of your file (where "utf-8" is
> >>> whatever your editor/system is using):
> >>>
> >>>    # -*- coding: utf-8 -*-
> >>>
> >>> and use
> >>>
> >>>    u'<text>'
> >>>
> >>> for all non-ASCII literals.
> >>>
> >>> </F>
> >>>
> >> Hi,
> >>
> >> The problem is that:
> >>
> >> # -*- coding: utf-8 -*-
> >> import string
> >> print len('a')
> >> print len('à')
> >>
> >> returns 1 then 2
> >
> > And if you do what was suggested and write:
> >
> > # -*- coding: utf-8 -*-
> > import string
> > print len(u'a')
> > print len(u'à')
> >
> > then you get:
> >
> > 1
> > 1

Some general comments:

1. There has been at least one thread on the subject of ripping accents
off Latin1 characters in the last 3 or 4 months. Try Google.

2. About your earlier problem, when len(thing1) != len(thing2):
In that and similar situations, it can be *very* useful to use this
technique:
    print repr(thing1), type(thing1)
    print repr(thing2), type(thing2)
Go back now and try it out!

> OK,
>
> How would you handle the string.maketrans then ?
>

I suggest that you first read the documentation on the str and unicode
"translate" methods.
You can obtain this quickly at the interactive prompt by doing
    help(''.translate)
and
    help(u''.translate)
respectively.

Next steps:

Is your *real* data (not the examples you were hard-coding earlier)
encoded (latin1, utf8) in str objects or is it in unicode objects?
After reading previous posts my head is spinning & I'm not going to
guess; you determine it  yourself.

[pseudocode -- blend of Pythonic & Knuthian styles]
if latin1: (A) you can use string.maketrans and str.translate
immediately.

elif unicode: (B) either (1) encode to latin1; goto (A) or (2) use
unicode.translate with do-it-yourself mapping

elif utf8: decode to unicode; goto (B)

else: ???

HTH,
John




More information about the Python-list mailing list