utf - string translation

John Machin sjmachin at lexicon.net
Wed Nov 29 18:08:05 EST 2006


Fredrik Lundh wrote:
> John Machin wrote:
>
> > Another point: there are many non-latin1 characters that could be
> > mapped to ASCII. For example:
> >     u"\u0141ukasziewicz".translate(unaccented_map())
> > doesn't work unless an entry is added to the no-decomposition table:
> >     0x0141: u"L", # LATIN CAPITAL LETTER L WITH STROKE
> >
> > It looks like generating extra entries like that could be done, with
> > the aid of unicodedata.name():
> >
> > LATIN CAPITAL LETTER X WITH blahblah -> "X"
> > LATIN SMALL LETTER X WITH blahblah -> "X".lower()
> >
> > This would require a fair bit of care -- obviously there are special
> > cases like LATIN CAPITAL LETTER O WITH STROKE. Eyeballing by regional
> > experts is probably required.
>
> see the comments over at
>
>      http://effbot.org/zone/unicode-convert.htm

Don't rush me, I was getting to that next :-)

>
> for an extended table, eyeballed by a regional expert (and since he
> makes the same point about OE vs Oe as you do, I'll probably have to
> change the code ;-)
>

Slightly extended. My point is that there is a large number of LATIN
(CAPITAL|SMALL) LETTER X WITH twiddly-bits that don't have a
decomposition; the table entries could be generated automatically

As well as regional experts, Google can be handy: googling for Thord,
Thordh, Thordsson and Thordhsson and noting the number of hits for each
tends to indicate that you and I are right about the treatment of
"eth"; Marcin's "dh" might better indicate how it's pronounced, but "d"
is AFAICT the standard transcription.

Cheers,
John




More information about the Python-list mailing list