utf - string translation
John Machin
sjmachin at lexicon.net
Wed Nov 29 16:52:16 EST 2006
Fredrik Lundh wrote:
> John Machin wrote:
>
> > 3. ... and to check for missing maps. The OP may be working only with
> > French text, and may not care about Icelandic and German letters, but
> > other readers who stumble on this (and miss past thread(s) on this
> > topic) may like something done with \xde (capital thorn), \xfe (small
> > thorn) and \xdf (sharp s aka Eszett).
>
> I did post links to code that does this to this thread, several days ago...
>
Ah yes, I missed that -- and your posting doesn't advertise that the
code fixed the "one character should be mapped to two" cases :-)
This code
(http://effbot.python-hosting.com/file/stuff/sandbox/text/unaccent.py)
looks generally very good, but I'm left wondering why "AE" and "OE" in
the table, not "Ae and "Oe":
[snip]
0xc6: u"AE", # LATIN CAPITAL LETTER AE <<<=== ??
0xd0: u"D", # LATIN CAPITAL LETTER ETH
0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE <<<=== ??
0xde: u"Th", # LATIN CAPITAL LETTER THORN
[snip]
Another point: there are many non-latin1 characters that could be
mapped to ASCII. For example:
u"\u0141ukasziewicz".translate(unaccented_map())
doesn't work unless an entry is added to the no-decomposition table:
0x0141: u"L", # LATIN CAPITAL LETTER L WITH STROKE
It looks like generating extra entries like that could be done, with
the aid of unicodedata.name():
LATIN CAPITAL LETTER X WITH blahblah -> "X"
LATIN SMALL LETTER X WITH blahblah -> "X".lower()
This would require a fair bit of care -- obviously there are special
cases like LATIN CAPITAL LETTER O WITH STROKE. Eyeballing by regional
experts is probably required.
Cheers,
John
More information about the Python-list
mailing list