Least-lossy string.encode to us-ascii?
Christian Heimes
lists at cheimes.de
Thu Sep 13 18:00:45 EDT 2012
Am 13.09.2012 23:26, schrieb Tim Chase:
> I've got a bunch of text in Portuguese and to transmit them, need to
> have them in us-ascii (7-bit). I'd like to keep as much information
> as possible, just stripping accents, cedillas, tildes, etc. So
> "serviço móvil" becomes "servico movil". Is there anything stock
> that I've missed? I can do mystring.encode('us-ascii', 'replace')
> but that doesn't keep as much information as I'd hope.
The unidecode [1] package contains a large mapping of unicode chars to
ASCII. It even supports cool stuff like Chinese to ASCII:
>>> import unidecode
>>> print u"\u5317\u4EB0"
北亰
>>> print unidecode.unidecode(u"\u5317\u4EB0")
Bei Jing
icu4c and pyicu [2] may contain more methods for conversion but they
require binary extensions. By the way ICU can do a lot of cool, too:
>>> import icu
>>> rbf = icu.RuleBasedNumberFormat(icu.URBNFRuleSetTag.SPELLOUT,
icu.Locale.getUS())
>>> rbf.format(23)
u'twenty-three'
>>> rbf.format(100000)
u'one hundred thousand'
Regards,
Christian
[1] http://pypi.python.org/pypi/Unidecode/0.04.9
[2] http://pypi.python.org/pypi/PyICU/1.4
More information about the Python-list
mailing list