cleaning up an ASCII file?
John Machin
sjmachin at lexicon.net
Thu Jun 11 00:58:28 EDT 2009
Nick Matzke <matzke <at> berkeley.edu> writes:
>
>
> Looks like this was a solution:
>
> 1. Use this guy's unescape function to convert from HTML/XML Entities to
> unicode
> http://effbot.org/zone/re-sub.htm#unescape-html
Looks like you didn't notice "this guy"'s unaccent.py :-)
http://effbot.org/zone/unicode-convert.htm
[Aside: Has anyone sighted the effbot recently? He's been very quiet.]
> 2. Take the unicode and convert to approximate plain ASCII matches with
> unicodedata (after import unicodedata)
>
> ascii_content2 = unescape(line)
>
> ascii_content = unicodedata.normalize('NFKD',
> unicode(ascii_content2)).encode('ascii','ignore')
The normalize hack gets you only so far. Many Latin-based characters are not
decomposable. Look for the thread in this newsgroup with subject "convert
unicode characters to visibly similar ascii characters" around 2008-07-01 or
google("hefferon unicode2ascii")
Alternative: If you told us which platform you are running on, people familiar
with that platform could help you set up your terminal to display non-ASCII
characters correctly.
HTH,
John
More information about the Python-list
mailing list