cleaning up an ASCII file?

John Machin sjmachin at
Thu Jun 11 00:58:28 EDT 2009

Nick Matzke <matzke <at>> writes:

> Looks like this was a solution:
> 1. Use this guy's unescape function to convert from HTML/XML Entities to 
> unicode

Looks like you didn't notice "this guy"'s :-)

[Aside: Has anyone sighted the effbot recently? He's been very quiet.]

> 2. Take the unicode and convert to approximate plain ASCII matches with 
> unicodedata (after import unicodedata)
> ascii_content2 = unescape(line)
> ascii_content = unicodedata.normalize('NFKD', 
> unicode(ascii_content2)).encode('ascii','ignore')

The normalize hack gets you only so far. Many Latin-based characters are not
decomposable. Look for the thread in this newsgroup with subject "convert
unicode characters to visibly similar ascii characters" around 2008-07-01 or
google("hefferon unicode2ascii")

Alternative: If you told us which platform you are running on, people familiar
with that platform could help you set up your terminal to display non-ASCII
characters correctly.


