[Tutor] converting encoded symbols from rss feed?

Fri Jun 19 02:07:24 CEST 2009

2009/6/18 Serdar Tumgoren <zstumgoren at gmail.com>:

>> In [7]: print x.encode('cp437')
>> ------> print(x.encode('cp437'))
>> abc░
>>
> So does this mean that my python install is incapable of encoding the
> en/em dash?

No, the problem is with the print, not the encoding. Your console, as
configured, is incapable of displaying the em dash.

> But for some reason, I can't seem to get my translate_code function to
> work inside the same loop as Mr. Lundh's html cleanup code. Below is
> the problem code:
>
> infile = open('test.txt','rb')
> outfile = open('test_cleaned.txt','wb')
>
> for line in infile:
>    try:
>        newline = strip_html(line)
>        cleanline = translate_code(newline)
>        outfile.write(cleanline)
>    except:
>        newline = "NOT CLEANED: %s" % line
>        outfile.write(newline)
>
> infile.close()
> outfile.close()
>
> The strip_html function, documented here
> (http://effbot.org/zone/re-sub.htm#unescape-html ), returns a text
> string as far as I can tell. I'm confused why I wouldn't be able to
> further manipulate the string with the "translate_code" function and
> store the result in the "cleanline" variable. When I try this
> approach, none of the translations succeed and I'm left with the same
> HTML gook in the "outfile".

Your try/except is hiding the problem. What happens if you take it
out? what error do you get?

My guess is that strip_html() is returning unicode and
translate_code() is expecting strings but I'm not sure without seeing
the error.

Kent