[Tutor] converting encoded symbols from rss feed?
Kent Johnson
kent37 at tds.net
Fri Jun 19 02:07:24 CEST 2009
2009/6/18 Serdar Tumgoren <zstumgoren at gmail.com>:
>> In [7]: print x.encode('cp437')
>> ------> print(x.encode('cp437'))
>> abcâ–‘
>>
> So does this mean that my python install is incapable of encoding the
> en/em dash?
No, the problem is with the print, not the encoding. Your console, as
configured, is incapable of displaying the em dash.
> But for some reason, I can't seem to get my translate_code function to
> work inside the same loop as Mr. Lundh's html cleanup code. Below is
> the problem code:
>
> infile = open('test.txt','rb')
> outfile = open('test_cleaned.txt','wb')
>
> for line in infile:
> try:
> newline = strip_html(line)
> cleanline = translate_code(newline)
> outfile.write(cleanline)
> except:
> newline = "NOT CLEANED: %s" % line
> outfile.write(newline)
>
> infile.close()
> outfile.close()
>
> The strip_html function, documented here
> (http://effbot.org/zone/re-sub.htm#unescape-html ), returns a text
> string as far as I can tell. I'm confused why I wouldn't be able to
> further manipulate the string with the "translate_code" function and
> store the result in the "cleanline" variable. When I try this
> approach, none of the translations succeed and I'm left with the same
> HTML gook in the "outfile".
Your try/except is hiding the problem. What happens if you take it
out? what error do you get?
My guess is that strip_html() is returning unicode and
translate_code() is expecting strings but I'm not sure without seeing
the error.
Kent
More information about the Tutor
mailing list