[Tutor] converting encoded symbols from rss feed?
zstumgoren at gmail.com
Fri Jun 19 00:33:18 CEST 2009
> The example is written assuming the console encoding is utf-8. Yours
> seems to be cp437. Try this:
> In : import sys
> In : sys.stdout.encoding
> Out: 'cp437'
That is indeed the result that I get as well.
> But there is another problem - \u2013 is an em dash which does not
> appear in cp437, so even giving the correct encoding doesn't work. Try
> In : x = u"abc\u2591"
> In : print x.encode('cp437')
> ------> print(x.encode('cp437'))
So does this mean that my python install is incapable of encoding the
For the time being, I've gone with treating the symptom rather than
the root problem and created a translate function.
text = text.replace("‘","'")
text = text.replace("’","'")
text = text.replace("“",'"')
text = text.replace("”",'"')
text = text.replace("–","-")
text = text.replace("—","--")
Which of course has led to a new problem. I'm first using Fredrik
Lundh's code to extract random html gobbledygook, then running my
translate function over the file to replace the windows-1252 encoded
But for some reason, I can't seem to get my translate_code function to
work inside the same loop as Mr. Lundh's html cleanup code. Below is
the problem code:
infile = open('test.txt','rb')
outfile = open('test_cleaned.txt','wb')
for line in infile:
newline = strip_html(line)
cleanline = translate_code(newline)
newline = "NOT CLEANED: %s" % line
The strip_html function, documented here
(http://effbot.org/zone/re-sub.htm#unescape-html ), returns a text
string as far as I can tell. I'm confused why I wouldn't be able to
further manipulate the string with the "translate_code" function and
store the result in the "cleanline" variable. When I try this
approach, none of the translations succeed and I'm left with the same
HTML gook in the "outfile".
Is there some way to combine these functions so I can perform all the
processing in one pass?
More information about the Tutor