[Tutor] unicode decode/encode issue
Steven D'Aprano
steve at pearwood.info
Mon Sep 26 20:42:14 EDT 2016
I'm sorry, I have misinterpreted your question.
On Mon, Sep 26, 2016 at 12:59:04PM -0400, bruce wrote:
> I've got a page from a web fetch. I'm simply trying to go from utf-8 to
> ascii.
Why would you do that? It's 2016, not 1953, and ASCII is well and truly
obsolete. (ASCII was even obsolete in 1953, even then there were
characters in common use in American English that couldn't be written in
ASCII, like ¢.) Any modern program should be dealing with UTF-8.
Nevertheless, assuming you have a good reason, you are dealing with data
scraped from a webpage, so it is likely to include HTML escape codes, as
you have already learned. So you need to go from something like this:
–
*first* to the actual EN DASH character, and then to the - hyphen. And
remember that HTML supports all(?) of Unicode via character escapes, so
you shouldn't assume that this is the only Unicode character.
Assuming you scrape the data from the webpage as a byte string, you'll
have something like this:
data = "hello world – goodbye" # byte-string read from HTML page
from HTMLParser import HTMLParser
parser = HTMLParser()
text = parser.unescape(data)
print text
which should display:
hello world – goodbye
including the en-dash. So now you have a Unicode string, which you can
manipulate any way you like:
text = text.replace(u'–', u'--') # remember to use Unicode strings here
See also the text.translate() method if you have to do lots of changes
in one go.
Lastly you can convert to an ASCII byte-string using the encode method.
By default, this will raise an exception if there are any non-ASCII
characters in your text string:
data = text.encode('ascii')
You can also skip non-ASCII characters, replace them with question
marks, or replace them with an escape code:
data = text.encode('ascii', 'ignore')
data = text.encode('ascii', 'replace')
data = text.encode('ascii', 'xmlcharrefreplace')
which will finally give you something suitable for use in programs
written in the 1970s :-)
--
Steve
More information about the Tutor
mailing list