[Tutor] unicode decode/encode issue

Steven D'Aprano steve at pearwood.info
Mon Sep 26 20:42:14 EDT 2016


I'm sorry, I have misinterpreted your question.

On Mon, Sep 26, 2016 at 12:59:04PM -0400, bruce wrote:

> I've got a page from a web fetch. I'm simply trying to go from utf-8 to
> ascii. 

Why would you do that? It's 2016, not 1953, and ASCII is well and truly 
obsolete. (ASCII was even obsolete in 1953, even then there were 
characters in common use in American English that couldn't be written in 
ASCII, like ¢.) Any modern program should be dealing with UTF-8.

Nevertheless, assuming you have a good reason, you are dealing with data 
scraped from a webpage, so it is likely to include HTML escape codes, as 
you have already learned. So you need to go from something like this:

   –

*first* to the actual EN DASH character, and then to the - hyphen. And 
remember that HTML supports all(?) of Unicode via character escapes, so 
you shouldn't assume that this is the only Unicode character.

Assuming you scrape the data from the webpage as a byte string, you'll 
have something like this:

data = "hello world – goodbye"  # byte-string read from HTML page
from HTMLParser import HTMLParser
parser = HTMLParser()
text = parser.unescape(data)
print text


which should display:

hello world – goodbye


including the en-dash. So now you have a Unicode string, which you can 
manipulate any way you like:

text = text.replace(u'–', u'--')  # remember to use Unicode strings here

See also the text.translate() method if you have to do lots of changes 
in one go.

Lastly you can convert to an ASCII byte-string using the encode method. 
By default, this will raise an exception if there are any non-ASCII 
characters in your text string:

data = text.encode('ascii')


You can also skip non-ASCII characters, replace them with question 
marks, or replace them with an escape code:

data = text.encode('ascii', 'ignore')
data = text.encode('ascii', 'replace')
data = text.encode('ascii', 'xmlcharrefreplace')


which will finally give you something suitable for use in programs 
written in the 1970s :-)



-- 
Steve


More information about the Tutor mailing list