Replacing utf-8 characters

David Bolen db3l at fitlinxx.com
Wed Oct 5 23:51:06 CEST 2005


Mike <no at spam> writes:

> What you and I typed was ascii. The value of link came from importing
> that utf-8 web page into that variable.  That is why I think it is not
> working.  But not sure what the solution is.

Are you sure you're asking what you think you are asking?  Both the
ampersand character (&) and the characters within the ampersand entity
character reference (&amp;) are ASCII.  As it turns out they are also
legal UTF-8, but I would not call a web page UTF-8 just because I saw
the sequence of characters "&amp;" within the stream.  (That's not to
say it isn't UTF-8 encoded, just that I don't think that's the issue)

I'm just guessing, but you do realize that legal HTML should quote all
uses of the ampersand character with an entity reference, since the
ampersand itself is reserved for use in such references.  This
includes URL references whether inside attributes or in the body of
the text.

So when you see something in a browser in a web page that shows a URL
that includes "&" such as for separating parameters, internally that
page is (or should be) stored with "&amp;" for that character.  Thus
if you retrieve the page in code, that's what you'll find.  It's the
browser processing that entity reference that turns it back into the
"&" for presentation.

Note that whether or not the page in question is encoded as UTF-8 is a
completely distinct question - whatever encoding the page is in would
be used to encode the characters in the entity reference (namely
"&amp;").

I'm assuming that in scraping the page you want to reverse the process
(e.g., perform the interpretation of the entity references much as a
browser would) before using that URL for other purposes.  If so, the
string replacement you tried should handle the replacement just fine,
at least within the value of the URL as managed by your code.

You then mention it being the same when you view the contents of the
link, which isn't quite clear to me, but if that means retrieving
another copy of the link as embedded in an HTML page then yes, it'll
get quoted again since as initially, you have to quote an ampersand
as an entity reference within HTML.

What did you mean by "view the contents link"?

-- David




More information about the Python-list mailing list