Python HTML parser chokes on UTF-8 input

Thu Oct 9 19:54:04 EDT 2008

On Thu, Oct 9, 2008 at 4:54 PM, Johannes Bauer <dfnsonfsduifb at gmx.de> wrote:
> Hello group,
>
> Now when I take "website" directly from the parser, everything is fine.
> However I want to do some modifications before I parse it, namely UTF-8
> modifications in the style:
>
> website = website.replace(u"föö", u"bär")

That's not utf-8, that's unicode.  Even if your file is saved as
utf-8, you're telling python to convert those from utf-8 encoded bytes
to unicode characters, by prefixing them with 'u'.

> Therefore, after fetching the web site content, I have to convert it to
> UTF-8 first, modify it and convert it back:

You have to convert it to unicode if and only if you are doing
manipulation with unicode stings.

> website = website.decode("latin1")
> website = website.replace(u"föö", u"bär")
> website = website.encode("latin1")
>
> This is incredibly ugly IMHO, as I would really like the parser to just
> accept UTF-8 input. However when I omit the reecoding to latin1:

You could just use the precise Latin-1 byte strings you'd like to replace:

website = website.replace("f\xf6\xf6", "b\xe4r")

Or, you could set the encoding of your source file to Latin-1, by
putting the following on the first or second line of your source file:

# -*- coding: Latin-1 -*-

Then use the appropriate literals in your source code, making sure
that you save it as Latin-1 in your editor of choice.

Truthfully, though, I think your current approach really is the right
one.  Decode to unicode character strings as soon as they come into
your program, manipulate them as unicode, then select your preferred
encoding when you write them back out.  It's explicit, and only takes
two lines of code.

-- 
Jerry