[XML-SIG] Re: HTML<->UTF-8 'codec'?

Martin v. Loewis martin@v.loewis.de
08 Mar 2002 09:42:16 +0100


"David Primmer" <dave@primco.org> writes:

> Ok... but here's the problem. Using a cut'paste from my Word generated ut=
f-8 file into IDLE I get:
>=20
> >>> print 'I=E2=19ve had'.encode("html-utf-8")
> I&#226;&#128;&#153;ve had
>=20
> Which makes a bunch of garbage in my browser of course.

The reason is that you cannot cut-and-paste UTF-8 bytes using the
Windows clipboard. You did not describe exactly how you performed the
"cut'paste", but I assume you've used some UTF-8-unaware editor (or
perhaps even "type" on a console), then copied the resulting
characters. This cannot work: it will past the resulting characters,
*not* the sequence of bytes. If your Windows system code page is, say,
CP 1252, then you will get a bunch of Latin characters pasted into
IDLE. When saving the file with all those characters, IDLE will save
them as UTF-8 (because of the Python default encoding).

> f=3Dopen('newfile.html','wb')
> f.write(unicodedata.lookup('RIGHT DOUBLE QUOTATION MARK'))
> f.close()
>=20
> f=3Dopen('newfile.html','rb')
> a =3D f.read()
> b =3D a.encode('html-utf-8')
> print 'from file'
> print b

> results in:
>=20
> from file
> &#226;&#128;&#157;

There is an error in this code. unicodedata.lookup gives you a Unicode
object. You try to write this into a file. This should normally not
work, but you've changed the default encoding, so it unfortunately
does: saving the Unicode object as UTF-8. Then you read it back as
variable a, which is a byte string. This byte string happens to be
three bytes (as defined in UTF-8).

Now you invoke the .encode method on the byte string. Encoding a
string is a somewhat difficult notion, since the string already *is*
encoded - it is not clear what this should do. What it does is:
- find the html-utf-8 codec,
- pass it the byte string
The codec operates on any sequence, converting all non-ASCII bytes
to character references. This gives you the result that you got.

What you really meant is

f=3Dopen('newfile.html','wb')
f.write(unicodedata.lookup('RIGHT DOUBLE QUOTATION MARK').encode("utf-8"))
f.close()
=20
f=3Dopen('newfile.html','rb')
a =3D f.read().decode("utf-8")
b =3D a.encode('html-utf-8')
print 'from file'
print b

or better

f=3Dcodecs.open('newfile.html','wb', encoding=3D"utf-8")
f.write(unicodedata.lookup('RIGHT DOUBLE QUOTATION MARK'))
f.close()
=20
f=3Dcodecs.open('newfile.html','rb',encoding=3D"utf-8")
a =3D f.read()
b =3D a.encode('html-utf-8')
print 'from file'
print b

When you have a file that is encoded in UTF-8, please say so in your
program; do not rely on the system default encoding (in fact, never
change it): Explicit is better than implicit.

Regards,
Martin