problems writing utf8

Sat Apr 13 04:19:54 EDT 2002

Boudewijn Rempt <boud at valdyas.org> writes:

> Then I tried to write the utf-8 data to a file. I have tried to
> construct that file with two methods:
> 
>     f = open("syllables", "w+")
>     d2 = codecs.EncodedFile(f, "unicode_internal", "utf-8")
>     f2.write(u"a")
>     f2.close()

This can't work (even though it should not crash). The EncodedFile
performs transparent recoding from two named encodings. In this
context, unicode_internal is the name of a *byte* encoding, namely the
encoding which exposes the internal memory layout of Unicode objects.

Unicode objects themselves have no "encoding"; they are just sequences
of Unicode characters. That they have an internal representation also
should not matter in most cases.

So what happens here is that you pass a Unicode object to the
unicode_internal decoder, which expects a byte string. u"a" will be
converted to the byte string "a" which is then interpreted as an
internal encoding of a Unicode object, which it is not.

> The second method writes garbage-encoded data to the file:
> 
>     f3 = codecs.open("syllables2", "w+", "utf-8")
>     f3.write(u"?")
>     f3.close()

That should work fine, and it works for me in all cases. 

> (Where u"?" contains any Unicode character you like -- in this
> case a glottal stop.)

When I do

import codecs
f3 = codecs.open("syllables2", "w+", "utf-8")
f3.write(u"\N{LATIN LETTER GLOTTAL STOP}")
f3.close()

print repr(open("syllables2").read())

I get

'\xca\x94'

which indeed is the UTF-8 representation of the glottal stop. What did
you get?

Regards,
Martin