problems writing utf8
Martin v. Loewis
martin at v.loewis.de
Sat Apr 13 04:19:54 EDT 2002
Boudewijn Rempt <boud at valdyas.org> writes:
> Then I tried to write the utf-8 data to a file. I have tried to
> construct that file with two methods:
>
> f = open("syllables", "w+")
> d2 = codecs.EncodedFile(f, "unicode_internal", "utf-8")
> f2.write(u"a")
> f2.close()
This can't work (even though it should not crash). The EncodedFile
performs transparent recoding from two named encodings. In this
context, unicode_internal is the name of a *byte* encoding, namely the
encoding which exposes the internal memory layout of Unicode objects.
Unicode objects themselves have no "encoding"; they are just sequences
of Unicode characters. That they have an internal representation also
should not matter in most cases.
So what happens here is that you pass a Unicode object to the
unicode_internal decoder, which expects a byte string. u"a" will be
converted to the byte string "a" which is then interpreted as an
internal encoding of a Unicode object, which it is not.
> The second method writes garbage-encoded data to the file:
>
> f3 = codecs.open("syllables2", "w+", "utf-8")
> f3.write(u"?")
> f3.close()
That should work fine, and it works for me in all cases.
> (Where u"?" contains any Unicode character you like -- in this
> case a glottal stop.)
When I do
import codecs
f3 = codecs.open("syllables2", "w+", "utf-8")
f3.write(u"\N{LATIN LETTER GLOTTAL STOP}")
f3.close()
print repr(open("syllables2").read())
I get
'\xca\x94'
which indeed is the UTF-8 representation of the glottal stop. What did
you get?
Regards,
Martin
More information about the Python-list
mailing list