[Tutor] Unicode Encode Error
Danny Yoo
dyoo at hkn.eecs.berkeley.edu
Thu Apr 27 23:12:40 CEST 2006
>>> You're right, I realised after playing with Tim's example that the
>>> problem was that I wasn't calling close() on the codecs file. Adding
>>> this after the f.write(html_text) seems to flush the buffer which
>>> means that the content now gets written to the file.
>>
>> Quick note: it may be important to write and read from the file using
>> binary mode "b". It's not so significant under Unix, but it is more
>> significant under Windows, because otherwise we may get some weird
>> results.
>
> But the file is utf-8 text, ISTM it should be written as text, not
> binary. Why do you recommend binaray mode?
Hi Kent,
Oh! I just wrote that out because I had a vague and fuzzy feeling that
utf-8, having high-order binary bits, needed to be written carefully.
But let me examine that unexamined assumption...
No, you're right, we don't have to be so careful here, for carriage
returns and newlines have their standard interpretation under utf-8 too.
Ok, good to know. Thank you!
I'd seen too many problems with Windows and binary data that I do 'rb' out
of habit whenever dealing with high-order binary data. For example,
ord(26) causes Windows to prematurely truncate the reading of a file in
text mode:
http://mail.python.org/pipermail/python-list/2003-March/154659.html
On a close reading of how the utf-8 encoding standard, though, I see that
it does say that utf-8 avoids encoding high Unicode code points with
control characters, so my caution is unfounded.
More information about the Tutor
mailing list