Python 3.0 automatic decoding of UTF16

MRAB google at mrabarnett.plus.com
Fri Dec 5 14:36:16 EST 2008


Joe Strout wrote:
> On Dec 5, 2008, at 11:36 AM, Johannes Bauer wrote:
> 
>>> I suspect that '?' after \n (\u0a00) is indicates not 'question-mark'
>>> but 'uninterpretable as a utf16 character'.  The traceback below
>>> confirms that.  It should be an end-of-file marker and should not be
>>> passed to Python.  I strongly suspect that whatever wrote the file
>>> screwed up the (OS-specific) end-of-file marker.  I have seen this
>>> occasionally on Dos/Windows with ascii byte files, with the same symptom
>>> of reading random garbage pass the end of the file.  Or perhaps
>>> end-of-file does not work right with utf16.
>>
>> So UTF-16 has an explicit EOF marker within the text?
> 
> No, it does not.  I don't know what Terry's thinking of there, but text 
> files do not have any EOF marker.  They start at the beginning 
> (sometimes including a byte-order mark), and go till the end of the 
> file, period.
> 
Text files _do_ sometimes have an EOF marker, such as character 0x1A. It 
can occur in text files in Windows.

>> I cannot find one in original file, only some kind of starting 
>> sequence I suppose
>> (0xfeff).
> 
> That's your byte-order mark (BOM).
> 
>> The last characters of the file are 0x00 0x0d 0x00 0x0a,
>> simple \r\n line ending.
> 
> Sounds like a perfectly normal file to me.
> 
> It's hard to imagine, but it looks to me like you've found a bug.
> 



More information about the Python-list mailing list