[Tutor] \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character ?

Steven D'Aprano steve at pearwood.info
Mon Nov 21 12:15:49 CET 2011


Dave Angel wrote:
> On 11/20/2011 04:45 PM, Steven D'Aprano wrote:
>> <snip>
>>
>> Something in the tool chain before it reached Python has saved it 
>> using a wide (four byte) encoding, most likely UTF-16 as that is 
>> widely used by Windows and Java. With the right settings, it could 
>> take as little as opening the file in Notepad, then clicking Save.
>>
> 
> UTF-16 is a two byte format.  That's typically what Windows uses for 
> Unicode.  It's Unices that are more likely to use a four-byte format.

Oops, you're right of course, two bytes, not four:

py> u'M'.encode('utf-16BE')
'\x00M'

I was thinking of four hex digits:

py> u'M'.encode('utf-16BE').encode('hex')
'004d'




-- 
Steven


More information about the Tutor mailing list