[Tutor] Unicode? UTF-8? UTF-16? WTF-8? ;)

Wed Sep 5 17:09:03 CEST 2012

Ray Jones wrote:

> On 09/05/2012 04:52 AM, Peter Otten wrote:
>> Ray Jones wrote:
>>
>>>
>>> But doesn't that entail knowing in advance which encoding you will be
>>> working with? How would you automate the process while reading existing
>>> files?
>> If you don't *know* the encoding you *have* to guess. For instance you
>> could default to UTF-8 and fall back to Latin-1 if you get an error.
>> While decoding non-UTF-8 data with an UTF-8 decoder is likely to fail
>> Latin-1 will always "succeed" as there is one codepoint associated with
>> every possible byte. The result howerver may not make sense. Think
>>
>> for line in codecs.open("lol_cat.jpg", encoding="latin1"):
>>     print line.rstrip()
> :))
> 
> So when glob reads and returns garbley, non-unicode file
> names....\xb4\xb9....is it making a guess as to which encoding should be
> used for that filename? Does Linux store that information when it saves
> the file name? And (most?) importantly, how can I use that fouled up
> file name as an argument in calling Dolphin?

Linux stores filenames as bytes always. If you pass a unicode path to 
os.listdir() it tries to decode the byte sequence of the resulting names or 
returns bytes if that fails:

>>> import sys
>>> sys.getfilesystemencoding()
'UTF-8'
>>> import os
>>> os.mkdir(u"alpha")
>>> odd_name = "".join(map(chr, range(128, 144))) # a byte string!
>>> os.mkdir(odd_name)
>>> os.listdir(u".") # unicode arg triggers unicode output (where possible)
[u'alpha', 
'\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f']

(Python 3 takes a slightly different approach)

Dolphin (at least the version I have tried) can only cope with filenames 
that can be decoded into unicode using the file system encoding. 

Neither Python nor Linux care for the "meaning" of the file names. They can 
process arbitrary byte sequences just fine.