[Tutor] Unicode? UTF-8? UTF-16? WTF-8? ;)

Wed Sep 5 17:16:37 CEST 2012

On 06/09/12 00:04, Ray Jones wrote:
> On 09/05/2012 04:52 AM, Peter Otten wrote:
>> Ray Jones wrote:
>>
>>>
>>> But doesn't that entail knowing in advance which encoding you will be
>>> working with? How would you automate the process while reading existing
>>> files?
>> If you don't *know* the encoding you *have* to guess. For instance you could
>> default to UTF-8 and fall back to Latin-1 if you get an error. While
>> decoding non-UTF-8 data with an UTF-8 decoder is likely to fail Latin-1 will
>> always "succeed" as there is one codepoint associated with every possible
>> byte. The result howerver may not make sense. Think
>>
>> for line in codecs.open("lol_cat.jpg", encoding="latin1"):
>>      print line.rstrip()
> :))
>
> So when glob reads and returns garbley, non-unicode file
> names....\xb4\xb9....is it making a guess as to which encoding should be
> used for that filename?

No. It is returning the actual bytes stored by the file system.

At least that's what it does under Linux. Windows is different.

The most common Linux file systems (ext2 and ext3) store file names as bytes,
not Unicode. Your desktop environment (KDE, Gnome, Unity, etc.) *may* try to
enforce Unicode names, probably using UTF-8, but the file system is perfectly
happy to let you create file names using different encodings, or no encoding
at all. (I believe the only invalid characters in ext2 or ext3 files are
ASCII NULL and FORWARD SLASH. Even newline \n is valid.)

> Does Linux store that information when it saves the file name?

No. The file system doesn't care about encodings, it just knows about bytes.
Your desktop environment might try to enforce UTF-8 encoded file names, but
nothing stops some other program from creating a file using a different
encoding.

For example, suppose I want to name a file "AπЯ†" (just because I can).
Assuming that KDE uses UTF-8, as it should, then Dolphin or Konqueror will
tell the file system:

name this file "\x41\xcf\x80\xd0\xaf\xe2\x80\xa0"

(Note that the first byte, \x41, is just the ASCII code for uppercase A.)

When another UTF-8 aware program sees that byte-string, it will decode it
back to "AπЯ†" and I will be happy that my files have cool names.

But then some day I use another program, which knows nothing about UTF-8
but thinks I'm running an Apple Mac in Greece back in 1990 or thereabouts.
It sees the same sequence of bytes, and decodes it using the MacGreek
encoding, which gives "AœÄ–·βÄ†" instead, and I'll be unhappy because my
cool file names look like rubbish. But the actual file names (stored as
bytes) are the same.

> And (most?) importantly, how can I use that fouled up
> file name as an argument in calling Dolphin?

Just pass it as the file name and hope for the best :)

Seriously, I *expect* (but don't know for sure) that just passing the
raw bytes to Dolphin will be fine, it will decode them as it sees fit.
Try it and see.

-- 
Steven