[Python-3000] Pre-PEP: Easy Text File Decoding
Marcin 'Qrczak' Kowalczyk
qrczak at knm.org.pl
Sat Oct 14 22:15:21 CEST 2006
"Martin v. Löwis" <martin at v.loewis.de> writes:
>> It changes the interpretation of some filenames which are valid UTF-8
^^^^^^^^^
>> (or generally of texts known to not contain '\0'). My hack is a pure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> extension since U+0000 can't be produced by standard UTF-8.
>
> That's not true. See RFC 2279:
>
> # Character values from 0000 0000 to 0000 007F (US-ASCII repertoire)
> # correspond to octets 00 to 7F (7 bit US-ASCII values).
>
> So U+0000 is represented by the octet 00.
'\0' (and thus U+0000) can't appear in Unix filenames, in names or
values of environment variables, in program invocation arguments etc.
It is true that it can change the interpretation of file contents.
This is unavoidable. Unless someone uses unpaired surrogates for this
purpose (or code points above U+10FFFF) - I've seen such proposals,
but IMHO they are abusing rules too far.
Anyway, Unicode people don't like my hack (it was really inspired by
Mono). They dislike any modifications of UTF-8, perhaps because such
modifications invite incompatibilities when different software use
different variants. I understand this, and I don't want software to
start exchanging data in such "encoding", but IMHO it's a useful hack
when the given language runtime translates filenames to UTF-8: it's
often the case that the interpretation of characters in filenames
doesn't matter, only preserving the byte sequences.
--
__("< Marcin Kowalczyk
\__/ qrczak at knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
More information about the Python-3000
mailing list