Is there really a default source encoding?
"Martin v. Löwis"
martin at v.loewis.de
Fri Jan 24 12:07:51 CET 2003
Alexander Schmolck wrote:
> This is a laudable goal. I just didn't understand why one would want to, after
> slowly and carefully moving away from eurocentric (latin-1), revert even
> further back to anglocentric (ascii) instead of opting for truly international
> (and anglo-neutral, utf-8). That seemed a bit like 1 step forward, 2 steps
In this case, it is Pythonic:
In the face of ambiguity, refuse the temptation to guess.
Python needs to support multiple source encodings, as it is a real-world
fact that people do use multiple source encodings, and they won't move
away from that unless forced. Python tries not to force things onto
people if it can be avoided, so we want people to use the encodings that
they happen to use.
Given that there are multiple encodings, we need declarations for them,
or else we would have to guess. The only unambiguous case is ASCII,
since you can't really mistake a file written in some other relevant
encoding as ASCII (I know iso-2022 somewhat breaks this rule, but it may
not be relevant for Python source code).
So if there are no bytes >128, it is ASCII. If there are bytes > 128, we
would have to guess, which we should refuse.
> Great. Only are you sure that BOMs are such a great idea?
Yes. First of all, notepad.exe supports it, for which Markus' comments
are irrelevant. Furthermore, I disagree with Markus on several accounts.
> On POSIX systems, the locale and not magic file type codes define the encoding
> of plain text files.
Does not apply: this is not plain text, but source code. Relying on the
locale is next to guessing, as the file may have been created by a user
of a different locale.
> Mixing the two concepts would add a lot of complexity and
> break existing functionality. Adding a UTF-8 signature at the start of a file
> would interfere with many established conventions such as the kernel looking
> for "#!" at the beginning of a plaintext executable to locate the appropriate
> interpreter. Handling BOMs properly would add undesirable complexity even to
> simple programs like cat or grep that mix contents of several files into one.
cat is irrelevant as well, as you rarely cat two source code files
together. I can't see how grep is affected - if you have encoding
declarations (which you also have in HTML, XML, and probably other file
formats), you can't grep for non-ASCII text anyway.
> The '#!'-bit would seem especially relevant.
Indeed, this is the only valid point in this objection. I have a patch
for the Linux kernel which I still hope to incorporate into future Linux
versions to support UTF-8-BOM #! in binfmt_script. If the feature takes
off, other Posix systems may follow.
More information about the Python-list