Is there really a default source encoding?

Fri Jan 24 06:07:51 EST 2003

Alexander Schmolck wrote:
> This is a laudable goal. I just didn't understand why one would want to, after
> slowly and carefully moving away from eurocentric (latin-1), revert even
> further back to anglocentric (ascii) instead of opting for truly international
> (and anglo-neutral, utf-8). That seemed a bit like 1 step forward, 2 steps
> back. 

In this case, it is Pythonic:

In the face of ambiguity, refuse the temptation to guess.

Python needs to support multiple source encodings, as it is a real-world 
fact that people do use multiple source encodings, and they won't move 
away from that unless forced. Python tries not to force things onto 
people if it can be avoided, so we want people to use the encodings that 
they happen to use.

Given that there are multiple encodings, we need declarations for them, 
or else we would have to guess. The only unambiguous case is ASCII, 
since you can't really mistake a file written in some other relevant 
encoding as ASCII (I know iso-2022 somewhat breaks this rule, but it may 
not be relevant for Python source code).

So if there are no bytes >128, it is ASCII. If there are bytes > 128, we 
would have to guess, which we should refuse.

> Great. Only are you sure that BOMs are such a great idea?

Yes. First of all, notepad.exe supports it, for which Markus' comments 
are irrelevant. Furthermore, I disagree with Markus on several accounts.

>     On POSIX systems, the locale and not magic file type codes define the encoding
>     of plain text files. 

Does not apply: this is not plain text, but source code. Relying on the 
locale is next to guessing, as the file may have been created by a user 
of a different locale.

>     Mixing the two concepts would add a lot of complexity and
>     break existing functionality.  Adding a UTF-8 signature at the start of a file
>     would interfere with many established conventions such as the kernel looking
>     for "#!" at the beginning of a plaintext executable to locate the appropriate
>     interpreter.  Handling BOMs properly would add undesirable complexity even to
>     simple programs like cat or grep that mix contents of several files into one.

cat is irrelevant as well, as you rarely cat two source code files 
together. I can't see how grep is affected - if you have encoding 
declarations (which you also have in HTML, XML, and probably other file 
formats), you can't grep for non-ASCII text anyway.

> The '#!'-bit would seem especially relevant.  

Indeed, this is the only valid point in this objection. I have a patch 
for the Linux kernel which I still hope to incorporate into future Linux 
versions to support UTF-8-BOM #! in binfmt_script. If the feature takes 
off, other Posix systems may follow.

Regards,
Martin