[Tutor] close, but no cigar

Tue Jul 23 09:30:20 CEST 2013

On 23/07/13 09:39, Marc Tompkins wrote:
> On Mon, Jul 22, 2013 at 3:22 PM, Jim Mooney <cybervigilante at gmail.com>wrote:
>
>> On 22 July 2013 14:11, Marc Tompkins <marc.tompkins at gmail.com> wrote:
>>
>>>
>>> One way to deal with this is to specify an encoding:
>>>      newchar = char.decode('cp437').encode('utf-8')
>>>
>>
>> Works fine, but I decided to add a dos graphics dash to the existing dash
>> to expand the tree
>> visually. Except I got a complaint from IDLE that I should add this:
>>
>> # -*- coding: utf-8 -*-
>>
>> Will that always work? Setting coding in a comment? Or am I looking at a
>> Linux hash line?
>>
>>
> I speak under correction here, but:  what you're setting there is the
> encoding for the script file itself (and - the real point here - any
> strings you specify, without explicit encoding, inside the script), NOT the
> default encoding that Python is going to use while executing your script.
> Unless I'm very much mistaken, Python will still use the default encoding
> ('ascii' in your case) when reading strings from external files.

Correct. The encoding declaration ONLY tells Python how to read the script. Remember, source code is text, but has to be stored on disk as bytes. If you only use ASCII characters, pretty much every program will agree what the bytes represent (since IBM mainframes using EBCDIC are pretty rare, and few programs expect double-byte encodings). But if you include non-ASCII characters, your text editor has to convert them to bytes. How does it do so? Nearly every editor is different, a plain text file doesn't have any way of storing metadata such as the encoding. Contrast this to things like JPEG files, which can store metadata like the camera you used to take the photo.

So, some programmer's editors have taken up the convention of using so-called "mode lines" to record editor settings as comments in source code, usually in the first couple or last couple of lines. Especially on Linux systems, Emacs and Vim uses frequently include such mode lines.

Python stole this idea from them. If the first or second line in the source code file is a comment containing something like "encoding = SPAM", then Python will read that source code using encoding SPAM. The form shown above

-*- coding: utf-8 -*-

is copied from Emacs. Python is pretty flexible though.

However, the encoding must be a known encoding (naturally), and the comment must be in the first or second line. You can't use it anywhere else. Well, you actually can, since it is a comment, but it will have no effect anywhere else.

-- 
Steven