[Tutor] How does # -*- coding: utf-8 -*- work?

eryksun eryksun at gmail.com
Sun Jan 27 00:50:39 CET 2013


On Sat, Jan 26, 2013 at 11:38 AM, Santosh Kumar <sntshkmr60 at gmail.com> wrote:
>
> Everything starting with hash character in Python is comment and is
> not interpreted by the interpreter. So how does that works? Give me
> full explanation.

The encoding declaration is parsed in the process of compiling the
source. CPython uses the function get_coding_spec in tokenizer.c.

CPython 2.7.3 source link:
http://hg.python.org/cpython/file/70274d53c1dd/Parser/tokenizer.c#l205

You can use the parser module to represent the nodes of a parsed
source tree as a sequence of nested tuples. The first item in each
tuple is the node type number. The associated names for each number
are split across two dictionaries. symbol.sym_name maps non-terminal
node types, and token.tok_name maps terminal nodes (i.e. leaf nodes in
the tree). In CPython 2.7/3.3, node types below 256 are terminal.

Here's an example source tree for two types of encoding declaration:

    >>> src1 = '# -*- coding: utf-8 -*-'
    >>> parser.suite(src1).totuple()
    (339, (257, (0, '')), 'utf-8')

    >>> src2 = '# coding=utf-8'
    >>> parser.suite(src2).totuple()
    (339, (257, (0, '')), 'utf-8')

As expected, src1 and src2 are equivalent. Now find the names of node
types 339, 257, and 0:

    >>> symbol.sym_name[339]
    'encoding_decl'
    >>> symbol.sym_name[257]
    'file_input'

    >>> token.ISTERMINAL(0)
    True
    >>> token.tok_name[0]
    'ENDMARKER'

The base node is type 339 (encoding_decl). The child is type 257
(file_input), which is just the empty body of the source (to keep it
simple, src1 and src2 lack statements). Tacked on at the end is the
string value of the encoding_decl (e.g. 'utf-8').


More information about the Tutor mailing list