[Python-Dev] Reading Python source file

Serhiy Storchaka storchaka at gmail.com
Mon Nov 16 20:53:32 EST 2015


I'm working on rewriting Python tokenizer (in particular the part that 
reads and decodes Python source file). The code is complicated. For now 
there are such cases:

* Reading from the string in memory.
* Interactive reading from the file.
* Reading from the file:
   - Raw reading ignoring encoding in parser generator.
   - Raw reading UTF-8 encoded file.
   - Reading and recoding to UTF-8.

The file is read by the line. It makes hard to check correctness of the 
first line if the encoding is specified in the second line. And it makes 
very hard problems with null bytes and with desynchronizing buffered C 
and Python files. All this problems can be easily solved if read all 
Python source file in memory and then parse it as string. This would 
allow to drop a large complex and buggy part of code.

Are there disadvantages in this solution? As for memory consumption, the 
source text itself will consume only small part of the memory consumed 
by AST tree and other structures. As for performance, reading and 
decoding all file can be faster then by the line.

[1] http://bugs.python.org/issue25643



More information about the Python-Dev mailing list