[Python-Dev] Reading Python source file

Tue Nov 17 10:22:37 EST 2015

On Tue, Nov 17, 2015 at 1:59 AM, M.-A. Lemburg <mal at egenix.com> wrote:
> On 17.11.2015 02:53, Serhiy Storchaka wrote:
>> I'm working on rewriting Python tokenizer (in particular the part that reads and decodes Python
>> source file). The code is complicated. For now there are such cases:
>>
>> * Reading from the string in memory.
>> * Interactive reading from the file.
>> * Reading from the file:
>>   - Raw reading ignoring encoding in parser generator.
>>   - Raw reading UTF-8 encoded file.
>>   - Reading and recoding to UTF-8.
>>
>> The file is read by the line. It makes hard to check correctness of the first line if the encoding
>> is specified in the second line. And it makes very hard problems with null bytes and with
>> desynchronizing buffered C and Python files. All this problems can be easily solved if read all
>> Python source file in memory and then parse it as string. This would allow to drop a large complex
>> and buggy part of code.
>>
>> Are there disadvantages in this solution? As for memory consumption, the source text itself will
>> consume only small part of the memory consumed by AST tree and other structures. As for performance,
>> reading and decoding all file can be faster then by the line.
>
> A problem with this approach is that you can no
> longer fail early and detect indentation errors et al. while
> parsing the data (which may well come from a pipe).

Oh, this use case I had forgotten about. I don't know how common or
important it is though.

But more important is the interactive REPL, which parses your input
fully each time you hit ENTER.

> Another related problem is that you have to wait for the full
> input data before you can start compiling the code.

That's always the case -- we don't start compiling before we have the
full parse tree.

> I don't think these situations are all that common, though,
> so reading in the full source code before compiling it
> sounds like a reasonable approach.
>
> We use the same simplification in eGenix PyRun's emulation of
> the Python command line interface and it has so far not
> caused any problems.

Curious how you do it? I'd actually be quite disappointed if the
amount of parsing done by the standard REPL went down.

>> [1] http://bugs.python.org/issue25643

-- 
--Guido van Rossum (python.org/~guido)