[Python-ideas] Correct way for writing Python code without causing interpreter crashes due to parser stack overflow

Wed Jun 27 11:17:52 EDT 2018

On 27 June 2018 at 17:04, Fiedler Roman <Roman.Fiedler at ait.ac.at> wrote:
> Hello List,
>
> Context: we are conducting machine learning experiments that generate some kind of nested decision trees. As the tree includes specific decision elements (which require custom code to evaluate), we decided to store the decision tree (result of the analysis) as generated Python code. Thus the decision tree can be transferred to sensor nodes (detectors) that will then filter data according to the decision tree when executing the given code.
>
> Tracking down a crash when executing that generated code, we came to following simplified reproducer that will cause the interpreter to crash (on Python 2/3) when loading the code before execution is started:
>
> #!/usr/bin/python2 -BEsStt
> A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A(None)])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])
>
> The error message is:
>
> s_push: parser stack overflow
> MemoryError
>
> Despite the machine having 16GB of RAM, the code cannot be loaded. Splitting it into two lines using an intermediate variable is the current workaround to still get it running after manual adapting.

This seems like it may indicate a potential problem in the pgen2
parser generator, since the compilation is failing at the original
parse step, but checking the largest version of this that CPython can
parse on my machine gives a syntax tree of only ~77kB:

    >>> tree = parser.expr("A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A(None)])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])")
    >>> sys.getsizeof(tree)
    77965

Attempting to print that hints more closely at the potential problem:

    >>> tree.tolist()
    Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
    RecursionError: maximum recursion depth exceeded while getting the
repr of an object

As far as I'm aware, the CPython parser is using the actual C stack
for recursion, and is hence throwing MemoryError because it ran out of
stack space to recurse into, not because it ran out of memory in
general (RecursionError would be a more accurate exception).

Trying your original example in PyPy (which uses a different parser
implementation) suggests you may want to try using that as your
execution target before resorting to switching languages entirely:

    >>>> tree2 =
parser.expr("A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A(None)])])])])])])])])])]]))])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])")
    >>>> len(tree2.tolist())
    5

Alternatively, you could explore mimicking the way that scikit-learn
saves its trained models (which I believe is a variation on "use
pickle", but I've never actually gone and checked for sure).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia