[Python-Dev] status of development documentation

Sun Dec 25 05:43:08 CET 2005

[Tim]
>> FWIW, test_builtin and test_pep263 both passed on WinXP in rev 39757.
>> That's the last revision before the AST branch was merged.
>>
>> I can't build rev 39758 on WinXP (VC complains that pythoncore.vcproj
>> can't be loaded -- looks like it got checked in with unresolved SVN
>> conflict markers -- which isn't easy to do under SVN ;-( ), so don't
>> know about that.
>>
>> The first revision at which Python built again was 39791 (23 Oct), and
>> test_builtin and test_pep263 both fail under that the same way they
>> fail today.

[Brett]
> Both syntax errors, right?

In test_builtin, yes, two syntax errors.  test_pep263 is different:

test test_pep263 failed -- Traceback (most recent call last):
  File "C:\Code\python\lib\test\test_pep263.py", line 12, in test_pep263
    '\xd0\x9f\xd0\xb8\xd1\x82\xd0\xbe\xd0\xbd'
AssertionError:
'\xc3\xb0\xc3\x89\xc3\x94\xc3\x8f\xc3\x8e' !=
'\xd0\x9f\xd0\xb8\xd1\x82\xd0\xbe\xd0\xbd'

That's not a syntax error, it's a wrong result.  There are other
parsing-related test failures, but those are the only two I've written
up so far (partly because I expect they all have the same underlying
cause, and partly because nobody so far seems to understand the code
well enough to explain why the first one works on any platform ;-)).

> My mind is partially gone thanks to being on vacation so following this thread
> has been abnormally hard.  =)
>
> Since it is a syntax error there won't be any bytecode to compare against.

Shouldn't be needed.  The snippet:

    bom = '\xef\xbb\xbf'
    compile(bom + 'print 1\n', '', 'exec')

treats the `bom` prefix like any other sequence of illegal characters.
 That's why it raises SyntaxError:

    It peels off the first character (\xef), and says "syntax
    error" at that point:

    Py_CompileStringFlags ->
    PyParser_ASTFromString ->
    PyParser_ParseStringFlagsFilename ->
    parsetok ->
    PyTokenizer_Get

    That sets `a` to point at the start of the string, `b` to point at the
    second character, and returns type==51.  Then `len` is set to 1,
    `str` is malloc'ed to hold 2 bytes, and `str` is filled in with
    "\xef\x00" (the first byte of the input, as a NUL-terminated C
    string).

    PyParser_AddToken then calls classify(), which falls off the end of
    its last loop and returns -1:  syntax error.

and later:

    I'm getting a strong suspicion that I'm the only developer to _try_
    building the trunk on WinXP since the AST merge was done, and that
    something obscure is fundamentally broken with it on this box.  For
    example, in tokenizer.c, these functions don't even exist on Windows
    today (because an enclosing #ifdef says not to compile them):

    error_ret
    new_string
    get_normal_name
    get_coding_spec
    check_coding_spec
    check_bom
    fp_readl
    fp_setreadl
    fp_getc
    fp_ungetc
    decoding_fgets
    decoding_feof
    buf_getc
    buf_ungetc
    buf_setreadl
    translate_into_utf8
    decode_str

    OK, that's not quite true.  "Degenerate" forms of three of those
    functions exist on Windows:

    static char *
    decoding_fgets(char *s, int size, struct tok_state *tok)
    {
           return fgets(s, size, tok->fp);
    }

    static int
    decoding_feof(struct tok_state *tok)
    {
           return feof(tok->fp);
    }

    static const char *
    decode_str(const char *str, struct tok_state *tok)
    {
          return str;
    }

    In the simple failing test, that degenerate decode_str() is getting
    called.  If the "fancy" decode_str() were being used instead, that one
    _does_ call check_bom().  Why do we have two versions of these
    functions?  Which set is supposed to be in use now?  What's the
    meaning of "#ifdef PGEN" today?  Should it be true or false?

>> I'm darned near certain that we're not using the _intended_ parsing
>> code on Windows now -- PGEN is still #define'd when the "final"
>> parsing code is compiled into python25.dll.  Don't know how to fix
>> that (I don't understand it).

> But the AST branch didn't touch the parser (unless you are grouping
> ast.c and compile.c under the "parser" umbrella just to throw me off
> =).

Possibly.  See above for unanswered questions about tokenizer.c, which
appears to be the whole problem wrt test_builtin.  Python couldn't be
built under VC7.1 on Windows after the AST merge.  However that got
repaired left parsing/tokenizing broken on Windows wrt (at least) some
encoding gimmicks.  Since the tests passed immediately before the AST
merge, and failed the first time Python could be built again after
that merge, it's the only natural candidate for finger-wagging.

> What can I do to help?

I don't know.  Enjoying Christmas couldn't hurt :-)  What this needs
is someone who understands how

    bom = '\xef\xbb\xbf'
    compile(bom + 'print 1\n', '', 'exec')

is supposed to work at the front-end level.

>  Do you need me to step through something?

Why doesn't the little code snippet above fail anywhere else? 
"Should" the degenerate decode_str() be getting called during it -- or
should the other decode_str() be getting called?  If the latter, what
got broke on Windows during the merge so that the wrong one is getting
called now?

> Do you need to know how gcc is preprocessing some file?

No, I just need to know how to fix Python on Windows ;-)