[Python-Dev] status of development documentation
Brett Cannon
bcannon at gmail.com
Sun Dec 25 07:29:36 CET 2005
On 12/24/05, Tim Peters <tim.peters at gmail.com> wrote:
> [Tim]
> >> FWIW, test_builtin and test_pep263 both passed on WinXP in rev 39757.
> >> That's the last revision before the AST branch was merged.
> >>
> >> I can't build rev 39758 on WinXP (VC complains that pythoncore.vcproj
> >> can't be loaded -- looks like it got checked in with unresolved SVN
> >> conflict markers -- which isn't easy to do under SVN ;-( ), so don't
> >> know about that.
> >>
> >> The first revision at which Python built again was 39791 (23 Oct), and
> >> test_builtin and test_pep263 both fail under that the same way they
> >> fail today.
>
> [Brett]
> > Both syntax errors, right?
>
> In test_builtin, yes, two syntax errors. test_pep263 is different:
>
> test test_pep263 failed -- Traceback (most recent call last):
> File "C:\Code\python\lib\test\test_pep263.py", line 12, in test_pep263
> '\xd0\x9f\xd0\xb8\xd1\x82\xd0\xbe\xd0\xbd'
> AssertionError:
> '\xc3\xb0\xc3\x89\xc3\x94\xc3\x8f\xc3\x8e' !=
> '\xd0\x9f\xd0\xb8\xd1\x82\xd0\xbe\xd0\xbd'
>
> That's not a syntax error, it's a wrong result. There are other
> parsing-related test failures, but those are the only two I've written
> up so far (partly because I expect they all have the same underlying
> cause, and partly because nobody so far seems to understand the code
> well enough to explain why the first one works on any platform ;-)).
>
> > My mind is partially gone thanks to being on vacation so following this thread
> > has been abnormally hard. =)
> >
> > Since it is a syntax error there won't be any bytecode to compare against.
>
> Shouldn't be needed. The snippet:
>
> bom = '\xef\xbb\xbf'
> compile(bom + 'print 1\n', '', 'exec')
>
> treats the `bom` prefix like any other sequence of illegal characters.
> That's why it raises SyntaxError:
>
> It peels off the first character (\xef), and says "syntax
> error" at that point:
>
> Py_CompileStringFlags ->
> PyParser_ASTFromString ->
> PyParser_ParseStringFlagsFilename ->
> parsetok ->
> PyTokenizer_Get
>
> That sets `a` to point at the start of the string, `b` to point at the
> second character, and returns type==51. Then `len` is set to 1,
> `str` is malloc'ed to hold 2 bytes, and `str` is filled in with
> "\xef\x00" (the first byte of the input, as a NUL-terminated C
> string).
>
> PyParser_AddToken then calls classify(), which falls off the end of
> its last loop and returns -1: syntax error.
>
> and later:
>
> I'm getting a strong suspicion that I'm the only developer to _try_
> building the trunk on WinXP since the AST merge was done, and that
> something obscure is fundamentally broken with it on this box. For
> example, in tokenizer.c, these functions don't even exist on Windows
> today (because an enclosing #ifdef says not to compile them):
>
> error_ret
> new_string
> get_normal_name
> get_coding_spec
> check_coding_spec
> check_bom
> fp_readl
> fp_setreadl
> fp_getc
> fp_ungetc
> decoding_fgets
> decoding_feof
> buf_getc
> buf_ungetc
> buf_setreadl
> translate_into_utf8
> decode_str
>
> OK, that's not quite true. "Degenerate" forms of three of those
> functions exist on Windows:
>
> static char *
> decoding_fgets(char *s, int size, struct tok_state *tok)
> {
> return fgets(s, size, tok->fp);
> }
>
> static int
> decoding_feof(struct tok_state *tok)
> {
> return feof(tok->fp);
> }
>
> static const char *
> decode_str(const char *str, struct tok_state *tok)
> {
> return str;
> }
>
> In the simple failing test, that degenerate decode_str() is getting
> called. If the "fancy" decode_str() were being used instead, that one
> _does_ call check_bom(). Why do we have two versions of these
> functions? Which set is supposed to be in use now? What's the
> meaning of "#ifdef PGEN" today? Should it be true or false?
>
Looking at the logs for tokenizer.c, tokenizer.h, and
tokenizer_pgen.c, it looks like this stuff has not been heavily
touched since Martin did stuff for PEP 263.
> >> I'm darned near certain that we're not using the _intended_ parsing
> >> code on Windows now -- PGEN is still #define'd when the "final"
> >> parsing code is compiled into python25.dll. Don't know how to fix
> >> that (I don't understand it).
>
> > But the AST branch didn't touch the parser (unless you are grouping
> > ast.c and compile.c under the "parser" umbrella just to throw me off
> > =).
>
> Possibly. See above for unanswered questions about tokenizer.c, which
> appears to be the whole problem wrt test_builtin. Python couldn't be
> built under VC7.1 on Windows after the AST merge. However that got
> repaired left parsing/tokenizing broken on Windows wrt (at least) some
> encoding gimmicks. Since the tests passed immediately before the AST
> merge, and failed the first time Python could be built again after
> that merge, it's the only natural candidate for finger-wagging.
>
Did it lead to tokenizer_pgen.c to suddenly be used for the build
instead of tokenizer.c? The former seems to be the only place where
PGEN is defined.
> > What can I do to help?
>
> I don't know. Enjoying Christmas couldn't hurt :-) What this needs
> is someone who understands how
>
> bom = '\xef\xbb\xbf'
> compile(bom + 'print 1\n', '', 'exec')
>
> is supposed to work at the front-end level.
>
Hopefully Martin will have some inkling since he committed the phase 1
stuff for PEP 263.
> > Do you need me to step through something?
>
> Why doesn't the little code snippet above fail anywhere else?
> "Should" the degenerate decode_str() be getting called during it -- or
> should the other decode_str() be getting called? If the latter, what
> got broke on Windows during the merge so that the wrong one is getting
> called now?
>
> > Do you need to know how gcc is preprocessing some file?
>
> No, I just need to know how to fix Python on Windows ;-)
=)
-Brett
More information about the Python-Dev
mailing list