On Tue, May 28, 2013 at 1:51 AM, Haoyi Li
If-if-if all that works out, you would be able to completely remove the ("b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB" | "r" | "u" | "R" | "U") from the grammar specification! Not add more to it, remove it! Shifting the specification of all the different string prefixes into a user-land library. I'd say that's a pretty creative way of getting rid of that nasty blob of grammar =D.
Now, that's a lot of "if"s, and I have no idea if any of them at all are true, but if-if-if they are all true, we could both simplify the lexer/parser, open up the prefixes for extensibility while maintaining the exact semantics for existing code.
Oops, should have read more of the thread before replying :) But, yeah, it would be nice if we could get to a mechanism that replaces the current horror show that is string prefix handling (see http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-li...) with the functional equivalent of the following Python code: def str_raw(source_bytes, source_encoding): return source_bytes.decode(source_encoding) def str_with_escapes(source_bytes, source_encoding): # Handle escapes to create "s" return s def bytes_raw(source_bytes, source_encoding): return source_bytes def bytes_with_escapes(source_bytes, source_encoding): # Handle escapes to create "b" return b cache_token = "Marker for pyc validity checking" for prefix in (None, "u", "U"): ast.register_str_prefix(prefix, str_with_escapes, cache_token) for prefix in ("r", "R"): ast.register_str_prefix(prefix, str_raw, cache_token) for prefix in (None, "b", "B"): ast.register_str_prefix(prefix, bytes_with_escapes, cache_token) for prefix in ("br", "Br", "bR", "BR", "rb", "rB", "Rb", "RB"): ast.register_str_prefix(prefix, bytes_raw, cache_token) The module caching code would likely need to grow another header dict that stores a mapping of prefix implementation names to their cache tokens. If the cache file references an unregistered prefix then the import would fail, while if it references one with a mismatched cache token, then the cached file would need to be regenerated. We could either just live with the fact that running the same file with different registrations may regenerate the file in __pycache__, or else come up with a nonconflicting naming scheme (I suspect the latter would be too messy and too rarely needed to be worth the hassle). (Obviously, the four core handlers wouldn't work quite this way - they would always be present, and their cache invalidation would be handled with the existing global bytecode cookie. However, it's a useful demonstration of the value of the generalisation, and the issues any such generalisation will need to handle). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia