[Python-ideas] Custom string prefixes

Tue May 28 05:06:38 CEST 2013

On Tue, May 28, 2013 at 1:51 AM, Haoyi Li <haoyi.sg at gmail.com> wrote:
> If-if-if all that works out, you would be able to completely remove the ("b"
> | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB" | "r" | "u" |
> "R" | "U") from the grammar specification! Not add more to it, remove it!
> Shifting the specification of all the different string prefixes into a
> user-land library. I'd say that's a pretty creative way of getting rid of
> that nasty blob of grammar =D.
>
> Now, that's a lot of "if"s, and I have no idea if any of them at all are
> true, but if-if-if they are all true, we could both simplify the
> lexer/parser, open up the prefixes for  extensibility while maintaining the
> exact semantics for existing code.

Oops, should have read more of the thread before replying :)

But, yeah, it would be nice if we could get to a mechanism that
replaces the current horror show that is string prefix handling (see
http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals)
with the functional equivalent
of the following Python code:

    def str_raw(source_bytes, source_encoding):
        return source_bytes.decode(source_encoding)

    def str_with_escapes(source_bytes, source_encoding):
        # Handle escapes to create "s"
        return s

    def bytes_raw(source_bytes, source_encoding):
        return source_bytes

    def bytes_with_escapes(source_bytes, source_encoding):
        # Handle escapes to create "b"
        return b

    cache_token = "Marker for pyc validity checking"

    for prefix in (None, "u", "U"):
        ast.register_str_prefix(prefix, str_with_escapes, cache_token)

    for prefix in ("r", "R"):
        ast.register_str_prefix(prefix, str_raw, cache_token)

    for prefix in (None, "b", "B"):
        ast.register_str_prefix(prefix, bytes_with_escapes, cache_token)

    for prefix in ("br", "Br", "bR", "BR", "rb", "rB", "Rb", "RB"):
        ast.register_str_prefix(prefix, bytes_raw, cache_token)

The module caching code would likely need to grow another header dict
that stores a mapping of prefix implementation names to their cache
tokens. If the cache file references an unregistered prefix then the
import would fail, while if it references one with a mismatched cache
token, then the cached file would need to be regenerated. We could
either just live with the fact that running the same file with
different registrations may regenerate the file in __pycache__, or
else come up with a nonconflicting naming scheme (I suspect the latter
would be too messy and too rarely needed to be worth the hassle).

(Obviously, the four core handlers wouldn't work quite this way - they
would always be present, and their cache invalidation would be handled
with the existing global bytecode cookie. However, it's a useful
demonstration of the value of the generalisation, and the issues any
such generalisation will need to handle).

Cheers,
Nick.

--
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia