[Python-ideas] Custom string prefixes

Tue May 28 08:32:17 CEST 2013

On 05/27/2013 10:06 PM, Nick Coghlan wrote:
> On Tue, May 28, 2013 at 1:51 AM, Haoyi Li<haoyi.sg at gmail.com>  wrote:
>> >If-if-if all that works out, you would be able to completely remove the ("b"
>> >| "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB" | "r" | "u" |
>> >"R" | "U") from the grammar specification! Not add more to it, remove it!
>> >Shifting the specification of all the different string prefixes into a
>> >user-land library. I'd say that's a pretty creative way of getting rid of
>> >that nasty blob of grammar =D.
>> >
>> >Now, that's a lot of "if"s, and I have no idea if any of them at all are
>> >true, but if-if-if they are all true, we could both simplify the
>> >lexer/parser, open up the prefixes for  extensibility while maintaining the
>> >exact semantics for existing code.
> Oops, should have read more of the thread before replying:)
>
> But, yeah, it would be nice if we could get to a mechanism that
> replaces the current horror show that is string prefix handling (see
> http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals)
> with the functional equivalent
> of the following Python code:

Yes, the grammar reference is a bit hard to grasp.  But there is really 
only four cases there.  It just seems like more when you consider the 
different variation of upper and lower case.

>      def str_raw(source_bytes, source_encoding):
>          return source_bytes.decode(source_encoding)
>
>      def str_with_escapes(source_bytes, source_encoding):
>          # Handle escapes to create "s"
>          return s
>
>      def bytes_raw(source_bytes, source_encoding):
>          return source_bytes
>
>      def bytes_with_escapes(source_bytes, source_encoding):
>          # Handle escapes to create "b"
>          return b
>
>      cache_token = "Marker for pyc validity checking"
>
>      for prefix in (None, "u", "U"):
>          ast.register_str_prefix(prefix, str_with_escapes, cache_token)
>
>      for prefix in ("r", "R"):
>          ast.register_str_prefix(prefix, str_raw, cache_token)
>
>      for prefix in (None, "b", "B"):
>          ast.register_str_prefix(prefix, bytes_with_escapes, cache_token)
>
>      for prefix in ("br", "Br", "bR", "BR", "rb", "rB", "Rb", "RB"):
>          ast.register_str_prefix(prefix, bytes_raw, cache_token)

While playing around in tokenizer.c, it took me a bit to figure out how 
this worked...

     if (is_potential_identifier_start(c)) {
         /* Process b"", r"", u"", br"" and rb"" */
         int saw_b = 0, saw_r = 0, saw_u = 0;
         while (1) {
             if (!(saw_b || saw_u) && (c == 'b' || c == 'B'))
                 saw_b = 1;
             /* Since this is a backwards compatibility support literal we 
don't
                want to support it in arbitrary order like byte literals. */
             else if (!(saw_b || saw_u || saw_r) && (c == 'u' || c == 'U'))
                 saw_u = 1;
             /* ur"" and ru"" are not supported */
             else if (!(saw_r || saw_u) && (c == 'r' || c == 'R'))
                 saw_r = 1;
             else
                 break;
             c = tok_nextc(tok);
             if (c == '"' || c == '\'')
                 goto letter_quote;
         }

It continues with the identifier section if it doesn't jump to the string 
section.

I came up with a working alternative that I think is much easier to 
understand...

     /* Check for standard string */
     if (c == '"' || c == '\'')
         goto letter_quote;

     /* Check for string literals b"", r"" or u"". */
     c2 = c;
     c = tok_nextc(tok);
     if ((c == '"' || c == '\'') &&
             ((c2 == 'b') || (c2 == 'B') || (c2 == 'r') || (c2 == 'R')
               || (c2 == 'u') || (c2 == 'U')))
         goto letter_quote;

     /* Check for string literals rb"" and br"". */
     c3 = c;
     c = tok_nextc(tok);
     if ((c == '"' || c == '\'') && (c2 != c3)
             && ((c2 == 'r') || (c2 == 'R') || (c2 == 'b') || (c2 == 'B'))
             && ((c3 == 'b') || (c3 == 'B') || (c3 == 'r') || (c3 == 'R'))
             )
         goto letter_quote;

     tok_backup(tok, c);
     tok_backup(tok, c3);
     c = c2;
     goto not_a_string;

   letter_quote:
     /* String */
     {

     ... reads string to find its end.

The jump to "not_a_string" just skips over the rest of the string section.

Because it's not a loop, it takes a few more lines.  This puts all the 
string code together in one place, and the identifier parts don't have any 
string testing lines in it.

Maybe it's not quite as efficient, but I think it's much easier to 
understand.

(And yes, I could've used if-else's and avoided the goto's, but I like the 
fall through pattern in this case without the deep indention.)

I'm not sure how practical removing or moving string prefixes would be. 
Having only a few literals, is probably the best practical compromise 
between the two ideals.

Moving them to the run time parser would make some things slower.  Being 
able to add or register more prefix's would probably hurt Pythons 
readability when you want to review someone else's programs.  I think it 
would only improve readability for programs we write ourselves, because 
then we know much more easily what we defined those prefixes to mean.  That 
wouldn't be the case when we read someone else's code.

There are some options for cleaning up parts of the interpreter.  We could 
move all (or as much as is doable) of the compile time stuff to later in 
the chain, which probably means moving it to a place that it all can be 
done from ast.c.  That would make the tokenizer simpler and cleaner. 
Alternatively, we could go the other way and move as much is doable to a 
preprocessor step, which would happen just before a program is tokenized.

But I'm not sure either one of these options has much real benefit.  I'm 
more interested in the mini core language that was suggested a while back 
to help solve some of the boot strapping issues.  Is anyone working on that?

Cheers,
    Ron