[Python-ideas] Custom string prefixes
Ron Adam
ron3200 at gmail.com
Tue May 28 08:32:17 CEST 2013
On 05/27/2013 10:06 PM, Nick Coghlan wrote:
> On Tue, May 28, 2013 at 1:51 AM, Haoyi Li<haoyi.sg at gmail.com> wrote:
>> >If-if-if all that works out, you would be able to completely remove the ("b"
>> >| "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB" | "r" | "u" |
>> >"R" | "U") from the grammar specification! Not add more to it, remove it!
>> >Shifting the specification of all the different string prefixes into a
>> >user-land library. I'd say that's a pretty creative way of getting rid of
>> >that nasty blob of grammar =D.
>> >
>> >Now, that's a lot of "if"s, and I have no idea if any of them at all are
>> >true, but if-if-if they are all true, we could both simplify the
>> >lexer/parser, open up the prefixes for extensibility while maintaining the
>> >exact semantics for existing code.
> Oops, should have read more of the thread before replying:)
>
> But, yeah, it would be nice if we could get to a mechanism that
> replaces the current horror show that is string prefix handling (see
> http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals)
> with the functional equivalent
> of the following Python code:
Yes, the grammar reference is a bit hard to grasp. But there is really
only four cases there. It just seems like more when you consider the
different variation of upper and lower case.
> def str_raw(source_bytes, source_encoding):
> return source_bytes.decode(source_encoding)
>
> def str_with_escapes(source_bytes, source_encoding):
> # Handle escapes to create "s"
> return s
>
> def bytes_raw(source_bytes, source_encoding):
> return source_bytes
>
> def bytes_with_escapes(source_bytes, source_encoding):
> # Handle escapes to create "b"
> return b
>
> cache_token = "Marker for pyc validity checking"
>
> for prefix in (None, "u", "U"):
> ast.register_str_prefix(prefix, str_with_escapes, cache_token)
>
> for prefix in ("r", "R"):
> ast.register_str_prefix(prefix, str_raw, cache_token)
>
> for prefix in (None, "b", "B"):
> ast.register_str_prefix(prefix, bytes_with_escapes, cache_token)
>
> for prefix in ("br", "Br", "bR", "BR", "rb", "rB", "Rb", "RB"):
> ast.register_str_prefix(prefix, bytes_raw, cache_token)
While playing around in tokenizer.c, it took me a bit to figure out how
this worked...
if (is_potential_identifier_start(c)) {
/* Process b"", r"", u"", br"" and rb"" */
int saw_b = 0, saw_r = 0, saw_u = 0;
while (1) {
if (!(saw_b || saw_u) && (c == 'b' || c == 'B'))
saw_b = 1;
/* Since this is a backwards compatibility support literal we
don't
want to support it in arbitrary order like byte literals. */
else if (!(saw_b || saw_u || saw_r) && (c == 'u' || c == 'U'))
saw_u = 1;
/* ur"" and ru"" are not supported */
else if (!(saw_r || saw_u) && (c == 'r' || c == 'R'))
saw_r = 1;
else
break;
c = tok_nextc(tok);
if (c == '"' || c == '\'')
goto letter_quote;
}
It continues with the identifier section if it doesn't jump to the string
section.
I came up with a working alternative that I think is much easier to
understand...
/* Check for standard string */
if (c == '"' || c == '\'')
goto letter_quote;
/* Check for string literals b"", r"" or u"". */
c2 = c;
c = tok_nextc(tok);
if ((c == '"' || c == '\'') &&
((c2 == 'b') || (c2 == 'B') || (c2 == 'r') || (c2 == 'R')
|| (c2 == 'u') || (c2 == 'U')))
goto letter_quote;
/* Check for string literals rb"" and br"". */
c3 = c;
c = tok_nextc(tok);
if ((c == '"' || c == '\'') && (c2 != c3)
&& ((c2 == 'r') || (c2 == 'R') || (c2 == 'b') || (c2 == 'B'))
&& ((c3 == 'b') || (c3 == 'B') || (c3 == 'r') || (c3 == 'R'))
)
goto letter_quote;
tok_backup(tok, c);
tok_backup(tok, c3);
c = c2;
goto not_a_string;
letter_quote:
/* String */
{
... reads string to find its end.
The jump to "not_a_string" just skips over the rest of the string section.
Because it's not a loop, it takes a few more lines. This puts all the
string code together in one place, and the identifier parts don't have any
string testing lines in it.
Maybe it's not quite as efficient, but I think it's much easier to
understand.
(And yes, I could've used if-else's and avoided the goto's, but I like the
fall through pattern in this case without the deep indention.)
I'm not sure how practical removing or moving string prefixes would be.
Having only a few literals, is probably the best practical compromise
between the two ideals.
Moving them to the run time parser would make some things slower. Being
able to add or register more prefix's would probably hurt Pythons
readability when you want to review someone else's programs. I think it
would only improve readability for programs we write ourselves, because
then we know much more easily what we defined those prefixes to mean. That
wouldn't be the case when we read someone else's code.
There are some options for cleaning up parts of the interpreter. We could
move all (or as much as is doable) of the compile time stuff to later in
the chain, which probably means moving it to a place that it all can be
done from ast.c. That would make the tokenizer simpler and cleaner.
Alternatively, we could go the other way and move as much is doable to a
preprocessor step, which would happen just before a program is tokenized.
But I'm not sure either one of these options has much real benefit. I'm
more interested in the mini core language that was suggested a while back
to help solve some of the boot strapping issues. Is anyone working on that?
Cheers,
Ron
More information about the Python-ideas
mailing list