[Python-Dev] Re: User extendable literal modifiers ?!

Gareth McCaughan gmccaughan@synaptics-uk.com
Fri, 27 Sep 2002 12:03:17 +0100 (BST)

Marc-Andre Lemburg wrote:

> Since these are numbers, it would be convenient if there were
> some way to create them in form of literals, much like 123L
> creates longs instead of integers or u"abc" gives you Unicode
> instead of an 8-bit string.
> I was wondering whether it would be worth adding something
> like a registry of literal modifiers to Python, so that
> extensions can register new modifiers with the compiler,
> e.g.
> sitecustomize.py:
> def create_I_literal(literal_string):
>      return 'mx.Number.Integer(%s)' % literal_string
> sys.register_numberlitmod('I', create_I_literal)
> test.py:
> x = 123I * 456I
> print x, 234I

Too limiting. You'd only be able to do this for numbers,
and it doesn't seem worth the pain just for numbers.
Better would be user-definable *prefixes*.

Common Lisp, for instance, makes it easy to customize
the reader to recognize tokens of the form <hash> <character> <anything>.
So you can arrange that #Q123,234,456:a(b)c turns into, erm,
something terribly useful :-). Some of these characters are
already taken for things like arrays [#(1 2 3), #2((1 2) (3 4))],
"logical pathnames" (lightly abstracted filenames) [#"foo/bar/baz"],
bit vectors [#*0001101011001], and so on. As perceptive readers
will have noticed, you can splice a number between "#" and
the magic character for special effects.

Python could do something similar, though obviously "#"
isn't a suitable character :-). Letting the user hijack
the reader as completely as can be done in CL would probably
be un-Pythonic, too. Here's a strawman suggestion.

    For any character "x" in some set I can't be bothered to
    specify, the Python tokenizer/parser will subject input
    of the form $x<string-literal> to special processing.
    The string-literal can be formed using any of {',",''',"""}.

    When I say "tokenizer/parser", I mean: the tokenizer will
    produce a special token encoding the character "x" and the
    contents of the string-literal. The parser will perform
    "special processing" in an attempt to turn it into a more
    normal token.

    The default "special processing" is to raise a SyntaxError.
    The user can define the special processing appropriate for
    a particular character "x" by making a function that
    interprets the string and feeding it to sys.register_dollar_handler.
    (In fact, anything callable will do.) The function will
    be passed two arguments: the character "x" and the string.
    Its return value will replace the $x"..." combination in
    the token stream, as a literal token.

    If an exception other than a SyntaxError is raised and
    not caught in the handler function then it will be silently
    replaced by a SyntaxError whose parameter has the form
    "ill-formed <xxx> literal". The value of "xxx" is defined
    when registering the handler.

    Handler functions are permitted to call "eval".


        >>> def handle_rational(char, s):
        ...     assert char == 'r'
        ...     components = s.split('/')
        ...     numerator, denominator = map(int, components)
        ...     return Rational(numerator, denominator)
        >>> sys.register_dollar_handler('r', handle_rational, 'rational')
        >>> print $r"1/2" + $r"3/4"
        >>> print $r"12345"
          File "<string>", line 1
            print $r"12345"
        SyntaxError: ill-formed rational literal


        >>> class Rational:
        ...     def __init__(self, x, y):
        ...     if isinstance(x, str):
        ...         x,y = map(int, y.split("/"))
        ...     self._numerator, self._denominator = x,y
        ...     [etc]
        >>> sys.register_dollar_handler('r', Rational, 'rational')

    Some dollar-syntax characters may be handled by Python itself
    or the standard library, or may be reserved for their use.
    It is possible for users to override them, but this should
    be considered bad practice.

    Registering a handler when one is already in place will produce
    a warning. To un-register a handler, pass None instead of the
    handler function. 

Possible applications:

  - Rational numbers.    $r"123/234"
  - Regular expressions. $/"foo.*bar"
  - Dates and times.     $t"2002-09-27 11:38"
  - Hostnames and ports. $h"www.google.com:80"


  - Is this insane?
  - Is "$" the best character?
  - Should there be a way to return tokens other than literal ones?
    For instance, identifiers or keywords?
  - Is the behaviour with exceptions correct?