Re: User extendable literal modifiers ?!

Marc-Andre Lemburg wrote:
Too limiting. You'd only be able to do this for numbers, and it doesn't seem worth the pain just for numbers. Better would be user-definable *prefixes*. Common Lisp, for instance, makes it easy to customize the reader to recognize tokens of the form <hash> <character> <anything>. So you can arrange that #Q123,234,456:a(b)c turns into, erm, something terribly useful :-). Some of these characters are already taken for things like arrays [#(1 2 3), #2((1 2) (3 4))], "logical pathnames" (lightly abstracted filenames) [#"foo/bar/baz"], bit vectors [#*0001101011001], and so on. As perceptive readers will have noticed, you can splice a number between "#" and the magic character for special effects. Python could do something similar, though obviously "#" isn't a suitable character :-). Letting the user hijack the reader as completely as can be done in CL would probably be un-Pythonic, too. Here's a strawman suggestion. For any character "x" in some set I can't be bothered to specify, the Python tokenizer/parser will subject input of the form $x<string-literal> to special processing. The string-literal can be formed using any of {',",''',"""}. When I say "tokenizer/parser", I mean: the tokenizer will produce a special token encoding the character "x" and the contents of the string-literal. The parser will perform "special processing" in an attempt to turn it into a more normal token. The default "special processing" is to raise a SyntaxError. The user can define the special processing appropriate for a particular character "x" by making a function that interprets the string and feeding it to sys.register_dollar_handler. (In fact, anything callable will do.) The function will be passed two arguments: the character "x" and the string. Its return value will replace the $x"..." combination in the token stream, as a literal token. If an exception other than a SyntaxError is raised and not caught in the handler function then it will be silently replaced by a SyntaxError whose parameter has the form "ill-formed <xxx> literal". The value of "xxx" is defined when registering the handler. Handler functions are permitted to call "eval". Example: >>> def handle_rational(char, s): ... assert char == 'r' ... components = s.split('/') ... numerator, denominator = map(int, components) ... return Rational(numerator, denominator) ... >>> sys.register_dollar_handler('r', handle_rational, 'rational') >>> print $r"1/2" + $r"3/4" $r"5/4" >>> print $r"12345" File "<string>", line 1 print $r"12345" ^ SyntaxError: ill-formed rational literal >>> Alternatively: >>> class Rational: ... def __init__(self, x, y): ... if isinstance(x, str): ... x,y = map(int, y.split("/")) ... self._numerator, self._denominator = x,y ... [etc] ... >>> sys.register_dollar_handler('r', Rational, 'rational') Some dollar-syntax characters may be handled by Python itself or the standard library, or may be reserved for their use. It is possible for users to override them, but this should be considered bad practice. Registering a handler when one is already in place will produce a warning. To un-register a handler, pass None instead of the handler function. Possible applications: - Rational numbers. $r"123/234" - Regular expressions. $/"foo.*bar" - Dates and times. $t"2002-09-27 11:38" - Hostnames and ports. $h"www.google.com:80" Questions: - Is this insane? - Is "$" the best character? - Should there be a way to return tokens other than literal ones? For instance, identifiers or keywords? - Is the behaviour with exceptions correct? -- g

On Friday 27 September 2002 01:03 pm, Gareth McCaughan wrote: ...
Better would be user-definable *prefixes*.
Yes -- nice idea.
Its return value will replace the $x"..." combination in the token stream, as a literal token.
Why just one token, and why just literal. Returning an arbitrary sequence of tokens seems more natural. This would allow e.g. Tim Berners-Lee to have basically what he wants (and asked for in his talk at IPC10) in terms of extended syntax for graphs, just with some $x in front. I had a similar idea right after Tim's talk, but could not articulate it clearly enough in a chat with Guido right afterwards, and later I didn't follow through with it. It seems to me that your proposal is detailed and precise enough (while my idea was rather vague) and that, by returning an arbitrary sequence of tokens, it will let Tim embed whatever funky syntax it requires. This power is also the downside of the whole idea of course -- no guarantee that somebody can't use this mechanism to produce highly obfuscated programs. But I think that such a somebody could already obfuscate quite effectively in other ways, and the risk of abuse shouldn't stop this interesting proposal.
... return Rational(numerator, denominator)
Hmmm, how would this "return a literal token"? It returns an instance of Rational -- how does the parser treat this instance as a literal token? I thought this use would have to return the sequence of tokens for identifier 'Rational', open parenthesis, literal (value of) numerator, comma, literal (value of) denominator, closed parenthesis -- which in turn is why I thought of an arbitrary sequence of tokens. If a single instance of any arbitrary class may be returned and get treated as a literal token by the parser, then that's much better (maybe I don't know Python's parser well enough, but I don't clearly see how that would be done).
- Is this insane?
Hope not, since I like it.
- Is "$" the best character?
Among the few available ones, I think I slightly prefer "@" for this use, but there's little to choose IMHO. Alex

From: Alex Martelli <aleax@aleax.it>
indeed, because then otherwise $r"123/234" = literal transformation => Rational(123,234) would require Rational to be installed in the builtins, or some kind of implicit import (ugly) or people would have to rember to put an explicit from ... import Rational in all modules that use $r, one import per program just to register $r would not be enough. regards

Samuele Pedroni wrote:
These are implementation details, e.g. if Python would provide a way to register new modifiers, these would only start working after having been registered. Let's say that a user wants 123I to map to mx.Number.Integer(123), then he'd have to make sure that mx.Number is imported in sitecustomize.py to have Python load modules containing the 123I literal using the registered object constructor for that literal modifier. Otherwise, the compiler or module loader would fail. There should not be any magic imports going on behind the scenes. Note that the whole point of the idea is to simplify using really basic types. Anything more complicated than a single character modifier would fail to meet this requirement. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Given all the discussion, this will need a PEP first. I'd suggest Marc-Andre and Alex as co-authors, but that's up to you. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Fri, Sep 27, 2002 at 12:03:17PM +0100, Gareth McCaughan wrote:
Of course, if you have no shame , each of these but $/ can be written with today's syntax in no more characters, placing the type identifier first and then an arbitrary, existing operator second: r+"123/234" This, in turn, saves only one character over r("123/234") Here's an example I wrote for work: class Dimension: ... class DimensionMaker: def __call__(self, v): return Dimension(v) def __add__(self, v): return Dimension(v) D = DimensionMaker() I don't know if we'll ultimately judge the D+"..." syntax justified, given that it feels yucky and saves only one character. Note that we're also treading very close to allowing function calls without parens, if we allow an arbitrary identifier before string literals. What actually happens if you write trailer: test | '(' [arglist] ')' | '[' subscriptlist ']' | '.' NAME in the grammar and change the compiler accordingly? I guess the problem becomes that '(' could be the beginning of a testlist from inside atom, but if you could arrange for '(' here to always start an arglist instead, or invent a new production "altpower" trailer: altpower | '(' [arglist] ')' | '[' subscriptlist ']' | '.' NAME altpower: altatom trailer* altatom: NAME | NUMBER | STRING+ Now, a x.y.z()[:] becomes legal syntax (and would be a call to 'a' with one arg, x.y.z()[:]) Likewise, D"123/234" becomes legal, and is equivalent to D("123/234"). you have a problem with anything now recognized as a prefix of a string, so r"123/234" can't work as $r"123/234" is proposed to work. Of course, you could make R 123/234 work, since that'd be (R 123)/234 which would be R(123)/234. Personally, I think all of this is pretty ugly. Jeff

Gareth McCaughan <gmccaughan@synaptics-uk.com>:
This strikes me as ugly. There doesn't seem to be much, if any, syntactical advantage over using a constructor: Rat("123/234") Regex("foo.*bar") Date("2002-09-27 11:38") Port("www.google.com:80") These look cleaner and easier to read to me. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

greg wrote:
isn't the whole idea that with a special syntax, you can do some of the processing when compiling the script? it's pretty pointless to invent more ways to call functions with string literals as arguments... btw, the following note is slightly related to this topic, and has been generating some buzz lately (at least in my mailbox): http://effbot.org/zone/idea-xml-literal.htm </F>

"Fredrik Lundh" <fredrik@pythonware.com> writes:
That can't be the idea: Marshalling would store the string form, so any compilation done until marshalling must be undone. Perhaps the idea is that these things are interpreted once before byte code interpretation starts (i.e. after loading a .pyc file). In that case, a number of interesting questions arise: - in what order, precisely, are those things evaluated? Probably in textual order, but this is not that easy, since the marshalling procedure might make such a requirement unimplementable. - are duplicate occurrences eliminated? If so, how does one determine duplicates? In any case, I think users will be surprised if $h"www.google.com:80" causes a dial-up connection to be set up as soon as a module is imported. Regards, Martin

[effbot]
Not necessarily. Domain-specific notations are useful with or without compile-time processing, and sometimes the added noise of the function call syntax + string literals can get in the way of readability. (Hey, binary operators are [mostly] just another syntax for calling functions, and around here we all agree that they're a good thing. :-) That said, I'm not very enamored of the $x"foo" notation -- too much line noise. MAL's original minimalistic proposal (123x, or pehaps also 123.456x, and maybe even 1.23e-456x) seems cleaner in cases where it's applicable. I don't expect Python will ever grow date/time or (heaven forbid) IP address literals, and we already have r"regex" literals.
That looks interesting in a futuristic kind of way. I'm curious why you decided not to return fixed-type tuples of the form (tag, attrs, content) -- that seems easier to deal with than having to deal with both (tag, content) and (tag, attrs, content). Tuples used as records ought to have a fixed lay-out. Parsing this would be tricky -- the tokenizer would have to know in what state the parser is in order to tell when to switch to XML if it sees a '<'. And if you want to use a standard XML parser you'd have to be careful to stop reading after the final '>'. And what can this do that you can't do by putting it in a string literal and feeding it to a convenience function? --Guido van Rossum (home page: http://www.python.org/~guido/)

On Mon, Sep 30, 2002 at 11:01:24AM +0200, Fredrik Lundh wrote:
This is a little like what I implemented for 'pyhtml'. It was inteded to be an extension to the Quixote templating system, so it used the idea that a HTML tag embedded in the code should write itself directly to the output, like the result of expression statements already does in templates. An excerpt the README: The following code: <UL> for i in range(10): <LI> i would output something like <UL><LI>0</LI><LI>2</LI>....<LI>9</LI></UL> As you can see, I let <TAG> start a block, and let blocks end according to Python's normal indentation rules. The productions added to the grammar were: compound_stmt: ... | tag_stmt tag_stmt: '<' NAME [tag_args] '>' suite tag_args: NAME '=' expr (',' NAME '=' expr)* [','] so that <DIV CLASS="blue"> "this might be blue" would also work. I thought it was rather cute to reverse the normal practice of finding a way to shoehorn Python syntax into the midst of an HTML document, but never wrote anything serious using pyhtml. The remains of the project can be seen at http://unpythonic.net/~jepler/falcon/pyhtml/ Jeff

Fredrik Lundh <fredrik@pythonware.com>:
isn't the whole idea that with a special syntax, you can do some of the processing when compiling the script?
I suppose the literal object could be precomputed when compiling -- but how would you marshal it when saving the bytecode? Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

On Friday 27 September 2002 01:03 pm, Gareth McCaughan wrote: ...
Better would be user-definable *prefixes*.
Yes -- nice idea.
Its return value will replace the $x"..." combination in the token stream, as a literal token.
Why just one token, and why just literal. Returning an arbitrary sequence of tokens seems more natural. This would allow e.g. Tim Berners-Lee to have basically what he wants (and asked for in his talk at IPC10) in terms of extended syntax for graphs, just with some $x in front. I had a similar idea right after Tim's talk, but could not articulate it clearly enough in a chat with Guido right afterwards, and later I didn't follow through with it. It seems to me that your proposal is detailed and precise enough (while my idea was rather vague) and that, by returning an arbitrary sequence of tokens, it will let Tim embed whatever funky syntax it requires. This power is also the downside of the whole idea of course -- no guarantee that somebody can't use this mechanism to produce highly obfuscated programs. But I think that such a somebody could already obfuscate quite effectively in other ways, and the risk of abuse shouldn't stop this interesting proposal.
... return Rational(numerator, denominator)
Hmmm, how would this "return a literal token"? It returns an instance of Rational -- how does the parser treat this instance as a literal token? I thought this use would have to return the sequence of tokens for identifier 'Rational', open parenthesis, literal (value of) numerator, comma, literal (value of) denominator, closed parenthesis -- which in turn is why I thought of an arbitrary sequence of tokens. If a single instance of any arbitrary class may be returned and get treated as a literal token by the parser, then that's much better (maybe I don't know Python's parser well enough, but I don't clearly see how that would be done).
- Is this insane?
Hope not, since I like it.
- Is "$" the best character?
Among the few available ones, I think I slightly prefer "@" for this use, but there's little to choose IMHO. Alex

From: Alex Martelli <aleax@aleax.it>
indeed, because then otherwise $r"123/234" = literal transformation => Rational(123,234) would require Rational to be installed in the builtins, or some kind of implicit import (ugly) or people would have to rember to put an explicit from ... import Rational in all modules that use $r, one import per program just to register $r would not be enough. regards

Samuele Pedroni wrote:
These are implementation details, e.g. if Python would provide a way to register new modifiers, these would only start working after having been registered. Let's say that a user wants 123I to map to mx.Number.Integer(123), then he'd have to make sure that mx.Number is imported in sitecustomize.py to have Python load modules containing the 123I literal using the registered object constructor for that literal modifier. Otherwise, the compiler or module loader would fail. There should not be any magic imports going on behind the scenes. Note that the whole point of the idea is to simplify using really basic types. Anything more complicated than a single character modifier would fail to meet this requirement. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Given all the discussion, this will need a PEP first. I'd suggest Marc-Andre and Alex as co-authors, but that's up to you. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Fri, Sep 27, 2002 at 12:03:17PM +0100, Gareth McCaughan wrote:
Of course, if you have no shame , each of these but $/ can be written with today's syntax in no more characters, placing the type identifier first and then an arbitrary, existing operator second: r+"123/234" This, in turn, saves only one character over r("123/234") Here's an example I wrote for work: class Dimension: ... class DimensionMaker: def __call__(self, v): return Dimension(v) def __add__(self, v): return Dimension(v) D = DimensionMaker() I don't know if we'll ultimately judge the D+"..." syntax justified, given that it feels yucky and saves only one character. Note that we're also treading very close to allowing function calls without parens, if we allow an arbitrary identifier before string literals. What actually happens if you write trailer: test | '(' [arglist] ')' | '[' subscriptlist ']' | '.' NAME in the grammar and change the compiler accordingly? I guess the problem becomes that '(' could be the beginning of a testlist from inside atom, but if you could arrange for '(' here to always start an arglist instead, or invent a new production "altpower" trailer: altpower | '(' [arglist] ')' | '[' subscriptlist ']' | '.' NAME altpower: altatom trailer* altatom: NAME | NUMBER | STRING+ Now, a x.y.z()[:] becomes legal syntax (and would be a call to 'a' with one arg, x.y.z()[:]) Likewise, D"123/234" becomes legal, and is equivalent to D("123/234"). you have a problem with anything now recognized as a prefix of a string, so r"123/234" can't work as $r"123/234" is proposed to work. Of course, you could make R 123/234 work, since that'd be (R 123)/234 which would be R(123)/234. Personally, I think all of this is pretty ugly. Jeff

Gareth McCaughan <gmccaughan@synaptics-uk.com>:
This strikes me as ugly. There doesn't seem to be much, if any, syntactical advantage over using a constructor: Rat("123/234") Regex("foo.*bar") Date("2002-09-27 11:38") Port("www.google.com:80") These look cleaner and easier to read to me. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

greg wrote:
isn't the whole idea that with a special syntax, you can do some of the processing when compiling the script? it's pretty pointless to invent more ways to call functions with string literals as arguments... btw, the following note is slightly related to this topic, and has been generating some buzz lately (at least in my mailbox): http://effbot.org/zone/idea-xml-literal.htm </F>

"Fredrik Lundh" <fredrik@pythonware.com> writes:
That can't be the idea: Marshalling would store the string form, so any compilation done until marshalling must be undone. Perhaps the idea is that these things are interpreted once before byte code interpretation starts (i.e. after loading a .pyc file). In that case, a number of interesting questions arise: - in what order, precisely, are those things evaluated? Probably in textual order, but this is not that easy, since the marshalling procedure might make such a requirement unimplementable. - are duplicate occurrences eliminated? If so, how does one determine duplicates? In any case, I think users will be surprised if $h"www.google.com:80" causes a dial-up connection to be set up as soon as a module is imported. Regards, Martin

[effbot]
Not necessarily. Domain-specific notations are useful with or without compile-time processing, and sometimes the added noise of the function call syntax + string literals can get in the way of readability. (Hey, binary operators are [mostly] just another syntax for calling functions, and around here we all agree that they're a good thing. :-) That said, I'm not very enamored of the $x"foo" notation -- too much line noise. MAL's original minimalistic proposal (123x, or pehaps also 123.456x, and maybe even 1.23e-456x) seems cleaner in cases where it's applicable. I don't expect Python will ever grow date/time or (heaven forbid) IP address literals, and we already have r"regex" literals.
That looks interesting in a futuristic kind of way. I'm curious why you decided not to return fixed-type tuples of the form (tag, attrs, content) -- that seems easier to deal with than having to deal with both (tag, content) and (tag, attrs, content). Tuples used as records ought to have a fixed lay-out. Parsing this would be tricky -- the tokenizer would have to know in what state the parser is in order to tell when to switch to XML if it sees a '<'. And if you want to use a standard XML parser you'd have to be careful to stop reading after the final '>'. And what can this do that you can't do by putting it in a string literal and feeding it to a convenience function? --Guido van Rossum (home page: http://www.python.org/~guido/)

On Mon, Sep 30, 2002 at 11:01:24AM +0200, Fredrik Lundh wrote:
This is a little like what I implemented for 'pyhtml'. It was inteded to be an extension to the Quixote templating system, so it used the idea that a HTML tag embedded in the code should write itself directly to the output, like the result of expression statements already does in templates. An excerpt the README: The following code: <UL> for i in range(10): <LI> i would output something like <UL><LI>0</LI><LI>2</LI>....<LI>9</LI></UL> As you can see, I let <TAG> start a block, and let blocks end according to Python's normal indentation rules. The productions added to the grammar were: compound_stmt: ... | tag_stmt tag_stmt: '<' NAME [tag_args] '>' suite tag_args: NAME '=' expr (',' NAME '=' expr)* [','] so that <DIV CLASS="blue"> "this might be blue" would also work. I thought it was rather cute to reverse the normal practice of finding a way to shoehorn Python syntax into the midst of an HTML document, but never wrote anything serious using pyhtml. The remains of the project can be seen at http://unpythonic.net/~jepler/falcon/pyhtml/ Jeff

Fredrik Lundh <fredrik@pythonware.com>:
isn't the whole idea that with a special syntax, you can do some of the processing when compiling the script?
I suppose the literal object could be precomputed when compiling -- but how would you marshal it when saving the bytecode? Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
participants (9)
-
Alex Martelli
-
Fredrik Lundh
-
Gareth McCaughan
-
Greg Ewing
-
Guido van Rossum
-
Jeff Epler
-
M.-A. Lemburg
-
martin@v.loewis.de
-
Samuele Pedroni