[Python-ideas] User-defined literals

Wed Jun 3 18:55:21 CEST 2015

On Jun 2, 2015, at 20:12, Chris Angelico <rosuav at gmail.com> wrote:
> 
>> On Wed, Jun 3, 2015 at 11:56 AM, Andrew Barnert <abarnert at yahoo.com> wrote:
>>> On Jun 2, 2015, at 18:05, Chris Angelico <rosuav at gmail.com> wrote:
> 
>>> Once the code's finished being compiled, there's no
>>> record of what type of string literal was used (raw, triple-quoted,
>>> etc), only the type of string object (bytes/unicode). Custom literals
>>> could be the same
>> 
>> But how? Without magic (like a registry or something similarly not locally visible in the source), how does the compiler know about user-defined literals at compile time? Python (unlike C++) doesn't have an extensible notion of "compile-time computation" to hook into here.
> 
> Well, an additional parameter to compile() would do it.

I don't understand what you mean. Sure, you can pass the magic registry a separate argument instead of leaving it in the local/global environment, but that doesn't really change anything.

> I've no idea
> how hard it is to write an import hook, but my notion was that you
> could do it that way and alter the behaviour of the compilation
> process.

It's not _that_ hard to write an import hook. But what are you going to do in that hook? 

If you're trying to change the syntax of Python by adding a new literal suffix, you have to rewrite the parser. (My hack gets around that by tokenizing, modifying the token stream, untokenizing, and compiling. But you don't want to do that in real life.)

So I assume your idea means something like: first we parse 2.3d into something like a new UserLiteral AST node, then if no hook translates that into something else before the AST is compiled, it's a SyntaxError?

But that still means:

 * If you want to use a user-defined literal, you can't import it; you need another module to first import that literal's import hook and then import your module.

 * Your .pyc file won't get updated when that other module changes the hooks in place when your module gets imported.

 * That's a significant amount of boilerplate for each module that wants to offer a new literal.

 * While it isn't actually that hard, it is something most module developers have no idea how to write. (A HOWTO could maybe help here....)

 * Every import has to be hooked and transformed once for each literal you want to be available.

Meanwhile, what exactly could the hook _do_ at compile time? It could generate the expression `Decimal('1.2')`, but that's no more "literal" than `literal_d('1.2')`, and now it means your script has to import `Decimal` into its scope instead. I suppose your import hook could push that import into the top of the script, but that seems even more magical. Or maybe you could generate an actual Decimal object, pickle it, compile in the expression `pickle.loads(b'cdecimal\nDecimal\np0\n(V1.2\np1\tp2\nRp3\n.')`, and push in a pickle import, but that doesn't really solve anything.

Really, trying to force something into a "compile-time computation" in a language that doesn't have a full compile-time sub-language is a losing proposition. C++03 had a sort of accidental minimal compile-time sub-language based on template expansion and required constant folding for integer and pointer arithmetic, and that really wasn't sufficient, which is why C++11 and D both added ways to use most of the language explicitly at compile time (and C++11 still didn't get it right, which is why C++14 had to redo it).

In Python, it's perfectly fine that -2 and 1+2j and (1, 2) are all compiled into expressions, so why isn't it fine that 1.2d is compiled into an expression? And, once you accept that, what's wrong with the expression being `literal_d('1.2')` instead of `Decimal('1.2')`?

> But I haven't put a lot of thought into implementation, nor
> do I know enough of the internals to know what's plausible and what
> isn't.
> 
>> And why do you actually care that it happens at compile time? If it's for optimization, that may be premature and irrelevant. (Certainly 1.2d isn't going to be any _worse_ than Decimal('1.2'), it just may not be better.) If it's because you want to reflect on code objects or something, that's not normal end-user code. Why should a normal user ever even know, much less care, whether 1.2d is stored as a constant or an expression in memory or in a .pyc file?
> 
> It's to do with expectations. A literal should simply be itself,
> nothing else. When you have a string literal in your code, nothing can
> change what string that represents; at compilation time, it turns into
> a string object, and there it remains. Shadowing the name 'str' won't
> affect it. But if something that looks like a literal ends up being a
> function call, it could get extremely confusing - name lookups
> happening at run-time when the name doesn't occur in the code. Imagine
> the traceback:
> 
> def calc_profit(hex):
>    decimal = int(hex, 16)
>    return 0.2d * decimal
> 
>>>> calc_profit("1E2A")
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  File "<stdin>", line 3, in calc_profit
> AttributeError: 'int' object has no attribute 'Decimal'

But that _can't_ happen with my design: the `0.2d` is compiled to `literal_d('0.2')`. The call to `decimal.Decimal` is in that function's scope, so nothing you do in your function can interfere with it.

Sure, you can still redefine `literal_d`, but (a) why would you, and (b) even if you do, the problem will be a lot more obvious (especially since you had to explicitly `from decimalliterals import literal_d` at the top of the script, while you didn't have to even mention `decimal` or `Decimal` anywhere).

But your design, or any design that does the translation at compile time, _would_ have this problem. If you compile `0.2d` directly into `decimal.Decimal('0.2')`, then it's `decimal` that has to be in scope.

Also, notice that my design leaves the door open for later coming up with a special bytecode to look up translation functions following different rules (a registry, an explicit global lookup that ignores local shadowing, etc.); translating into a normal constructor expression doesn't.

> Uhh... what? Sure, I shadowed the module name there, but I'm not
> *using* the decimal module! I'm just using a decimal literal! It's no
> problem to shadow the built-in function 'hex' there, because I'm not
> using the built-in function!
> 
> Whatever name you use, there's the possibility that it'll have been
> changed at run-time, and that will cause no end of confusion. A
> literal shouldn't cause surprise function calls and name lookups.
> 
>>> - come to think of it, it might be nice to have
>>> pathlib.Path literals, represented as p"/home/rosuav" or something. In
>>> any case, they'd be evaluated using only compile-time information, and
>>> would then be saved as constants.
>>> 
>>> That implies that only immutables should have literal syntaxes. I'm
>>> not sure whether that's significant or not.
>> 
>> But pathlib.Path isn't immutable.
> 
> Huh, it isn't? That's a pity. In that case, I guess you can't have a
> path literal.

I don't understand why you think this is important.

Literal values, compile-time-computable/accessible values, and run-time-constant values are certainly not unrelated, but they're not the same thing. Other languages don't try to force them to be the same. In C++, for example, a literal has to evaluate into a compile-time-computable expression that only uses constant compile-time-accessible values, but the value it doesn't have to be constant at runtime. In fact, it's quite common for it not to be.

> In any case, I'm sure there'll be other string-like
> things that people can come up with literal syntaxes for.