Custom string prefixes

In Python strings are allowed to have a number of special prefixes: b'', r'', u'', f'' + their combinations. The proposal is to allow arbitrary (or letter-only) user-defined prefixes as well. Essentially, a string prefix would serve as a decorator for a string, allowing the user to impose a special semantics of their choosing. There are quite a few situations where this can be used: - Fraction literals: `frac'123/4567'` - Decimals: `dec'5.34'` - Date/time constants: `t'2019-08-26'` - SQL expressions: `sql'SELECT * FROM tbl WHERE a=?'.bind(a=...)` - Regular expressions: `rx'[a-zA-Z]+'` - Version strings: `v'1.13.0a'` - etc. This proposal has been already discussed before, in 2013: https://mail.python.org/archives/list/python-ideas@python.org/thread/M3OLUUR... The opinions were divided whether this is a useful addition. The opponents mainly argued that as this only "saves a couple of keystrokes", there is no need to overcomplicate the language. It seems to me that now, 6 years later, that argument can be dismissed by the fact that we had, in fact, added new prefix "f" to the language. Note how the "format strings" would fall squarely within this framework if they were not added by now. In addition, I believe that "saving a few keystroked" is a worthy goal if it adds considerable clarity to the expression. Readability counts. Compare: v"1.13.0a" v("1.13.0a") To me, the former expression is far easier to read. Parentheses, especially as they become deeply nested, are not easy on the eyes. But, even more importantly, the first expression much better conveys the *intent* of a version string. It has a feeling of an immutable object. In the second case the string is passed to the constructor, but the string has no meaning of its own. As such, the second expression feels artificial. Consider this: if the feature already existed, how *would* you prefer to write your code? The prefixes would also help when writing functions that accept different types of their argument. For example: collection.select("abc") # find items with name 'abc' collection.select(rx"[abc]+") # find items that match regular expression I'm not discussing possible implementation of this feature just yet, we can get to that point later when there is a general understanding that this is worth considering.

On Aug 26, 2019, at 16:03, stpasha@gmail.com wrote:
I don’t think you can fairly discuss this idea without getting at least a _little_ bit into the implementation details. How does your code specify a new prefix? How does the tokenizer know which prefixes are active? What code does the compiler emit for a prefixed string? The answers to those questions will determine which potential prefixes are useful. In particular, you mention that f-strings “would fall squarely within this framework”, but it’s actually pretty hard to imagine an implementation that would have actually allowed for f-strings. They essentially need to recursively call the compiler on the elements inside braces, and then inline the resulting expressions into the containing scope.
Neither. I’d prefer this: 2.3D # decimal.Decimal('2.3') 1/3F # 1/fractions.Fraction('3') After all, why would I want to put the number in a string when it’s not a string, but a number? This looks a lot like C’s `2.3f` that gives me 2.3 as a float rather than a double, and it works like it too, so there’s no surprise. And C++ already proves that such a thing can be widely usable; it’s been part of that language for three versions, since 2011. Also this: p'C:\' That can’t be handled by just using a “native Path” prefix together with the existing raw prefix, because even in raw string literals you can’t end with a backslash. And this is another place where talking about implementation matters. At first glance it might seem like arbitrary-literal affixes would be a lot more difficult than string-literal-only affixes, but in fact, as soon as you try to implement it, you realize that you get the exact same set of issues, no more. See https://github.com/abarnert/userliteralhack for a proof of concept I wrote back in 2015. (Not that we’d want to actually implement them the way I did, just demonstrating that it can be done, and doesn’t cause ambiguity.) I’ve got a couple older PoCs up there as well if you want to play around more, including one that only allows string literals (so you can see that it’s actually no easier, and solves no ambiguity problems). I can’t remember if I did one that does prefixes instead of suffixes, but I don’t _think_ that raises any new issues, except for the one about interacting with the existing prefixes. And it might seem like having some affixes get the totally raw token, others get a cooked string is too complicated, but C++ actually lives with a 3-way distinction between raw token, cooked string, and fully parsed value. (Why would you ever want the last one? So your units-and-quantities library can define a _km suffix so 2_km is a km<int>, 2.3_km is a km<double>, 2.3f_km is a km<float>, and maybe even 2.3dec_km is a km<Decimal>.) I’m not sure we need this last distinction, but the first one might be worth copying, so that Path and other literals can work, but things like version can interact nicely with plain string literals, and r, and b if that’s appropriate, and most of all f, by just accepting a cooked string.

Thanks, Andrew, for your feedback. I didn't even think about string **suffixes**, but clearly they can be implemented together with the prefixes for additional flexibility. And your idea that `<string literal> <suffix>` is conceptually no different than `<numeric literal> <suffix>` is absolutely insightful. Speaking of string suffixes, flags on regular expressions immediately come to mind. For example `rx"(abc)"ig` could create a regular expression that performs global case-insensitive search.
I don’t think you can fairly discuss this idea without getting at least a _little_ bit into the implementation details.
Right. So, the first question to answer is what the compiler should do when it sees a prefixed (suffixed) string? That is, what byte-code should be emitted when the compiler sees `lambda: a"bcd"e` ? In one approach, we'd want this expression to be evaluated at compile time, similar to how f-strings work. However, how would the compiler know what prefix "a" means exactly? There has to be some kind of directive to tell the compiler that. For example, imagine the compiler sees near the top of the file #pragma from mymodule import a It would then import the symbol `a`, call `a("bcd", suffix="e")`. This would return an AST tree that will be plugged in place of the original string. This solution allows maximum efficiency, but seems inflexible and deeply invasive. Another approach would defer the construction of objects to compile time. Though not as efficient, it would allow loading prefixes at run-time. In this case `a"bcd"e` can be interpreted by the compiler as if it was a("bcd", suffix="e") where symbol `a` is to be looked up in the local/global scope. One thing I would rather want to avoid though, is the pollution of the variable namespace. For example, I'd like to be able to use variable `t = ...`, without worrying about string prefix `t"..."`. For this approach to work, we'd create a new code op, so that `a"bcd"e` would become 0 LOAD_CONST 1 ('a', 'bcd', 'e') 2 STR_RESOLVE_TAG 0 where `STR_RESOLVE_TAG` would effectively call `__resolve_tag__()` special method. The method would search for `a` in the registry of known string tags, and then pass the tuple to the corresponding constructor. There will, of course, be a method to register new tags. Something like str.___register_tag__('a', MyAObject) As for suffix-only literals, we can treat them as if they begin with an underscore. Thus, `1/3f` would be equivalent to 1/_f(3)

What about _instead of_ rather than _together with_? Half of Stephen’s objections are related to the ambiguity (to a human, even if not to the parser) of user prefixes in the (potential) presence of the builtin prefixes. None of those go even arise with suffixes. Anyway, maybe you already have good answers for all of those objections, but if not… Also, there’s at least one mainstream language (C++) that allows user suffixes and has literal syntax otherwise somewhat like Python’s, and the proposals for other languages like Rust generally seem to be generally trying to do “like C++ but minus all the usual C++ over-complexity”. Are there actual examples of languages with user prefixes? The only different designs I know of rely on the static type of the evaluation context. (For example, in Swift, you can just statically type `23 : km` or `"abc]*" : regex`, or even just pass the literal to a function that’s declared or inferred to take a regex if that happens to be readable in your use case, so there’s no need for a suffix syntax.) Which is neat, but obviously not applicable to Python.
And your idea that `<string literal> <suffix>` is conceptually no different than `<numeric literal> <suffix>` is absolutely insightful.
Well, back in 2015 I probably just stole the idea from C++. :) Another question that raises that I just remembered: the word “literal” has three overlapping but distinct meanings in Python. Which one do we actually mean here? In particular, are container displays “literals”? For that matter, is -2 even a literal? Also, from what I remember, either in 2013 or in 2015, the discussion got side-tracked over people not liking the word “literal” to mean “something that’s actually the result of a runtime function call”. That may be less of a problem after f-strings (which are called literals in the PEP; not sure about the language reference), but last time around, bringing up the fact that “-2” is actually a function call didn’t sway anyone. So, maybe I shouldn’t be using the word “literal” this time, and I really hope it doesn’t ruin your proposal…
That’s an interesting idea. And that’s something you can’t do with a single-affix design; you need prefixes and suffixes, unless you have some kind of separator for chaining, or only allow single characters.
My hack works basically like this. The compiler just converts it to a function call, which is looked up normally. I think that’s the right tack here. IIRC, my hack translates a D suffix into a call to something like -_user_literal_D, which solves the problem with accidental pollution of the namespace. But this does mean that any code that wants to use the D suffix has to `from decimal_literals import *, or `2.3D` raises a NameError about nothing named _user_literal_D. (Either that, or someone has to inject it into builtins…) I’m not sure whether that’s user-friendly enough. Anyway, I think your registry idea makes more sense. Then `2.3D` effectively just means `__user_literals__['D']('2.3')`, and there’s no namespace pollution at all.
Do we even need that? It’s true that most things in Python translate reasonably directly to bytecodes, but in this case it might be easier to just compile to existing bytecodes to look up and call the function.
There will, of course, be a method to register new tags. Something like
str.___register_tag__('a', MyAObject)
If the params are (handler, name=None), and None means to use the __name__ of the handler at the tag, then you can use it as a decorator: @__register_tag__ def D(decimal_string): return decimal.Decimal(decimal_string) Although this may not be the best example, because it might actually be clearer (as well as more efficient) to just register the constructor: __register_tag__(decimal.Decimal, 'D') … but I suspect many examples won’t be just a matter of calling a constructor on the string.
Does that mean you can’t actually register a prefix named `_f`? Or that, if you do, it also registers a suffix named `f`? Also, I think for most non-single-letter suffixes you’d actually want an underscore at the start of the suffix. See C++ for lots of examples, but for a quick illustration,compare these: c = 2.99792458e8mps c = 2.99792458e8_mps c = 299_792_458mps c = 299_792_458_mps The _mps suffix looks a lot better than the mps suffix, doesn’t it? But would you want the function to have to be named __mps with two underscores? It may be worth coming up with the most compelling examples and then working out what feature set would support as many as possible, rather that trying to work out the ultimate feature set first and then see what we can do with it. It’s probably worth stealing liberally from the C++ discussion (and any other languages that have similar features) as well as the 2013 Python discussion, but off the top of my head: * Decimal, Fraction, np.float32, mpz, … * Path objects * Windows native Path objects, possibly with “really raw” processing to allow trailing backslashes * regex, possibly with flags, possibly with “really raw” backslashes * “Really raw” strings in general. * JSON (register the stdlib or simplejson or ujson), XML (register ETree or lxml or bs4 or whatever you want), HTML, etc. * unit suffixes for quantities

27.08.19 06:38, Andrew Barnert via Python-ideas пише:
* JSON (register the stdlib or simplejson or ujson),
What if the JSON suffix for? JSON is virtually a subset of Python except that that it uses true, false and null instead of True, False and None. If you set these three variables you can embed JSON syntax in pure Python.

On Aug 26, 2019, at 23:43, Serhiy Storchaka <storchaka@gmail.com> wrote:
I think you’d mainly want it in combination with percent-, html-, or uu-equals-decoding, which makes it a potential stress test of the “multiple affixes” or “affixes with modifiers” idea. Which I think is important, because I like what the OP came up with for that idea, so I want to push it beyond just the “regex with flags” example to see if it breaks. Maybe URL, which often has the same html and percent encoding issues, would be a better example? I personally don’t need to decode URLs that often in Python (unlike in, say, ObjC, where there’s a smart URL class that you use in place of strings all over the place), but maybe others do?
JSON is virtually a subset of Python except that that it uses true, false and null instead of True, False and None.
Is it _virtually_ a subset, or literally so, modulo those three values? I don’t know off the top of my head. Look at all the trouble caused by Crockford just assuming that the syntax he’d defined was a strict subset of JS when actually it isn’t quite. Actually, now that I think of it, I do know. Python has allow_nan on by default, so you’d need to also `from math import nan as NaN` and `from math import inf as Infinity`. But is that it? I’m not sure. And of course if you’ve done this: jdec = json.JSONDecoder(parse_float=Decimal) __register_prefix__(jdec.decode, 'j') … then even j'1.1' and 1.1 are no longer the same values. Not to mention what you get if you registered Pandas’s JSON reader instead of the stdlib’s.

On Mon, Aug 26, 2019 at 11:03:38PM -0000, stpasha@gmail.com wrote:
Current string prefixes are allowed in combinations. Does the same apply to your custom prefixes? If yes, then they are ambiguous: how could the reader tell whether the string prefix frac'...' is a f- r- a- c-string combination, a fra- c-string combination, a fr- ac-string combination, or a f- rac- string combination? If no, then it will confuse and frustrate users who wonder why they can combine built-in prefixes like fr'...' but not their own prefixes. What kind of object is a frac-string? You might think it is obvious that it is a "frac" (Fraction? Something else?) but how about a czt-string? As a reader, at least I know that czt('...') is a function call that could return anything at all. That is standard across hundreds of programming languages. But as a string prefix, it looks like a kind of string, but could be anything at all. Imagine trying to reason about Python syntax: 1. u'...' is a unicode string, evaluating to a str. 2. r'...' is a raw string, evaluating to a str. 3. f'...' is a f-string, evaluating to a str. 4. b'...' is a byte-string, evaluating to a bytes object, which is not a str object but is still conceptually a kind of string. 5. Therefore z'...' is what kind of string, evaluating to what kind of object? Things that look similar should be similar. This string prefix idea means that things that look similar can be radically different. It looks like a string, but may not be anything like a string. The same applies to function call syntax, of course, but as I mentioned above, function call syntax is standard across hundreds of languages and readers don't expect that the result of an arbitrary function call is necessarily the same as its first argument(s). We don't expect that foo('abcde') will return a string, even if we're a little unclear about what foo() actually does. u- (unicode) strings, r- (raw) strings, and even b- (byte) strings are all kinds of *string*. We know just by looking at them that they evaluate to a str or bytes object. Even f-strings, which is syntax for executable code, at least is guaranteed to evaluate to a str object. But these arbitrary string prefixes could return anything.
Indeed. czt'...' saves only two characters from czt('...').
I don't see how that follows. The existence of one new prefix adds *this* much new complexity: [holds forefinger and thumb about a millimeter apart] for significant gains. Trying to write your own f-string equivalent function would be quite difficult, but being in the language not only is it faster and more efficient than a function call, but it needs to be only written once. But adding a new way of writing single-argument function calls with a string argument: czt'...' is equivalent to czt('...') adds *this* much complexity to the language: [holds forefingers of each hand about shoulder-width apart] for rather insignificant gains, the saving of two parentheses. You still have to write the czt() function, it will have to parse the string itself, you will have no support from the compiler, and anyone needing this czt() will either have to re-invent the wheel or hope that somebody publishes it on PyPI with a suitable licence.
What's v() do? Verbose string?
Oh, you intended a version string did you? If only you had written ``version`` instead of ``v`` I might not have guessed wrong. What were you saying about preferring readability and clarity over brevity? *semi-wink* I'm only half joking here. Of course I could guess that '1.13.0a' looks like a version string. But I genuinely expected v-string to mean "verbose", not version, and could only guess otherwise because I know what version strings look like. In other words, I got *all* of the meaning from the string part, not the prefix. The prefix on its own, I would have guessed completely wrong. This goes against your claim that "the string has no meaning of its own". Of course it has meaning on its own. It looks like a version string, which is the only way I could predict that v'...' stands for version-string rather than verbose-string. What if we didn't recognise the semantics of the string part? v'cal-{a}-%.5f-^H/7:d{b}s' What's this v-string mean, what does it do, how do I parse the string part of it? I think that one of the weaknesses of this proposal is that you are assuming that the meanings of these prefixes are as obvious to everyone else as they are to you. They aren't.
It has a feeling of an immutable object.
How are we supposed to know that v-strings return an immutable object? Let's suppose you come across l'abc' in somebody's code base. What's an l-string? Does it still look immutable to you? What if I told you that l-string stands for "list-string" and it returns a mutable list?
If I wanted to parse a string and return a Version object, I would write it as Version('1.13.0a'). If your v-string prefix does something other than that, I cannot comment, as I have no idea what your v-string prefix would do or how it would differ from the regular string '1.13.0a'. -- Steven

Thank you, Steven, for taking the time to write such an elaborate rebuttal. If I understand the heart of your argument correctly, you're concerned that the prefixed strings may add confusion to the code. That nobody knows what `l'abc'` or `czt'xxx'` could possibly mean, while at the same time `v'1.0'` could mean many things, whereas `v'cal-{a}'` would mean nothing at all... These are all valid concerns. The string (or number) prefixes add new power to the language, and with new power comes new responsibility. While the syntax can be used to enhance readability of the code, it can also be abused to make the code more obscure. However, Python does not attempt to be an idiot-proof language. "We are all consenting adults" is one of its guiding principles. If a certain feature can potentially be misused shouldn't deter us from adding it, if the benefits are significant. And the benefits in terms of readability can be significant. Consider the existing python prefixes: `r'...'` is purely for readability, it adds no extra functionality; `f'...'` has a neat compiler support, but even if it didn't (and most python users don't actually realize f-strings get preprocess by the compiler) it would still enhance readability compared to `str.format()`. It's nice to be able to write a complex number as `5 + 3j` instead of `complex(5, 3)`. And so on.
You're correct that, devoid of context, `v"smth..."` is not very meaningful. The "v" suffix could mean "version", or "verbose", or "volatile", or "vectorized", or "velociraptor", or whatever. Luckily, the code is almost always exists within a specific context. It solves a particular problem, and works within a particular domain, and makes perfect sense for people working within that domain. This isn't much different than, say, `np.` suffix, which means "numpy" in the domain of numerical computations, NP-completeness for some mathematicians, and "no problem" for regular users. From practical perspective, the meaning of each particular symbol will come from the way that it was created or imported. For example, if you script says `from packaging.version import v` then "v" is a version. If, on the other hand, it says `from zoo import velociraptor as v`, then it's an altogether different beast.
In other words, I got all of the meaning from the string part, not the prefix. The prefix on its own, I would have guessed completely wrong.
Exactly. You look at string "1.10a" and you know it must be a version string, because you're a human, you're smart. The compiler is not a human, it has no idea. To the Python interpreter it's just a PyUnicode object of length 5. It's meaningless. But when you combine this string with a prefix into a single object, it gains power. It can have methods or special behaviors. It can have a type, different from `str`, that can be inspected when passing this object to another function. Think of `v"1.10a"` as making a "typed string" (even though it may end up not being a string at all). By writing `v"1.10a"` I convey the intent for this to be a version string.
for rather insignificant gains, the saving of two parentheses.
Two bytes doesn't sound like a lot. I mean, it is quite little on the grand scale of things. However, I don't think the simple byte-count is a proper measure here. There could be benefits to readability even if it was 0 or negative byte difference. I believe a good way to think about this is the following: if the feature was already implemented, would people want to use it, and would it improve readability of their code? I speculate that the answer is true to both of these questions. At least some people. As a practical example, consider function `pandas.read_csv()`. The documentation for its `sep` parameter says "In addition, separators longer than 1 character and different from ``'\s+'`` will be interpreted as regular expressions ...". In this case they wanted the `sep` parameter to handle both simple separators, and the regular expression separators. However, as there is no syntax to create a "regular expression string", they ended up with this dubious heuristic based on the length of the string... Ideally, they should have said that `sep` could be either a string or a regexp-object, but the barrier to write from re import compile as rx rx('...') is just impossibly high for a typical user. Not to mention that such code **would** be actually harder to read, because I'd be inventing my own notation for a function that is commonly known under a different name. My another pet peeve is datetime literals. Or, rather, their absence. I often see, again in pandas, how people create columns of strings ["2010-05-01", "2010-05-02", ...], and then call `parse_datetime()`. It would have been more straightforward if there was a standard syntax for denoting datetime constants, allowing us to create a column of datetime type directly.

On Tue, Aug 27, 2019 at 6:25 PM <stpasha@gmail.com> wrote:
Syntactically, the "np." prefix (not suffix fwiw) actually means "look up the np object, then locate an attribute called <whatever comes next>". That's true of every prefix you could ever get, and they're always looked up at run time; the attribute name always follows the exact same syntactic rules no matter what the prefix is. Literals, on the other hand, are part of syntax - a different string type prefix can change the way the entire file gets parsed. Will these "custom prefixes" be able to define anything syntactically? If not, why not just use a function call? And if they can, then you have created an absolute monster, where a v-string in one context can have completely different syntactic influence on what follows it than a v-string in another context. At least with attribute lookups, you can parse a file without knowing what "np" actually means, and even examine things at run-time. ChrisA

On Aug 27, 2019, at 01:42, Chris Angelico <rosuav@gmail.com> wrote:
There is a possibility in between the two extremes of “useless” and “complete monster”: the prefix accepts exactly one token, but can parse that token however it wants. That’s pretty close to what C++ does, and pretty close to the way my hacky proof of concept last time around worked, and I don’t think that only works because those are suffix-only designs. (That being said, if you do allow “really raw” string literals as input to the user prefixes/suffixes to handle the path'C:\' case, then it’s possible to invent cases that would tokenize differently with and without the feature—in fact. I just did—and therefore it _might_ be possible to invent cases that parse validly but differently, in which case the monster is lurking after all. Someone might want to look more carefully at the C++ rules for that?)

On Tue, Aug 27, 2019 at 05:24:19AM -0700, Andrew Barnert via Python-ideas wrote:
How is that different from passing a string argument to a function or class constructor that can parse that token however it wants? x'...' x('...') Unless there is some significant difference between the two, what does this proposal give us? -- Steven

On Aug 27, 2019, at 08:52, Steven D'Aprano <steve@pearwood.info> wrote:
Before I get into this, let me ask you a question. What does the j suffix give us? You can write complex numbers without it just fine: c = complex c(1, 2) And you can even write a j function trivially: def j(x): return complex(0, x) 1 + j(2) But would anyone ever write that when they can write it like this: 1 + 2j I don’t think so. What does the j suffix give us? The two extra keystrokes are trivial. The visual noise of the parens is a bigger deal. The real issue is that this matches the way we conceptually think of complex numbers, and the way we write them in other contexts. (Well, the way electrical engineers write them; most of the rest of us use i rather than j… but still, having to use j instead of i is less of an impediment to reading 1+2j than having to use function syntax like 1+i(2). And the exact same thing is true in 3D or CUDA code that uses a lot of float32 values. Or code that uses a lot of Decimal values. In those cases, I actually have to go through a string for implementation reasons (because otherwise Python would force me to go through a float64 and distort the values), but conceptually; there are no strings involved when I write this: array([f('0.2'), f('0.3'), f('0.1')]) … and it would be a lot more readable if I could write it the same way I do in other programming languages: array([0.2f, 0.3f, 0.1f]) Again, it’s not about saving 4 keystrokes per number, and the visual noise of the parens is an issue but not the main one (and quotes are barely any noise by comparison); it’s the fact that these numeric values look like numeric values instead of looking like strings The fact that they look the same as the same values in other contexts like a C++ program or a GLSL shader is a pretty large added bonus. But I don’t think that’s essential to the value here. If you forced me to use prefixes instead of suffixes (I don’t think there’s any good reason for that, but who knows how the winds of bikeshedding may blow), I’d still prefer f2.3 to f('2.3'), because it still looks like a number, as it should. I know this is doable, because I’ve written an import hook that does it, plus I have a decade of experience with another popular language (C++) that has essentially the same feature. What about the performance cost of these values not being constants? A decorator that finds np.float32 calls on constants and promoted them to constants by hacking the bytecode is pretty trivial to write, or you can load the whole array in one go from a bytes constant and put the readable version in a comment, or whatever. But anything that’s slow enough to be worth optimizing is doing a huge matmul or pushing zillions of values back and forth to the GPU or something else that swamps the setup cost, even if the setup cost involves a few dozen string parses, so it never matters. At least not for me. —- For a completely different example—but one that I’ve also already given earlier in this thread, so I won’t belabor it too much: path'C:\' bs"this\ space won’t have a backslash before it, also \e[22; is an escape sequence and of course \a is still a bell because I’m using the rules from C/JS/etc." bs37"this\ space has a backslash before it without raising a warning or an error even in Python 3.15 because I’ve implemented the 3.7 rules” … and so on. Some of these _could_ be done with a raw string and a (maybe slightly more complicated) function call, but at least the first one is impossible to do that way. Unlike the numeric suffixes, this one I haven’t actually implemented a hacky version of, and I don’t know of any other languages that have an identical feature, so I can’t promise it’s feasible, but it seems like it should be.

On Aug 27, 2019, at 10:21, Rhodri James <rhodri@kynesim.co.uk> wrote:
You make the point yourself: this is something we already understand from dealing with complex numbers in other circumstances. That is not true of generic single-character string prefixes.
It certainly is true for 1.23f. And, while 1.23d for a decimal or 1/3F for a Fraction may not be identical to any other context, it’s a close-enough analogy that it’s immediately familiar. Although I might actually prefer 1.23dec or 1/3frac or something more explicit in those cases. (Fortunately, there’s nothing in the design stopping me from doing that.) As for string prefixes, I don’t think those should usually, or maybe even ever, be single-character. People have given examples like sql"…" (I’m still not sure exactly what that does, but it’s apparently used in other languages for something?) and regex"…" and path"…" (which are a lot more obvious). I’m not sure if they actually are useful, which is why my proposal didn’t have them; I’m waiting on the OP to give more complete examples, cite similar uses from other languages, etc. But I doubt the problem you’re talking about, that they’d all be unfamiliar cryptic one-letter things, is likely to arise.

On 29/08/2019 00:24, Andrew Barnert wrote:
I would contend that (and anyway 1.23f is redundant; 1.23 is already a float literal). But anyway I said "generic single-character string prefixes", because that's what the original proposal was. You seem to be going off on creating literal syntax for standard library types (which, for the record, I think is a good idea and deserves its own thread), but that's not what the OP seems to be going for. -- Rhodri James *-* Kynesim Ltd

On Wed, Aug 28, 2019 at 3:10 AM Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
If your conclusion here were "and that's why Python needs a proper syntax for Decimal literals", then I would be inclined to agree with you - a Decimal literal would be lossless (as it can entirely encode whatever was in the source file), and you could then create the float32 values from those. But you haven't made the case for generic string prefixes or any sort of "arbitrary literal" that would let you import something that registers something to make your float32 literals. ChrisA

On Tuesday, August 27, 2019, 11:12:51 AM PDT, Chris Angelico <rosuav@gmail.com> wrote:
Sure I did; you just cut off the rest of the email that had other cases. And ignored most of what you quoted about the float32 case. And ignored the previous emails by both me and the OP that had other cases. Or can you explain to me how a builtin Decimal literal could solve the problem of Windows paths? Here's a few more: Numeric types that can't be losslessly converted to and from Decimal, like Fraction. Something more similar to complex (e.g., `quat = 1.0x + 0.0y + 0.1z + 1.0w`). What would Decimal literals do for me there? I think your reluctance and the OP's excitement here both come from the same source: Any feature that gives you a more convenient way to write and read something is good, because it lets you write things in a way that's consistent with your actual domain, and also bad, because it lets you write things in a way that's not readable to people who aren't steeped in your domain. Those are _always_ both true, so just arguing from first principles is pointless. The question is whether, for this specific feature, there are good uses where the benefit outweighs the cost. And I think there are. In fact, if you're already convinced that we need Decimal literals, unless you can come up with a more feasible way to add builtin Decimal literals to Python, Decimal on its own seems like a sufficient use case for the feature.

On Wed, Aug 28, 2019 at 6:03 AM Andrew Barnert <abarnert@yahoo.com> wrote:
Not sure that's a total blocker, but in any case, I'm not arguing for that - I'm just saying that everything up to that point in your argument would be better served by a Decimal literal than by any notion of "custom literals".
But they're not. You didn't even attempt to answer the comparison with complex that you quoted. The problem that `j` solves is not that there's no way to create complex values losslessly out of floats, but that there's no way to create them _readably_, in a way that's consistent with the way you read and write them in every other context. Which is exactly the problem that `f` solves. Adding a Decimal literal would not help that at all—letting me write `f(1.23d)` instead of `f('1.23')` does not let me write `1.23f`.
TBH I don't quite understand the problem. Is it only an issue with negative zero? If so, maybe you should say so, because in every other way, building a complex out of a float added to an imaginary is perfectly lossless.
Also, I think you're the one who brought up performance earlier? `%timeit np.float32('1.23')` is 671ns, while `%timeit np.float32(d)` with a pre-constructed `Decimal(1.23)` is 2.56us on my laptop, so adding a Decimal literal instead of custom literals actually encourages _slower_ code, not faster.
No, I didn't say that. I have no idea why numpy would take longer to work with a Decimal than a string, and that's the sort of thing that could easily change from one version to another. But the main argument here is about readability, not performance.
Also, as the OP has pointed out repeatedly and nobody has yet answered, if I want to write `f(1.23d)` or `f('1.23')`, I have to pollute the global namespace with a function named `f` (a very commonly-used name); if I want to write `1.23f`, I don't, since the converter gets stored in some out-of-the-way place like `__user_literals_registry__['f']` rather than `f`. That seems like a serious benefit to me.
Maybe. But far worse is that you have a very confusing situation that this registered value could be different in different programs. In contrast, f(1.23d) would have the same meaning everywhere: call a function 'f' with one parameter, the Decimal value 1.23. Allowing language syntax to vary between programs is a mess that needs a LOT more justification than anything I've seen so far.
Which said basically the same as the parts I quoted.
And ignored most of what you quoted about the float32 case.
What did I ignore?
And ignored the previous emails by both me and the OP that had other cases. Or can you explain to me how a builtin Decimal literal could solve the problem of Windows paths?
All the examples about Windows paths fall into one of two problematic boxes: 1) Proposals that allow an arbitrary prefix to redefine the entire parser - basically impossible for anything sane 2) Proposals that do not allow the prefix to redefine the parser, and are utterly useless, because the rest of the string still has to be valid. So no, you still haven't made a case for arbitrary literals.
Here's a few more: Numeric types that can't be losslessly converted to and from Decimal, like Fraction.
If you want to push for Fraction literals as well, then sure. But that's still very very different from *arbitrary literal types*.
Something more similar to complex (e.g., `quat = 1.0x + 0.0y + 0.1z + 1.0w`). What would Decimal literals do for me there?
Quaternions are sufficiently niche that it should be possible to represent them with multiplication. quat = 1.0 + 0.0*i + 0.1*j + 1.0*k With appropriate objects i, j, k, it should be possible to craft something that implements quaternion arithmetic using this syntax. Yes, it's not quite as easy as 4+3j is, but it's also far FAR rarer. (And remember, even regular complex numbers are more advanced than a lot of languages have syntactic support for.)
I think your reluctance and the OP's excitement here both come from the same source: Any feature that gives you a more convenient way to write and read something is good, because it lets you write things in a way that's consistent with your actual domain, and also bad, because it lets you write things in a way that's not readable to people who aren't steeped in your domain. Those are _always_ both true, so just arguing from first principles is pointless. The question is whether, for this specific feature, there are good uses where the benefit outweighs the cost. And I think there are.
That line of argument is valid for anything that is specifically defined by the language. Creating a way to represent matrix multiplication benefits people who do matrix multiplication. Those of us who don't work with matrix multiplication on a daily basis, however, can at least read some Python code and go "ah, a @ b means matrix multiplication". The creation of custom literals means we can't do that any more. For instance, you want this: x = path"C:\" but that means that it's equally possible for me to create this: y = tree" \" " \ " Now, what does that mean? Can you even parse the rest of the script without knowing what my 'tree' type does?
In fact, if you're already convinced that we need Decimal literals, unless you can come up with a more feasible way to add builtin Decimal literals to Python, Decimal on its own seems like a sufficient use case for the feature.
There are valid use cases for Decimal literals and Fraction literals, but not, IMO, for custom literals. Look at some of the worst abuses of #define in C to get an idea of what syntax customization can do to readability. ChrisA

On Aug 27, 2019, at 14:41, Chris Angelico <rosuav@gmail.com> wrote:
No, it really couldn’t. A builtin Decimal literal would arguably serve the Decimal use case better (but I’m not even sure about that one; see below), but it doesn’t serve the float32 case that you’re responding to.
Negative zero is an irrelevant side issue that Serhiy brought up. It means j is not quite perfect—and yet j is still perfectly usable despite that. Ignore negative zero. The problem that j solves is dead simple: 1 + 2j is more readable than complex(1, 2). And it matches what you write and read in other contexts besides Python. That’s the only problem j solves. But it’s a problem worth solving, at least for code that uses a lot of complex numbers. Without it, even if you wanted to pollute the namespace with a single-letter global so you could write c(1, 2) or 1 + j(2), it _still_ wouldn’t be nearly as readable or as familiar. That’s why we have j. There is literally no other benefit, and yet it’s enough. And the problem that f solves would be exactly the same: 1.23f is more readable than calling float32, and it matches what you read and write in other contexts besides Python (like, say, C or shader code). Even if you wanted to pollute the namespace with a single-letter global f, it still wouldn’t be as readable or as familiar. That’s why we should have f. There is literally no other benefit, but I think it’s enough benefit, for enough programs, that we should be allowed to do it. Just like j. Unlike j, however, I don’t think it’s useful in enough programs that it should be builtin. And I think the same is probably true for Decimal. And for most of the other examples that have come up in this thread. Which is why I think we’d be better served with something akin to C++ allowing you to explicitly register affixes for your specific program, than something like C with its too-big-to remember-but-still-not-enough-for-many-uses zoo of builtin affixes.
Sure, and the global f could also be different in different programs—or even in different modules in the same program. So what? 1.23f would always have the same meaning everywhere, it’s just that the meaning is something like __user_literals__['f']('1.23') instead of globals()['f']('1.23'). Yes, of course that is something new to be learned, if you’re looking at a program that does a lot of 3D math, or a lot of decimal math, or a lot of Windows path stuff, or whatever, people are likely to have used this feature so you’ll need to know how to look up the f or d or whatever. But that really isn’t a huge hardship, and I think the benefits outweigh the cost.
This doesn’t really allow syntax to vary between programs. It just allows literals to be followed (or maybe preceded) by tags. The rest of the syntax is unchanged.
Not even remotely—again, unless you think that Windows paths could somehow be served by builtin decimal literals?
And ignored most of what you quoted about the float32 case.
What did I ignore?
That 1.23f is more readable, familiar, etc. in exactly the same way that 2.3j is.
3) Proposals that do not allow the prefix to redefine the parser for the entire program, but do allow it to manually parse anything the tokenizer can recognize as a single (literal) token. As I said, I haven’t tried to implement this example as I have with the other examples, so I can’t promise that it’s doable (with the current tokenizer, or with a reasonable change to it). But if it is doable, it’s neither insane nor useless. (And evenif it’s not doable, that’s just two examples that affixes can’t solve—Windows paths and general “super-raw strings”. They still solve all of the other examples.)
If we really only ever needed Decimal and Fraction, then yes, I think allowing user-defined literal tags would be better than adding two hard-coded tags that most people will rarely use.
Yes, and? “Literal token” is specifically defined by the language. “Literal token with attached tag” will also be specifically defined by the language. The only thing open to customization is what that token gets compiled to. (Or course this is just one suggestion I came up with, not the only way to do things, or what the OP suggested. But it does show that there is at least one possibility besides “insane” and “useless”.) Your argument comes down to the fact that anything that could possibly be construed as affecting syntax, no matter how constrained it is, and no matter how far you have to stretch to see it as affecting syntax in the first place, and even if nobody would ever do that with it, is Inherently so evil that it can’t possibly be allowed. I think that’s a silly argument. Especially in a language that already has a nicely documented feature like, say, import hooks.
Look at the plethora of suffixes C has for number and character literals. Look at how many things people still can’t do with them that they want to. Look at the way user literals work in C++. While technically you can argue that they are “syntax customization”, in practice the customization is highly constrained. Is it _impossible_ to use that feature to write code that can’t be parsed by a human reader? I don’t know if I could prove that it’s impossible. However, I do know that it’s not easy. And that none of the examples, or real-life uses, that I’ve seen have done so. Do you think Python users are incapable of the kind of restraint and taste shown by C++ users, and therefore we can’t trust Python users with a tool that might possibly (but we aren’t sure) if abused badly enough make code harder to visually parse?

On Wed, Aug 28, 2019 at 10:52 AM Andrew Barnert <abarnert@yahoo.com> wrote:
So what is the definition of "a single literal token" when you're creating a path-string? You want this to be valid: x = path"C:\" For this to work, the path prefix has to redefine the way the parser finds the end of the token, does it not? Otherwise, you still have the same problems you already do - backslashes have to be escaped. That's why I say that, without being able to redefine the parser, this is completely useless, as a "path string" might as well just be a "string". Which way is it?
I don't understand. Are you saying that the prefix is not going to be able to change how backslashes are handled, or that it is? If you keep the tokenizer exactly the same and just add a token in front of it, then things like path"C:\" will be considered to be incomplete and will continue to consume source code until the next quote (or throw SyntaxError for EOL inside string literal). Or is your idea of "literal token" something other than that? If a "literal token" is simply a string literal, then how is this actually helping anything? What do you achieve?
Look at the plethora of suffixes C has for number and character literals. Look at how many things people still can’t do with them that they want to.
I don't know how many there are. The only ones I can think of are "f" for single-precision float, and the long and unsigned suffixes on integers. Python doesn't have these because very few programs need to care about whether a float is single-precision or double-precision, or how large an int is.
Look at the way user literals work in C++. While technically you can argue that they are “syntax customization”, in practice the customization is highly constrained. Is it _impossible_ to use that feature to write code that can’t be parsed by a human reader? I don’t know if I could prove that it’s impossible. However, I do know that it’s not easy. And that none of the examples, or real-life uses, that I’ve seen have done so.
I also have not yet seen any good examples of user literals in C++.
Do you think Python users are incapable of the kind of restraint and taste shown by C++ users, and therefore we can’t trust Python users with a tool that might possibly (but we aren’t sure) if abused badly enough make code harder to visually parse?
People can be trusted with powerful features that can introduce complexity. There's just not a lot of point introducing a low-value feature that adds a lot of complexity. ChrisA

I’m not sure (maybe about 60% at best), but I think last time I checked this, the tokenizer actually hits the error without munching the rest of the file. If I’m wrong, then you would need to add a “really raw string literal” builtin that any affixes that want really raw string literals could use, but that’s all you’d have to do. And I really don’t think it’s worth getting this in-depth into just one of the possible uses that I just tossed off as an aside, especially without actually sitting down and testing anything.
Of the top of my head, there are also long long integers, and long doubles, and wide and three Unicode suffixes for char. Those probably aren’t all of them. And your compiler probably has extensions for “legacy” suffixes and nonstandard types like int128 or decimal64 and so on.
Right, but the issue isn’t which ones, but how many. C doesn’t have decimals or fractions, and other things like datetime objects have been suggested in this thread, and even more in the two earlier threads. If there are too many useful kinds of constants, there are too many to make them all builtins.
But it really doesn’t add a lot of complexity. If you’re not convinced that really-raw string processing is doable, drop that. Since the OP hasn’t given a detailed version of his grammar, just take mine: a literal token immediately followed by one or more identifier characters (that couldn’t have been munched by the literal) is a user-suffix literal. This is compiled into code that looks up the suffix in a central registry and calls it with the token’s text. That’s all there is to it. Compare that adding Decimal (and Fraction, as you said last time) literals when the types aren’t even builtin. That’s more complexity, for less benefit. So why is it better?

On Wed, Aug 28, 2019 at 2:40 PM Andrew Barnert <abarnert@yahoo.com> wrote:
What is a "literal token", what is an "identifier character", and how does this apply to your example of having digits, a decimal point, and then a suffix? What if you want to have a string, and what if you want to have that string contain backslashes or quotes? If you want to say that this doesn't add complexity, give us some SIMPLE rules that explain this. And make absolutely sure that the rules are identical for EVERY possible custom prefix/suffix, because otherwise you're opening up the problem of custom prefixes changing the parser again.
Compare that adding Decimal (and Fraction, as you said last time) literals when the types aren’t even builtin. That’s more complexity, for less benefit. So why is it better?
Actually no, it's a lot less complexity, because it's all baked into the language. You don't have to have the affix registry to figure out how to parse a script into AST. The definition of a "literal" is given by the tokenizer, and for instance, "-1+2j" is not a literal. How is this going to impact your registry? The distinction doesn't matter to Decimal or Fraction, because you can perform operations on them at compile time and retain the results, so "-1.23d" would syntactically be unary negation on the literal Decimal("1.23"), and -4/5f would be unary negation on the integer 4 and division between that and Fraction(5). But does that work with your proposed registry? What is a "literal token", and would it need to include these kinds of things? What if some registered types need to include them and some don't? ChrisA

On Aug 28, 2019, at 00:40, Chris Angelico <rosuav@gmail.com> wrote:
Literals and identifier characters are already defined today, so I don’t need new definitions for them. The existing tokens are already implemented in the tokenizer and in the tokenize module, which is why I was able to slap together multiple variations on a proof of concept 4 years ago in a few minutes as a token-stream-processing import hook. My import hook version is a hack, of course, but it serves as a counter to your argument that there’s no simple thing that could work by being a dead simple thing that does work. And there’s no reason to believe a real version wouldn’t be at least as simple.
We add a`suffixedfloatnumber` production defined as `floatnumber identifier`. So, the `2.34` parses as a `floatnumber` the same as always. That `d` can’t be part of a `floatnumber`, but it can be the start of an `idenfifier`, and those two nodes together can make up a `suffixedfloatnumber`. No need for any new lookahead or other context. And for the concrete implementation in CPython, it should be obvious that the suffix can be pushed down into the tokenizer, at which point the parse becomes trivial. If you’re asking how my hacky version works, you could just read the code, which is simpler than an explanation, but here goes (from memory, because I’m on my phone): To the existing tokenizer, `d` isn’t a delimiter character, so it tries to match the whole `2.34d`. That doesn’t match anything. But `2.34` does match something, etc., so ultimately it emits two tokens, `floatnumber('2.34'), error('d')`. My import hook reads the stream of tokens. When it sees a `floatnumber` followed by an `error`, it checks whether the error body could be an identifier token. If so, it replaces those two tokens in the steam with… I forget, but probably I just hand-parsed the lookup and call and emit the tokens for that. I can’t _guarantee_ that the real version would be simpler until I try it. And I don’t want to hijack the OP’s thread and replace his proposal (which does give me what I want) with mine (which doesn’t give him what he wants), unless he abandons the idea of attempting to implement his version. But I’m pretty confident it would be as simple as it sounds, which is even simpler than the hacky version (which, again, is dead simple and works today). And most variations on the idea you could design would be just as simple. Maybe the OP will perversely design one that isn’t. If so, it’s his job to show that it can be implemented. And if he gives up, then I’ll argue for something that I can implement simply. But I don’t think that’s even going to come up.
Well, that works exactly the same way a string does today (including the optional r prefix). The closing quote can now be followed by a string of identifier characters, but everything up to there is exactly the same as today. So, it doesn’t add any complexity, because it uses the same rules as today. I did suggest, as a throwaway addon to the OP’s proposal, that you could instead do raw strings or even really-raw (the string ends at the first matching quote; backslashes mean nothing). I don’t know if he wants either of those, but if he does, raw string literals are already defined in the grammar and implemented in the tokenizer, and really-raw is an even simpler grammar (identical to the existing grammar except that instead of `longstringchar | stringescapeseq` there’s a `<any source character except the quote>` node, and the same for `shortstringitem`).
And make absolutely sure that the rules are identical for EVERY possible custom prefix/suffix,
Well, in my version, since the rule for suffixedstringliteral is just `stringliteral identifier`, of course it’s the same for every possible suffix; there’s no conceivable way it could be different. If the OP wants to propose something more complicated that provides some way of selecting different rules, he could, but I don’t think he has, and if he doesn’t, then the issue will be equally nonexistent. I don’t know whether he wants to interact with the existing string prefixes (or, if so, how that works), or always do normal strings, or always do really-raw strings, or what, but there are multiple plausible designs, most of which are not impossible or even complicated, so the fact that you can imagine that there might be a design that would be impossible really isn’t relevant. Just to show how easy it is to come up with something (but which, again, may not be what the OP actually wants here): a stringliteral is now a stringprefix followed by shortstring or longstring (as today) or an identifier followed by rrshortstring or rrlongstring. The rr tokens are defined as I described above: they end at the first matching quote, no backslashing. This option would have some limitations— people can’t use \” to escape quotes in prefixed strings, there’s no way to get prefixed bytes, you probably can’t call a prefix “bub”… does that make some of the OP’s desired use cases or some of the 2013 use cases no longer viable? I don’t know. If so, the OP presumably won’t use this option and will use a different one. Any option will have some limitations, and I don’t know which one he wants, but there are a huge number of simple, and nonmagical, options that he could pick.
Making the language definition and the interpreter and the compiler more complicated doesn’t eliminate the complexity, it just moves it somewhere else.
You don't have to have the affix registry to figure out how to parse a script into AST.
You don’t need the registry to parse to an AST for my proposal either; it’s only used at runtime. And, while the OP didn’t give us a grammar, he did give us proposed bytecode output of (one version of) his idea, and it’s pretty obvious that the registry isn’t getting involved until the interpreter eval loop processes the new registry-lookup opcode, so it clearly isn’t involved in parsing. And why would it get involved in parsing? It’s not like someone is proposing Rust or Dylan macros here.
Not at all. Why would it?
Yes, of course it does. That should be obvious from the fact that I said that `1/2F` would end up equivalent to `1/Fraction(2)`. Concretely, it ends up as something like `1/sys.__user_suffixes__['F']('2')` Except probably with nicer error handling, so you don’t get a KeyError for an unknown suffix. Notice that the only thing looked up in the registry is the function to process the text, and this doesn’t need to happen until runtime, long after the code has been not just parsed, but compiled. Of course the OP’s version will be a little different. He wants to handle both prefixes and suffixes by looking up the prefix and passing the suffix as an second argument. And I’m not sure what exactly he wants as the main argument. But I still don’t see any reason it would need to look in the registry at tokenizer or parse or compile time. And again, his proposed bytecode translation implies that it doesn’t do so. So why imagine that it has to when there’s no visible reason for it?
What is a "literal token", and would it need to include these kinds of things?
How could this not be obvious? I deliberately chose the phrase “literal token”, and you clearly understand what this means because you invoked that meaning just one paragraph above. I also provided a link to a hacky implementation that blatantly relies on the tokenizer’s current processing of literals. And I gave examples that make it clear that `2` is a literal token and `1/2` is not. So why do you even need to ask whether `-4/5` is one? How could `-4/5` possibly be a literal token it '1/2` is not, or if it isn’t a token at all?
What if some registered types need to include them and some don't?
They can’t. The simple rule works for every numeric example everyone has come up with so far, even Steven’s facetious quaternion example that he proposed as too ridiculous for anyone to actually want. Is it a flaw that there may or may not be some examples that nobody has been able to think of that might work with a much more complicated feature but won’t work with this feature? Of course not. That’s true for every feature ever. There’s no reason to ignore the obvious simple design and try to imagine more complicated designs that may or may not solve additional problems that nobody’s even imagined just so you can dismiss the idea as too complicated.

Thanks, Andrew, you're able to explain this much better than I do. Just wanted to add that Python *already* has ways to grossly abuse its syntax and create unreadable code. For example, I can write >>> о = 3 >>> o = 5 >>> ο = 6 >>> (о, o, ο) (3, 5, 6) But just because some feature CAN get abused, doesn't mean it ACTUALLY gets abused in practice. People want to write nice, readable code, because they will ultimately be the ones to support it.

On Wed, Aug 28, 2019 at 10:50 PM Rhodri James <rhodri@kynesim.co.uk> wrote:
'\u043e' CYRILLIC SMALL LETTER O 'o' LATIN SMALL LETTER O '\u03bf' GREEK SMALL LETTER OMICRON Virtually indistinguishable in most fonts, but distinct characters. It's the same thing you can do with "I" and "l" in many fonts, or "rn" and "m" in some, but taken to a more untypable level. ChrisA

The case can be made as follows: different people use different parts of the Python language. Andrew would love to see the support for decimals, fractions and float32s (possibly float16s too, and maybe even posit numbers). Myself, I miss datetime and regular expression literals. Other people on the 2013 thread argued at length in favor of supporting sql-literals, which would allow them to be used in a much safer manner. Then there are those who want to write complex numbers in a natural fashion, but they already got their wish granted. In short, the needs vary, and not all of the functionality belongs to the python standard library either.

On Tuesday, August 27, 2019, 11:42:23 AM PDT, Serhiy Storchaka <storchaka@gmail.com> wrote:
And yet, despite that limitation, many people find it useful, and use it on a daily basis. Are you suggesting that Python would be better off without the `j` suffix because of that problem?

On Tue, Aug 27, 2019 at 10:07:41AM -0700, Andrew Barnert wrote:
Before I get into this, let me ask you a question. What does the j suffix give us?
I'm going to answer that question, but before I answer it, I'm going to object that this analogy is a poor one. This proposal is *in no way* a proposal for a new compile-time literal. If it were, it might be interesting: I would be very interested to hear more about literals for a Decimal type, say, or regular expressions. But this proposal doesn't offer that. This proposal is for mere syntactic sugar allowing us to drop the parentheses from a tiny subset of function calls, those which take a single string argument. And even then, only when the argument is a string literal: czt'abc' # Okay. s = 'abc' czt's' # Oops, wrong, doesn't work. But, to answer your question, what does the j suffix give us? Damn little. Unless there is a large community of Scipy and Numpy users who need complex literals, I suspect that complex literals are one of the least used features in Python. I do a lot of maths in Python, and aside from experimentation in the interactive interpreter, I think I can safely say that I have used complex literals exactly zero times in code.
You can write complex numbers without it just fine: [...]
Indeed. And if we didn't already have complex literals, would we accept a proposal to add them now? I doubt it. But if you think we would, how about a proposal to add quaternions? q = 3 + 4i + 2j - 7k
But would anyone ever write that when they can write it like this:
1 + 2j
Given that complex literals are already a thing, of course you are correct that if I ever needed a complex literal, I would use the literal syntax. But that's the point: it is *literal syntax* handled by the compiler at compile time, not syntactic sugar for a runtime function call that has to inefficiency parse a string. Because it is built-in to the language, we don't have to do this: def c(astring): assert isinstance(astring, str) # Parse the string at runtime real, imag = ... return complex(real, imag) z = c"1.23 + 4.56j" (I'm aware that the complex constructor actually does parse strings already, so in *this specific* example we don't have to write our own parser. But that doesn't apply in the general case.) That is nothing like complex literals: py> from dis import dis py> dis(compile('1+2j', '', 'eval')) 1 0 LOAD_CONST 2 ((1+2j)) 3 RETURN_VALUE # Hypothetical byte-code generated from custom string prefix py> dis(compile("c'1+2j'", '', 'eval')) 1 0 LOAD_NAME 0 (c) 3 LOAD_CONST 0 ('1+2j') 6 CALL_FUNCTION 1 (1 positional, 0 keyword pair) 9 RETURN_VALUE Note that in the first case, we generate a complex literal at compile time; in the second case, we generate a *string* literal at compile time, which must be parsed at runtime. This is not a rhetorical question: if we didn't have complex literals, why would you write your complex number as a string, deferring parsing it until runtime, when you could parse it in your head at edit-time and call the constructor directly? z = complex(1.23, 4.56) # Assuming there was no literal syntax.
I don't think it is. I think the big deals in this proposal are: - you have something that looks like a kind of string czt'...' but is really a function call that might return absolutely anything at all; - you have a redundant special case for calling functions that take a single argument, but only if that argument is a string literal; - you encourage people to write cryptic single-character functions, like v(), x(), instead of meaningful expressions like Version() and re.compile(); - you encourage people to defer parsing that could be efficiently done in your head at edit time into slow and likely inefficient string parsing done at runtime; - the OP still hasn't responded to my question about the ambiguity of the proposal (is czt'...' a one three-letter prefix, or three one-letter prefixes?) all of which *hugely* outweighs the gain of being able to avoid a pair of parentheses. [...]
Indeed, but this proposal doesn't help you here. You still have to write strings. What you want is a float32 literal, let's say 1.23f but what you have to write is a function call with a string argument f('1.23'). All this proposal buys you is to drop the parentheses f'1.23'. Its still a function call, except it looks like a string. While I'm sympathetic, and I'd like to see a Decimal literal, I doubt that there's enough use-cases outside of specialists like yourself for 16- or 32-bit floats to justify making them built-ins with literal syntax. Python is not Julia :-) But if you could get the numpy people to write a PEP... *wink* -- Steven

On Aug 27, 2019, at 18:59, Steven D'Aprano <steve@pearwood.info> wrote:
Yes, you’re the same person who got hung up on the fact that these affixes don’t really give us “literals” back in either 2013 or 2016, and I don’t want to rehash that argument. I could point out that nobody cares that -1 isn’t really a literal, and almost nobody cares that the CPython optimizer special-cases its way around that, and the whole issue with Python having three different definitions of “literal” that don’t coincide, and so on, but we already had this conversation and I don’t think anyone but the two of us cared. What matters here is not whether things like the OP’s czt'abc' or my 1.23f or 1.23d are literals to the compiler, but whether they’re readable ways to enter constant values to the human reader. If so, they’re useful. Period. Now, it’s possible that even though they’re useful, the feature is still not worth adding because of Chris’s issue that it can be abused, or because there’s an unavoidable performance cost that makes it a bad idea to rely on them, or because they’re not useful in _enough_ code to be worth the effort, or whatever. Those are questions worth discussing. But arguing about whether they meet (one of the three definitions of) “literal” is not relevant.
And to drop the quotes as well. And to avoid polluting the global namespace with otherwise-unused one-character function names. Can you honestly tell me that you see no significant readability difference between these examples: vec = [1.23f, 2.5f, 1.11f] vec = [f('1.23'), f('2.5'), f('1.11')] I think anyone would agree that the former is a lot more readable. Sure, you have to learn what the f suffix means, but once you do, it means all of the dozens of constants in the module are more readable. (And of course most people reading this code will probably be people who are used to 3D code and already _expect_ that format, since that’s how you write it in C, in shaders, etc.)
Sure, just like you can’t apply an r or f prefix to a string expression.
I don’t think your experience here is typical. I can’t think of a good way to search GitHub python repos for uses of j, but a hacky search immediately turned up this numpy issue:https://github.com/numpy/numpy/issues/13179:
A fast way to get the inverse of angle, i.e., exp(1j * a) = cos(a) + 1j * sin(a). Note that for large angle arrays, exp(1j*a)needlessly triples memory use…
That doesn’t prove that people actually call it with `1j * a` instead of `complex(0, a)`, but it does seem likely.
I’m not sure. I assume you’d be against it, but I suspect that most of the people who use it today would be for it. But if we had custom affixes, I think everyone would be happy with “just define a custom j suffix”. Would anyone really argue that they need the performance benefit or compile-time handling? How often do you evaluate zillions of constants in the middle of a tight loop? And what other argument would there be for adding it to the grammar and the compiler and forcing every project to use it? Which is exactly what I think of the Decimal and Fraction suffixes, contrary to what Chris says. There will be a small number of projects than get a lot of readability benefit, but every other project gains nothing, so why add it as a builtin for every project? And I don’t see why float32 is any different from Decimal and Fraction, given that the actual problem is not lossless values but readability, so a builtin Decimal suffix wouldn’t help there.
I’m sure you can guess my answer to that: most projects don’t need it, so there should definitely not be builtin suffixes for it. But if we have custom suffixes, it anyone _does_ need it, they can do it trivially, without having to bother the rest of us asking for it.
You’ve cut off my paragraph to make it misleading. The visual noise of the parens _is_ a bigger deal than the two extra keystrokes, but as I said in the very next sentence, it’s still not the point of the feature. The real big deal is that it lets you write complex numbers in a way that looks like complex numbers. I don’t see any benefit in arguing about whether the “bigger deal but still not the point” is actually a bigger deal or not, because who cares?
It doesn’t return “anything at all”, any more than a function returns “anything at all”. It returns something consistent that has a specific meaning in your project. I don’t know what czt means a priori, but if I were reading the OP’s code, I could look it up, and then I would know. And I could assume that, unless the author is an idiot, an affix on a string literal is going to be something stringy, and an affix on a number literal is going to be something numbery. Sure, you _could_ violate that assumption, but that’s no different from the fact that you could write a function called sqrt(n) that returns an iterable of the contents of all the files in the nth sub directory of $HOME. You’re not going to do that. (Or, if you do, I’m going to stop reading your code.)
Do you honestly not see the readability benefit in a bunch of constants that all look like `2.3d` instead of `Decimal('2.3')`? Do you honestly think that `D('2.3')` is just as good as `2.3d`, and also worth using up a scarce resource (one-letter global names) for? If not, then I don’t see why you’re pretending not to see the benefit of the proposal.
Well, this is why I wanted him to get into more details than his initial 10000-feet-vision thing. I honestly see a lot more use for numeric affixes than string ones, and for suffixes rather than prefixes, and for just one suffix per value rather than some rule for combining them. But I know that last time around (or maybe the time before), a sql prefix was the thing that got the most people excited, and I could see wanting to combine that with raw or not, and so on, so I’d like to see a concrete proposal on how all of that works.
all of which *hugely* outweighs the gain of being able to avoid a pair of parentheses.
Which, you’ll note, I already said was not the point of the proposal.
No I don’t. I write `1.23f`, just like I do in C or in GLSL, exactly what I want to write, and read. In my version of the proposal (which I described in my first email in the thread—including a link to a hacky import hook that implements it that I wrote up last time this subject came up in 2015), the parser sees a literal token (any kind of literal, not just string literals) followed by a string of identifier characters (that weren’t munched by that literal), it looks up that string, and calls the looked-up function with the raw text of the literal token. The OP’s second email in the thread incorporated my idea into his existing idea. His version is more complicated than mine because it handles prefixes as well as suffixes, and it doesn’t have a proof of concept to verify that it’s all doable unambiguously, but it still allows me to write `1.23f`.
What you want is a float32 literal, let's say 1.23f
I don’t care whether it’s an actual literal. I do care that I can write it as `1.23f`. And both my proposal and the OP’s allow that, so I’m happy with either.
See, I would _not_ like to see a builtin Decimal literal. Or float16 or float32 or fixed1616, or Fraction, or _any_ other new kinds of literals. As long as I can use `1.23f` to mean `float32('1.23')` in the projects where I have a mess of float32 constants (and maybe copy-paste them into my REPL from a shader or a C debugger session), I’m happy. And I think user-defined affixes are a better way to get that than trying to convince everyone that Python should add a builtin float32 type and all the math that goes with it and then add a suffix for float32 literals. Not because I doubt I could convince anyone of that, not because I’d have to wait 5 years before I could start using it even if I could, but because it’s simply not right in the first place. Python should not have a builtin float32 type. And therefore, Python should not have a builtin `f` suffix. And even if that weren’t an issue, I’d _still_ rather have a custom affix feature rather than a mess of new builtin ones. If I run into an unfamiliar affix in your code, I’d rather look it up in your project that consult a table with a mess of builtin affixes. If I want `d` for Decimal in one project, why should that mean nobody can ever use `d` for something different in a project that doesn’t do any decimal math but does a whole bunch of… something else I can’t think of off the top of my head, but I’m sure there will be at least one suffix that has two good meanings in widely different uses. Just like the namespace for builtin functions, the namespace for builtin affixes is—and should be—a limited resource. And meanwhile, what would I get from `d` being builtin? I can save one import or register call or whatever per project. My program that takes 80 seconds to run starts up 2us faster because a few dozen (or at most hundred) constructed constants can be stored in the .pyc file. I don’t have to watch my speech to carefully avoid using the word “literal” imprecisely. None of those are worth anywhere near as much to me as being able to have the suffixes I want for the project, even if I just thought of them today—and to _not_ have the ones I _don’t_ want, even is some of them have a broader niche.

On Wed, 28 Aug 2019 at 05:04, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
Extended (I'm avoiding the term "custom" for now) literals like 0.2f, 3.14D, re/^hello.*/ or qw{a b c} have a fairly solid track record in other languages, and I think in general have proved both useful and straightforward in those languages. And even in Python, constructs like f-strings and complex numbers are examples of such things. However, I know of almost no examples of other languages that have added *user-definable* literal types (with the notable exception of C++, and I don't believe I've seen use of that feature in user code - which is not to say that it's not used). That to me says that there are complexities in extending the question to user-defined literals that we need to be careful of. In my view, the issue isn't abuse of the feature, or performance, or limited value. It's the very basic problem that it's *really hard* to define and implement such a feature in a way that everyone is happy with - particularly in a language like Python which doesn't have a user-exposed "compile source to binary" step (I tried very hard to cover myself against nitpicking there - I'm sure I failed, but please, don't get sidetracked, you know what I mean here :-)). Some specific questions which would need to be dealt with: 1. What is valid in the "literal" part of the construct (this is the p"C:\" question)? 2. How do definitions of literal syntax get brought into scope in time for the parser to act on them (this is about "import xyz_literal" making xyz"a string" valid but leaving abc"a string" as a syntax error)? These questions also fundamentally affect other tools like IDEs, linters, code formatters, etc. In addition, there is the question of how user-defined literals would get turned into constants within the code. In common with list expressions, tuples, etc, user-defined literals would need to be handled as translating into runtime instructions for constructing the value (i.e., a function call). But people typically don't expect values that take the form of a literal like this to be "just" syntax sugar for a function call. So there's an education issue here. Code will get errors at runtime that the users might have expected to happen at compile time, or in the linter. It's not that these questions can't be answered. Obviously they can, as you produced a proof of concept implementation. But the design trade-offs that one person might make are deeply unsatisfactory to someone else, and there's no "obviously right" answer (at least not yet, as no-one Dutch has explained what's obvious ;-)) Also, it's worth noting that the benefits of *user-defined* literals are *not* the same as the benefits of things like 0.2f, or 3.14d, or even re/^hello.*/. Those things may well be useful. But the benefit you gain from *user-defined* literals is that of letting the end user make the design decisions, rather than the language designer. And that's a subtly different thing. So, to summarise, the real problem with user defined literal proposals is that the benefit they give hasn't yet proven sufficient to push anyone to properly address all of the design-time details. We keep having high-level "would this be useful" debates, but never really focus on the key question, of what, in precise detail, is the "this" that we're talking about - so people are continually making arguments based on how they conceive such a feature might work. A really good example here is the p"C:\" question. Is the proposal that the "string part" of the literal is just a normal string? If so, then how do you address this genuine issue that not all paths are valid? What about backslash-escapes (p"C:\temp")? Is the string a raw string or not? If the proposal is that the path-literal code can define how the string is parsed, then *how does that work*? The OP even made this point explicitly:
I'm not discussing possible implementation of this feature just yet, we can get to that point later when there is a general understanding that this is worth considering.
I don't think we *can* agree on much without the implementation details (well, other than "yes, it's worth discussing, but only if someone proposes a properly specified design" ;-)) Paul

On 2019-08-28 01:05, Paul Moore wrote:
However, I know of almost no examples of other languages that have added*user-definable* literal types (with the notable exception of
Believe there is such a feature in modern JavaScript: https://developers.google.com/web/updates/2015/01/ES6-Template-Strings#tagge... -Mike

On Wed, Aug 28, 2019 at 04:02:26PM +0100, Paul Moore wrote:
Elixir has something it calls sigils. It seems to be basically the map-to-function variant: https://elixir-lang.org/getting-started/sigils.html Konstantin

In addition, there is the question of how user-defined literals would get turned into constants within the code.
So, I'm just brainstorming here, but how about the following approach: - Whenever a compiler sees `abc"def"`, it creates a constant of the type `ud_literal` with fields `.prefix="abc"`, `.content="def"`. - When it compiles a function then instead of `LOAD_CONST n` op it would emit `LOAD_UD_CONST n` op. - This new op first checks whether its argument is a "ud_literal", and if so calls the '.resolve()` method on that argument. The method should call the prefix with the content, producing an object that the LOAD_UD_CONST op stores back in the `co_consts` storage of the function. It is a TypeError for the resolve method to return another ud_literal. - Subsequent calls to the LOAD_UD_CONST op will see that the argument is no longer a ud-literal, and will return it as-is. This system would allow each constant to be evaluated only once and subsequently memoized, and only compute those constants that will actually be used.

I don't usually work with windows, but I can see how this could be a pain point for windows users. They need both backslashes and the quotation marks in their paths. As nobody has suggested yet how to deal with the problem, I'd like to give it a try. Behold: p{C:\} The part within the curly braces is considered a "really-raw" string. The "really-raw" means that every character is interpreted exactly as it looks, there are no escape characters. Internal braces will be allowed too, provided that they are properly nested: p{C:\"Program Files"\{hello}\} If you **need** to have unmatched braces in the string, your last hope is the triple-braced literal: p{{{Letter Ж looks like }|{... }}} The curly braces can only be used with a string prefix (or suffix?). And while we're at it, why not allow chained literals: re{(\w+)}{"\1"} frac{1}{17}

On Aug 28, 2019, at 01:05, Paul Moore <p.f.moore@gmail.com> wrote:
Agreed 100%. That’s why I think we need a more concrete proposal, that includes at least some thought on implementation, before we can go any farther, as I said in my first reply. The OP wanted to get some feeling of whether at least some people might find some version of this useful before going further. I think we’ve got that now (the fact that not 100% of the responders agree doesn’t change that), so we need to get more detailed now. My own proposal was just to answer the charge that any design will inherently be impossible or magical or complicated by giving a design that is none of those. It shouldn’t be taken as any more than that. If there are good use cases for prefixes, prefixes plus suffixes, etc., then my proposal can’t get you there, so let’s wait for the OP’s
I think this pretty much has to be either (a) exactly what’s valid in the equivalent literals today, or (b) something equally simple to describe, and parse, even if it’s different (like really-raw strings, or perlesque regex with delimiters other than quotes, or whatever). Either way, I think you want to use the same rule for all affixed literals, not allow a choice of different ones like C++ does.
I don’t know that this is actually necessary. If `abc"a string"` raises an error at execution time rather than compile time, yes, that’s different from how most syntax errors work today, but is it really unacceptable? (Notice that in the most typical case, the error still gets raised from importing the module or from the top level of the script—but that’s just the most typical case, not all cases—you could get those errors from, say, calling a method, which you don’t normally expect.) There’s clearly a trade off here, because the only other alternative (at least that I’ve thought of or seen from anyone else; I’d love to be wrong) is that what you’ve imported and/or registers affects how later imports work (and doesn’t that mean some kind of registry hash needs to get encoded in .pyc files or something too?). While that is normal for people who use import hooks, most people don’t use import hooks most of the time, and I suspect that weirdness would be more off-putting than the late errors. Another big one: How do custom prefixes interact with builtin string prefixes? For suffixes, there’s no problem suffixing, say, a b-string, but for prefixes, there is. If this is going to be allowed, there are multiple ways it could be designed, but someone has to pick one and specify it. (Actually, for suffixes, there _is_ a similar issue: is `1.2jd` a `d` suffix on the literal `1.2j`, or a `jd` suffix on `1.2`? I think the former, because it’s a trivially simple rule that doesn’t need to touch any of the rest of the grammar. Plus, not only is it likely to never matter, but on the rare cases where it does matter, I think it’s the rule you’d want. For example, if I created my own ComplexDecimal class and wanted to use a suffix for it, why would I want to define both `d` and `jd` instead of just defining `d` and having it work with imaginary literals?)
These questions also fundamentally affect other tools like IDEs, linters, code formatters, etc.
Good point. I was thinking that any rule that’s simple enough for Python and humans to parse will probably be reasonably simple for other tools, and any rule that isn’t simple enough for Python and humans is probably a non-starter anyway. But the “lookup affixes at compile time” idea is an example of something that would be easy for Python and for humans but difficult for single-file-at-a-time tools, so this can be important.
I really don’t think this one is a serious issue. Many people never need to learn that -2, 1+2j, (1,2), etc. are not literals, or which of those get optimized by CPython and packed into co_consts anyway, or which things that don’t even look like literals get similarly optimized. So how often will they need to know whether 1.23f is a literal, not a literal but optimized into a const, or neither?
That’s a good point, but I think you’re missing something big here. Think about it this way; assuming f and frac and dec and re and sql and so on are useful, out options are: 1) people don’t get a useful feature 2) we add user-defined affixes 3) we add all of these as builtin affixes While #3 theoretically isn’t impossible, it’s wildly implausible, and probably a bad idea to boot, so the realistic choice is between 1 and 2. Now, you’re right that choice 2 inherently means that we’re putting a new design decision on the end user (or library designer). That is definitely a factor to be weighed on the decision. But I don’t think it’s an immediate disqualifying factor. And in fact, if the feature is properly designed to be restrictive enough (but not too restrictive) I don’t think it will even end up being that big of a deal. There are all kinds of things that we leave up to the user, from the trivial (e.g., in Haskell, a capital letter means a type rather than a value; in Python it’s entirely up to each project whether it means anything at all) to the drastic but rarely used (import hooks probably being the most extreme). This one isn’t going to be trivial, but I think it will fall much closer to the less-disruptive side than many people are assuming. It’s only going to touch a small part of the grammar, and the language in general. (And if that turns out not to be true of the actual proposal, then I probably won’t support the actual proposal.)
Again, agreed.

On Thu, 29 Aug 2019 at 01:18, Andrew Barnert <abarnert@yahoo.com> wrote:
That's a completely different point. Built in affixes are defined by the language, user defined affixes are defined by the user (obviously!) That includes all aspects of design - both how a given affix works, and whether it's justified to have an affix at all for a given use case. The argument is identical to that of user-defined operators vs built in operators. If you can use this argument to justify user-defined affixes, it applies equally to user-defined operators, which is something that has been asked for far more often, with much more widespread precedents in other languages, and been rejected every time. Regarding your cases #1, #2, and #3, this is the fundamental point of language design - you have to choose whether a feature is worthwhile (in the face of people saying "well *I* would find it useful), and whether to provide a general mechanism or make a judgement on which (if any) use cases warrant a special-case language builtin. If you assume everything should be handled by general mechanisms, you end up at the Lisp/Haskell end of the spectrum. If you decide that the language defines the limits, you are at the C end. Traditionally, Python has been a lot closer to the "language defined" end of the scale than the "general mechanisms" end. You can argue whether that's good or bad, or even whether things should change because people have different expectations nowadays, but it's a fairly pervasive design principle, and should be treated as such. This actually goes back to the OP's point:
we can get to that point later when there is a general understanding that this is worth considering
The biggest roadblock to a "general understanding that this is worth considering" is precisely that Python has traditionally avoided (over-) general mechanisms for things like this. The obvious other example, as I mentioned above, being user defined operators. I've been very careful *not* to use the term "Pythonic" here, as it's too easy for that to be a way of just saying "my opinion is more correct than yours" without a real justification, but the real stumbling block for proposals like this tends to be far less about the technical issues, and far *more* about "does this fit into the philosophy of Python as a language, that has made it as successful as it is?" My instinct is that it doesn't fit well with Python's general philosophy. Paul

On Aug 29, 2019, at 00:58, Paul Moore <p.f.moore@gmail.com> wrote:
And if you don’t make either assumption, but instead judge each case on its own merits, you end up with a language which is better than languages at either extreme. There are plenty of cases where Python generalizes beyond most languages (how many languages use the same feature for async functions and sequence iteration? or get metaclasses for free by having only one “kind” and then defining both construction and class definitions as type calls?), and plenty where It doesn’t generalize as much as most languages, and its best features are found all across that spectrum. You can’t avoid tradeoffs by trying to come up with a rule that makes language decisions automatically. (If you could, why would this list even exist?) The closest thing you can get to that is the vague and self-contradictory and facetious but still useful Zen. If you really did try to zealously pick one side or the other, always avoiding general solutions whenever a hardcoded solution is simpler no matter what, the best-case scenario would be something like Go, where a big ecosystem of codegen tools defeats your attempt to be zealous and makes your language actually usable despite your own efforts until soon you start using those tools even in the stdlib. Also, I’m not sure the spectrum is nearly as well defined as you imply in the first place. It’s hard to find a large C project that doesn’t use the hell out of preprocessor macros to effectively create custom syntax for things like error handling and looping over collections (not to mention M4 macros to autoconf the code so it’s actually portable instead of just theoretically portable), and meanwhile Haskell’s syntax is chock full of special-purpose features you couldn’t build yourself (would anyone even use the language without, say, do blocks?).

On Thu, 29 Aug 2019 at 14:21, Andrew Barnert <abarnert@yahoo.com> wrote:
You can’t avoid tradeoffs by trying to come up with a rule that makes language decisions automatically. (If you could, why would this list even exist?) The closest thing you can get to that is the vague and self-contradictory and facetious but still useful Zen.
Sorry, I wasn't trying to imply that you could. Just that choosing to implement some, but not all, possible literal affixes on a case by case basis was a valid language design option, and one that is taken in many cases. Your statement
seemed to imply that you thought it was an "all or nothing" choice. My apologies if I misunderstood your point. Paul

all of which hugely outweighs the gain of being able to avoid a pair of parentheses.
Thank you for summarizing the main objections so succinctly, otherwise it becomes too easy to get lost in the discussion. Let me try to answer them as best as I can:
This is kinda the whole point. I understand, of course, how the idea of a string-that-is-not-a-string may sound blasphemous, however I invite you to look at this from a different perspective. Today's date is 2019-08-28. The date is a moment in time, or perhaps a point in the calendar, but it is certainly not a string. How do we write this date in Python? As `datetime("2019-08-28")`. We are forced to put the date into a string and pass that string into a function to create an actual datetime object. With this proposal the code would look something like `dt"2019-08-28"`. You're right, it's not a string anymore. But it *should not* have been a string to begin with, we only used a string there because Python didn't offer us any other way. Now with prefixed strings the justice is finally done: we are able to express the notion of <a specific date> directly. And the fact that it may still use strings under the hood to achieve the desired result is really an implementation detail, that may even change at some point in the future.
There are many things in python that are in fact function calls in disguise. Decorators? function calls. Imports? function calls. Class definition? function call. Getters/setters? function calls. Attribute access? function calls. Even a function call is a function call via `__call__()`. I may be oversimplifying a bit, but the point is that just because something can be written as a function call doesn't mean it's the most natural way of doing it. Besides, there are use cases (such as `sql'...'`) where people do actually want to have a function that is constrained to string literals only. Having said that, prefixed (suffixed) strings (numbers) are not *exactly* equivalent to function calls. The points of difference are: - prefixes/suffixes are namespaced separately from regular variable names. - their results can be automatically memoized, bringing them closer to builtin literals.
Which is why I suggested to put them in a separate namespace. You're right that function `v()` is cryptic and should be avoided. But a prefix `v"..."` is neither a function nor a variable, it's ok for it to be short. The existing string prefixes are all short after all.
I don't encourage such thing, it's just that most often there is no other way. For example, consider regular expression `[0-9]+`. I can "parse it in my head" to understand that it means a sequence of digits, but how exactly am I supposed to convey this understanding to Python? Or perhaps I can parse "2019-08-28" in my head, and write in Python `datetime(year=2019, month=8, day=28)`. However, such form would greatly reduce readability of the code from humans' perspective. And human readability matters more than computer readability, for now. In fact, purely from the efficiency perspective, the prefixed strings can potentially have better performance because they are auto-memoized, while `datetime("2019-08-28")` needs to re-parse its input string every time (or add its own internal memoization, but even that would be less efficient because it doesn't know the input is a literal string).
Sorry, I thought this part was obvious. It's a single three-letter prefix.

On Wed, Aug 28, 2019 at 10:01:25PM -0000, stpasha@gmail.com wrote:
Yes, I understand that. And that's one of the reasons why I think that this is a bad idea. Since Python is limited to ASCII syntax, we only have a small number of symbols suitable for delimiters. With such a small number available, - parentheses () are used for grouping and function calls; - square brackets [] are used for lists and subscripting; - curly brackets {} are used for dicts and sets; - quote marks are used for bytes and strings; And with your proposal: - quote marks are also used for function calls, but only a limited subset of function calls (those which take a single string literal argument). Across a large majority of languages, it is traditional and common to use round brackets for grouping and function calls, and square and curly brackets for collections. There are a handful of languages, like Mathematica, which use [] for function calls.
I understand, of course, how the idea of a string-that-is-not-a-string may sound blasphemous,
Its not a matter of blasphemy. It's a matter of readability and clarity.
We are "forced" to write that are we? Have you ever tried it? py> from datetime import datetime py> datetime("2019-08-28") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: an integer is required (got type str)
py> datetime(2019, 8, 28) datetime.datetime(2019, 8, 28, 0, 0) It is difficult to take your argument seriously when so much of it rests on things which aren't true. -- Steven

On Aug 29, 2019, at 04:58, Steven D'Aprano <steve@pearwood.info> wrote:
This is a disingenuous argument. When you read spam.eggs, of course you know that that means to call the __getattr__('eggs') method on spam. But do you actually read it as a special method calling syntax that’s restricted to taking a single string that must be an identifier as an argument, or do you read it as accessing the eggs member? Of course you read it as member access, not as a special restricted calling syntax (except in rare cases—e.g., you’re debugging a __getattribute__), because to do otherwise would be willfully obtuse to do so, and would actively impede your understanding of the code. And the same goes for lots of other cases, like [1:7]. And the same goes for regex"a.*b" or 1.23f as well. Of course you’ll know that under the covers that means something like calling __whatever_registry__['regex'] with the argument "a.*b", but you’re going to think of it as a regex object or a float object, not as a special restricted calling syntax, unless you want to actively impede your understanding of the code.

On Thu, Aug 29, 2019 at 05:30:39AM -0700, Andrew Barnert wrote:
You make a good point about abstractions, but you are missing the critical point that spam.eggs *doesn't look like a string*. Things that look similar should be similar; things which are different should not look similar. I acknowledge your point (and the OP's) that many things in Python are ultimately implemented as function calls. But none of those things look like strings: - The argument to the import statement looks like an identifier (since it is an identifier, not an arbitrary string); - The argument to __getattr__ etc looks like an identifier (since it is an identifier, not an arbitrary string); - The argument to __getitem__ is an arbitrary expression, not just a string. All three are well understood to involve runtime lookups: modules must be searched for and potentially compiled, object superclass inheritance hierarchies must be searched; items or keys in a list or dict must be looked up. None of them suggest a constant literal in the same way that "" string delimiters do. The large majority of languages follow similar principles, allowing for usually minor syntactic differences. Some syntactic conventions are very weak, and languages can and do differ greatly. But some are very, very strong, e.g.: 123.4567 is nearly always a numeric float of some kind, rather than ((say) multiplying two ints; ' and/or " are nearly always used for delimiting strings. Even languages like Forth, which have radically different syntax to mainstream languages, sort-of follows that convention of associating quote marks with strings. ." outputs the following character string, terminating at the next " character. i.e. ." foo" in Forth would be more or less equivalent to print("foo") in Python. Let me suggest some design principles that should hold for languages with more-or-less "conventional" syntax. Languages like APL or Forth excluded. - anything using ' or " quotation marks as delimiters (with or without affixes) ought to return a string, and nothing but a string; - as a strong preference, anything using quotation marks as delimiters ought to be processed at compile-time (f-strings are a conspicuous exception to that principle); - using affixes for numeric types seems like a fine idea, and languages like Julia that offer a wide-range of builtin numeric types show that this works fine; in Python2 we used to have native ints and longints that took a L suffix so there's precedent there. [...]
No I'm not. I'm going to think of it as a *string*, because it looks like a string. Particularly given the OP's preference for single-letter prefixes. 1.23f doesn't look like a string, it looks like a number. I have no objection to that in principle, although of course there is a question whether float32 is important enough to justify either builtin syntax or custom, user-defined syntax. -- Steven

On Thu, 29 Aug 2019 at 15:54, Steven D'Aprano <steve@pearwood.info> wrote:
This will degenerate into nitpicking very fast, so let me just say that I understand the general idea that you're trying to express here. I don't entirely agree with it, though, and I think there are some fairly common violations of your suggestion below that make your arguments less persuasive than maybe you'd like.
- anything using ' or " quotation marks as delimiters (with or without affixes) ought to return a string, and nothing but a string;
In C, Java and C++, 'x' is an integer (char). In SQL (some dialects, at least) TIMESTAMP'2019-08-22 11:32:12' is a TIMESTAMP value. In Python, b'123' is a bytes object (which maybe you're willing to classify as "a string", but the line blurs quite fast). Paul

On Aug 29, 2019, at 07:52, Steven D'Aprano <steve@pearwood.info> wrote:
Which is exactly why you’d read 1.23dec or 1.23f as a number, because it looks like a number and also acts like a number, rather than as a function call that takes the string '1.23', even if you know that’s how it’s implemented. And most of the string affixes people have suggested are for string-ish things. I’m not sure what a “version string” is, but I might design that as an actual subclass of str that adds extractor methods and overrides comparison. A compiled regex isn’t literally a string, but neither is a bytes; it’s still clearly _similar_ to a string, in important ways. And so is a path, or a URL (although I don’t know what you’d use the url prefix for in Python, given that we don’t have a string-ish type like ObjC’s NSURL to return and I don’t think we need one, but presumably whoever wrote the url affix would be someone who disagreed and packaged the prefix with such a class). And versions of the proposal that allow delimiters other than quotes so you can write things like regex/a.*b/, well, I’d need to see a specific proposal to be sure, but that seems even less objectionable in this regard. That looks like nothing else in Python, but it looks like a regex in awk or sed or perl, so I’d probably read it as a regex object.
The arguments to the dec and f affix handlers look like numeric literals, not arbitrary strings. The arguments to path and version are… probably string literal representations (with the quotes and all), not arbitrary strings. Although that does depends on the details of the specific proposal, if _any_ of your killer uses needs uncooked strings, then either you rcome up with something over complicated like C++ where you can register three different kinds of affixes, or you just always pass uncooked strings (because it’s trivial to cook on demand but impossible to de-cook). And the arguments to regex may be some _other_ kind of restricted special string that… I don’t think anyone has tried to define yet, but you can vaguely imagine what it would have to be like, and it certainly won’t be any arbitrary string.
So b"abc" should not be allowed? Let’s say I created a native-UTF16-string type to deal with some horrible Windows or Java stuff. Why would this principle of yours suggest that I shouldn’t be allowed to use u16"" just like b””? This is a design guideline for affixes, custom or otherwise. Which could be useful as a filter on the list of proposed uses, to see if any good ones remain (and if no string affix uses remain, then of course the proposal is either useless or should be restricted to just numbers or whatever), but it can’t be an argument against all affixes, or against custom affixes, or anything else generic like that.
I don’t see why you should even want to _know_ whether it’s true, much less have a strong preference. Here are things you probably really do care about: (a) they act like strings, (b) they act like constants, (c) if there are potential issues parsing them, you see those issues as soon as possible, (d) working with them is more than fast enough. Compile time is neither necessary (Haskell) nor sufficient (Tcl) for any of that. So why insist on compile-time instead of insisting on a-d?
No I'm not. I'm going to think of it as a *string*, because it looks like a string.
Well, yes. It’s a path string, or a regex string, or a version string, or whatever, which is loosely a kind of string but not literally one. Like bytes. Or it’s a sql cursor, in which case it was probably a misuse of the feature.
Particularly given the OP's preference for single-letter prefixes.
OK, I will agree with you there that the overuse of single-letter prefixes in the motivating examples is a worrying sign. In principle there’s nothing wrong with single letters (and I think I can make a good case for the f suffix as a good use in 3D-math code). And a program that used a whole ton of version strings and version string constants might find it useful to use v instead of ver. But I’m having a hard time imagining such a program existing. (Even something like pip or the PyPI backend might have lots of version strings, but why would it have lots of version string constants?) So, maybe that’s a sign that the OP’s eventual detailed set of use cases is not going to make me happy. Of course the burden is on the proposer, and if every proposed string affix use case ends up looking bad, then I’d either oppose the proposal or suggest that it be restricted to numeric affixes or something. But that’s not a reason to reject the proposal before seeing it, or to argue that whatever it is can’t conceivably be good because of [some posited universal principle that doesn’t even hold in Python today].
As I’ve said before, I believe that anything that doesn’t have a builtin type does not deserve builtin syntax. And I don’t understand why that isn’t a near-ubiquitous viewpoint. But it’s not just you; at least three people (all of whom dislike the whole concept of custom affixes) seem at least in principle open to the idea of adding builtin affixes for types that don’t exist. Which makes me think it’s almost certainly not that you’re all crazy, but that I’m missing something important. Can you explain it to me?

On 29/08/2019 22:10:21, Andrew Barnert via Python-ideas wrote:
As I’ve said before, I believe that anything that doesn’t have a builtin type does not deserve builtin syntax. And I don’t understand why that isn’t a near-ubiquitous viewpoint.
+1 (maybe that means I'm missing something). Just curious: Is there any reason not to make decimal.Decimal a built-in type? It's tried and tested. There are situations where floats are appropriate, and others where Decimals are appropriate (I'm currently using it myself); conceptually I see them as on an equal footing. If it were built-in, there would be good reason to accept 1.23d meaning a Decimal literal (distinct from a float literal), whether or not (any part of) the OP was adopted. Rob Cliffe

On Thu, Aug 29, 2019 at 11:19:58PM +0100, Rob Cliffe via Python-ideas wrote:
Just curious: Is there any reason not to make decimal.Decimal a built-in type?
Yes: it is big and complex, with a big complex API that is over-kill for the major motivating use-case for a built-in decimal type. There might be a strong case for adding a fixed-precision decimal type, and leaving out the complex parts of the Decimal API: no variable precision, just a single rounding mode, no contexts, no traps. If you need the full API, use the decimal module; if you just need something like builtin floats, but in base 10, use the built-in decimal. There have been at least two proposals. Neither have got so far as a PEP. If I recall correctly, the first suggested using Decimal64: https://en.wikipedia.org/wiki/Decimal64_floating-point_format the second suggested Decimal128: https://en.wikipedia.org/wiki/Decimal128_floating-point_format -- Steven

On Thu, 29 Aug 2019 at 22:12, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
As I’ve said before, I believe that anything that doesn’t have a builtin type does not deserve builtin syntax. And I don’t understand why that isn’t a near-ubiquitous viewpoint. But it’s not just you; at least three people (all of whom dislike the whole concept of custom affixes) seem at least in principle open to the idea of adding builtin affixes for types that don’t exist. Which makes me think it’s almost certainly not that you’re all crazy, but that I’m missing something important. Can you explain it to me?
In my case, it's me that had missed something - namely the whole of this point. I can imagine having builtin syntax for a stdlib type (like Decimal, Fraction, or regex), but I agree that it gives the stdlib special privileges which I'm uncomfortable with. I definitely agree that built in syntax for 3rd party types is unacceptable. That quite probably contradicts some of my earlier statements - just assume I was wrong previously, I'm not going to bother going back over what I said and correcting my comments :-) I remain of the opinion that the benefits of user-defined literals would be sufficiently marginal that they wouldn't justify the cost, though. Paul

On Thu, Aug 29, 2019 at 02:10:21PM -0700, Andrew Barnert wrote: [...]
And most of the string affixes people have suggested are for string-ish things.
I don't think that's correct. Looking back at the original post in this thread, here are the motivating examples: [quote] There are quite a few situations where this can be used: - Fraction literals: `frac'123/4567'` - Decimals: `dec'5.34'` - Date/time constants: `t'2019-08-26'` - SQL expressions: `sql'SELECT * FROM tbl WHERE a=?'.bind(a=...)` - Regular expressions: `rx'[a-zA-Z]+'` - Version strings: `v'1.13.0a'` - etc. [/quote] By my count, that's zero out of six string-ish things. There may have been other proposals, but I haven't trolled through the entire thread to find them.
A version object is a record with fields, most of which are numeric. For an existing example, see sys.version_info which is a kind of named tuple, not a string. The version *string* is just a nice human-readable representation. It doesn't make sense to implement string methods on a Version object. Why would you offer expandtabs(), find(), splitlines(), translate(), isspace(), capitalise(), etc methods? Or * and + (repetition and concatenation) operators? I cannot think of a single string method/operator that a Version object should implement.
It isn't clear to me how a compiled regex object is "similar" to a string. The set of methods offered by both regexes and strings is pretty small, by my generous count it is just two methods: - str.split and SRE_Pattern.split; - str.replace and SRE_Pattern.sub neither of which use the same API or have the same semantics. Compiled regex objects don't offer string methods like translate, isdigits, upper, encode, etc. I would say that they are clearly *not* strings. [...]
Why do you need the "regex" prefix? Assuming the parser and the human reader can cope with using / as both a delimiter and a operator (which isn't a given!) /.../ for a regex object seems fine to me. I suspect that this is going to be ambiguous though: target = regex/a*b/ +x could be: target = ((regex / a) * b) / ( unary-plus x) or target = (regex object) + x so maybe we do need a prefix.
In what way are byte-STRINGS not strings? Unicode-strings and byte-strings share a significant fraction of their APIs, and are so similar that back in Python 2.2 the devs thought it was a good idea to try automagically coercing from one to the other. I was careful to write *string* rather than *str*. Sorry if that wasn't clear enough.
It is a utf16 STRING so making it look like a STRING is perfectly fine. [...]
Because I care about performance, at least a bit. Because I don't want to write code that is unnecessarily slow, for some definition of "unnecessary". Because I want to be able to reason (at least in broad terms) about the cost of certain operations. Because I want to be able to reason about the semantics of my code. Why do I write 1234 instead of int("1234")? The second is longer, but it is more explicit and it is self-documenting: the reader knows that its an int because it says so right there in the code, even if they come from Javascript where 1234 is an IEEE-754 float. Assuming the builtin int() hasn't be shadowed. But it's also wastefully slow. If we are genuinely indifferent to the difference, then we should be equally indifferent to a proposal to replace the LOAD_CONST byte-code for ints as follows: dis("1234") # in current Python LOAD_CONST 0 (1234) # In the future: LOAD_NAME 0 (int) LOAD_CONST 0 ('1234') CALL_FUNCTION 1 (1 positional, 0 keyword pair) If you were asked to vote +1 or -1 on this proposal (sitting on the fence not allowed), which would you vote? I would vote -1. Aside from the performance hit, it's also a semantic change: what was a compile-time literal is now a runtime function call which can be shadowed. It is nice to know that when I say ``n = 1234`` that the value of n is guaranteed to be 1234 no matter what odd things are going on. (Short of running a modified interpreter.) String literals (byte- or unicode, raw or cooked, triple- or single-quoted) are, with the exception of f-strings, LOAD_CONST calls like ints. I think that's a valuable, useful thing to know, and not something we should lightly give up.
Here are things you probably really do care about: (a) they act like strings, (b) they act like constants,
Don't confuse immutability with constant-ness. Python doesn't have constants, except by convention. There's no easy way to prevent a simple name from being rebound.
(c) if there are potential issues parsing them, you see those issues as soon as possible,
Like at compile-time? Consider the difference between the compile-time syntax error you get here: x = 123w456 versus the run-time error you get here: x = int("123w456") I can understand saying "we have no choice but to make this a runtime operation", or even "on the balance of pros and cons, it isn't worth the extra work to make this happen at compile-time". I don't like it that we have to write Decimal("123.456"), but I understand the reasons why we have to and can accept that it is a necessary evil. (To clarify: of course it is a feature that we *can* pass strings to the Decimal constructor, when such strings come from user-input or are read from data files etc.) But I don't think that it is a feature that there is no alternative but to pass a string, even when the value is known at edit-time. And I don't understand your position that I shouldn't care about the difference.
(d) working with them is more than fast enough.
You are right that Python is usually "fast enough" (except when it isn't), and that the one-off cost of creating a few pseudo-constants is generally only a small fraction of the cost of most programs. But Python isn't quote-unquote "slow" because of any one single thing, it is more of a death by a thousand cuts, lots of *small* inefficiences which individually don't matter but collectively add up to making Python up to a hundred times slower than C. When performance matters, which would you rather write? for item in huge_sequence: value = item + 1234 value = item + int("1234") I know that when I use a literal, it will be as fast as it possibly can be in Python, or at least there's nothing *I* can do to make it faster. But when I have to use a call like Decimal("123.45"), that's one more thing for me to have to worry about: is it fast enough? Can I make it faster? Should I make it faster? We should be wary about hiding potentially slow code in something that looks like fast code. (Yes, that's a criticism of properties too, but in the case of properties we know that the benefits outweigh the risk. It's not clear that this is the case here.)
I think you will find that I said this should be "a strong preference", which is hardly *insisting*.
Actually, no, it will be a Path object, a compiled regex SRE_pattern object, or a Version object, not a string at all.
or whatever, which is loosely a kind of string but not literally one. Like bytes.
Bytes literally are strings. They just aren't strings of Unicode characters.
Or it’s a sql cursor, in which case it was probably a misuse of the feature.
That's one of the motivating examples. I agree it is a misuse of the proposed feature.
I can concur with all of that. [...]
As I’ve said before, I believe that anything that doesn’t have a builtin type does not deserve builtin syntax.
Agreed. Although there's a bit of fuzziness over the concept of "builtin". Not all built-in objects are available in the ``builtins`` module, e.g. NoneType, or FunctionType.
I thought it went without saying that a necessary pre-condition for adding builtin syntax for a type was for the type to become built-in first. Sorry if it wasn't as clear or obvious as I thought. -- Steven

On Sat, Aug 31, 2019 at 8:44 PM Steven D'Aprano <steve@pearwood.info> wrote:
We call it a string, but a bytes object has as much in common with bytearray and with a list of integers as it does with a text string. Is the contents of a MIDI file a "string"? I would say no, it's not - but it can *contain* strings, eg for metadata and lyrics. The MIDI file representation of an integer might be stored in a byte-string, but the common API between text strings and byte strings is going to be mostly irrelevant here. You can't upper-case the variable-length-integer b"\xe7\x61" any more than you can upper-case the integer 13281. Those common methods are mostly built on the assumption that the string contains ASCII text. There are a few string-like functions that truly can be used with completely binary data, and which actually do make a lot more sense on a byte string than on, say, a list of integers. Notably, finding a particular byte sequence can be done without knowing what the bytes actually mean (and similarly bytes.split(), which does the same sort of search), and you can strip off trailing b"\0" without needing to give much meaning to the content. But I cannot recollect *ever* using these methods on any bytes object that wasn't storing some form of encoded text. Bytes and text have a long relationship, and as such, there are special similarities. That doesn't mean that bytes ARE text, any more than a compiled regex is text just because it's traditional to describe a regex in a textual form. Path objects also blur the "is this text?" line, since you can divide a Path by a string to concatenate them, and there are ways of smuggling arbitrary bytes through them. I don't think it's necessary to be too adamant about "must be some sort of thing-we-call-string" here. Let practicality rule, since purity has already waved a white flag at us. ChrisA

On Sat, Aug 31, 2019 at 09:31:15PM +1000, Chris Angelico wrote:
I don't think that's true. py> b'abc'.upper() b'ABC' py> [1, 2, 3].upper() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'list' object has no attribute 'upper' Shall I beat this dead horse some more by listing the other 33 methods that byte-strings share with Unicode-strings but not lists? Compared to just two methods shared by all three of bytes, str and list, (namely count() and index()), and *zero* methods shared by bytes and list but not str. In Python2, byte-strings and Unicode strings were both subclasses of type basestring. Although we have moved away from that shared base class in Python3, it does demonstrate that conceptually bytes and str are closely related to each other.
Is the contents of a MIDI file a "string"? I would say no, it's not - but it can *contain* strings, eg for metadata and lyrics.
Don't confuse *human-readable native language strings* for generic strings. "Hello world!" is a string, but so are '&w-8\x02^xs\0' and b'DEADBEEF'.
Of course you can. py> b"\xe7\x61".upper() b'\xe7A' Whether it is *meaningful* to do so is another question. But the same applies to str.upper: just because you can call the method doesn't mean that the result will be semantically valid. source = "def spam():\n\tpass\n" source = source.upper() # no longer valid Python source code.
Those common methods are mostly built on the assumption that the string contains ASCII text.
As they often do. If they don't, then don't call the text methods which don't make sense in context. Just as there are cases where text methods don't make sense on Unicode strings. You wouldn't want to call .casefold() on a password, or .lstrip() on a line of Python source code. [...]
Bytes and text have a long relationship, and as such, there are special similarities. That doesn't mean that bytes ARE text,
I didn't say that bytes are (human-readable) text. Although they can be: not every application needs Unicode strings, ASCII strings are still special, and there are still applications where once has to mix binary and ASCII text data. I said they were *strings*. Strings are not necessarily text, although they often are. Formally, a string is a finite sequence of symbols that are chosen from a set called an alphabet. See: https://en.wikipedia.org/wiki/String_%28computer_science%29
It is because of *practicality* that we should prefer that things that look similar should be similar. Code is read far more often that it is written, and if you read two pieces of code that look similar, we should strongly prefer that they should actually be similar. Would you be happy with a Pythonesque language that used prefixed strings as the delimiter for arbitrary data types? mylist = L"1, 2, None, {}, L"", 99.5" mydict = D"key: value, None: L"", "abc": "xyz"" myset = S"1, 2, None" That's what this proposal wants: string syntax that can return arbitrary data types. How about using quotes for function calls? assert chr"9" == "\t" assert ord"9" == 57 That's what this proposal wants: string syntax for a subset of function calls. Don't say that this proposal won't be abused. Every one of the OP's motivating examples is an abuse of the syntax, returning non-strings from something that looks like a string. -- Steven

On Sun, Sep 1, 2019 at 10:47 AM Steven D'Aprano <steve@pearwood.info> wrote:
Older versions of Python had text and bytes be the same things. That means that, for backward compatibility, they have some common methods. But does that really mean that bytes can be uppercased? Or is it that we allow bytes to be treated as ASCII-encoded text, which is then uppercased, and then returned to being bytes?
Or does it actually demonstrate that Python 3 maintains backward compatibility with Python 2?
So what did you actually do here? You took some bytes that represent an integer, and you called a method on it that makes no sense whatsoever, because now it represents a different integer. There's no sense in which your new bytes object represents an "upper-cased version of" the integer 13281. If I were to decode that string to text and THEN uppercase it, it might give a quite different result:
b"\xe7\x61".decode("Latin-1").upper().encode("Latin-1") b'\xc7A'
And if you choose some other encoding than Latin-1, you might get different results again. I put it to you that bytes.upper() exists more for backward compatibility with Python 2 than because a bytes object is, in some way, uppercaseable.
source = "def spam():\n\tpass\n" source = source.upper() # no longer valid Python source code.
But it started out as text, and it is now uppercase text. When you do that with bytes, you have to first layer in "this is actually encoded text", and you are then able to destroy that.
A finite sequence of symbols... you mean like a list of integers within the range [0, 255]? Nothing in that formal definition says that a "string" of anything other than characters should be meaningfully treated as text.
And you have yet to prove that this similarity is actually a thing.
At some point it's meaningless to call it a "Pythonesque" language, but I've worked with plenty of languages that simply do not have data types this rich, and so everything is manipulated the exact same way. When a list of values is represented as ";item 1;item 2;item 3" (actually as a string), or when you unpack a URL to find that it has JSON embedded inside it, the idea of a "prefixed string" that tells you exactly what data type is coming would be a luxury.
Let's look at regular expressions. JavaScript has a syntax for them involving leading and trailing slashes, borrowed from Perl, but I can't figure out whether a regex is a first-class object in Perl. So you can do something like this:
In Python, I can do the exact same thing, only using double quotes as the delimiter.
re.search("spa*m", "This has spaaaaam in it") <re.Match object; span=(9, 17), match='spaaaaam'>
So what do you mean by "non-string" exactly? In what way is a regular expression "not a string", yet the byte-encoded form of an integer somehow is? It makes absolutely no sense to uppercase an integer, yet you could uppercase a regex (since all regex special characters are non-letters, this will make it match uppercase strings). Yet when you encode the string as bytes, it gains an upper() method, and when you encode a regex as a compiled regex object, it loses one. Why do you insist that a regex is somehow not a string, but b"\xe7\x61" is? ChrisA

Chris Angelico writes:
Not just older versions. There have been several, more or less hotly contested, changes post-2/3 fork that basically come down to "bytes are frequently the wire format of ASCII-compatibly-encoded text, so we're going to add text methods for the convenience of people who work with those wire formats but do not need to (and sometimes cannot) decode to Unicode." For example, RFC 5322 header field tags are defined to be case- insensitive ASCII, and therefore it's useful to match them by upper- or lowercasing the tag, then matching against fixed strings. Could you convert to text and do the work? Not usefully: you need to parse the bytes to determine which text encoding is in use. (And ironically enough, if the message is RFC 5322 + RFC 2045-conformant, the hacky iso-8859-1 "conversion" will be allocation of a str object and then a memcpy of the bytes. I don't think that's a rebuttal to your argument, of course, it's just amusing.) That doesn't mean that bytes ARE text that happens to fit in 8-bit code units (PEP 393). It does mean that the similarities of the APIs are neither random accidents nor historical artifact. They're intentional. I don't think this has anything whatsoever to do with whether the "custom string prefix" proposal is a good idea or not. (other) Steve

On Sun, Sep 01, 2019 at 12:24:24PM +1000, Chris Angelico wrote:
Older versions of Python had text and bytes be the same things.
Whether a string object is *text* is a semantic question, and independent of what data format you use. 'Hello world!' is text, whether you are using Python 1.5 or Python 3.8. '\x01\x06\x13\0' is not text, whether you are using Python 1.5 or Python 3.8.
I'm curious what you think that b'chris angelico'.upper() is doing, if it is not uppercasing the byte-string b'chris angelico'. Is it a mere accident that the result happens to be b'CHRIS ANGELICO'? Unicode strings are sequences of code-points, abstract integers between 0 and 1114111 inclusive. When you uppercase the Unicode string 'chris angelico', you're transforming the sequence of integers: U+0063,0068,0072,0069,0073,0020,0061,006e,0067,0065,006c,0069,0063,006f to this sequence of integers: U+0043,0048,0052,0049,0053,0020,0041,004e,0047,0045,004c,0049,0043,004f If you are prepared to call that "uppercasing", you should be prepared to do the same for the byte-string equivalent. (For the avoidance of doubt: this is independent of the encoding used to store those code points in memory or on disk. Encodings have nothing to do with this.) [...]
I'm fairly confident that bytes methods aren't implemented by decoding to Unicode, applying the method, then re-encoding back to bytes. But even if they were, that's just an implementation detail. Imagine a float method that internally converted the float to a pair of integers (numerator/denominator), operated on that fraction, and then re-converted back to a float. I'm sure you wouldn't want to say that this proves that floats aren't numbers. The same applies to byte-strings. In the unlikely case that byte methods delegate to str methods, that doesn't mean byte-strings aren't strings. It just means that two sorts of strings can share a single implementation for their methods. Code reuse for the win! [...]
For the sake of the argument I'll accept that *this particular* byte string represents an integer rather than a series of mixed binary data and ASCII text, or text in some unusual encoding, or pixels in an image, or any of a million other things it could represent. That's absolutely fine: if it doesn't make sense to call .upper() on your bytes, then don't call .upper() on them. Precisely as you wouldn't call .upper() on a str object, if it didn't make sense to do so.
and you called a method on it that makes no sense whatsoever, because now it represents a different integer.
The same applies to Unicode strings too. Any Unicode string method that transforms the input returns something that represents a different sequence of code-points, hence a different sequence of integers. Shall we agree that neither bytes nor Unicode are strings? No, I don't think so either :-)
If I were to decode that string to text and THEN uppercase it, it might give a quite different result:
Sure. If you perform *any* transformation on the data first, it might give a different result on uppercasing: - if you reverse the bytes, uppercasing gives a different result; - if you replace b'a' with b'e', uppercasing gives a different result etc. And exactly the same observation applies to str objects: - if you reverse the characters, uppercasing gives a different result; - if you replace 'a' with 'e', uppercasing gives a different result.
And if you choose some other encoding than Latin-1, you might get different results again.
Sure. The bytes methods like .upper() etc are predicated on the assumption that your bytes represent ASCII text. If your bytes represent something else, then calling the .upper() method may not be meaningful or useful. In other words... if your bytes string came from an ASCII text file, it's probably safe to uppercase it. If your bytes string came from a JPEG, then uppercasing them will probably make a mess of the image, if not corrupt the file. So don't do that :-) Analogy: ints support the unary minus operator. But if your int represents a mass, then negating it isn't meaningful. There's no such thing as -5 kg. Should we conclude from this that the int type in Python doesn't represent a number, and that the support of numeric operators and methods is merely for backwards compatibility? I think not. The formal definition of a string is a sequence of symbols from an alphabet. That is precisely what bytes objects are: the alphabet in this case is the 8-bit numbers 0 to 255 inclusive, which for usefulness, convenience and backwards compatibility can be optionally interpreted as the 7-bit ASCII character set plus another 128 abstract "characters".
Sure. If your bytes don't represent text, then methods like upper() probably won't do anything meaningful. It's still a string though.
I'm not sure the onus is on me to prove this. "Status quo wins a stalemate." And surely the onus is on those proposing the new syntax to demonstrate that it will be fine to use string delimiters as function calls. You could make a good start by finding other languages, reasonably conventional languages with syntax based on the Algol or C tradition, that use quotes '' or "" to return arbitrary types. Even languages with unconventional syntax like Forth or APL would be a good start. Maybe I'm wrong. Maybe quotation marks are widely used for purposes other than delimiting strings, and I'm just too ignorant of other languages to know it. Maybe Python is in the minority here. Anyway, the bottom line is this: I have no objection to using prefixed quotes to represent Unicode strings, or byte strings, or Andrew's hypothetical UTF-16 strings, or EBCDIC strings, or TRON strings. https://en.wikipedia.org/wiki/TRON_(encoding) But I think that any API that would allow z"..." to represent (let's say) a socket, or a float, or a HTTP_Server instance, or a list, would be a deeply flawed API.
Sure. As a convenience, the re module has functions which accepts regular expression patterns as well as compiled regular expression objects.
So what do you mean by "non-string" exactly? In what way is a regular expression "not a string",
That question is ambiguous. Are you asking about regular expression patterns, or regular expression objects? Regular expression *patterns* are clearly strings: pattern = r'...' We type them with string delimiters, if you call type(pattern) it will return str, you can slice the pattern or uppercase it. Regular expression *objects* are just as clearly not strings: rx = re.compile(pattern) You can't slice them, they aren't sequences of symbols, they have attributes like rx.flags which have no meaning in a string, they lack string methods like upper, and those methods they have operate very differently from their equivalent string methods: pattern.find("X") # search pattern for "X" rx.search("X") # search "X" for pattern Regex objects are far more than just the regex pattern.
yet the byte-encoded form of an integer somehow is?
If your bytes represent an integer, then uppercasing them isn't meaningful. If your bytes represent ASCII text then uppercasing them may be meaningful.
In general, you can't uppercase regex patterns without radically changing the meaning of them. Consider r'\d' and r'\D'.
Because a byte-string matches the definition of strings, while compiled regex objects do not. -- Steven

On Mon, Sep 2, 2019 at 9:56 PM Steven D'Aprano <steve@pearwood.info> wrote:
Okay, so "string" and "text" are completely different concepts. Hold that thought.
No, they're not decoded. What happens is an *assumption* that certain bytes represent uppercaseable characters, and others do not. I specifically chose my example such that the corresponding code points both represented letters, and that the uppercased versions of each land inside the first 256 Unicode codepoints; yet uppercasing the bytestring changes one and not the other. Is it uppercasing the number 0x61 to create the number 0x41? No, it's assuming that it means "a" and uppercasing it to "A".
I specifically said a *list* of integers. Like what you'd get if you call list() on a bytestring. There's nothing in the formal definition you gave that precludes this from being considered a string, yet it is somehow, by your own words, fundamentally different.
Actually it is, because YOU are the one who said that quoted strings should be restricted to "string-like" things. Would a Path literal be sufficiently string-like to be blessed with double quotes? A regex literal? An IP header, represented as a bytestring? What's a string and what's not? Why are you trying to draw a line?
I gave an example wherein a list/array is represented as ";foo;bar;quux" - does that count? (VX-REXX, if you're curious.)
What if it represents a "connectable endpoint"? Is that a string? It'd be kinda like a pathlib.Path but with a bit more flexibility, allowing it to define a variety of information including the method of connection and perhaps some credentials. IOW a URI.
Exactly. To the re module, strings and compiled regexes are interchangeable.
Both at once. We're discussing the possibility of a "regex literal" concept that may or may not use double quotes. To most human beings, a regular expression IS a text string. Is a compiled regex allowed to have a literal form that uses double quotes, based on your definition of "string-like"? YOU are the one who is trying to draw a line in the sand here.
Right, but even if they represent an integer, you're fine with them using double quotes. Or am I mistaken here, and you would prefer to see it represented as bytes((0xe7, 0x61)) ?
And [0xe7, 0x61] also matches the definition of a string. ChrisA

If you strongly believe that if something looks like a string it ought to quack like a string too, then we can consider 2 potential remedies: 1. Change the delimiter, for example use curly braces: `re{abc}`. This would still be parseable, since currently an id cannot be followed by a set or a dict. (The forward-slash, on the other hand, will be ambiguous). 2. We can also leave the quotation marks as delimiters. Once this feature is implemented, the IDEs will update their parsers, and will be emitting a token of "user-defined literal" type. Simply setting the color for this token to something different than your preferred color for strings will make it visually clear that those tokens aren't strings. Hence, no possibility for confusion.

On Mon, 2 Sep 2019 at 07:04, Pasha Stetsenko <stpasha@gmail.com> wrote:
Just to add my 2 cents: there are always two sides in each language proposal: more flexibility/usability, and more language complexity. These need to be compared and the comparison is hard because it is often subjective. FWIW, I think in this case the added complexity outweighs the benefits. I think only the very widely used literals (like numbers) deserve their own syntax. For everything else it is fine to have few extra keystrokes. -- Ivan

On 31/08/2019 12:31, Chris Angelico wrote:
We call it a string, but a bytes object has as much in common with bytearray and with a list of integers as it does with a text string.
You say that as if text strings aren't sequences of bytes. Complicated and restricted sequences, I grant you, but no more so than a packet for a given network protocol. -- Rhodri James *-* Kynesim Ltd

On Wed, Sep 4, 2019 at 12:43 AM Rhodri James <rhodri@kynesim.co.uk> wrote:
Is an integer also a sequence of bytes? A float? A list? At some level, everything's just stored as bytes in memory, but since there are many possible representations of the same information, it's best not to say that a character "is" a byte, but that it "can be stored in" some number of bytes. In Python, subscripting a text string gives you another text string. Subscripting a list of integers gives you an integer. Subscripting a bytearray gives you an integer. And (as of Python 3.0) subscripting a bytestring also gives you an integer. Whether that's right or wrong (maybe subscripting a bytestring should have been defined as yielding a length-1 bytestring), subscripting a text string does not give an integer, and subscripting a bytestring does not give a character. ChrisA

On Sep 3, 2019, at 06:17, Rhodri James <rhodri@kynesim.co.uk> wrote:
Forget about bytes vs. octets; this still isn’t a useful perspective. A character is a grapheme cluster, a sequence or one or more code points. A code point is an integer between 0 and 1.1M. A string is a flattened sequence of grapheme clusters—that is, a sequence of code points. (Python ignores the cluster part, pretending code points are characters, at the cost of requiring every application to handle normalization manually. Which is normally a good tradeoff, but it does mean that you can’t even say whether two sequences of code points are the same string without calling a function.) Meanwhile, there are multiple ways to store those code points as bytes. Python does whatever it wants under the covers, hiding it from the user. Obviously there is _some_ array of bytes somewhere in memory that represents the characters of the string in some way (I say “obviously”, but that isn’t always true in Swift, and isn’t even frequently true in Haskell…), but you don’t have access to that. If you want a sequence of bytes, you have to ask for a sequence in some specific representation, like UTF-8 or UTF-16-BE or Shift-JIS, which it creates for you on the fly (albeit cached in a few special cases). So, from your system programmer’s perspective, in what useful sense is a character, or a string, a sequence of bytes? And this is all still ignoring the fact that in Python, all values are “boxed” in an opaque structure that you can’t access from within the language, and even from the C API of CPython the box structure isn’t part of the API, so even something simpler like, say, an int isn’t usefully a sequence of 30-bit digits from the system programmer’s perspective, it’s an opaque handle that you can pass to functions to _obtain_ a sequence of 30-bit digits. (In the case of strings, you have to first pass to opaque handle to one function to see what format to ask for, then pass it to another to obtain a sequence of 1, 2, or 4-byte integers representing the code points in native-endian ASCII, UCS2, or UCS4. Which normally you don’t do—you ask for a UTF-8 string or a UTF-32 string that may get constructed on the fly—but if you really do want the actual storage, this is the way to get it.) And most of this is not peculiar to Python. In Swift, a string is a sequence of grapheme clusters. In Java, it’s a sequence of UTF-16 code units. In Go, it’s a sequence of UTF-8 code units. In Haskell, it’s a lazy linked list of code points. And so on. In some of those cases, a character does happen to be represented as a string of bytes within a larger representation, but even when it is, that still doesn’t mean you can usefully access it that way. Of course a text file on disk is a sequence or bytes, and (if you know the encoding and normalization) you could operate directly on those. But you don’t; you pass the byte strings to a function that decodes them (and then sometimes to a second function that normalizes them into a canonical form) and then use your language’s string functions on the result. In fact, you probably don’t even do that; you let the file object buffer the byte strings however it wants to and just hand you decoded text objects, so you don’t even know which byte substrings exist in memory at any given time.(Languages with powerful optimizers or macro systems like Haskell or Rust might actually do that by translating all your string-function calls into calls directly on the steam of bytes, but from your perspective that’s entirely under the covers, and you’re doing the same thing you do in Python.)

On Thu, Aug 29, 2019 at 09:58:35PM +1000, Steven D'Aprano wrote:
Since Python is limited to ASCII syntax, we only have a small number of symbols suitable for delimiters. With such a small number available,
Oops, I had an interrupted thought there. With such a small number available, there is bound to be some duplication, but it tends to be fairly consistent across the majority of conventional programming languages. -- Steven

On 28/08/2019 23:01, stpasha@gmail.com wrote:
I don't think it's blasphemous. I think it's misleading, and that's far worse.
Pace Stephen's point that this is not in fact how datetime works, this has the major advantage of being readable. My thought processes on coming across that in code would go something like; "OK, we have a function call. Judging from the name its something to do with dates and times, so the result is going to be some date/time thing. Oh, I remember seeing "from datetime import datetime" at the top, so I know where to look it up if it becomes important. Fine. Moving on."
Here my thoughts would be more like; "OK, this is some kind of special string. I wonder what "dt" means. I wonder where I look it up. The string looks kind of like a date in ISO order, bear that in mind. Maybe "dt" is "date/time"." Followed a few lines later by "wait, why are we calling methods on that string that don't look like string methods? WTF? Maybe "dt" means "delirium tremens". Abort! Abort!" Obviously I've played this up a bit, but the point remains that even if I do work out that "dt" is actually a secret function call, I have to go back and fix my understanding of the code that I've already read. This significantly increases the chance that my understanding will be wrong. This is a Bad Thing.
If all that dt"string" gives us is a run-time call to dt("string"), it's a complete non-starter as far as I'm concerned. It's adding confusion for no real gain. However, it sounds like what you really want is something I've often really wanted to -- a way to get the compiler to pre-create "constant" objects for me. The trouble is that after thinking about it for a bit, it almost always turns out that I don't want that after all. Suppose that we did have some funky mechanism to get the compiler to create objects at compile time so we don't have the run-time creation cost to contend with. For the sake of argument, let's make it start_date = $datetime(2019,8,28) (I know this syntax would be laughed out of court, but like I said, for the sake of argument...) So we use "start_date" somewhere, and mutate it because the start date for some purpose was different. Then we use it somewhere else, and it's not the start date we thought it was. This is essentially the mutable default argument gotcha, just writ globally. The obvious cure for that would be to have our compile-time created objects be immutable. Leaving aside questions like how we do that, and whether contained containers are immutable, and so on, we still have the problem that we don't actually want an immutable object most of the time. I find that almost invariably I need to use the constant as a starting point, but tweak it somehow. Perhaps like in the example above, the start date is different for a particular purpose. In that case I need to copy the immutable object to a mutable version, so I have all the object creation shenanigans to go through anyway, and that saving I thought I had has gone away. I'm afraid these custom string prefixes won't achieve what I think you want to achieve, and they will make code less readable in the process.
So how do you distinguish the custom prefix "br" from a raw byte string? Existing syntax allows prefixes to stack, so there's inherent ambiguity in multi-character prefixes. -- Rhodri James *-* Kynesim Ltd

On Aug 29, 2019, at 06:40, Rhodri James <rhodri@kynesim.co.uk> wrote:
However, it sounds like what you really want is something I've often really wanted to -- a way to get the compiler to pre-create "constant" objects for me.
People often say they want this, but does anyone actually ever have a good reason for it? I was taken in by the lure of this idea myself—all those wasted frozenset constructor calls! (This was before the peephole optimizer understood frozensets.) Of course I hadn’t even bothered to construct the frozensets from tuples instead of lists, which should have been a hint that I was in premature optimization mode, and should have been the first thing I tried before going off the deep end. But hacking bytecode is fun, so I sat down and wrote a bytecode processor that let me replace any expression with a LOAD_CONST, much as the builtin optimizer does for things like simple arithmetic. It’s easy to hook it up to a decorator to call on a function, or to an import hook to call at module compile time. And then, finally, it’s time to benchmark and discover that it makes no difference. Stripping things down to something trivial enough to be tested… aha, I really was saving 13us, it’s just that 13us is not measurable in code that takes seconds to run. Maybe someone has a real use case where it matters. But I’ve never seen one. I tried to find good nails for my shiny new hammer and never found one, and eventually just stopped maintaining it. And then I revived it when I wrote my decimal literal hack (the predecessor to the more general user literal hack I linked earlier in the thread) back during the 2015 iteration of this discussion, but again couldn’t come up with a plausible example where those 2.3d pseudo-literals were measurably affecting performance and needed constifying; I don’t think I even bothered mentioning it in that thread. Also, even if you find a problem, it‘s almost always easy to work around today. If the constant is constructed inside a loop, just manually lift it out of the loop. If it’s in a function body, this is effectively the same problem as global or builtin lookups being too slow inside a function body, and can be solved the same way, with a keyword parameter with a default value. And if the Python community thinks that _sin=sin is good enough for the uncommon problem of lookups significantly affecting performance, surely of _vals=frozenset((1,2,3)) is also good enough for that far more uncommon problem, and therefore _limit=1e1000dec would also be good enough for the new but probably even more uncommon one. (Also, notice that the param default can be used with mutable values, it’s just up to you to make sure you don’t accidentally mutate them; an invisible compiler optimization couldn’t do that, at least not without something like Victor Stinner’s FAT guards.) For what it’s worth, I actually found my @constify decorator more readable than the param default, especially for global functions—but not nearly enough so that it’s worth using a hacky, CPython-specific module that I have to maintain across Python versions (and byteplay to byteplay3 to bytecode) and that nobody else is using. Or to propose for a builtin (or stdlib but magic) feature. What this all comes down to is that, despite my initial impression, I really don’t care whether Python thinks 1.23d is a constant value or not; I only care whether the human reader thinks it is one. Think about it this way: do you know off the top of your head whether (1, (2,3)) gets optimized to a const the same way (1,2) does in CPython? Has it ever occurred to you to check before I asked? And this is actually something that changed relatively recently. Why would someone who doesn’t even think about when tuples are constified want to talk about how to force Python to constify other types? Because even years of Python experience hasn’t cured us of premature-optimization-itis.

Rhodri James wrote:
Suppose that we did have some funky mechanism to get the compiler to create objects at compile time
It doesn't necessarily have to be at compile time. It can be at run time, as long as it only happens once.
I don't think this is as much of a problem as it seems. We often assign things to globals that are intended to be treated as constants, with the understanding that it's our responsibility to refrain from mutating them. -- Greg

Unless there is some significant difference between the two, what does this proposal give us?
The difference between `x'...'` and `x('...')`, other than visual noise, is the following: - The first "x" is in its own namespace of string prefixes. The second "x" exists in the global namespace of all other symbols. - Python style discourages too short variable names, especially in libraries, because they have increased chance of clashing with other symbols, and generally may be hard to understand. At the same time, short names for string prefixes could be perfectly fine: there won't be too many of them anyways. The standard prefixes "b", "r", "u", "f" are all short, and nobody gets confused about them. - Barrier of entry. Today you can write `from re import compile as x` and then write `x('...')` to denote a regular expression (if you don't mind having `x` as a global variable). But this is not the way people usually write code. People write the code the way they are taught from examples, and the examples don't speak about regular expression objects. The examples only show regular expressions-as-strings, so many python users don't even realize that regular expressions can be objects. Now, if the string prefixes were available, library authors would think "Do we want to export such functionality for the benefit of our users?" And if they answer yes, then they'll showcase this in the documentation and examples, and the user will see that their code has become cleaner and more understandable.

On Tue, Aug 27, 2019 at 05:13:41PM -0000, stpasha@gmail.com wrote:
Ouch! That's adding a lot of additional complexity to the language. Python's scoping rules are usually described as LEGB: - Local - Enclosing (non-local) - Global (module) - Builtins but that's an over-simplification, dating back to something like Python 1.5 days. Python scope also includes: - class bodies can be the local scope, but they don't work quite the same as function locals); - parts of the body of comprehensions behave as if they were a seperate scope. This proposal adds a completely seperate, parallel set of scoping rules for these string prefixes. How many layers in this parallel scope? The simplest design is to have a single, interpreter wide namespace for prefixes. Then we will have name clashes, especially since you seem to want to encourage single character prefixes like "v" (verbose, version) or "d" (date, datetime, decimal). Worse, defining a new prefix will affect all other modules using the same prefix. So we need a more complex parallel scope. How much more complex? * if I define a string prefix inside a comprehension, function or class body, will that apply across the entire module or just inside that comp/func/class? * how do nested functions interact with prefixes? * do we need a set of parallel keywords equivalent to global and nonlocal for prefixes? If different modules have different registries, then not only do we need to build a parallel set of scoping rules for prefixes into the interpreter, but we need a parallel way to import them from other modules, otherwise they can't be re-used. Does "from module import x" import the regular object x from the module namespace, or the prefix x from the prefix-namespace? So it seems we'll need a parallel import system as well. All this adds more complexity to the language, more things to be coded and tested and documented, more for users to learn, more for other implementations to re-implement, and the benefit is marginal: the ability to drop parentheses from some but not all function calls. Now consider another problem: introspection, or the lack thereof. One of the weaknesses of string prefixes is that it's hard to get help for them. In the REPL, we can easily get help on any class or function: help(function) and that's really, really great. We can use the inspect module or dir() to introspect functions, classes and instances, but we can't do the same for string prefixes. What's the difference between r-strings and u-strings? help() is no help (pun intended), since help sees only the string instance, not the syntax you used to create it. All of these will give precisely the same output: help(str()) help('') help(u'') help(r"") etc. This is a real weakness of the prefix system, and will apply equally to custom prefixes. It is *super easy* to introspect a class or function like Version; it is *really hard* to do the same for a prefix. You want this seperate namespace for prefixes so that you can have an v prefix without "polluting" the module namespace with a v function (or class). But v doesn't write itself! You still have to write a function or class, athough you might give it a better name and then register it with the single letter prefix: @register_prefix('v') class Version: ... (say). This still leaves Version lying around in your global namespace, unless you explicitly delete it: del Version but you probably won't want to do that, since Version will probably be useful for those who want to create Version objects from expressions or variables, not just string literals. So the "pollution" isn't really pollution at all, at least not if you use reasonable names, and the main justification for parallel namespaces seems much weaker. Let me put it another way: parallel namespaces is not a feature of this proposal. It is a point against it.
That's an interesting position for the proponent of a new feature to take. "Don't worry about this being confusing, because hardly anyone will use it."
The standard prefixes "b", "r", "u", "f" are all short, and nobody gets confused about them.
Plenty of people get confused about raw strings. There's only four, plus uppercase and combinations, and they are standard across the entire language. If there were dozens of them, coming from lots of different modules and third-party libraries, with lots of conflicts ('v' for version in foolib, but 'v' for verbose in barlib), the situation would be very different. We can't extrapolate from four built-in prefixes being manageable to concluding that dozens of clashing user-defined prefixes will be too.
I doubt that is true. "from module import foo as bar" is a standard, commonly used Python language feature: https://stackoverflow.com/questions/22245711/from-import-or-import-as-for-mo... in particular this answer here: https://stackoverflow.com/a/29010729 Besides, we don't design the language for the least knowledgable, most ignorant, copy-and-paste coders.
That's simply wrong. The *very first* example of a regular expression here: https://scotch.io/tutorials/an-introduction-to-regex-in-python uses the compile function. More examples talking about regex objects: https://docs.python.org/3/library/re.html#re.compile https://pymotw.com/2/re/#compiling-expressions https://docs.python.org/3/howto/regex.html#compiling-regular-expressions https://stackoverflow.com/questions/20386207/what-does-pythons-re-compile-do These weren't hard to find. You don't have to dig deep into obscure parts of the WWW to find people talking about regex objects. I think you underestimate the knowledge of the average Python programmer. -- Steven

On Wed, 28 Aug 2019 at 13:15, Anders Hovmöller <boxed@killingar.net> wrote:
On 28 Aug 2019, at 14:09, Piotr Duda <duda.piotr@gmail.com> wrote:
The only sane proposal that I can see (assuming that no-one is proposing to drop the principle that Python shouldn't have mutable syntax) is to modify the definition stringliteral ::= [stringprefix](shortstring | longstring) stringprefix ::= "r" | "u" | "R" | "U" | "f" | "F" | "fr" | "Fr" | "fR" | "FR" | "rf" | "rF" | "Rf" | "RF" to expand the definition of <stringprefix> to allow any identifier-like token (precise details to be confirmed). Then, if it's one of the values enumerated above (you'd also need some provison for special-casing bytes literals, which are in a different syntax rule), work as at present. For any other identifier-like token, you'd define TOKEN(shortstring|longstring) as being equivalent to TOKEN(r(shortstring|longstring)) I.e., treat the string as a raw string, and TOKEN as a function name, and compile to a function call of the named function with the raw string as argument. That's a well-defined proposal, although whether it's what people want is a different question. Potential issues: 1. It makes a whole class of typos that are currently syntax errors into runtime errors - fru"foo\and {bar}" is now a function call rather than a syntax error (it was never a raw Unicode f-string, even though someone might think it was and be glad to be corrected by the current syntax error...) 2. It begs the question of whether people want raw-string semantics - whilst it's the most flexible option, it does mean that literals wanting to allow escape sequences would need to implement it themselves. 3. It does nothing for the edge case that a trailing \ isn't allowed - p"C:\" wouldn't be a valid Path literal. There are of course other possible proposals, but we'd need more than broad statements to make sense of them (specifically, either "exactly *what* new syntax are you suggesting we allow?", or "how are you proposing to allow users to alter Python syntax on demand?") Paul

Right, having a parallel set of scopes sounds like WAY too much work. Which is why I didn't want to start my proposal with a particular implementation -- I simply don't have enough experience for that. Still, we can brainstorm possible approaches, and come up with something that is feasible. For example, how about this: prefixes/suffixes "live" in the same local scope as normal variables, however, in order to separate them from the normal variables, their names get mangled into something that is not a valid variable name. Thus, re'a|b|c' --becomes--> (locals()["re~"])("a|b|c") 2.3f --becomes--> (locals()["~f"])("2.3") Assuming that most people don't create variable names that start or end with `~`, the impact on existing code should be minimal (we could use an even more rare character there, say `\0`). The current string prefixes would be special-cased by the compiler to behave exactly as they behave right now. Also, a prefix such as `czt""` is always just a single prefix, there is no need to treat it as 3 single-char prefixes.
Well, it's just another problem to overcome. I know in Python one can get help on keywords and even operators by saying `help('class')` or `help('+')`. We could extend this to allow `help('foo""')` to give the help for the prefix "foo". Specifically, if the argument to `help` is a string, and that string is not a registered topic, then check whether the string is of the form `<id>""` or `<id>''` or `""<id>` or `''<id>`, and invoke the help for the corresponding prefix / suffix. This will even solve the problem with the help for existing affixes `b""`, `f""`, `0j`, etc.
For the Version class you're right. But use cases vary. In the thread from 2013 where this issue was discussed, many people wanted `sql"..."` literal to be available as literal and nothing else. Presumably, if you wanted to construct a query dynamically there could be a separate function `sql_unsafe()` taking a simple string as an argument.
The pollution argument is that, on one hand, we want to use short names such as "v" for prefixes/suffixes, while on the other hand we don't want them to be "regular" variable names because of the possibilities of name clashes. It's perfectly fine to have a short character for a prefix and at the same time a longer name for a function. It's like we have the `unicode()` function and `u"..."` prefix. It's like most command line utilities offer short single-character options and longer full-name options.
I'm sorry if I expressed myself ambiguously. What I meant to say is that the set of different prefixes within a single program will likely be small.
We can't extrapolate from four built-in prefixes being manageable to concluding that dozens of clashing user-defined prefixes will be too.
That's a valid point. Though we can't extrapolate that they will be unmanageable either. There's just not enough data. But we could look at other languages who have more suffixes. Say, C or C++. Ultimately, this can be a self-regulating feature: if having too many suffixes/prefixes makes one's code unreadable, then simply stop using them and go back to regular function calls.

On Aug 28, 2019, at 12:45, stpasha@gmail.com wrote:
Since this specific use has come up a few times—and a similar feature in other languages—can you summarize exactly what people want from this one? IIRC, DB-API 2.0 doesn’t have any notion of compiled statements, or bound statements, just this: Connection.execute(statement: str, *args) -> Cursor So the only thing I can think of is that sql"…" is a shortcut for that. Maybe: curs = sql"SELECT lastname FROM person WHERE firstname={firstname}" … which would do the equivalent of: curs = conn.execute("SELECT lastname FROM person WHERE firstname=?", firstname) … except that it knows whether your particular database library uses ? or %s or whatever for SQL params. I can see how that could be useful, but I’m not sure how it could be easily implemented. First, it has to know where to find your connection object. Maybe the library that exposes the prefix requires you to put the connection in a global (or threadlocal or contextvar) with a specific name, or manages a pool of connections that it stores in its own module or something? But that seems simultaneously too magical and too restrictive. And then it has to do f-string-style evaluation of the brace contents, in your scope, to get the args to pass along. Which I’d assume means that prefix handlers need to get passed locals and globals, so the sql prefix handler can eval each braced expression? (Even that wouldn’t be as good as f-strings, but it might be good enough here?) Even with all that, I‘m pretty sure I’d never use it. I’m often willing to bring magic into my database API, but only if I get a lot more magic (an expression-builder library, a full-blown ORM, that thing that I forget the name of that translates generators into SQL queries quasi-LINQ-style, etc.). But maybe there are lots of people who do want just this much magic and no more. Is this roughly what people are asking for? If so, is that eval magic needed for any other examples you’ve seen besides sql? It’s definitely not needed for regexes, paths, really-raw strings, or any of the numeric examples, but if it is needed for more than one good example, it’s probably still worth looking at whether it’s feasible.

My understanding is that for a sql prefix the most valuable part is to be able to know that it was created from a literal. No other magic, definitely not auto-executing. Then it would be legal to write result = conn.execute(sql"SELECT * FROM people WHERE id=?", user_id) but not result = conn.execute(f"SELECT * FROM people WHERE id={user_id}") In order to achieve this, the `execute()` method only has to look at the type of its argument, and throw an error if it's a plain string. Perhaps with some more imagination we can make result = conn.execute(sql"SELECT * FROM people WHERE id={user_id}") work too, but in this case the `sql"..."` token would only create an `UnpreparedStatement` object, which expects a variable named "user_id", and then the `conn.execute()` method would pass locals()/globals() into the `.prepare()` method of that statement, binding those values to the placeholders. Crucially, the `.prepare()` method shouldn't modify the object, but return a new PreparedStatement, which then gets executed by the `conn.execute()`.

My understanding is that for a sql prefix the most valuable part is to be able to know that it was created from a literal. No other magic, definitely not auto-executing. Then it would be legal to write result = conn.execute(sql"SELECT * FROM people WHERE id=?", user_id) but not result = conn.execute(f"SELECT * FROM people WHERE id={user_id}") In order to achieve this, the `execute()` method only has to look at the type of its argument, and throw an error if it's a plain string. Perhaps with some more imagination we can make result = conn.execute(sql"SELECT * FROM people WHERE id={user_id}") work too, but in this case the `sql"..."` token would only create an `UnpreparedStatement` object, which expects a variable named "user_id", and then the `conn.execute()` method would pass locals()/globals() into the `.prepare()` method of that statement, binding those values to the placeholders. Crucially, the `.prepare()` method shouldn't modify the object, but return a new PreparedStatement, which then gets executed by the `conn.execute()`.

On Fri, Aug 30, 2019 at 3:51 AM Pasha Stetsenko <stpasha@gmail.com> wrote:
There's no such thing, though, any more than there's such a thing as a "raw string". There are only two types of string in Python - text and bytes. You can't behave differently based on whether you were given a triple-quoted, raw, or other string literal.
One way to handle this particular case would be to do it as a variant of f-string that doesn't join its arguments, but passes the list to some other function. Just replace the final step BUILD_STRING step with BUILD_LIST, then call the function. There'd need to be some way to recognize which sections were in the literal and which came from interpolations (one option is to simply include empty strings where necessary such that it always starts with a literal and then alternates), but otherwise, the "sql" manager could do all the escaping it wants. However, this wouldn't be enough to truly parameterize a query; it would only do escaping into the string itself. Another option would be to have a single variant of f-string that, instead of creating a string, creates a "string with formatted values". That would then be a single object that can be passed around as normal, and if conn.execute() received such a string, it could do the proper parameterization. Not sure either of them would be worth the hassle, though. ChrisA

On 8/29/19 11:14 AM, Chris Angelico wrote:
But isn't the idea of the sql" (or other) prefix was that the 'plain string' was put through a special function that processes it, and that function could return an object of some other type, so it could detect the difference.
= -- Richard Damon

On Thu, Aug 29, 2019 at 08:17:39PM +1200, Greg Ewing wrote:
I don't think that stpasha@gmail.com means that the user literally assigns to locals() themselves. I read his proposal as having the compiler automatical mangle the names in some way, similar to name mangling inside classes. The transformation from prefix re to mangled name 're~' is easy, the compiler could surely handle that, but I'm not sure how the other side of it will work. How does one register that re.compile (say) is to be aliased as the prefix 're'? I'm fairly sure we don't want to allow ~ in identifiers: # not this re~ = re.compile I'm still not convinced that we need this parallel namespace idea, even in a watered down version as name-mangling. Why not just have the prefix X call name X for any valid name X (apart from the builtin prefixes)? I still am not convinced that is a good idea, but at least the complexity is significantly reduced. P.S. stpasha@gmail.com if you're reading this, it would be nice if you signed your emails with a name, so we don't have to refer to you by your email address or as "the OP". -- Steven

Steven D'Aprano wrote:
Yes, but at some point you have to define a function to handle your string prefix. If it's at the module level then it's no problem, because you can do something like globals()["~f"] = lambda: ... But you can't do that for locals. So mangling to something unspellable would effectively preclude having string prefixes local to a function. -- Greg

On Aug 29, 2019, at 16:58, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
What happens if you do this, and then include "~f" in __all__, and then import * from that module? I personally would rather have my prefixes or suffixes available in every module that imports them, without needing to manually register them each time. Not a huge deal, and if nobody else agrees, fine. But if I could __all__ it, I could get what I want anyway. :)

How does one get a value into locals()["re~"]?
You're right, I didn't think about that. I agree with Steven's interpretation that the user is not expected to modify locals herself, still the immutable nature of locals presents a considerable challenge. So I'm thinking that perhaps we could change that to `globals()["re~"]`, where globals are in fact mutable and can even be modified by the user. This would make it so that affixes can only be declared at a module level, similar to how `from library import *` is not allowed in a function either. This is probably a saner approach anyways -- if affixes could mean different things in different functions, that could be quite confusing...

On Tue, Aug 27, 2019 at 08:22:22AM -0000, stpasha@gmail.com wrote:
The string (or number) prefixes add new power to the language
I don't think they do. It's just syntactic sugar for a function call. There's nothing that czt'...' will do that czt('...') can't already do. If you have a proposal that allows custom string prefixes to do something that a function call cannot do, I've missed it.
If a certain feature can potentially be misused shouldn't deter us from adding it, if the benefits are significant.
Very true, but so far I see nothing in this proposal that suggests that the benefits are more significant than avoiding having to type a pair of parentheses. Every benefit I have seen applies equally to the function call version, but without the added complexity to the language of allowing custom string prefixes.
And the benefits in terms of readability can be significant.
I don't think they will be. I think they will encourage cryptic one-character function names disguised as prefixes: v'...' instead of Version(...) x'...' instead of re.compile(...) to take two examples from your proposal. At least this is somewhat better: sql'...' but that leaves the ambiguity of not knowing whether that's a chained function call s(q(l(...))) or a single sql(...). I believe it will also encourage inefficient and cryptic string parsing instead of more clear use of seperate arguments. Your earlier example: frac'123/4567' The Fraction constructor already accepts such strings, and it is occasionally handy for parsing user-input. But using it to parse string literals gives slow, inefficient code for little or no benefit: [steve@ando cpython]$ ./python -m timeit -s 'from fractions import Fraction' 'Fraction(123, 4567)' 20000 loops, best of 5: 18.9 usec per loop [steve@ando cpython]$ ./python -m timeit -s 'from fractions import Fraction' 'Fraction("123/4567")' 5000 loops, best of 5: 52.9 usec per loop Unless you can suggest a way to parse arbitrary strings in arbitrary ways at compile-time, these custom string prefixes are probably doomed to be slow and inefficient. The best thing I can say about this is that at least frac'123/4567' would probably be easy to understand, since the / syntax for fractions is familiar to most people from school. But the same cannot be said for other custom prefixes: cf'[0; 37, 7, 1, 2, 5]' Perhaps you can guess the meaning of that cf-string. Perhaps you can't. A hint might point you in the right direction: assert cf'[0; 37, 7, 1, 2, 5]' == Fraction(123, 4567) (By the way, the semi-colon is meaningful and not a typo.) To the degree that custom string prefixes will encourage cryptic one and two letter names, I think that this will hurt readability and clarity of code. But if the reader has the domain knowledge to recognise what "cf" stands for, this may be no worse than (say) "re" (regular expression). In conventional code, we might call the cf function like this: cf([0, 37, 7, 1, 2, 5]) # Single list argument. cf(0, 37, 7, 1, 2, 5) # *args version. Either way works for me. But it is your argument that replacing the parentheses with quote marks is "more readable": cf([0, 37, 7, 1, 2, 5]) cf'[0; 37, 7, 1, 2, 5]' not just a little bit more readable, but enough to make up for the inefficiency of having to write your own parser, deal with errors, compile a string literal, parse it at runtime, and only then call the actual cf constructor and return a cf object. Even if I accepted your claim that swapping (...) for '...' was more readable, I am skeptical that the additional work and runtime inefficiency would be worth the supposed benefit. I don't wish to say that parsing strings to extract information is always an anti-pattern: http://cyrille.martraire.com/2010/01/the-string-obsession-anti-pattern/ after all we often need to process data coming from config files or other user-input, where we have no choice but to accept a string. But parsing string *literals* usually is an anti-pattern, especially when there is a trivial transformation from the string to the constructor arguments, e.g. 123/4567 --> Fraction(123, 4567). [...]
Everything you say there applies to ordinary function call syntax too: Version('1.10a') can have methods, special behaviours, a type different from str, etc. Not one of those benefits comes from *custom string prefixes*. They all come from the use of a custom type. In fact, we can can be more explicit and clear with the constructor: Version(major=1, minor=10, stage='a') There is nothing magic about this v-string prefix. You still have to write a Version class with a version-string parser. The compiler can't help you, because it has no knowledge of the format of version strings. All the compiler can do is pass the string '1.10a' to the function v(). [...]
"There could be..." lots of things, but the onus is on you to prove that there actually *are* such benefits.
I answered that in my previous post. I would prefer an explicit, clear, self-documenting function call Version() over a terse, unclear syntax that looks like a string but isn't. I don't think that v'1.10a' is clearer or more readable than Version('1.10a'). It is *shorter*, but that's it. The bottom line is, so long as this proposal is for nothing more than mere syntactic sugar allowing you to drop the parentheses from certain function calls (those that take a single string argument), the benefit is tiny, and the added complexity and opportunity for abuse and confusion is large.
I can't help pandas' poor API, and I doubt that your proposal would have prevented it either.
Think about what you are saying about the sophisticated data scientists who are typical pandas users: - they can write "import pandas" - but not "import re" or "from re import compile as rx" - they will be able to import your rx'...' string prefix from wherever it comes from (perhaps "from re import rx"?) - and are capable of writing regular expressions using your custom rx'...' syntax - but adding parentheses is beyond them: rx('...'). I cannot take this argument about sophisticated regex-users who are defeated by function call syntax seriously. -- Steven

On Aug 27, 2019, at 08:36, Steven D'Aprano <steve@pearwood.info> wrote:
But there are plenty of cases where parsing string literals is the current usual practice. Decimal is obvious, as well as most other non-native numeric types. Path objects even more so. Pandas users seem to always build their datetime objects out of YYYYMMDDTHHMMSS strings. And so on. So the status quo doesn’t mean nobody parses string literals, it means people _explicitly_ parse string literals. And the proposed change doesn’t mean more string literal parsing, it means making some of the existing, uneliminable uses less visually prominent and more readable. (And, relevant to the blog you linked, it seems to make it _less_ likely, not more, that you’d bind the string rather than the value to a name, or pass it around and parse it repeatedly, or the other bar practices they were talking about.) I’ll admit there are some cases where I might sacrifice performance for convenience if we had this feature. For example, F1/3 (or 1/3F with suffixes) would have to mean at least Fraction(1) / 3, if not Fraction('1') / 3, or even that plus an extra LOAD_ATTR. That is clearly going to be more expensive than F(1, 3) meaning Fraction(1, 3), but I’d still do it at the REPL, and likely in real code as well. But I don’t think that choice would make my code worse (because when setup costs matter, I _wouldn’t_ make that choice), so I don’t see that as a problem.

On 8/26/19 4:03 PM, stpasha@gmail.com wrote:
I have seen a lot of discussion on this but haven't seen a few points that I thought of brought up. One solution to all these would be to have these be done as suffixes, Python currently has a number of existing prefixes to strings that are valid, and it might catch some people when they want to use a combination that is currently a valid prefix. (It has been brought up that this converts an invalid prefix from an immediately diagnosable syntax error to a run time error.) This also means that it becomes very hard to decide to add a new prefix as that would now have a defined meaning. A second issue is that currently some of the prefixes (like r) change how the string literal is parsed. These means that the existing prefixes are just a slightly special case of the general rules, but need to be treated very differently, or perhaps somehow the prefix needs to indicate what standard prefix to use to parse the string. Some of your examples could benefit by sometimes being able to use r' and sometimes not, so being able to say both r'string're or 'string're could be useful. -- Richard Damon

On Aug 26, 2019, at 16:03, stpasha@gmail.com wrote:
I don’t think you can fairly discuss this idea without getting at least a _little_ bit into the implementation details. How does your code specify a new prefix? How does the tokenizer know which prefixes are active? What code does the compiler emit for a prefixed string? The answers to those questions will determine which potential prefixes are useful. In particular, you mention that f-strings “would fall squarely within this framework”, but it’s actually pretty hard to imagine an implementation that would have actually allowed for f-strings. They essentially need to recursively call the compiler on the elements inside braces, and then inline the resulting expressions into the containing scope.
Neither. I’d prefer this: 2.3D # decimal.Decimal('2.3') 1/3F # 1/fractions.Fraction('3') After all, why would I want to put the number in a string when it’s not a string, but a number? This looks a lot like C’s `2.3f` that gives me 2.3 as a float rather than a double, and it works like it too, so there’s no surprise. And C++ already proves that such a thing can be widely usable; it’s been part of that language for three versions, since 2011. Also this: p'C:\' That can’t be handled by just using a “native Path” prefix together with the existing raw prefix, because even in raw string literals you can’t end with a backslash. And this is another place where talking about implementation matters. At first glance it might seem like arbitrary-literal affixes would be a lot more difficult than string-literal-only affixes, but in fact, as soon as you try to implement it, you realize that you get the exact same set of issues, no more. See https://github.com/abarnert/userliteralhack for a proof of concept I wrote back in 2015. (Not that we’d want to actually implement them the way I did, just demonstrating that it can be done, and doesn’t cause ambiguity.) I’ve got a couple older PoCs up there as well if you want to play around more, including one that only allows string literals (so you can see that it’s actually no easier, and solves no ambiguity problems). I can’t remember if I did one that does prefixes instead of suffixes, but I don’t _think_ that raises any new issues, except for the one about interacting with the existing prefixes. And it might seem like having some affixes get the totally raw token, others get a cooked string is too complicated, but C++ actually lives with a 3-way distinction between raw token, cooked string, and fully parsed value. (Why would you ever want the last one? So your units-and-quantities library can define a _km suffix so 2_km is a km<int>, 2.3_km is a km<double>, 2.3f_km is a km<float>, and maybe even 2.3dec_km is a km<Decimal>.) I’m not sure we need this last distinction, but the first one might be worth copying, so that Path and other literals can work, but things like version can interact nicely with plain string literals, and r, and b if that’s appropriate, and most of all f, by just accepting a cooked string.

Thanks, Andrew, for your feedback. I didn't even think about string **suffixes**, but clearly they can be implemented together with the prefixes for additional flexibility. And your idea that `<string literal> <suffix>` is conceptually no different than `<numeric literal> <suffix>` is absolutely insightful. Speaking of string suffixes, flags on regular expressions immediately come to mind. For example `rx"(abc)"ig` could create a regular expression that performs global case-insensitive search.
I don’t think you can fairly discuss this idea without getting at least a _little_ bit into the implementation details.
Right. So, the first question to answer is what the compiler should do when it sees a prefixed (suffixed) string? That is, what byte-code should be emitted when the compiler sees `lambda: a"bcd"e` ? In one approach, we'd want this expression to be evaluated at compile time, similar to how f-strings work. However, how would the compiler know what prefix "a" means exactly? There has to be some kind of directive to tell the compiler that. For example, imagine the compiler sees near the top of the file #pragma from mymodule import a It would then import the symbol `a`, call `a("bcd", suffix="e")`. This would return an AST tree that will be plugged in place of the original string. This solution allows maximum efficiency, but seems inflexible and deeply invasive. Another approach would defer the construction of objects to compile time. Though not as efficient, it would allow loading prefixes at run-time. In this case `a"bcd"e` can be interpreted by the compiler as if it was a("bcd", suffix="e") where symbol `a` is to be looked up in the local/global scope. One thing I would rather want to avoid though, is the pollution of the variable namespace. For example, I'd like to be able to use variable `t = ...`, without worrying about string prefix `t"..."`. For this approach to work, we'd create a new code op, so that `a"bcd"e` would become 0 LOAD_CONST 1 ('a', 'bcd', 'e') 2 STR_RESOLVE_TAG 0 where `STR_RESOLVE_TAG` would effectively call `__resolve_tag__()` special method. The method would search for `a` in the registry of known string tags, and then pass the tuple to the corresponding constructor. There will, of course, be a method to register new tags. Something like str.___register_tag__('a', MyAObject) As for suffix-only literals, we can treat them as if they begin with an underscore. Thus, `1/3f` would be equivalent to 1/_f(3)

What about _instead of_ rather than _together with_? Half of Stephen’s objections are related to the ambiguity (to a human, even if not to the parser) of user prefixes in the (potential) presence of the builtin prefixes. None of those go even arise with suffixes. Anyway, maybe you already have good answers for all of those objections, but if not… Also, there’s at least one mainstream language (C++) that allows user suffixes and has literal syntax otherwise somewhat like Python’s, and the proposals for other languages like Rust generally seem to be generally trying to do “like C++ but minus all the usual C++ over-complexity”. Are there actual examples of languages with user prefixes? The only different designs I know of rely on the static type of the evaluation context. (For example, in Swift, you can just statically type `23 : km` or `"abc]*" : regex`, or even just pass the literal to a function that’s declared or inferred to take a regex if that happens to be readable in your use case, so there’s no need for a suffix syntax.) Which is neat, but obviously not applicable to Python.
And your idea that `<string literal> <suffix>` is conceptually no different than `<numeric literal> <suffix>` is absolutely insightful.
Well, back in 2015 I probably just stole the idea from C++. :) Another question that raises that I just remembered: the word “literal” has three overlapping but distinct meanings in Python. Which one do we actually mean here? In particular, are container displays “literals”? For that matter, is -2 even a literal? Also, from what I remember, either in 2013 or in 2015, the discussion got side-tracked over people not liking the word “literal” to mean “something that’s actually the result of a runtime function call”. That may be less of a problem after f-strings (which are called literals in the PEP; not sure about the language reference), but last time around, bringing up the fact that “-2” is actually a function call didn’t sway anyone. So, maybe I shouldn’t be using the word “literal” this time, and I really hope it doesn’t ruin your proposal…
That’s an interesting idea. And that’s something you can’t do with a single-affix design; you need prefixes and suffixes, unless you have some kind of separator for chaining, or only allow single characters.
My hack works basically like this. The compiler just converts it to a function call, which is looked up normally. I think that’s the right tack here. IIRC, my hack translates a D suffix into a call to something like -_user_literal_D, which solves the problem with accidental pollution of the namespace. But this does mean that any code that wants to use the D suffix has to `from decimal_literals import *, or `2.3D` raises a NameError about nothing named _user_literal_D. (Either that, or someone has to inject it into builtins…) I’m not sure whether that’s user-friendly enough. Anyway, I think your registry idea makes more sense. Then `2.3D` effectively just means `__user_literals__['D']('2.3')`, and there’s no namespace pollution at all.
Do we even need that? It’s true that most things in Python translate reasonably directly to bytecodes, but in this case it might be easier to just compile to existing bytecodes to look up and call the function.
There will, of course, be a method to register new tags. Something like
str.___register_tag__('a', MyAObject)
If the params are (handler, name=None), and None means to use the __name__ of the handler at the tag, then you can use it as a decorator: @__register_tag__ def D(decimal_string): return decimal.Decimal(decimal_string) Although this may not be the best example, because it might actually be clearer (as well as more efficient) to just register the constructor: __register_tag__(decimal.Decimal, 'D') … but I suspect many examples won’t be just a matter of calling a constructor on the string.
Does that mean you can’t actually register a prefix named `_f`? Or that, if you do, it also registers a suffix named `f`? Also, I think for most non-single-letter suffixes you’d actually want an underscore at the start of the suffix. See C++ for lots of examples, but for a quick illustration,compare these: c = 2.99792458e8mps c = 2.99792458e8_mps c = 299_792_458mps c = 299_792_458_mps The _mps suffix looks a lot better than the mps suffix, doesn’t it? But would you want the function to have to be named __mps with two underscores? It may be worth coming up with the most compelling examples and then working out what feature set would support as many as possible, rather that trying to work out the ultimate feature set first and then see what we can do with it. It’s probably worth stealing liberally from the C++ discussion (and any other languages that have similar features) as well as the 2013 Python discussion, but off the top of my head: * Decimal, Fraction, np.float32, mpz, … * Path objects * Windows native Path objects, possibly with “really raw” processing to allow trailing backslashes * regex, possibly with flags, possibly with “really raw” backslashes * “Really raw” strings in general. * JSON (register the stdlib or simplejson or ujson), XML (register ETree or lxml or bs4 or whatever you want), HTML, etc. * unit suffixes for quantities

27.08.19 06:38, Andrew Barnert via Python-ideas пише:
* JSON (register the stdlib or simplejson or ujson),
What if the JSON suffix for? JSON is virtually a subset of Python except that that it uses true, false and null instead of True, False and None. If you set these three variables you can embed JSON syntax in pure Python.

On Aug 26, 2019, at 23:43, Serhiy Storchaka <storchaka@gmail.com> wrote:
I think you’d mainly want it in combination with percent-, html-, or uu-equals-decoding, which makes it a potential stress test of the “multiple affixes” or “affixes with modifiers” idea. Which I think is important, because I like what the OP came up with for that idea, so I want to push it beyond just the “regex with flags” example to see if it breaks. Maybe URL, which often has the same html and percent encoding issues, would be a better example? I personally don’t need to decode URLs that often in Python (unlike in, say, ObjC, where there’s a smart URL class that you use in place of strings all over the place), but maybe others do?
JSON is virtually a subset of Python except that that it uses true, false and null instead of True, False and None.
Is it _virtually_ a subset, or literally so, modulo those three values? I don’t know off the top of my head. Look at all the trouble caused by Crockford just assuming that the syntax he’d defined was a strict subset of JS when actually it isn’t quite. Actually, now that I think of it, I do know. Python has allow_nan on by default, so you’d need to also `from math import nan as NaN` and `from math import inf as Infinity`. But is that it? I’m not sure. And of course if you’ve done this: jdec = json.JSONDecoder(parse_float=Decimal) __register_prefix__(jdec.decode, 'j') … then even j'1.1' and 1.1 are no longer the same values. Not to mention what you get if you registered Pandas’s JSON reader instead of the stdlib’s.

On Mon, Aug 26, 2019 at 11:03:38PM -0000, stpasha@gmail.com wrote:
Current string prefixes are allowed in combinations. Does the same apply to your custom prefixes? If yes, then they are ambiguous: how could the reader tell whether the string prefix frac'...' is a f- r- a- c-string combination, a fra- c-string combination, a fr- ac-string combination, or a f- rac- string combination? If no, then it will confuse and frustrate users who wonder why they can combine built-in prefixes like fr'...' but not their own prefixes. What kind of object is a frac-string? You might think it is obvious that it is a "frac" (Fraction? Something else?) but how about a czt-string? As a reader, at least I know that czt('...') is a function call that could return anything at all. That is standard across hundreds of programming languages. But as a string prefix, it looks like a kind of string, but could be anything at all. Imagine trying to reason about Python syntax: 1. u'...' is a unicode string, evaluating to a str. 2. r'...' is a raw string, evaluating to a str. 3. f'...' is a f-string, evaluating to a str. 4. b'...' is a byte-string, evaluating to a bytes object, which is not a str object but is still conceptually a kind of string. 5. Therefore z'...' is what kind of string, evaluating to what kind of object? Things that look similar should be similar. This string prefix idea means that things that look similar can be radically different. It looks like a string, but may not be anything like a string. The same applies to function call syntax, of course, but as I mentioned above, function call syntax is standard across hundreds of languages and readers don't expect that the result of an arbitrary function call is necessarily the same as its first argument(s). We don't expect that foo('abcde') will return a string, even if we're a little unclear about what foo() actually does. u- (unicode) strings, r- (raw) strings, and even b- (byte) strings are all kinds of *string*. We know just by looking at them that they evaluate to a str or bytes object. Even f-strings, which is syntax for executable code, at least is guaranteed to evaluate to a str object. But these arbitrary string prefixes could return anything.
Indeed. czt'...' saves only two characters from czt('...').
I don't see how that follows. The existence of one new prefix adds *this* much new complexity: [holds forefinger and thumb about a millimeter apart] for significant gains. Trying to write your own f-string equivalent function would be quite difficult, but being in the language not only is it faster and more efficient than a function call, but it needs to be only written once. But adding a new way of writing single-argument function calls with a string argument: czt'...' is equivalent to czt('...') adds *this* much complexity to the language: [holds forefingers of each hand about shoulder-width apart] for rather insignificant gains, the saving of two parentheses. You still have to write the czt() function, it will have to parse the string itself, you will have no support from the compiler, and anyone needing this czt() will either have to re-invent the wheel or hope that somebody publishes it on PyPI with a suitable licence.
What's v() do? Verbose string?
Oh, you intended a version string did you? If only you had written ``version`` instead of ``v`` I might not have guessed wrong. What were you saying about preferring readability and clarity over brevity? *semi-wink* I'm only half joking here. Of course I could guess that '1.13.0a' looks like a version string. But I genuinely expected v-string to mean "verbose", not version, and could only guess otherwise because I know what version strings look like. In other words, I got *all* of the meaning from the string part, not the prefix. The prefix on its own, I would have guessed completely wrong. This goes against your claim that "the string has no meaning of its own". Of course it has meaning on its own. It looks like a version string, which is the only way I could predict that v'...' stands for version-string rather than verbose-string. What if we didn't recognise the semantics of the string part? v'cal-{a}-%.5f-^H/7:d{b}s' What's this v-string mean, what does it do, how do I parse the string part of it? I think that one of the weaknesses of this proposal is that you are assuming that the meanings of these prefixes are as obvious to everyone else as they are to you. They aren't.
It has a feeling of an immutable object.
How are we supposed to know that v-strings return an immutable object? Let's suppose you come across l'abc' in somebody's code base. What's an l-string? Does it still look immutable to you? What if I told you that l-string stands for "list-string" and it returns a mutable list?
If I wanted to parse a string and return a Version object, I would write it as Version('1.13.0a'). If your v-string prefix does something other than that, I cannot comment, as I have no idea what your v-string prefix would do or how it would differ from the regular string '1.13.0a'. -- Steven

Thank you, Steven, for taking the time to write such an elaborate rebuttal. If I understand the heart of your argument correctly, you're concerned that the prefixed strings may add confusion to the code. That nobody knows what `l'abc'` or `czt'xxx'` could possibly mean, while at the same time `v'1.0'` could mean many things, whereas `v'cal-{a}'` would mean nothing at all... These are all valid concerns. The string (or number) prefixes add new power to the language, and with new power comes new responsibility. While the syntax can be used to enhance readability of the code, it can also be abused to make the code more obscure. However, Python does not attempt to be an idiot-proof language. "We are all consenting adults" is one of its guiding principles. If a certain feature can potentially be misused shouldn't deter us from adding it, if the benefits are significant. And the benefits in terms of readability can be significant. Consider the existing python prefixes: `r'...'` is purely for readability, it adds no extra functionality; `f'...'` has a neat compiler support, but even if it didn't (and most python users don't actually realize f-strings get preprocess by the compiler) it would still enhance readability compared to `str.format()`. It's nice to be able to write a complex number as `5 + 3j` instead of `complex(5, 3)`. And so on.
You're correct that, devoid of context, `v"smth..."` is not very meaningful. The "v" suffix could mean "version", or "verbose", or "volatile", or "vectorized", or "velociraptor", or whatever. Luckily, the code is almost always exists within a specific context. It solves a particular problem, and works within a particular domain, and makes perfect sense for people working within that domain. This isn't much different than, say, `np.` suffix, which means "numpy" in the domain of numerical computations, NP-completeness for some mathematicians, and "no problem" for regular users. From practical perspective, the meaning of each particular symbol will come from the way that it was created or imported. For example, if you script says `from packaging.version import v` then "v" is a version. If, on the other hand, it says `from zoo import velociraptor as v`, then it's an altogether different beast.
In other words, I got all of the meaning from the string part, not the prefix. The prefix on its own, I would have guessed completely wrong.
Exactly. You look at string "1.10a" and you know it must be a version string, because you're a human, you're smart. The compiler is not a human, it has no idea. To the Python interpreter it's just a PyUnicode object of length 5. It's meaningless. But when you combine this string with a prefix into a single object, it gains power. It can have methods or special behaviors. It can have a type, different from `str`, that can be inspected when passing this object to another function. Think of `v"1.10a"` as making a "typed string" (even though it may end up not being a string at all). By writing `v"1.10a"` I convey the intent for this to be a version string.
for rather insignificant gains, the saving of two parentheses.
Two bytes doesn't sound like a lot. I mean, it is quite little on the grand scale of things. However, I don't think the simple byte-count is a proper measure here. There could be benefits to readability even if it was 0 or negative byte difference. I believe a good way to think about this is the following: if the feature was already implemented, would people want to use it, and would it improve readability of their code? I speculate that the answer is true to both of these questions. At least some people. As a practical example, consider function `pandas.read_csv()`. The documentation for its `sep` parameter says "In addition, separators longer than 1 character and different from ``'\s+'`` will be interpreted as regular expressions ...". In this case they wanted the `sep` parameter to handle both simple separators, and the regular expression separators. However, as there is no syntax to create a "regular expression string", they ended up with this dubious heuristic based on the length of the string... Ideally, they should have said that `sep` could be either a string or a regexp-object, but the barrier to write from re import compile as rx rx('...') is just impossibly high for a typical user. Not to mention that such code **would** be actually harder to read, because I'd be inventing my own notation for a function that is commonly known under a different name. My another pet peeve is datetime literals. Or, rather, their absence. I often see, again in pandas, how people create columns of strings ["2010-05-01", "2010-05-02", ...], and then call `parse_datetime()`. It would have been more straightforward if there was a standard syntax for denoting datetime constants, allowing us to create a column of datetime type directly.

On Tue, Aug 27, 2019 at 6:25 PM <stpasha@gmail.com> wrote:
Syntactically, the "np." prefix (not suffix fwiw) actually means "look up the np object, then locate an attribute called <whatever comes next>". That's true of every prefix you could ever get, and they're always looked up at run time; the attribute name always follows the exact same syntactic rules no matter what the prefix is. Literals, on the other hand, are part of syntax - a different string type prefix can change the way the entire file gets parsed. Will these "custom prefixes" be able to define anything syntactically? If not, why not just use a function call? And if they can, then you have created an absolute monster, where a v-string in one context can have completely different syntactic influence on what follows it than a v-string in another context. At least with attribute lookups, you can parse a file without knowing what "np" actually means, and even examine things at run-time. ChrisA

On Aug 27, 2019, at 01:42, Chris Angelico <rosuav@gmail.com> wrote:
There is a possibility in between the two extremes of “useless” and “complete monster”: the prefix accepts exactly one token, but can parse that token however it wants. That’s pretty close to what C++ does, and pretty close to the way my hacky proof of concept last time around worked, and I don’t think that only works because those are suffix-only designs. (That being said, if you do allow “really raw” string literals as input to the user prefixes/suffixes to handle the path'C:\' case, then it’s possible to invent cases that would tokenize differently with and without the feature—in fact. I just did—and therefore it _might_ be possible to invent cases that parse validly but differently, in which case the monster is lurking after all. Someone might want to look more carefully at the C++ rules for that?)

On Tue, Aug 27, 2019 at 05:24:19AM -0700, Andrew Barnert via Python-ideas wrote:
How is that different from passing a string argument to a function or class constructor that can parse that token however it wants? x'...' x('...') Unless there is some significant difference between the two, what does this proposal give us? -- Steven

On Aug 27, 2019, at 08:52, Steven D'Aprano <steve@pearwood.info> wrote:
Before I get into this, let me ask you a question. What does the j suffix give us? You can write complex numbers without it just fine: c = complex c(1, 2) And you can even write a j function trivially: def j(x): return complex(0, x) 1 + j(2) But would anyone ever write that when they can write it like this: 1 + 2j I don’t think so. What does the j suffix give us? The two extra keystrokes are trivial. The visual noise of the parens is a bigger deal. The real issue is that this matches the way we conceptually think of complex numbers, and the way we write them in other contexts. (Well, the way electrical engineers write them; most of the rest of us use i rather than j… but still, having to use j instead of i is less of an impediment to reading 1+2j than having to use function syntax like 1+i(2). And the exact same thing is true in 3D or CUDA code that uses a lot of float32 values. Or code that uses a lot of Decimal values. In those cases, I actually have to go through a string for implementation reasons (because otherwise Python would force me to go through a float64 and distort the values), but conceptually; there are no strings involved when I write this: array([f('0.2'), f('0.3'), f('0.1')]) … and it would be a lot more readable if I could write it the same way I do in other programming languages: array([0.2f, 0.3f, 0.1f]) Again, it’s not about saving 4 keystrokes per number, and the visual noise of the parens is an issue but not the main one (and quotes are barely any noise by comparison); it’s the fact that these numeric values look like numeric values instead of looking like strings The fact that they look the same as the same values in other contexts like a C++ program or a GLSL shader is a pretty large added bonus. But I don’t think that’s essential to the value here. If you forced me to use prefixes instead of suffixes (I don’t think there’s any good reason for that, but who knows how the winds of bikeshedding may blow), I’d still prefer f2.3 to f('2.3'), because it still looks like a number, as it should. I know this is doable, because I’ve written an import hook that does it, plus I have a decade of experience with another popular language (C++) that has essentially the same feature. What about the performance cost of these values not being constants? A decorator that finds np.float32 calls on constants and promoted them to constants by hacking the bytecode is pretty trivial to write, or you can load the whole array in one go from a bytes constant and put the readable version in a comment, or whatever. But anything that’s slow enough to be worth optimizing is doing a huge matmul or pushing zillions of values back and forth to the GPU or something else that swamps the setup cost, even if the setup cost involves a few dozen string parses, so it never matters. At least not for me. —- For a completely different example—but one that I’ve also already given earlier in this thread, so I won’t belabor it too much: path'C:\' bs"this\ space won’t have a backslash before it, also \e[22; is an escape sequence and of course \a is still a bell because I’m using the rules from C/JS/etc." bs37"this\ space has a backslash before it without raising a warning or an error even in Python 3.15 because I’ve implemented the 3.7 rules” … and so on. Some of these _could_ be done with a raw string and a (maybe slightly more complicated) function call, but at least the first one is impossible to do that way. Unlike the numeric suffixes, this one I haven’t actually implemented a hacky version of, and I don’t know of any other languages that have an identical feature, so I can’t promise it’s feasible, but it seems like it should be.

On Aug 27, 2019, at 10:21, Rhodri James <rhodri@kynesim.co.uk> wrote:
You make the point yourself: this is something we already understand from dealing with complex numbers in other circumstances. That is not true of generic single-character string prefixes.
It certainly is true for 1.23f. And, while 1.23d for a decimal or 1/3F for a Fraction may not be identical to any other context, it’s a close-enough analogy that it’s immediately familiar. Although I might actually prefer 1.23dec or 1/3frac or something more explicit in those cases. (Fortunately, there’s nothing in the design stopping me from doing that.) As for string prefixes, I don’t think those should usually, or maybe even ever, be single-character. People have given examples like sql"…" (I’m still not sure exactly what that does, but it’s apparently used in other languages for something?) and regex"…" and path"…" (which are a lot more obvious). I’m not sure if they actually are useful, which is why my proposal didn’t have them; I’m waiting on the OP to give more complete examples, cite similar uses from other languages, etc. But I doubt the problem you’re talking about, that they’d all be unfamiliar cryptic one-letter things, is likely to arise.

On 29/08/2019 00:24, Andrew Barnert wrote:
I would contend that (and anyway 1.23f is redundant; 1.23 is already a float literal). But anyway I said "generic single-character string prefixes", because that's what the original proposal was. You seem to be going off on creating literal syntax for standard library types (which, for the record, I think is a good idea and deserves its own thread), but that's not what the OP seems to be going for. -- Rhodri James *-* Kynesim Ltd

On Wed, Aug 28, 2019 at 3:10 AM Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
If your conclusion here were "and that's why Python needs a proper syntax for Decimal literals", then I would be inclined to agree with you - a Decimal literal would be lossless (as it can entirely encode whatever was in the source file), and you could then create the float32 values from those. But you haven't made the case for generic string prefixes or any sort of "arbitrary literal" that would let you import something that registers something to make your float32 literals. ChrisA

On Tuesday, August 27, 2019, 11:12:51 AM PDT, Chris Angelico <rosuav@gmail.com> wrote:
Sure I did; you just cut off the rest of the email that had other cases. And ignored most of what you quoted about the float32 case. And ignored the previous emails by both me and the OP that had other cases. Or can you explain to me how a builtin Decimal literal could solve the problem of Windows paths? Here's a few more: Numeric types that can't be losslessly converted to and from Decimal, like Fraction. Something more similar to complex (e.g., `quat = 1.0x + 0.0y + 0.1z + 1.0w`). What would Decimal literals do for me there? I think your reluctance and the OP's excitement here both come from the same source: Any feature that gives you a more convenient way to write and read something is good, because it lets you write things in a way that's consistent with your actual domain, and also bad, because it lets you write things in a way that's not readable to people who aren't steeped in your domain. Those are _always_ both true, so just arguing from first principles is pointless. The question is whether, for this specific feature, there are good uses where the benefit outweighs the cost. And I think there are. In fact, if you're already convinced that we need Decimal literals, unless you can come up with a more feasible way to add builtin Decimal literals to Python, Decimal on its own seems like a sufficient use case for the feature.

On Wed, Aug 28, 2019 at 6:03 AM Andrew Barnert <abarnert@yahoo.com> wrote:
Not sure that's a total blocker, but in any case, I'm not arguing for that - I'm just saying that everything up to that point in your argument would be better served by a Decimal literal than by any notion of "custom literals".
But they're not. You didn't even attempt to answer the comparison with complex that you quoted. The problem that `j` solves is not that there's no way to create complex values losslessly out of floats, but that there's no way to create them _readably_, in a way that's consistent with the way you read and write them in every other context. Which is exactly the problem that `f` solves. Adding a Decimal literal would not help that at all—letting me write `f(1.23d)` instead of `f('1.23')` does not let me write `1.23f`.
TBH I don't quite understand the problem. Is it only an issue with negative zero? If so, maybe you should say so, because in every other way, building a complex out of a float added to an imaginary is perfectly lossless.
Also, I think you're the one who brought up performance earlier? `%timeit np.float32('1.23')` is 671ns, while `%timeit np.float32(d)` with a pre-constructed `Decimal(1.23)` is 2.56us on my laptop, so adding a Decimal literal instead of custom literals actually encourages _slower_ code, not faster.
No, I didn't say that. I have no idea why numpy would take longer to work with a Decimal than a string, and that's the sort of thing that could easily change from one version to another. But the main argument here is about readability, not performance.
Also, as the OP has pointed out repeatedly and nobody has yet answered, if I want to write `f(1.23d)` or `f('1.23')`, I have to pollute the global namespace with a function named `f` (a very commonly-used name); if I want to write `1.23f`, I don't, since the converter gets stored in some out-of-the-way place like `__user_literals_registry__['f']` rather than `f`. That seems like a serious benefit to me.
Maybe. But far worse is that you have a very confusing situation that this registered value could be different in different programs. In contrast, f(1.23d) would have the same meaning everywhere: call a function 'f' with one parameter, the Decimal value 1.23. Allowing language syntax to vary between programs is a mess that needs a LOT more justification than anything I've seen so far.
Which said basically the same as the parts I quoted.
And ignored most of what you quoted about the float32 case.
What did I ignore?
And ignored the previous emails by both me and the OP that had other cases. Or can you explain to me how a builtin Decimal literal could solve the problem of Windows paths?
All the examples about Windows paths fall into one of two problematic boxes: 1) Proposals that allow an arbitrary prefix to redefine the entire parser - basically impossible for anything sane 2) Proposals that do not allow the prefix to redefine the parser, and are utterly useless, because the rest of the string still has to be valid. So no, you still haven't made a case for arbitrary literals.
Here's a few more: Numeric types that can't be losslessly converted to and from Decimal, like Fraction.
If you want to push for Fraction literals as well, then sure. But that's still very very different from *arbitrary literal types*.
Something more similar to complex (e.g., `quat = 1.0x + 0.0y + 0.1z + 1.0w`). What would Decimal literals do for me there?
Quaternions are sufficiently niche that it should be possible to represent them with multiplication. quat = 1.0 + 0.0*i + 0.1*j + 1.0*k With appropriate objects i, j, k, it should be possible to craft something that implements quaternion arithmetic using this syntax. Yes, it's not quite as easy as 4+3j is, but it's also far FAR rarer. (And remember, even regular complex numbers are more advanced than a lot of languages have syntactic support for.)
I think your reluctance and the OP's excitement here both come from the same source: Any feature that gives you a more convenient way to write and read something is good, because it lets you write things in a way that's consistent with your actual domain, and also bad, because it lets you write things in a way that's not readable to people who aren't steeped in your domain. Those are _always_ both true, so just arguing from first principles is pointless. The question is whether, for this specific feature, there are good uses where the benefit outweighs the cost. And I think there are.
That line of argument is valid for anything that is specifically defined by the language. Creating a way to represent matrix multiplication benefits people who do matrix multiplication. Those of us who don't work with matrix multiplication on a daily basis, however, can at least read some Python code and go "ah, a @ b means matrix multiplication". The creation of custom literals means we can't do that any more. For instance, you want this: x = path"C:\" but that means that it's equally possible for me to create this: y = tree" \" " \ " Now, what does that mean? Can you even parse the rest of the script without knowing what my 'tree' type does?
In fact, if you're already convinced that we need Decimal literals, unless you can come up with a more feasible way to add builtin Decimal literals to Python, Decimal on its own seems like a sufficient use case for the feature.
There are valid use cases for Decimal literals and Fraction literals, but not, IMO, for custom literals. Look at some of the worst abuses of #define in C to get an idea of what syntax customization can do to readability. ChrisA

On Aug 27, 2019, at 14:41, Chris Angelico <rosuav@gmail.com> wrote:
No, it really couldn’t. A builtin Decimal literal would arguably serve the Decimal use case better (but I’m not even sure about that one; see below), but it doesn’t serve the float32 case that you’re responding to.
Negative zero is an irrelevant side issue that Serhiy brought up. It means j is not quite perfect—and yet j is still perfectly usable despite that. Ignore negative zero. The problem that j solves is dead simple: 1 + 2j is more readable than complex(1, 2). And it matches what you write and read in other contexts besides Python. That’s the only problem j solves. But it’s a problem worth solving, at least for code that uses a lot of complex numbers. Without it, even if you wanted to pollute the namespace with a single-letter global so you could write c(1, 2) or 1 + j(2), it _still_ wouldn’t be nearly as readable or as familiar. That’s why we have j. There is literally no other benefit, and yet it’s enough. And the problem that f solves would be exactly the same: 1.23f is more readable than calling float32, and it matches what you read and write in other contexts besides Python (like, say, C or shader code). Even if you wanted to pollute the namespace with a single-letter global f, it still wouldn’t be as readable or as familiar. That’s why we should have f. There is literally no other benefit, but I think it’s enough benefit, for enough programs, that we should be allowed to do it. Just like j. Unlike j, however, I don’t think it’s useful in enough programs that it should be builtin. And I think the same is probably true for Decimal. And for most of the other examples that have come up in this thread. Which is why I think we’d be better served with something akin to C++ allowing you to explicitly register affixes for your specific program, than something like C with its too-big-to remember-but-still-not-enough-for-many-uses zoo of builtin affixes.
Sure, and the global f could also be different in different programs—or even in different modules in the same program. So what? 1.23f would always have the same meaning everywhere, it’s just that the meaning is something like __user_literals__['f']('1.23') instead of globals()['f']('1.23'). Yes, of course that is something new to be learned, if you’re looking at a program that does a lot of 3D math, or a lot of decimal math, or a lot of Windows path stuff, or whatever, people are likely to have used this feature so you’ll need to know how to look up the f or d or whatever. But that really isn’t a huge hardship, and I think the benefits outweigh the cost.
This doesn’t really allow syntax to vary between programs. It just allows literals to be followed (or maybe preceded) by tags. The rest of the syntax is unchanged.
Not even remotely—again, unless you think that Windows paths could somehow be served by builtin decimal literals?
And ignored most of what you quoted about the float32 case.
What did I ignore?
That 1.23f is more readable, familiar, etc. in exactly the same way that 2.3j is.
3) Proposals that do not allow the prefix to redefine the parser for the entire program, but do allow it to manually parse anything the tokenizer can recognize as a single (literal) token. As I said, I haven’t tried to implement this example as I have with the other examples, so I can’t promise that it’s doable (with the current tokenizer, or with a reasonable change to it). But if it is doable, it’s neither insane nor useless. (And evenif it’s not doable, that’s just two examples that affixes can’t solve—Windows paths and general “super-raw strings”. They still solve all of the other examples.)
If we really only ever needed Decimal and Fraction, then yes, I think allowing user-defined literal tags would be better than adding two hard-coded tags that most people will rarely use.
Yes, and? “Literal token” is specifically defined by the language. “Literal token with attached tag” will also be specifically defined by the language. The only thing open to customization is what that token gets compiled to. (Or course this is just one suggestion I came up with, not the only way to do things, or what the OP suggested. But it does show that there is at least one possibility besides “insane” and “useless”.) Your argument comes down to the fact that anything that could possibly be construed as affecting syntax, no matter how constrained it is, and no matter how far you have to stretch to see it as affecting syntax in the first place, and even if nobody would ever do that with it, is Inherently so evil that it can’t possibly be allowed. I think that’s a silly argument. Especially in a language that already has a nicely documented feature like, say, import hooks.
Look at the plethora of suffixes C has for number and character literals. Look at how many things people still can’t do with them that they want to. Look at the way user literals work in C++. While technically you can argue that they are “syntax customization”, in practice the customization is highly constrained. Is it _impossible_ to use that feature to write code that can’t be parsed by a human reader? I don’t know if I could prove that it’s impossible. However, I do know that it’s not easy. And that none of the examples, or real-life uses, that I’ve seen have done so. Do you think Python users are incapable of the kind of restraint and taste shown by C++ users, and therefore we can’t trust Python users with a tool that might possibly (but we aren’t sure) if abused badly enough make code harder to visually parse?

On Wed, Aug 28, 2019 at 10:52 AM Andrew Barnert <abarnert@yahoo.com> wrote:
So what is the definition of "a single literal token" when you're creating a path-string? You want this to be valid: x = path"C:\" For this to work, the path prefix has to redefine the way the parser finds the end of the token, does it not? Otherwise, you still have the same problems you already do - backslashes have to be escaped. That's why I say that, without being able to redefine the parser, this is completely useless, as a "path string" might as well just be a "string". Which way is it?
I don't understand. Are you saying that the prefix is not going to be able to change how backslashes are handled, or that it is? If you keep the tokenizer exactly the same and just add a token in front of it, then things like path"C:\" will be considered to be incomplete and will continue to consume source code until the next quote (or throw SyntaxError for EOL inside string literal). Or is your idea of "literal token" something other than that? If a "literal token" is simply a string literal, then how is this actually helping anything? What do you achieve?
Look at the plethora of suffixes C has for number and character literals. Look at how many things people still can’t do with them that they want to.
I don't know how many there are. The only ones I can think of are "f" for single-precision float, and the long and unsigned suffixes on integers. Python doesn't have these because very few programs need to care about whether a float is single-precision or double-precision, or how large an int is.
Look at the way user literals work in C++. While technically you can argue that they are “syntax customization”, in practice the customization is highly constrained. Is it _impossible_ to use that feature to write code that can’t be parsed by a human reader? I don’t know if I could prove that it’s impossible. However, I do know that it’s not easy. And that none of the examples, or real-life uses, that I’ve seen have done so.
I also have not yet seen any good examples of user literals in C++.
Do you think Python users are incapable of the kind of restraint and taste shown by C++ users, and therefore we can’t trust Python users with a tool that might possibly (but we aren’t sure) if abused badly enough make code harder to visually parse?
People can be trusted with powerful features that can introduce complexity. There's just not a lot of point introducing a low-value feature that adds a lot of complexity. ChrisA

I’m not sure (maybe about 60% at best), but I think last time I checked this, the tokenizer actually hits the error without munching the rest of the file. If I’m wrong, then you would need to add a “really raw string literal” builtin that any affixes that want really raw string literals could use, but that’s all you’d have to do. And I really don’t think it’s worth getting this in-depth into just one of the possible uses that I just tossed off as an aside, especially without actually sitting down and testing anything.
Of the top of my head, there are also long long integers, and long doubles, and wide and three Unicode suffixes for char. Those probably aren’t all of them. And your compiler probably has extensions for “legacy” suffixes and nonstandard types like int128 or decimal64 and so on.
Right, but the issue isn’t which ones, but how many. C doesn’t have decimals or fractions, and other things like datetime objects have been suggested in this thread, and even more in the two earlier threads. If there are too many useful kinds of constants, there are too many to make them all builtins.
But it really doesn’t add a lot of complexity. If you’re not convinced that really-raw string processing is doable, drop that. Since the OP hasn’t given a detailed version of his grammar, just take mine: a literal token immediately followed by one or more identifier characters (that couldn’t have been munched by the literal) is a user-suffix literal. This is compiled into code that looks up the suffix in a central registry and calls it with the token’s text. That’s all there is to it. Compare that adding Decimal (and Fraction, as you said last time) literals when the types aren’t even builtin. That’s more complexity, for less benefit. So why is it better?

On Wed, Aug 28, 2019 at 2:40 PM Andrew Barnert <abarnert@yahoo.com> wrote:
What is a "literal token", what is an "identifier character", and how does this apply to your example of having digits, a decimal point, and then a suffix? What if you want to have a string, and what if you want to have that string contain backslashes or quotes? If you want to say that this doesn't add complexity, give us some SIMPLE rules that explain this. And make absolutely sure that the rules are identical for EVERY possible custom prefix/suffix, because otherwise you're opening up the problem of custom prefixes changing the parser again.
Compare that adding Decimal (and Fraction, as you said last time) literals when the types aren’t even builtin. That’s more complexity, for less benefit. So why is it better?
Actually no, it's a lot less complexity, because it's all baked into the language. You don't have to have the affix registry to figure out how to parse a script into AST. The definition of a "literal" is given by the tokenizer, and for instance, "-1+2j" is not a literal. How is this going to impact your registry? The distinction doesn't matter to Decimal or Fraction, because you can perform operations on them at compile time and retain the results, so "-1.23d" would syntactically be unary negation on the literal Decimal("1.23"), and -4/5f would be unary negation on the integer 4 and division between that and Fraction(5). But does that work with your proposed registry? What is a "literal token", and would it need to include these kinds of things? What if some registered types need to include them and some don't? ChrisA

On Aug 28, 2019, at 00:40, Chris Angelico <rosuav@gmail.com> wrote:
Literals and identifier characters are already defined today, so I don’t need new definitions for them. The existing tokens are already implemented in the tokenizer and in the tokenize module, which is why I was able to slap together multiple variations on a proof of concept 4 years ago in a few minutes as a token-stream-processing import hook. My import hook version is a hack, of course, but it serves as a counter to your argument that there’s no simple thing that could work by being a dead simple thing that does work. And there’s no reason to believe a real version wouldn’t be at least as simple.
We add a`suffixedfloatnumber` production defined as `floatnumber identifier`. So, the `2.34` parses as a `floatnumber` the same as always. That `d` can’t be part of a `floatnumber`, but it can be the start of an `idenfifier`, and those two nodes together can make up a `suffixedfloatnumber`. No need for any new lookahead or other context. And for the concrete implementation in CPython, it should be obvious that the suffix can be pushed down into the tokenizer, at which point the parse becomes trivial. If you’re asking how my hacky version works, you could just read the code, which is simpler than an explanation, but here goes (from memory, because I’m on my phone): To the existing tokenizer, `d` isn’t a delimiter character, so it tries to match the whole `2.34d`. That doesn’t match anything. But `2.34` does match something, etc., so ultimately it emits two tokens, `floatnumber('2.34'), error('d')`. My import hook reads the stream of tokens. When it sees a `floatnumber` followed by an `error`, it checks whether the error body could be an identifier token. If so, it replaces those two tokens in the steam with… I forget, but probably I just hand-parsed the lookup and call and emit the tokens for that. I can’t _guarantee_ that the real version would be simpler until I try it. And I don’t want to hijack the OP’s thread and replace his proposal (which does give me what I want) with mine (which doesn’t give him what he wants), unless he abandons the idea of attempting to implement his version. But I’m pretty confident it would be as simple as it sounds, which is even simpler than the hacky version (which, again, is dead simple and works today). And most variations on the idea you could design would be just as simple. Maybe the OP will perversely design one that isn’t. If so, it’s his job to show that it can be implemented. And if he gives up, then I’ll argue for something that I can implement simply. But I don’t think that’s even going to come up.
Well, that works exactly the same way a string does today (including the optional r prefix). The closing quote can now be followed by a string of identifier characters, but everything up to there is exactly the same as today. So, it doesn’t add any complexity, because it uses the same rules as today. I did suggest, as a throwaway addon to the OP’s proposal, that you could instead do raw strings or even really-raw (the string ends at the first matching quote; backslashes mean nothing). I don’t know if he wants either of those, but if he does, raw string literals are already defined in the grammar and implemented in the tokenizer, and really-raw is an even simpler grammar (identical to the existing grammar except that instead of `longstringchar | stringescapeseq` there’s a `<any source character except the quote>` node, and the same for `shortstringitem`).
And make absolutely sure that the rules are identical for EVERY possible custom prefix/suffix,
Well, in my version, since the rule for suffixedstringliteral is just `stringliteral identifier`, of course it’s the same for every possible suffix; there’s no conceivable way it could be different. If the OP wants to propose something more complicated that provides some way of selecting different rules, he could, but I don’t think he has, and if he doesn’t, then the issue will be equally nonexistent. I don’t know whether he wants to interact with the existing string prefixes (or, if so, how that works), or always do normal strings, or always do really-raw strings, or what, but there are multiple plausible designs, most of which are not impossible or even complicated, so the fact that you can imagine that there might be a design that would be impossible really isn’t relevant. Just to show how easy it is to come up with something (but which, again, may not be what the OP actually wants here): a stringliteral is now a stringprefix followed by shortstring or longstring (as today) or an identifier followed by rrshortstring or rrlongstring. The rr tokens are defined as I described above: they end at the first matching quote, no backslashing. This option would have some limitations— people can’t use \” to escape quotes in prefixed strings, there’s no way to get prefixed bytes, you probably can’t call a prefix “bub”… does that make some of the OP’s desired use cases or some of the 2013 use cases no longer viable? I don’t know. If so, the OP presumably won’t use this option and will use a different one. Any option will have some limitations, and I don’t know which one he wants, but there are a huge number of simple, and nonmagical, options that he could pick.
Making the language definition and the interpreter and the compiler more complicated doesn’t eliminate the complexity, it just moves it somewhere else.
You don't have to have the affix registry to figure out how to parse a script into AST.
You don’t need the registry to parse to an AST for my proposal either; it’s only used at runtime. And, while the OP didn’t give us a grammar, he did give us proposed bytecode output of (one version of) his idea, and it’s pretty obvious that the registry isn’t getting involved until the interpreter eval loop processes the new registry-lookup opcode, so it clearly isn’t involved in parsing. And why would it get involved in parsing? It’s not like someone is proposing Rust or Dylan macros here.
Not at all. Why would it?
Yes, of course it does. That should be obvious from the fact that I said that `1/2F` would end up equivalent to `1/Fraction(2)`. Concretely, it ends up as something like `1/sys.__user_suffixes__['F']('2')` Except probably with nicer error handling, so you don’t get a KeyError for an unknown suffix. Notice that the only thing looked up in the registry is the function to process the text, and this doesn’t need to happen until runtime, long after the code has been not just parsed, but compiled. Of course the OP’s version will be a little different. He wants to handle both prefixes and suffixes by looking up the prefix and passing the suffix as an second argument. And I’m not sure what exactly he wants as the main argument. But I still don’t see any reason it would need to look in the registry at tokenizer or parse or compile time. And again, his proposed bytecode translation implies that it doesn’t do so. So why imagine that it has to when there’s no visible reason for it?
What is a "literal token", and would it need to include these kinds of things?
How could this not be obvious? I deliberately chose the phrase “literal token”, and you clearly understand what this means because you invoked that meaning just one paragraph above. I also provided a link to a hacky implementation that blatantly relies on the tokenizer’s current processing of literals. And I gave examples that make it clear that `2` is a literal token and `1/2` is not. So why do you even need to ask whether `-4/5` is one? How could `-4/5` possibly be a literal token it '1/2` is not, or if it isn’t a token at all?
What if some registered types need to include them and some don't?
They can’t. The simple rule works for every numeric example everyone has come up with so far, even Steven’s facetious quaternion example that he proposed as too ridiculous for anyone to actually want. Is it a flaw that there may or may not be some examples that nobody has been able to think of that might work with a much more complicated feature but won’t work with this feature? Of course not. That’s true for every feature ever. There’s no reason to ignore the obvious simple design and try to imagine more complicated designs that may or may not solve additional problems that nobody’s even imagined just so you can dismiss the idea as too complicated.

Thanks, Andrew, you're able to explain this much better than I do. Just wanted to add that Python *already* has ways to grossly abuse its syntax and create unreadable code. For example, I can write >>> о = 3 >>> o = 5 >>> ο = 6 >>> (о, o, ο) (3, 5, 6) But just because some feature CAN get abused, doesn't mean it ACTUALLY gets abused in practice. People want to write nice, readable code, because they will ultimately be the ones to support it.

On Wed, Aug 28, 2019 at 10:50 PM Rhodri James <rhodri@kynesim.co.uk> wrote:
'\u043e' CYRILLIC SMALL LETTER O 'o' LATIN SMALL LETTER O '\u03bf' GREEK SMALL LETTER OMICRON Virtually indistinguishable in most fonts, but distinct characters. It's the same thing you can do with "I" and "l" in many fonts, or "rn" and "m" in some, but taken to a more untypable level. ChrisA

The case can be made as follows: different people use different parts of the Python language. Andrew would love to see the support for decimals, fractions and float32s (possibly float16s too, and maybe even posit numbers). Myself, I miss datetime and regular expression literals. Other people on the 2013 thread argued at length in favor of supporting sql-literals, which would allow them to be used in a much safer manner. Then there are those who want to write complex numbers in a natural fashion, but they already got their wish granted. In short, the needs vary, and not all of the functionality belongs to the python standard library either.

On Tuesday, August 27, 2019, 11:42:23 AM PDT, Serhiy Storchaka <storchaka@gmail.com> wrote:
And yet, despite that limitation, many people find it useful, and use it on a daily basis. Are you suggesting that Python would be better off without the `j` suffix because of that problem?

On Tue, Aug 27, 2019 at 10:07:41AM -0700, Andrew Barnert wrote:
Before I get into this, let me ask you a question. What does the j suffix give us?
I'm going to answer that question, but before I answer it, I'm going to object that this analogy is a poor one. This proposal is *in no way* a proposal for a new compile-time literal. If it were, it might be interesting: I would be very interested to hear more about literals for a Decimal type, say, or regular expressions. But this proposal doesn't offer that. This proposal is for mere syntactic sugar allowing us to drop the parentheses from a tiny subset of function calls, those which take a single string argument. And even then, only when the argument is a string literal: czt'abc' # Okay. s = 'abc' czt's' # Oops, wrong, doesn't work. But, to answer your question, what does the j suffix give us? Damn little. Unless there is a large community of Scipy and Numpy users who need complex literals, I suspect that complex literals are one of the least used features in Python. I do a lot of maths in Python, and aside from experimentation in the interactive interpreter, I think I can safely say that I have used complex literals exactly zero times in code.
You can write complex numbers without it just fine: [...]
Indeed. And if we didn't already have complex literals, would we accept a proposal to add them now? I doubt it. But if you think we would, how about a proposal to add quaternions? q = 3 + 4i + 2j - 7k
But would anyone ever write that when they can write it like this:
1 + 2j
Given that complex literals are already a thing, of course you are correct that if I ever needed a complex literal, I would use the literal syntax. But that's the point: it is *literal syntax* handled by the compiler at compile time, not syntactic sugar for a runtime function call that has to inefficiency parse a string. Because it is built-in to the language, we don't have to do this: def c(astring): assert isinstance(astring, str) # Parse the string at runtime real, imag = ... return complex(real, imag) z = c"1.23 + 4.56j" (I'm aware that the complex constructor actually does parse strings already, so in *this specific* example we don't have to write our own parser. But that doesn't apply in the general case.) That is nothing like complex literals: py> from dis import dis py> dis(compile('1+2j', '', 'eval')) 1 0 LOAD_CONST 2 ((1+2j)) 3 RETURN_VALUE # Hypothetical byte-code generated from custom string prefix py> dis(compile("c'1+2j'", '', 'eval')) 1 0 LOAD_NAME 0 (c) 3 LOAD_CONST 0 ('1+2j') 6 CALL_FUNCTION 1 (1 positional, 0 keyword pair) 9 RETURN_VALUE Note that in the first case, we generate a complex literal at compile time; in the second case, we generate a *string* literal at compile time, which must be parsed at runtime. This is not a rhetorical question: if we didn't have complex literals, why would you write your complex number as a string, deferring parsing it until runtime, when you could parse it in your head at edit-time and call the constructor directly? z = complex(1.23, 4.56) # Assuming there was no literal syntax.
I don't think it is. I think the big deals in this proposal are: - you have something that looks like a kind of string czt'...' but is really a function call that might return absolutely anything at all; - you have a redundant special case for calling functions that take a single argument, but only if that argument is a string literal; - you encourage people to write cryptic single-character functions, like v(), x(), instead of meaningful expressions like Version() and re.compile(); - you encourage people to defer parsing that could be efficiently done in your head at edit time into slow and likely inefficient string parsing done at runtime; - the OP still hasn't responded to my question about the ambiguity of the proposal (is czt'...' a one three-letter prefix, or three one-letter prefixes?) all of which *hugely* outweighs the gain of being able to avoid a pair of parentheses. [...]
Indeed, but this proposal doesn't help you here. You still have to write strings. What you want is a float32 literal, let's say 1.23f but what you have to write is a function call with a string argument f('1.23'). All this proposal buys you is to drop the parentheses f'1.23'. Its still a function call, except it looks like a string. While I'm sympathetic, and I'd like to see a Decimal literal, I doubt that there's enough use-cases outside of specialists like yourself for 16- or 32-bit floats to justify making them built-ins with literal syntax. Python is not Julia :-) But if you could get the numpy people to write a PEP... *wink* -- Steven

On Aug 27, 2019, at 18:59, Steven D'Aprano <steve@pearwood.info> wrote:
Yes, you’re the same person who got hung up on the fact that these affixes don’t really give us “literals” back in either 2013 or 2016, and I don’t want to rehash that argument. I could point out that nobody cares that -1 isn’t really a literal, and almost nobody cares that the CPython optimizer special-cases its way around that, and the whole issue with Python having three different definitions of “literal” that don’t coincide, and so on, but we already had this conversation and I don’t think anyone but the two of us cared. What matters here is not whether things like the OP’s czt'abc' or my 1.23f or 1.23d are literals to the compiler, but whether they’re readable ways to enter constant values to the human reader. If so, they’re useful. Period. Now, it’s possible that even though they’re useful, the feature is still not worth adding because of Chris’s issue that it can be abused, or because there’s an unavoidable performance cost that makes it a bad idea to rely on them, or because they’re not useful in _enough_ code to be worth the effort, or whatever. Those are questions worth discussing. But arguing about whether they meet (one of the three definitions of) “literal” is not relevant.
And to drop the quotes as well. And to avoid polluting the global namespace with otherwise-unused one-character function names. Can you honestly tell me that you see no significant readability difference between these examples: vec = [1.23f, 2.5f, 1.11f] vec = [f('1.23'), f('2.5'), f('1.11')] I think anyone would agree that the former is a lot more readable. Sure, you have to learn what the f suffix means, but once you do, it means all of the dozens of constants in the module are more readable. (And of course most people reading this code will probably be people who are used to 3D code and already _expect_ that format, since that’s how you write it in C, in shaders, etc.)
Sure, just like you can’t apply an r or f prefix to a string expression.
I don’t think your experience here is typical. I can’t think of a good way to search GitHub python repos for uses of j, but a hacky search immediately turned up this numpy issue:https://github.com/numpy/numpy/issues/13179:
A fast way to get the inverse of angle, i.e., exp(1j * a) = cos(a) + 1j * sin(a). Note that for large angle arrays, exp(1j*a)needlessly triples memory use…
That doesn’t prove that people actually call it with `1j * a` instead of `complex(0, a)`, but it does seem likely.
I’m not sure. I assume you’d be against it, but I suspect that most of the people who use it today would be for it. But if we had custom affixes, I think everyone would be happy with “just define a custom j suffix”. Would anyone really argue that they need the performance benefit or compile-time handling? How often do you evaluate zillions of constants in the middle of a tight loop? And what other argument would there be for adding it to the grammar and the compiler and forcing every project to use it? Which is exactly what I think of the Decimal and Fraction suffixes, contrary to what Chris says. There will be a small number of projects than get a lot of readability benefit, but every other project gains nothing, so why add it as a builtin for every project? And I don’t see why float32 is any different from Decimal and Fraction, given that the actual problem is not lossless values but readability, so a builtin Decimal suffix wouldn’t help there.
I’m sure you can guess my answer to that: most projects don’t need it, so there should definitely not be builtin suffixes for it. But if we have custom suffixes, it anyone _does_ need it, they can do it trivially, without having to bother the rest of us asking for it.
You’ve cut off my paragraph to make it misleading. The visual noise of the parens _is_ a bigger deal than the two extra keystrokes, but as I said in the very next sentence, it’s still not the point of the feature. The real big deal is that it lets you write complex numbers in a way that looks like complex numbers. I don’t see any benefit in arguing about whether the “bigger deal but still not the point” is actually a bigger deal or not, because who cares?
It doesn’t return “anything at all”, any more than a function returns “anything at all”. It returns something consistent that has a specific meaning in your project. I don’t know what czt means a priori, but if I were reading the OP’s code, I could look it up, and then I would know. And I could assume that, unless the author is an idiot, an affix on a string literal is going to be something stringy, and an affix on a number literal is going to be something numbery. Sure, you _could_ violate that assumption, but that’s no different from the fact that you could write a function called sqrt(n) that returns an iterable of the contents of all the files in the nth sub directory of $HOME. You’re not going to do that. (Or, if you do, I’m going to stop reading your code.)
Do you honestly not see the readability benefit in a bunch of constants that all look like `2.3d` instead of `Decimal('2.3')`? Do you honestly think that `D('2.3')` is just as good as `2.3d`, and also worth using up a scarce resource (one-letter global names) for? If not, then I don’t see why you’re pretending not to see the benefit of the proposal.
Well, this is why I wanted him to get into more details than his initial 10000-feet-vision thing. I honestly see a lot more use for numeric affixes than string ones, and for suffixes rather than prefixes, and for just one suffix per value rather than some rule for combining them. But I know that last time around (or maybe the time before), a sql prefix was the thing that got the most people excited, and I could see wanting to combine that with raw or not, and so on, so I’d like to see a concrete proposal on how all of that works.
all of which *hugely* outweighs the gain of being able to avoid a pair of parentheses.
Which, you’ll note, I already said was not the point of the proposal.
No I don’t. I write `1.23f`, just like I do in C or in GLSL, exactly what I want to write, and read. In my version of the proposal (which I described in my first email in the thread—including a link to a hacky import hook that implements it that I wrote up last time this subject came up in 2015), the parser sees a literal token (any kind of literal, not just string literals) followed by a string of identifier characters (that weren’t munched by that literal), it looks up that string, and calls the looked-up function with the raw text of the literal token. The OP’s second email in the thread incorporated my idea into his existing idea. His version is more complicated than mine because it handles prefixes as well as suffixes, and it doesn’t have a proof of concept to verify that it’s all doable unambiguously, but it still allows me to write `1.23f`.
What you want is a float32 literal, let's say 1.23f
I don’t care whether it’s an actual literal. I do care that I can write it as `1.23f`. And both my proposal and the OP’s allow that, so I’m happy with either.
See, I would _not_ like to see a builtin Decimal literal. Or float16 or float32 or fixed1616, or Fraction, or _any_ other new kinds of literals. As long as I can use `1.23f` to mean `float32('1.23')` in the projects where I have a mess of float32 constants (and maybe copy-paste them into my REPL from a shader or a C debugger session), I’m happy. And I think user-defined affixes are a better way to get that than trying to convince everyone that Python should add a builtin float32 type and all the math that goes with it and then add a suffix for float32 literals. Not because I doubt I could convince anyone of that, not because I’d have to wait 5 years before I could start using it even if I could, but because it’s simply not right in the first place. Python should not have a builtin float32 type. And therefore, Python should not have a builtin `f` suffix. And even if that weren’t an issue, I’d _still_ rather have a custom affix feature rather than a mess of new builtin ones. If I run into an unfamiliar affix in your code, I’d rather look it up in your project that consult a table with a mess of builtin affixes. If I want `d` for Decimal in one project, why should that mean nobody can ever use `d` for something different in a project that doesn’t do any decimal math but does a whole bunch of… something else I can’t think of off the top of my head, but I’m sure there will be at least one suffix that has two good meanings in widely different uses. Just like the namespace for builtin functions, the namespace for builtin affixes is—and should be—a limited resource. And meanwhile, what would I get from `d` being builtin? I can save one import or register call or whatever per project. My program that takes 80 seconds to run starts up 2us faster because a few dozen (or at most hundred) constructed constants can be stored in the .pyc file. I don’t have to watch my speech to carefully avoid using the word “literal” imprecisely. None of those are worth anywhere near as much to me as being able to have the suffixes I want for the project, even if I just thought of them today—and to _not_ have the ones I _don’t_ want, even is some of them have a broader niche.

On Wed, 28 Aug 2019 at 05:04, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
Extended (I'm avoiding the term "custom" for now) literals like 0.2f, 3.14D, re/^hello.*/ or qw{a b c} have a fairly solid track record in other languages, and I think in general have proved both useful and straightforward in those languages. And even in Python, constructs like f-strings and complex numbers are examples of such things. However, I know of almost no examples of other languages that have added *user-definable* literal types (with the notable exception of C++, and I don't believe I've seen use of that feature in user code - which is not to say that it's not used). That to me says that there are complexities in extending the question to user-defined literals that we need to be careful of. In my view, the issue isn't abuse of the feature, or performance, or limited value. It's the very basic problem that it's *really hard* to define and implement such a feature in a way that everyone is happy with - particularly in a language like Python which doesn't have a user-exposed "compile source to binary" step (I tried very hard to cover myself against nitpicking there - I'm sure I failed, but please, don't get sidetracked, you know what I mean here :-)). Some specific questions which would need to be dealt with: 1. What is valid in the "literal" part of the construct (this is the p"C:\" question)? 2. How do definitions of literal syntax get brought into scope in time for the parser to act on them (this is about "import xyz_literal" making xyz"a string" valid but leaving abc"a string" as a syntax error)? These questions also fundamentally affect other tools like IDEs, linters, code formatters, etc. In addition, there is the question of how user-defined literals would get turned into constants within the code. In common with list expressions, tuples, etc, user-defined literals would need to be handled as translating into runtime instructions for constructing the value (i.e., a function call). But people typically don't expect values that take the form of a literal like this to be "just" syntax sugar for a function call. So there's an education issue here. Code will get errors at runtime that the users might have expected to happen at compile time, or in the linter. It's not that these questions can't be answered. Obviously they can, as you produced a proof of concept implementation. But the design trade-offs that one person might make are deeply unsatisfactory to someone else, and there's no "obviously right" answer (at least not yet, as no-one Dutch has explained what's obvious ;-)) Also, it's worth noting that the benefits of *user-defined* literals are *not* the same as the benefits of things like 0.2f, or 3.14d, or even re/^hello.*/. Those things may well be useful. But the benefit you gain from *user-defined* literals is that of letting the end user make the design decisions, rather than the language designer. And that's a subtly different thing. So, to summarise, the real problem with user defined literal proposals is that the benefit they give hasn't yet proven sufficient to push anyone to properly address all of the design-time details. We keep having high-level "would this be useful" debates, but never really focus on the key question, of what, in precise detail, is the "this" that we're talking about - so people are continually making arguments based on how they conceive such a feature might work. A really good example here is the p"C:\" question. Is the proposal that the "string part" of the literal is just a normal string? If so, then how do you address this genuine issue that not all paths are valid? What about backslash-escapes (p"C:\temp")? Is the string a raw string or not? If the proposal is that the path-literal code can define how the string is parsed, then *how does that work*? The OP even made this point explicitly:
I'm not discussing possible implementation of this feature just yet, we can get to that point later when there is a general understanding that this is worth considering.
I don't think we *can* agree on much without the implementation details (well, other than "yes, it's worth discussing, but only if someone proposes a properly specified design" ;-)) Paul

On 2019-08-28 01:05, Paul Moore wrote:
However, I know of almost no examples of other languages that have added*user-definable* literal types (with the notable exception of
Believe there is such a feature in modern JavaScript: https://developers.google.com/web/updates/2015/01/ES6-Template-Strings#tagge... -Mike

On Wed, Aug 28, 2019 at 04:02:26PM +0100, Paul Moore wrote:
Elixir has something it calls sigils. It seems to be basically the map-to-function variant: https://elixir-lang.org/getting-started/sigils.html Konstantin

In addition, there is the question of how user-defined literals would get turned into constants within the code.
So, I'm just brainstorming here, but how about the following approach: - Whenever a compiler sees `abc"def"`, it creates a constant of the type `ud_literal` with fields `.prefix="abc"`, `.content="def"`. - When it compiles a function then instead of `LOAD_CONST n` op it would emit `LOAD_UD_CONST n` op. - This new op first checks whether its argument is a "ud_literal", and if so calls the '.resolve()` method on that argument. The method should call the prefix with the content, producing an object that the LOAD_UD_CONST op stores back in the `co_consts` storage of the function. It is a TypeError for the resolve method to return another ud_literal. - Subsequent calls to the LOAD_UD_CONST op will see that the argument is no longer a ud-literal, and will return it as-is. This system would allow each constant to be evaluated only once and subsequently memoized, and only compute those constants that will actually be used.

I don't usually work with windows, but I can see how this could be a pain point for windows users. They need both backslashes and the quotation marks in their paths. As nobody has suggested yet how to deal with the problem, I'd like to give it a try. Behold: p{C:\} The part within the curly braces is considered a "really-raw" string. The "really-raw" means that every character is interpreted exactly as it looks, there are no escape characters. Internal braces will be allowed too, provided that they are properly nested: p{C:\"Program Files"\{hello}\} If you **need** to have unmatched braces in the string, your last hope is the triple-braced literal: p{{{Letter Ж looks like }|{... }}} The curly braces can only be used with a string prefix (or suffix?). And while we're at it, why not allow chained literals: re{(\w+)}{"\1"} frac{1}{17}

On Aug 28, 2019, at 01:05, Paul Moore <p.f.moore@gmail.com> wrote:
Agreed 100%. That’s why I think we need a more concrete proposal, that includes at least some thought on implementation, before we can go any farther, as I said in my first reply. The OP wanted to get some feeling of whether at least some people might find some version of this useful before going further. I think we’ve got that now (the fact that not 100% of the responders agree doesn’t change that), so we need to get more detailed now. My own proposal was just to answer the charge that any design will inherently be impossible or magical or complicated by giving a design that is none of those. It shouldn’t be taken as any more than that. If there are good use cases for prefixes, prefixes plus suffixes, etc., then my proposal can’t get you there, so let’s wait for the OP’s
I think this pretty much has to be either (a) exactly what’s valid in the equivalent literals today, or (b) something equally simple to describe, and parse, even if it’s different (like really-raw strings, or perlesque regex with delimiters other than quotes, or whatever). Either way, I think you want to use the same rule for all affixed literals, not allow a choice of different ones like C++ does.
I don’t know that this is actually necessary. If `abc"a string"` raises an error at execution time rather than compile time, yes, that’s different from how most syntax errors work today, but is it really unacceptable? (Notice that in the most typical case, the error still gets raised from importing the module or from the top level of the script—but that’s just the most typical case, not all cases—you could get those errors from, say, calling a method, which you don’t normally expect.) There’s clearly a trade off here, because the only other alternative (at least that I’ve thought of or seen from anyone else; I’d love to be wrong) is that what you’ve imported and/or registers affects how later imports work (and doesn’t that mean some kind of registry hash needs to get encoded in .pyc files or something too?). While that is normal for people who use import hooks, most people don’t use import hooks most of the time, and I suspect that weirdness would be more off-putting than the late errors. Another big one: How do custom prefixes interact with builtin string prefixes? For suffixes, there’s no problem suffixing, say, a b-string, but for prefixes, there is. If this is going to be allowed, there are multiple ways it could be designed, but someone has to pick one and specify it. (Actually, for suffixes, there _is_ a similar issue: is `1.2jd` a `d` suffix on the literal `1.2j`, or a `jd` suffix on `1.2`? I think the former, because it’s a trivially simple rule that doesn’t need to touch any of the rest of the grammar. Plus, not only is it likely to never matter, but on the rare cases where it does matter, I think it’s the rule you’d want. For example, if I created my own ComplexDecimal class and wanted to use a suffix for it, why would I want to define both `d` and `jd` instead of just defining `d` and having it work with imaginary literals?)
These questions also fundamentally affect other tools like IDEs, linters, code formatters, etc.
Good point. I was thinking that any rule that’s simple enough for Python and humans to parse will probably be reasonably simple for other tools, and any rule that isn’t simple enough for Python and humans is probably a non-starter anyway. But the “lookup affixes at compile time” idea is an example of something that would be easy for Python and for humans but difficult for single-file-at-a-time tools, so this can be important.
I really don’t think this one is a serious issue. Many people never need to learn that -2, 1+2j, (1,2), etc. are not literals, or which of those get optimized by CPython and packed into co_consts anyway, or which things that don’t even look like literals get similarly optimized. So how often will they need to know whether 1.23f is a literal, not a literal but optimized into a const, or neither?
That’s a good point, but I think you’re missing something big here. Think about it this way; assuming f and frac and dec and re and sql and so on are useful, out options are: 1) people don’t get a useful feature 2) we add user-defined affixes 3) we add all of these as builtin affixes While #3 theoretically isn’t impossible, it’s wildly implausible, and probably a bad idea to boot, so the realistic choice is between 1 and 2. Now, you’re right that choice 2 inherently means that we’re putting a new design decision on the end user (or library designer). That is definitely a factor to be weighed on the decision. But I don’t think it’s an immediate disqualifying factor. And in fact, if the feature is properly designed to be restrictive enough (but not too restrictive) I don’t think it will even end up being that big of a deal. There are all kinds of things that we leave up to the user, from the trivial (e.g., in Haskell, a capital letter means a type rather than a value; in Python it’s entirely up to each project whether it means anything at all) to the drastic but rarely used (import hooks probably being the most extreme). This one isn’t going to be trivial, but I think it will fall much closer to the less-disruptive side than many people are assuming. It’s only going to touch a small part of the grammar, and the language in general. (And if that turns out not to be true of the actual proposal, then I probably won’t support the actual proposal.)
Again, agreed.

On Thu, 29 Aug 2019 at 01:18, Andrew Barnert <abarnert@yahoo.com> wrote:
That's a completely different point. Built in affixes are defined by the language, user defined affixes are defined by the user (obviously!) That includes all aspects of design - both how a given affix works, and whether it's justified to have an affix at all for a given use case. The argument is identical to that of user-defined operators vs built in operators. If you can use this argument to justify user-defined affixes, it applies equally to user-defined operators, which is something that has been asked for far more often, with much more widespread precedents in other languages, and been rejected every time. Regarding your cases #1, #2, and #3, this is the fundamental point of language design - you have to choose whether a feature is worthwhile (in the face of people saying "well *I* would find it useful), and whether to provide a general mechanism or make a judgement on which (if any) use cases warrant a special-case language builtin. If you assume everything should be handled by general mechanisms, you end up at the Lisp/Haskell end of the spectrum. If you decide that the language defines the limits, you are at the C end. Traditionally, Python has been a lot closer to the "language defined" end of the scale than the "general mechanisms" end. You can argue whether that's good or bad, or even whether things should change because people have different expectations nowadays, but it's a fairly pervasive design principle, and should be treated as such. This actually goes back to the OP's point:
we can get to that point later when there is a general understanding that this is worth considering
The biggest roadblock to a "general understanding that this is worth considering" is precisely that Python has traditionally avoided (over-) general mechanisms for things like this. The obvious other example, as I mentioned above, being user defined operators. I've been very careful *not* to use the term "Pythonic" here, as it's too easy for that to be a way of just saying "my opinion is more correct than yours" without a real justification, but the real stumbling block for proposals like this tends to be far less about the technical issues, and far *more* about "does this fit into the philosophy of Python as a language, that has made it as successful as it is?" My instinct is that it doesn't fit well with Python's general philosophy. Paul

On Aug 29, 2019, at 00:58, Paul Moore <p.f.moore@gmail.com> wrote:
And if you don’t make either assumption, but instead judge each case on its own merits, you end up with a language which is better than languages at either extreme. There are plenty of cases where Python generalizes beyond most languages (how many languages use the same feature for async functions and sequence iteration? or get metaclasses for free by having only one “kind” and then defining both construction and class definitions as type calls?), and plenty where It doesn’t generalize as much as most languages, and its best features are found all across that spectrum. You can’t avoid tradeoffs by trying to come up with a rule that makes language decisions automatically. (If you could, why would this list even exist?) The closest thing you can get to that is the vague and self-contradictory and facetious but still useful Zen. If you really did try to zealously pick one side or the other, always avoiding general solutions whenever a hardcoded solution is simpler no matter what, the best-case scenario would be something like Go, where a big ecosystem of codegen tools defeats your attempt to be zealous and makes your language actually usable despite your own efforts until soon you start using those tools even in the stdlib. Also, I’m not sure the spectrum is nearly as well defined as you imply in the first place. It’s hard to find a large C project that doesn’t use the hell out of preprocessor macros to effectively create custom syntax for things like error handling and looping over collections (not to mention M4 macros to autoconf the code so it’s actually portable instead of just theoretically portable), and meanwhile Haskell’s syntax is chock full of special-purpose features you couldn’t build yourself (would anyone even use the language without, say, do blocks?).

On Thu, 29 Aug 2019 at 14:21, Andrew Barnert <abarnert@yahoo.com> wrote:
You can’t avoid tradeoffs by trying to come up with a rule that makes language decisions automatically. (If you could, why would this list even exist?) The closest thing you can get to that is the vague and self-contradictory and facetious but still useful Zen.
Sorry, I wasn't trying to imply that you could. Just that choosing to implement some, but not all, possible literal affixes on a case by case basis was a valid language design option, and one that is taken in many cases. Your statement
seemed to imply that you thought it was an "all or nothing" choice. My apologies if I misunderstood your point. Paul

all of which hugely outweighs the gain of being able to avoid a pair of parentheses.
Thank you for summarizing the main objections so succinctly, otherwise it becomes too easy to get lost in the discussion. Let me try to answer them as best as I can:
This is kinda the whole point. I understand, of course, how the idea of a string-that-is-not-a-string may sound blasphemous, however I invite you to look at this from a different perspective. Today's date is 2019-08-28. The date is a moment in time, or perhaps a point in the calendar, but it is certainly not a string. How do we write this date in Python? As `datetime("2019-08-28")`. We are forced to put the date into a string and pass that string into a function to create an actual datetime object. With this proposal the code would look something like `dt"2019-08-28"`. You're right, it's not a string anymore. But it *should not* have been a string to begin with, we only used a string there because Python didn't offer us any other way. Now with prefixed strings the justice is finally done: we are able to express the notion of <a specific date> directly. And the fact that it may still use strings under the hood to achieve the desired result is really an implementation detail, that may even change at some point in the future.
There are many things in python that are in fact function calls in disguise. Decorators? function calls. Imports? function calls. Class definition? function call. Getters/setters? function calls. Attribute access? function calls. Even a function call is a function call via `__call__()`. I may be oversimplifying a bit, but the point is that just because something can be written as a function call doesn't mean it's the most natural way of doing it. Besides, there are use cases (such as `sql'...'`) where people do actually want to have a function that is constrained to string literals only. Having said that, prefixed (suffixed) strings (numbers) are not *exactly* equivalent to function calls. The points of difference are: - prefixes/suffixes are namespaced separately from regular variable names. - their results can be automatically memoized, bringing them closer to builtin literals.
Which is why I suggested to put them in a separate namespace. You're right that function `v()` is cryptic and should be avoided. But a prefix `v"..."` is neither a function nor a variable, it's ok for it to be short. The existing string prefixes are all short after all.
I don't encourage such thing, it's just that most often there is no other way. For example, consider regular expression `[0-9]+`. I can "parse it in my head" to understand that it means a sequence of digits, but how exactly am I supposed to convey this understanding to Python? Or perhaps I can parse "2019-08-28" in my head, and write in Python `datetime(year=2019, month=8, day=28)`. However, such form would greatly reduce readability of the code from humans' perspective. And human readability matters more than computer readability, for now. In fact, purely from the efficiency perspective, the prefixed strings can potentially have better performance because they are auto-memoized, while `datetime("2019-08-28")` needs to re-parse its input string every time (or add its own internal memoization, but even that would be less efficient because it doesn't know the input is a literal string).
Sorry, I thought this part was obvious. It's a single three-letter prefix.

On Wed, Aug 28, 2019 at 10:01:25PM -0000, stpasha@gmail.com wrote:
Yes, I understand that. And that's one of the reasons why I think that this is a bad idea. Since Python is limited to ASCII syntax, we only have a small number of symbols suitable for delimiters. With such a small number available, - parentheses () are used for grouping and function calls; - square brackets [] are used for lists and subscripting; - curly brackets {} are used for dicts and sets; - quote marks are used for bytes and strings; And with your proposal: - quote marks are also used for function calls, but only a limited subset of function calls (those which take a single string literal argument). Across a large majority of languages, it is traditional and common to use round brackets for grouping and function calls, and square and curly brackets for collections. There are a handful of languages, like Mathematica, which use [] for function calls.
I understand, of course, how the idea of a string-that-is-not-a-string may sound blasphemous,
Its not a matter of blasphemy. It's a matter of readability and clarity.
We are "forced" to write that are we? Have you ever tried it? py> from datetime import datetime py> datetime("2019-08-28") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: an integer is required (got type str)
py> datetime(2019, 8, 28) datetime.datetime(2019, 8, 28, 0, 0) It is difficult to take your argument seriously when so much of it rests on things which aren't true. -- Steven

On Aug 29, 2019, at 04:58, Steven D'Aprano <steve@pearwood.info> wrote:
This is a disingenuous argument. When you read spam.eggs, of course you know that that means to call the __getattr__('eggs') method on spam. But do you actually read it as a special method calling syntax that’s restricted to taking a single string that must be an identifier as an argument, or do you read it as accessing the eggs member? Of course you read it as member access, not as a special restricted calling syntax (except in rare cases—e.g., you’re debugging a __getattribute__), because to do otherwise would be willfully obtuse to do so, and would actively impede your understanding of the code. And the same goes for lots of other cases, like [1:7]. And the same goes for regex"a.*b" or 1.23f as well. Of course you’ll know that under the covers that means something like calling __whatever_registry__['regex'] with the argument "a.*b", but you’re going to think of it as a regex object or a float object, not as a special restricted calling syntax, unless you want to actively impede your understanding of the code.

On Thu, Aug 29, 2019 at 05:30:39AM -0700, Andrew Barnert wrote:
You make a good point about abstractions, but you are missing the critical point that spam.eggs *doesn't look like a string*. Things that look similar should be similar; things which are different should not look similar. I acknowledge your point (and the OP's) that many things in Python are ultimately implemented as function calls. But none of those things look like strings: - The argument to the import statement looks like an identifier (since it is an identifier, not an arbitrary string); - The argument to __getattr__ etc looks like an identifier (since it is an identifier, not an arbitrary string); - The argument to __getitem__ is an arbitrary expression, not just a string. All three are well understood to involve runtime lookups: modules must be searched for and potentially compiled, object superclass inheritance hierarchies must be searched; items or keys in a list or dict must be looked up. None of them suggest a constant literal in the same way that "" string delimiters do. The large majority of languages follow similar principles, allowing for usually minor syntactic differences. Some syntactic conventions are very weak, and languages can and do differ greatly. But some are very, very strong, e.g.: 123.4567 is nearly always a numeric float of some kind, rather than ((say) multiplying two ints; ' and/or " are nearly always used for delimiting strings. Even languages like Forth, which have radically different syntax to mainstream languages, sort-of follows that convention of associating quote marks with strings. ." outputs the following character string, terminating at the next " character. i.e. ." foo" in Forth would be more or less equivalent to print("foo") in Python. Let me suggest some design principles that should hold for languages with more-or-less "conventional" syntax. Languages like APL or Forth excluded. - anything using ' or " quotation marks as delimiters (with or without affixes) ought to return a string, and nothing but a string; - as a strong preference, anything using quotation marks as delimiters ought to be processed at compile-time (f-strings are a conspicuous exception to that principle); - using affixes for numeric types seems like a fine idea, and languages like Julia that offer a wide-range of builtin numeric types show that this works fine; in Python2 we used to have native ints and longints that took a L suffix so there's precedent there. [...]
No I'm not. I'm going to think of it as a *string*, because it looks like a string. Particularly given the OP's preference for single-letter prefixes. 1.23f doesn't look like a string, it looks like a number. I have no objection to that in principle, although of course there is a question whether float32 is important enough to justify either builtin syntax or custom, user-defined syntax. -- Steven

On Thu, 29 Aug 2019 at 15:54, Steven D'Aprano <steve@pearwood.info> wrote:
This will degenerate into nitpicking very fast, so let me just say that I understand the general idea that you're trying to express here. I don't entirely agree with it, though, and I think there are some fairly common violations of your suggestion below that make your arguments less persuasive than maybe you'd like.
- anything using ' or " quotation marks as delimiters (with or without affixes) ought to return a string, and nothing but a string;
In C, Java and C++, 'x' is an integer (char). In SQL (some dialects, at least) TIMESTAMP'2019-08-22 11:32:12' is a TIMESTAMP value. In Python, b'123' is a bytes object (which maybe you're willing to classify as "a string", but the line blurs quite fast). Paul

On Aug 29, 2019, at 07:52, Steven D'Aprano <steve@pearwood.info> wrote:
Which is exactly why you’d read 1.23dec or 1.23f as a number, because it looks like a number and also acts like a number, rather than as a function call that takes the string '1.23', even if you know that’s how it’s implemented. And most of the string affixes people have suggested are for string-ish things. I’m not sure what a “version string” is, but I might design that as an actual subclass of str that adds extractor methods and overrides comparison. A compiled regex isn’t literally a string, but neither is a bytes; it’s still clearly _similar_ to a string, in important ways. And so is a path, or a URL (although I don’t know what you’d use the url prefix for in Python, given that we don’t have a string-ish type like ObjC’s NSURL to return and I don’t think we need one, but presumably whoever wrote the url affix would be someone who disagreed and packaged the prefix with such a class). And versions of the proposal that allow delimiters other than quotes so you can write things like regex/a.*b/, well, I’d need to see a specific proposal to be sure, but that seems even less objectionable in this regard. That looks like nothing else in Python, but it looks like a regex in awk or sed or perl, so I’d probably read it as a regex object.
The arguments to the dec and f affix handlers look like numeric literals, not arbitrary strings. The arguments to path and version are… probably string literal representations (with the quotes and all), not arbitrary strings. Although that does depends on the details of the specific proposal, if _any_ of your killer uses needs uncooked strings, then either you rcome up with something over complicated like C++ where you can register three different kinds of affixes, or you just always pass uncooked strings (because it’s trivial to cook on demand but impossible to de-cook). And the arguments to regex may be some _other_ kind of restricted special string that… I don’t think anyone has tried to define yet, but you can vaguely imagine what it would have to be like, and it certainly won’t be any arbitrary string.
So b"abc" should not be allowed? Let’s say I created a native-UTF16-string type to deal with some horrible Windows or Java stuff. Why would this principle of yours suggest that I shouldn’t be allowed to use u16"" just like b””? This is a design guideline for affixes, custom or otherwise. Which could be useful as a filter on the list of proposed uses, to see if any good ones remain (and if no string affix uses remain, then of course the proposal is either useless or should be restricted to just numbers or whatever), but it can’t be an argument against all affixes, or against custom affixes, or anything else generic like that.
I don’t see why you should even want to _know_ whether it’s true, much less have a strong preference. Here are things you probably really do care about: (a) they act like strings, (b) they act like constants, (c) if there are potential issues parsing them, you see those issues as soon as possible, (d) working with them is more than fast enough. Compile time is neither necessary (Haskell) nor sufficient (Tcl) for any of that. So why insist on compile-time instead of insisting on a-d?
No I'm not. I'm going to think of it as a *string*, because it looks like a string.
Well, yes. It’s a path string, or a regex string, or a version string, or whatever, which is loosely a kind of string but not literally one. Like bytes. Or it’s a sql cursor, in which case it was probably a misuse of the feature.
Particularly given the OP's preference for single-letter prefixes.
OK, I will agree with you there that the overuse of single-letter prefixes in the motivating examples is a worrying sign. In principle there’s nothing wrong with single letters (and I think I can make a good case for the f suffix as a good use in 3D-math code). And a program that used a whole ton of version strings and version string constants might find it useful to use v instead of ver. But I’m having a hard time imagining such a program existing. (Even something like pip or the PyPI backend might have lots of version strings, but why would it have lots of version string constants?) So, maybe that’s a sign that the OP’s eventual detailed set of use cases is not going to make me happy. Of course the burden is on the proposer, and if every proposed string affix use case ends up looking bad, then I’d either oppose the proposal or suggest that it be restricted to numeric affixes or something. But that’s not a reason to reject the proposal before seeing it, or to argue that whatever it is can’t conceivably be good because of [some posited universal principle that doesn’t even hold in Python today].
As I’ve said before, I believe that anything that doesn’t have a builtin type does not deserve builtin syntax. And I don’t understand why that isn’t a near-ubiquitous viewpoint. But it’s not just you; at least three people (all of whom dislike the whole concept of custom affixes) seem at least in principle open to the idea of adding builtin affixes for types that don’t exist. Which makes me think it’s almost certainly not that you’re all crazy, but that I’m missing something important. Can you explain it to me?

On 29/08/2019 22:10:21, Andrew Barnert via Python-ideas wrote:
As I’ve said before, I believe that anything that doesn’t have a builtin type does not deserve builtin syntax. And I don’t understand why that isn’t a near-ubiquitous viewpoint.
+1 (maybe that means I'm missing something). Just curious: Is there any reason not to make decimal.Decimal a built-in type? It's tried and tested. There are situations where floats are appropriate, and others where Decimals are appropriate (I'm currently using it myself); conceptually I see them as on an equal footing. If it were built-in, there would be good reason to accept 1.23d meaning a Decimal literal (distinct from a float literal), whether or not (any part of) the OP was adopted. Rob Cliffe

On Thu, Aug 29, 2019 at 11:19:58PM +0100, Rob Cliffe via Python-ideas wrote:
Just curious: Is there any reason not to make decimal.Decimal a built-in type?
Yes: it is big and complex, with a big complex API that is over-kill for the major motivating use-case for a built-in decimal type. There might be a strong case for adding a fixed-precision decimal type, and leaving out the complex parts of the Decimal API: no variable precision, just a single rounding mode, no contexts, no traps. If you need the full API, use the decimal module; if you just need something like builtin floats, but in base 10, use the built-in decimal. There have been at least two proposals. Neither have got so far as a PEP. If I recall correctly, the first suggested using Decimal64: https://en.wikipedia.org/wiki/Decimal64_floating-point_format the second suggested Decimal128: https://en.wikipedia.org/wiki/Decimal128_floating-point_format -- Steven

On Thu, 29 Aug 2019 at 22:12, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
As I’ve said before, I believe that anything that doesn’t have a builtin type does not deserve builtin syntax. And I don’t understand why that isn’t a near-ubiquitous viewpoint. But it’s not just you; at least three people (all of whom dislike the whole concept of custom affixes) seem at least in principle open to the idea of adding builtin affixes for types that don’t exist. Which makes me think it’s almost certainly not that you’re all crazy, but that I’m missing something important. Can you explain it to me?
In my case, it's me that had missed something - namely the whole of this point. I can imagine having builtin syntax for a stdlib type (like Decimal, Fraction, or regex), but I agree that it gives the stdlib special privileges which I'm uncomfortable with. I definitely agree that built in syntax for 3rd party types is unacceptable. That quite probably contradicts some of my earlier statements - just assume I was wrong previously, I'm not going to bother going back over what I said and correcting my comments :-) I remain of the opinion that the benefits of user-defined literals would be sufficiently marginal that they wouldn't justify the cost, though. Paul

On Thu, Aug 29, 2019 at 02:10:21PM -0700, Andrew Barnert wrote: [...]
And most of the string affixes people have suggested are for string-ish things.
I don't think that's correct. Looking back at the original post in this thread, here are the motivating examples: [quote] There are quite a few situations where this can be used: - Fraction literals: `frac'123/4567'` - Decimals: `dec'5.34'` - Date/time constants: `t'2019-08-26'` - SQL expressions: `sql'SELECT * FROM tbl WHERE a=?'.bind(a=...)` - Regular expressions: `rx'[a-zA-Z]+'` - Version strings: `v'1.13.0a'` - etc. [/quote] By my count, that's zero out of six string-ish things. There may have been other proposals, but I haven't trolled through the entire thread to find them.
A version object is a record with fields, most of which are numeric. For an existing example, see sys.version_info which is a kind of named tuple, not a string. The version *string* is just a nice human-readable representation. It doesn't make sense to implement string methods on a Version object. Why would you offer expandtabs(), find(), splitlines(), translate(), isspace(), capitalise(), etc methods? Or * and + (repetition and concatenation) operators? I cannot think of a single string method/operator that a Version object should implement.
It isn't clear to me how a compiled regex object is "similar" to a string. The set of methods offered by both regexes and strings is pretty small, by my generous count it is just two methods: - str.split and SRE_Pattern.split; - str.replace and SRE_Pattern.sub neither of which use the same API or have the same semantics. Compiled regex objects don't offer string methods like translate, isdigits, upper, encode, etc. I would say that they are clearly *not* strings. [...]
Why do you need the "regex" prefix? Assuming the parser and the human reader can cope with using / as both a delimiter and a operator (which isn't a given!) /.../ for a regex object seems fine to me. I suspect that this is going to be ambiguous though: target = regex/a*b/ +x could be: target = ((regex / a) * b) / ( unary-plus x) or target = (regex object) + x so maybe we do need a prefix.
In what way are byte-STRINGS not strings? Unicode-strings and byte-strings share a significant fraction of their APIs, and are so similar that back in Python 2.2 the devs thought it was a good idea to try automagically coercing from one to the other. I was careful to write *string* rather than *str*. Sorry if that wasn't clear enough.
It is a utf16 STRING so making it look like a STRING is perfectly fine. [...]
Because I care about performance, at least a bit. Because I don't want to write code that is unnecessarily slow, for some definition of "unnecessary". Because I want to be able to reason (at least in broad terms) about the cost of certain operations. Because I want to be able to reason about the semantics of my code. Why do I write 1234 instead of int("1234")? The second is longer, but it is more explicit and it is self-documenting: the reader knows that its an int because it says so right there in the code, even if they come from Javascript where 1234 is an IEEE-754 float. Assuming the builtin int() hasn't be shadowed. But it's also wastefully slow. If we are genuinely indifferent to the difference, then we should be equally indifferent to a proposal to replace the LOAD_CONST byte-code for ints as follows: dis("1234") # in current Python LOAD_CONST 0 (1234) # In the future: LOAD_NAME 0 (int) LOAD_CONST 0 ('1234') CALL_FUNCTION 1 (1 positional, 0 keyword pair) If you were asked to vote +1 or -1 on this proposal (sitting on the fence not allowed), which would you vote? I would vote -1. Aside from the performance hit, it's also a semantic change: what was a compile-time literal is now a runtime function call which can be shadowed. It is nice to know that when I say ``n = 1234`` that the value of n is guaranteed to be 1234 no matter what odd things are going on. (Short of running a modified interpreter.) String literals (byte- or unicode, raw or cooked, triple- or single-quoted) are, with the exception of f-strings, LOAD_CONST calls like ints. I think that's a valuable, useful thing to know, and not something we should lightly give up.
Here are things you probably really do care about: (a) they act like strings, (b) they act like constants,
Don't confuse immutability with constant-ness. Python doesn't have constants, except by convention. There's no easy way to prevent a simple name from being rebound.
(c) if there are potential issues parsing them, you see those issues as soon as possible,
Like at compile-time? Consider the difference between the compile-time syntax error you get here: x = 123w456 versus the run-time error you get here: x = int("123w456") I can understand saying "we have no choice but to make this a runtime operation", or even "on the balance of pros and cons, it isn't worth the extra work to make this happen at compile-time". I don't like it that we have to write Decimal("123.456"), but I understand the reasons why we have to and can accept that it is a necessary evil. (To clarify: of course it is a feature that we *can* pass strings to the Decimal constructor, when such strings come from user-input or are read from data files etc.) But I don't think that it is a feature that there is no alternative but to pass a string, even when the value is known at edit-time. And I don't understand your position that I shouldn't care about the difference.
(d) working with them is more than fast enough.
You are right that Python is usually "fast enough" (except when it isn't), and that the one-off cost of creating a few pseudo-constants is generally only a small fraction of the cost of most programs. But Python isn't quote-unquote "slow" because of any one single thing, it is more of a death by a thousand cuts, lots of *small* inefficiences which individually don't matter but collectively add up to making Python up to a hundred times slower than C. When performance matters, which would you rather write? for item in huge_sequence: value = item + 1234 value = item + int("1234") I know that when I use a literal, it will be as fast as it possibly can be in Python, or at least there's nothing *I* can do to make it faster. But when I have to use a call like Decimal("123.45"), that's one more thing for me to have to worry about: is it fast enough? Can I make it faster? Should I make it faster? We should be wary about hiding potentially slow code in something that looks like fast code. (Yes, that's a criticism of properties too, but in the case of properties we know that the benefits outweigh the risk. It's not clear that this is the case here.)
I think you will find that I said this should be "a strong preference", which is hardly *insisting*.
Actually, no, it will be a Path object, a compiled regex SRE_pattern object, or a Version object, not a string at all.
or whatever, which is loosely a kind of string but not literally one. Like bytes.
Bytes literally are strings. They just aren't strings of Unicode characters.
Or it’s a sql cursor, in which case it was probably a misuse of the feature.
That's one of the motivating examples. I agree it is a misuse of the proposed feature.
I can concur with all of that. [...]
As I’ve said before, I believe that anything that doesn’t have a builtin type does not deserve builtin syntax.
Agreed. Although there's a bit of fuzziness over the concept of "builtin". Not all built-in objects are available in the ``builtins`` module, e.g. NoneType, or FunctionType.
I thought it went without saying that a necessary pre-condition for adding builtin syntax for a type was for the type to become built-in first. Sorry if it wasn't as clear or obvious as I thought. -- Steven

On Sat, Aug 31, 2019 at 8:44 PM Steven D'Aprano <steve@pearwood.info> wrote:
We call it a string, but a bytes object has as much in common with bytearray and with a list of integers as it does with a text string. Is the contents of a MIDI file a "string"? I would say no, it's not - but it can *contain* strings, eg for metadata and lyrics. The MIDI file representation of an integer might be stored in a byte-string, but the common API between text strings and byte strings is going to be mostly irrelevant here. You can't upper-case the variable-length-integer b"\xe7\x61" any more than you can upper-case the integer 13281. Those common methods are mostly built on the assumption that the string contains ASCII text. There are a few string-like functions that truly can be used with completely binary data, and which actually do make a lot more sense on a byte string than on, say, a list of integers. Notably, finding a particular byte sequence can be done without knowing what the bytes actually mean (and similarly bytes.split(), which does the same sort of search), and you can strip off trailing b"\0" without needing to give much meaning to the content. But I cannot recollect *ever* using these methods on any bytes object that wasn't storing some form of encoded text. Bytes and text have a long relationship, and as such, there are special similarities. That doesn't mean that bytes ARE text, any more than a compiled regex is text just because it's traditional to describe a regex in a textual form. Path objects also blur the "is this text?" line, since you can divide a Path by a string to concatenate them, and there are ways of smuggling arbitrary bytes through them. I don't think it's necessary to be too adamant about "must be some sort of thing-we-call-string" here. Let practicality rule, since purity has already waved a white flag at us. ChrisA

On Sat, Aug 31, 2019 at 09:31:15PM +1000, Chris Angelico wrote:
I don't think that's true. py> b'abc'.upper() b'ABC' py> [1, 2, 3].upper() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'list' object has no attribute 'upper' Shall I beat this dead horse some more by listing the other 33 methods that byte-strings share with Unicode-strings but not lists? Compared to just two methods shared by all three of bytes, str and list, (namely count() and index()), and *zero* methods shared by bytes and list but not str. In Python2, byte-strings and Unicode strings were both subclasses of type basestring. Although we have moved away from that shared base class in Python3, it does demonstrate that conceptually bytes and str are closely related to each other.
Is the contents of a MIDI file a "string"? I would say no, it's not - but it can *contain* strings, eg for metadata and lyrics.
Don't confuse *human-readable native language strings* for generic strings. "Hello world!" is a string, but so are '&w-8\x02^xs\0' and b'DEADBEEF'.
Of course you can. py> b"\xe7\x61".upper() b'\xe7A' Whether it is *meaningful* to do so is another question. But the same applies to str.upper: just because you can call the method doesn't mean that the result will be semantically valid. source = "def spam():\n\tpass\n" source = source.upper() # no longer valid Python source code.
Those common methods are mostly built on the assumption that the string contains ASCII text.
As they often do. If they don't, then don't call the text methods which don't make sense in context. Just as there are cases where text methods don't make sense on Unicode strings. You wouldn't want to call .casefold() on a password, or .lstrip() on a line of Python source code. [...]
Bytes and text have a long relationship, and as such, there are special similarities. That doesn't mean that bytes ARE text,
I didn't say that bytes are (human-readable) text. Although they can be: not every application needs Unicode strings, ASCII strings are still special, and there are still applications where once has to mix binary and ASCII text data. I said they were *strings*. Strings are not necessarily text, although they often are. Formally, a string is a finite sequence of symbols that are chosen from a set called an alphabet. See: https://en.wikipedia.org/wiki/String_%28computer_science%29
It is because of *practicality* that we should prefer that things that look similar should be similar. Code is read far more often that it is written, and if you read two pieces of code that look similar, we should strongly prefer that they should actually be similar. Would you be happy with a Pythonesque language that used prefixed strings as the delimiter for arbitrary data types? mylist = L"1, 2, None, {}, L"", 99.5" mydict = D"key: value, None: L"", "abc": "xyz"" myset = S"1, 2, None" That's what this proposal wants: string syntax that can return arbitrary data types. How about using quotes for function calls? assert chr"9" == "\t" assert ord"9" == 57 That's what this proposal wants: string syntax for a subset of function calls. Don't say that this proposal won't be abused. Every one of the OP's motivating examples is an abuse of the syntax, returning non-strings from something that looks like a string. -- Steven

On Sun, Sep 1, 2019 at 10:47 AM Steven D'Aprano <steve@pearwood.info> wrote:
Older versions of Python had text and bytes be the same things. That means that, for backward compatibility, they have some common methods. But does that really mean that bytes can be uppercased? Or is it that we allow bytes to be treated as ASCII-encoded text, which is then uppercased, and then returned to being bytes?
Or does it actually demonstrate that Python 3 maintains backward compatibility with Python 2?
So what did you actually do here? You took some bytes that represent an integer, and you called a method on it that makes no sense whatsoever, because now it represents a different integer. There's no sense in which your new bytes object represents an "upper-cased version of" the integer 13281. If I were to decode that string to text and THEN uppercase it, it might give a quite different result:
b"\xe7\x61".decode("Latin-1").upper().encode("Latin-1") b'\xc7A'
And if you choose some other encoding than Latin-1, you might get different results again. I put it to you that bytes.upper() exists more for backward compatibility with Python 2 than because a bytes object is, in some way, uppercaseable.
source = "def spam():\n\tpass\n" source = source.upper() # no longer valid Python source code.
But it started out as text, and it is now uppercase text. When you do that with bytes, you have to first layer in "this is actually encoded text", and you are then able to destroy that.
A finite sequence of symbols... you mean like a list of integers within the range [0, 255]? Nothing in that formal definition says that a "string" of anything other than characters should be meaningfully treated as text.
And you have yet to prove that this similarity is actually a thing.
At some point it's meaningless to call it a "Pythonesque" language, but I've worked with plenty of languages that simply do not have data types this rich, and so everything is manipulated the exact same way. When a list of values is represented as ";item 1;item 2;item 3" (actually as a string), or when you unpack a URL to find that it has JSON embedded inside it, the idea of a "prefixed string" that tells you exactly what data type is coming would be a luxury.
Let's look at regular expressions. JavaScript has a syntax for them involving leading and trailing slashes, borrowed from Perl, but I can't figure out whether a regex is a first-class object in Perl. So you can do something like this:
In Python, I can do the exact same thing, only using double quotes as the delimiter.
re.search("spa*m", "This has spaaaaam in it") <re.Match object; span=(9, 17), match='spaaaaam'>
So what do you mean by "non-string" exactly? In what way is a regular expression "not a string", yet the byte-encoded form of an integer somehow is? It makes absolutely no sense to uppercase an integer, yet you could uppercase a regex (since all regex special characters are non-letters, this will make it match uppercase strings). Yet when you encode the string as bytes, it gains an upper() method, and when you encode a regex as a compiled regex object, it loses one. Why do you insist that a regex is somehow not a string, but b"\xe7\x61" is? ChrisA

Chris Angelico writes:
Not just older versions. There have been several, more or less hotly contested, changes post-2/3 fork that basically come down to "bytes are frequently the wire format of ASCII-compatibly-encoded text, so we're going to add text methods for the convenience of people who work with those wire formats but do not need to (and sometimes cannot) decode to Unicode." For example, RFC 5322 header field tags are defined to be case- insensitive ASCII, and therefore it's useful to match them by upper- or lowercasing the tag, then matching against fixed strings. Could you convert to text and do the work? Not usefully: you need to parse the bytes to determine which text encoding is in use. (And ironically enough, if the message is RFC 5322 + RFC 2045-conformant, the hacky iso-8859-1 "conversion" will be allocation of a str object and then a memcpy of the bytes. I don't think that's a rebuttal to your argument, of course, it's just amusing.) That doesn't mean that bytes ARE text that happens to fit in 8-bit code units (PEP 393). It does mean that the similarities of the APIs are neither random accidents nor historical artifact. They're intentional. I don't think this has anything whatsoever to do with whether the "custom string prefix" proposal is a good idea or not. (other) Steve

On Sun, Sep 01, 2019 at 12:24:24PM +1000, Chris Angelico wrote:
Older versions of Python had text and bytes be the same things.
Whether a string object is *text* is a semantic question, and independent of what data format you use. 'Hello world!' is text, whether you are using Python 1.5 or Python 3.8. '\x01\x06\x13\0' is not text, whether you are using Python 1.5 or Python 3.8.
I'm curious what you think that b'chris angelico'.upper() is doing, if it is not uppercasing the byte-string b'chris angelico'. Is it a mere accident that the result happens to be b'CHRIS ANGELICO'? Unicode strings are sequences of code-points, abstract integers between 0 and 1114111 inclusive. When you uppercase the Unicode string 'chris angelico', you're transforming the sequence of integers: U+0063,0068,0072,0069,0073,0020,0061,006e,0067,0065,006c,0069,0063,006f to this sequence of integers: U+0043,0048,0052,0049,0053,0020,0041,004e,0047,0045,004c,0049,0043,004f If you are prepared to call that "uppercasing", you should be prepared to do the same for the byte-string equivalent. (For the avoidance of doubt: this is independent of the encoding used to store those code points in memory or on disk. Encodings have nothing to do with this.) [...]
I'm fairly confident that bytes methods aren't implemented by decoding to Unicode, applying the method, then re-encoding back to bytes. But even if they were, that's just an implementation detail. Imagine a float method that internally converted the float to a pair of integers (numerator/denominator), operated on that fraction, and then re-converted back to a float. I'm sure you wouldn't want to say that this proves that floats aren't numbers. The same applies to byte-strings. In the unlikely case that byte methods delegate to str methods, that doesn't mean byte-strings aren't strings. It just means that two sorts of strings can share a single implementation for their methods. Code reuse for the win! [...]
For the sake of the argument I'll accept that *this particular* byte string represents an integer rather than a series of mixed binary data and ASCII text, or text in some unusual encoding, or pixels in an image, or any of a million other things it could represent. That's absolutely fine: if it doesn't make sense to call .upper() on your bytes, then don't call .upper() on them. Precisely as you wouldn't call .upper() on a str object, if it didn't make sense to do so.
and you called a method on it that makes no sense whatsoever, because now it represents a different integer.
The same applies to Unicode strings too. Any Unicode string method that transforms the input returns something that represents a different sequence of code-points, hence a different sequence of integers. Shall we agree that neither bytes nor Unicode are strings? No, I don't think so either :-)
If I were to decode that string to text and THEN uppercase it, it might give a quite different result:
Sure. If you perform *any* transformation on the data first, it might give a different result on uppercasing: - if you reverse the bytes, uppercasing gives a different result; - if you replace b'a' with b'e', uppercasing gives a different result etc. And exactly the same observation applies to str objects: - if you reverse the characters, uppercasing gives a different result; - if you replace 'a' with 'e', uppercasing gives a different result.
And if you choose some other encoding than Latin-1, you might get different results again.
Sure. The bytes methods like .upper() etc are predicated on the assumption that your bytes represent ASCII text. If your bytes represent something else, then calling the .upper() method may not be meaningful or useful. In other words... if your bytes string came from an ASCII text file, it's probably safe to uppercase it. If your bytes string came from a JPEG, then uppercasing them will probably make a mess of the image, if not corrupt the file. So don't do that :-) Analogy: ints support the unary minus operator. But if your int represents a mass, then negating it isn't meaningful. There's no such thing as -5 kg. Should we conclude from this that the int type in Python doesn't represent a number, and that the support of numeric operators and methods is merely for backwards compatibility? I think not. The formal definition of a string is a sequence of symbols from an alphabet. That is precisely what bytes objects are: the alphabet in this case is the 8-bit numbers 0 to 255 inclusive, which for usefulness, convenience and backwards compatibility can be optionally interpreted as the 7-bit ASCII character set plus another 128 abstract "characters".
Sure. If your bytes don't represent text, then methods like upper() probably won't do anything meaningful. It's still a string though.
I'm not sure the onus is on me to prove this. "Status quo wins a stalemate." And surely the onus is on those proposing the new syntax to demonstrate that it will be fine to use string delimiters as function calls. You could make a good start by finding other languages, reasonably conventional languages with syntax based on the Algol or C tradition, that use quotes '' or "" to return arbitrary types. Even languages with unconventional syntax like Forth or APL would be a good start. Maybe I'm wrong. Maybe quotation marks are widely used for purposes other than delimiting strings, and I'm just too ignorant of other languages to know it. Maybe Python is in the minority here. Anyway, the bottom line is this: I have no objection to using prefixed quotes to represent Unicode strings, or byte strings, or Andrew's hypothetical UTF-16 strings, or EBCDIC strings, or TRON strings. https://en.wikipedia.org/wiki/TRON_(encoding) But I think that any API that would allow z"..." to represent (let's say) a socket, or a float, or a HTTP_Server instance, or a list, would be a deeply flawed API.
Sure. As a convenience, the re module has functions which accepts regular expression patterns as well as compiled regular expression objects.
So what do you mean by "non-string" exactly? In what way is a regular expression "not a string",
That question is ambiguous. Are you asking about regular expression patterns, or regular expression objects? Regular expression *patterns* are clearly strings: pattern = r'...' We type them with string delimiters, if you call type(pattern) it will return str, you can slice the pattern or uppercase it. Regular expression *objects* are just as clearly not strings: rx = re.compile(pattern) You can't slice them, they aren't sequences of symbols, they have attributes like rx.flags which have no meaning in a string, they lack string methods like upper, and those methods they have operate very differently from their equivalent string methods: pattern.find("X") # search pattern for "X" rx.search("X") # search "X" for pattern Regex objects are far more than just the regex pattern.
yet the byte-encoded form of an integer somehow is?
If your bytes represent an integer, then uppercasing them isn't meaningful. If your bytes represent ASCII text then uppercasing them may be meaningful.
In general, you can't uppercase regex patterns without radically changing the meaning of them. Consider r'\d' and r'\D'.
Because a byte-string matches the definition of strings, while compiled regex objects do not. -- Steven

On Mon, Sep 2, 2019 at 9:56 PM Steven D'Aprano <steve@pearwood.info> wrote:
Okay, so "string" and "text" are completely different concepts. Hold that thought.
No, they're not decoded. What happens is an *assumption* that certain bytes represent uppercaseable characters, and others do not. I specifically chose my example such that the corresponding code points both represented letters, and that the uppercased versions of each land inside the first 256 Unicode codepoints; yet uppercasing the bytestring changes one and not the other. Is it uppercasing the number 0x61 to create the number 0x41? No, it's assuming that it means "a" and uppercasing it to "A".
I specifically said a *list* of integers. Like what you'd get if you call list() on a bytestring. There's nothing in the formal definition you gave that precludes this from being considered a string, yet it is somehow, by your own words, fundamentally different.
Actually it is, because YOU are the one who said that quoted strings should be restricted to "string-like" things. Would a Path literal be sufficiently string-like to be blessed with double quotes? A regex literal? An IP header, represented as a bytestring? What's a string and what's not? Why are you trying to draw a line?
I gave an example wherein a list/array is represented as ";foo;bar;quux" - does that count? (VX-REXX, if you're curious.)
What if it represents a "connectable endpoint"? Is that a string? It'd be kinda like a pathlib.Path but with a bit more flexibility, allowing it to define a variety of information including the method of connection and perhaps some credentials. IOW a URI.
Exactly. To the re module, strings and compiled regexes are interchangeable.
Both at once. We're discussing the possibility of a "regex literal" concept that may or may not use double quotes. To most human beings, a regular expression IS a text string. Is a compiled regex allowed to have a literal form that uses double quotes, based on your definition of "string-like"? YOU are the one who is trying to draw a line in the sand here.
Right, but even if they represent an integer, you're fine with them using double quotes. Or am I mistaken here, and you would prefer to see it represented as bytes((0xe7, 0x61)) ?
And [0xe7, 0x61] also matches the definition of a string. ChrisA

If you strongly believe that if something looks like a string it ought to quack like a string too, then we can consider 2 potential remedies: 1. Change the delimiter, for example use curly braces: `re{abc}`. This would still be parseable, since currently an id cannot be followed by a set or a dict. (The forward-slash, on the other hand, will be ambiguous). 2. We can also leave the quotation marks as delimiters. Once this feature is implemented, the IDEs will update their parsers, and will be emitting a token of "user-defined literal" type. Simply setting the color for this token to something different than your preferred color for strings will make it visually clear that those tokens aren't strings. Hence, no possibility for confusion.

On Mon, 2 Sep 2019 at 07:04, Pasha Stetsenko <stpasha@gmail.com> wrote:
Just to add my 2 cents: there are always two sides in each language proposal: more flexibility/usability, and more language complexity. These need to be compared and the comparison is hard because it is often subjective. FWIW, I think in this case the added complexity outweighs the benefits. I think only the very widely used literals (like numbers) deserve their own syntax. For everything else it is fine to have few extra keystrokes. -- Ivan

On 31/08/2019 12:31, Chris Angelico wrote:
We call it a string, but a bytes object has as much in common with bytearray and with a list of integers as it does with a text string.
You say that as if text strings aren't sequences of bytes. Complicated and restricted sequences, I grant you, but no more so than a packet for a given network protocol. -- Rhodri James *-* Kynesim Ltd

On Wed, Sep 4, 2019 at 12:43 AM Rhodri James <rhodri@kynesim.co.uk> wrote:
Is an integer also a sequence of bytes? A float? A list? At some level, everything's just stored as bytes in memory, but since there are many possible representations of the same information, it's best not to say that a character "is" a byte, but that it "can be stored in" some number of bytes. In Python, subscripting a text string gives you another text string. Subscripting a list of integers gives you an integer. Subscripting a bytearray gives you an integer. And (as of Python 3.0) subscripting a bytestring also gives you an integer. Whether that's right or wrong (maybe subscripting a bytestring should have been defined as yielding a length-1 bytestring), subscripting a text string does not give an integer, and subscripting a bytestring does not give a character. ChrisA

On Sep 3, 2019, at 06:17, Rhodri James <rhodri@kynesim.co.uk> wrote:
Forget about bytes vs. octets; this still isn’t a useful perspective. A character is a grapheme cluster, a sequence or one or more code points. A code point is an integer between 0 and 1.1M. A string is a flattened sequence of grapheme clusters—that is, a sequence of code points. (Python ignores the cluster part, pretending code points are characters, at the cost of requiring every application to handle normalization manually. Which is normally a good tradeoff, but it does mean that you can’t even say whether two sequences of code points are the same string without calling a function.) Meanwhile, there are multiple ways to store those code points as bytes. Python does whatever it wants under the covers, hiding it from the user. Obviously there is _some_ array of bytes somewhere in memory that represents the characters of the string in some way (I say “obviously”, but that isn’t always true in Swift, and isn’t even frequently true in Haskell…), but you don’t have access to that. If you want a sequence of bytes, you have to ask for a sequence in some specific representation, like UTF-8 or UTF-16-BE or Shift-JIS, which it creates for you on the fly (albeit cached in a few special cases). So, from your system programmer’s perspective, in what useful sense is a character, or a string, a sequence of bytes? And this is all still ignoring the fact that in Python, all values are “boxed” in an opaque structure that you can’t access from within the language, and even from the C API of CPython the box structure isn’t part of the API, so even something simpler like, say, an int isn’t usefully a sequence of 30-bit digits from the system programmer’s perspective, it’s an opaque handle that you can pass to functions to _obtain_ a sequence of 30-bit digits. (In the case of strings, you have to first pass to opaque handle to one function to see what format to ask for, then pass it to another to obtain a sequence of 1, 2, or 4-byte integers representing the code points in native-endian ASCII, UCS2, or UCS4. Which normally you don’t do—you ask for a UTF-8 string or a UTF-32 string that may get constructed on the fly—but if you really do want the actual storage, this is the way to get it.) And most of this is not peculiar to Python. In Swift, a string is a sequence of grapheme clusters. In Java, it’s a sequence of UTF-16 code units. In Go, it’s a sequence of UTF-8 code units. In Haskell, it’s a lazy linked list of code points. And so on. In some of those cases, a character does happen to be represented as a string of bytes within a larger representation, but even when it is, that still doesn’t mean you can usefully access it that way. Of course a text file on disk is a sequence or bytes, and (if you know the encoding and normalization) you could operate directly on those. But you don’t; you pass the byte strings to a function that decodes them (and then sometimes to a second function that normalizes them into a canonical form) and then use your language’s string functions on the result. In fact, you probably don’t even do that; you let the file object buffer the byte strings however it wants to and just hand you decoded text objects, so you don’t even know which byte substrings exist in memory at any given time.(Languages with powerful optimizers or macro systems like Haskell or Rust might actually do that by translating all your string-function calls into calls directly on the steam of bytes, but from your perspective that’s entirely under the covers, and you’re doing the same thing you do in Python.)

On Thu, Aug 29, 2019 at 09:58:35PM +1000, Steven D'Aprano wrote:
Since Python is limited to ASCII syntax, we only have a small number of symbols suitable for delimiters. With such a small number available,
Oops, I had an interrupted thought there. With such a small number available, there is bound to be some duplication, but it tends to be fairly consistent across the majority of conventional programming languages. -- Steven

On 28/08/2019 23:01, stpasha@gmail.com wrote:
I don't think it's blasphemous. I think it's misleading, and that's far worse.
Pace Stephen's point that this is not in fact how datetime works, this has the major advantage of being readable. My thought processes on coming across that in code would go something like; "OK, we have a function call. Judging from the name its something to do with dates and times, so the result is going to be some date/time thing. Oh, I remember seeing "from datetime import datetime" at the top, so I know where to look it up if it becomes important. Fine. Moving on."
Here my thoughts would be more like; "OK, this is some kind of special string. I wonder what "dt" means. I wonder where I look it up. The string looks kind of like a date in ISO order, bear that in mind. Maybe "dt" is "date/time"." Followed a few lines later by "wait, why are we calling methods on that string that don't look like string methods? WTF? Maybe "dt" means "delirium tremens". Abort! Abort!" Obviously I've played this up a bit, but the point remains that even if I do work out that "dt" is actually a secret function call, I have to go back and fix my understanding of the code that I've already read. This significantly increases the chance that my understanding will be wrong. This is a Bad Thing.
If all that dt"string" gives us is a run-time call to dt("string"), it's a complete non-starter as far as I'm concerned. It's adding confusion for no real gain. However, it sounds like what you really want is something I've often really wanted to -- a way to get the compiler to pre-create "constant" objects for me. The trouble is that after thinking about it for a bit, it almost always turns out that I don't want that after all. Suppose that we did have some funky mechanism to get the compiler to create objects at compile time so we don't have the run-time creation cost to contend with. For the sake of argument, let's make it start_date = $datetime(2019,8,28) (I know this syntax would be laughed out of court, but like I said, for the sake of argument...) So we use "start_date" somewhere, and mutate it because the start date for some purpose was different. Then we use it somewhere else, and it's not the start date we thought it was. This is essentially the mutable default argument gotcha, just writ globally. The obvious cure for that would be to have our compile-time created objects be immutable. Leaving aside questions like how we do that, and whether contained containers are immutable, and so on, we still have the problem that we don't actually want an immutable object most of the time. I find that almost invariably I need to use the constant as a starting point, but tweak it somehow. Perhaps like in the example above, the start date is different for a particular purpose. In that case I need to copy the immutable object to a mutable version, so I have all the object creation shenanigans to go through anyway, and that saving I thought I had has gone away. I'm afraid these custom string prefixes won't achieve what I think you want to achieve, and they will make code less readable in the process.
So how do you distinguish the custom prefix "br" from a raw byte string? Existing syntax allows prefixes to stack, so there's inherent ambiguity in multi-character prefixes. -- Rhodri James *-* Kynesim Ltd

On Aug 29, 2019, at 06:40, Rhodri James <rhodri@kynesim.co.uk> wrote:
However, it sounds like what you really want is something I've often really wanted to -- a way to get the compiler to pre-create "constant" objects for me.
People often say they want this, but does anyone actually ever have a good reason for it? I was taken in by the lure of this idea myself—all those wasted frozenset constructor calls! (This was before the peephole optimizer understood frozensets.) Of course I hadn’t even bothered to construct the frozensets from tuples instead of lists, which should have been a hint that I was in premature optimization mode, and should have been the first thing I tried before going off the deep end. But hacking bytecode is fun, so I sat down and wrote a bytecode processor that let me replace any expression with a LOAD_CONST, much as the builtin optimizer does for things like simple arithmetic. It’s easy to hook it up to a decorator to call on a function, or to an import hook to call at module compile time. And then, finally, it’s time to benchmark and discover that it makes no difference. Stripping things down to something trivial enough to be tested… aha, I really was saving 13us, it’s just that 13us is not measurable in code that takes seconds to run. Maybe someone has a real use case where it matters. But I’ve never seen one. I tried to find good nails for my shiny new hammer and never found one, and eventually just stopped maintaining it. And then I revived it when I wrote my decimal literal hack (the predecessor to the more general user literal hack I linked earlier in the thread) back during the 2015 iteration of this discussion, but again couldn’t come up with a plausible example where those 2.3d pseudo-literals were measurably affecting performance and needed constifying; I don’t think I even bothered mentioning it in that thread. Also, even if you find a problem, it‘s almost always easy to work around today. If the constant is constructed inside a loop, just manually lift it out of the loop. If it’s in a function body, this is effectively the same problem as global or builtin lookups being too slow inside a function body, and can be solved the same way, with a keyword parameter with a default value. And if the Python community thinks that _sin=sin is good enough for the uncommon problem of lookups significantly affecting performance, surely of _vals=frozenset((1,2,3)) is also good enough for that far more uncommon problem, and therefore _limit=1e1000dec would also be good enough for the new but probably even more uncommon one. (Also, notice that the param default can be used with mutable values, it’s just up to you to make sure you don’t accidentally mutate them; an invisible compiler optimization couldn’t do that, at least not without something like Victor Stinner’s FAT guards.) For what it’s worth, I actually found my @constify decorator more readable than the param default, especially for global functions—but not nearly enough so that it’s worth using a hacky, CPython-specific module that I have to maintain across Python versions (and byteplay to byteplay3 to bytecode) and that nobody else is using. Or to propose for a builtin (or stdlib but magic) feature. What this all comes down to is that, despite my initial impression, I really don’t care whether Python thinks 1.23d is a constant value or not; I only care whether the human reader thinks it is one. Think about it this way: do you know off the top of your head whether (1, (2,3)) gets optimized to a const the same way (1,2) does in CPython? Has it ever occurred to you to check before I asked? And this is actually something that changed relatively recently. Why would someone who doesn’t even think about when tuples are constified want to talk about how to force Python to constify other types? Because even years of Python experience hasn’t cured us of premature-optimization-itis.

Rhodri James wrote:
Suppose that we did have some funky mechanism to get the compiler to create objects at compile time
It doesn't necessarily have to be at compile time. It can be at run time, as long as it only happens once.
I don't think this is as much of a problem as it seems. We often assign things to globals that are intended to be treated as constants, with the understanding that it's our responsibility to refrain from mutating them. -- Greg

Unless there is some significant difference between the two, what does this proposal give us?
The difference between `x'...'` and `x('...')`, other than visual noise, is the following: - The first "x" is in its own namespace of string prefixes. The second "x" exists in the global namespace of all other symbols. - Python style discourages too short variable names, especially in libraries, because they have increased chance of clashing with other symbols, and generally may be hard to understand. At the same time, short names for string prefixes could be perfectly fine: there won't be too many of them anyways. The standard prefixes "b", "r", "u", "f" are all short, and nobody gets confused about them. - Barrier of entry. Today you can write `from re import compile as x` and then write `x('...')` to denote a regular expression (if you don't mind having `x` as a global variable). But this is not the way people usually write code. People write the code the way they are taught from examples, and the examples don't speak about regular expression objects. The examples only show regular expressions-as-strings, so many python users don't even realize that regular expressions can be objects. Now, if the string prefixes were available, library authors would think "Do we want to export such functionality for the benefit of our users?" And if they answer yes, then they'll showcase this in the documentation and examples, and the user will see that their code has become cleaner and more understandable.

On Tue, Aug 27, 2019 at 05:13:41PM -0000, stpasha@gmail.com wrote:
Ouch! That's adding a lot of additional complexity to the language. Python's scoping rules are usually described as LEGB: - Local - Enclosing (non-local) - Global (module) - Builtins but that's an over-simplification, dating back to something like Python 1.5 days. Python scope also includes: - class bodies can be the local scope, but they don't work quite the same as function locals); - parts of the body of comprehensions behave as if they were a seperate scope. This proposal adds a completely seperate, parallel set of scoping rules for these string prefixes. How many layers in this parallel scope? The simplest design is to have a single, interpreter wide namespace for prefixes. Then we will have name clashes, especially since you seem to want to encourage single character prefixes like "v" (verbose, version) or "d" (date, datetime, decimal). Worse, defining a new prefix will affect all other modules using the same prefix. So we need a more complex parallel scope. How much more complex? * if I define a string prefix inside a comprehension, function or class body, will that apply across the entire module or just inside that comp/func/class? * how do nested functions interact with prefixes? * do we need a set of parallel keywords equivalent to global and nonlocal for prefixes? If different modules have different registries, then not only do we need to build a parallel set of scoping rules for prefixes into the interpreter, but we need a parallel way to import them from other modules, otherwise they can't be re-used. Does "from module import x" import the regular object x from the module namespace, or the prefix x from the prefix-namespace? So it seems we'll need a parallel import system as well. All this adds more complexity to the language, more things to be coded and tested and documented, more for users to learn, more for other implementations to re-implement, and the benefit is marginal: the ability to drop parentheses from some but not all function calls. Now consider another problem: introspection, or the lack thereof. One of the weaknesses of string prefixes is that it's hard to get help for them. In the REPL, we can easily get help on any class or function: help(function) and that's really, really great. We can use the inspect module or dir() to introspect functions, classes and instances, but we can't do the same for string prefixes. What's the difference between r-strings and u-strings? help() is no help (pun intended), since help sees only the string instance, not the syntax you used to create it. All of these will give precisely the same output: help(str()) help('') help(u'') help(r"") etc. This is a real weakness of the prefix system, and will apply equally to custom prefixes. It is *super easy* to introspect a class or function like Version; it is *really hard* to do the same for a prefix. You want this seperate namespace for prefixes so that you can have an v prefix without "polluting" the module namespace with a v function (or class). But v doesn't write itself! You still have to write a function or class, athough you might give it a better name and then register it with the single letter prefix: @register_prefix('v') class Version: ... (say). This still leaves Version lying around in your global namespace, unless you explicitly delete it: del Version but you probably won't want to do that, since Version will probably be useful for those who want to create Version objects from expressions or variables, not just string literals. So the "pollution" isn't really pollution at all, at least not if you use reasonable names, and the main justification for parallel namespaces seems much weaker. Let me put it another way: parallel namespaces is not a feature of this proposal. It is a point against it.
That's an interesting position for the proponent of a new feature to take. "Don't worry about this being confusing, because hardly anyone will use it."
The standard prefixes "b", "r", "u", "f" are all short, and nobody gets confused about them.
Plenty of people get confused about raw strings. There's only four, plus uppercase and combinations, and they are standard across the entire language. If there were dozens of them, coming from lots of different modules and third-party libraries, with lots of conflicts ('v' for version in foolib, but 'v' for verbose in barlib), the situation would be very different. We can't extrapolate from four built-in prefixes being manageable to concluding that dozens of clashing user-defined prefixes will be too.
I doubt that is true. "from module import foo as bar" is a standard, commonly used Python language feature: https://stackoverflow.com/questions/22245711/from-import-or-import-as-for-mo... in particular this answer here: https://stackoverflow.com/a/29010729 Besides, we don't design the language for the least knowledgable, most ignorant, copy-and-paste coders.
That's simply wrong. The *very first* example of a regular expression here: https://scotch.io/tutorials/an-introduction-to-regex-in-python uses the compile function. More examples talking about regex objects: https://docs.python.org/3/library/re.html#re.compile https://pymotw.com/2/re/#compiling-expressions https://docs.python.org/3/howto/regex.html#compiling-regular-expressions https://stackoverflow.com/questions/20386207/what-does-pythons-re-compile-do These weren't hard to find. You don't have to dig deep into obscure parts of the WWW to find people talking about regex objects. I think you underestimate the knowledge of the average Python programmer. -- Steven

On Wed, 28 Aug 2019 at 13:15, Anders Hovmöller <boxed@killingar.net> wrote:
On 28 Aug 2019, at 14:09, Piotr Duda <duda.piotr@gmail.com> wrote:
The only sane proposal that I can see (assuming that no-one is proposing to drop the principle that Python shouldn't have mutable syntax) is to modify the definition stringliteral ::= [stringprefix](shortstring | longstring) stringprefix ::= "r" | "u" | "R" | "U" | "f" | "F" | "fr" | "Fr" | "fR" | "FR" | "rf" | "rF" | "Rf" | "RF" to expand the definition of <stringprefix> to allow any identifier-like token (precise details to be confirmed). Then, if it's one of the values enumerated above (you'd also need some provison for special-casing bytes literals, which are in a different syntax rule), work as at present. For any other identifier-like token, you'd define TOKEN(shortstring|longstring) as being equivalent to TOKEN(r(shortstring|longstring)) I.e., treat the string as a raw string, and TOKEN as a function name, and compile to a function call of the named function with the raw string as argument. That's a well-defined proposal, although whether it's what people want is a different question. Potential issues: 1. It makes a whole class of typos that are currently syntax errors into runtime errors - fru"foo\and {bar}" is now a function call rather than a syntax error (it was never a raw Unicode f-string, even though someone might think it was and be glad to be corrected by the current syntax error...) 2. It begs the question of whether people want raw-string semantics - whilst it's the most flexible option, it does mean that literals wanting to allow escape sequences would need to implement it themselves. 3. It does nothing for the edge case that a trailing \ isn't allowed - p"C:\" wouldn't be a valid Path literal. There are of course other possible proposals, but we'd need more than broad statements to make sense of them (specifically, either "exactly *what* new syntax are you suggesting we allow?", or "how are you proposing to allow users to alter Python syntax on demand?") Paul

Right, having a parallel set of scopes sounds like WAY too much work. Which is why I didn't want to start my proposal with a particular implementation -- I simply don't have enough experience for that. Still, we can brainstorm possible approaches, and come up with something that is feasible. For example, how about this: prefixes/suffixes "live" in the same local scope as normal variables, however, in order to separate them from the normal variables, their names get mangled into something that is not a valid variable name. Thus, re'a|b|c' --becomes--> (locals()["re~"])("a|b|c") 2.3f --becomes--> (locals()["~f"])("2.3") Assuming that most people don't create variable names that start or end with `~`, the impact on existing code should be minimal (we could use an even more rare character there, say `\0`). The current string prefixes would be special-cased by the compiler to behave exactly as they behave right now. Also, a prefix such as `czt""` is always just a single prefix, there is no need to treat it as 3 single-char prefixes.
Well, it's just another problem to overcome. I know in Python one can get help on keywords and even operators by saying `help('class')` or `help('+')`. We could extend this to allow `help('foo""')` to give the help for the prefix "foo". Specifically, if the argument to `help` is a string, and that string is not a registered topic, then check whether the string is of the form `<id>""` or `<id>''` or `""<id>` or `''<id>`, and invoke the help for the corresponding prefix / suffix. This will even solve the problem with the help for existing affixes `b""`, `f""`, `0j`, etc.
For the Version class you're right. But use cases vary. In the thread from 2013 where this issue was discussed, many people wanted `sql"..."` literal to be available as literal and nothing else. Presumably, if you wanted to construct a query dynamically there could be a separate function `sql_unsafe()` taking a simple string as an argument.
The pollution argument is that, on one hand, we want to use short names such as "v" for prefixes/suffixes, while on the other hand we don't want them to be "regular" variable names because of the possibilities of name clashes. It's perfectly fine to have a short character for a prefix and at the same time a longer name for a function. It's like we have the `unicode()` function and `u"..."` prefix. It's like most command line utilities offer short single-character options and longer full-name options.
I'm sorry if I expressed myself ambiguously. What I meant to say is that the set of different prefixes within a single program will likely be small.
We can't extrapolate from four built-in prefixes being manageable to concluding that dozens of clashing user-defined prefixes will be too.
That's a valid point. Though we can't extrapolate that they will be unmanageable either. There's just not enough data. But we could look at other languages who have more suffixes. Say, C or C++. Ultimately, this can be a self-regulating feature: if having too many suffixes/prefixes makes one's code unreadable, then simply stop using them and go back to regular function calls.

On Aug 28, 2019, at 12:45, stpasha@gmail.com wrote:
Since this specific use has come up a few times—and a similar feature in other languages—can you summarize exactly what people want from this one? IIRC, DB-API 2.0 doesn’t have any notion of compiled statements, or bound statements, just this: Connection.execute(statement: str, *args) -> Cursor So the only thing I can think of is that sql"…" is a shortcut for that. Maybe: curs = sql"SELECT lastname FROM person WHERE firstname={firstname}" … which would do the equivalent of: curs = conn.execute("SELECT lastname FROM person WHERE firstname=?", firstname) … except that it knows whether your particular database library uses ? or %s or whatever for SQL params. I can see how that could be useful, but I’m not sure how it could be easily implemented. First, it has to know where to find your connection object. Maybe the library that exposes the prefix requires you to put the connection in a global (or threadlocal or contextvar) with a specific name, or manages a pool of connections that it stores in its own module or something? But that seems simultaneously too magical and too restrictive. And then it has to do f-string-style evaluation of the brace contents, in your scope, to get the args to pass along. Which I’d assume means that prefix handlers need to get passed locals and globals, so the sql prefix handler can eval each braced expression? (Even that wouldn’t be as good as f-strings, but it might be good enough here?) Even with all that, I‘m pretty sure I’d never use it. I’m often willing to bring magic into my database API, but only if I get a lot more magic (an expression-builder library, a full-blown ORM, that thing that I forget the name of that translates generators into SQL queries quasi-LINQ-style, etc.). But maybe there are lots of people who do want just this much magic and no more. Is this roughly what people are asking for? If so, is that eval magic needed for any other examples you’ve seen besides sql? It’s definitely not needed for regexes, paths, really-raw strings, or any of the numeric examples, but if it is needed for more than one good example, it’s probably still worth looking at whether it’s feasible.

My understanding is that for a sql prefix the most valuable part is to be able to know that it was created from a literal. No other magic, definitely not auto-executing. Then it would be legal to write result = conn.execute(sql"SELECT * FROM people WHERE id=?", user_id) but not result = conn.execute(f"SELECT * FROM people WHERE id={user_id}") In order to achieve this, the `execute()` method only has to look at the type of its argument, and throw an error if it's a plain string. Perhaps with some more imagination we can make result = conn.execute(sql"SELECT * FROM people WHERE id={user_id}") work too, but in this case the `sql"..."` token would only create an `UnpreparedStatement` object, which expects a variable named "user_id", and then the `conn.execute()` method would pass locals()/globals() into the `.prepare()` method of that statement, binding those values to the placeholders. Crucially, the `.prepare()` method shouldn't modify the object, but return a new PreparedStatement, which then gets executed by the `conn.execute()`.

My understanding is that for a sql prefix the most valuable part is to be able to know that it was created from a literal. No other magic, definitely not auto-executing. Then it would be legal to write result = conn.execute(sql"SELECT * FROM people WHERE id=?", user_id) but not result = conn.execute(f"SELECT * FROM people WHERE id={user_id}") In order to achieve this, the `execute()` method only has to look at the type of its argument, and throw an error if it's a plain string. Perhaps with some more imagination we can make result = conn.execute(sql"SELECT * FROM people WHERE id={user_id}") work too, but in this case the `sql"..."` token would only create an `UnpreparedStatement` object, which expects a variable named "user_id", and then the `conn.execute()` method would pass locals()/globals() into the `.prepare()` method of that statement, binding those values to the placeholders. Crucially, the `.prepare()` method shouldn't modify the object, but return a new PreparedStatement, which then gets executed by the `conn.execute()`.

On Fri, Aug 30, 2019 at 3:51 AM Pasha Stetsenko <stpasha@gmail.com> wrote:
There's no such thing, though, any more than there's such a thing as a "raw string". There are only two types of string in Python - text and bytes. You can't behave differently based on whether you were given a triple-quoted, raw, or other string literal.
One way to handle this particular case would be to do it as a variant of f-string that doesn't join its arguments, but passes the list to some other function. Just replace the final step BUILD_STRING step with BUILD_LIST, then call the function. There'd need to be some way to recognize which sections were in the literal and which came from interpolations (one option is to simply include empty strings where necessary such that it always starts with a literal and then alternates), but otherwise, the "sql" manager could do all the escaping it wants. However, this wouldn't be enough to truly parameterize a query; it would only do escaping into the string itself. Another option would be to have a single variant of f-string that, instead of creating a string, creates a "string with formatted values". That would then be a single object that can be passed around as normal, and if conn.execute() received such a string, it could do the proper parameterization. Not sure either of them would be worth the hassle, though. ChrisA

On 8/29/19 11:14 AM, Chris Angelico wrote:
But isn't the idea of the sql" (or other) prefix was that the 'plain string' was put through a special function that processes it, and that function could return an object of some other type, so it could detect the difference.
= -- Richard Damon

On Thu, Aug 29, 2019 at 08:17:39PM +1200, Greg Ewing wrote:
I don't think that stpasha@gmail.com means that the user literally assigns to locals() themselves. I read his proposal as having the compiler automatical mangle the names in some way, similar to name mangling inside classes. The transformation from prefix re to mangled name 're~' is easy, the compiler could surely handle that, but I'm not sure how the other side of it will work. How does one register that re.compile (say) is to be aliased as the prefix 're'? I'm fairly sure we don't want to allow ~ in identifiers: # not this re~ = re.compile I'm still not convinced that we need this parallel namespace idea, even in a watered down version as name-mangling. Why not just have the prefix X call name X for any valid name X (apart from the builtin prefixes)? I still am not convinced that is a good idea, but at least the complexity is significantly reduced. P.S. stpasha@gmail.com if you're reading this, it would be nice if you signed your emails with a name, so we don't have to refer to you by your email address or as "the OP". -- Steven

Steven D'Aprano wrote:
Yes, but at some point you have to define a function to handle your string prefix. If it's at the module level then it's no problem, because you can do something like globals()["~f"] = lambda: ... But you can't do that for locals. So mangling to something unspellable would effectively preclude having string prefixes local to a function. -- Greg

On Aug 29, 2019, at 16:58, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
What happens if you do this, and then include "~f" in __all__, and then import * from that module? I personally would rather have my prefixes or suffixes available in every module that imports them, without needing to manually register them each time. Not a huge deal, and if nobody else agrees, fine. But if I could __all__ it, I could get what I want anyway. :)

How does one get a value into locals()["re~"]?
You're right, I didn't think about that. I agree with Steven's interpretation that the user is not expected to modify locals herself, still the immutable nature of locals presents a considerable challenge. So I'm thinking that perhaps we could change that to `globals()["re~"]`, where globals are in fact mutable and can even be modified by the user. This would make it so that affixes can only be declared at a module level, similar to how `from library import *` is not allowed in a function either. This is probably a saner approach anyways -- if affixes could mean different things in different functions, that could be quite confusing...

On Tue, Aug 27, 2019 at 08:22:22AM -0000, stpasha@gmail.com wrote:
The string (or number) prefixes add new power to the language
I don't think they do. It's just syntactic sugar for a function call. There's nothing that czt'...' will do that czt('...') can't already do. If you have a proposal that allows custom string prefixes to do something that a function call cannot do, I've missed it.
If a certain feature can potentially be misused shouldn't deter us from adding it, if the benefits are significant.
Very true, but so far I see nothing in this proposal that suggests that the benefits are more significant than avoiding having to type a pair of parentheses. Every benefit I have seen applies equally to the function call version, but without the added complexity to the language of allowing custom string prefixes.
And the benefits in terms of readability can be significant.
I don't think they will be. I think they will encourage cryptic one-character function names disguised as prefixes: v'...' instead of Version(...) x'...' instead of re.compile(...) to take two examples from your proposal. At least this is somewhat better: sql'...' but that leaves the ambiguity of not knowing whether that's a chained function call s(q(l(...))) or a single sql(...). I believe it will also encourage inefficient and cryptic string parsing instead of more clear use of seperate arguments. Your earlier example: frac'123/4567' The Fraction constructor already accepts such strings, and it is occasionally handy for parsing user-input. But using it to parse string literals gives slow, inefficient code for little or no benefit: [steve@ando cpython]$ ./python -m timeit -s 'from fractions import Fraction' 'Fraction(123, 4567)' 20000 loops, best of 5: 18.9 usec per loop [steve@ando cpython]$ ./python -m timeit -s 'from fractions import Fraction' 'Fraction("123/4567")' 5000 loops, best of 5: 52.9 usec per loop Unless you can suggest a way to parse arbitrary strings in arbitrary ways at compile-time, these custom string prefixes are probably doomed to be slow and inefficient. The best thing I can say about this is that at least frac'123/4567' would probably be easy to understand, since the / syntax for fractions is familiar to most people from school. But the same cannot be said for other custom prefixes: cf'[0; 37, 7, 1, 2, 5]' Perhaps you can guess the meaning of that cf-string. Perhaps you can't. A hint might point you in the right direction: assert cf'[0; 37, 7, 1, 2, 5]' == Fraction(123, 4567) (By the way, the semi-colon is meaningful and not a typo.) To the degree that custom string prefixes will encourage cryptic one and two letter names, I think that this will hurt readability and clarity of code. But if the reader has the domain knowledge to recognise what "cf" stands for, this may be no worse than (say) "re" (regular expression). In conventional code, we might call the cf function like this: cf([0, 37, 7, 1, 2, 5]) # Single list argument. cf(0, 37, 7, 1, 2, 5) # *args version. Either way works for me. But it is your argument that replacing the parentheses with quote marks is "more readable": cf([0, 37, 7, 1, 2, 5]) cf'[0; 37, 7, 1, 2, 5]' not just a little bit more readable, but enough to make up for the inefficiency of having to write your own parser, deal with errors, compile a string literal, parse it at runtime, and only then call the actual cf constructor and return a cf object. Even if I accepted your claim that swapping (...) for '...' was more readable, I am skeptical that the additional work and runtime inefficiency would be worth the supposed benefit. I don't wish to say that parsing strings to extract information is always an anti-pattern: http://cyrille.martraire.com/2010/01/the-string-obsession-anti-pattern/ after all we often need to process data coming from config files or other user-input, where we have no choice but to accept a string. But parsing string *literals* usually is an anti-pattern, especially when there is a trivial transformation from the string to the constructor arguments, e.g. 123/4567 --> Fraction(123, 4567). [...]
Everything you say there applies to ordinary function call syntax too: Version('1.10a') can have methods, special behaviours, a type different from str, etc. Not one of those benefits comes from *custom string prefixes*. They all come from the use of a custom type. In fact, we can can be more explicit and clear with the constructor: Version(major=1, minor=10, stage='a') There is nothing magic about this v-string prefix. You still have to write a Version class with a version-string parser. The compiler can't help you, because it has no knowledge of the format of version strings. All the compiler can do is pass the string '1.10a' to the function v(). [...]
"There could be..." lots of things, but the onus is on you to prove that there actually *are* such benefits.
I answered that in my previous post. I would prefer an explicit, clear, self-documenting function call Version() over a terse, unclear syntax that looks like a string but isn't. I don't think that v'1.10a' is clearer or more readable than Version('1.10a'). It is *shorter*, but that's it. The bottom line is, so long as this proposal is for nothing more than mere syntactic sugar allowing you to drop the parentheses from certain function calls (those that take a single string argument), the benefit is tiny, and the added complexity and opportunity for abuse and confusion is large.
I can't help pandas' poor API, and I doubt that your proposal would have prevented it either.
Think about what you are saying about the sophisticated data scientists who are typical pandas users: - they can write "import pandas" - but not "import re" or "from re import compile as rx" - they will be able to import your rx'...' string prefix from wherever it comes from (perhaps "from re import rx"?) - and are capable of writing regular expressions using your custom rx'...' syntax - but adding parentheses is beyond them: rx('...'). I cannot take this argument about sophisticated regex-users who are defeated by function call syntax seriously. -- Steven

On Aug 27, 2019, at 08:36, Steven D'Aprano <steve@pearwood.info> wrote:
But there are plenty of cases where parsing string literals is the current usual practice. Decimal is obvious, as well as most other non-native numeric types. Path objects even more so. Pandas users seem to always build their datetime objects out of YYYYMMDDTHHMMSS strings. And so on. So the status quo doesn’t mean nobody parses string literals, it means people _explicitly_ parse string literals. And the proposed change doesn’t mean more string literal parsing, it means making some of the existing, uneliminable uses less visually prominent and more readable. (And, relevant to the blog you linked, it seems to make it _less_ likely, not more, that you’d bind the string rather than the value to a name, or pass it around and parse it repeatedly, or the other bar practices they were talking about.) I’ll admit there are some cases where I might sacrifice performance for convenience if we had this feature. For example, F1/3 (or 1/3F with suffixes) would have to mean at least Fraction(1) / 3, if not Fraction('1') / 3, or even that plus an extra LOAD_ATTR. That is clearly going to be more expensive than F(1, 3) meaning Fraction(1, 3), but I’d still do it at the REPL, and likely in real code as well. But I don’t think that choice would make my code worse (because when setup costs matter, I _wouldn’t_ make that choice), so I don’t see that as a problem.

On 8/26/19 4:03 PM, stpasha@gmail.com wrote:
I have seen a lot of discussion on this but haven't seen a few points that I thought of brought up. One solution to all these would be to have these be done as suffixes, Python currently has a number of existing prefixes to strings that are valid, and it might catch some people when they want to use a combination that is currently a valid prefix. (It has been brought up that this converts an invalid prefix from an immediately diagnosable syntax error to a run time error.) This also means that it becomes very hard to decide to add a new prefix as that would now have a defined meaning. A second issue is that currently some of the prefixes (like r) change how the string literal is parsed. These means that the existing prefixes are just a slightly special case of the general rules, but need to be treated very differently, or perhaps somehow the prefix needs to indicate what standard prefix to use to parse the string. Some of your examples could benefit by sometimes being able to use r' and sometimes not, so being able to say both r'string're or 'string're could be useful. -- Richard Damon
participants (20)
-
Anders Hovmöller
-
Andrew Barnert
-
Chris Angelico
-
Eric V. Smith
-
Greg Ewing
-
Ivan Levkivskyi
-
Konstantin Schukraft
-
Mike Miller
-
MRAB
-
Pasha Stetsenko
-
Paul Moore
-
Piotr Duda
-
Rhodri James
-
Richard Damon
-
Rob Cliffe
-
Robert Vanden Eynde
-
Serhiy Storchaka
-
Stephen J. Turnbull
-
Steven D'Aprano
-
stpasha@gmail.com