Re: [Python-ideas] Empty set, Empty dict

Victor Stinner wrote: 2014-06-10 8:15 GMT+02:00 Neil Girdhar <mistersheik at gmail.com>:
Perhaps {,} would be a possible spelling. For consistency you might want to allow (,) to create an empty tuple as well; personally I would find that more intuitive that (()). Wichert.

+1 for using {,}. On Tue, Jun 10, 2014 at 4:07 AM, Wichert Akkerman <wichert@wiggy.net> wrote: > Victor Stinner wrote: > > > 2014-06-10 8:15 GMT+02:00 Neil Girdhar <mistersheik at gmail.com <https://mail.python.org/mailman/listinfo/python-ideas>>: > > >* I've seen this proposed before, and I personally would love this, but my > *>* guess is that it breaks too much code for too little gain. > *>>* On Wednesday, May 21, 2014 12:33:30 PM UTC-4, Frédéric Legembre wrote: > *>>>>>>* Now | Future | > *>>* ---------------------------------------------------- > *>>* () | () | empty tuple ( 1, 2, 3 ) > *>>* [] | [] | empty list [ 1, 2, 3 ] > *>>* set() | {} | empty set { 1, 2, 3 } > *>>* {} | {:} | empty dict { 1:a, 2:b, 3:c } > * > > Your guess is right. It will break all Python 2 and Python 3 in the world. > > Technically, set((1, 2)) is different than {1, 2}: the first creates a > tuple and loads the global name "set" (which can be replaced at > runtime!), whereas the later uses bytecode and only store values > (numbers 1 and 2). > > It would be nice to have a syntax for empty set, but {} is a no-no. > > > Perhaps {,} would be a possible spelling. For consistency you might want > to allow (,) to create an empty tuple as well; personally I would find that > more intuitive that (()). > > Wichert. > > > > > _______________________________________________ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Ryan If anybody ever asks me why I prefer C++ to C, my answer will be simple: "It's becauseslejfp23(@#Q*(E*EIdc-SEGFAULT. Wait, I don't think that was nul-terminated."

No. Jeez. :-( On Tue, Jun 10, 2014 at 9:25 AM, Ryan Gonzalez <rymg19@gmail.com> wrote: > +1 for using {,}. > > > On Tue, Jun 10, 2014 at 4:07 AM, Wichert Akkerman <wichert@wiggy.net> > wrote: > >> Victor Stinner wrote: >> >> >> 2014-06-10 8:15 GMT+02:00 Neil Girdhar <mistersheik at gmail.com <https://mail.python.org/mailman/listinfo/python-ideas>>: >> >> >> >> >* I've seen this proposed before, and I personally would love this, but my >> *>* guess is that it breaks too much code for too little gain. >> *>>* On Wednesday, May 21, 2014 12:33:30 PM UTC-4, Frédéric Legembre wrote: >> *>>>>>>* Now | Future | >> *>>* ---------------------------------------------------- >> *>>* () | () | empty tuple ( 1, 2, 3 ) >> *>>* [] | [] | empty list [ 1, 2, 3 ] >> *>>* set() | {} | empty set { 1, 2, 3 } >> *>>* {} | {:} | empty dict { 1:a, 2:b, 3:c } >> * >> >> Your guess is right. It will break all Python 2 and Python 3 in the world. >> >> Technically, set((1, 2)) is different than {1, 2}: the first creates a >> tuple and loads the global name "set" (which can be replaced at >> runtime!), whereas the later uses bytecode and only store values >> (numbers 1 and 2). >> >> It would be nice to have a syntax for empty set, but {} is a no-no. >> >> >> Perhaps {,} would be a possible spelling. For consistency you might want >> to allow (,) to create an empty tuple as well; personally I would find that >> more intuitive that (()). >> >> Wichert. >> >> >> >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas@python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > > > > -- > Ryan > If anybody ever asks me why I prefer C++ to C, my answer will be simple: > "It's becauseslejfp23(@#Q*(E*EIdc-SEGFAULT. Wait, I don't think that was > nul-terminated." > > > _______________________________________________ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- --Guido van Rossum (python.org/~guido)

of course it’s ugly, but it’s also obvious that it had to be suggested, because it’s the only obvious idea. which leaves us with either a non-obvious idea or no empty set literal, which is a bit sad and inconsistent. if i’d develop a language from scratch, i’d possibly use the following empty literals: [] = list; () = tuple, {} = set [:] = ordered dict, (:) = named tuple, {:} = dict but that ship has sailed. 2014-06-10 18:39 GMT+02:00 Guido van Rossum <guido@python.org>: > No. Jeez. :-( > > > On Tue, Jun 10, 2014 at 9:25 AM, Ryan Gonzalez <rymg19@gmail.com> wrote: > >> +1 for using {,}. >> >> >> On Tue, Jun 10, 2014 at 4:07 AM, Wichert Akkerman <wichert@wiggy.net> >> wrote: >> >>> Victor Stinner wrote: >>> >>> 2014-06-10 8:15 GMT+02:00 Neil Girdhar <mistersheik at gmail.com <https://mail.python.org/mailman/listinfo/python-ideas>>: >>> >>> >>> >>> >>> >* I've seen this proposed before, and I personally would love this, but my >>> *>* guess is that it breaks too much code for too little gain. >>> *>>* On Wednesday, May 21, 2014 12:33:30 PM UTC-4, Frédéric Legembre wrote: >>> *>>>>>>* Now | Future | >>> *>>* ---------------------------------------------------- >>> *>>* () | () | empty tuple ( 1, 2, 3 ) >>> *>>* [] | [] | empty list [ 1, 2, 3 ] >>> *>>* set() | {} | empty set { 1, 2, 3 } >>> *>>* {} | {:} | empty dict { 1:a, 2:b, 3:c } >>> * >>> >>> Your guess is right. It will break all Python 2 and Python 3 in the world. >>> >>> Technically, set((1, 2)) is different than {1, 2}: the first creates a >>> tuple and loads the global name "set" (which can be replaced at >>> runtime!), whereas the later uses bytecode and only store values >>> (numbers 1 and 2). >>> >>> It would be nice to have a syntax for empty set, but {} is a no-no. >>> >>> >>> Perhaps {,} would be a possible spelling. For consistency you might want >>> to allow (,) to create an empty tuple as well; personally I would find that >>> more intuitive that (()). >>> >>> Wichert. >>> >>> >>> >>> >>> _______________________________________________ >>> Python-ideas mailing list >>> Python-ideas@python.org >>> https://mail.python.org/mailman/listinfo/python-ideas >>> Code of Conduct: http://python.org/psf/codeofconduct/ >>> >> >> >> >> -- >> Ryan >> If anybody ever asks me why I prefer C++ to C, my answer will be simple: >> "It's becauseslejfp23(@#Q*(E*EIdc-SEGFAULT. Wait, I don't think that was >> nul-terminated." >> >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas@python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > > > > -- > --Guido van Rossum (python.org/~guido) > > _______________________________________________ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ >

Clint Hepner, 22.06.2014 14:29:
There's one more, even more obvious (IMO) option for an empty set literal: U+2205, EMPTY SET. But that opens a whole other can of worms, namely expanding the grammar to allow Unicode characters outside of identifiers.
... and then teaching people how to type them on their keyboards. Stefan

On Sun, Jun 22, 2014 at 10:48 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
At least the concept of "empty set literal" has more merit than "save a few keystrokes on the word 'lambda'". Even if it isn't something easily typed, it would have value over the current spelling of "set()", which isn't a literal. (Whether it has *enough* value over set() to be worth doing is still in question, but it's not like lambda vs λ.) ChrisA

On 22 June 2014 22:51, Chris Angelico <rosuav@gmail.com> wrote:
Yep, "status quo wins a stalemate" tends to be the winner on this particular topic. With a blank slate, the obvious choice is {} for the empty set and {:} for the empty dict, but Python doesn't have that option due to builtin sets arriving *long* after builtin dicts (for a very long time, sets weren't even in the standard library - folks just used dicts with the values all set to None). So, for those historical reasons, set() will likely persist indefinitely with its discontinuity in appearance between the "zero items" and "one or more predefined items" cases. Cheers, Nick, -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

i honestly don’t see the problem here. if people are too lazy to find a input method that works for them (Alt Gr, compose key, copy&paste), they should just continue to type ASCII, and leave the more elegant unicode variants for others. this violates TSBOOWTDI, but as there’s also dict() next to {}, this should neither be a problem. i like scala’s way to allow both <- and ←, as well as => and ⇒, and so on. ∅ and λ seem like good ideas to me as un-redefinable empty set literal and shorter/more elegant lambda. And “…” for “Ellipsis”. there’s also ∀, ¬, ×, ∧,∨, ∩, ∪, ∈, ∉, ≠, ≡, ≤, and ≥, but i think those are a bit much: my_set = ∅ my_set ∪= other_set my_set = map(λ e: e × 5, my_set ∩ third_set) *∀* spam *∈* my_set: if spam *≡* None *∨* spam ≤ 8: print(spam *∉* allowed_values, *¬* spam) vs. my_set = ∅ my_set &= other_set my_set = map(λ e: e * 5, my_set | third_set) for spam in my_set: if spam is None or spam <= 8: print(spam not in allowed_values, not spam) 2014-06-22 14:29 GMT+02:00 Clint Hepner <clint.hepner@gmail.com>: > There's one more, even more obvious (IMO) option for an empty set literal: > U+2205, EMPTY SET. But that opens a whole other can of worms, namely > expanding the grammar to allow Unicode characters outside of identifiers. > > On Jun 22, 2014, at 5:25 AM, "Philipp A." <flying-sheep@web.de> wrote: > > of course it’s ugly, but it’s also obvious that it had to be suggested, > because it’s the only obvious idea. > > which leaves us with either a non-obvious idea or no empty set literal, > which is a bit sad and inconsistent. > > if i’d develop a language from scratch, i’d possibly use the following > empty literals: > > [] = list; () = tuple, {} = set > [:] = ordered dict, (:) = named tuple, {:} = dict > > but that ship has sailed. > > > 2014-06-10 18:39 GMT+02:00 Guido van Rossum <guido@python.org>: > >> No. Jeez. :-( >> >> >> On Tue, Jun 10, 2014 at 9:25 AM, Ryan Gonzalez <rymg19@gmail.com> wrote: >> >>> +1 for using {,}. >>> >>> >>> On Tue, Jun 10, 2014 at 4:07 AM, Wichert Akkerman <wichert@wiggy.net> >>> wrote: >>> >>>> Victor Stinner wrote: >>>> >>>> 2014-06-10 8:15 GMT+02:00 Neil Girdhar <mistersheik at gmail.com <https://mail.python.org/mailman/listinfo/python-ideas>>: >>>> >>>> >>>> >>>> >>>> >* I've seen this proposed before, and I personally would love this, but my >>>> *>* guess is that it breaks too much code for too little gain. >>>> *>>* On Wednesday, May 21, 2014 12:33:30 PM UTC-4, Frédéric Legembre wrote: >>>> *>>>>>>* Now | Future | >>>> *>>* ---------------------------------------------------- >>>> *>>* () | () | empty tuple ( 1, 2, 3 ) >>>> *>>* [] | [] | empty list [ 1, 2, 3 ] >>>> *>>* set() | {} | empty set { 1, 2, 3 } >>>> *>>* {} | {:} | empty dict { 1:a, 2:b, 3:c } >>>> * >>>> >>>> Your guess is right. It will break all Python 2 and Python 3 in the world. >>>> >>>> Technically, set((1, 2)) is different than {1, 2}: the first creates a >>>> tuple and loads the global name "set" (which can be replaced at >>>> runtime!), whereas the later uses bytecode and only store values >>>> (numbers 1 and 2). >>>> >>>> It would be nice to have a syntax for empty set, but {} is a no-no. >>>> >>>> >>>> Perhaps {,} would be a possible spelling. For consistency you might >>>> want to allow (,) to create an empty tuple as well; personally I would find >>>> that more intuitive that (()). >>>> >>>> Wichert. >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Python-ideas mailing list >>>> Python-ideas@python.org >>>> https://mail.python.org/mailman/listinfo/python-ideas >>>> Code of Conduct: http://python.org/psf/codeofconduct/ >>>> >>> >>> >>> >>> -- >>> Ryan >>> If anybody ever asks me why I prefer C++ to C, my answer will be simple: >>> "It's becauseslejfp23(@#Q*(E*EIdc-SEGFAULT. Wait, I don't think that was >>> nul-terminated." >>> >>> >>> _______________________________________________ >>> Python-ideas mailing list >>> Python-ideas@python.org >>> https://mail.python.org/mailman/listinfo/python-ideas >>> Code of Conduct: http://python.org/psf/codeofconduct/ >>> >> >> >> >> -- >> --Guido van Rossum (python.org/~guido) >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas@python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > > _______________________________________________ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > > > _______________________________________________ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ >

Problem: For years, various people have suggested that they would like to use syntactically significant unicode symbols in Python code. A prime example is using U+2205, EMPTY SET, ∅, instead of 'set()'. On the other hand, the conservative, overwhelmed core development group is not much interested and would rather do other things. Solution: Act instead of ask. One or more of the people who really want this could get themselves together and produce a working system. (If multiple people, ask for a new sig and mailing list). 1. Ask core development to reserve '.pyu' for python with unicode symbolds. (If refused, chose something else.) 2. Write pyu.py. It should first translate x.pyu to the equivalent x.py. If x.py exists, check the date (at with .py and .pyc). Optionally, but probably by default, run x.py. Translation requires two operations: masking comments and string literals from translation and translating the remainder. I personally would start by doing the two operations separately, with separately testable functions. def codechunk(unisymcode): '''Yield code_or_not, code_chunk pairs for code with unicode symbols. Chunks are comments or string literals (code_or_not == False), and code that might have unicode symbols that need translation 'code_or_not' == True). ''' <Simplified parser, possibly derived from tokenize.tokenize(), which already knows how to recognize comments and strings.> unisym = <dict mapping unicode ordinals to ascii replacements> def unisym2ascii(unisymcode): blocklist = [] for code, block in codeblocks(unisymcode): if code: block = block.translate(unisym) blocklist.append(block) return ''.join(blocklist) 3. Upload pyu.py to PyPI, *along with instructions on the various ways to enter unicode symbols on various systems*. Announce and promote. On 6/22/2014 10:41 AM, Philipp A. wrote:
Being snarky can be fun, but if I wrote and distributed pyu.py, I would want as many users as possible.
I think the unisym dict should be inclusive and let people choose to use the symbols they want. I suspect I use ≤ and ≥ b sooner than λ. A mathematician that used most of those symbols, for a math audience, could still use the ascii tranlation for other audiences. On 6/22/2014 11:01 AM, MRAB wrote:
λ is a valid identifier in Python 3 because it's a letter.
Overall, I see this as less of a problem than the possibility of rebinding builtin names. The program could have a 'translate_lambda' (default True) parameter. But I would be willing to say that if you use unicode symbols, then you cannot also use λ as an identifier. (If one did, the resulting .py would stop with SyntaxError where 'lambda' repladed identifier λ.) -- Terry Jan Reedy

On 23 June 2014 10:30, Guido van Rossum <guido@python.org> wrote:
Hm. What's wrong with rejecting bad ideas?
While I agree it's a bad idea to use symbols that can't be readily typed as part of the language syntax, I think Terry's broader point that anything which *can* be implemented outside the core usually *should* be implemented outside the core (at least as a proof-of-concept) is a good one. Hy shows it is possible to implement a Lisp on top of the CPython runtime, so folks should certainly be capable of implementing a Python-with-Unicode-symbols on top of existing Python runtimes without needing the blessing of the core development team. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, Jun 22, 2014 at 8:23 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
This particular proposal sounds to me like something that shouldn't be implemented at all. We don't need another split in the community over how to spell operators.
Hy shows it is possible to implement a Lisp on top of the CPython runtime,
It wasn't proposed as a serious feature on python-ideas.
Terry *is* asking for a blessing of the .pyu extension by the core team. (Although it seems he wouldn't be too upset if he didn't get it. :-) -- --Guido van Rossum (python.org/~guido)

On Sun, Jun 22, 2014 at 09:00:20PM -0700, Guido van Rossum wrote:
I think you're exaggerating the danger here a tad. Split the community? We can barely get the community to grudgingly accept that maybe there's a use for Unicode *at all*, let alone use it as syntax :-) -- Steven

On Mon, Jun 23, 2014 at 5:23 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
AFAIU this *really* looks like a bad idea. I don't even understand why would anyone want to do such a thing. -- Giampaolo - http://grodola.blogspot.com

On 6/22/2014 8:30 PM, Guido van Rossum wrote:
Hm. What's wrong with rejecting bad ideas?
[I am not sure whether you are asking seriously or rhetorically, but I think this question is worth a serious response.] Aside from the fact that different people have different ideas of what is an absolutely bad idea, nothing. I personally reject almost all new syntax ideas because I think most of them are local small-audience optimizations that would overall make Python worse. However, the purpose of python-ideas is "Discussions of speculative Python language ideas". 'Discussion' means not routinely trying to stop discussion. Indeed, some good can come from discussion of ideas I (or you) think are bad. Rejection has multiple forms, not mutually exclusive: Inaction: by default, an idea is effectively rejected until a patch is committed. Education: explaining how one can already accomplish the desired task. Deflection: suggest implementing the idea somewhere other than in core Python. Explanation: explain why something is bad. Downvote or BDFL rejection: (self-explanatory)
Specifically, I believe people have asked that Python parsers accept and translate unicode symbols *in .py files*. This would have the immediate effect of making some .py files invisibly and unnecessarily incompatible with all existing Python interpreters, even if the translated code would run just fine. I, too, do not want the meaning of '.py' fragmented further than it already is.
In other words, the idea of changing Python itself has been and will be rejected by inaction for at least the next few years, and until circumstances change after that. (Hence, no need for *me* to 'reject' it.)
Solution: Act instead of ask.
'Stop asking' is not only rejection of the idea of changing Python, but also of continuing the discussion that has gone on for years. People who do not want to give up the idea should do something else. In the course of suggesting an implementation, I also suggested some aspects of an implementation that I consider important.
Discuss it elsewhere because python-ideas is not 3rd-party-package-dev.
1. Ask core development to reserve '.pyu' for python with unicode symbols. (If refused, chose something else.)
In other words, 1. do not use .py for unisym_python. 2. while .pyu seems like an obvious alternative (to me), recognize python-devs moral rights to .pyx, regardless of legalities.
2. Write pyu.py. It should first translate x.pyu to the equivalent x.py. ... run x.py.
To be clear, I meant write x.py to disk, where it would be available for humans to read. This is specifically aimed at the issue of 'fragmenting the community'.
Again, I would want the standard .py file available. In my first post to clp/python list over 17 years ago, I dubbed Python 'executable pseudocode' and opined that it should be used to communicate algorithms in preference to non-executable notation. I would rather a mathematician use symbols embedded in Python, with a link to a .py file, than the same symbols in a non-executable *and non-testable* notation. -- Terry Jan Reedy

Sorry Terry, I was short (and ended up being cryptic) because I was on a mobile device. I meant "this is a bad idea and should be rejected", and in addition I also meant to discourage a 3rd party implementation of the idea. I also wanted to object against your claim that this idea has only been left unimplemented because of disinterest or inaction by the core dev team; to the contrary, the general sentiment is pretty clear that it's a bad idea. There are other ideas that are not suitable for adding to the language but where we would encourage folks to help themselves by writing a module or extension and posting it on PyPy, or even ideas where it would eventually be a good idea to include such a package into the stdlib. But this is not one of them. On Mon, Jun 23, 2014 at 12:11 PM, Terry Reedy <tjreedy@udel.edu> wrote:
-- --Guido van Rossum (python.org/~guido)

On Sun, Jun 22, 2014, at 16:18, Terry Reedy wrote:
2. Write pyu.py. It should first translate x.pyu to the equivalent x.py.
What is the equivalent x.py for "BUILD_SET 0" rather than "LOAD_GLOBAL (set), CALL_FUNCTION 0"?

On Sat, Jun 28, 2014 at 3:00 PM, <random832@fastmail.us> wrote:
Is there any reason that it has to be normal-looking source code? def empty_set_literal(): # line 123 of somefile.pyu print("I'm an empty set!", ∅) # becomes empty_set_literal = type(lambda:0)(type((lambda:0).__code__)(0,0,0,3,67,b't\x00\x00d\x01\x00h\x00\x00\x83\x02\x00\x01d\x00\x00S',(None,"I'm an empty set!",{}),('print',),(),"somefile.pyu","empty_set_literal",123,b"\x00\x01"),globals(),"empty_set_literal") I got most of the args for the code() constructor by disassembling the function, using a one-element set, and then manually edited the code afterward. It does appear to work:
Given that the purpose of this is to make something executable, not something readable (in contrast to, say, 2to3), I don't think it would be a problem to have nightmare-level code in there occasionally. That said, I'm not particularly in favour of the proposal - I just felt like answering this part of it :) ChrisA

On Sat, Jun 28, 2014 at 3:28 PM, Chris Angelico <rosuav@gmail.com> wrote:
Here's a POC translator. Give it a string with the source code for one function, and it'll give back a string that'll generate a similar function. Currently assumes it's working at top level - doesn't handle nested functions, methods, etc, etc. But it seems to work. https://github.com/Rosuav/shed/blob/master/empty_set.py ChrisA

On Sat, Jun 28, 2014, at 01:28, Chris Angelico wrote:
empty_set_literal = type(lambda:0)(type((lambda:0).__code__)(0,0,0,3,67,b't\x00\x00d\x01\x00h\x00\x00\x83\x02\x00\x01d\x00\x00S',(None,"I'm
If you're embedding the entire compiler (in fact, a modified one) in your tool, why not just output a .pyc?

On Tue, Jul 1, 2014 at 3:18 AM, <random832@fastmail.us> wrote:
I'm not, I'm calling on the normal compiler. Also, I'm not familiar with the pyc format, nor with any of the potential pit-falls of that approach. But if someone wants to make an "alternative front end that makes a .pyc file" kind of thing, they're most welcome to. ChrisA

First, two quick side notes: It might be nice if the compiler were as easy to hook as the importer. Alternatively, it might be nice if there were a way to do "inline bytecode assembly" in CPython, similar to the way you do inline assembly in many C compilers, so the answer to random's question is just "asm [('BUILD_SET', 0)]" or something similar. Either of those would make this problem trivial. I doubt either of those would be useful often enough that anyone wants to put in the work. But then I doubt the empty-set literal would be either, so anyone who seriously wants to work on this might want to work on the inline assembly and/or hookable compiler first. Anyway: On Monday, June 30, 2014 3:12 PM, Chris Angelico <rosuav@gmail.com> wrote:
On Tue, Jul 1, 2014 at 3:18 AM, <random832@fastmail.us> wrote:
I think it makes more sense to use types.FunctionType and types.CodeType here than to generate two extra functions for each function, even if that means you have to put an import types at the top of every munged source file.
The tricky bit with making a .pyc file is generating the header information—last I checked (quite a while ago, and not that deeply…) that wasn't documented, and there were no helpers exported to Python. But I think what he was suggesting is something like this: Let py_compile.compile generate the .pyc file as normal, then munge the bytecode in that file, instead of compiling each function, munging its bytecode, and emitting source that creates the munged functions. Besides being a lot less work, his version works for ∅ at top level, in class definitions, in lambda expressions, etc., not just for def statements. And it doesn't require finding and identifying all of the things to munge in a source file (which I assume you'd do bottom-up based on the ast.parse tree or something). But either way, this still doesn't solve the big problem. Compiling a function by hand and then tweaking the bytecode is easy; doing it programmatically is more painful. You obviously need the function to compile, so you have to replace the ∅ with something else whose bytecode you can search-and-replace. But what? That something else has to be valid in an expression context (so it compiles), has to compile to a 3-byte opcode (otherwise, replacing it will screw up any jump targets that point after it), can't add any globals/constants/etc. to the list (otherwise, removing it will screw up any LOAD_FOO statements that refer to a higher-numbered foo), and can't appear anywhere in the code being compiled. The only thing I can think of off the top of my head is to replace it with whichever of [], (), or {} doesn't appear anywhere in the code being compiled, then you can search-replace BUILD_LIST/TUPLE/MAP 0 with BUILD_SET 0. But what if all three appear in the code? Raise a SyntaxError('Cannot use all 4 kinds of empty literals in the same scope')? One more thing that I'm sure you thought of, but may not have thought through all the way: To make this generally useful, you can't just hardcode creating a zero-arg top-level function; you need to copy all of the code and function constructor arguments from the compiled function. So, if the function is a closure, how do you do that? You need to pass a list of closure cell objects that bind to the appropriate co_cellvars from the current frame, and I don't think there's a way to do that from Python. So, you need to do that by bytecode-hacking the outer function in the same way, just so it can build the inner function. And, even if you could build closure cells, once you've replaced the inner function definition with a function constructor from bytecode, when the resulting code gets compiled, it won't have any cellvars anymore. And going back to the top, all of these problems are why I think random's solution would be a lot easier than yours, but why my solution (first build compiler hooks or inline assembly, then use that to implement the empty set trivially) would be no harder than either (and a lot more generally useful), and also why I think this really isn't worth doing.

On Mon, Jun 30, 2014 at 04:48:14PM -0700, Andrew Barnert wrote:
Again, to be absolutely clear here, I hate this idea. `set()` is perfectly clear. Sorry. Had to be said before any of this. Right, so, this was brought up before, but with Hylang (https://github.com/hylang/hy), we abuse the PEP 302 new import hooks to search sys.path for .hy files rather then .py files. You could do the same for your .pyu files (again, *without* the blessing of the core team, as this is insane), and do the mangling before passing it to the normal internals to turn it into bytecode / AST. Doing it this way means you won't have to futz with the compiler, and you can remain happy. And we like being happy. More info: https://github.com/hylang/hy/blob/master/hy/importer.py http://slides.pault.ag/hy.html#/15 https://www.youtube.com/watch?v=AmMaN1AokTI https://www.youtube.com/watch?v=ulekCWvDFVI http://legacy.python.org/dev/peps/pep-0302/ Again, this approach can be a bit flaky, and this particular issue might very well cause problems for us as a community - seeing as how the syntax is almost exactly identical. Hylang (for what it's worth) is just a nice way for us Lisp nerds to stop complaining as much. Godspeed, Paul -- #define sizeof(x) rand() </paul> :wq

On Monday, June 30, 2014 5:18 PM, Paul Tagliamonte <paultag@gmail.com> wrote: [snip]
The reason for needing to futz with the compiler is to generate source code that actually compiles to the bytecode to build an empty set directly, instead of the bytecode to load and call the "set" global. I agree with both you and Guido that the whole thing is silly, and set() is fine. I also agree with your implied assumption that, even if you needed an empty set literal, having it compile to the same thing as set() would be fine. But those who disagree with both, and really want an empty set literal that compiles to "BUILD_SET 0", cannot have it without futzing with the compiler. So, I'd encourage them to work on making the compiler more futzable (which surely more people would have a use for than the number of people for whom set() is intolerably slow, or unusable because they want to redefine the global).

On Tue, Jul 1, 2014 at 9:48 AM, Andrew Barnert <abarnert@yahoo.com> wrote:
First, two quick side notes:
It might be nice if the compiler were as easy to hook as the importer. Alternatively, it might be nice if there were a way to do "inline bytecode assembly" in CPython, similar to the way you do inline assembly in many C compilers, so the answer to random's question is just "asm [('BUILD_SET', 0)]" or something similar. Either of those would make this problem trivial.
That would be interesting, but it raises the possibility of mucking up the stack. (Imagine if you put BUILD_SET 1 in there instead. What's it going to make a set of? What's going to happen to the rest of the stack? Do you REALLY want to debug that?) Back when I did a lot of C and C++ programming, I used to make good use of a "drop to assembly" feature. There were two broad areas where I'd use it: either to access a CPU feature that the compiler and library didn't offer me (like CPUID, in its early days), or to hand-optimize some code. Then compilers got better and better, and the first set of cases got replaced with library functions... and the second lot ended up being no better than the compiler's output, and potentially a lot worse - particularly because they're non-portable. Allowing a "drop to bytecode" in CPython would have the exact same effects, I think. Some people would use it to create an empty set, others would use it to replace variable swapping with a marginally faster and *almost* identical stack-based swap: x,y = y,x LOAD_GLOBAL y LOAD_GLOBAL x ROT_TWO STORE_GLOBAL x STORE_GLOBAL y becomes LOAD_GLOBAL x LOAD_GLOBAL y STORE_GLOBAL x STORE_GLOBAL y Seems fine, right? But it's a subtle change to semantics (evaluation order), and not much benefit anyway. Plus, if it's decided that this semantic change is safe (if it's provably not going to have any significance), a future version of CPython would be able to make the exact same optimization, while leaving the code readable, and portable to other Python implementations. So while an inline bytecode assembler might have some uses, I suspect it'd be an attractive nuisance more than anything else.
Sure. This is just a proof-of-concept anyway, and it's not meant to be good code. Either way works, I just tried to minimize name usage (and potential name collisions).
But I think what he was suggesting is something like this: Let py_compile.compile generate the .pyc file as normal, then munge the bytecode in that file, instead of compiling each function, munging its bytecode, and emitting source that creates the munged functions.
Besides being a lot less work, his version works for ∅ at top level, in class definitions, in lambda expressions, etc., not just for def statements. And it doesn't require finding and identifying all of the things to munge in a source file (which I assume you'd do bottom-up based on the ast.parse tree or something).
Sure. But all I was doing was responding to the implied statement that it's not possible to write a .py file that makes a function with BUILD_SET 0 in it. Translating a .pyu directly into a .pyc is still possible, but was not the proposal.
But either way, this still doesn't solve the big problem. Compiling a function by hand and then tweaking the bytecode is easy; doing it programmatically is more painful. You obviously need the function to compile, so you have to replace the ∅ with something else whose bytecode you can search-and-replace. But what? That something else has to be valid in an expression context (so it compiles), has to compile to a 3-byte opcode (otherwise, replacing it will screw up any jump targets that point after it), can't add any globals/constants/etc. to the list (otherwise, removing it will screw up any LOAD_FOO statements that refer to a higher-numbered foo), and can't appear anywhere in the code being compiled.
What I did was put in a literal string. https://github.com/Rosuav/shed/blob/master/empty_set.py It uses "∅ is set()" as a marker, depending on that string not existing in the source. (I could compile the function twice, once with that string, and then a second time with another string; the first compilation would show what consts it uses, and the program could then generate an arbitrary constant which doesn't exist.) The opcode is the right length (assuming it doesn't go for EXTENDED_ARG, which I've never heard of; it seems to be necessary if you have more than 64K consts/globals/locals in a function???), and the resulting function has an unnecessary const in it. It wouldn't be hard to drop it (the code already parses through everything; it could just go "if it's LOAD_CONST, three options - if it's the marker, switch in a BUILD_SET, if it's less than the marker, no change, if it's more than the marker, decrement"), but it doesn't seem to be a problem to have an extra const in there.
One more thing that I'm sure you thought of, but may not have thought through all the way: To make this generally useful, you can't just hardcode creating a zero-arg top-level function; you need to copy all of the code and function constructor arguments from the compiled function.
It handles arguments and stuff. All the attributes of the original function object get passed through unchanged to the resulting function, with the exception of the bytecode, obviously.
So, if the function is a closure, how do you do that? You need to pass a list of closure cell objects that bind to the appropriate co_cellvars from the current frame, and I don't think there's a way to do that from Python. So, you need to do that by bytecode-hacking the outer function in the same way, just so it can build the inner function. And, even if you could build closure cells, once you've replaced the inner function definition with a function constructor from bytecode, when the resulting code gets compiled, it won't have any cellvars anymore.
Ah, that part I've no idea about. But it wouldn't be impossible for someone to develop that a bit further.
And going back to the top, all of these problems are why I think random's solution would be a lot easier than yours, but why my solution (first build compiler hooks or inline assembly, then use that to implement the empty set trivially) would be no harder than either (and a lot more generally useful), and also why I think this really isn't worth doing.
Right. I absolutely agree with your conclusion (not worth doing), and always have had that view. This is proof that it's kinda possible, but still a bad idea. Now, if someone comes up with a really compelling use-case for an empty set literal, then maybe it'd be more important; but if that happens, CPython will probably grow an empty set literal in ASCII somehow, and then the .pyu translation can just turn ∅ into that. ChrisA

Before I get to the reply, because I couldn't find a 3.x-compatible bytecode assembler, I slapped one together at https://github.com/abarnert/cpyasm. I think it would be reasonably possible to use this to add inline assembly to a preprocessor, but I haven't tried, because I don't have a preprocessor I actually want, and this was the fun part. :)
On Monday, June 30, 2014 5:39 PM, Chris Angelico <rosuav@gmail.com> wrote:
The same thing that happens if you use bad inline assembly in C, or a bad C extension module in Python—bad things that you can't debug at source level. And yet, inline assembly in C and C extension modules in Python are still quite useful. Of course the difference is that you can drop from the source level to the machine level pretty easily in gdb, lldb, Microsoft's debugger, etc., while you can't as easily drop from the source level to the bytecode level in pdb. (I'm not sure that wouldn't be an interesting feature to add in itself, but that's getting even farther off topic, so forget it for now.)
Back when I did a lot of C and C++ programming, I used to make good
I'll ignore the second case for the moment, because I think it's rarely if ever appropriate to Python, and just focus on the first. Those cases did not go away because CPUID got replaced with library functions. Those library functions—which are compiled with the same compiler you use for your code—have inline assembly in them. (Or, if you're on linux, those library functions read from a device file, but the device driver, which is compiled with the same compiler you use, has inline assembly in it.) So, the compiler still needs to be able to compile it. There are cases where that isn't true. For example, most modern compilers that care about x86 have extended the C language in some way to make it unnecessary for you to write LOCK CMPXCHG all over the place if you want to do lockfree refcounting (and, even better, they've done so in a way that also does the right thing on ARM 9 or SPARC or whatever else you care about). Or, in some cases, they've done something halfway in between, adding "compiler intrinsic functions" that look like functions, but are compiled directly into inline asm. But either way, that didn't happen until a lot of people were publishing code that used that inline assembly. Otherwise, the compiler vendors have no reason to believe it's necessary to add a new feature. Plus, people still needed to keep distributing code that uses the inline asm for years, until the oldest compiler and library on every platform they support incorporated the change they needed. And, just as you say, I think it would have the exact same effects in CPython. If we added inline bytecode asm to 3.5, and there were actually something useful to do with it, people would start doing it, and that's how we'd know that something useful was worth adding to the language, and when we added that something useful in 3.7, eventually people could start using that, and then it would be years before all of the projects that need that feature either die or require 3.7. But that's not a problem; that's inline asm working exactly as it should. There is one good reason to reject the inline asm idea: If it's unlikely that there will be anything worth using it for (or if it might plausibly be useful, but not enough so that anyone's worth doing the work). Which I think is at least plausible, and maybe likely.
Do you really think anyone would do the latter? Seriously, what kind of code can you imagine that's too slow in CPython, not worth rewriting in C or running in PyPy or whatever, but fast enough with the rot opcode removed? And if someone really _did_ need that, I doubt they'd care much that Python 3.8 makes it unnecessary; they obviously have a specific deployment platform that has to work and that needed that last 3% speedup under 3.6.2, and they're going to need that to keep working for years. The former, maybe. Not just to allow ∅, but maybe someone would want to write a Unicode-math-ified Python dialect as an import-hook preprocessor that used inline asm among other tools. In which case… so what? That's not going to be something that people just randomly drop into their code, there will be a single project with however many users, which will be no worse for the Python community than Hylang. If their demonstration is just so cool that everyone decides we need Unicode symbols in Python core, then great. If not, and they still want to keep using it, well, a simpler preprocessor will be easier for the rest of us to understand than a ridiculously complicated one that does bytecode hackery, or than a hacked-up CPython compiler.
So while an inline bytecode assembler might have some uses, I suspect
it'd be an attractive nuisance more than anything else.
I honestly don't see it becoming an attractive nuisance. I can easily see it just not getting used for anything at all, beyond people playing with the interpreter. And now, on to your other replies:
On Monday, June 30, 2014 3:12 PM, Chris Angelico <rosuav@gmail.com>
Agreed, I just think it's an _easier_ proposal than yours, not a harder one (assuming you want to actually build the real thing, not just a proof of concept), which I think is why Random suggested it. Also, again, I don't think a real project that allowed ∅ in a def but not in a lambda, class, or top-level code would be acceptable to anyone, and I don't see how your solution can be easily adapted to those cases (well, except lambda). [snip, and everything below here condensed]
I assumed that leaving the unnecessary const behind was unacceptable. After all, we're talking about (hypothetical?) people who find the cost of LOAD_GLOBAL set; CALL_FUNCTION 0 to be unacceptable… But you're right that fixing up all the other LOAD_CONST bytecodes' args is a feasible way to solve that.
So, if the function is a closure, how do you do that? Ah, that part I've no idea about. But it wouldn't be impossible for
someone to develop that a bit further.
Not impossible, but very hard, much harder than what you've done so far. Ultimately, I think that just backs up your larger point: This is doable, but it's going to be a lot of work, and the benefit isn't even nearly worth the cost. My point is that there are other ways to do it that would be less work and/or that would have more side benefits… but the benefit still isn't even nearly worth the cost, so who cares? :)

On Tue, Jul 1, 2014 at 6:04 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
Right, useful but it adds another set of problems. (Just out of curiosity, what protection _is_ there for a smashed stack? I just tried fiddling with it and didn't manage to crash stuff.)
I'll ignore the second case for the moment, because I think it's rarely if ever appropriate to Python, and just focus on the first. Those cases did not go away because CPUID got replaced with library functions. Those library functions—which are compiled with the same compiler you use for your code—have inline assembly in them. (Or, if you're on linux, those library functions read from a device file, but the device driver, which is compiled with the same compiler you use, has inline assembly in it.) So, the compiler still needs to be able to compile it.
Or those library functions are written in assembly language directly. It's entirely possible to write something that uses CPUID and doesn't use inline assembly in a C source file. The equivalent here, I suppose, would be hand-rolling a .pyc file.
Hang on, you're asking two different questions there. I'll split it out: 1) Do I really think anyone *should* do this? Your subsequent comments support this question, and the answer is resoundingly NO. CPython is not the sort of platform on which that kind of thing is ever worth doing. You'll get far more performance by using Cython for parts, or in some other way improving your code, than you will by hand-tweaking the Python bytecode. 2) Do I think anyone would, if given the ability to tweak the bytecode, go "Ah ha!" and proudly improve on what the compiler has done, and then brag about the performance improvement? Definitely. Someone will. It'll make some marginal difference to a microbenchmark, and if you don't believe that would cause people to warp their code into utter unreadability, you clearly don't hang out on python-list enough :)
The "attractive nuisance" part is with microbenchmarking. Code won't materially improve, and it'll be markedly worse in readability/maintainability and portability (although the latter probably doesn't matter all that much; a lot of people's code will be suboptimal on Pythons other than CPython, if only for lack of 'with' statements around files and such), with the addition of such a feature.
I'm not sure whether the problem is the cost of LOAD_GLOBAL followed by CALL_FUNCTION (and, by the way, one unnecessary constant in the function won't have anything like that cost - a bit of wasted RAM, but not a function call), or the fact that such a style is vulnerable to shadowing of the name 'set', which admittedly is a very useful name. But in any case, it's quite solvable.
Yep. Maybe someone (great, that probably means me) should write this up into a PEP for immediate rejection or withdrawal, just to be a document to point to - if you want an empty set literal, answer these objections. ChrisA

On Tuesday, July 1, 2014 1:39 AM, Chris Angelico <rosuav@gmail.com> wrote:
I believe there are cases where the interpreter can detect that you've gone below 0 and raise an exception, but in general there's no protection, or at least nothing you can count on. For example, assemble this code as a complete function: CALL_FUNCTION 1 RETURN_VALUE In 3.4.1, on my Mac, I get a bus error. But, even when you don't manage to crash the interpreter, when you just confuse it at the bytecode level, there's still no way to debug that except by dropping to gdb/lldb/etc.
Yeah, that's entirely possible, but that's not how the linux device driver or the FreeBSD libc wrapper do it; they use inline assembly. Why? Well, for one thing, you get the function prolog and epilog code appropriate for your compiler automatically, instead of having to write it yourself. Also, you can do nice things like cast the result to a struct that you defined in C (which could be done with, e.g., a C macro wrapping the assembly source, but that's just making things more complicated for no benefit). And you don't need to know how to configure and run an assembler alongside the C compiler to build the device. And so on. Basically, the C versions of the exact same reasons you wouldn't want to hand-roll a .pyc file in Python…
Using ctypes to load Python.so to swap the pointers under the covers is already significantly faster, and would still be significantly faster than your optimized bytecode, and yes, people have suggested it on at least two StackOverflow questions. For that matter, you can already do exactly your optimization with a relatively simple bytecode hack, which would look a lot worse than the inline asm and have the same effect. Also, that bytecode hack could be factored out into a function, without any performance cost except a constant cost at .pyc time, while the inline asm obviously can't, another reason the inline asm (which would have to be written inline, and edited to fit the variables in question, each time) would be less of an attractive nuisance than what's already there. Sure, there may be a few people who are looking for horrible micro-optimizations like this, would know enough to figure out how to do this with inline asm, would not know how to do it with bytecode hacks, would not know any of the better (as in much worse, to anyone but them) alternatives, etc., but I think that number is vanishingly small.
What I did was put in a literal string…
I realize the cost of an extra LOAD_GLOBAL is much smaller than an extra CALL_FUNCTION, it's just that I think in 99.9999% of real cases neither will make a difference, and anyone who's objecting to the latter on principle will probably also object to the former on principle…
I think Terry Reedy actually had a better answer: just tell people to implement it, polish it up, put it on PyPI, and come back to us when they're ready to show off their tons of users who can't live without it. Random objected that wasn't possible, in which case Terry's idea is more of a dismissal than a helpful suggestion, but I think https://github.com/abarnert/emptyset proves that it is possible, and even pretty easy.

On 7/1/2014 6:51 AM, Andrew Barnert wrote:
'Random' said something quite different. He only noted that if '∅' were translated to 'set()', then the resulting CPython-specific bytecode would continue to be "LOAD_GLOBAL (set), CALL_FUNCTION 0" rather than the 'optimized' "BUILD_SET 0". He also noted (objected) that there is no python code that CPython currently compiles as "BUILD_SET 0" Well, its unfortunate that {} is not available. If it were, there would be no issue, to me anyway, of using '∅'. However, optimizing CPython bytecode, and compiler hooks, are completely different issues from translating unisym python to standard python that could run on any implementation of Python. If we thought the bytecode difference was important (which most do not), we could have a peephole optimizer to 'fix' it, completely independently of the existence of '∅' or any idea of using it in python code.
in which case Terry's idea is more of a dismissal than a helpful suggestion,
My post was a dismissal of the idea of changing python itself *and* a suggestion of how to proceed without involving pydev.
https://github.com/abarnert/emptyset proves that it is possible, and even pretty easy.
I consider producing (or at least being able to produce) a standard .py file that can be published outside the specialized group run on and debugged on standard interpreters to be essential to any sensible idea for augmented Python code (whether with unicode symbols or anything else, such as native-language keywords). However, as I said before, off topic here for unicode symbols, though not on python-list. -- Terry Jan Reedy

Somewhere in this thread, someone mentioned https://github.com/ehamberg/vim-cute-python (and something similar for emacs, but I'm a vim user). I'm not sure if this mention was a joke or not, but I thought it looked cool and started using it. I can't decide if I actually find it useful or distracting, but in truth it seems to answer the *entire* concern of anyone wanting to see an empty-set symbol (but not to save one bytecode instruction, I admit), and also various other math symbols that name concepts spelled in ASCII in Python. While some hypothetical .pyu translation tool or import hook could do the same thing, this really *does* seem like something to just do at the editor level since there's nothing *semantic* about the new symbols, just a way for them to look. On Tue, Jul 1, 2014 at 1:04 PM, Terry Reedy <tjreedy@udel.edu> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Tuesday, July 1, 2014 1:05 PM, Terry Reedy <tjreedy@udel.edu> wrote:
You're reading a lot into a 2-line message, but your take is that he interpreted the problem as needing to compile "BUILD_SET 0", and pointed out that there is no way to do that with a source preprocessor. You can insist that they're two separate problems to be solved (or, maybe, not solved), and I think you're right. You just have to make that point—as you, I, and half a dozen others have done since his original post. But meanwhile, Chris Angelico offered a solution to the problem that answers his complaint, and I offered another solution that doesn't even require bytecode hacking. That shows that even if you accept the objection, it still doesn't block anyone.
First, as others have pointed out, it's not just, or even primarily, an optimization, it's also a semantic difference.
If we thought the bytecode difference was
But you can't make semantic changes in a peephole optimizer. You'd have to first change the language to document that set() may (or may not!) return an empty set even if the name set resolves to something different. While this isn't entirely unique in Python history (e.g., back when you could redefine False through various kinds of trickery, the compiler was still allowed to optimize out if False: code), but it's very unusual. And nobody's going to do that for a minor optimization (if False:, besides being a potentially huge optimization, also _fixes_ a semantic problem, rather than causing one, since False was supposed to be un-redefinable, but wasn't because of various holes).
My point is that _if_ you take Random's objection as being critical, _then_ your post dismisses the idea, even though it wasn't intended to. You can follow up in two ways: challenge his objection, or answer his objection; there were replies doing both, and if either of the two succeeds, the idea is still alive for people to take further if they want.
My approach is made up of nothing but standard .py files. Those files can be published outside a specialized group, and run and debugged on CPython 3.4+. They can also be edited by people outside that specialized group, without needing a specialized build process involving a preprocessor, just a standard Python module that they already have. Sure, it only works on CPython, but Python 3.4, scipy, etc. also currently only work on CPython, and that doesn't prevent a large community of users from making using of them, publishing code outside a specialized group, and—most importantly for the topic at hand—coming up with suggestions that are germane to Python as a whole and taken seriously. For example, nobody suggested that PEP 465 wasn't a sensible idea because all of the sample code presented only runs on CPython; the idea itself is clearly portable, the community using such code is gigantic and mature, and that's all that matters. Finally, I don't think anyone actually needs this feature, but I was able to whip up a proof of concept in an hour that provides it. Anyone who seriously wants to pursue it doesn't have to use my approach, much less my code; it still serves as an existence proof that what they want to do can be done, meaning they should go do it.

On Tue, Jul 01, 2014 at 06:38:37PM +1000, Chris Angelico wrote: [...]
1) Do I really think anyone *should* do this? Your subsequent comments support this question, and the answer is resoundingly NO. CPython is
"This" being trying to micro-optimize code by bytecode-hacking.
I think that micro-optimization is probably the wrong reason to hack bytecodes. What I'm more interested in is exploring potential new features, or to add functionality, for example: Adding the ability to trace individual expressions, not just lines: http://nedbatchelder.com/blog/200804/wicked_hack_python_bytecode_tracing.htm... Exploring dynamic scoping: http://www.voidspace.org.uk/python/articles/code_blocks.shtml A proposal from Python 2.3 days for a brand-new decorator syntax: http://code.activestate.com/recipes/286147 A (serious!) defence of GOTO in Python: http://www.dr-josiah.com/2012/04/python-bytecode-hacks-gotos-revisited.html (although even Josiah doesn't suggest using COMEFROM :-) I don't know that such bytecode manipulations should be provided in the standard library, and certainly not as a built-in "asm" command. But, I think that we ought to acknowledge that bytecode hacking has a role to play in the wider Python ecosystem. I'm lead to understand that in the Java community, bytecode hacking is, perhaps not common, but accepted as something that powerusers do when all else fails: https://weblogs.java.net/blog/simonis/archive/2009/02/we_need_a_dirty.html [Aside: does Python do any sort of verification of the bytecode before executing it, as Java does?] -- Steven

On 1 July 2014 10:33, Steven D'Aprano <steve@pearwood.info> wrote:
https://pypi.python.org/pypi/withhacks and https://pypi.python.org/pypi/byteplay may also be of interest to anyone wishing to seriously tinker with what the CPython VM (as opposed to Python-the-language) already supports. I also highly advise working Python 3.4, since we made some substantial improvements to the dis module API (adding the yield from tests for 3.3 highlighted how limited the previous API was for testing purposes, so we fixed it in a way that made bytecode easier to work with in general). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 1 July 2014 10:33, Steven D'Aprano <steve@pearwood.info> wrote:
[Aside: does Python do any sort of verification of the bytecode before executing it, as Java does?]
Nope, it will happily attempt to execute invalid bytecode. That's actually one of the reasons executing untrusted bytecode is even less safe than executing untrusted source code - it's likely to be possible to trigger segfaults that way. There's an initial attempt at a bytecode verifier on PyPI (https://pypi.python.org/pypi/Python-Bytecode-Verifier/), and I have a vague recollection that Google have a bytecode verifier kicking around somewhere, but there's nothing built in to the CPython runtime. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 2014-07-01 19:16, Nick Coghlan wrote:
The re module also uses a kind of bytecode that's generated by the Python front end and verified by the C back end. The bytecode contains things like offsets; for example, the bytecode that starts a repeated sequence has an offset to the corresponding bytecode that ends it, and vice versa. The problem with that is that the structure (i.e. the nesting) is no longer explicit, so it's more difficult to spot misnested structures. For the regex module, I decided that it would be easier to verify if I kept the structure explicit by using bytecodes to indicate the start and end of the structures. For example, a repeated sequence could be indicated by having a structure like GREEDY_REPEAT min_count max_count ... END. The C back end could then build the internal representation that's actually interpreted.

On Tuesday, July 1, 2014 10:35 AM, Steven D'Aprano <steve@pearwood.info> wrote:
I think CPython provides just about the right level of support here. The documentation, the APIs, and the helper tools for dealing with bytecode are all superb, and get better with each release. It's all more than sufficient to figure out what you're doing, and how to do it. It might be nice if there were an assembler in the stdlib, but the format is simple enough, and the documentation complete enough, that you can write one in a couple hours (as I did). And, honestly, I suspect a stdlib assembler wouldn't be updated fast enough—e.g., when support for Instruction objects was added to CPython's dis module in 3.4, I doubt an existing assembler would have been modified to take advantage of that, but a new one that you slap together can do so easily. Documenting that bytecode is only supported on CPython, and can change between CPython versions, isn't a problem for anyone who's just looking to experiment with and explore ideas, rather than write production code. As your examples show, you can usually even publish your explorations for others to experiment with, granting those limitations, and maintain them for years without much headache. (Bytecode has traditionally been much more conservative than what the documentation allows; it's generally only when your hacks rely on knowing exactly what bytecode will be generated for a given Python expression that they break. But even there, with a sufficient test suite, it's usually pretty simple to adapt.)
I'm lead to understand that in the Java community, bytecode hacking is,
Here, it sounds like you _are_ suggesting that bytecode hacking may need to be used for production code, not just for exploration. But there are some pretty big differences between Java and Python that I think are relevant here: * Java is designed for one specific VM, on which many other languages run; Python is designed to run on a variety of VMs, and nothing else runs on the CPython VM. * Java is designed to be secure first, fast second, and flexible a distant third; Python is designed to be simple and transparent first, flexible and dynamic second, and everything else a distant third. So most of what you'd want to do (including solving problems like the one in the blog) can be done with simple monkey-patching and related techniques—and you can go a lot deeper than that without getting beyond the supported, portable reflection techniques. * Java's VM is designed to be debuggable and optimizable; CPython's is designed to be the simplest thing that could support CPython. So, anything that's too hard to do with runtime structures is often easier at the VM level in Java, while the reverse is true in CPython. * Java code is often distributed and always deployed as binary files; Python almost always as source. Besides being the cause of problems like the one in this article, it also means that if you have to go below the runtime level, you don't have the intermediate steps of source and AST hacking, you have no choice but to go to the bytecode.

On 30 Jun 2014 16:51, "Andrew Barnert" <abarnert@yahoo.com.dmarc.invalid> wrote:
First, two quick side notes:
It might be nice if the compiler were as easy to hook as the
importer. Alternatively, it might be nice if there were a way to do "inline bytecode assembly" in CPython, similar to the way you do inline assembly in many C compilers, so the answer to random's question is just "asm [('BUILD_SET', 0)]" or something similar. Either of those would make this problem trivial. Eugene Toder & Dave Malcolm have some interesting patches on the tracker to help enhance the compiler (in particular, Dave's allowed compiler plugins to be written in Python). Neither set of patches made it to being merge ready, though, and they'll be rather stale at this point. Cheers, Nick.

Like bytecode, the compiler's workings are not part of the language spec, and are likely to change incompatibly between versions and not work for anything besides CPython. I don't really want to go there (cool though it sounds for wannabe compiler hackers). On Mon, Jun 30, 2014 at 7:15 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

(Replies to both Guido's top-post and Nick's reply-post below.) On Monday, June 30, 2014 7:19 PM, Guido van Rossum <guido@python.org> wrote:
Like bytecode, the compiler's workings are not part of the language spec, and are likely to change incompatibly between versions and not work for anything besides CPython. I don't really want to go there (cool though it sounds for wannabe compiler hackers).
But CPython does expose bytecode via the dis module, parts of inspect, etc. For that matter, it exposes some of the compiler's workings (especially if you consider everything up to AST generation part of the compiler, since every step up to there is exposed, including doing the whole thing in one whack with PyCF_ONLY_AST). So, I don't see how exposing the AST-to-bytecode transformation part (or, while we're at it, the .pyc generation part) is any more unportable than what's already there. That being said, I can appreciate that it would almost certainly take a lot more work, and a lot riskier work to do that, so the same tradeoff could easily go the other way in this case. (Not to mention that the dis module and so on are already there, while the patches Nick was talking about, much less something brand new, are obviously not.)
Thanks! Are you referring to Dave Malcolm's patch to adding a hook for an AST optimizer (in Python) right before compiling the AST to code (http://bugs.python.org/issue10399 and related)? If so, I don't think that would actually help here. Unless it's possible to say "BUILD_SET 0" in AST, but in that case, we don't need any new compiler hooks; just use an import hook the same way MacroPy does. (Doing it without import hooks would be a little nicer, but it's not essential.) The only patch I could find by Eugene Toder is one to reenable constant folding on -0, which I think was already committed in 3.3, and doesn't seem relevant anyway. Is there something else I should be searching for?

On , Andrew Barnert <abarnert@yahoo.com> wrote:
[snip]
I should have just tested it before saying anything:
So… it is possible to say "BUILD_SET 0" in AST. Which means the easy way to do this is to wrap an import hook around this: class FixEmptySet(ast.NodeTransformer): def visit_Name(self, node): if node.id == '_EMPTY_SET_LITERAL': return ast.copy_location( ast.Set(elts=[], ctx=ast.Load()), node) return node def ecompile(src, fname): src = src.replace('∅', '_EMPTY_SET_LITERAL') tree = compile(src, fname, 'exec', flags=ast.PyCF_ONLY_AST) tree = FixEmptySet().visit(tree) return compile(tree, fname, 'exec') code = ecompile('def f(): return ∅', '<>') exec(code) f() That returns set(). And if you dis.dis(f), it's just BUILD_SET 0 and RETURN_VALUE.

On Tue, Jul 1, 2014 at 7:00 PM, Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> wrote:
src = src.replace('∅', '_EMPTY_SET_LITERAL')
Note that this suffers from a flaw that my POC script also suffers from: it replaces this character *anywhere*, rather than only when it's being used as a symbol on its own. Even inside a literal string. It might be necessary to replace it back the other way afterward, somehow, but I'm not sure if that would work. ChrisA

On Tuesday, July 1, 2014 2:08 AM, Chris Angelico <rosuav@gmail.com> wrote:
Yes, that's easy. Also, _EMPTY_SET_LITERAL_ itself could exist in your source (after all, it exists in my source fragment right above, right?), but that's easy too. See https://github.com/abarnert/emptyset for a slapped-together implementation that solves both those problems (except for bytes literals, but it explains how to do that). If it prints out "set() is the empty set ∅", then it worked; it successfully replaced the ∅ in your source with an empty set literal, and left the ∅ in your format string as ∅.

On 1 July 2014 01:27, Andrew Barnert <abarnert@yahoo.com> wrote:
Note that the dis module has a "CPython implementation detail" disclaimer, and the AST structure is deliberately exempted from the usual backwards compatibility guarantees. As far as hooking compilation goes, https://docs.python.org/3/library/importlib.html#importlib.abc.InspectLoader... was added in 3.4 specifically to make it easier to define custom loaders that make use of most of the existing import machinery (including bytecode cache files), but do something different for the source -> bytecode transformation step. Cheers, Nick.

On Tue, Jul 01, 2014 at 09:58:37AM -0700, Nick Coghlan wrote:
On 1 July 2014 01:27, Andrew Barnert <abarnert@yahoo.com> wrote:
Further to what Nick says, the *output* of dis is not expected to remain backwards compatible from version to version, only the dis API itself. There's a big difference between saying "we guarantee that the dis module will correctly and accurately disassemble valid bytecode", and saying "we guarantee that this specific chunk of bytecode will do these things". In order to use a hypothetical asm function, you need to know what pseudo-assembly to write, say `asm [SPAM, EGGS]`. That means that SPAM and EGGS must be stable and part of the language definition. (Or at least part of the CPython API.) That's a big step from the current situation. -- Steven

+1 for using {,}. On Tue, Jun 10, 2014 at 4:07 AM, Wichert Akkerman <wichert@wiggy.net> wrote: > Victor Stinner wrote: > > > 2014-06-10 8:15 GMT+02:00 Neil Girdhar <mistersheik at gmail.com <https://mail.python.org/mailman/listinfo/python-ideas>>: > > >* I've seen this proposed before, and I personally would love this, but my > *>* guess is that it breaks too much code for too little gain. > *>>* On Wednesday, May 21, 2014 12:33:30 PM UTC-4, Frédéric Legembre wrote: > *>>>>>>* Now | Future | > *>>* ---------------------------------------------------- > *>>* () | () | empty tuple ( 1, 2, 3 ) > *>>* [] | [] | empty list [ 1, 2, 3 ] > *>>* set() | {} | empty set { 1, 2, 3 } > *>>* {} | {:} | empty dict { 1:a, 2:b, 3:c } > * > > Your guess is right. It will break all Python 2 and Python 3 in the world. > > Technically, set((1, 2)) is different than {1, 2}: the first creates a > tuple and loads the global name "set" (which can be replaced at > runtime!), whereas the later uses bytecode and only store values > (numbers 1 and 2). > > It would be nice to have a syntax for empty set, but {} is a no-no. > > > Perhaps {,} would be a possible spelling. For consistency you might want > to allow (,) to create an empty tuple as well; personally I would find that > more intuitive that (()). > > Wichert. > > > > > _______________________________________________ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Ryan If anybody ever asks me why I prefer C++ to C, my answer will be simple: "It's becauseslejfp23(@#Q*(E*EIdc-SEGFAULT. Wait, I don't think that was nul-terminated."

No. Jeez. :-( On Tue, Jun 10, 2014 at 9:25 AM, Ryan Gonzalez <rymg19@gmail.com> wrote: > +1 for using {,}. > > > On Tue, Jun 10, 2014 at 4:07 AM, Wichert Akkerman <wichert@wiggy.net> > wrote: > >> Victor Stinner wrote: >> >> >> 2014-06-10 8:15 GMT+02:00 Neil Girdhar <mistersheik at gmail.com <https://mail.python.org/mailman/listinfo/python-ideas>>: >> >> >> >> >* I've seen this proposed before, and I personally would love this, but my >> *>* guess is that it breaks too much code for too little gain. >> *>>* On Wednesday, May 21, 2014 12:33:30 PM UTC-4, Frédéric Legembre wrote: >> *>>>>>>* Now | Future | >> *>>* ---------------------------------------------------- >> *>>* () | () | empty tuple ( 1, 2, 3 ) >> *>>* [] | [] | empty list [ 1, 2, 3 ] >> *>>* set() | {} | empty set { 1, 2, 3 } >> *>>* {} | {:} | empty dict { 1:a, 2:b, 3:c } >> * >> >> Your guess is right. It will break all Python 2 and Python 3 in the world. >> >> Technically, set((1, 2)) is different than {1, 2}: the first creates a >> tuple and loads the global name "set" (which can be replaced at >> runtime!), whereas the later uses bytecode and only store values >> (numbers 1 and 2). >> >> It would be nice to have a syntax for empty set, but {} is a no-no. >> >> >> Perhaps {,} would be a possible spelling. For consistency you might want >> to allow (,) to create an empty tuple as well; personally I would find that >> more intuitive that (()). >> >> Wichert. >> >> >> >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas@python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > > > > -- > Ryan > If anybody ever asks me why I prefer C++ to C, my answer will be simple: > "It's becauseslejfp23(@#Q*(E*EIdc-SEGFAULT. Wait, I don't think that was > nul-terminated." > > > _______________________________________________ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- --Guido van Rossum (python.org/~guido)

of course it’s ugly, but it’s also obvious that it had to be suggested, because it’s the only obvious idea. which leaves us with either a non-obvious idea or no empty set literal, which is a bit sad and inconsistent. if i’d develop a language from scratch, i’d possibly use the following empty literals: [] = list; () = tuple, {} = set [:] = ordered dict, (:) = named tuple, {:} = dict but that ship has sailed. 2014-06-10 18:39 GMT+02:00 Guido van Rossum <guido@python.org>: > No. Jeez. :-( > > > On Tue, Jun 10, 2014 at 9:25 AM, Ryan Gonzalez <rymg19@gmail.com> wrote: > >> +1 for using {,}. >> >> >> On Tue, Jun 10, 2014 at 4:07 AM, Wichert Akkerman <wichert@wiggy.net> >> wrote: >> >>> Victor Stinner wrote: >>> >>> 2014-06-10 8:15 GMT+02:00 Neil Girdhar <mistersheik at gmail.com <https://mail.python.org/mailman/listinfo/python-ideas>>: >>> >>> >>> >>> >>> >* I've seen this proposed before, and I personally would love this, but my >>> *>* guess is that it breaks too much code for too little gain. >>> *>>* On Wednesday, May 21, 2014 12:33:30 PM UTC-4, Frédéric Legembre wrote: >>> *>>>>>>* Now | Future | >>> *>>* ---------------------------------------------------- >>> *>>* () | () | empty tuple ( 1, 2, 3 ) >>> *>>* [] | [] | empty list [ 1, 2, 3 ] >>> *>>* set() | {} | empty set { 1, 2, 3 } >>> *>>* {} | {:} | empty dict { 1:a, 2:b, 3:c } >>> * >>> >>> Your guess is right. It will break all Python 2 and Python 3 in the world. >>> >>> Technically, set((1, 2)) is different than {1, 2}: the first creates a >>> tuple and loads the global name "set" (which can be replaced at >>> runtime!), whereas the later uses bytecode and only store values >>> (numbers 1 and 2). >>> >>> It would be nice to have a syntax for empty set, but {} is a no-no. >>> >>> >>> Perhaps {,} would be a possible spelling. For consistency you might want >>> to allow (,) to create an empty tuple as well; personally I would find that >>> more intuitive that (()). >>> >>> Wichert. >>> >>> >>> >>> >>> _______________________________________________ >>> Python-ideas mailing list >>> Python-ideas@python.org >>> https://mail.python.org/mailman/listinfo/python-ideas >>> Code of Conduct: http://python.org/psf/codeofconduct/ >>> >> >> >> >> -- >> Ryan >> If anybody ever asks me why I prefer C++ to C, my answer will be simple: >> "It's becauseslejfp23(@#Q*(E*EIdc-SEGFAULT. Wait, I don't think that was >> nul-terminated." >> >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas@python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > > > > -- > --Guido van Rossum (python.org/~guido) > > _______________________________________________ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ >

Clint Hepner, 22.06.2014 14:29:
There's one more, even more obvious (IMO) option for an empty set literal: U+2205, EMPTY SET. But that opens a whole other can of worms, namely expanding the grammar to allow Unicode characters outside of identifiers.
... and then teaching people how to type them on their keyboards. Stefan

On Sun, Jun 22, 2014 at 10:48 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
At least the concept of "empty set literal" has more merit than "save a few keystrokes on the word 'lambda'". Even if it isn't something easily typed, it would have value over the current spelling of "set()", which isn't a literal. (Whether it has *enough* value over set() to be worth doing is still in question, but it's not like lambda vs λ.) ChrisA

On 22 June 2014 22:51, Chris Angelico <rosuav@gmail.com> wrote:
Yep, "status quo wins a stalemate" tends to be the winner on this particular topic. With a blank slate, the obvious choice is {} for the empty set and {:} for the empty dict, but Python doesn't have that option due to builtin sets arriving *long* after builtin dicts (for a very long time, sets weren't even in the standard library - folks just used dicts with the values all set to None). So, for those historical reasons, set() will likely persist indefinitely with its discontinuity in appearance between the "zero items" and "one or more predefined items" cases. Cheers, Nick, -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

i honestly don’t see the problem here. if people are too lazy to find a input method that works for them (Alt Gr, compose key, copy&paste), they should just continue to type ASCII, and leave the more elegant unicode variants for others. this violates TSBOOWTDI, but as there’s also dict() next to {}, this should neither be a problem. i like scala’s way to allow both <- and ←, as well as => and ⇒, and so on. ∅ and λ seem like good ideas to me as un-redefinable empty set literal and shorter/more elegant lambda. And “…” for “Ellipsis”. there’s also ∀, ¬, ×, ∧,∨, ∩, ∪, ∈, ∉, ≠, ≡, ≤, and ≥, but i think those are a bit much: my_set = ∅ my_set ∪= other_set my_set = map(λ e: e × 5, my_set ∩ third_set) *∀* spam *∈* my_set: if spam *≡* None *∨* spam ≤ 8: print(spam *∉* allowed_values, *¬* spam) vs. my_set = ∅ my_set &= other_set my_set = map(λ e: e * 5, my_set | third_set) for spam in my_set: if spam is None or spam <= 8: print(spam not in allowed_values, not spam) 2014-06-22 14:29 GMT+02:00 Clint Hepner <clint.hepner@gmail.com>: > There's one more, even more obvious (IMO) option for an empty set literal: > U+2205, EMPTY SET. But that opens a whole other can of worms, namely > expanding the grammar to allow Unicode characters outside of identifiers. > > On Jun 22, 2014, at 5:25 AM, "Philipp A." <flying-sheep@web.de> wrote: > > of course it’s ugly, but it’s also obvious that it had to be suggested, > because it’s the only obvious idea. > > which leaves us with either a non-obvious idea or no empty set literal, > which is a bit sad and inconsistent. > > if i’d develop a language from scratch, i’d possibly use the following > empty literals: > > [] = list; () = tuple, {} = set > [:] = ordered dict, (:) = named tuple, {:} = dict > > but that ship has sailed. > > > 2014-06-10 18:39 GMT+02:00 Guido van Rossum <guido@python.org>: > >> No. Jeez. :-( >> >> >> On Tue, Jun 10, 2014 at 9:25 AM, Ryan Gonzalez <rymg19@gmail.com> wrote: >> >>> +1 for using {,}. >>> >>> >>> On Tue, Jun 10, 2014 at 4:07 AM, Wichert Akkerman <wichert@wiggy.net> >>> wrote: >>> >>>> Victor Stinner wrote: >>>> >>>> 2014-06-10 8:15 GMT+02:00 Neil Girdhar <mistersheik at gmail.com <https://mail.python.org/mailman/listinfo/python-ideas>>: >>>> >>>> >>>> >>>> >>>> >* I've seen this proposed before, and I personally would love this, but my >>>> *>* guess is that it breaks too much code for too little gain. >>>> *>>* On Wednesday, May 21, 2014 12:33:30 PM UTC-4, Frédéric Legembre wrote: >>>> *>>>>>>* Now | Future | >>>> *>>* ---------------------------------------------------- >>>> *>>* () | () | empty tuple ( 1, 2, 3 ) >>>> *>>* [] | [] | empty list [ 1, 2, 3 ] >>>> *>>* set() | {} | empty set { 1, 2, 3 } >>>> *>>* {} | {:} | empty dict { 1:a, 2:b, 3:c } >>>> * >>>> >>>> Your guess is right. It will break all Python 2 and Python 3 in the world. >>>> >>>> Technically, set((1, 2)) is different than {1, 2}: the first creates a >>>> tuple and loads the global name "set" (which can be replaced at >>>> runtime!), whereas the later uses bytecode and only store values >>>> (numbers 1 and 2). >>>> >>>> It would be nice to have a syntax for empty set, but {} is a no-no. >>>> >>>> >>>> Perhaps {,} would be a possible spelling. For consistency you might >>>> want to allow (,) to create an empty tuple as well; personally I would find >>>> that more intuitive that (()). >>>> >>>> Wichert. >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Python-ideas mailing list >>>> Python-ideas@python.org >>>> https://mail.python.org/mailman/listinfo/python-ideas >>>> Code of Conduct: http://python.org/psf/codeofconduct/ >>>> >>> >>> >>> >>> -- >>> Ryan >>> If anybody ever asks me why I prefer C++ to C, my answer will be simple: >>> "It's becauseslejfp23(@#Q*(E*EIdc-SEGFAULT. Wait, I don't think that was >>> nul-terminated." >>> >>> >>> _______________________________________________ >>> Python-ideas mailing list >>> Python-ideas@python.org >>> https://mail.python.org/mailman/listinfo/python-ideas >>> Code of Conduct: http://python.org/psf/codeofconduct/ >>> >> >> >> >> -- >> --Guido van Rossum (python.org/~guido) >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas@python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > > _______________________________________________ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > > > _______________________________________________ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ >

Problem: For years, various people have suggested that they would like to use syntactically significant unicode symbols in Python code. A prime example is using U+2205, EMPTY SET, ∅, instead of 'set()'. On the other hand, the conservative, overwhelmed core development group is not much interested and would rather do other things. Solution: Act instead of ask. One or more of the people who really want this could get themselves together and produce a working system. (If multiple people, ask for a new sig and mailing list). 1. Ask core development to reserve '.pyu' for python with unicode symbolds. (If refused, chose something else.) 2. Write pyu.py. It should first translate x.pyu to the equivalent x.py. If x.py exists, check the date (at with .py and .pyc). Optionally, but probably by default, run x.py. Translation requires two operations: masking comments and string literals from translation and translating the remainder. I personally would start by doing the two operations separately, with separately testable functions. def codechunk(unisymcode): '''Yield code_or_not, code_chunk pairs for code with unicode symbols. Chunks are comments or string literals (code_or_not == False), and code that might have unicode symbols that need translation 'code_or_not' == True). ''' <Simplified parser, possibly derived from tokenize.tokenize(), which already knows how to recognize comments and strings.> unisym = <dict mapping unicode ordinals to ascii replacements> def unisym2ascii(unisymcode): blocklist = [] for code, block in codeblocks(unisymcode): if code: block = block.translate(unisym) blocklist.append(block) return ''.join(blocklist) 3. Upload pyu.py to PyPI, *along with instructions on the various ways to enter unicode symbols on various systems*. Announce and promote. On 6/22/2014 10:41 AM, Philipp A. wrote:
Being snarky can be fun, but if I wrote and distributed pyu.py, I would want as many users as possible.
I think the unisym dict should be inclusive and let people choose to use the symbols they want. I suspect I use ≤ and ≥ b sooner than λ. A mathematician that used most of those symbols, for a math audience, could still use the ascii tranlation for other audiences. On 6/22/2014 11:01 AM, MRAB wrote:
λ is a valid identifier in Python 3 because it's a letter.
Overall, I see this as less of a problem than the possibility of rebinding builtin names. The program could have a 'translate_lambda' (default True) parameter. But I would be willing to say that if you use unicode symbols, then you cannot also use λ as an identifier. (If one did, the resulting .py would stop with SyntaxError where 'lambda' repladed identifier λ.) -- Terry Jan Reedy

On 23 June 2014 10:30, Guido van Rossum <guido@python.org> wrote:
Hm. What's wrong with rejecting bad ideas?
While I agree it's a bad idea to use symbols that can't be readily typed as part of the language syntax, I think Terry's broader point that anything which *can* be implemented outside the core usually *should* be implemented outside the core (at least as a proof-of-concept) is a good one. Hy shows it is possible to implement a Lisp on top of the CPython runtime, so folks should certainly be capable of implementing a Python-with-Unicode-symbols on top of existing Python runtimes without needing the blessing of the core development team. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, Jun 22, 2014 at 8:23 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
This particular proposal sounds to me like something that shouldn't be implemented at all. We don't need another split in the community over how to spell operators.
Hy shows it is possible to implement a Lisp on top of the CPython runtime,
It wasn't proposed as a serious feature on python-ideas.
Terry *is* asking for a blessing of the .pyu extension by the core team. (Although it seems he wouldn't be too upset if he didn't get it. :-) -- --Guido van Rossum (python.org/~guido)

On Sun, Jun 22, 2014 at 09:00:20PM -0700, Guido van Rossum wrote:
I think you're exaggerating the danger here a tad. Split the community? We can barely get the community to grudgingly accept that maybe there's a use for Unicode *at all*, let alone use it as syntax :-) -- Steven

On Mon, Jun 23, 2014 at 5:23 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
AFAIU this *really* looks like a bad idea. I don't even understand why would anyone want to do such a thing. -- Giampaolo - http://grodola.blogspot.com

On 6/22/2014 8:30 PM, Guido van Rossum wrote:
Hm. What's wrong with rejecting bad ideas?
[I am not sure whether you are asking seriously or rhetorically, but I think this question is worth a serious response.] Aside from the fact that different people have different ideas of what is an absolutely bad idea, nothing. I personally reject almost all new syntax ideas because I think most of them are local small-audience optimizations that would overall make Python worse. However, the purpose of python-ideas is "Discussions of speculative Python language ideas". 'Discussion' means not routinely trying to stop discussion. Indeed, some good can come from discussion of ideas I (or you) think are bad. Rejection has multiple forms, not mutually exclusive: Inaction: by default, an idea is effectively rejected until a patch is committed. Education: explaining how one can already accomplish the desired task. Deflection: suggest implementing the idea somewhere other than in core Python. Explanation: explain why something is bad. Downvote or BDFL rejection: (self-explanatory)
Specifically, I believe people have asked that Python parsers accept and translate unicode symbols *in .py files*. This would have the immediate effect of making some .py files invisibly and unnecessarily incompatible with all existing Python interpreters, even if the translated code would run just fine. I, too, do not want the meaning of '.py' fragmented further than it already is.
In other words, the idea of changing Python itself has been and will be rejected by inaction for at least the next few years, and until circumstances change after that. (Hence, no need for *me* to 'reject' it.)
Solution: Act instead of ask.
'Stop asking' is not only rejection of the idea of changing Python, but also of continuing the discussion that has gone on for years. People who do not want to give up the idea should do something else. In the course of suggesting an implementation, I also suggested some aspects of an implementation that I consider important.
Discuss it elsewhere because python-ideas is not 3rd-party-package-dev.
1. Ask core development to reserve '.pyu' for python with unicode symbols. (If refused, chose something else.)
In other words, 1. do not use .py for unisym_python. 2. while .pyu seems like an obvious alternative (to me), recognize python-devs moral rights to .pyx, regardless of legalities.
2. Write pyu.py. It should first translate x.pyu to the equivalent x.py. ... run x.py.
To be clear, I meant write x.py to disk, where it would be available for humans to read. This is specifically aimed at the issue of 'fragmenting the community'.
Again, I would want the standard .py file available. In my first post to clp/python list over 17 years ago, I dubbed Python 'executable pseudocode' and opined that it should be used to communicate algorithms in preference to non-executable notation. I would rather a mathematician use symbols embedded in Python, with a link to a .py file, than the same symbols in a non-executable *and non-testable* notation. -- Terry Jan Reedy

Sorry Terry, I was short (and ended up being cryptic) because I was on a mobile device. I meant "this is a bad idea and should be rejected", and in addition I also meant to discourage a 3rd party implementation of the idea. I also wanted to object against your claim that this idea has only been left unimplemented because of disinterest or inaction by the core dev team; to the contrary, the general sentiment is pretty clear that it's a bad idea. There are other ideas that are not suitable for adding to the language but where we would encourage folks to help themselves by writing a module or extension and posting it on PyPy, or even ideas where it would eventually be a good idea to include such a package into the stdlib. But this is not one of them. On Mon, Jun 23, 2014 at 12:11 PM, Terry Reedy <tjreedy@udel.edu> wrote:
-- --Guido van Rossum (python.org/~guido)

On Sun, Jun 22, 2014, at 16:18, Terry Reedy wrote:
2. Write pyu.py. It should first translate x.pyu to the equivalent x.py.
What is the equivalent x.py for "BUILD_SET 0" rather than "LOAD_GLOBAL (set), CALL_FUNCTION 0"?

On Sat, Jun 28, 2014 at 3:00 PM, <random832@fastmail.us> wrote:
Is there any reason that it has to be normal-looking source code? def empty_set_literal(): # line 123 of somefile.pyu print("I'm an empty set!", ∅) # becomes empty_set_literal = type(lambda:0)(type((lambda:0).__code__)(0,0,0,3,67,b't\x00\x00d\x01\x00h\x00\x00\x83\x02\x00\x01d\x00\x00S',(None,"I'm an empty set!",{}),('print',),(),"somefile.pyu","empty_set_literal",123,b"\x00\x01"),globals(),"empty_set_literal") I got most of the args for the code() constructor by disassembling the function, using a one-element set, and then manually edited the code afterward. It does appear to work:
Given that the purpose of this is to make something executable, not something readable (in contrast to, say, 2to3), I don't think it would be a problem to have nightmare-level code in there occasionally. That said, I'm not particularly in favour of the proposal - I just felt like answering this part of it :) ChrisA

On Sat, Jun 28, 2014 at 3:28 PM, Chris Angelico <rosuav@gmail.com> wrote:
Here's a POC translator. Give it a string with the source code for one function, and it'll give back a string that'll generate a similar function. Currently assumes it's working at top level - doesn't handle nested functions, methods, etc, etc. But it seems to work. https://github.com/Rosuav/shed/blob/master/empty_set.py ChrisA

On Sat, Jun 28, 2014, at 01:28, Chris Angelico wrote:
empty_set_literal = type(lambda:0)(type((lambda:0).__code__)(0,0,0,3,67,b't\x00\x00d\x01\x00h\x00\x00\x83\x02\x00\x01d\x00\x00S',(None,"I'm
If you're embedding the entire compiler (in fact, a modified one) in your tool, why not just output a .pyc?

On Tue, Jul 1, 2014 at 3:18 AM, <random832@fastmail.us> wrote:
I'm not, I'm calling on the normal compiler. Also, I'm not familiar with the pyc format, nor with any of the potential pit-falls of that approach. But if someone wants to make an "alternative front end that makes a .pyc file" kind of thing, they're most welcome to. ChrisA

First, two quick side notes: It might be nice if the compiler were as easy to hook as the importer. Alternatively, it might be nice if there were a way to do "inline bytecode assembly" in CPython, similar to the way you do inline assembly in many C compilers, so the answer to random's question is just "asm [('BUILD_SET', 0)]" or something similar. Either of those would make this problem trivial. I doubt either of those would be useful often enough that anyone wants to put in the work. But then I doubt the empty-set literal would be either, so anyone who seriously wants to work on this might want to work on the inline assembly and/or hookable compiler first. Anyway: On Monday, June 30, 2014 3:12 PM, Chris Angelico <rosuav@gmail.com> wrote:
On Tue, Jul 1, 2014 at 3:18 AM, <random832@fastmail.us> wrote:
I think it makes more sense to use types.FunctionType and types.CodeType here than to generate two extra functions for each function, even if that means you have to put an import types at the top of every munged source file.
The tricky bit with making a .pyc file is generating the header information—last I checked (quite a while ago, and not that deeply…) that wasn't documented, and there were no helpers exported to Python. But I think what he was suggesting is something like this: Let py_compile.compile generate the .pyc file as normal, then munge the bytecode in that file, instead of compiling each function, munging its bytecode, and emitting source that creates the munged functions. Besides being a lot less work, his version works for ∅ at top level, in class definitions, in lambda expressions, etc., not just for def statements. And it doesn't require finding and identifying all of the things to munge in a source file (which I assume you'd do bottom-up based on the ast.parse tree or something). But either way, this still doesn't solve the big problem. Compiling a function by hand and then tweaking the bytecode is easy; doing it programmatically is more painful. You obviously need the function to compile, so you have to replace the ∅ with something else whose bytecode you can search-and-replace. But what? That something else has to be valid in an expression context (so it compiles), has to compile to a 3-byte opcode (otherwise, replacing it will screw up any jump targets that point after it), can't add any globals/constants/etc. to the list (otherwise, removing it will screw up any LOAD_FOO statements that refer to a higher-numbered foo), and can't appear anywhere in the code being compiled. The only thing I can think of off the top of my head is to replace it with whichever of [], (), or {} doesn't appear anywhere in the code being compiled, then you can search-replace BUILD_LIST/TUPLE/MAP 0 with BUILD_SET 0. But what if all three appear in the code? Raise a SyntaxError('Cannot use all 4 kinds of empty literals in the same scope')? One more thing that I'm sure you thought of, but may not have thought through all the way: To make this generally useful, you can't just hardcode creating a zero-arg top-level function; you need to copy all of the code and function constructor arguments from the compiled function. So, if the function is a closure, how do you do that? You need to pass a list of closure cell objects that bind to the appropriate co_cellvars from the current frame, and I don't think there's a way to do that from Python. So, you need to do that by bytecode-hacking the outer function in the same way, just so it can build the inner function. And, even if you could build closure cells, once you've replaced the inner function definition with a function constructor from bytecode, when the resulting code gets compiled, it won't have any cellvars anymore. And going back to the top, all of these problems are why I think random's solution would be a lot easier than yours, but why my solution (first build compiler hooks or inline assembly, then use that to implement the empty set trivially) would be no harder than either (and a lot more generally useful), and also why I think this really isn't worth doing.

On Mon, Jun 30, 2014 at 04:48:14PM -0700, Andrew Barnert wrote:
Again, to be absolutely clear here, I hate this idea. `set()` is perfectly clear. Sorry. Had to be said before any of this. Right, so, this was brought up before, but with Hylang (https://github.com/hylang/hy), we abuse the PEP 302 new import hooks to search sys.path for .hy files rather then .py files. You could do the same for your .pyu files (again, *without* the blessing of the core team, as this is insane), and do the mangling before passing it to the normal internals to turn it into bytecode / AST. Doing it this way means you won't have to futz with the compiler, and you can remain happy. And we like being happy. More info: https://github.com/hylang/hy/blob/master/hy/importer.py http://slides.pault.ag/hy.html#/15 https://www.youtube.com/watch?v=AmMaN1AokTI https://www.youtube.com/watch?v=ulekCWvDFVI http://legacy.python.org/dev/peps/pep-0302/ Again, this approach can be a bit flaky, and this particular issue might very well cause problems for us as a community - seeing as how the syntax is almost exactly identical. Hylang (for what it's worth) is just a nice way for us Lisp nerds to stop complaining as much. Godspeed, Paul -- #define sizeof(x) rand() </paul> :wq

On Monday, June 30, 2014 5:18 PM, Paul Tagliamonte <paultag@gmail.com> wrote: [snip]
The reason for needing to futz with the compiler is to generate source code that actually compiles to the bytecode to build an empty set directly, instead of the bytecode to load and call the "set" global. I agree with both you and Guido that the whole thing is silly, and set() is fine. I also agree with your implied assumption that, even if you needed an empty set literal, having it compile to the same thing as set() would be fine. But those who disagree with both, and really want an empty set literal that compiles to "BUILD_SET 0", cannot have it without futzing with the compiler. So, I'd encourage them to work on making the compiler more futzable (which surely more people would have a use for than the number of people for whom set() is intolerably slow, or unusable because they want to redefine the global).

On Tue, Jul 1, 2014 at 9:48 AM, Andrew Barnert <abarnert@yahoo.com> wrote:
First, two quick side notes:
It might be nice if the compiler were as easy to hook as the importer. Alternatively, it might be nice if there were a way to do "inline bytecode assembly" in CPython, similar to the way you do inline assembly in many C compilers, so the answer to random's question is just "asm [('BUILD_SET', 0)]" or something similar. Either of those would make this problem trivial.
That would be interesting, but it raises the possibility of mucking up the stack. (Imagine if you put BUILD_SET 1 in there instead. What's it going to make a set of? What's going to happen to the rest of the stack? Do you REALLY want to debug that?) Back when I did a lot of C and C++ programming, I used to make good use of a "drop to assembly" feature. There were two broad areas where I'd use it: either to access a CPU feature that the compiler and library didn't offer me (like CPUID, in its early days), or to hand-optimize some code. Then compilers got better and better, and the first set of cases got replaced with library functions... and the second lot ended up being no better than the compiler's output, and potentially a lot worse - particularly because they're non-portable. Allowing a "drop to bytecode" in CPython would have the exact same effects, I think. Some people would use it to create an empty set, others would use it to replace variable swapping with a marginally faster and *almost* identical stack-based swap: x,y = y,x LOAD_GLOBAL y LOAD_GLOBAL x ROT_TWO STORE_GLOBAL x STORE_GLOBAL y becomes LOAD_GLOBAL x LOAD_GLOBAL y STORE_GLOBAL x STORE_GLOBAL y Seems fine, right? But it's a subtle change to semantics (evaluation order), and not much benefit anyway. Plus, if it's decided that this semantic change is safe (if it's provably not going to have any significance), a future version of CPython would be able to make the exact same optimization, while leaving the code readable, and portable to other Python implementations. So while an inline bytecode assembler might have some uses, I suspect it'd be an attractive nuisance more than anything else.
Sure. This is just a proof-of-concept anyway, and it's not meant to be good code. Either way works, I just tried to minimize name usage (and potential name collisions).
But I think what he was suggesting is something like this: Let py_compile.compile generate the .pyc file as normal, then munge the bytecode in that file, instead of compiling each function, munging its bytecode, and emitting source that creates the munged functions.
Besides being a lot less work, his version works for ∅ at top level, in class definitions, in lambda expressions, etc., not just for def statements. And it doesn't require finding and identifying all of the things to munge in a source file (which I assume you'd do bottom-up based on the ast.parse tree or something).
Sure. But all I was doing was responding to the implied statement that it's not possible to write a .py file that makes a function with BUILD_SET 0 in it. Translating a .pyu directly into a .pyc is still possible, but was not the proposal.
But either way, this still doesn't solve the big problem. Compiling a function by hand and then tweaking the bytecode is easy; doing it programmatically is more painful. You obviously need the function to compile, so you have to replace the ∅ with something else whose bytecode you can search-and-replace. But what? That something else has to be valid in an expression context (so it compiles), has to compile to a 3-byte opcode (otherwise, replacing it will screw up any jump targets that point after it), can't add any globals/constants/etc. to the list (otherwise, removing it will screw up any LOAD_FOO statements that refer to a higher-numbered foo), and can't appear anywhere in the code being compiled.
What I did was put in a literal string. https://github.com/Rosuav/shed/blob/master/empty_set.py It uses "∅ is set()" as a marker, depending on that string not existing in the source. (I could compile the function twice, once with that string, and then a second time with another string; the first compilation would show what consts it uses, and the program could then generate an arbitrary constant which doesn't exist.) The opcode is the right length (assuming it doesn't go for EXTENDED_ARG, which I've never heard of; it seems to be necessary if you have more than 64K consts/globals/locals in a function???), and the resulting function has an unnecessary const in it. It wouldn't be hard to drop it (the code already parses through everything; it could just go "if it's LOAD_CONST, three options - if it's the marker, switch in a BUILD_SET, if it's less than the marker, no change, if it's more than the marker, decrement"), but it doesn't seem to be a problem to have an extra const in there.
One more thing that I'm sure you thought of, but may not have thought through all the way: To make this generally useful, you can't just hardcode creating a zero-arg top-level function; you need to copy all of the code and function constructor arguments from the compiled function.
It handles arguments and stuff. All the attributes of the original function object get passed through unchanged to the resulting function, with the exception of the bytecode, obviously.
So, if the function is a closure, how do you do that? You need to pass a list of closure cell objects that bind to the appropriate co_cellvars from the current frame, and I don't think there's a way to do that from Python. So, you need to do that by bytecode-hacking the outer function in the same way, just so it can build the inner function. And, even if you could build closure cells, once you've replaced the inner function definition with a function constructor from bytecode, when the resulting code gets compiled, it won't have any cellvars anymore.
Ah, that part I've no idea about. But it wouldn't be impossible for someone to develop that a bit further.
And going back to the top, all of these problems are why I think random's solution would be a lot easier than yours, but why my solution (first build compiler hooks or inline assembly, then use that to implement the empty set trivially) would be no harder than either (and a lot more generally useful), and also why I think this really isn't worth doing.
Right. I absolutely agree with your conclusion (not worth doing), and always have had that view. This is proof that it's kinda possible, but still a bad idea. Now, if someone comes up with a really compelling use-case for an empty set literal, then maybe it'd be more important; but if that happens, CPython will probably grow an empty set literal in ASCII somehow, and then the .pyu translation can just turn ∅ into that. ChrisA

Before I get to the reply, because I couldn't find a 3.x-compatible bytecode assembler, I slapped one together at https://github.com/abarnert/cpyasm. I think it would be reasonably possible to use this to add inline assembly to a preprocessor, but I haven't tried, because I don't have a preprocessor I actually want, and this was the fun part. :)
On Monday, June 30, 2014 5:39 PM, Chris Angelico <rosuav@gmail.com> wrote:
The same thing that happens if you use bad inline assembly in C, or a bad C extension module in Python—bad things that you can't debug at source level. And yet, inline assembly in C and C extension modules in Python are still quite useful. Of course the difference is that you can drop from the source level to the machine level pretty easily in gdb, lldb, Microsoft's debugger, etc., while you can't as easily drop from the source level to the bytecode level in pdb. (I'm not sure that wouldn't be an interesting feature to add in itself, but that's getting even farther off topic, so forget it for now.)
Back when I did a lot of C and C++ programming, I used to make good
I'll ignore the second case for the moment, because I think it's rarely if ever appropriate to Python, and just focus on the first. Those cases did not go away because CPUID got replaced with library functions. Those library functions—which are compiled with the same compiler you use for your code—have inline assembly in them. (Or, if you're on linux, those library functions read from a device file, but the device driver, which is compiled with the same compiler you use, has inline assembly in it.) So, the compiler still needs to be able to compile it. There are cases where that isn't true. For example, most modern compilers that care about x86 have extended the C language in some way to make it unnecessary for you to write LOCK CMPXCHG all over the place if you want to do lockfree refcounting (and, even better, they've done so in a way that also does the right thing on ARM 9 or SPARC or whatever else you care about). Or, in some cases, they've done something halfway in between, adding "compiler intrinsic functions" that look like functions, but are compiled directly into inline asm. But either way, that didn't happen until a lot of people were publishing code that used that inline assembly. Otherwise, the compiler vendors have no reason to believe it's necessary to add a new feature. Plus, people still needed to keep distributing code that uses the inline asm for years, until the oldest compiler and library on every platform they support incorporated the change they needed. And, just as you say, I think it would have the exact same effects in CPython. If we added inline bytecode asm to 3.5, and there were actually something useful to do with it, people would start doing it, and that's how we'd know that something useful was worth adding to the language, and when we added that something useful in 3.7, eventually people could start using that, and then it would be years before all of the projects that need that feature either die or require 3.7. But that's not a problem; that's inline asm working exactly as it should. There is one good reason to reject the inline asm idea: If it's unlikely that there will be anything worth using it for (or if it might plausibly be useful, but not enough so that anyone's worth doing the work). Which I think is at least plausible, and maybe likely.
Do you really think anyone would do the latter? Seriously, what kind of code can you imagine that's too slow in CPython, not worth rewriting in C or running in PyPy or whatever, but fast enough with the rot opcode removed? And if someone really _did_ need that, I doubt they'd care much that Python 3.8 makes it unnecessary; they obviously have a specific deployment platform that has to work and that needed that last 3% speedup under 3.6.2, and they're going to need that to keep working for years. The former, maybe. Not just to allow ∅, but maybe someone would want to write a Unicode-math-ified Python dialect as an import-hook preprocessor that used inline asm among other tools. In which case… so what? That's not going to be something that people just randomly drop into their code, there will be a single project with however many users, which will be no worse for the Python community than Hylang. If their demonstration is just so cool that everyone decides we need Unicode symbols in Python core, then great. If not, and they still want to keep using it, well, a simpler preprocessor will be easier for the rest of us to understand than a ridiculously complicated one that does bytecode hackery, or than a hacked-up CPython compiler.
So while an inline bytecode assembler might have some uses, I suspect
it'd be an attractive nuisance more than anything else.
I honestly don't see it becoming an attractive nuisance. I can easily see it just not getting used for anything at all, beyond people playing with the interpreter. And now, on to your other replies:
On Monday, June 30, 2014 3:12 PM, Chris Angelico <rosuav@gmail.com>
Agreed, I just think it's an _easier_ proposal than yours, not a harder one (assuming you want to actually build the real thing, not just a proof of concept), which I think is why Random suggested it. Also, again, I don't think a real project that allowed ∅ in a def but not in a lambda, class, or top-level code would be acceptable to anyone, and I don't see how your solution can be easily adapted to those cases (well, except lambda). [snip, and everything below here condensed]
I assumed that leaving the unnecessary const behind was unacceptable. After all, we're talking about (hypothetical?) people who find the cost of LOAD_GLOBAL set; CALL_FUNCTION 0 to be unacceptable… But you're right that fixing up all the other LOAD_CONST bytecodes' args is a feasible way to solve that.
So, if the function is a closure, how do you do that? Ah, that part I've no idea about. But it wouldn't be impossible for
someone to develop that a bit further.
Not impossible, but very hard, much harder than what you've done so far. Ultimately, I think that just backs up your larger point: This is doable, but it's going to be a lot of work, and the benefit isn't even nearly worth the cost. My point is that there are other ways to do it that would be less work and/or that would have more side benefits… but the benefit still isn't even nearly worth the cost, so who cares? :)

On Tue, Jul 1, 2014 at 6:04 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
Right, useful but it adds another set of problems. (Just out of curiosity, what protection _is_ there for a smashed stack? I just tried fiddling with it and didn't manage to crash stuff.)
I'll ignore the second case for the moment, because I think it's rarely if ever appropriate to Python, and just focus on the first. Those cases did not go away because CPUID got replaced with library functions. Those library functions—which are compiled with the same compiler you use for your code—have inline assembly in them. (Or, if you're on linux, those library functions read from a device file, but the device driver, which is compiled with the same compiler you use, has inline assembly in it.) So, the compiler still needs to be able to compile it.
Or those library functions are written in assembly language directly. It's entirely possible to write something that uses CPUID and doesn't use inline assembly in a C source file. The equivalent here, I suppose, would be hand-rolling a .pyc file.
Hang on, you're asking two different questions there. I'll split it out: 1) Do I really think anyone *should* do this? Your subsequent comments support this question, and the answer is resoundingly NO. CPython is not the sort of platform on which that kind of thing is ever worth doing. You'll get far more performance by using Cython for parts, or in some other way improving your code, than you will by hand-tweaking the Python bytecode. 2) Do I think anyone would, if given the ability to tweak the bytecode, go "Ah ha!" and proudly improve on what the compiler has done, and then brag about the performance improvement? Definitely. Someone will. It'll make some marginal difference to a microbenchmark, and if you don't believe that would cause people to warp their code into utter unreadability, you clearly don't hang out on python-list enough :)
The "attractive nuisance" part is with microbenchmarking. Code won't materially improve, and it'll be markedly worse in readability/maintainability and portability (although the latter probably doesn't matter all that much; a lot of people's code will be suboptimal on Pythons other than CPython, if only for lack of 'with' statements around files and such), with the addition of such a feature.
I'm not sure whether the problem is the cost of LOAD_GLOBAL followed by CALL_FUNCTION (and, by the way, one unnecessary constant in the function won't have anything like that cost - a bit of wasted RAM, but not a function call), or the fact that such a style is vulnerable to shadowing of the name 'set', which admittedly is a very useful name. But in any case, it's quite solvable.
Yep. Maybe someone (great, that probably means me) should write this up into a PEP for immediate rejection or withdrawal, just to be a document to point to - if you want an empty set literal, answer these objections. ChrisA

On Tuesday, July 1, 2014 1:39 AM, Chris Angelico <rosuav@gmail.com> wrote:
I believe there are cases where the interpreter can detect that you've gone below 0 and raise an exception, but in general there's no protection, or at least nothing you can count on. For example, assemble this code as a complete function: CALL_FUNCTION 1 RETURN_VALUE In 3.4.1, on my Mac, I get a bus error. But, even when you don't manage to crash the interpreter, when you just confuse it at the bytecode level, there's still no way to debug that except by dropping to gdb/lldb/etc.
Yeah, that's entirely possible, but that's not how the linux device driver or the FreeBSD libc wrapper do it; they use inline assembly. Why? Well, for one thing, you get the function prolog and epilog code appropriate for your compiler automatically, instead of having to write it yourself. Also, you can do nice things like cast the result to a struct that you defined in C (which could be done with, e.g., a C macro wrapping the assembly source, but that's just making things more complicated for no benefit). And you don't need to know how to configure and run an assembler alongside the C compiler to build the device. And so on. Basically, the C versions of the exact same reasons you wouldn't want to hand-roll a .pyc file in Python…
Using ctypes to load Python.so to swap the pointers under the covers is already significantly faster, and would still be significantly faster than your optimized bytecode, and yes, people have suggested it on at least two StackOverflow questions. For that matter, you can already do exactly your optimization with a relatively simple bytecode hack, which would look a lot worse than the inline asm and have the same effect. Also, that bytecode hack could be factored out into a function, without any performance cost except a constant cost at .pyc time, while the inline asm obviously can't, another reason the inline asm (which would have to be written inline, and edited to fit the variables in question, each time) would be less of an attractive nuisance than what's already there. Sure, there may be a few people who are looking for horrible micro-optimizations like this, would know enough to figure out how to do this with inline asm, would not know how to do it with bytecode hacks, would not know any of the better (as in much worse, to anyone but them) alternatives, etc., but I think that number is vanishingly small.
What I did was put in a literal string…
I realize the cost of an extra LOAD_GLOBAL is much smaller than an extra CALL_FUNCTION, it's just that I think in 99.9999% of real cases neither will make a difference, and anyone who's objecting to the latter on principle will probably also object to the former on principle…
I think Terry Reedy actually had a better answer: just tell people to implement it, polish it up, put it on PyPI, and come back to us when they're ready to show off their tons of users who can't live without it. Random objected that wasn't possible, in which case Terry's idea is more of a dismissal than a helpful suggestion, but I think https://github.com/abarnert/emptyset proves that it is possible, and even pretty easy.

On 7/1/2014 6:51 AM, Andrew Barnert wrote:
'Random' said something quite different. He only noted that if '∅' were translated to 'set()', then the resulting CPython-specific bytecode would continue to be "LOAD_GLOBAL (set), CALL_FUNCTION 0" rather than the 'optimized' "BUILD_SET 0". He also noted (objected) that there is no python code that CPython currently compiles as "BUILD_SET 0" Well, its unfortunate that {} is not available. If it were, there would be no issue, to me anyway, of using '∅'. However, optimizing CPython bytecode, and compiler hooks, are completely different issues from translating unisym python to standard python that could run on any implementation of Python. If we thought the bytecode difference was important (which most do not), we could have a peephole optimizer to 'fix' it, completely independently of the existence of '∅' or any idea of using it in python code.
in which case Terry's idea is more of a dismissal than a helpful suggestion,
My post was a dismissal of the idea of changing python itself *and* a suggestion of how to proceed without involving pydev.
https://github.com/abarnert/emptyset proves that it is possible, and even pretty easy.
I consider producing (or at least being able to produce) a standard .py file that can be published outside the specialized group run on and debugged on standard interpreters to be essential to any sensible idea for augmented Python code (whether with unicode symbols or anything else, such as native-language keywords). However, as I said before, off topic here for unicode symbols, though not on python-list. -- Terry Jan Reedy

Somewhere in this thread, someone mentioned https://github.com/ehamberg/vim-cute-python (and something similar for emacs, but I'm a vim user). I'm not sure if this mention was a joke or not, but I thought it looked cool and started using it. I can't decide if I actually find it useful or distracting, but in truth it seems to answer the *entire* concern of anyone wanting to see an empty-set symbol (but not to save one bytecode instruction, I admit), and also various other math symbols that name concepts spelled in ASCII in Python. While some hypothetical .pyu translation tool or import hook could do the same thing, this really *does* seem like something to just do at the editor level since there's nothing *semantic* about the new symbols, just a way for them to look. On Tue, Jul 1, 2014 at 1:04 PM, Terry Reedy <tjreedy@udel.edu> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Tuesday, July 1, 2014 1:05 PM, Terry Reedy <tjreedy@udel.edu> wrote:
You're reading a lot into a 2-line message, but your take is that he interpreted the problem as needing to compile "BUILD_SET 0", and pointed out that there is no way to do that with a source preprocessor. You can insist that they're two separate problems to be solved (or, maybe, not solved), and I think you're right. You just have to make that point—as you, I, and half a dozen others have done since his original post. But meanwhile, Chris Angelico offered a solution to the problem that answers his complaint, and I offered another solution that doesn't even require bytecode hacking. That shows that even if you accept the objection, it still doesn't block anyone.
First, as others have pointed out, it's not just, or even primarily, an optimization, it's also a semantic difference.
If we thought the bytecode difference was
But you can't make semantic changes in a peephole optimizer. You'd have to first change the language to document that set() may (or may not!) return an empty set even if the name set resolves to something different. While this isn't entirely unique in Python history (e.g., back when you could redefine False through various kinds of trickery, the compiler was still allowed to optimize out if False: code), but it's very unusual. And nobody's going to do that for a minor optimization (if False:, besides being a potentially huge optimization, also _fixes_ a semantic problem, rather than causing one, since False was supposed to be un-redefinable, but wasn't because of various holes).
My point is that _if_ you take Random's objection as being critical, _then_ your post dismisses the idea, even though it wasn't intended to. You can follow up in two ways: challenge his objection, or answer his objection; there were replies doing both, and if either of the two succeeds, the idea is still alive for people to take further if they want.
My approach is made up of nothing but standard .py files. Those files can be published outside a specialized group, and run and debugged on CPython 3.4+. They can also be edited by people outside that specialized group, without needing a specialized build process involving a preprocessor, just a standard Python module that they already have. Sure, it only works on CPython, but Python 3.4, scipy, etc. also currently only work on CPython, and that doesn't prevent a large community of users from making using of them, publishing code outside a specialized group, and—most importantly for the topic at hand—coming up with suggestions that are germane to Python as a whole and taken seriously. For example, nobody suggested that PEP 465 wasn't a sensible idea because all of the sample code presented only runs on CPython; the idea itself is clearly portable, the community using such code is gigantic and mature, and that's all that matters. Finally, I don't think anyone actually needs this feature, but I was able to whip up a proof of concept in an hour that provides it. Anyone who seriously wants to pursue it doesn't have to use my approach, much less my code; it still serves as an existence proof that what they want to do can be done, meaning they should go do it.

On Tue, Jul 01, 2014 at 06:38:37PM +1000, Chris Angelico wrote: [...]
1) Do I really think anyone *should* do this? Your subsequent comments support this question, and the answer is resoundingly NO. CPython is
"This" being trying to micro-optimize code by bytecode-hacking.
I think that micro-optimization is probably the wrong reason to hack bytecodes. What I'm more interested in is exploring potential new features, or to add functionality, for example: Adding the ability to trace individual expressions, not just lines: http://nedbatchelder.com/blog/200804/wicked_hack_python_bytecode_tracing.htm... Exploring dynamic scoping: http://www.voidspace.org.uk/python/articles/code_blocks.shtml A proposal from Python 2.3 days for a brand-new decorator syntax: http://code.activestate.com/recipes/286147 A (serious!) defence of GOTO in Python: http://www.dr-josiah.com/2012/04/python-bytecode-hacks-gotos-revisited.html (although even Josiah doesn't suggest using COMEFROM :-) I don't know that such bytecode manipulations should be provided in the standard library, and certainly not as a built-in "asm" command. But, I think that we ought to acknowledge that bytecode hacking has a role to play in the wider Python ecosystem. I'm lead to understand that in the Java community, bytecode hacking is, perhaps not common, but accepted as something that powerusers do when all else fails: https://weblogs.java.net/blog/simonis/archive/2009/02/we_need_a_dirty.html [Aside: does Python do any sort of verification of the bytecode before executing it, as Java does?] -- Steven

On 1 July 2014 10:33, Steven D'Aprano <steve@pearwood.info> wrote:
https://pypi.python.org/pypi/withhacks and https://pypi.python.org/pypi/byteplay may also be of interest to anyone wishing to seriously tinker with what the CPython VM (as opposed to Python-the-language) already supports. I also highly advise working Python 3.4, since we made some substantial improvements to the dis module API (adding the yield from tests for 3.3 highlighted how limited the previous API was for testing purposes, so we fixed it in a way that made bytecode easier to work with in general). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 1 July 2014 10:33, Steven D'Aprano <steve@pearwood.info> wrote:
[Aside: does Python do any sort of verification of the bytecode before executing it, as Java does?]
Nope, it will happily attempt to execute invalid bytecode. That's actually one of the reasons executing untrusted bytecode is even less safe than executing untrusted source code - it's likely to be possible to trigger segfaults that way. There's an initial attempt at a bytecode verifier on PyPI (https://pypi.python.org/pypi/Python-Bytecode-Verifier/), and I have a vague recollection that Google have a bytecode verifier kicking around somewhere, but there's nothing built in to the CPython runtime. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 2014-07-01 19:16, Nick Coghlan wrote:
The re module also uses a kind of bytecode that's generated by the Python front end and verified by the C back end. The bytecode contains things like offsets; for example, the bytecode that starts a repeated sequence has an offset to the corresponding bytecode that ends it, and vice versa. The problem with that is that the structure (i.e. the nesting) is no longer explicit, so it's more difficult to spot misnested structures. For the regex module, I decided that it would be easier to verify if I kept the structure explicit by using bytecodes to indicate the start and end of the structures. For example, a repeated sequence could be indicated by having a structure like GREEDY_REPEAT min_count max_count ... END. The C back end could then build the internal representation that's actually interpreted.

On Tuesday, July 1, 2014 10:35 AM, Steven D'Aprano <steve@pearwood.info> wrote:
I think CPython provides just about the right level of support here. The documentation, the APIs, and the helper tools for dealing with bytecode are all superb, and get better with each release. It's all more than sufficient to figure out what you're doing, and how to do it. It might be nice if there were an assembler in the stdlib, but the format is simple enough, and the documentation complete enough, that you can write one in a couple hours (as I did). And, honestly, I suspect a stdlib assembler wouldn't be updated fast enough—e.g., when support for Instruction objects was added to CPython's dis module in 3.4, I doubt an existing assembler would have been modified to take advantage of that, but a new one that you slap together can do so easily. Documenting that bytecode is only supported on CPython, and can change between CPython versions, isn't a problem for anyone who's just looking to experiment with and explore ideas, rather than write production code. As your examples show, you can usually even publish your explorations for others to experiment with, granting those limitations, and maintain them for years without much headache. (Bytecode has traditionally been much more conservative than what the documentation allows; it's generally only when your hacks rely on knowing exactly what bytecode will be generated for a given Python expression that they break. But even there, with a sufficient test suite, it's usually pretty simple to adapt.)
I'm lead to understand that in the Java community, bytecode hacking is,
Here, it sounds like you _are_ suggesting that bytecode hacking may need to be used for production code, not just for exploration. But there are some pretty big differences between Java and Python that I think are relevant here: * Java is designed for one specific VM, on which many other languages run; Python is designed to run on a variety of VMs, and nothing else runs on the CPython VM. * Java is designed to be secure first, fast second, and flexible a distant third; Python is designed to be simple and transparent first, flexible and dynamic second, and everything else a distant third. So most of what you'd want to do (including solving problems like the one in the blog) can be done with simple monkey-patching and related techniques—and you can go a lot deeper than that without getting beyond the supported, portable reflection techniques. * Java's VM is designed to be debuggable and optimizable; CPython's is designed to be the simplest thing that could support CPython. So, anything that's too hard to do with runtime structures is often easier at the VM level in Java, while the reverse is true in CPython. * Java code is often distributed and always deployed as binary files; Python almost always as source. Besides being the cause of problems like the one in this article, it also means that if you have to go below the runtime level, you don't have the intermediate steps of source and AST hacking, you have no choice but to go to the bytecode.

On 30 Jun 2014 16:51, "Andrew Barnert" <abarnert@yahoo.com.dmarc.invalid> wrote:
First, two quick side notes:
It might be nice if the compiler were as easy to hook as the
importer. Alternatively, it might be nice if there were a way to do "inline bytecode assembly" in CPython, similar to the way you do inline assembly in many C compilers, so the answer to random's question is just "asm [('BUILD_SET', 0)]" or something similar. Either of those would make this problem trivial. Eugene Toder & Dave Malcolm have some interesting patches on the tracker to help enhance the compiler (in particular, Dave's allowed compiler plugins to be written in Python). Neither set of patches made it to being merge ready, though, and they'll be rather stale at this point. Cheers, Nick.

Like bytecode, the compiler's workings are not part of the language spec, and are likely to change incompatibly between versions and not work for anything besides CPython. I don't really want to go there (cool though it sounds for wannabe compiler hackers). On Mon, Jun 30, 2014 at 7:15 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

(Replies to both Guido's top-post and Nick's reply-post below.) On Monday, June 30, 2014 7:19 PM, Guido van Rossum <guido@python.org> wrote:
Like bytecode, the compiler's workings are not part of the language spec, and are likely to change incompatibly between versions and not work for anything besides CPython. I don't really want to go there (cool though it sounds for wannabe compiler hackers).
But CPython does expose bytecode via the dis module, parts of inspect, etc. For that matter, it exposes some of the compiler's workings (especially if you consider everything up to AST generation part of the compiler, since every step up to there is exposed, including doing the whole thing in one whack with PyCF_ONLY_AST). So, I don't see how exposing the AST-to-bytecode transformation part (or, while we're at it, the .pyc generation part) is any more unportable than what's already there. That being said, I can appreciate that it would almost certainly take a lot more work, and a lot riskier work to do that, so the same tradeoff could easily go the other way in this case. (Not to mention that the dis module and so on are already there, while the patches Nick was talking about, much less something brand new, are obviously not.)
Thanks! Are you referring to Dave Malcolm's patch to adding a hook for an AST optimizer (in Python) right before compiling the AST to code (http://bugs.python.org/issue10399 and related)? If so, I don't think that would actually help here. Unless it's possible to say "BUILD_SET 0" in AST, but in that case, we don't need any new compiler hooks; just use an import hook the same way MacroPy does. (Doing it without import hooks would be a little nicer, but it's not essential.) The only patch I could find by Eugene Toder is one to reenable constant folding on -0, which I think was already committed in 3.3, and doesn't seem relevant anyway. Is there something else I should be searching for?

On , Andrew Barnert <abarnert@yahoo.com> wrote:
[snip]
I should have just tested it before saying anything:
So… it is possible to say "BUILD_SET 0" in AST. Which means the easy way to do this is to wrap an import hook around this: class FixEmptySet(ast.NodeTransformer): def visit_Name(self, node): if node.id == '_EMPTY_SET_LITERAL': return ast.copy_location( ast.Set(elts=[], ctx=ast.Load()), node) return node def ecompile(src, fname): src = src.replace('∅', '_EMPTY_SET_LITERAL') tree = compile(src, fname, 'exec', flags=ast.PyCF_ONLY_AST) tree = FixEmptySet().visit(tree) return compile(tree, fname, 'exec') code = ecompile('def f(): return ∅', '<>') exec(code) f() That returns set(). And if you dis.dis(f), it's just BUILD_SET 0 and RETURN_VALUE.

On Tue, Jul 1, 2014 at 7:00 PM, Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> wrote:
src = src.replace('∅', '_EMPTY_SET_LITERAL')
Note that this suffers from a flaw that my POC script also suffers from: it replaces this character *anywhere*, rather than only when it's being used as a symbol on its own. Even inside a literal string. It might be necessary to replace it back the other way afterward, somehow, but I'm not sure if that would work. ChrisA

On Tuesday, July 1, 2014 2:08 AM, Chris Angelico <rosuav@gmail.com> wrote:
Yes, that's easy. Also, _EMPTY_SET_LITERAL_ itself could exist in your source (after all, it exists in my source fragment right above, right?), but that's easy too. See https://github.com/abarnert/emptyset for a slapped-together implementation that solves both those problems (except for bytes literals, but it explains how to do that). If it prints out "set() is the empty set ∅", then it worked; it successfully replaced the ∅ in your source with an empty set literal, and left the ∅ in your format string as ∅.

On 1 July 2014 01:27, Andrew Barnert <abarnert@yahoo.com> wrote:
Note that the dis module has a "CPython implementation detail" disclaimer, and the AST structure is deliberately exempted from the usual backwards compatibility guarantees. As far as hooking compilation goes, https://docs.python.org/3/library/importlib.html#importlib.abc.InspectLoader... was added in 3.4 specifically to make it easier to define custom loaders that make use of most of the existing import machinery (including bytecode cache files), but do something different for the source -> bytecode transformation step. Cheers, Nick.

On Tue, Jul 01, 2014 at 09:58:37AM -0700, Nick Coghlan wrote:
On 1 July 2014 01:27, Andrew Barnert <abarnert@yahoo.com> wrote:
Further to what Nick says, the *output* of dis is not expected to remain backwards compatible from version to version, only the dis API itself. There's a big difference between saying "we guarantee that the dis module will correctly and accurately disassemble valid bytecode", and saying "we guarantee that this specific chunk of bytecode will do these things". In order to use a hypothetical asm function, you need to know what pseudo-assembly to write, say `asm [SPAM, EGGS]`. That means that SPAM and EGGS must be stable and part of the language definition. (Or at least part of the CPython API.) That's a big step from the current situation. -- Steven
participants (18)
-
Andrew Barnert
-
Antoine Pitrou
-
Barry Warsaw
-
Chris Angelico
-
Clint Hepner
-
David Mertz
-
Giampaolo Rodola'
-
Guido van Rossum
-
MRAB
-
Nick Coghlan
-
Paul Tagliamonte
-
Philipp A.
-
random832@fastmail.us
-
Ryan Gonzalez
-
Stefan Behnel
-
Steven D'Aprano
-
Terry Reedy
-
Wichert Akkerman