Exposing regular expression bytecode

(This was previously sent to python-dev [1], but it was suggested that I bring it here first.) I filed http://bugs.python.org/issue26336 a few days ago, but now I think this list might be a better place to get discussion going. Basically, I'd like to see the bytecode of a compiled regex object exposed as a public (probably read-only) attribute of the object. Currently, although compiled in pure Python through modules sre_compile and sre_parse, the list of opcodes is then passed into C and copied into an array in a C struct, without being publicly exposed in any way. The only way for a user to get an internal representation of the regex is the re.DEBUG flag, which only produces an intermediate representation rather than the actual bytecode and only goes to stdout, which makes it useless for someone who wants to examine it programmatically. I'm sure others can think of other potential use cases for this, but one in particular would be that someone could write a debugger that can allow a user to step through a regex one opcode at a time to see exactly where it is failing. It would also perhaps be nice to have a public constructor for the regex object type, which would enable users to modify the bytecode and directly create a new regex object from it, similar to what is currently possible for function bytecode through the types.FunctionType and types.CodeType constructors. This would make possible things such as optimizers. In addition to exposing the code in a public attribute, a helper module written in Python similar to the dis module (which is for Python's own bytecode) would be very helpful, allowing the code to be easily disassembled and examined at a higher level. Is this a good idea, or am I barking up the wrong tree? I think it's a great idea, but I'm open to being told this is a horrible idea. :) I welcome any and all comments both here and on the bug tracker. [1] https://mail.python.org/pipermail/python-dev/2016-February/143355.html

the new 'regex' module (with a re compatability mode and unicode) may be the place to find/add more debugging syms * | PyPI: https://pypi.python.org/pypi/regex * | Src: https://bitbucket.org/mrabarnett/mrab-regex On Feb 14, 2016 11:49 PM, "Jonathan Goble" <jcgoble3@gmail.com> wrote:

On Mon, Feb 15, 2016 at 2:26 AM, Wes Turner <wes.turner@gmail.com> wrote:
the new 'regex' module (with a re compatability mode and unicode) may be the place to find/add more debugging syms
Thanks, but I'm already well aware of that module, as I'm already using it for another project of mine that requires some of its new features. But this proposal, I think, is simple enough that it would be worth doing for the built-in 're' modules. Also, the things that implementing it would enable, such as optimizers and especially debuggers, would be available to a much larger number of people if the built-in module supported them. I'm not really asking for much; just take an array already in the struct that represents the object, expose it as a public attribute, and write a simple 'dis'-like module to enable low-level programmatic analysis. Once that's available, third parties are likely to build on it from there; that's much less likely to happen if such tools only worked on a different regex module (albeit a great one) from PyPI and not on the built-in version.

Hello? Anyone? I'd really like to see this idea move forward, but I need some comments for that. Original message: https://mail.python.org/pipermail/python-ideas/2016-February/038488.html On Mon, Feb 15, 2016 at 2:47 AM, Jonathan Goble <jcgoble3@gmail.com> wrote:

On Tue, Feb 16, 2016 at 1:15 AM, Andrew Barnert <abarnert@yahoo.com> wrote:
Isn't the regex module intended to replace the re module "one of these days", once it's actually whipped into stdlibable shape? If so, it seems like it would be better to help Matthew finish that than to add to the re module.
As I said roughly 24 hours ago, I'd like to see this in core now (meaning 3.6) because of its usefulness and (I presume) its simplicity. The regex module has been "intended" to replace re for what seems like (and probably has been) years, and could take several more years at this pace. I really don't see what is accomplished by excluding a simple, useful feature from this year's version of Python core in favor of shunting it to (what is currently) a third-party PyPI module in the hopes that it will eventually be approved to replace the core module some years down the road. All that does is drastically reduce the number of people who have access to the feature in the time being. Don't get me wrong here; I completely understand why the regex module is being developed separately, because the changes being made are massive. But this proposed feature is not even remotely on the scale of those changes and can not only easily be made to core now, but can benefit many, many users of regular expressions, especially after third-party debuggers and optimizers come along (which is not likely to happen *until* it's in core). Why arbitrarily limit the number of people with access to it? And why does the fact that another regex module has been in development for several years have to preclude any possibility of changes to the current one, especially when the other one doesn't seem likely to get approved any time soon? As a final note, I ALREADY argued against shunting this to regex last night, and quoted that response in my bump tonight. As such, you're not leaving a good first impression on newbie me by totally ignoring that and repeating the same suggestion without any attempt to counter my arguments against it. If you have actual counter-arguments to either of these responses, I'm open to listening and possibly being convinced otherwise, but please don't just ignore what I've already said.

On Feb 15, 2016, at 22:36, Jonathan Goble <jcgoble3@gmail.com> wrote:
Sorry; I was assuming it was close because it almost made it into 3.3 and then again 3.4. But I guess the alternative position--that almost but not quite making 3.3 and 3.4 and then not even being mentioned for 3.5--implies that it may not be that near-future, even if you were to help with it. Anyway, assuming it's not that complicated, and you did the work, the only negative is that it might encourage people to write code that depends on re that regex can't be backward-compatible with. And at this point, that doesn't seem like much of a negative. (At the very least, nobody's made a good argument for it being a major negative.)
As such, you're not leaving a good first impression on newbie me
On a personal note: please keep in mind that I'm just some random guy, like you, and so are many other people on this list. Please don't hold anything I say against the core devs, any more than you'd want something you say to represent them. It can be frustrating to push ideas through -ideas--but if you think of the long-term consequences of keeping Python simple and conservative vs. keeping the discussion fair to new ideas, it helps with the frustration. I hope I haven't contributed to pushing you away from contributing to Python, because it needs people willing to do the work and put together complete ideas, with complete implementations.

On Tue, Feb 16, 2016 at 2:28 AM, Andrew Barnert <abarnert@yahoo.com> wrote:
That's precisely what I was getting at. :-)
Anyway, assuming it's not that complicated, and you did the work, the only negative is that it might encourage people to write code that depends on re that regex can't be backward-compatible with. And at this point, that doesn't seem like much of a negative. (At the very least, nobody's made a good argument for it being a major negative.)
I would imagine that a change this simple could be easily made to regex as well as re. In other words, I'm not opposed to having the change made in regex; I just want to see it in re at the same time. As for doing the work, I'm willing to look into that possibility, but I had never, ever written C code until last month, and I know nothing about the Python/C API. Literally all I've done in C is a fork and extension of Lua's pattern matching [1]; considering a feature for that is what led me to inspect Python's re internals and made me realize that exposing the code could be very beneficial. [1] https://github.com/jcgoble3/lua-matchext
I understand, and probably shouldn't have added that last paragraph, but it is frustrating to argue against something and then have the same thing thrown back at me without counter-argument. I guess I'm just annoyed that I haven't gotten any substantive comments regarding the usefulness of it, while other threads are highly active (e.g. generator unpacking). Perhaps I should be a bit more patient; patience has never been one of my positive qualities. :-P
I hope I haven't contributed to pushing you away from contributing to Python, because it needs people willing to do the work and put together complete ideas, with complete implementations.
You haven't. :-) And again, I'm willing to do some coding work on this, but it may take some time for me to learn my way around the Python/C API, as a cursory look at the docs immediately shows that it's vastly different from the Lua/C API that I've recently become accustomed to.

On Tue, Feb 16, 2016 at 6:42 PM, Jonathan Goble <jcgoble3@gmail.com> wrote:
For what it's worth, I read your post with interest, but didn't have anything substantive to reply - mainly because I don't use regexes much. But it would be rather cool to be able to decompile a regex. Imagine a regex pretty-printer: compile an arbitrary string, and if it's a valid regex, decompile it to a valid source code form, using re.VERBOSE. That could help _hugely_ with debugging, if the trick can be pulled off. ChrisA

On 16 February 2016 at 06:04, Jonathan Goble <jcgoble3@gmail.com> wrote:
Hello? Anyone? I'd really like to see this idea move forward, but I need some comments for that.
Sorry. I don't personally have any issue with the proposal, and it sounds like a reasonable idea. I don't think it's likely to be *hugely* controversial - although it will likely need a little care in documenting the feature to ensure that we are clear that there's no guarantees of backward compatibility that we don't want to commit to on the newly - exposed data. And we should also ensure that by exposing this information, we don't preclude changes such as the incorporation of the regex module (I don't know if the regex module has a bytecode implementation like the re module does). The next step is probably simply to raise a tracker issue for this. I know you said you have little C experience, but the big problem is that it's unlikely that any of the core devs with C experience will have the time or motivation to code up your idea. So without a working patch, and someone willing and able to respond to comments on the patch, it's not likely to progress. But if you are willing to dig into Python's C API yourself (and it sounds like you are) there are definitely people who will help you. You might want to join the core mentorship list (see http://pythonmentors.com/) where you should get plenty of assistance. This proposal sounds like a great "beginner" task, as well - so even if you don't want to implement it yourself, still put it on the tracker, and mark it as an "easy" change, and maybe some other newcomer who wants a task to help them learn the C API will pick it up. Hope that helps - thanks for the suggestion and sorry if it seems like no-one was interested at first. It's an unfortunate fact of life around here that things *do* take time to get people's interest. You mention patience in one of your messages - that's definitely something you'll need to cultivate, I'm afraid... :-) Paul

the new 'regex' module (with a re compatability mode and unicode) may be the place to find/add more debugging syms * | PyPI: https://pypi.python.org/pypi/regex * | Src: https://bitbucket.org/mrabarnett/mrab-regex On Feb 14, 2016 11:49 PM, "Jonathan Goble" <jcgoble3@gmail.com> wrote:

On Mon, Feb 15, 2016 at 2:26 AM, Wes Turner <wes.turner@gmail.com> wrote:
the new 'regex' module (with a re compatability mode and unicode) may be the place to find/add more debugging syms
Thanks, but I'm already well aware of that module, as I'm already using it for another project of mine that requires some of its new features. But this proposal, I think, is simple enough that it would be worth doing for the built-in 're' modules. Also, the things that implementing it would enable, such as optimizers and especially debuggers, would be available to a much larger number of people if the built-in module supported them. I'm not really asking for much; just take an array already in the struct that represents the object, expose it as a public attribute, and write a simple 'dis'-like module to enable low-level programmatic analysis. Once that's available, third parties are likely to build on it from there; that's much less likely to happen if such tools only worked on a different regex module (albeit a great one) from PyPI and not on the built-in version.

Hello? Anyone? I'd really like to see this idea move forward, but I need some comments for that. Original message: https://mail.python.org/pipermail/python-ideas/2016-February/038488.html On Mon, Feb 15, 2016 at 2:47 AM, Jonathan Goble <jcgoble3@gmail.com> wrote:

On Tue, Feb 16, 2016 at 1:15 AM, Andrew Barnert <abarnert@yahoo.com> wrote:
Isn't the regex module intended to replace the re module "one of these days", once it's actually whipped into stdlibable shape? If so, it seems like it would be better to help Matthew finish that than to add to the re module.
As I said roughly 24 hours ago, I'd like to see this in core now (meaning 3.6) because of its usefulness and (I presume) its simplicity. The regex module has been "intended" to replace re for what seems like (and probably has been) years, and could take several more years at this pace. I really don't see what is accomplished by excluding a simple, useful feature from this year's version of Python core in favor of shunting it to (what is currently) a third-party PyPI module in the hopes that it will eventually be approved to replace the core module some years down the road. All that does is drastically reduce the number of people who have access to the feature in the time being. Don't get me wrong here; I completely understand why the regex module is being developed separately, because the changes being made are massive. But this proposed feature is not even remotely on the scale of those changes and can not only easily be made to core now, but can benefit many, many users of regular expressions, especially after third-party debuggers and optimizers come along (which is not likely to happen *until* it's in core). Why arbitrarily limit the number of people with access to it? And why does the fact that another regex module has been in development for several years have to preclude any possibility of changes to the current one, especially when the other one doesn't seem likely to get approved any time soon? As a final note, I ALREADY argued against shunting this to regex last night, and quoted that response in my bump tonight. As such, you're not leaving a good first impression on newbie me by totally ignoring that and repeating the same suggestion without any attempt to counter my arguments against it. If you have actual counter-arguments to either of these responses, I'm open to listening and possibly being convinced otherwise, but please don't just ignore what I've already said.

On Feb 15, 2016, at 22:36, Jonathan Goble <jcgoble3@gmail.com> wrote:
Sorry; I was assuming it was close because it almost made it into 3.3 and then again 3.4. But I guess the alternative position--that almost but not quite making 3.3 and 3.4 and then not even being mentioned for 3.5--implies that it may not be that near-future, even if you were to help with it. Anyway, assuming it's not that complicated, and you did the work, the only negative is that it might encourage people to write code that depends on re that regex can't be backward-compatible with. And at this point, that doesn't seem like much of a negative. (At the very least, nobody's made a good argument for it being a major negative.)
As such, you're not leaving a good first impression on newbie me
On a personal note: please keep in mind that I'm just some random guy, like you, and so are many other people on this list. Please don't hold anything I say against the core devs, any more than you'd want something you say to represent them. It can be frustrating to push ideas through -ideas--but if you think of the long-term consequences of keeping Python simple and conservative vs. keeping the discussion fair to new ideas, it helps with the frustration. I hope I haven't contributed to pushing you away from contributing to Python, because it needs people willing to do the work and put together complete ideas, with complete implementations.

On Tue, Feb 16, 2016 at 2:28 AM, Andrew Barnert <abarnert@yahoo.com> wrote:
That's precisely what I was getting at. :-)
Anyway, assuming it's not that complicated, and you did the work, the only negative is that it might encourage people to write code that depends on re that regex can't be backward-compatible with. And at this point, that doesn't seem like much of a negative. (At the very least, nobody's made a good argument for it being a major negative.)
I would imagine that a change this simple could be easily made to regex as well as re. In other words, I'm not opposed to having the change made in regex; I just want to see it in re at the same time. As for doing the work, I'm willing to look into that possibility, but I had never, ever written C code until last month, and I know nothing about the Python/C API. Literally all I've done in C is a fork and extension of Lua's pattern matching [1]; considering a feature for that is what led me to inspect Python's re internals and made me realize that exposing the code could be very beneficial. [1] https://github.com/jcgoble3/lua-matchext
I understand, and probably shouldn't have added that last paragraph, but it is frustrating to argue against something and then have the same thing thrown back at me without counter-argument. I guess I'm just annoyed that I haven't gotten any substantive comments regarding the usefulness of it, while other threads are highly active (e.g. generator unpacking). Perhaps I should be a bit more patient; patience has never been one of my positive qualities. :-P
I hope I haven't contributed to pushing you away from contributing to Python, because it needs people willing to do the work and put together complete ideas, with complete implementations.
You haven't. :-) And again, I'm willing to do some coding work on this, but it may take some time for me to learn my way around the Python/C API, as a cursory look at the docs immediately shows that it's vastly different from the Lua/C API that I've recently become accustomed to.

On Tue, Feb 16, 2016 at 6:42 PM, Jonathan Goble <jcgoble3@gmail.com> wrote:
For what it's worth, I read your post with interest, but didn't have anything substantive to reply - mainly because I don't use regexes much. But it would be rather cool to be able to decompile a regex. Imagine a regex pretty-printer: compile an arbitrary string, and if it's a valid regex, decompile it to a valid source code form, using re.VERBOSE. That could help _hugely_ with debugging, if the trick can be pulled off. ChrisA

On 16 February 2016 at 06:04, Jonathan Goble <jcgoble3@gmail.com> wrote:
Hello? Anyone? I'd really like to see this idea move forward, but I need some comments for that.
Sorry. I don't personally have any issue with the proposal, and it sounds like a reasonable idea. I don't think it's likely to be *hugely* controversial - although it will likely need a little care in documenting the feature to ensure that we are clear that there's no guarantees of backward compatibility that we don't want to commit to on the newly - exposed data. And we should also ensure that by exposing this information, we don't preclude changes such as the incorporation of the regex module (I don't know if the regex module has a bytecode implementation like the re module does). The next step is probably simply to raise a tracker issue for this. I know you said you have little C experience, but the big problem is that it's unlikely that any of the core devs with C experience will have the time or motivation to code up your idea. So without a working patch, and someone willing and able to respond to comments on the patch, it's not likely to progress. But if you are willing to dig into Python's C API yourself (and it sounds like you are) there are definitely people who will help you. You might want to join the core mentorship list (see http://pythonmentors.com/) where you should get plenty of assistance. This proposal sounds like a great "beginner" task, as well - so even if you don't want to implement it yourself, still put it on the tracker, and mark it as an "easy" change, and maybe some other newcomer who wants a task to help them learn the C API will pick it up. Hope that helps - thanks for the suggestion and sorry if it seems like no-one was interested at first. It's an unfortunate fact of life around here that things *do* take time to get people's interest. You mention patience in one of your messages - that's definitely something you'll need to cultivate, I'm afraid... :-) Paul
participants (5)
-
Andrew Barnert
-
Chris Angelico
-
Jonathan Goble
-
Paul Moore
-
Wes Turner