something like sscanf for Python
On Jun 26, 2019, at 7:13 PM, Chris Angelico <rosuav@gmail.com> wrote:
The main advantage of sscanf over a regular expression is that it performs a single left-to-right pass over the format string and the target string simultaneously, with no backtracking. (This is also its main DISadvantage compared to a regular expression.) A tiny amount of look-ahead in the format string is the sole exception (for instance, format string "%s$%d" would collect a string up until it finds a dollar sign, which would otherwise have to be written "%[^$]$%d"). There is significant value in having an extremely simple parsing tool available; the question is, is it worth complicating matters with yet another way to parse strings? (We still have fewer ways to parse than ways to format strings. I think.) I agree. Python should have an equivalent of scanf, but perhaps it should have some extensions:
%P - read pickled object %J - read JSON object %M - read msgpack object
On 27/06/2019 18:58, James Lu wrote:
On Jun 26, 2019, at 7:13 PM, Chris Angelico <rosuav@gmail.com> wrote:
The main advantage of sscanf over a regular expression is that it performs a single left-to-right pass over the format string and the target string simultaneously, with no backtracking. (This is also its main DISadvantage compared to a regular expression.) A tiny amount of look-ahead in the format string is the sole exception (for instance, format string "%s$%d" would collect a string up until it finds a dollar sign, which would otherwise have to be written "%[^$]$%d"). There is significant value in having an extremely simple parsing tool available; the question is, is it worth complicating matters with yet another way to parse strings? (We still have fewer ways to parse than ways to format strings. I think.) I agree. Python should have an equivalent of scanf, but perhaps it should have some extensions:
%P - read pickled object %J - read JSON object %M - read msgpack object
I somewhat disagree; scanf (or rather sscanf) always looks like a brilliant idea right up until I come to use it, at which point I almost always do something else that gives me better control. I get very paranoid about parsing, and rolling my own usually feels safer. Whether or not it is safer is, of course, another issue :-/ -- Rhodri James *-* Kynesim Ltd
On 28 Jun 2019, at 19:01, Rhodri James <rhodri@kynesim.co.uk> wrote:
On 27/06/2019 18:58, James Lu wrote:
On Jun 26, 2019, at 7:13 PM, Chris Angelico <rosuav@gmail.com> wrote:
The main advantage of sscanf over a regular expression is that it performs a single left-to-right pass over the format string and the target string simultaneously, with no backtracking. (This is also its main DISadvantage compared to a regular expression.) A tiny amount of look-ahead in the format string is the sole exception (for instance, format string "%s$%d" would collect a string up until it finds a dollar sign, which would otherwise have to be written "%[^$]$%d"). There is significant value in having an extremely simple parsing tool available; the question is, is it worth complicating matters with yet another way to parse strings? (We still have fewer ways to parse than ways to format strings. I think.) I agree. Python should have an equivalent of scanf, but perhaps it should have some extensions: %P - read pickled object %J - read JSON object %M - read msgpack object
I somewhat disagree; scanf (or rather sscanf) always looks like a brilliant idea right up until I come to use it, at which point I almost always do something else that gives me better control. I get very paranoid about parsing, and rolling my own usually feels safer. Whether or not it is safer is, of course, another issue :-/
And let's not forgot how bad positional matching is. So if one were to implement such a library it would be best if one can supply names for the the parts and have it spit out a dict. / Anders
On Sat, Jun 29, 2019 at 3:24 AM Anders Hovmöller <boxed@killingar.net> wrote:
On 28 Jun 2019, at 19:01, Rhodri James <rhodri@kynesim.co.uk> wrote:
On 27/06/2019 18:58, James Lu wrote:
On Jun 26, 2019, at 7:13 PM, Chris Angelico <rosuav@gmail.com> wrote:
The main advantage of sscanf over a regular expression is that it performs a single left-to-right pass over the format string and the target string simultaneously, with no backtracking. (This is also its main DISadvantage compared to a regular expression.) A tiny amount of look-ahead in the format string is the sole exception (for instance, format string "%s$%d" would collect a string up until it finds a dollar sign, which would otherwise have to be written "%[^$]$%d"). There is significant value in having an extremely simple parsing tool available; the question is, is it worth complicating matters with yet another way to parse strings? (We still have fewer ways to parse than ways to format strings. I think.) I agree. Python should have an equivalent of scanf, but perhaps it should have some extensions: %P - read pickled object %J - read JSON object %M - read msgpack object
I somewhat disagree; scanf (or rather sscanf) always looks like a brilliant idea right up until I come to use it, at which point I almost always do something else that gives me better control. I get very paranoid about parsing, and rolling my own usually feels safer. Whether or not it is safer is, of course, another issue :-/
And let's not forgot how bad positional matching is. So if one were to implement such a library it would be best if one can supply names for the the parts and have it spit out a dict.
Dunno about that; positional matching works really nicely with unpacking assignment: spam, eggs, sausages = sscanf(string, "%d /// %s ||| %d") No dictionary needed. Of course, if you _want_ named placeholders, that can be a good feature to support, but I wouldn't say that positional matching is "bad". Here's a random thought, though. Let's break this into two separate parts. 1) For all the different types of object that can be read (integer, string, JSON blob, etc), have a function that will read one, stop when it's done, and report both the parsed object and the point where it stopped parsing. 2) Have a general template handler that looks for the literal text tokens, says, "oh, you want that kind of object up to the slash-slash-slash", and splits up the string to hand to the main parser. I know for sure that the first half of that will have value in other contexts. Just last night I was wishing that I could call compile(data, "-", "eval") and have it read one Python expression and leave behind the rest. (To my knowledge, that doesn't exist either.) There are ways to hack around the JSON module to get that effect, but it's not exactly a supported feature. The second half? Less sure, and the API would make or break it. Anyone feel inspired by this and want to come up with one? ChrisA
On Jun 28, 2019, at 12:09, Chris Angelico <rosuav@gmail.com> wrote:
1) For all the different types of object that can be read (integer, string, JSON blob, etc), have a function that will read one, stop when it's done, and report both the parsed object and the point where it stopped parsing.
For a string, what does it mean to “read one”? Does it just munch everything, or until the end of the line (whether \n or universal newlines), or until white space, or just one character? Whichever one you decide is right is probably trivial to implement (value, _, rest = arg.partition('\n')), but unless the goal is “exactly what C scanf does” (in which case I’m not sure we need a whole protocol-and-wrapper thing), there doesn’t seem to be a TOOWTDI answer here. Meanwhile, the json module can already do this with the raw decode method (although you to have to construct a decoder instance, as it doesn’t have a convenience wrapper like loads), and so can lots of other things (even stuff like struct.unpack_from), but they mostly have a wide range of inconsistent APIs. Maybe just having a consistent “val, rest = parse_one(source_str, type)” function that calls a dunder protocol type.__parse_one__(source_str) or accesses a registry that each module can add to (and users can customize), or …? Assuming you have that, then writing an unformat function where you specify the types by name is trivial, and just as extensible as format. That seems a lot more useful than a function which is like C scanf with a few differences and a few extensions (that aren’t the same extensions as, say, ObjC). Although I suppose there’s no reason you couldn’t do both.
On Sat, Jun 29, 2019 at 9:02 AM Andrew Barnert <abarnert@yahoo.com> wrote:
On Jun 28, 2019, at 12:09, Chris Angelico <rosuav@gmail.com> wrote:
1) For all the different types of object that can be read (integer, string, JSON blob, etc), have a function that will read one, stop when it's done, and report both the parsed object and the point where it stopped parsing.
For a string, what does it mean to “read one”? Does it just munch everything, or until the end of the line (whether \n or universal newlines), or until white space, or just one character? Whichever one you decide is right is probably trivial to implement (value, _, rest = arg.partition('\n')), but unless the goal is “exactly what C scanf does” (in which case I’m not sure we need a whole protocol-and-wrapper thing), there doesn’t seem to be a TOOWTDI answer here.
The %s marker would accept everything up to the next literal text. So if you say "%s@%s", it would read up to the at sign. The second part of the proposal would be doing that, though; the "%s" handler would simply accept everything and return it.
Meanwhile, the json module can already do this with the raw decode method (although you to have to construct a decoder instance, as it doesn’t have a convenience wrapper like loads), and so can lots of other things (even stuff like struct.unpack_from), but they mostly have a wide range of inconsistent APIs. Maybe just having a consistent “val, rest = parse_one(source_str, type)” function that calls a dunder protocol type.__parse_one__(source_str) or accesses a registry that each module can add to (and users can customize), or …?
Yes, that's what I mentioned as being possible to hack around with JSON parsing, but it's not exactly an API, and it's something that could be done way better for other protocols too. ChrisA
On Jun 28, 2019, at 16:10, Chris Angelico <rosuav@gmail.com> wrote:
The %s marker would accept everything up to the next literal text. So if you say "%s@%s", it would read up to the at sign. The second part of the proposal would be doing that, though; the "%s" handler would simply accept everything and return it
So, not like C scanf at all, where %s reads until white space. Also, there’s nothing like your “second part”; literals are almost useless in scanf except for things like binary protocols, because they’re not even looked at until after the previous format specifier has already been parsed. So if, say, you scan “I have 20. How many do you have?” with “I have %f. %s…”, the %f will munch the “20.”, then the literal “.” will fall). Anyway, something like my unformat or your implied design might be more useful, but I’m not sure that it would be. People have been trying to improve on scanf for 40 years, and the only things that have caught on look nothing like it (regex, or just not having a format string at all and doing something like C++ >> operator).
On Sat, Jun 29, 2019 at 9:41 AM Andrew Barnert <abarnert@yahoo.com> wrote:
On Jun 28, 2019, at 16:10, Chris Angelico <rosuav@gmail.com> wrote:
The %s marker would accept everything up to the next literal text. So if you say "%s@%s", it would read up to the at sign. The second part of the proposal would be doing that, though; the "%s" handler would simply accept everything and return it
So, not like C scanf at all, where %s reads until white space. Also, there’s nothing like your “second part”; literals are almost useless in scanf except for things like binary protocols, because they’re not even looked at until after the previous format specifier has already been parsed. So if, say, you scan “I have 20. How many do you have?” with “I have %f. %s…”, the %f will munch the “20.”, then the literal “.” will fall).
Anyway, something like my unformat or your implied design might be more useful, but I’m not sure that it would be. People have been trying to improve on scanf for 40 years, and the only things that have caught on look nothing like it (regex, or just not having a format string at all and doing something like C++ >> operator).
Hmm, I'm actually rather rusty on the details of C's sscanf, having used high level languages most of the time for years. There are [s]scanf functions in a number of languages, and I forgot to check back what C's own semantics are. But hey. If Python takes something that's inspired heavily by C's, and partially by (say) Pike's, that can still be useful. And it's a fair sight better to work with strings than character pointers. ChrisA
Rhodri James wrote:
scanf (or rather sscanf) always looks like a brilliant idea right up until I come to use it, at which point I almost always do something else that gives me better control.
My experience is similar, but that's largely because error detection and reporting with the C version of sscanf is pretty terrible. At best all you can say is "there is something wrong with this line of input". If the Python version could produce better diagnostics, it might find more use. -- Greg
This thread caught my interest because I’ve written code that wraps fscanf More than once (And have some in production). So a few observations: 1) my primary reason for doing it was performance — reading lots of numbers into numpy arrays. I literally NEVER felt Python’s built in string manipulation facilities were inadequate. 2) C’s scanf Is well suited to pulling numbers out of text, but not a very good general purpose parser. So: while I can see a new nifty parser could be great, I’m not sure scanf is particularly good inspiration. And this clearly seems like a “put a package on PyPi, and if it really catches on, then consider adding to the stdlib” type of proposal. Finally — wasn’t there a thread recently on this list about a parser for the stdlib? -CHB On Fri, Jun 28, 2019 at 5:42 PM Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Rhodri James wrote:
scanf (or rather sscanf) always looks like a brilliant idea right up until I come to use it, at which point I almost always do something else that gives me better control.
My experience is similar, but that's largely because error detection and reporting with the C version of sscanf is pretty terrible. At best all you can say is "there is something wrong with this line of input".
If the Python version could produce better diagnostics, it might find more use.
-- Greg _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/TYT6AC... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On 6/27/19 1:58 PM, James Lu wrote:
On Jun 26, 2019, at 7:13 PM, Chris Angelico <rosuav@gmail.com> wrote:
The main advantage of sscanf over a regular expression is that it performs a single left-to-right pass over the format string and the target string simultaneously, with no backtracking. (This is also its main DISadvantage compared to a regular expression.) A tiny amount of look-ahead in the format string is the sole exception (for instance, format string "%s$%d" would collect a string up until it finds a dollar sign, which would otherwise have to be written "%[^$]$%d"). There is significant value in having an extremely simple parsing tool available; the question is, is it worth complicating matters with yet another way to parse strings? (We still have fewer ways to parse than ways to format strings. I think.) I agree. Python should have an equivalent of scanf, but perhaps it should have some extensions:
%P - read pickled object %J - read JSON object %M - read msgpack object
James, have you considered investigating whether such things already exist? I googled "python scanf", which took me to a Stack Overflow page, which linked me to https://pypi.org/project/parse/, which could be just what you are looking for. --Ned.
participants (8)
-
Anders Hovmöller
-
Andrew Barnert
-
Chris Angelico
-
Christopher Barker
-
Greg Ewing
-
James Lu
-
Ned Batchelder
-
Rhodri James