A string function idea

Hello everyone, When I am coding in Python I often encounter situations where I have a string like this one. sample=""" fruit:apple tree:[Apple tree] quantity:{5} quantity:{3} """ And I want to grab some kind of value from it. So let me introduce you to the grab function. This is the definition: def grab(start, end, list_out=False): Now, if you want to get the fruit from the sample string you can just "grab" it as follows: sample.grab(start="fruit:", end="\n")
'apple'
You can also "grab" values enclosed in brackets or in any kind of character and It works as you would expect. sample.grab(start="tree:[", end="]")
'Apple tree'
The optional argument "list_out" makes the function return a list of values, for cases where there are several of them to grab. sample.grab(start="quantity:{", end="}", list_out=True)
[5, 3]
The "grab" function will return only the first occurrence if "list_out" is omitted or passed as False. sample.grab(start="quantity:{", end="}")
5
sample.grab(start="quantity:{", end="}", list_out=False)
5
As you can see, it is incredibly useful for extracting substrings from a larger and more complex string and can change the way we parse strings for the better. For example, it could simplify the way we parse the fields of HTTP headers. It also has many applications for the parsing of configuration files and the construction of strings for SQL queries and bash commands among others. I have implemented this function locally to use it for my personal projects and It has proven to be really useful in practice, so I hope you find it useful as well and consider adding it as a class method for strings in Python. Julio Cabria Engineering student Autonomous University of Madrid

The grab function would find the index of the first occurrence of the "start" string in the parent string and then the next occurrence of the "end" string starting from that index and return the substring between those. So in the example: sample = "sqrt(sin(x) + cos(y))" The grab function would return: sample.grab(start="sqrt(", end=")")
"sin(x"
This shows that "grab" is only useful given that you specify the "start" and "end" delimiters unambiguously. It depends on that to produce the correct output. Julio Cabria Engineering student Autonomous University of Madrid On Tue, Mar 29, 2022, Steven D'Aprano <steve@pearwood.info> wrote:

On Tue, 29 Mar 2022 at 18:13, StrikerOmega <oddochentaycinco850@gmail.com> wrote:
The grab function would find the index of the first occurrence of the "start" string in the parent string and then the next occurrence of the "end" string starting from that index and return the substring between those.
This sounds like a really good job for preparsing. Take your string, parse it according to your rules, and build a list or dict of the cooked results. Then you can look up in that very easily and efficiently. Going back to the string every time tends to be inefficient, but a single pass that gives you a convenient lookup table is both easier to work with and easier to execute. In your example: sample=""" fruit:apple tree:[Apple tree] quantity:{5} quantity:{3} """ I'd start by splitting it into lines, then for each line, partitioning it on the colon, thus giving you a keyword and a value. (I'm not sure what it means to have quantity 5 and quantity 3, but I'm sure you'd define that in some way - maybe first one wins, or last one wins, or build a list of all the values, whatever makes sense.) You could end up with something like: { "fruit": "apple", "tree": "Apple tree", "quantity": ... } depending on how you resolve the conflict. Python is an excellent language for text processing; you have a wide variety of pretty cool tools available. ChrisA

On Tue, Mar 29, 2022 at 09:12:36AM +0200, StrikerOmega wrote:
That's what I guessed it would do. So your earlier statement: "You can also "grab" values enclosed in brackets or in any kind of character and It works as you would expect." is wrong. When using brackets I expect it to understand nesting. -- Steve

On Tue, Mar 29, 2022 at 12:35:56AM -0700, Paul Bryan wrote:
I wonder if applying regular expressions would sufficiently address your use case.
'Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.' -- Jamie Zawinski Apart from (probably) being slower and harder to understand, what benefit would a regular expression bring to this problem? Julio already has a working solution that does exactly what he wants:
-- Steve

28.03.22 15:13, StrikerOmega пише:
And I want to grab some kind of value from it.
There is a powerful tool designed for solving such problems. Is is called regular expressions.
sample.grab(start="fruit:", end="\n")
'apple'
re.search(r'fruit:(.*?)\n', sample)[1]
sample.grab(start="tree:[", end="]")
'Apple tree'
re.search(r'tree:\[(.*?)\]', sample)[1]
sample.grab(start="quantity:{", end="}", list_out=True)
[5, 3]
list(re.findall(r'quantity:\{(.*?)\}', sample))

On Tue, Mar 29, 2022 at 11:00:41AM +0300, Serhiy Storchaka wrote:
Now do grab(start="*", end="."). Of course you know how to do it, but a naive solution: re.search(r'*(.*?).', sample)[1] will fail. So now we have to learn about escaping characters in order to do a simple find-and-extract. And you need to memorise what characters have to be escaped, and if your start and end parameters are expressions or parameters rather than literals, the complexity goes up a lot: # Untested, so probably wrong. re.search(re.escape(start) + "(.*?)" + re.escape(end))[1] and we both know that many people won't bother with the escapes until they get bitten by bugs in their production code. And even then, regexes are a leading source of serious software vulnerabilities. https://cwe.mitre.org/data/definitions/185.html Yes, regular expressions can be used. We know that regexes can be used to solve most problems, for some definition of "solve". Including finding prime numbers: https://iluxonchik.github.io/regular-expression-check-if-number-is-prime/ A method can raise a useful, self-explanatory error message on failure. Your regex raises "TypeError: 'NoneType' object is not subscriptable". A method can be written to parse nested brackets correctly. A regular expression cannot. And then regexes are significantly slower:
Here's the version of grab I used: def grab(text, start, end): a = text.index(start) b = text.index(end, a+len(start)) return text[a+len(start):b] I have no strong opinion on whether this simple function should be built into the string class, but I do have a strong opinion about re-writing it into a slower, more fragile, harder to understand, less user-friendly regex. Don't make me quote Jamie Zawinski again. -- Steve

On Wed, 30 Mar 2022 at 10:08, Steven D'Aprano <steve@pearwood.info> wrote:
This is where Python would benefit from an sscanf-style parser. Instead of regexps, something this simple could be written like this: [fruit] = sscanf(sample, "%*sfruit:%s\n") It's simple left-to-right tokenization, so it's faster than a regex (due to the lack of backtracking). It's approximately as clear, and doesn't require playing with the index and remembering to skip len(start). That said, though - I do think the OP's task is better served by a tokenization pass that transforms the string into something easier to look things up in. ChrisA

Chris Angelico writes:
[fruit] = sscanf(sample, "%*sfruit:%s\n")
I'm warming to this idea. It does hit the sweet spot of doing exactly what you want -- except when it can't do what you want at all. :-) It's concise and quite powerful, applicable to many common use cases. I do have one windowframe of the bikeshed to paint: this is Python, so maybe just "scanf" is a fine name? The first argument can be any iterable of characters, and if an iterator it would leave the iteration pointer where it left off (eg, beginning of next line in 'sample' above). Then the question would be how to use that feature. Specifically, how does scanf deal with the case that the parse fails? Consider while True: fruits.append(scanf(input_file, "%*sfruit:%s\n")[0]) Neither returning a sentinel (presumably None) nor raising a NotFound exception seems palatable. Can it raise StopIteration, perhaps conditional on the first argument having a .__next__?

On Wed, 30 Mar 2022 at 15:11, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
It fits nicely between "x,sep,y = str.partition(...)" and a regular expression.
I do have one windowframe of the bikeshed to paint: this is Python, so maybe just "scanf" is a fine name?
Sure, whether it's scanf or sscanf doesn't really matter to me. And - I had to look this up - the converse is referred to in the docs as "printf-style formatting", not sprintf. So that's an argument in favour of "scanf".
Hmm, I'm not really a fan. To be efficient, scanf will need to be able to use core string searching functionality - str.index() is faster than simply iterating over a string and comparing character by character. I don't think, for instance, that json.load() promises anything about where it leaves an iterable; in fact, I believe it simply reads everything into a string and then parses that. It would be worth supporting both byte strings and text strings, though, for the same reason that they both support printf formatting. ChrisA

Chris Angelico writes:
On Wed, 30 Mar 2022 at 15:11, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
OK. That performance issue plus my concerns about how to make use of a generic iterable pretty in a loop convince me it's not a good idea, and if someone wants it later they can deal with those issues. I figure it will eventually get a C (or Rust ;-) accelerator anyway, but the pure Python implementation should be as fast as possible.
It would seem so, the only constraint on the source is that it support .read(). It's also documented to assume a "file-like object containing a [single] JSON document" (brackets are my insertion). AFAIK, JSON doesn't define "document", and a "JSON text" appears to be defined to be a single JSON value (array, object, number, string, 'true', 'false', 'null'). If you take that seriously, then you might assume the file to contain exactly one value, and throw away any excess.
It would be worth supporting both byte strings and text strings, though, for the same reason that they both support printf formatting.
Ethan would show up in our nightmares if we didn't!

On Wed, 30 Mar 2022 at 18:09, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
Yeah. Logically, it ought to be possible and reasonable to build a JSON loader that can consume a file progressively, minimizing memory usage (particularly if there ends up being an error). But that's not what happens.
That is correct. The standard definition of JSON is that there is a single value (calling it a "document" just implies intent - it's the same thing as any other value, just that most people don't use json.load() on a file that contains the word "false"), with nothing else in the file other than whitespace. Although I have frequently made use of JSONDecoder().raw_decode(), which returns a document and the point where it ends. It still doesn't work with partial reads, but I've used it with chunked formats a number of times.
Actually, I'd like to see that. Nothing wrong with making one's nightmares more interesting. :) ChrisA

On Wed, Mar 30, 2022 at 1:24 AM Chris Angelico <rosuav@gmail.com> wrote:
The standard definition of JSON is that there is a single value
I believe that single value has to be either an array or an object. At least some sub-specifications call for that. But we’ve gotten quite sidetracked :-) -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

FYI, there is a “parse” library on PyPI: https://pypi.org/project/parse/ *Parse strings using a specification based on the Python format() syntax.* *parse() is the opposite of format()* (I haven’t used it myself, but I find the idea compelling, specially for people unfamiliar with regexes or C-style printf syntax). S. -- Stefane Fermigier - http://fermigier.com/ - http://twitter.com/sfermigier - http://linkedin.com/in/sfermigier Founder & CEO, Abilian - Enterprise Social Software - http://www.abilian.com/ Co-Founder & Co-Chairman, National Council for Free & Open Source Software (CNLL) - http://cnll.fr/ Co-Founder & Chairman, Association Professionnelle Européenne du Logiciel Libre (APELL) - https://www.apell.info/ Co-Founder & Spokesperson, European Cloud Industrial Alliance (EUCLIDIA) - https://www.euclidia.eu/ Founder, PyParis & PyData Paris - http://pyparis.org/ & http://pydata.fr/

On Wed, Mar 30, 2022 at 1:50 AM Stéfane Fermigier <sf@fermigier.com> wrote:
Me neither, but I do like this idea better than scanf style. And there’s an implementation ready to try out. -CHB but I find the idea compelling, specially for people unfamiliar with
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

parse is a great library! i've used it a lot. --- Ricky. "I've never met a Kentucky man who wasn't either thinking about going home or actually going home." - Happy Chandler On Wed, Mar 30, 2022 at 1:08 PM Paul Moore <p.f.moore@gmail.com> wrote:

On Tue, Mar 29, 2022 at 4:08 PM Steven D'Aprano <steve@pearwood.info> wrote:
I have no strong opinion on whether this simple function should be built into the string class,
I do -- this is not sufficiently general to be a string method.
I do agree there. I also agree with Chris A's suggestion: *some* scanner / parser that could be used for this and many other things that's significantly more straightforward that regex's. -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

The grab function would find the index of the first occurrence of the "start" string in the parent string and then the next occurrence of the "end" string starting from that index and return the substring between those. So in the example: sample = "sqrt(sin(x) + cos(y))" The grab function would return: sample.grab(start="sqrt(", end=")")
"sin(x"
This shows that "grab" is only useful given that you specify the "start" and "end" delimiters unambiguously. It depends on that to produce the correct output. Julio Cabria Engineering student Autonomous University of Madrid On Tue, Mar 29, 2022, Steven D'Aprano <steve@pearwood.info> wrote:

On Tue, 29 Mar 2022 at 18:13, StrikerOmega <oddochentaycinco850@gmail.com> wrote:
The grab function would find the index of the first occurrence of the "start" string in the parent string and then the next occurrence of the "end" string starting from that index and return the substring between those.
This sounds like a really good job for preparsing. Take your string, parse it according to your rules, and build a list or dict of the cooked results. Then you can look up in that very easily and efficiently. Going back to the string every time tends to be inefficient, but a single pass that gives you a convenient lookup table is both easier to work with and easier to execute. In your example: sample=""" fruit:apple tree:[Apple tree] quantity:{5} quantity:{3} """ I'd start by splitting it into lines, then for each line, partitioning it on the colon, thus giving you a keyword and a value. (I'm not sure what it means to have quantity 5 and quantity 3, but I'm sure you'd define that in some way - maybe first one wins, or last one wins, or build a list of all the values, whatever makes sense.) You could end up with something like: { "fruit": "apple", "tree": "Apple tree", "quantity": ... } depending on how you resolve the conflict. Python is an excellent language for text processing; you have a wide variety of pretty cool tools available. ChrisA

On Tue, Mar 29, 2022 at 09:12:36AM +0200, StrikerOmega wrote:
That's what I guessed it would do. So your earlier statement: "You can also "grab" values enclosed in brackets or in any kind of character and It works as you would expect." is wrong. When using brackets I expect it to understand nesting. -- Steve

On Tue, Mar 29, 2022 at 12:35:56AM -0700, Paul Bryan wrote:
I wonder if applying regular expressions would sufficiently address your use case.
'Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.' -- Jamie Zawinski Apart from (probably) being slower and harder to understand, what benefit would a regular expression bring to this problem? Julio already has a working solution that does exactly what he wants:
-- Steve

28.03.22 15:13, StrikerOmega пише:
And I want to grab some kind of value from it.
There is a powerful tool designed for solving such problems. Is is called regular expressions.
sample.grab(start="fruit:", end="\n")
'apple'
re.search(r'fruit:(.*?)\n', sample)[1]
sample.grab(start="tree:[", end="]")
'Apple tree'
re.search(r'tree:\[(.*?)\]', sample)[1]
sample.grab(start="quantity:{", end="}", list_out=True)
[5, 3]
list(re.findall(r'quantity:\{(.*?)\}', sample))

On Tue, Mar 29, 2022 at 11:00:41AM +0300, Serhiy Storchaka wrote:
Now do grab(start="*", end="."). Of course you know how to do it, but a naive solution: re.search(r'*(.*?).', sample)[1] will fail. So now we have to learn about escaping characters in order to do a simple find-and-extract. And you need to memorise what characters have to be escaped, and if your start and end parameters are expressions or parameters rather than literals, the complexity goes up a lot: # Untested, so probably wrong. re.search(re.escape(start) + "(.*?)" + re.escape(end))[1] and we both know that many people won't bother with the escapes until they get bitten by bugs in their production code. And even then, regexes are a leading source of serious software vulnerabilities. https://cwe.mitre.org/data/definitions/185.html Yes, regular expressions can be used. We know that regexes can be used to solve most problems, for some definition of "solve". Including finding prime numbers: https://iluxonchik.github.io/regular-expression-check-if-number-is-prime/ A method can raise a useful, self-explanatory error message on failure. Your regex raises "TypeError: 'NoneType' object is not subscriptable". A method can be written to parse nested brackets correctly. A regular expression cannot. And then regexes are significantly slower:
Here's the version of grab I used: def grab(text, start, end): a = text.index(start) b = text.index(end, a+len(start)) return text[a+len(start):b] I have no strong opinion on whether this simple function should be built into the string class, but I do have a strong opinion about re-writing it into a slower, more fragile, harder to understand, less user-friendly regex. Don't make me quote Jamie Zawinski again. -- Steve

On Wed, 30 Mar 2022 at 10:08, Steven D'Aprano <steve@pearwood.info> wrote:
This is where Python would benefit from an sscanf-style parser. Instead of regexps, something this simple could be written like this: [fruit] = sscanf(sample, "%*sfruit:%s\n") It's simple left-to-right tokenization, so it's faster than a regex (due to the lack of backtracking). It's approximately as clear, and doesn't require playing with the index and remembering to skip len(start). That said, though - I do think the OP's task is better served by a tokenization pass that transforms the string into something easier to look things up in. ChrisA

Chris Angelico writes:
[fruit] = sscanf(sample, "%*sfruit:%s\n")
I'm warming to this idea. It does hit the sweet spot of doing exactly what you want -- except when it can't do what you want at all. :-) It's concise and quite powerful, applicable to many common use cases. I do have one windowframe of the bikeshed to paint: this is Python, so maybe just "scanf" is a fine name? The first argument can be any iterable of characters, and if an iterator it would leave the iteration pointer where it left off (eg, beginning of next line in 'sample' above). Then the question would be how to use that feature. Specifically, how does scanf deal with the case that the parse fails? Consider while True: fruits.append(scanf(input_file, "%*sfruit:%s\n")[0]) Neither returning a sentinel (presumably None) nor raising a NotFound exception seems palatable. Can it raise StopIteration, perhaps conditional on the first argument having a .__next__?

On Wed, 30 Mar 2022 at 15:11, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
It fits nicely between "x,sep,y = str.partition(...)" and a regular expression.
I do have one windowframe of the bikeshed to paint: this is Python, so maybe just "scanf" is a fine name?
Sure, whether it's scanf or sscanf doesn't really matter to me. And - I had to look this up - the converse is referred to in the docs as "printf-style formatting", not sprintf. So that's an argument in favour of "scanf".
Hmm, I'm not really a fan. To be efficient, scanf will need to be able to use core string searching functionality - str.index() is faster than simply iterating over a string and comparing character by character. I don't think, for instance, that json.load() promises anything about where it leaves an iterable; in fact, I believe it simply reads everything into a string and then parses that. It would be worth supporting both byte strings and text strings, though, for the same reason that they both support printf formatting. ChrisA

Chris Angelico writes:
On Wed, 30 Mar 2022 at 15:11, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
OK. That performance issue plus my concerns about how to make use of a generic iterable pretty in a loop convince me it's not a good idea, and if someone wants it later they can deal with those issues. I figure it will eventually get a C (or Rust ;-) accelerator anyway, but the pure Python implementation should be as fast as possible.
It would seem so, the only constraint on the source is that it support .read(). It's also documented to assume a "file-like object containing a [single] JSON document" (brackets are my insertion). AFAIK, JSON doesn't define "document", and a "JSON text" appears to be defined to be a single JSON value (array, object, number, string, 'true', 'false', 'null'). If you take that seriously, then you might assume the file to contain exactly one value, and throw away any excess.
It would be worth supporting both byte strings and text strings, though, for the same reason that they both support printf formatting.
Ethan would show up in our nightmares if we didn't!

On Wed, 30 Mar 2022 at 18:09, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
Yeah. Logically, it ought to be possible and reasonable to build a JSON loader that can consume a file progressively, minimizing memory usage (particularly if there ends up being an error). But that's not what happens.
That is correct. The standard definition of JSON is that there is a single value (calling it a "document" just implies intent - it's the same thing as any other value, just that most people don't use json.load() on a file that contains the word "false"), with nothing else in the file other than whitespace. Although I have frequently made use of JSONDecoder().raw_decode(), which returns a document and the point where it ends. It still doesn't work with partial reads, but I've used it with chunked formats a number of times.
Actually, I'd like to see that. Nothing wrong with making one's nightmares more interesting. :) ChrisA

On Wed, Mar 30, 2022 at 1:24 AM Chris Angelico <rosuav@gmail.com> wrote:
The standard definition of JSON is that there is a single value
I believe that single value has to be either an array or an object. At least some sub-specifications call for that. But we’ve gotten quite sidetracked :-) -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

FYI, there is a “parse” library on PyPI: https://pypi.org/project/parse/ *Parse strings using a specification based on the Python format() syntax.* *parse() is the opposite of format()* (I haven’t used it myself, but I find the idea compelling, specially for people unfamiliar with regexes or C-style printf syntax). S. -- Stefane Fermigier - http://fermigier.com/ - http://twitter.com/sfermigier - http://linkedin.com/in/sfermigier Founder & CEO, Abilian - Enterprise Social Software - http://www.abilian.com/ Co-Founder & Co-Chairman, National Council for Free & Open Source Software (CNLL) - http://cnll.fr/ Co-Founder & Chairman, Association Professionnelle Européenne du Logiciel Libre (APELL) - https://www.apell.info/ Co-Founder & Spokesperson, European Cloud Industrial Alliance (EUCLIDIA) - https://www.euclidia.eu/ Founder, PyParis & PyData Paris - http://pyparis.org/ & http://pydata.fr/

On Wed, Mar 30, 2022 at 1:50 AM Stéfane Fermigier <sf@fermigier.com> wrote:
Me neither, but I do like this idea better than scanf style. And there’s an implementation ready to try out. -CHB but I find the idea compelling, specially for people unfamiliar with
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

parse is a great library! i've used it a lot. --- Ricky. "I've never met a Kentucky man who wasn't either thinking about going home or actually going home." - Happy Chandler On Wed, Mar 30, 2022 at 1:08 PM Paul Moore <p.f.moore@gmail.com> wrote:

On Tue, Mar 29, 2022 at 4:08 PM Steven D'Aprano <steve@pearwood.info> wrote:
I have no strong opinion on whether this simple function should be built into the string class,
I do -- this is not sufficiently general to be a string method.
I do agree there. I also agree with Chris A's suggestion: *some* scanner / parser that could be used for this and many other things that's significantly more straightforward that regex's. -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
participants (11)
-
Abdulla Al Kathiri
-
Chris Angelico
-
Christopher Barker
-
Paul Bryan
-
Paul Moore
-
Ricky Teachey
-
Serhiy Storchaka
-
Stephen J. Turnbull
-
Steven D'Aprano
-
StrikerOmega
-
Stéfane Fermigier