Very often....make that very very very very very very very very very often, I find myself processing text in python that when .split()'ing a line, I'd like to exclude the split for a 'quoted' item...quoted because it contains whitespace or the sep char. For example: s = ' Chan: 11 SNR: 22 ESSID: "Spaced Out Wifi" Enc: On' If I want to yank the essid in the above example, it's a pain. But with my new dandy split quoted method, we have a 3rd argument to .split() that we can spec the quote delimiter where no splitting will occur, and the quote char will be dropped: s.split(None,-1,'"')[5] 'Spaced Out Wifi' Attached is a proof of concept patch against Python-2.4.1/Objects/stringobject.c that implements this. It is limited to whitespace splitting only. (sep == None) As implemented the quote delimiter also doubles as an additional separator for the spliting out a substr. For example: 'There is"no whitespace before these"quotes'.split(None,-1,'"') ['There', 'is', 'no whitespace before these', 'quotes'] This is useful, but possibly better put into practice as a separate method?? Comments please. Dave
Am Donnerstag 18 Mai 2006 06:06 schrieb Dave Cinege:
This is useful, but possibly better put into practice as a separate method??
I personally don't think it's particularily useful, at least not in the special case that your patch tries to address. 1) Generally, you won't only have one character that does quoting, but several. Think of the Python syntax, where you have ", ', """ and ''', which all behave slightly differently. The logic for " and ' is simple enough to implement (basically that's what your patch does, and I'm sure it's easy enough to extend it to accept a range of characters as splitters), but if you have more complicated quoting operators (such as """), are you sure it's sensible to implement the logic in split()? 2) What should the result of "this is a \"test string".split(None,-1,'"') be? An exception (ParseError)? Silently ignoring the missing delimiter, and returning ['this','is','a','test string']? Ignoring the delimiter altogether, returning ['this','is','a','"test','string']? I don't think there's one case to satisfy all here... 3) What about escapes of the delimiter? Your current patch doesn't address them at all (AFAICT) at the moment, but what should the escaping character be? Should "escape processing" take place, i.E. what should the result of "this is a \\\"delimiter \\test".split(None,-1,'"') be? Don't get me wrong, I personally find this functionality very, very interesting (I'm +0.5 on adding it in some way or another), especially as a part of the standard library (not necessarily as an extension to .split()). But there's quite a lot of semantic stuff to get right before you can implement it properly; see the complexity of the csv module, where you have to define pretty much all of this in the dialect you use to parse the csv file... Why not write up a PEP? --- Heiko.
Am Donnerstag 18 Mai 2006 06:06 schrieb Dave Cinege:
This is useful, but possibly better put into practice as a separate method??
I personally don't think it's particularily useful, at least not in the special case that your patch tries to address. 1) Generally, you won't only have one character that does quoting, but several. Think of the Python syntax, where you have ", ', """ and ''', which all behave slightly differently. The logic for " and ' is simple enough to implement (basically that's what your patch does, and I'm sure it's easy enough to extend it to accept a range of characters as splitters), but if you have more complicated quoting operators (such as """), are you sure it's sensible to implement the logic in split()? 2) What should the result of "this is a \"test string".split(None,-1,'"') be? An exception (ParseError)? Silently ignoring the missing delimiter, and returning ['this','is','a','test string']? Ignoring the delimiter altogether, returning ['this','is','a','"test','string']? I don't think there's one case to satisfy all here... 3) What about escapes of the delimiter? Your current patch doesn't address them at all (AFAICT) at the moment, but what should the escaping character be? Should "escape processing" take place, i.E. what should the result of "this is a \\\"delimiter \\test".split(None,-1,'"') be? Don't get me wrong, I personally find this functionality very, very interesting (I'm +0.5 on adding it in some way or another), especially as a part of the standard library (not necessarily as an extension to .split()). But there's quite a lot of semantic stuff to get right before you can implement it properly; see the complexity of the csv module, where you have to define pretty much all of this in the dialect you use to parse the csv file... Why not write up a PEP? --- Heiko.
Heiko Wundram
Don't get me wrong, I personally find this functionality very, very interesting (I'm +0.5 on adding it in some way or another), especially as a part of the standard library (not necessarily as an extension to .split()).
It's already there. It's called shlex.split(), and follows the semantic of a standard UNIX shell, including escaping and other things.
import shlex shlex.split(r"""Hey I\'m a "bad guy" for you""") ['Hey', "I'm", 'a', 'bad guy', 'for', 'you']
Giovanni Bajo
Am Donnerstag 18 Mai 2006 10:21 schrieb Giovanni Bajo:
Heiko Wundram
wrote: Don't get me wrong, I personally find this functionality very, very interesting (I'm +0.5 on adding it in some way or another), especially as a part of the standard library (not necessarily as an extension to .split()).
It's already there. It's called shlex.split(), and follows the semantic of a standard UNIX shell, including escaping and other things.
I knew about *nix shell escaping, but that isn't necessarily what I find in input I have to process (although generally it's what you see, yeah). That's why I said that it would be interesting to have a generalized method, sort of like the csv module but only for string "interpretation", which takes a dialect, and parses a string for the specified dialect. Remember, there also escaping by doubling the end of string marker (for example, '""this is not a single argument""'.split() should be parsed as ['"this','is','not','a',....]), and I know programs that use exactly this format for file storage. Maybe, one could simply export the function the csv module uses to parse the actual data fields as a more prominent method, which accepts keyword arguments, instead of a Dialect-derived class. --- Heiko.
Heiko Wundram
Don't get me wrong, I personally find this functionality very, very interesting (I'm +0.5 on adding it in some way or another), especially as a part of the standard library (not necessarily as an extension to .split()).
It's already there. It's called shlex.split(), and follows the semantic of a standard UNIX shell, including escaping and other things.
I knew about *nix shell escaping, but that isn't necessarily what I find in input I have to process (although generally it's what you see, yeah). That's why I said that it would be interesting to have a generalized method, sort of like the csv module but only for string "interpretation", which takes a dialect, and parses a string for the specified dialect.
Remember, there also escaping by doubling the end of string marker (for example, '""this is not a single argument""'.split() should be parsed as ['"this','is','not','a',....]), and I know programs that use exactly this format for file storage.
I never met this one. Anyway, I don't think it's harder than:
def mysplit(s): ... """Allow double quotes to escape a quotes""" ... return shlex.split(s.replace(r'""', r'\"')) ... mysplit('""This is not a single argument""') ['"This', 'is', 'not', 'a', 'single', 'argument"']
Maybe, one could simply export the function the csv module uses to parse the actual data fields as a more prominent method, which accepts keyword arguments, instead of a Dialect-derived class.
I think you're over-generalizing a very simple problem. I believe that str.split, shlex.split, and some simple variation like the one above (maybe using regular expressions to do the substitution if you have slightly more complex cases) can handle 99.99% of the splitting cases. They surely handle 100% of those I myself had to parse. I believe the standard library already covers common usage. There will surely be cases where a custom lexer/splitetr will have to be written, but that's life :) Giovanni Bajo
Am Donnerstag 18 Mai 2006 12:26 schrieb Giovanni Bajo:
I believe the standard library already covers common usage. There will surely be cases where a custom lexer/splitetr will have to be written, but that's life
The csv data field parser handles all common usage I have encountered so far, yes. ;-) But, generally, you can't (easily) get at the method that parses a data field directly, that's why I proposed to publish that method with keyword arguments. (actually, I've only tried getting at it when the csv module was still plain-python, I wouldn't even know whether the "method" is exported now that the module is written in C). I've had the need to write a custom lexer time and again, and generally, I'd love to have a little more general string interpretation facility available to spare me from writing a state automaton... But as I said before, the "simple" patch that was proposed here won't do for my case. But I don't know if it's worth the trouble to actually write a more general version, because there are quite some different pitfalls that have to be overcome... I still remain +0.5 for adding something like this to the stdlib, but only if it's overly general so that it can handle all cases the csv module can handle. --- Heiko.
On Thursday 18 May 2006 04:21, Giovanni Bajo wrote:
It's already there. It's called shlex.split(), and follows the semantic of a standard UNIX shell, including escaping and other things.
Not quite. As I said in my other post, simple is the idea for this, just like the split method itself. (no escaping, etc.....just recognizing delimiters as an exception to the split seperatation) shlex.split() does not let one choose the separator or use a maxsplit, nor is it a pure method to strings. Dave
Dave Cinege wrote:
It's already there. It's called shlex.split(), and follows the semantic of a standard UNIX shell, including escaping and other things.
Not quite. As I said in my other post, simple is the idea for this, just like the split method itself. (no escaping, etc.....just recognizing delimiters as an exception to the split seperatation)
And what's the actual problem? You either have a syntax which does not support escaping or one that it does. If it can't be escaped, there won't be any weird characters in the way, and shlex.split() will do it. If it does support escaping in a decent way, you can either use shlex.split() directly or modify the string before (like I've shown in the other message). In any case, you get your job done. Do you have any real-world case where you are still not able to split a string? And if you do, are they really so many to warrant a place in the standard library? As I said before, I think that split() and shlex.split() cover the majority of real world usage cases.
shlex.split() does not let one choose the separator or use a maxsplit
Real-world use case? Show me what you need to parse, and I assume this weird format is generated by a program you have not written yourself (or you could just change it to generate a more standard and simple format!)
, nor is it a pure method to strings.
This is a totally different problem. It doesn't make it less useful nor it does provide a need for adding a new method to the string. -- Giovanni Bajo
Sorry to all about tmda on my dcinege-mlists email addy. It was not supposed to be, however the dash in dcinege-mlists was flipping out the latest incarnation of my mail server config. Please use this address to reply to me in this thread. Dave
On Thursday 18 May 2006 03:00, Heiko Wundram wrote:
Am Donnerstag 18 Mai 2006 06:06 schrieb Dave Cinege:
This is useful, but possibly better put into practice as a separate method??
I personally don't think it's particularily useful, at least not in the special case that your patch tries to address.
Well I'm thinking along the lines of a method to extract only quoted substr's: ' this is "something" and"nothing else"but junk'.splitout('"') ['something ', 'nothing else'] Useful? I dunno....
splitters), but if you have more complicated quoting operators (such as """), are you sure it's sensible to implement the logic in split()?
Probably not. See below...
2) What should the result of "this is a \"test string".split(None,-1,'"') be? An exception (ParseError)?
I'd probably vote for that. However my current patch will simply play dumb and stop split'ing the rest of the line, dropping the first quote. 'this is a "test string'.split(None,-1,'"') ['this', 'is', 'a', 'test string']
Silently ignoring the missing delimiter, and returning ['this','is','a','test string']? Ignoring the delimiter altogether, returning ['this','is','a','"test','string']? I don't think there's one case to satisfy all here...
Well the point to the patch is a KISS approach to extending the split() method just slightly to exclude a range of substr from split'ing by delimiter, not to engage in further text processing. I'm dealing with this ALL the time, while processing output from other programs. (Windope) fIlenames, (poorly considered) wifi network names, etc. For me it's always some element with whitespace in it and double quotes surrounding it, that otherwise I could just use a slice to dump the quotes for the needed element 'filename: "/root/tmp.txt"'.split()[1] [1:-1] '/root/tmp.txt' OK 'filename: "/root/is a bit slow.txt"'.split()[1] [1:-1] '/root/i' NOT OK This exact bug just zapped me in a product I have, that I didn't forsee whitespace turning up in that element..... Thus my patch: 'filename: "/root/is a bit slow.txt"'.split(None,-1,'"')[1] '/root/is a bit slow.txt' LIFE IS GOOD
3) What about escapes of the delimiter? Your current patch doesn't address them at all (AFAICT) at the moment,
And it wouldn't, just like the current split doesn't. 'this is a \ test string'.split() ['this', 'is', 'a', '\\', 'test', 'string']
Don't get me wrong, I personally find this functionality very, very interesting (I'm +0.5 on adding it in some way or another), especially as a part of the standard library (not necessarily as an extension to .split()).
I'd be happy to have this in as .splitquoted(), but once you use it, it seems more to me like a natural 'ought to be there' extension to split itself.
Why not write up a PEP?
Because I have no idea of the procedure. : ) URL? Dave
On 5/17/06, Dave Cinege
Very often....make that very very very very very very very very very often, I find myself processing text in python that when .split()'ing a line, I'd like to exclude the split for a 'quoted' item...quoted because it contains whitespace or the sep char.
For example:
s = ' Chan: 11 SNR: 22 ESSID: "Spaced Out Wifi" Enc: On'
If I want to yank the essid in the above example, it's a pain. But with my new dandy split quoted method, we have a 3rd argument to .split() that we can spec the quote delimiter where no splitting will occur, and the quote char will be dropped:
s.split(None,-1,'"')[5] 'Spaced Out Wifi'
Attached is a proof of concept patch against Python-2.4.1/Objects/stringobject.c that implements this. It is limited to whitespace splitting only. (sep == None)
As implemented the quote delimiter also doubles as an additional separator for the spliting out a substr.
For example: 'There is"no whitespace before these"quotes'.split(None,-1,'"') ['There', 'is', 'no whitespace before these', 'quotes']
This is useful, but possibly better put into practice as a separate method??
Comments please.
What's wrong with: re.findall(r'"[^"]*"|[^"\s]+', s) YMMV, n
Dave Cinege wrote:
Very often....make that very very very very very very very very very often, I find myself processing text in python that when .split()'ing a line, I'd like to exclude the split for a 'quoted' item...quoted because it contains whitespace or the sep char.
For example:
s = ' Chan: 11 SNR: 22 ESSID: "Spaced Out Wifi" Enc: On'
Even if you don't like Neal's more efficient regex-based version, the necessary utility function to do a two-pass split operation really isn't that tricky: def split_quoted(text, sep=None, quote='"'): sections = text.split(quote) result = [] for idx, unquoted_text in enumerate(sections[::2]): result.extend(unquoted_text.split(sep)) quoted = 2*idx+1 quoted_text = sections[quoted:quoted+1] result.extend(quoted_text) return result
split_quoted(' Chan: 11 SNR: 22 ESSID: "Spaced Out Wifi" Enc: On') ['Chan:', '11', 'SNR:', '22', 'ESSID:', 'Spaced Out Wifi', 'Enc:', 'On']
Given that this function (or a regex based equivalent) is easy enough to add if you do need it, I don't find the idea of increasing the complexity of the basic split API particularly compelling. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
This is not an apropriate function to add as a string methods. There
are too many conventions for quoting and too many details to get
right. One method can't possibly handle them all without an enormous
number of weird options. It's better to figure out how to do this with
regexps or use some of the other approaches that have been suggested.
(Did anyone mention the csv module yet? It deals with this too.)
--Guido
On 5/17/06, Dave Cinege
Very often....make that very very very very very very very very very often, I find myself processing text in python that when .split()'ing a line, I'd like to exclude the split for a 'quoted' item...quoted because it contains whitespace or the sep char.
For example:
s = ' Chan: 11 SNR: 22 ESSID: "Spaced Out Wifi" Enc: On'
If I want to yank the essid in the above example, it's a pain. But with my new dandy split quoted method, we have a 3rd argument to .split() that we can spec the quote delimiter where no splitting will occur, and the quote char will be dropped:
s.split(None,-1,'"')[5] 'Spaced Out Wifi'
Attached is a proof of concept patch against Python-2.4.1/Objects/stringobject.c that implements this. It is limited to whitespace splitting only. (sep == None)
As implemented the quote delimiter also doubles as an additional separator for the spliting out a substr.
For example: 'There is"no whitespace before these"quotes'.split(None,-1,'"') ['There', 'is', 'no whitespace before these', 'quotes']
This is useful, but possibly better put into practice as a separate method??
Comments please.
Dave
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
On Thursday 18 May 2006 11:11, Guido van Rossum wrote:
This is not an apropriate function to add as a string methods. There are too many conventions for quoting and too many details to get right. One method can't possibly handle them all without an enormous number of weird options. It's better to figure out how to do this with regexps or use some of the other approaches that have been suggested. (Did anyone mention the csv module yet? It deals with this too.)
Maybe my idea is better called splitexcept instead of splitquoted, as my goal is to (simply) provide a way to limit the split by delimiters, and not dive into an all encompassing quoting algorithm. It me this is in the spirit of the maxsplit option already present. Dave
participants (11)
-
Dave Cinege
-
Dave Cinege
-
Dave Cinege
-
Dave Cinege
-
Dave Cinege
-
Giovanni Bajo
-
Guido van Rossum
-
Heiko Wundram
-
Heiko Wundram
-
Neal Norwitz
-
Nick Coghlan