Give regex operations more sugar

Hi all, Regexes are really useful in many places, and to me it's sad to see the builtin "re" module having to resort to requiring a source string as an argument. It would be much more elegant to simply do "s.search(pattern)" than "re.search(pattern, s)". I suggest building all regex operations into the str class itself, as well as a new syntax for regular expressions. Thus a "findall" for any lowercase letter in a string would look like this: >>> "1a3c5e7g9i".findall(!%[a-z]%) ['a', 'c', 'e', 'g', 'i'] A "findall" for any letter, case insensitive: >>> "1A3c5E7g9I".findall(!%[a-z]%i) ['A', 'c', 'E', 'g', 'I'] A substitution of any letter for the string " WOOF WOOF ": >>> "1a3c5e7g9i".sub(!%[a-z]% WOOF WOOF %) '1 WOOF WOOF 3 WOOF WOOF 5 WOOF WOOF 7 WOOF WOOF 9 WOOF WOOF ' A substitution of any letter, case insensitive, for the string "hovercraft": >>> "1A3c5E7g9I".sub(!%[a-z]%hovercraft%i) '1hovercraft3hovercraft5hovercraft7hovercraft9hovercraft' You may wonder why I chose the regex delimiters as "!%" ... "%" [ ... "%" ] ... The choice of "%" was purely arbitrary; I just thought of it since there seems to be a convention to use "%" in PHP regex patterns. The "!" is in front to disambiguate it from the "%" modulo operator or the "%" string formatting operator, and because "!" is currently not used in Python. Another potential idea is to simply use "!" to denote the start of a regex, and use the character immediately following it to delimit the regex. Thus all of the following would be regexes matching a single lowercase letter: !%[a-z]% !#[a-z]# !?[a-z]? !/[a-z]/ And all of the following would be substitution regexes replacing a single case-insensitive letter with "@": !%[a-z]%@%i !#[a-z]#@#i !?[a-z]?@?i !/[a-z]/@/i Some examples of how to use this: >>> "pneumonoultramicroscopicsilicovolcanokoniosis".findall(!%[aeiou]+%) ['eu', 'o', 'ou', 'a', 'i', 'o', 'o', 'i', 'i', 'i', 'o', 'o', 'a', 'o', 'o', 'io', 'i'] >>> "GMzKqtnnyGdqIQNlQSLidbDlqpdhoRbHrrUAgyhMgkZKYVhQuI".search(!%[^A-Z][A-Z]{3}([a-z])[A-Z]{3}[^A-Z]%) <regex_match; span=(11, 20); match='qIQNlQSLi'> >>> "My name is Joanne.".findall(!%[A-Z][a-z]+%) ['My', 'Joanne'] Thoughts? Sincerely, Ken;

On 13/06/18 12:06, Ken Hilton wrote:
My first, most obvious thought is that Python is not Perl, and does not encourage people to reach for regular expressions at every opportunity. That said, I don't see how having a special delimiter syntax for something that could just as well be a string is a help. -- Rhodri James *-* Kynesim Ltd

I think you'll have to be more specific about why it is sad to pass strings to a function. There already is a class that has all the methods you want, although with the roles of regex and string reversed from what you want. >>> x = re.compile(r'[a-z]') # x = !%a-z%!, or what have you >>> type(x) <type '_sre.SRE_Pattern'> >>> x.findall("1a3c5e7g9i") ['a', 'c', 'e', 'g', 'i'] Strictly speaking, a regular expression is just a string that encodes of a (non)deterministic finite automata. A "regex" (the thing that supports all sorts of extensions that make the expression decidedly non-regular) is a string that encodes ... some class of Turing machines. [Aside: A discussion of just what they match can be found at http://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html, which suggests they can match context-free languages, some (but possibly not all) context-sensitive languages, and maybe some languages that context-sensitive grammars cannot not. Suffice it to say, they are powerful and complex.] I don't think it is the job of the str class to build, store, and run the resulting machines, solely for the sake of some perceived syntactic benefit. I don't see any merit in adding regex literals beyond making Python look more like Perl. -- Clint

Clint Hepner wrote:
Strictly speaking, a regular expression is just a string that encodes of a (non)deterministic finite automata.
More strictly speaking, regular expressions themselves are agnostic about determinism vs. non-determinism, since for any NFA you can always find an equivalent DFA. -- Greg

On Wed, Jun 13, 2018 at 07:06:09PM +0800, Ken Hilton <kenlhilton@gmail.com> wrote:
pat_compiled = re.compile(pattern) pat_compiled.search(s)
I suggest building all regex operations into the str class itself, as well as a new syntax for regular expressions.
There are many different regular expression implementation (regex, re2). How to make ``s.search(pattern)`` work with all of them?
Oleg. -- Oleg Broytman https://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

You don't, they work stand alone anyway. Besides, nobody is proposing to retire the re module either. But if it's really important, you can make hooks to provide the implementation like we did with breakpoint(). I really wish, however, than we separate the issue of adding the methods on the str object, and making literals. I know that literals are going to be rejected, "python is not perl", etc. But the str methods are quite an interesting idea.

On 2018-06-13 06:33, Michel Desmoulin wrote:
I often wished for findall and sub to be string methods, so +1 on that.
Agreed, and there are a few string functions that could be extended (to take a sequence) to handle more cases that push folks to regex, perhaps earlier than they should. Some string functions accept sequences to match on, some don't, if memory serves. -Mike

Le 13/06/2018 à 19:11, Mike Miller a écrit :
str.replace come to mind. It's a annoying to have to chain it 5 times while we could pass optionally a tuple. several startswith() and endswith() require a loop, but we could make them accept *args. Also, we do have to saturate the str namespace with all the re functions. We could decide to go for `str.re.stuff`.

On 2018-06-13 22:29, Steven D'Aprano wrote:
str.re can be a descriptor object which "knows" which string instance it is bound to. This kind of thing is common in many libraries. Pandas for example has all kinds of things like df.loc[1:3], df.column.str.startswith('blah'), etc. The "loc" and "str" attributes give objects which are bound (in the sense that bound methods are bound) to the objects on which they are accessed, so when you use these attributes to do things, the effect takes account of on the "root" object on which you accessed the attribute. Personally I think this is a great way to reduce namespace clutter and group related functionality without having to worry about using up all the short or "good" names at the top level. I'm not sure I agree with the specific proposal here for allowing regex operations on strings, but if we do do it, this would be a good way to do it. -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

On Thu, Jun 14, 2018 at 4:12 PM, Brendan Barnwell <brenbarn@brenbarn.net> wrote:
How is this materially different from: "some string".re_match(...) ? It's not a grouped namespace in any technical sense, but to any human, a set of methods that start with a clear prefix is functionally a group. That said, though, I don't think any of them need to be methods. The 're' module is there to be imported. ChrisA

On 2018-06-13 23:37, Chris Angelico wrote:
Do you really mean that? :-) As far as I can see, by the same argument, there is no need for modules. Instead of math.sin and math.cos, we can just have math_sin and math_cos. Instead of os.path.join we can just have os_path_join. And so on. Just one big namespace for everything. But as we all know, namespaces are one honking great idea! Now, of course there are other advantages to modules (such as being able to save the time of loading things you don't need), and likewise there are other advantages to this descriptor mechanism in some cases. (For instance, sometimes the sub-object may want to hold state if it is going to be passed around and used later, rather than just having a method called and being thrown away immediately.) But I think it's clear that in both cases the namespacing is also nice. -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

On Thu, Jun 14, 2018 at 12:12:34AM -0700, Brendan Barnwell wrote:
I'm not Chris, but I'll try to give an answer... Visually, there shouldn't be any difference between using . as a namespace separator and using _ instead. Whether we type math.sin or math_sin makes little difference beyond familiarity. But it does make a difference in whether we can treat math as a distinct object without the .sin part, and whether we can treat namespaces as real values or not. So math.sin is little different from math_sin, but the fact that math alone is a module, a first-class object, and not just a prefix of the name, makes a big difference. As you say:
Now, of course there are other advantages to modules (such as being able to save the time of loading things you don't need),
Loading on demand is one such advantage. Organising source code is another. Being able to pass the math object around as a first-class value, to call getattr() and setattr() or vars() or use introspection on it. You can't do that if its just a name prefix.
We can get that from making the regex method a method directly on the string object. The question I have is, what benefit does the str.re intermediate object bring? Does it carry its own weight? In his refactoring books, Martin Fowler makes it clear that objects ought to carry their own weight. When an object grows too big, you ought to split out functionality and state into intermediate objects. But if those intermediate objects do too little, the extra complexity they bring isn't justified by their usefulness. class Count: def __init__(self, start=0): self.counter = 0 def __iadd__(self, value): self.counter += value Would you use that class, or say it simply adds a needless level of indirection? If the re namespace doesn't do something to justify itself beyond simply adding a namespace, then Chris is right: we might as well just use re_ as a prefix and use a de facto namespace, and save the extra mental complexity and the additional indirection by dropping this intermediate descriptor object. -- Steve

On Thu, Jun 14, 2018 at 6:21 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Yep. That's pretty much what I meant. There are many different types of namespace in Python. Some are actual first-class objects (modules, classes, etc). Others are not, but (to a programmer) are very similar (classes that end "Error", the various constants in the stat module, etc). Sometimes it's useful to query a collection - you can say "show me all the methods and attributes of float" or "give me all the builtins that end with Error" - and as groups or collections, both types of namespace are reasonably functional. But there is a very real *thing* that collects up all the float methods, and that is the type <float>. That's a thing, and it has an identity. What is the thing that gathers together all Errors (as opposed to, say, all subclasses of Exception, which can be queried from the Exception type)? Sometimes the line is blurry. What's the true identity of the math module, other than "the collection of all things mathy"? It'd be plausible to have a "trig" module that has sin/cos/tan etc, and it'd also be plausible to say "from math import Fraction". But when there is no strong identity to the actual thing, and there's a technical and technological reason to avoid giving it an arbitrary identity (what is "spam".re and just how magical is it?), there's basically no reason to do it. Python gives us multiple tools, and there are good reasons to use all of them. In this case, yes, I most definitely *am* saying that <"spam".re_> is a valid human-readable namespace, but one which has no intrinsic identity. ChrisA

Steven D'Aprano wrote:
This is important because it provides ways of referring to things in the module without having to write out the whole module name every time, e.g. import math as m y = m.sin(x) Would it be useful to pull out mystring.re and use it this way? I don't know. Maybe sometimes, the same way that extracting bound methods is sometimes useful. -- Greg

On Thu, Jun 14, 2018 at 12:12:34AM -0700, Brendan Barnwell <brenbarn@brenbarn.net> wrote:
as we all know, namespaces are one honking great idea!
Flat is better than nested, so additional string.re subnamespace is not needed.
-- Brendan Barnwell
Oleg. -- Oleg Broytman https://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Wed, Jun 13, 2018 at 11:12:43PM -0700, Brendan Barnwell wrote:
Obviously, but then its not "just a namespace". This idea might be common in libraries like pandas, but I don't like it. Common is not necessarily good. Unless str.re is something meaningful on its own, what purpose does it hold? If str.re doesn't carry its own weight as a meaningful object, then it shouldn't exist. Particularly since we're only talking about a handful of new methods. Looking at re, we have these public functions: - match - fullmatch - search match and fullmatch are redundant; they're the same as calling search with a pattern that matches "start of string" and "end of string". str.find() could easily take a pattern object instead of needing a separate search object, particularly if we have dedicated syntax for regexes. - sub - subn sub is redundant since it does the same as subn; or the str.replace() method could take a pattern object. - split Likewise str.split() could take a pattern object. - findall - finditer findall is just list(finditer); search, match and fullmatch are just next(finditer). The re module API is full of redundancies. That's okay; I'm not proposing we "fix" that. But we don't have to duplicate that in string objects. Rather than add eight new methods, we could allow the existing string methods to take pattern objects as arguments. That gives us potentially: count, endswith, find, index, lstrip, partition, replace, rfind, rindex, rpartition, rsplit, rstrip, split, startswith, strip (15 methods) that support regex pattern objects, pretty much covering all the functionality of: match, fullmatch, search, split, sub, subn and then some. re.findall is redundant. That leaves (potentially) only a single re function to turn into a string method: finditer. How do you get the pattern object? We have three possible tactics: - import re and call re.compile; - add a compile method to str; - add special regex syntax, let's say /pattern/ for the sake of the argument. With pattern literals, we can do this with a single new string method, finditer. (Or whatever name we choose.) Without pattern literals, it won't be so convenient, but we could do this with just a one more method: compile. Or we could simply require people to import re to compile their patterns, which would be even less convenient, but it would work. (But maybe that's a good thing, to encourage people to think before reaching for a regular expression, not to encourage them to see every problem as a nail and regexes as the hammer.) -- Steve

On 2018-06-14 00:10, Steven D'Aprano wrote:
Unless a special regex syntax is added, I don't see that there's much benefit to allowing a compiled object as the argument. (And I don't support adding special regex syntax!) The point is to be able to easily type regular expressions. If using a pattern argument still requires you to import re and call functions in there, it's not worth it. In order for there to be any gain in convenience, you need to be able to pass the actual regex directly to the string method. But there is another way to do this beyond the ones you listed: give .find() (or whatever methods we decide should support regexes) an extra boolean "regex" argument that specifies whether to interpret the target string as a literal string or a regex. I'm not sure why I'm arguing this point, though. :-) Because I actually agree with you (and others on this thread) that there is no real need to make regexes more convenient. I think importing the re module and using the functions therein is fine. If anything, I think the name "re" is too short and cryptic and should be made longer! -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

On Thu, Jun 14, 2018 at 12:22:45AM -0700, Brendan Barnwell wrote:
Unless a special regex syntax is added, I don't see that there's much benefit to allowing a compiled object as the argument.
Fair enough -- I'm not convinced that this proposal is either desirable or necessary either, I'm just suggesting what we *could* do if we choose to. But I'll admit that I'm biased: I find all but the simplest regexes virtually unreadable and I'm very antipathetic to anything which encourages the use of regexes. (Even though I intellectually know that they're just a tool and we shouldn't blame regexes for the abuses some people put them too.) [...]
Guido has a guideline (one I agree with): no constant bool arguments. If you have a method or function that takes a flag that swaps between two modes, and in practice the flag is only ever (or almost only ever) going to be given as a literal, then it is better to split the function into two distinctly named functions and forego the flag. *Especially* if the flag simply swaps between two distinct implementations with little or nothing in common.
Heh, even I don't go that far :-) -- Steve

On Thu, Jun 14, 2018 at 12:22:45AM -0700, Brendan Barnwell <brenbarn@brenbarn.net> wrote:
If anything, I think the name "re" is too short and cryptic and should be made longer!
import re as regular_expressions_operations
-- Brendan Barnwell
Oleg. -- Oleg Broytman https://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Thu, Jun 14, 2018 at 2:12 AM, Brendan Barnwell <brenbarn@brenbarn.net> wrote:
It's a clever idea, but it's a completely new (at least to standard Python) way to call a function that acts on a given argument. That means more to learn. We already have foo.bar(...) and bar(foo): "Hello!".count("o") len("Hello!") Nesting is hiding. Hiding can be good or bad. Adding `foo.b.ar()` will make it harder to discover. It's also magical: To understand what `foo.b.ar()` does, you can't think of `foo.b` as a (semantic) property of the object, or a method of the object, but as a descriptor trick which holds more methods of that object. I mainly use Python on a REPL. When I'm on IPython, I can ask what properties and methods an object has. When I'm on the basic Python REPL, I use `dir`, or a function which filters and prints `dir` in a nicer way. Nested method namespaces will be harder to navigate through. I would not be able to programmatically tell whether a property is just a property. I'd need to manually inspect each oddly-named property, to make sure it's not hiding more methods of the object (and that would only work if the docstrings are maintained and clear enough for me). I don't see any advantage of using `foo.b.ar()` over `foo.b_ar()`. In either case, you'd need to spell out the whole name each time (unlike with import statements), unless you save the bound method, which you can do in both cases. P.S.: Is there any way of guessing what proportion of Python programs use `re`, either explicitly or implicitly? How many programs will, at some point in their runtime, load the `re` module?

from a lurker's perspective, why not just implement str.compile() as new method, and methods where it's relevant support it's result as argument ? That's a small change in additions, and the other methods in the normal case just do the same as now. It's also pretty clear what things like "whatever".replace("regex".compile(), "otherstring") should do in that case.

Le 14/06/2018 à 07:29, Steven D'Aprano a écrit :
There are a lot of ways to do that. One possible way: import re class re_proxy: def __init__(self, string): self.string = string def match(self, pattern, flags): return re.match(pattern, self.string, flags) ... @property def re(self): return re_proxy(self)

On Thu, Jun 14, 2018 at 6:43 AM, Michel Desmoulin <desmoulinmichel@gmail.com> wrote:
That would be handy. Either pass two sequences of equal length (replace each with the corresponding), or one sequence and one string (replaceactual any with that). (And yes, I know that a string IS a sequence.) This would want to be semantically different from chained calls, in that a single replace([x,y,z], q) would avoid re-replacing; but for many situations, it'll be functionally identical.
several startswith() and endswith() require a loop, but we could make them accept *args.
Not without breaking other code: they already accept two optional parameters. It'd have to be accepting a tuple of strings. Which... they already do. :) startswith(...) method of builtins.str instance S.startswith(prefix[, start[, end]]) -> bool Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. prefix can also be a tuple of strings to try. ChrisA

Chris Angelico wrote:
This would want to be semantically different from chained calls, in that a single replace([x,y,z], q) would avoid re-replacing;
+1, this would be REALLY handy! It's easy to trip yourself up with chained replacements if you're not careful -- like I did once when escaping things using &xxx; sequences in XML. If you don't do it in the right order, you end up escaping some of the &s you just inserted. :-( -- Greg

On Wed, Jun 13, 2018 at 8:15 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
In a thread earlier this year, I suggested allowing a dict: https://mail.python.org/pipermail/python-ideas/2018-February/048875.html For example: txt.replace({ '&': '&', '"': '"', "'": ''', ... }) Tuples of strings can be dict keys, so it's also possible to allow several options to be replaced with a single thing. One use I had for multi-replace was to parse a file that was almost CSV, but not close enough to be parsed by `import csv`. I had to be careful to get the order right so that old replacements wouldn't cause newer ones.

On Wed, Jun 13, 2018, 4:44 PM Michel Desmoulin <desmoulinmichel@gmail.com> wrote:
several startswith() and endswith() require a loop, but we could make them accept *args.
You mean something like: "Lorem ipsum".startswith(('Lo', 'Hi', 'Foo')) You might want to check the time machine.

Michel Desmoulin wrote:
Also, we do have to saturate the str namespace with all the re functions. We could decide to go for `str.re.stuff`.
However, note that this is not as simple as just adding a class attribute to str that references the re module, since presumably you want to be able to write mystring.re.match(pattern) instead of having to write mystring.re.match(pattern, mystring) or str.re.match(pattern, mystring) which would mostly defeat the purpose. -- Greg

On Wed, Jun 13, 2018 at 10:43:43PM +0200, Michel Desmoulin wrote:
str.replace come to mind. It's a annoying to have to chain it 5 times while we could pass optionally a tuple.
Its not so simple. Multiple replacements underspecifies the behaviour. The simplest behaviour is to have astring.replace((spam, eggs, cheese), new) be simply syntactic sugar for: astring.replace(spam, new).replace(eggs, new).replace(cheese, new) which is nice and simple to explain and nice and simple to implement (it's just a loop calling the method for each argument in the tuple), but its probably not the most useful solution: # replace any of "salad", "cheese" or "ham" with "cheesecake". s = "Lunch course are cheese & coffee, salad & cream, or ham & peas" s.replace("salad", "cheesecake").replace("cheese", "cheesecake").replace("ham", "cheesecake") => 'Lunch course are cheesecake & coffee, cheesecakecake & cream, or cheesecake & peas' which is highly unlikely to be what anyone wants. But it isn't clear what people *will* want. So we need to decide what replace with multiple targets actually means. Here are some suggestions: - the order of targets ought to be irrelevant: replace((a, b) ...) and replace((b, a) ...) ought to mean the same thing; - should targets match longest first or shortest first? or a flag to choose which you want? - what if you have multiple targets and you need to give some longer ones priority, and some shorter ones? - there ought to be a single pass through the string, not multiple passes -- this is not just syntactic sugar for calling replace in a loop! - the replacement string should be skipped and not scanned. -- Steve

On Thu, Jun 14, 2018 at 06:33:14PM +1200, Greg Ewing wrote:
"Explicit is better than implicit" -- the problem with having the order be meaningful is that it opens us up to silent errors when we neglect to consider the order. replace((spam, eggs, cheese) ...) *seems* like it simply means "replace any of spam, eggs or cheese" and it is easy to forget that that the order of replacement is *sometimes* meaningful. But not always. So this is a bug magnet in waiting. So I'd rather have to explicitly specify the order with a parameter rather than implicitly according to how I happen to have built the tuple. # remove duplicates targets = tuple(set(targets)) newstring = mystring.replace(targets, replacement) That's buggy, but it doesn't look buggy, and you could test it until the cows come home and never notice the bug. -- Steve

On 13/06/18 12:06, Ken Hilton wrote:
My first, most obvious thought is that Python is not Perl, and does not encourage people to reach for regular expressions at every opportunity. That said, I don't see how having a special delimiter syntax for something that could just as well be a string is a help. -- Rhodri James *-* Kynesim Ltd

I think you'll have to be more specific about why it is sad to pass strings to a function. There already is a class that has all the methods you want, although with the roles of regex and string reversed from what you want. >>> x = re.compile(r'[a-z]') # x = !%a-z%!, or what have you >>> type(x) <type '_sre.SRE_Pattern'> >>> x.findall("1a3c5e7g9i") ['a', 'c', 'e', 'g', 'i'] Strictly speaking, a regular expression is just a string that encodes of a (non)deterministic finite automata. A "regex" (the thing that supports all sorts of extensions that make the expression decidedly non-regular) is a string that encodes ... some class of Turing machines. [Aside: A discussion of just what they match can be found at http://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html, which suggests they can match context-free languages, some (but possibly not all) context-sensitive languages, and maybe some languages that context-sensitive grammars cannot not. Suffice it to say, they are powerful and complex.] I don't think it is the job of the str class to build, store, and run the resulting machines, solely for the sake of some perceived syntactic benefit. I don't see any merit in adding regex literals beyond making Python look more like Perl. -- Clint

Clint Hepner wrote:
Strictly speaking, a regular expression is just a string that encodes of a (non)deterministic finite automata.
More strictly speaking, regular expressions themselves are agnostic about determinism vs. non-determinism, since for any NFA you can always find an equivalent DFA. -- Greg

On Wed, Jun 13, 2018 at 07:06:09PM +0800, Ken Hilton <kenlhilton@gmail.com> wrote:
pat_compiled = re.compile(pattern) pat_compiled.search(s)
I suggest building all regex operations into the str class itself, as well as a new syntax for regular expressions.
There are many different regular expression implementation (regex, re2). How to make ``s.search(pattern)`` work with all of them?
Oleg. -- Oleg Broytman https://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

You don't, they work stand alone anyway. Besides, nobody is proposing to retire the re module either. But if it's really important, you can make hooks to provide the implementation like we did with breakpoint(). I really wish, however, than we separate the issue of adding the methods on the str object, and making literals. I know that literals are going to be rejected, "python is not perl", etc. But the str methods are quite an interesting idea.

On 2018-06-13 06:33, Michel Desmoulin wrote:
I often wished for findall and sub to be string methods, so +1 on that.
Agreed, and there are a few string functions that could be extended (to take a sequence) to handle more cases that push folks to regex, perhaps earlier than they should. Some string functions accept sequences to match on, some don't, if memory serves. -Mike

Le 13/06/2018 à 19:11, Mike Miller a écrit :
str.replace come to mind. It's a annoying to have to chain it 5 times while we could pass optionally a tuple. several startswith() and endswith() require a loop, but we could make them accept *args. Also, we do have to saturate the str namespace with all the re functions. We could decide to go for `str.re.stuff`.

On 2018-06-13 22:29, Steven D'Aprano wrote:
str.re can be a descriptor object which "knows" which string instance it is bound to. This kind of thing is common in many libraries. Pandas for example has all kinds of things like df.loc[1:3], df.column.str.startswith('blah'), etc. The "loc" and "str" attributes give objects which are bound (in the sense that bound methods are bound) to the objects on which they are accessed, so when you use these attributes to do things, the effect takes account of on the "root" object on which you accessed the attribute. Personally I think this is a great way to reduce namespace clutter and group related functionality without having to worry about using up all the short or "good" names at the top level. I'm not sure I agree with the specific proposal here for allowing regex operations on strings, but if we do do it, this would be a good way to do it. -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

On Thu, Jun 14, 2018 at 4:12 PM, Brendan Barnwell <brenbarn@brenbarn.net> wrote:
How is this materially different from: "some string".re_match(...) ? It's not a grouped namespace in any technical sense, but to any human, a set of methods that start with a clear prefix is functionally a group. That said, though, I don't think any of them need to be methods. The 're' module is there to be imported. ChrisA

On 2018-06-13 23:37, Chris Angelico wrote:
Do you really mean that? :-) As far as I can see, by the same argument, there is no need for modules. Instead of math.sin and math.cos, we can just have math_sin and math_cos. Instead of os.path.join we can just have os_path_join. And so on. Just one big namespace for everything. But as we all know, namespaces are one honking great idea! Now, of course there are other advantages to modules (such as being able to save the time of loading things you don't need), and likewise there are other advantages to this descriptor mechanism in some cases. (For instance, sometimes the sub-object may want to hold state if it is going to be passed around and used later, rather than just having a method called and being thrown away immediately.) But I think it's clear that in both cases the namespacing is also nice. -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

On Thu, Jun 14, 2018 at 12:12:34AM -0700, Brendan Barnwell wrote:
I'm not Chris, but I'll try to give an answer... Visually, there shouldn't be any difference between using . as a namespace separator and using _ instead. Whether we type math.sin or math_sin makes little difference beyond familiarity. But it does make a difference in whether we can treat math as a distinct object without the .sin part, and whether we can treat namespaces as real values or not. So math.sin is little different from math_sin, but the fact that math alone is a module, a first-class object, and not just a prefix of the name, makes a big difference. As you say:
Now, of course there are other advantages to modules (such as being able to save the time of loading things you don't need),
Loading on demand is one such advantage. Organising source code is another. Being able to pass the math object around as a first-class value, to call getattr() and setattr() or vars() or use introspection on it. You can't do that if its just a name prefix.
We can get that from making the regex method a method directly on the string object. The question I have is, what benefit does the str.re intermediate object bring? Does it carry its own weight? In his refactoring books, Martin Fowler makes it clear that objects ought to carry their own weight. When an object grows too big, you ought to split out functionality and state into intermediate objects. But if those intermediate objects do too little, the extra complexity they bring isn't justified by their usefulness. class Count: def __init__(self, start=0): self.counter = 0 def __iadd__(self, value): self.counter += value Would you use that class, or say it simply adds a needless level of indirection? If the re namespace doesn't do something to justify itself beyond simply adding a namespace, then Chris is right: we might as well just use re_ as a prefix and use a de facto namespace, and save the extra mental complexity and the additional indirection by dropping this intermediate descriptor object. -- Steve

On Thu, Jun 14, 2018 at 6:21 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Yep. That's pretty much what I meant. There are many different types of namespace in Python. Some are actual first-class objects (modules, classes, etc). Others are not, but (to a programmer) are very similar (classes that end "Error", the various constants in the stat module, etc). Sometimes it's useful to query a collection - you can say "show me all the methods and attributes of float" or "give me all the builtins that end with Error" - and as groups or collections, both types of namespace are reasonably functional. But there is a very real *thing* that collects up all the float methods, and that is the type <float>. That's a thing, and it has an identity. What is the thing that gathers together all Errors (as opposed to, say, all subclasses of Exception, which can be queried from the Exception type)? Sometimes the line is blurry. What's the true identity of the math module, other than "the collection of all things mathy"? It'd be plausible to have a "trig" module that has sin/cos/tan etc, and it'd also be plausible to say "from math import Fraction". But when there is no strong identity to the actual thing, and there's a technical and technological reason to avoid giving it an arbitrary identity (what is "spam".re and just how magical is it?), there's basically no reason to do it. Python gives us multiple tools, and there are good reasons to use all of them. In this case, yes, I most definitely *am* saying that <"spam".re_> is a valid human-readable namespace, but one which has no intrinsic identity. ChrisA

Steven D'Aprano wrote:
This is important because it provides ways of referring to things in the module without having to write out the whole module name every time, e.g. import math as m y = m.sin(x) Would it be useful to pull out mystring.re and use it this way? I don't know. Maybe sometimes, the same way that extracting bound methods is sometimes useful. -- Greg

On Thu, Jun 14, 2018 at 12:12:34AM -0700, Brendan Barnwell <brenbarn@brenbarn.net> wrote:
as we all know, namespaces are one honking great idea!
Flat is better than nested, so additional string.re subnamespace is not needed.
-- Brendan Barnwell
Oleg. -- Oleg Broytman https://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Wed, Jun 13, 2018 at 11:12:43PM -0700, Brendan Barnwell wrote:
Obviously, but then its not "just a namespace". This idea might be common in libraries like pandas, but I don't like it. Common is not necessarily good. Unless str.re is something meaningful on its own, what purpose does it hold? If str.re doesn't carry its own weight as a meaningful object, then it shouldn't exist. Particularly since we're only talking about a handful of new methods. Looking at re, we have these public functions: - match - fullmatch - search match and fullmatch are redundant; they're the same as calling search with a pattern that matches "start of string" and "end of string". str.find() could easily take a pattern object instead of needing a separate search object, particularly if we have dedicated syntax for regexes. - sub - subn sub is redundant since it does the same as subn; or the str.replace() method could take a pattern object. - split Likewise str.split() could take a pattern object. - findall - finditer findall is just list(finditer); search, match and fullmatch are just next(finditer). The re module API is full of redundancies. That's okay; I'm not proposing we "fix" that. But we don't have to duplicate that in string objects. Rather than add eight new methods, we could allow the existing string methods to take pattern objects as arguments. That gives us potentially: count, endswith, find, index, lstrip, partition, replace, rfind, rindex, rpartition, rsplit, rstrip, split, startswith, strip (15 methods) that support regex pattern objects, pretty much covering all the functionality of: match, fullmatch, search, split, sub, subn and then some. re.findall is redundant. That leaves (potentially) only a single re function to turn into a string method: finditer. How do you get the pattern object? We have three possible tactics: - import re and call re.compile; - add a compile method to str; - add special regex syntax, let's say /pattern/ for the sake of the argument. With pattern literals, we can do this with a single new string method, finditer. (Or whatever name we choose.) Without pattern literals, it won't be so convenient, but we could do this with just a one more method: compile. Or we could simply require people to import re to compile their patterns, which would be even less convenient, but it would work. (But maybe that's a good thing, to encourage people to think before reaching for a regular expression, not to encourage them to see every problem as a nail and regexes as the hammer.) -- Steve

On 2018-06-14 00:10, Steven D'Aprano wrote:
Unless a special regex syntax is added, I don't see that there's much benefit to allowing a compiled object as the argument. (And I don't support adding special regex syntax!) The point is to be able to easily type regular expressions. If using a pattern argument still requires you to import re and call functions in there, it's not worth it. In order for there to be any gain in convenience, you need to be able to pass the actual regex directly to the string method. But there is another way to do this beyond the ones you listed: give .find() (or whatever methods we decide should support regexes) an extra boolean "regex" argument that specifies whether to interpret the target string as a literal string or a regex. I'm not sure why I'm arguing this point, though. :-) Because I actually agree with you (and others on this thread) that there is no real need to make regexes more convenient. I think importing the re module and using the functions therein is fine. If anything, I think the name "re" is too short and cryptic and should be made longer! -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

On Thu, Jun 14, 2018 at 12:22:45AM -0700, Brendan Barnwell wrote:
Unless a special regex syntax is added, I don't see that there's much benefit to allowing a compiled object as the argument.
Fair enough -- I'm not convinced that this proposal is either desirable or necessary either, I'm just suggesting what we *could* do if we choose to. But I'll admit that I'm biased: I find all but the simplest regexes virtually unreadable and I'm very antipathetic to anything which encourages the use of regexes. (Even though I intellectually know that they're just a tool and we shouldn't blame regexes for the abuses some people put them too.) [...]
Guido has a guideline (one I agree with): no constant bool arguments. If you have a method or function that takes a flag that swaps between two modes, and in practice the flag is only ever (or almost only ever) going to be given as a literal, then it is better to split the function into two distinctly named functions and forego the flag. *Especially* if the flag simply swaps between two distinct implementations with little or nothing in common.
Heh, even I don't go that far :-) -- Steve

On Thu, Jun 14, 2018 at 12:22:45AM -0700, Brendan Barnwell <brenbarn@brenbarn.net> wrote:
If anything, I think the name "re" is too short and cryptic and should be made longer!
import re as regular_expressions_operations
-- Brendan Barnwell
Oleg. -- Oleg Broytman https://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Thu, Jun 14, 2018 at 2:12 AM, Brendan Barnwell <brenbarn@brenbarn.net> wrote:
It's a clever idea, but it's a completely new (at least to standard Python) way to call a function that acts on a given argument. That means more to learn. We already have foo.bar(...) and bar(foo): "Hello!".count("o") len("Hello!") Nesting is hiding. Hiding can be good or bad. Adding `foo.b.ar()` will make it harder to discover. It's also magical: To understand what `foo.b.ar()` does, you can't think of `foo.b` as a (semantic) property of the object, or a method of the object, but as a descriptor trick which holds more methods of that object. I mainly use Python on a REPL. When I'm on IPython, I can ask what properties and methods an object has. When I'm on the basic Python REPL, I use `dir`, or a function which filters and prints `dir` in a nicer way. Nested method namespaces will be harder to navigate through. I would not be able to programmatically tell whether a property is just a property. I'd need to manually inspect each oddly-named property, to make sure it's not hiding more methods of the object (and that would only work if the docstrings are maintained and clear enough for me). I don't see any advantage of using `foo.b.ar()` over `foo.b_ar()`. In either case, you'd need to spell out the whole name each time (unlike with import statements), unless you save the bound method, which you can do in both cases. P.S.: Is there any way of guessing what proportion of Python programs use `re`, either explicitly or implicitly? How many programs will, at some point in their runtime, load the `re` module?

from a lurker's perspective, why not just implement str.compile() as new method, and methods where it's relevant support it's result as argument ? That's a small change in additions, and the other methods in the normal case just do the same as now. It's also pretty clear what things like "whatever".replace("regex".compile(), "otherstring") should do in that case.

Le 14/06/2018 à 07:29, Steven D'Aprano a écrit :
There are a lot of ways to do that. One possible way: import re class re_proxy: def __init__(self, string): self.string = string def match(self, pattern, flags): return re.match(pattern, self.string, flags) ... @property def re(self): return re_proxy(self)

On Thu, Jun 14, 2018 at 6:43 AM, Michel Desmoulin <desmoulinmichel@gmail.com> wrote:
That would be handy. Either pass two sequences of equal length (replace each with the corresponding), or one sequence and one string (replaceactual any with that). (And yes, I know that a string IS a sequence.) This would want to be semantically different from chained calls, in that a single replace([x,y,z], q) would avoid re-replacing; but for many situations, it'll be functionally identical.
several startswith() and endswith() require a loop, but we could make them accept *args.
Not without breaking other code: they already accept two optional parameters. It'd have to be accepting a tuple of strings. Which... they already do. :) startswith(...) method of builtins.str instance S.startswith(prefix[, start[, end]]) -> bool Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. prefix can also be a tuple of strings to try. ChrisA

Chris Angelico wrote:
This would want to be semantically different from chained calls, in that a single replace([x,y,z], q) would avoid re-replacing;
+1, this would be REALLY handy! It's easy to trip yourself up with chained replacements if you're not careful -- like I did once when escaping things using &xxx; sequences in XML. If you don't do it in the right order, you end up escaping some of the &s you just inserted. :-( -- Greg

On Wed, Jun 13, 2018 at 8:15 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
In a thread earlier this year, I suggested allowing a dict: https://mail.python.org/pipermail/python-ideas/2018-February/048875.html For example: txt.replace({ '&': '&', '"': '"', "'": ''', ... }) Tuples of strings can be dict keys, so it's also possible to allow several options to be replaced with a single thing. One use I had for multi-replace was to parse a file that was almost CSV, but not close enough to be parsed by `import csv`. I had to be careful to get the order right so that old replacements wouldn't cause newer ones.

On Wed, Jun 13, 2018, 4:44 PM Michel Desmoulin <desmoulinmichel@gmail.com> wrote:
several startswith() and endswith() require a loop, but we could make them accept *args.
You mean something like: "Lorem ipsum".startswith(('Lo', 'Hi', 'Foo')) You might want to check the time machine.

Michel Desmoulin wrote:
Also, we do have to saturate the str namespace with all the re functions. We could decide to go for `str.re.stuff`.
However, note that this is not as simple as just adding a class attribute to str that references the re module, since presumably you want to be able to write mystring.re.match(pattern) instead of having to write mystring.re.match(pattern, mystring) or str.re.match(pattern, mystring) which would mostly defeat the purpose. -- Greg

On Wed, Jun 13, 2018 at 10:43:43PM +0200, Michel Desmoulin wrote:
str.replace come to mind. It's a annoying to have to chain it 5 times while we could pass optionally a tuple.
Its not so simple. Multiple replacements underspecifies the behaviour. The simplest behaviour is to have astring.replace((spam, eggs, cheese), new) be simply syntactic sugar for: astring.replace(spam, new).replace(eggs, new).replace(cheese, new) which is nice and simple to explain and nice and simple to implement (it's just a loop calling the method for each argument in the tuple), but its probably not the most useful solution: # replace any of "salad", "cheese" or "ham" with "cheesecake". s = "Lunch course are cheese & coffee, salad & cream, or ham & peas" s.replace("salad", "cheesecake").replace("cheese", "cheesecake").replace("ham", "cheesecake") => 'Lunch course are cheesecake & coffee, cheesecakecake & cream, or cheesecake & peas' which is highly unlikely to be what anyone wants. But it isn't clear what people *will* want. So we need to decide what replace with multiple targets actually means. Here are some suggestions: - the order of targets ought to be irrelevant: replace((a, b) ...) and replace((b, a) ...) ought to mean the same thing; - should targets match longest first or shortest first? or a flag to choose which you want? - what if you have multiple targets and you need to give some longer ones priority, and some shorter ones? - there ought to be a single pass through the string, not multiple passes -- this is not just syntactic sugar for calling replace in a loop! - the replacement string should be skipped and not scanned. -- Steve

On Thu, Jun 14, 2018 at 06:33:14PM +1200, Greg Ewing wrote:
"Explicit is better than implicit" -- the problem with having the order be meaningful is that it opens us up to silent errors when we neglect to consider the order. replace((spam, eggs, cheese) ...) *seems* like it simply means "replace any of spam, eggs or cheese" and it is easy to forget that that the order of replacement is *sometimes* meaningful. But not always. So this is a bug magnet in waiting. So I'd rather have to explicitly specify the order with a parameter rather than implicitly according to how I happen to have built the tuple. # remove duplicates targets = tuple(set(targets)) newstring = mystring.replace(targets, replacement) That's buggy, but it doesn't look buggy, and you could test it until the cows come home and never notice the bug. -- Steve
participants (16)
-
Brendan Barnwell
-
Chris Angelico
-
Clint Hepner
-
David Mertz
-
Eric Fahlgren
-
Franklin? Lee
-
Greg Ewing
-
Jacco van Dorp
-
Ken Hilton
-
Michael Selik
-
Michel Desmoulin
-
Mike Miller
-
MRAB
-
Oleg Broytman
-
Rhodri James
-
Steven D'Aprano