[Python-ideas] Experiment: Adding "re" to string objects.

Mon Jul 20 23:47:05 CEST 2009

I've combined the various discussions into a single "digest" of the
discussions and responded to them below.  Thanks so much everyone for
the discussion.

MRAB wrote:
>Why not drop the ".re" part? You would, however, then need a new name
>for the re split, eg "re_split".

Because of the "str.re.match()" and other attributes that depend on having
had str.re.match or search called, I thought it best to consolidate them
into a single area, to make it more obvious that they were related.

But, I'm not tied to that idea, it just seemed to make sense.

MRAB wrote:
>Or you could make the string the pattern, eg r'whatever(.*)'.match(s).

The problem with that is that it you already have the "s", but you may not
have a regex object.  So the above may lead to:

   regex = r'whatever(.*)'
   if regex.match(s):
      regex.group(1)

As compared with:

   if s.re.match(r'whatever(.*)'):
      s.re.group(1)

In which case we're probably talking about going the direction that Nick
mentioned below, with a helper class.

Nick Coghlan wrote:
>The idea of adding a mutable attribute to strings (even if it plays no
>part in string equality) sends shivers down my spine.

I'm having a hard time seeing this as a mutable attribute.  Now, this would
be unprecedented in that the value returned by the re.group() type calls
would vary depending on the re.match() type calls, nothing else has similar
sorts of side-effects on string objects.  But it just doesn't "feel
mutable" to me because of this.

>It also seems like
>it would be difficult to do this in a way that didn't increase the size
>of all strings by at least one pointer slot.

I gather that's a big deal?  I don't honestly know, to me it probably
isn't, but I'll admit there are many cases I don't really care about.  :-)

Nick Coghlan wrote:
>However, the idea of adding more convenience classes to the re module
>may still have some merit. In particular, when checking against multiple
>regexes, being able to do something like the following might be helpful:
>
>m = re.Matcher(s)
>if m.match(r'whatever(.*)'):
>      print m.group(1)
>elif m.match(r'something (.*) else(.*)'):
>      print m.group(2)

I think that's just another way of implementing the "restr()" class I made
in the "filtertools" module.  Possibly the biggest difference being that
the "restr()" is able to be used as a string, instead of having to do
something like:

   m = re.Matcher(s)
   if not m.match(r'whatever(.*)'):
      print "Didn't match line: '%s'" % m.origstr
(or:)
      print "Didn't match line: '%s'" % s

Christian Heimes wrote:
>* regular expressions are rarely used in Python. I have just a couple of
>scripts that use re

Oh really?

   guin:~$ cd ~/cvs/python-trunk/Lib
   guin:Lib$ find . -type f -name \*.py -exec egrep 'import re\>' '{}' + | wc -l
   137
   guin:Lib$ find . -type f -name \*.py -exec egrep 'import os\>' '{}' + | wc -l
   447
   guin:Lib$ find . -type f -name \*.py -exec egrep 'import sys\>' '{}' + | wc -l
   643
   guin:Lib$ find . -type f -name \*.py -exec egrep 'import time\>' '{}' + | wc -l
   142
   guin:Lib$ find . -type f -name \*.py -exec egrep 'import socket\>' '{}' + | wc -l
   54
   guin:Lib$ find . -type f -name \*.py -exec egrep 'import datetime\>' '{}' + | wc -l
   13
   guin:Lib$ find . -type f -name \*.py -exec egrep 'import math\>' '{}' + | wc -l
   23
   guin:Lib$ find . -type f -name \*.py -exec egrep 'import pickle\>' '{}' + | wc -l
   31
   guin:Lib$ 

Admittedly, this is a very crude metric, but I'm not sure it's fair to say
that regular expressions are rarely used.  YOU may rarely use them, but I
probably use them in one out of 3 Python programs I write.  I do a lot of
stuff where I process the output of other commands or log or text files
though...

Christian Heimes wrote:
>* we shouldn't encourage people in using re when there is a simpler
>solution. Python isn't Perl.

I know people like to say "Python isn't Perl", but one of the reasons Perl
and other languages make dealing with regexes so easy is that it's
something that is useful in a lot of cases.  Python isn't LISP or other
languages either, but we still take the good stuff from them.

With all due respect, "Python isn't Perl" seems more designed to invoke an
emotional response rather than be based on hard reasoning.

Christian Heimes wrote:
>* Several modules and a C extension must be loaded during *every*
>interpreter startup. Everybody must pay the speed penalty and the memory
>usage of the interpreter increases with every module, too. Lazy loading
>may be a workaround, though

I had expected this would be loaded lazily.  Possibly even to the extent
that it can address Nick's concern about using another pointer slot in
strings.  I think it can be resolved, but I don't have the solution at
hand.

Note that "str.re" is *NOT* just an import of "re" into the string module.
The implementation of restr() in filtertools uses an inner class that the
re attribute is an instance of, and could load the re module only when
used.

I would agree that if it had to cause re to be loaded for strings to be
used at interpreter startup, that it would make this idea unusable.

Christian Heimes wrote:
>* Beautiful is better than ugly.
>* Explicit is better than implicit.
>* Simple is better than complex.
>* Readability counts.

I'll presume you mean this tongue-in-cheek, because most of these arguments
could apply to this proposal, making some handling of regexes more
beautiful, simple, and readable.  Explicit?  Provide more details of what
you are thinking there...

Steven D'Aprano wrote:
>Apologies for the Metoo, but I'm with Nick and Christian on this. It 
>sounds like a terrible idea to me, just to avoid a temporary name in 
>the standard idiom:
>
>m = re.match(r'whatever(.*)', s)
>if m:
>    m.group(1)

It's not so much about adding a temporary name, it's about the above being
an ugly construct.  Particularly in more complex cases:

   m = re.match(r'whatever(.*)', s)
   if m:
       m.group(1)
   m = re.match(r'something else(.*)', s)
   if m:
       m.group(1)

instead of:

   if s.re.match(r'whatever(.*)') or s.re.match(r'something else(.*)'):
       s.re.group(1)

Georg Brandl wrote:
>That one looks very useful, and will prevent more proposals of new syntax
>along the lines of "if rex.match(...) as m" :)

Well, "if rex.match() as m" is more general, but if it's really the primary
reason for it, that's a good reason for it.  :-)

Antoine Pitrou wrote:
>-0.5. Right now, objects in the re module are constructed from a regular
>expression pattern -- one of the reasons being that these patterns are compiled
>to bytecode form, and the objects help retain the bytecode.

Sure, but the compiled regex cache in my testing pretty much eradicated any
performance reasons for compiling regexes.

Antoine Pitrou wrote:
>Having another
>object type constructed from the string-to-match is confusing.

I think you mean "the string-to-match against"?  I'm not sure that it's
really more confusing that way though.  It makes sense considering that you
can't assign and compare in Python, so there's some argument for it being
"more pythonic".

Antoine Pitrou wrote:
>Besides, keeping
>some kind of internal state about the last matched pattern, only for
>"convenience" purposes, isn't pretty either.

I guess that depends on whether you use convenience in the pejorative or
not.  :-)  This isn't the Most Manly Programming Contest (tm), making
programming more convenient is a GOOD thing.  It's kind of like the
caching of compiled regexes -- that's a convenience thing.

Jan Kaliszewski wrote:
>Maybe it should be limited to using compiled regexps, not strings?

I don't think so, though I could see it being able to take compiled
regexes as well as strings.  If you HAVE to compile the regexes then you're
already making a temporary object and I'm not sure how you would gain
anything from this pattern.

Thanks,
Sean
-- 
 Do bad programmers wake up on Christmas morning to find coal in
 their sockets?  -- Sean Reifschneider
Sean Reifschneider, Member of Technical Staff <jafo at tummy.com>
tummy.com, ltd. - Linux Consulting since 1995: Ask me about High Availability