[Tutor] don't understand iteration

Steven D'Aprano steve at pearwood.info
Thu Nov 13 22:55:57 CET 2014


On Mon, Nov 10, 2014 at 03:08:23PM -0800, Clayton Kirkwood wrote:

> Also of confusion, the library reference says:
> 
> Match objects always have a boolean value of True. Since match() and
> search() return None when there is no match, you can test whether there was
> a match with a simple if statement:
> 
> match = re.search(pattern, string)
> if match:
>     process(match)
> 
> blah = re.search(
>    r'<\w\w>(\w{3})\.\s+(\d{2}),\s+(\d{2}).+([AP]M)\s+(E[SD]T)', line)
> >>> blah
> <_sre.SRE_Match object; span=(45, 73), match='<BR>Nov. 09, 07:15:46 PM EST'>
> >>> if blah: print(blah)
> <_sre.SRE_Match object; span=(45, 73), match='<BR>Nov. 09, 07:15:46 PM EST'>
> >>> if blah == True: print(blah)>
> No print out
> 
> To me, this doesn't *appear* to be quite true.

I think you are misreading a plain English expression, namely to "have 
a boolean value", as "is a boolean value". If I said:

    The binary string '0b1001' has a numeric value of 9.

I don't think anyone would interpret that sentence as meaning that 
Python treats the string equal to the int:

    '0b1001' == 9  # returns False

but rather that converting the string to an int returns 9:

    int('0b1001', 0) == 9  # returns True


Somebody unfamiliar with Python might (wrongly) believe that Python 
requires an explicit bool() conversion, just as Python requires an 
explicit int() conversion, but to avoid that misapprehension, the docs 
show an example of the correct idiomatic code to use. You tried it 
yourself and saw that it works:

    if blah: print(blah)

prints blah, exactly as the docs suggest. As you can see from the 
printed string value of blah, it is a Match object, and it behaves like 
True in conditionals (if-statement).

On the other hand, this piece of code does something completely 
different:

    s = "<_sre.SRE_Match object; span=(45, 73), match='<BR>Nov. 09, 07:15:46 PM EST'>"
    if blah == s: print(blah)

First it checks whether blah equals the given string, then it tests the 
condition. Not surprisingly, that doesn't print anything. Match objects 
are not strings, and although they do have a printable string 
representation, they are not equal to that representation.

Nor are they equal to True:

if blah == True: print(blah)  # also fails to print anything

The comparison "blah == True" returns False, as it should, and the if 
clause does not run.

Match objects might not be equal to True, however they are true, in 
the same way that my car is not equal to red, but it is red. (You'll 
have to take my word for it that I actually do own a red car.)

[...]
> I would expect len(sizeof, whatever)(blah) to return the number of (in this
> case) matches, so 5. Doing a search suggests what is important: the number
> of matches. Why else would you do a search, normally.

The number of groups in a match is comparatively unimportant. The 
*content* of the matched groups is important. Consider this regular 
expression:

regex = r'(\w*?)\s*=\s*\$(\d*)'

That has two groups. It *always* has two groups, regardless of what it 
matches:

py> re.match(regex, "profit = $10").groups()
('profit', '10')
py> re.match(regex, "loss = $3456").groups()
('loss', '3456')

I can imagine writing code that looks like:

    key, amount = mo.groups()
    if key == 'profit':
        handle_profit(amount)
    elif key == 'loss':
        handle_loss(amount)
    else:
        raise ValueError('unexpected key "%s"' % key)


but I wouldn't expect to write code like this:

    t = mo.groups()
    if len(t) == 2:
        handle_two_groups(t)
    else:
        raise ValueError('and a miracle occurs')


It truly would be a miracle, or perhaps a bug is more accurate, if the 
regular expression r'(\w*?)\s*=\s*\$(\d*)' ever returned a match object 
with less than, or more than, two groups. That would be like:

    mylist = [1, 2]
    if len(mylist) != 2:
        raise ValueError


The only time you don't know how many groups are in a Match object is if 
the regular expression itself was generated programmatically, and that's 
very unusual.


> That could then be used in the range()
> It would be nice to have the number of arguments.
> I would expect len(blah.group()) to be 5, because that is the relevant
> number of elements returned from group. And that is the basic thing that
> group is about; the groups, what they are and how many there are. I
> certainly wouldn't want len(group) to return the number of characters, in
> this case, 28 (which it does:>{{{
> 
> 
> >>> blah.group()
> '<BR>Nov. 09, 07:15:46 PM EST'

MatchObject.group() with no arguments is like a default argument of 0, 
which returns the entire matched string. For many purposes, that is all 
you need, you may not care about the individual groups in the regex.


> >>> len(blah.group())
> 28

What would you expect 

    len('<BR>Nov. 09, 07:15:46 PM EST')

to return? There are 28 characters, so returning anything other than 28 
would be a terrible bug. There is no way that len() can tell the 
difference between any of these:

    len('<BR>Nov. 09, 07:15:46 PM EST')
    len(blah.group())
    len('<BR>Nov. %s, 07:15:46 PM EST' % '09')
    s = '<BR>Nov. 09, 07:15:46 PM EST'; len(s)
    len((lambda: '<BR>Nov. 09, 07:15:46 PM EST')())


or any other of an infinite number of ways to get the same string. All 
len() sees is the string, not where it came from.

If you want to know how many groups are in the regex, *look at it*:

    r'<\w\w>(\w{3})\.\s+(\d{2}),\s+(\d{2}).+([AP]M)\s+(E[SD]T)'

has five groups. Or call groups and count the number of items returned:

    len(blah.groups())


> I didn't run group to find out the number of characters in a string, I ran
> it to find out something about blah and its matches.

Well, of course nobody is stopping you from calling blah.group() to find 
out the number of groups, in the same way that nobody is stopping you 
from calling int('123456') to find out the time of day. But in both 
cases you will be disappointed. You have to use the correct tool for the 
correct job, and blah.group() returns the entire matching string, not a 
tuple of groups. For that, you call blah.groups() (note plural).


-- 
Steven


More information about the Tutor mailing list