[Tutor] don't understand iteration
Steven D'Aprano
steve at pearwood.info
Thu Nov 13 22:55:57 CET 2014
On Mon, Nov 10, 2014 at 03:08:23PM -0800, Clayton Kirkwood wrote:
> Also of confusion, the library reference says:
>
> Match objects always have a boolean value of True. Since match() and
> search() return None when there is no match, you can test whether there was
> a match with a simple if statement:
>
> match = re.search(pattern, string)
> if match:
> process(match)
>
> blah = re.search(
> r'<\w\w>(\w{3})\.\s+(\d{2}),\s+(\d{2}).+([AP]M)\s+(E[SD]T)', line)
> >>> blah
> <_sre.SRE_Match object; span=(45, 73), match='<BR>Nov. 09, 07:15:46 PM EST'>
> >>> if blah: print(blah)
> <_sre.SRE_Match object; span=(45, 73), match='<BR>Nov. 09, 07:15:46 PM EST'>
> >>> if blah == True: print(blah)>
> No print out
>
> To me, this doesn't *appear* to be quite true.
I think you are misreading a plain English expression, namely to "have
a boolean value", as "is a boolean value". If I said:
The binary string '0b1001' has a numeric value of 9.
I don't think anyone would interpret that sentence as meaning that
Python treats the string equal to the int:
'0b1001' == 9 # returns False
but rather that converting the string to an int returns 9:
int('0b1001', 0) == 9 # returns True
Somebody unfamiliar with Python might (wrongly) believe that Python
requires an explicit bool() conversion, just as Python requires an
explicit int() conversion, but to avoid that misapprehension, the docs
show an example of the correct idiomatic code to use. You tried it
yourself and saw that it works:
if blah: print(blah)
prints blah, exactly as the docs suggest. As you can see from the
printed string value of blah, it is a Match object, and it behaves like
True in conditionals (if-statement).
On the other hand, this piece of code does something completely
different:
s = "<_sre.SRE_Match object; span=(45, 73), match='<BR>Nov. 09, 07:15:46 PM EST'>"
if blah == s: print(blah)
First it checks whether blah equals the given string, then it tests the
condition. Not surprisingly, that doesn't print anything. Match objects
are not strings, and although they do have a printable string
representation, they are not equal to that representation.
Nor are they equal to True:
if blah == True: print(blah) # also fails to print anything
The comparison "blah == True" returns False, as it should, and the if
clause does not run.
Match objects might not be equal to True, however they are true, in
the same way that my car is not equal to red, but it is red. (You'll
have to take my word for it that I actually do own a red car.)
[...]
> I would expect len(sizeof, whatever)(blah) to return the number of (in this
> case) matches, so 5. Doing a search suggests what is important: the number
> of matches. Why else would you do a search, normally.
The number of groups in a match is comparatively unimportant. The
*content* of the matched groups is important. Consider this regular
expression:
regex = r'(\w*?)\s*=\s*\$(\d*)'
That has two groups. It *always* has two groups, regardless of what it
matches:
py> re.match(regex, "profit = $10").groups()
('profit', '10')
py> re.match(regex, "loss = $3456").groups()
('loss', '3456')
I can imagine writing code that looks like:
key, amount = mo.groups()
if key == 'profit':
handle_profit(amount)
elif key == 'loss':
handle_loss(amount)
else:
raise ValueError('unexpected key "%s"' % key)
but I wouldn't expect to write code like this:
t = mo.groups()
if len(t) == 2:
handle_two_groups(t)
else:
raise ValueError('and a miracle occurs')
It truly would be a miracle, or perhaps a bug is more accurate, if the
regular expression r'(\w*?)\s*=\s*\$(\d*)' ever returned a match object
with less than, or more than, two groups. That would be like:
mylist = [1, 2]
if len(mylist) != 2:
raise ValueError
The only time you don't know how many groups are in a Match object is if
the regular expression itself was generated programmatically, and that's
very unusual.
> That could then be used in the range()
> It would be nice to have the number of arguments.
> I would expect len(blah.group()) to be 5, because that is the relevant
> number of elements returned from group. And that is the basic thing that
> group is about; the groups, what they are and how many there are. I
> certainly wouldn't want len(group) to return the number of characters, in
> this case, 28 (which it does:>{{{
>
>
> >>> blah.group()
> '<BR>Nov. 09, 07:15:46 PM EST'
MatchObject.group() with no arguments is like a default argument of 0,
which returns the entire matched string. For many purposes, that is all
you need, you may not care about the individual groups in the regex.
> >>> len(blah.group())
> 28
What would you expect
len('<BR>Nov. 09, 07:15:46 PM EST')
to return? There are 28 characters, so returning anything other than 28
would be a terrible bug. There is no way that len() can tell the
difference between any of these:
len('<BR>Nov. 09, 07:15:46 PM EST')
len(blah.group())
len('<BR>Nov. %s, 07:15:46 PM EST' % '09')
s = '<BR>Nov. 09, 07:15:46 PM EST'; len(s)
len((lambda: '<BR>Nov. 09, 07:15:46 PM EST')())
or any other of an infinite number of ways to get the same string. All
len() sees is the string, not where it came from.
If you want to know how many groups are in the regex, *look at it*:
r'<\w\w>(\w{3})\.\s+(\d{2}),\s+(\d{2}).+([AP]M)\s+(E[SD]T)'
has five groups. Or call groups and count the number of items returned:
len(blah.groups())
> I didn't run group to find out the number of characters in a string, I ran
> it to find out something about blah and its matches.
Well, of course nobody is stopping you from calling blah.group() to find
out the number of groups, in the same way that nobody is stopping you
from calling int('123456') to find out the time of day. But in both
cases you will be disappointed. You have to use the correct tool for the
correct job, and blah.group() returns the entire matching string, not a
tuple of groups. For that, you call blah.groups() (note plural).
--
Steven
More information about the Tutor
mailing list