[Python-Dev] Scanner

Sat, 26 May 2001 17:15:11 -0400

[Paul Prescod]
> What ever happened to the sre Scanner? It seemed like a good idea
> but it was not documented

I previously urged /F to document, and Python-Dev to accept, the .lastindex
and .lastgroup match object extensions, but to date <wink> got no response.
Whether to adopt the Scanner class too is fuzzier, since AFAICT almost
nobody has figured out how to use it.

> and it doesn't work for me.

This isn't a code problem, it's a failure to reverse-engineer the
undocumeted API <wink>.

> Is it just a case of nobody got around to the documentation or have
> we decided against it?

WRT Scanner, partly the former, nothing of the latter, mostly that there's
been no discussion of the API at all.

WRT lastindex and lastgroup, I think purely the former.

> Here's the code that doesn't work for me:
>
> from sre import Scanner
>
> scanner = Scanner([
>     (r"[a-zA-Z_]\w*", None),
>     (r"\d+\.\d*", None),
>     (r"\d+", None),
>     (r"=|\+|-|\*|/", None),
>     (r"\s+", None),
>     ])

1. Every tokenization regexp must contain exactly one capturing group.
   The lack above is the source of your later TypeError.  Unclear to
   me whether that was the intent, or ust the way the code happens to
   work today.

2. When an action is None, the substring matched by the pattern will
   be thrown away.  You need to supply non-None actions if you want
   anything to show up in the token list.

> tokens, tail = scanner.scan("sum = 3*foo + 312.50 + bar")
>
> Traceback (most recent call last):
>   File "junk.py", line 11, in ?
>     tokens, tail = scanner.scan("sum = 3*foo + 312.50 + bar")
>   File "c:\program files\python21\lib\sre.py", line 254, in scan
>     action = self.lexicon[m.lastindex][1]
> TypeError: sequence index must be integer
>
> m.lastindex is None

Here's a working rewrite:

from sre import Scanner

def retrieve(scanner, group):
    return group

scanner = Scanner([
    (r"([a-zA-Z_]\w*)", retrieve),
    (r"(\d+\.\d*)", retrieve),
    (r"(\d+)", retrieve),
    (r"(=|\+|-|\*|/)", retrieve),
    (r"(\s+)", None),  # ignore whitespace
    ])

tokens, tail = scanner.scan("sum = 3*foo + 312.50 + bar")
print tokens, `tail`

That prints

['sum', '=', '3', '*', 'foo', '+', '312.50', '+', 'bar'] ''

In return for that, how about *you* supply a works-on-Windows rewrite of
test_urllib2.py?  You know more about that than anyone, and the test has
been failing for weeks.