[Python-ideas] Pattern matching

Thu Apr 16 22:22:47 CEST 2015

In my previous post I aimed for brevity and skimmed over some of the
details.  In some places, I was a bit too brief, and so this post will
be much more detailed.

> > The second is performing algebra on the struct module's format strings.
> > I've written some automatic optimization of them with concatenation and
> > scalar multiplication, so that for instance 'b' + 'b' = 
> > '2b' rather than
> > 'bb' and '10s' + '10s' = '10s10s'.  The value of 
> > this is in dealing with
> > format strings that can be combined in various ways to describe the
> > layouts of complicated binary data formats.  With some context stripped
> > for brevity,
> > 
> >     def _check_add(self, other):
> >         if not isinstance(other, _StructBase):
> >             return NotImplemented
> >         if not self.same_endianness(other):
> >             return NotImplemented
> >         if self.endianness == other.endianness or len(self.endianness) >
> > len(other.endianness):
> >             return type(self)
> >         else:
> >             if isinstance(other, _StructSequence):
> >                 return type(other)
> >             else:
> >                 return other._STRUCT_SEQUENCES[other.endianness]
> > 
> >     def __add__(self, other):
> >         struct_sequence = self._check_add(other)
> >         if isinstance(other, _StructSequence):
> >             if other._elements == []:
> >                 return struct_sequence(self._elements.copy())
> >             else:
> >                 begin = other._elements[0]
> >                 rest = other._elements[1:]
> >         else:
> >             begin = other
> >             rest = []
> >         if self._elements == []:
> >             rest.insert(0, begin)
> >             return struct_sequence(rest)
> >         if self._elements[-1].character == begin.character:
> >             return struct_sequence(self._elements[:-1] +
> > [self._elements[-1] + begin] + rest)
> >         else:
> >             return struct_sequence(self._elements + [begin] + rest)
> 
> This seems to be overcomplicating things by trying to write ML code in
> Python instead of just writing Python code. For example, the only
> reason you decompose other._elements into begin and rest is so you can
> recompose them (a copy of) into the same list you started with. Why
> bother? (Also, if you're avoiding the limited pattern-matching already
> available in Python by not writing "begin, *rest = other._elements",
> that doesn't make a great case for adding more.)
>
>     def __add__(self, other):
>         struct_sequence = self._check_add(other)
>         if isinstance(other, _StructSequence):
>             elements = other._elements
>         else:
>             elements = [other]
>         if (self._elements and elements 
>             and self.elements[-1].character == elements[0].character):
>             return struct_sequence(self._elements[:-1] +
>                                    [self._elements[-1] + elements[0]] + elements[1:])
>         return struct_sequence(self._elements + elements)
> 
> If this is all you're doing, your argument is equivalent to showing
> some inherently iterative code written instead in tail-recursive form
> to argue that Python should do tail call elimination. You're asking to
> make it easier to provide another, less Pythonic way to do something
> that can already be easily done Pythonically.

I didn't use "begin, *rest = other._elements" because this code has to
be Python-2 compatible.  (This is another argument for a library
solution.)

I meant more to give the flavor of the complications of needing to test
and decompose objects at the same time in Python than provide the
best possible solution.  Your version of __add__ is better and simpler
than mine, but I still find it hard to read and understand, and you're
using _check_add too.  The following is another description of this
algorithm, simplifying by ignoring the case where you're adding a
StructAtom to a StructSequence.  (Arguably, that should be another
method, or StructAtoms shouldn't be part of the public API just as
Python has no char built-in type).

A StructSequence is a sequence of StructAtoms.  A StructAtom is composed
of an endianness, a legal format character, and a positive integer.  To
concatenate two StructSequences in __add__ or __radd__, here are the
steps:

1) Check that `other` is a StructSequence.  If it's not, return
NotImplemented.

2) Check that `self` and `other` have compatible endiannesses.  Probably
the best way to do this is use a dictionary with two-element frozensets
of endianness characters mapping to one endianness character each.  I
used conditionals above in _check_add, and that's messier.  A missing
key means they aren't compatible, and then __radd__ has to raise some
kind of exception indicating invalid endiannesses for concatenation.
__add__ needs to return NotImplemented.  A successful lookup will give
the endianness character for the result StructSequence.

3) Check that neither `self` nor `other` are empty.  If either is empty,
return a copy of the other.  (If they're both empty, you'll end up
returning a copy of an empty StructSequence, which is fine.)

4) Compare the first StructAtom of one StructSequence with the final
StructAtom of the other.  (Which is `self` and which is `other` depends
on whether we're in __add__ or __radd__.)  If the format characters are
the same and not 's' or 'p', then create a new StructSequence composed
of all but the last character of one, all but the first character of the
other, and a StructAtom that has the endianness of the result
StructSequence, the same character as the two original StructAtoms, and
an integer that's the sum of the two integers in the original
StructAtoms.  If the format characters aren't the same or are both 's'
or 'p', return a new StructSequence that's the concatenation of the two
original sequences.

Replacing _check_add with the aforesaid dictionary won't improve its
clarity much, if any.  You still end up with at least four conditionals
of some form and needing to do object decomposition in some of the
branches.  Moreover, struct format string manipulation is simple for
problems of these kinds: consider doing simplification on regular
expressions.  Devin's examples from the ast and sre_compile modules are
better illustrations than this.  If you don't find the code in these
examples hard to read or comprehend, I don't expect you to agree that
pattern matching is useful.  Do you or don't you?  That's an honest
question.

> > Those are two small-scale examples.  Some broad areas where pattern
> > matching is useful are:
> > 
> > 1) Parsing: It's useful both for performing manipulations on grammars
> > like left factoring and operating on the trees that parsers produce.
> 
> Parsing, and related things (compiling, schema validation, etc.) looks
> like a good case, but I'm not sure it is.
> 
> For simple languages, an expression-tree library like PyParsing or
> Boost.Spirit seems to be more than sufficient, and often be even nicer
> than explicit pattern matching.
> 
> For more complex languages (and sometimes even for simple ones), you
> can use or write a framework that lets you write a fully-declarative
> syntax and then executes it or generates executable code from it. For
> example, if you want to validate and parse an XML language, doing it
> by hand in ML would be nicer than in Python, but writing it in XML
> Schema and using a schema-validating parser is much nicer than either.
> Parsing the Python grammar by hand would be nicer in ML than in C or
> Python, but writing it as a declarative GRAMMAR file is nicer than
> either. And so on. You could argue that the work to write that
> framework is wasted if it's only going to be used once, but how many
> complex languages really need to be parsed only once?Even CPython's
> parser, which isn't used for anything but parsing Python's GRAMMAR, is
> effectively reused every time Python's syntax changes, and every time
> someone wants to hack on it to play with new syntax.
> 
> The one really good case for pattern matching seems to be when you
> pretty much _have_ to write a custom parser--maybe because the
> language is full of special-case rules and tangled context, like C++.
> At least the GCC and LLVM/Clang groups gave up trying to write C++ as
> a declarative specification and instead wrote custom recursive-descent
> parsers in C, which I suspect could be a lot nicer with pattern
> matching. But has anyone written an equivalent C++ parser in Haskell,
> ML, etc. to compare, or is that just a suspicion?

I became interested in pattern matching when I started writing a parsing
library and realized that both the manipulations I needed to perform in
the parser itself and the manipulations I would need to do on the AST
after parsing were better expressed using pattern matching than the
if-elif-else trees I was writing.  As part of this, I've been doing some
informal research on how people do parsing in the real world.
Unfortunately, I don't have any formal data and don't know how I'd
collect it, not to mention not having the money or time to do a formal
survey even if I knew how to do it.  With that caveat, some patterns
I've noticed:

1) Hand-written recursive-descent parsers are ubiquitous in real code.
If you take a random program that needs to do parsing somewhere, I'd
give better than even odds that it's done with a hand-written
recursive-descent parser, not a library.  I have some guesses as to why
this is, but I won't speculate here.  I think there's a solid case for
putting a parser package in the standard library---Haskell and Scala,
for instance, have already done this---which helps with one part of this
problem, but then you have to do something with the tree you've parsed,
and pattern matching is also useful for that.

2) Many programmers don't know (E)BNF or formal grammars.  I prefer to
use some declarative DSL too, but I know the languages and the concepts
already.  Those who don't might benefit from features that make parsers
easier to write without that background.

3) The world is filled with obscure data formats that none of us have
ever heard of; this is is especially true of binary data formats.  Each
of these requires some new parser whether hand-written, built with a
library, or generated.

4) Most languages and formats can't be parsed with anything less than a
context-sensitive grammar, and sometimes require more than that.  To
handle this problem, successful parser libraries make it easy to add
custom code to the parsing process.

Thus, there's a lot of parsing code in the world, whether there should
be or not, and even if you know how formal grammars and how to use
parser generators and similar tools, you might still end needing to
write parsing code because you have an obscure format to parse, a format
that has non-standard properties, or both.  Aside: if you do end up
needing to write a framework for a language or format, wouldn't pattern
matching make it easier to write that framework?

As far as I know, there's no C++ parser in Haskell or ML, mainly because
C++ is so hard to parse.  There exist parsers for other languages (C,
for instance) that do make extensive use of pattern matching.

> > 2) Mathematics: manipulations of abstract mathematical objects like
> > algebraic expressions (not numerical analysis), for instance algebraic
> > simplification rules.
> 
> Are you talking about parsing the expressions into a tree? If so, we're still on parsing.

I'm talking about doing manipulations on mathematical objects, like say
doing substitutions with trigonometric identities.

> > 3) Recursive data structures: Python doesn't have algebraic data types
> > or embed recursion into everything the way languages like Haskell do,
> > but there's plenty of recursively-structured data out in the wild like
> > HTML and JSON.
> 
> The parsing of this data is covered above. But what do you do with it
> once it's parsed? If you're transforming it, that again seems to be a
> case where writing the transformer declaratively in, say, XSSL wins.
> And if you're just extracting data from it, usually you're better off
> with the simple data['machines'][0]['name'] syntax (in an except
> KeyError block), and when that's not good enough, you're usually
> better off with a declarative search language like XPath. So this
> seems to be the same as with parsing.

What is XSSL?  All my comments above on parsing apply here, too.  One
problem I've encountered is that since most languages and formats don't
have nice solutions like XPath already available, you end up
implementing your own.

> > 
> > Whether I've convinced you or not that pattern matching is a worthwhile
> > addition to the language, I'd like it, and if I could, I'd write a
> > library to get it.
> 
> That raises the question: what's wrong with the more than a dozen libraries already written?

The short answer is some combination of poor performance, lack of
compatibility (with 2.7, with 3.4, with PyPy, with other alternate
interpreters), bad syntax, and missing functionality, depending on the
library.  If you have a particular library you're curious about, I can
be more specific.

> > with match(expr) as case:
> >     with case(pattern) as matches:
> >         if matches:
> >             target_list = matches
> >             suite
> >     with case(pattern) as matches:
> >         if matches:
> >             target_list = matches
> >             suite
> >     ...
> 
> 
> First, this seems like it will execute _every_ branch that matches,
> instead of just the first one. Surely you don't want that, do you? You
> can wrap the whole thing in a "while True:" and have each suite (and
> the "else" case) end with "break", or wrap it in a try/except and have
> each suite raise a custom exception, but you need some way to get out.

Yes, I have `while True:` in my notes, it was a clumsy copy-paste, and
omitting the break lines was a simple mistake.  There are some reasons
that try/except might be better.

> Also, why not just put the target_list in the as clause (inside
> parens, to fit it in a target, which is what an as clause actually
> takes)? Of course that means you get an exception rather than an else
> when it doesn't match, but that's fine. (It would be nice if it were a
> subclass of ValueError, in case the suite raises some ValueError of
> its own, however...)

I didn't know you could do that :).  Yes, that's preferable.

> > 
> > for target_list in pattern(expr):
> >     suite
> 
> 
> What is this going to do? From your above example, if looks like
> match(expr)(pattern) returns either an iterable that can be assigned
> to a target_list or None. What does pattern(expr) return? Return an
> iterable of one target_list or an empty iterable?
> 
> Also, the very idea of having an object which can be called on an
> expression or can have a (processed) expression called on it with
> similar but not identical results seems very confusing.

Here, what should happen is that `expr` should be an iterator and
calling `pattern()` on it returns another iterator containing the
deconstructed parts of each object in the original iterator *if* they
match.  I agree the lack of parallelism is confusing.

> 
> The useful part of pattern matching--and the hard part, too--is
> decomposing structures. A proposal that ignores that and only
> decomposes lists and dicts misses the point.
> 
> Yes, it's true that Python doesn't have ADTs. So that means one of two
> things: (1) types that want to be matched will have to do a bit of
> work to expose their structure and/or to decompose themselves, or (2)
> pattern matching isn't useful in Python. The examples you give from
> other languages show that the first is at least plausible. And both
> Szymon's proposal and mine and various other ideas suggested on this
> thread show attempts at a Pythonic design. But just ignoring structure
> decomposition, as you're doing, is not a way toward anything useful.

I assumed the answer to this was obvious because you've already given
it :).  Objects have to opt in to pattern matching by having some method
that inverts their constructor or at least provides some plausible
facsimile thereof.  For any object-oriented language, if you want to
preserve the ability to define with any attributes and have pattern
matching, you have to do it this way or with structural typing, and
Python doesn't really have that.  Python classes don't have private
variables like Java except by convention, but neither do they define
their variables except by convention.  __init__ is a black box that
could be doing any Turing-complete computation on its arguments and with
properties, any attribute access could be the result of a computation
rather than a member of the class.  Scala calls its pattern-matching
method unapply by analogy with its apply method.  In Python, I'd
probably pick a different name, but the concept is the same.

> If you're willing to use MacroPy, that brings up the question: what's
> wrong with the pattern matching that it comes with as an example, and
> why don't you want to build on it instead of starting over?

I'm tempted to answer, "Nothing!" because I am thinking about building
on it, but the short answer is that I'm not sure the syntactic and
performance advantages of macros outweigh the usability problems
introduced by macros (unfamiliarity of macros to most Python developers,
bootstrapping, complication of the implementation).  This is my major
concern: if you have a case to make one way or the other, I'd like to
hear it.  Other secondary concerns are that, despite doing AST
rewriting, MacroPy's pattern matching doesn't seem to do any
optimization, it isn't feature-complete (it's still listed as
experimental for good reasons), and MacroPy itself requires updating to
3.4 (and soon 3.5).  None of these are insurmountable problems if I were
convinced that macros were the right approach.  I would prefer to build
off one of the existing solutions rather than make a new one.
Unfortunately, for most of them it would be more work to comprehend the
existing code than it would be to reimplement their functionality,
especially for the ones with nonexistent documentation.  In that
respect, MacroPy's is definitely one of the more promising options.

> > 
> > I haven't talked at all about the nitty-gritty of implementation, though
> > I can.  I think the most productive way to have that discussion on this
> > list would be to look at what smaller features would be most helpful for
> > implementing pattern matching as a library, and likewise for syntax,
> > shades of the recent discussion about macros.
> 
> This last part, I think, is very useful: After building the best
> pattern-matching library you can with Python today, figure out what
> small changes to the language would allow you to either make it better
> or simplify its implementation, then propose those small changes
> instead of proposing pattern matching itself. For example, if you have
> to use nonportable hacks (or a MacroPy macro) so your context
> manager's __enter__ method can skip the statement body, that would be
> a good argument for reopening PEP 377 (because you have a new use case
> that wasn't known at the time). But trying to guess in advance what
> features might be helpful seems harder and less likely to pay off.
> 
> That being said, I suspect it's not going to pay off, because I think
> the with statement is a false lead in the first place. The semantics
> look promising, but they're not actually what you want to build on
> (unless you want to completely transform with statements into
> something that isn't actually a with statement, using MacroPy AST
> transformations or otherwise). I'd love to be proven wrong, of course.

After some thinking about it, I've concluded you're right, because
both of the other options (try and for) combine binding with control
flow, while with requires an additional control flow statement without
something like PEP 377.  Let's leave aside MacroPy and other ways of
changing Python's semantics like the decorators in Grant's PyPatt and
\url{https://github.com/Suor/patterns} for the moment.  Pattern
matching has two basic forms, the case/match block and function
definition, but before I can talk about them, I have to discuss
patterns themselves.  Without metaprogramming, patterns have to be
defined either as ordinary functions or classes.  You have some set of
elementary patterns like class patterns, dict patterns, list patterns,
variable patterns, and so on.  You construct or initialize patterns
with either objects to match against or other patterns, mimicking the
structure of the data you want to match.  With the same tentative
syntax I suggested last time:

ClassPattern(Foo(1, VariablePattern('a', str)))
DictPattern['a': 1, 'b': 2, 'c': 'a': int]
ListPattern[1, ..., 'a': Bar]

Alternately, if you don't want to use the slicing syntax like I am
there, you can make everything very explicit:

ClassPattern(Foo(1, VariablePattern('a', str))
DictPattern(('a', 1), ('b', 2), ('c', VariablePattern('a', int)))
ListPattern(1, ..., VariablePattern('a', Bar))

Even the slicing syntax is verbose.  Both forms of syntax could be
made less verbose.  Would that make it more or less clear?  I'm not
sure.  This syntax is tightly circumscribed because the whole point of
pattern matching is that the patterns should look like the objects
they're being matched against.  The closer the resemblance, the
happier I'd be.  After defining a pattern, you call it or one of its
methods on the object you want to match against, and it returns either
some indication of failure, maybe None, maybe some object with
information about the non-match, or an object containing the
deconstructed objects in a form suitable for tuple assignment.

The case/match block should start with the object that the patterns
are to be matched against (avoiding this duplication is a syntactic
advantage of pattern matching over if/elif/else blocks), contain a
list of patterns to match against, have access to variables in the
local scope, and bind variables in the same scope.  The only
statements that bind variables in the same scope beyond basic
assignment statements in Python are, as far as I know, try-except,
with, and for.  (Well, import, but it's obviously unsuitable.)  Here
are some sketches of syntax using try-except and for.

try:
    pattern0(expression,
             pattern1,
             pattern2,
             ...)
except pattern0.exception as identifier:
    target_list = identifier
    suite
except pattern1.exception as identifier:
    target_list = identifier
    suite
except pattern2.exception as identifier:
    target_list = identifier    
    suite
...
else:
    suite

Here, each pattern has an associated exception.  If a pattern matches,
it throws its exception; if it doesn't match, it calls the first
pattern in its varargs list on the value of the expression it was
passed, and so on.  The last pattern has no pattern to call so returns
None if it fails to match.  The exceptions have iterator methods for
tuple unpacking their arguments after they've been caught.  Another
possible choice for setting up the patterns uses operator overloading
and a dummy function to look vaguely like ML's syntax:

match(expression,
      (pattern0
       | pattern1
       | pattern2
       ...))

Using for statements can avoid the extra assignment statements at the
cost of needing to reiterate the expression and some raises.

try:
    for target_list in pattern0(expression):
        suite
        raise MatchFound
    for target_list in pattern1(expression):
        suite
        raise MatchFound
    ...
except MatchFound:
    pass
else:
    suite

A pattern either returns a one-element iterator containing an iterator
to be bound to target_list or raises StopIteration, in which case the
loop body gets skipped and the next pattern is tried.  Once any loop
body is executed, it raises MatchFound, which skips all the other
patterns.

In Python, function definition is easier both for syntax and
implementation.  All the pattern matching implementations for Python
that dispatch to functions
( http://blog.chadselph.com/adding-functional-style-pattern-matching-to-python.html ,
 https://github.com/martinblech/pyfpm ,
 http://svn.colorstudy.com/home/ianb/recipes/patmatch.py ,
 https://github.com/admk/patmat ,
 http://peak.telecommunity.com/DevCenter/PEAK-Rules ) use
decorators and all use them in similar ways.  There are two major
variations.  First, some decorators take patterns as arguments while
others use introspection on keyword arguments to embed the patterns in
the function definition.  You could use annotations instead of keyword
arguments, though I don't know how easily that would integrate with
type-hinting---it's probably possible because pattern matching and
typing are closely related.  I assume the reason no one's done so is
because annotations are Python-3 only.  Second, some implementations
have one uniform decorator across all definitions of a function with
frame-hacking or similar hacks to introspect the namespace and link up
the decorated functions, while others use two decorators like
functools.singledispatch, one to create a pattern-matched
function and another to link other functions to the first one. I
prefer the latter approach because I dislike frame-hacking on
principle, since it introduces compatibility issues, and I think two
decorators is more explicit as to what's going on.  Beyond those
differences, they're all similar: there's an outer function that tries
patterns against its arguments until it finds one that matches, then
it takes the names bound by the pattern and passes them as arguments
to the inner function.  Quoting my suggested function definition
syntax from my last message:

@match(pattern)
def f(parameter_list):
    suite

@f.case(pattern)
def g(parameter_list):
    suite

@f.case(pattern)
def h(parameter_list):
    suite

f(expression)

In all of these potential syntax choices, the names of bound variables
are assigned manually in assignment statements or function arguments.

These approaches all have some boilerplate, but I'm more worried about
overhead.  Without any desugaring, most of them will require at least
one function call per pattern on top of the conditionals and copying
necessary.  With metaprogramming, it's possible to eliminate some of
the function calls for function definitions and the pseudo-ML
match-case statement by constructing a single dispatch function or
single pattern before calling any patterns.  Unfortunately, in the
match-case statement, moving overhead to pattern definition time
doesn't help because the patterns have to be redefined every time
they're executed unless they're explicitly defined outside it, making
the code less readable; this is another way in which pattern matching
for function definition is easier.  Any metaprogramming could remove the
boilerplate, but MacroPy could move the overhead to module import time.
This is, I think, the best reason to use MacroPy over both
non-metaprogramming approaches and other metaprogramming approaches.

Unfortunately, Python's current implementation imposes some
limitations on desugaring.  In ML and Haskell, desugaring pattern
matching involves creating a finite automaton.  In Python, the VM
doesn't provide any way to jump to an offset determined by runtime
information short of Philip J. Eby's computed goto hack (
https://mail.python.org/pipermail/python-dev/2006-June/066115.html
), which isn't portable.  This means that any state with more than two
outgoing transitions has to be represented with an if-else tree that
turns what could be a constant-time operation (look up the jump
location in a hash table, jump) to a linear-time operation (check
conditionals equal to one minus the number of transitions).  This is,
of course, the switch statement, PEPs 275 and 3103, in another guise.
Another, less serious problem is there's a constant-time performance
penalty from needing to use exceptions to implement backtracking or
nodes with more than one incoming transition.

Supposing I were to build on MacroPy's pattern matching, is MacroPy's
current syntax for the match-case statement good enough?  If not, how
could it be improved?  As I understand it, by using import hooks like
MacroPy does and a new parser, the sky's the limit as far as changing
syntax goes, but I'd much rather stay within MacroPy's limitations,
using Python's current parsers, than go outside them.

> Meanwhile transforming a pattern-match into something that can be
> assigned to a target_list instead of just assigning in the match does
> work in simple cases, but I think Szymon's proposal shows why it's not
> sufficient, and I think any real examples will show that it's
> redundant. Even without the language support in my proposal, explicit
> quoting of names like in Grant Jenks' PyPatt library shows that you
> can get most of the way there. (The "as" clause is useful for the case
> when you want to bind the whole structure at the same time you
> decompose it and potentially bind its parts, but that's pretty
> rare--which is another reason I think "with" is misleading.)

I don't understand your comment about how Szymon's proposal shows
assigning to a target_list isn't sufficient.

Cara