Aahz <aahz@pythoncraft.com>:

> Can someone suggest a good simple reference on the
> distinctions between parsing / lexing / tokenizing

Lexical analysis, otherwise known as "lexing" or
"tokenising", is the process of splitting the input
up into a sequence of "tokens", such as (in the case
of a programming language) identifiers, operators, 
string literals, etc.

Parsing is the next higher level in the process,
which takes the sequence of tokens and recognises
language constructs -- statements, expressions, etc.

> particularly in the context of general string processing (e.g. XML)
> rather than the arcane art of compiler technology?

The lexing and parsing part of compiler technology isn't really any
more arcane than it is for XML or anything else -- exactly the same
principles apply.

It's more a matter of how deeply you want to get into the theory. The
standard text on this stuff around here seems to be Aho, Hopcroft and
Ullman, "The Theory of Parsing, Translation and Compiling", but you
might find that a bit much if all you want to do is parse XML. It
will, however, give you a good grounding in the theory of REs, various
classes of grammar, different parsing techniques, etc., after which
writing an XML parser will seem like quite a trivial task. :-)

