
Aahz <aahz@pythoncraft.com>:
Can someone suggest a good simple reference on the distinctions between parsing / lexing / tokenizing
Lexical analysis, otherwise known as "lexing" or "tokenising", is the process of splitting the input up into a sequence of "tokens", such as (in the case of a programming language) identifiers, operators, string literals, etc. Parsing is the next higher level in the process, which takes the sequence of tokens and recognises language constructs -- statements, expressions, etc.
particularly in the context of general string processing (e.g. XML) rather than the arcane art of compiler technology?
The lexing and parsing part of compiler technology isn't really any more arcane than it is for XML or anything else -- exactly the same principles apply. It's more a matter of how deeply you want to get into the theory. The standard text on this stuff around here seems to be Aho, Hopcroft and Ullman, "The Theory of Parsing, Translation and Compiling", but you might find that a bit much if all you want to do is parse XML. It will, however, give you a good grounding in the theory of REs, various classes of grammar, different parsing techniques, etc., after which writing an XML parser will seem like quite a trivial task. :-) Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+