[XML-SIG] Shallow Parsing with Regular Expressions

Wed, 10 Nov 1999 13:05:40 -0600

This is a multi-part message in MIME format.

------=_NextPart_000_00CB_01BF2B7C.49F97ED0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

Hello,

I ran across an interesting article titled "REX: XML Shallow Parsing with
Regular Expressions" by Robert D. Cameron that I thought others following
the XML-SIG and String-SIG might find intriguing.

From the article's abstract:

"The syntax of XML is simple enough that it is possible to parse an XML
document into a list of its markup and text items using a single regular
expression. Such a shallow parse of an XML document can be very useful for
the construction of a variety of lightweight XML processing tools. However,
complex regular expressions can be difficult to construct and even more
difficult to read. Using a form of literate programming for regular
expressions, this paper documents a set of XML shallow parsing expressions
that can be used a basis for simple, correct, efficient, robust and
language-independent XML shallow parsing. Complete shallow parser
implementations of less than 50 lines each in Perl, JavaScript and Lex/Flex
are also given."

The paper was just published in Markup Languages: Theory and Practice,
Volume 1, Number 3, Summer 1999, pp. 61-88 (MIT Press) but is available
online as a technical report at

ftp://fas.sfu.ca/pub/cs/TR/1998/CMPT1998-17.html

The regex he creates ends up being about 1K (DFA and efficient), and can
match just about any part of and XML document you're interested in. What you
do next (entity reference extraction, attribute processing, error detection,
etc.) is left to the reader. ;-) The Perl version of the regex works as-is
with Python (of course). I've adjusted for Python syntax and attached
'REX.py' to this message for anyone interested. The article is clearly
written and the method he uses to build up the regex out of small,
well-reasoned pieces is very instructive (although his retangle and reweave
literate programming tools aren't detailed).

As a simple example,

>>> import REX
>>> print REX.ShallowParse('<MYTAG>This is <I>shallow parsing</I>.</MYTAG>')
['<MYTAG>', 'This is ', '<I>', 'shallow parsing', '</I>', '.', '</MYTAG>']

As a simple example of what can be done, I made a variation on the regex
that adds named groups, e.g., (?P<ElemTagSPE>...), to each of the components
(see ftp://starship.python.net/pub/crew/dni/REX/REX_detail.py ). The group
names that are not None show exactly which parts of the regex contributed to
the match, effectively categorizing each part of the document. Prettyprinted
output for the string

    '<tag1 att="123" att2="456">my &first; <i>shallow parse</i></tag1>'

looks like

D:\python> REX_detail.py

 MarkupSPE: '<tag1 att="123" att2="456">'
 ElemTagCE: 'tag1 att="123" att2="456">'
      Name: 'att2'
  NameStrt: 'a'
  NameChar: '2'
  AttValSE: '"456"'

    TextSE: 'my &first; '

 MarkupSPE: '<i>'
 ElemTagCE: 'i>'

    TextSE: 'shallow parse'

 MarkupSPE: '</i>'
  EndTagCE: 'i>'

 MarkupSPE: '</tag1>'
  EndTagCE: 'tag1>'

More detail than you need, but you can see where further processing (for
attributes, entities) would start.

--David Niergarth

------=_NextPart_000_00CB_01BF2B7C.49F97ED0
Content-Type: text/plain;
	name="REX.py"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
	filename="REX.py"

"""REX/Python

Based on Robert D. Cameron's REX/Perl 1.0.=20

Original copyright notice follows:

REX/Perl 1.0

Robert D. Cameron "REX: XML Shallow Parsing with Regular Expressions",
Technical Report TR 1998-17, School of Computing Science, Simon Fraser=20
University, November, 1998.
Copyright (c) 1998, Robert D. Cameron.=20
The following code may be freely used and distributed provided that
this copyright and citation notice remains intact and that modifications
or additions are clearly identified.

"""

import re

TextSE =3D "[^<]+"
UntilHyphen =3D "[^-]*-"
Until2Hyphens =3D UntilHyphen + "(?:[^-]" + UntilHyphen + ")*-"
CommentCE =3D Until2Hyphens + ">?"
UntilRSBs =3D "[^\\]]*](?:[^\\]]+])*]+"
CDATA_CE =3D UntilRSBs + "(?:[^\\]>]" + UntilRSBs + ")*>"
S =3D "[ \\n\\t\\r]+"
NameStrt =3D "[A-Za-z_:]|[^\\x00-\\x7F]"
NameChar =3D "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]"
Name =3D "(?:" + NameStrt + ")(?:" + NameChar + ")*"
QuoteSE =3D "\"[^\"]*\"|'[^']*'"
DT_IdentSE =3D S + Name + "(?:" + S + "(?:" + Name + "|" + QuoteSE + =
"))*"
MarkupDeclCE =3D "(?:[^\\]\"'><]+|" + QuoteSE + ")*>"
S1 =3D "[\\n\\r\\t ]"
UntilQMs =3D "[^?]*\\?+"
PI_Tail =3D "\\?>|" + S1 + UntilQMs + "(?:[^>?]" + UntilQMs + ")*>"
DT_ItemSE =3D "<(?:!(?:--" + Until2Hyphens + ">|[^-]" + MarkupDeclCE + =
")|\\?" + Name + "(?:" + PI_Tail + "))|%" + Name + ";|" + S
DocTypeCE =3D DT_IdentSE + "(?:" + S + ")?(?:\\[(?:" + DT_ItemSE + =
")*](?:" + S + ")?)?>?"
DeclCE =3D "--(?:" + CommentCE + ")?|\\[CDATA\\[(?:" + CDATA_CE + =
")?|DOCTYPE(?:" + DocTypeCE + ")?"
PI_CE =3D Name + "(?:" + PI_Tail + ")?"
EndTagCE =3D Name + "(?:" + S + ")?>?"
AttValSE =3D "\"[^<\"]*\"|'[^<']*'"
ElemTagCE =3D Name + "(?:" + S + Name + "(?:" + S + ")?=3D(?:" + S + =
")?(?:" + AttValSE + "))*(?:" + S + ")?/?>?"
MarkupSPE =3D "<(?:!(?:" + DeclCE + ")?|\\?(?:" + PI_CE + ")?|/(?:" + =
EndTagCE + ")?|(?:" + ElemTagCE + ")?)"
XML_SPE =3D TextSE + "|" + MarkupSPE

# The original Perl subroutine
#
# sub ShallowParse {=20
#   my($XML_document) =3D @_;
#   return $XML_document =3D~ /$XML_SPE/g;
# }

def ShallowParse(XML_document):
	return re.findall(XML_SPE, XML_document)

def _test():
	print ShallowParse('<MYTAG>This is <I>shallow parsing</I>.</MYTAG>')

if __name__ =3D=3D '__main__':
	_test()

# _test() output looks like
# ['<MYTAG>', 'This is ', '<I>', 'shallow parsing', '</I>', '.', =
'</MYTAG>']

------=_NextPart_000_00CB_01BF2B7C.49F97ED0--