Parsing markup.

Mon Nov 29 20:46:08 EST 2010

On 11/29/2010 11:20 AM, Joe Goldthwaite wrote:
> Hi MRAB,
>
> I was trying to avoid regex because my poor old brain has trouble with it. I
> have to admin though, that line is slick!  I'll have to go through my regex
> documentation to try and figure out what it actually means.

Personally, I'd be hesitant to use a regex.  It can be done and I've 
done it myself on occasion when I had a simple job to do and a very, 
very, very well defined target.

The problem with using regular expressions is that there is are many 
variations in the text of valid XML.  There can be namespaces, 
attributes, newlines in surprising places, unexpected character 
encodings, alternative quoting styles (e.g. id='123' or id = "123"), 
character entities ("<") and possibly other things that I haven't 
thought of.

The parser authors have thought of those things and written parsing code 
that works properly on legal XML, even in these surprising cases that 
rarely show up in your data but can.  You might think there won't ever 
be any surprises that break your regexes, but then some new programmer 
appears on the project and thinks, Aha, this is XML, I can solve my 
problem by adding a new attribute to that 'p' tag.  He will be tearing 
his hair and muttering sentences that happen to have your name in them 
when he discovers that his perfectly legal XML content won't parse 
correctly in your regex based parser.  The muttering will get much 
louder if he only discovers this after much data has been processed and 
important items silently skipped over by the parser.

     Alan