Parsing markup.
Alan Meyer
ameyer2 at yahoo.com
Mon Nov 29 20:46:08 EST 2010
On 11/29/2010 11:20 AM, Joe Goldthwaite wrote:
> Hi MRAB,
>
> I was trying to avoid regex because my poor old brain has trouble with it. I
> have to admin though, that line is slick! I'll have to go through my regex
> documentation to try and figure out what it actually means.
Personally, I'd be hesitant to use a regex. It can be done and I've
done it myself on occasion when I had a simple job to do and a very,
very, very well defined target.
The problem with using regular expressions is that there is are many
variations in the text of valid XML. There can be namespaces,
attributes, newlines in surprising places, unexpected character
encodings, alternative quoting styles (e.g. id='123' or id = "123"),
character entities ("<") and possibly other things that I haven't
thought of.
The parser authors have thought of those things and written parsing code
that works properly on legal XML, even in these surprising cases that
rarely show up in your data but can. You might think there won't ever
be any surprises that break your regexes, but then some new programmer
appears on the project and thinks, Aha, this is XML, I can solve my
problem by adding a new attribute to that 'p' tag. He will be tearing
his hair and muttering sentences that happen to have your name in them
when he discovers that his perfectly legal XML content won't parse
correctly in your regex based parser. The muttering will get much
louder if he only discovers this after much data has been processed and
important items silently skipped over by the parser.
Alan
More information about the Python-list
mailing list