[Tutor] making a custom file parser?

Mon Jan 9 03:06:12 CET 2012

On Mon, Jan 9, 2012 at 2:19 AM, Devin Jeanpierre <jeanpierreda at gmail.com> wrote:
>> Parsing XML with regular expressions is generally very bad idea. In
>> the general case, it's actually impossible. XML is not what is called
>> a regular language, and therefore cannot be parsed with regular
>> expressions. You can use regular expressions to grab a limited amount
>> of data from a limited set of XML files, but this is dangerous, hard,
>> and error-prone.
>
> Python regexes aren't regular, and this isn't XML.
>
> A working XML parser has been written using .NET regexes (sorry, no
> citation -- can't find it), and they only have one extra feature
> (recursion, of course). And it was dreadfully ugly and nasty and
> probably terrible to maintain -- that's the real cost of regexes.
>

IIRC, Python's only non-regular feature is backreferences though; I'm
pretty sure that isn't enough to parse XML. It does not make it
powerful enough to parse context-free languages. I really would like
that citation though, tried googling for it but not much turned up.
I'm not calling bs or anything, I don't know anything about .net
regexes and I'll readily believe it can be done (I just want to see
the code for myself). But really I still wouldn't dare try without a
feature set like perl 6's regexes. And even then..

You're technically correct (it's the best kind), but I feel like it
doesn't really take away the general correctness of my advice ;)

> In particular, his data actually does look regular.
>

Quite right. We haven't seen enough of it to be sure, but that little
bite seems parseable enough with some basic string methods and one or
two regexes. That's really all you need, and trying to do the whole
thing with pure regex is just needlessly overcomplicating things (I'm
pretty sure we all actually agree on that).

>> I'll assume that said "(.*)". There's still a few problems: < and >
>> shouldn't be escaped, which is why you're not getting any matches.
>> Also you shouldn't use * because it is greedy, matching as much as
>> possible. So it would match everything in between the first <unit> and
>> the last </unit> tag in the file, including other <unit></unit> tags
>> that might show up.
>
> On the "can you do work with this with regexes" angle: if units can be
> nested, then neither greedy nor non-greedy matching will work. That's
> a particular case where regular expressions can't work for your data.
>
>> Test it carefully, ditch elementtree, use as little regexes as
>> possible (string functions are your friends! startswith, split, strip,
>> et cetera) and you might end up with something that is only slightly
>> ugly and mostly works. That said, I'd still advise against it. turning
>> the files into valid XML and then using whatever XML parser you fancy
>> will probably be easier.
>
> He'd probably do that using regexes.
>

Yeah, that's what I was thinking when I said it too. Something like,
one regex to quote attributes, and one that adds close tags at the
earliest opportunity. Like right before a newline? It looks okay based
on just that sample, but it's really hard to say. The viability of
regexes depends so much on the dataset you have. If you can make the
dataset valid XML with just three regexes (quotes, end tags, comments)
then just parse it that way, that sounds like the simplest possible
option.

> Easiest way is probably to write a real parser using some PEG or CFG
> thingy. Less error-prone.
>

You mean like flex/bison? May be overkill, but then again, maybe not.
So much depends on the data.

> Overall agree with advice, though. Just being picky. Sorry.
>
> -- Devin
>
>

I love being picky myself, so I don't mind, as long as there is a
disclaimer somewhere ;) Cheers,
Hugo