[Tutor] making a custom file parser?

Devin Jeanpierre jeanpierreda at gmail.com
Mon Jan 9 14:39:53 CET 2012


> IIRC, Python's only non-regular feature is backreferences though

Probably. I'm not too familiar with a couple other features or how
their semantics work, in particular the (?(id)yes|no) syntax.

> I'm not calling bs or anything, I don't know anything about .net
> regexes and I'll readily believe it can be done (I just want to see
> the code for myself).

They add the ability to push and pop from a stack, which turns their
regular expressions into at-least-as-powerful as push-down automata,
which are equivalent in power to context-free-grammars, which means
they can match XML. I think this has been well-known in the .NET
community for years, but nobody had ever done it, and nobody ever
mentioned it. It's a dirty secret you don't tell the newbies because
then they think regexps are fine to use for everything.

It's also why I don't like the "this isn't regular so don't use
regular expressions" spiel. We call things regular expressions even
when they're context-free parsing expressions! The term has meaning,
but it's no longer tied to finite state automata, and any argument
along that lines is just waiting to be broken by the next feature
addition to the re module.

Anyway, I found the reference I was thinking of:
http://porg.es/blog/so-it-turns-out-that-dot-nets-regex-are-more-powerful-than-i-originally-thought

> Quite right. We haven't seen enough of it to be sure, but that little
> bite seems parseable enough with some basic string methods and one or
> two regexes. That's really all you need, and trying to do the whole
> thing with pure regex is just needlessly overcomplicating things (I'm
> pretty sure we all actually agree on that).

Oh I dunno. If the regex would be simple, it'd be the simplest
solution. As soon as you have order-independence though...

> You mean like flex/bison? May be overkill, but then again, maybe not.
> So much depends on the data.

Flex/Bison are a little old-school / difficult to deal with. I'm more
thinking LEPL or PyMeta or something.

-- Devin

On Sun, Jan 8, 2012 at 9:06 PM, Hugo Arts <hugo.yoshi at gmail.com> wrote:
> On Mon, Jan 9, 2012 at 2:19 AM, Devin Jeanpierre <jeanpierreda at gmail.com> wrote:
>>> Parsing XML with regular expressions is generally very bad idea. In
>>> the general case, it's actually impossible. XML is not what is called
>>> a regular language, and therefore cannot be parsed with regular
>>> expressions. You can use regular expressions to grab a limited amount
>>> of data from a limited set of XML files, but this is dangerous, hard,
>>> and error-prone.
>>
>> Python regexes aren't regular, and this isn't XML.
>>
>> A working XML parser has been written using .NET regexes (sorry, no
>> citation -- can't find it), and they only have one extra feature
>> (recursion, of course). And it was dreadfully ugly and nasty and
>> probably terrible to maintain -- that's the real cost of regexes.
>>
>
> IIRC, Python's only non-regular feature is backreferences though; I'm
> pretty sure that isn't enough to parse XML. It does not make it
> powerful enough to parse context-free languages. I really would like
> that citation though, tried googling for it but not much turned up.
> I'm not calling bs or anything, I don't know anything about .net
> regexes and I'll readily believe it can be done (I just want to see
> the code for myself). But really I still wouldn't dare try without a
> feature set like perl 6's regexes. And even then..
>
> You're technically correct (it's the best kind), but I feel like it
> doesn't really take away the general correctness of my advice ;)
>
>> In particular, his data actually does look regular.
>>
>
> Quite right. We haven't seen enough of it to be sure, but that little
> bite seems parseable enough with some basic string methods and one or
> two regexes. That's really all you need, and trying to do the whole
> thing with pure regex is just needlessly overcomplicating things (I'm
> pretty sure we all actually agree on that).
>
>>> I'll assume that said "(.*)". There's still a few problems: < and >
>>> shouldn't be escaped, which is why you're not getting any matches.
>>> Also you shouldn't use * because it is greedy, matching as much as
>>> possible. So it would match everything in between the first <unit> and
>>> the last </unit> tag in the file, including other <unit></unit> tags
>>> that might show up.
>>
>> On the "can you do work with this with regexes" angle: if units can be
>> nested, then neither greedy nor non-greedy matching will work. That's
>> a particular case where regular expressions can't work for your data.
>>
>>> Test it carefully, ditch elementtree, use as little regexes as
>>> possible (string functions are your friends! startswith, split, strip,
>>> et cetera) and you might end up with something that is only slightly
>>> ugly and mostly works. That said, I'd still advise against it. turning
>>> the files into valid XML and then using whatever XML parser you fancy
>>> will probably be easier.
>>
>> He'd probably do that using regexes.
>>
>
> Yeah, that's what I was thinking when I said it too. Something like,
> one regex to quote attributes, and one that adds close tags at the
> earliest opportunity. Like right before a newline? It looks okay based
> on just that sample, but it's really hard to say. The viability of
> regexes depends so much on the dataset you have. If you can make the
> dataset valid XML with just three regexes (quotes, end tags, comments)
> then just parse it that way, that sounds like the simplest possible
> option.
>
>> Easiest way is probably to write a real parser using some PEG or CFG
>> thingy. Less error-prone.
>>
>
> You mean like flex/bison? May be overkill, but then again, maybe not.
> So much depends on the data.
>
>> Overall agree with advice, though. Just being picky. Sorry.
>>
>> -- Devin
>>
>>
>
> I love being picky myself, so I don't mind, as long as there is a
> disclaimer somewhere ;) Cheers,
> Hugo


More information about the Tutor mailing list