[Tutor] making a custom file parser?

Sat Jan 7 21:15:15 CET 2012

On Sat, Jan 7, 2012 at 8:22 PM, Alex Hall <mehgcap at gmail.com> wrote:
> I had planned to parse myself, but am not sure how to go about it. I
> assume regular expressions, but I couldn't even find the amount of
> units in the file by using:
> unitReg=re.compile(r"\<unit\>(*)\</unit\>")
> unitCount=unitReg.search(fileContents)
> print "number of units: "+unitCount.len(groups())
>
> I just get an exception that "None type object has no attribute
> groups", meaning that the search was unsuccessful. What I was hoping
> to do was to grab everything between the opening and closing unit
> tags, then read it one at a time and parse further. There is a tag
> inside a unit tag called AttackTable which also terminates, so I would
> need to pull that out and work with it separately. I probably just
> have misunderstood how regular expressions and groups work...
>

Parsing XML with regular expressions is generally very bad idea. In
the general case, it's actually impossible. XML is not what is called
a regular language, and therefore cannot be parsed with regular
expressions. You can use regular expressions to grab a limited amount
of data from a limited set of XML files, but this is dangerous, hard,
and error-prone.

As long as you realize this, though, you could possibly give it a shot
(here be dragons, you have been warned).

> unitReg=re.compile(r"\<unit\>(*)\</unit\>")

This is probably not what you actually did, because it fails with a
different error:

>>> a = re.compile(r"\<unit\>(*)\</unit\>")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py",
line 188, in compile
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py",
line 243, in _compile
sre_constants.error: nothing to repeat

I'll assume that said "(.*)". There's still a few problems: < and >
shouldn't be escaped, which is why you're not getting any matches.
Also you shouldn't use * because it is greedy, matching as much as
possible. So it would match everything in between the first <unit> and
the last </unit> tag in the file, including other <unit></unit> tags
that might show up. What you want is more like this:

unit_reg = re.compile(r"<unit>(.*?)</unit>")

Test it carefully, ditch elementtree, use as little regexes as
possible (string functions are your friends! startswith, split, strip,
et cetera) and you might end up with something that is only slightly
ugly and mostly works. That said, I'd still advise against it. turning
the files into valid XML and then using whatever XML parser you fancy
will probably be easier. Adding quotes and closing tags and removing
comments with regexes is still bad, but easier than parsing the whole
thing with regexes.

HTH,
Hugo