[Tutor] Optional groups in RE's

Sun Apr 12 04:06:33 CEST 2009

Mark Tolonen <metolone+gmane at gmail.com> wrote:
> Your data looks like XML.  If it is actually well-formed XML, have you tried
> ElementTree?

It is XML. I used minidom from xml.dom, and it worked fine, except it
was ~16 times slower. I'm parsing a ~70mb file, and the difference is
3 minutes to 10 seconds with re's.

I used separate re's for each field I wanted, and it worked nicely.
(1-1 between DOM calls and re.search and re.finditer)

This problem raised when I tried to do the match in one re.

I guess instead of minidom I could try lxml, which uses libxml2, which
is written in C.

Kent Johnson <kent37 at tds.net> wrote:
> This re doesn't have to match anything after </ship> so it doesn't.
> You can force it to match to the end by adding $ at the end but that
> is not enough, you have to make the "</ship>.*?" *not* match <title>.
> One way to do that is to use [^<]*? instead of .*?:

Ah. Thanks.
Unfortunately, the input string is multi-line, and doesn't end in </title>

Moos

P.S.

I'm still relatively new to RE's, or IRE's. sed, awk, grep, and perl
have different format for re's. grep alone has four different versions
of RE's!

Since the only form of re I'm using is "start(.*?)end" I was thinking
about writing a C program to do that.