[Tutor] Optional groups in RE's

Kent Johnson kent37 at tds.net
Sun Apr 12 05:07:02 CEST 2009


On Sat, Apr 11, 2009 at 10:06 PM, Moos Heintzen <iwasroot at gmail.com> wrote:
> Mark Tolonen <metolone+gmane at gmail.com> wrote:
>> Your data looks like XML.  If it is actually well-formed XML, have you tried
>> ElementTree?
>
> It is XML. I used minidom from xml.dom, and it worked fine, except it
> was ~16 times slower. I'm parsing a ~70mb file, and the difference is
> 3 minutes to 10 seconds with re's.
>
> I used separate re's for each field I wanted, and it worked nicely.
> (1-1 between DOM calls and re.search and re.finditer)
>
> This problem raised when I tried to do the match in one re.
>
> I guess instead of minidom I could try lxml, which uses libxml2, which
> is written in C.

ElementTree is likely faster than minidom, it ha a C implementation.

> Kent Johnson <kent37 at tds.net> wrote:
>> This re doesn't have to match anything after </ship> so it doesn't.
>> You can force it to match to the end by adding $ at the end but that
>> is not enough, you have to make the "</ship>.*?" *not* match <title>.
>> One way to do that is to use [^<]*? instead of .*?:
>
> Ah. Thanks.
> Unfortunately, the input string is multi-line, and doesn't end in </title>

Perhaps you should show your actual input then.

Kent


More information about the Tutor mailing list