extract occurrence of regular expression from elements of XML documents

Martin Schmidt martin.schmidt1 at gmail.com
Tue Mar 16 17:58:00 CET 2010


On Tue, Mar 16, 2010 at 11:56 AM, Martin Schmidt
<martin.schmidt1 at gmail.com>wrote:

> Thanks, Stefan.
> Actually I will have to run the searches I am interested in only a few
> times and therefore will drop performance concerns.
>
> Thanks for len(text.split()) .
> I will try it later.
>
> The text I am interested in is always in leaf elements.
>
> I have posted a concrete example incl. a representative XML file a few
> minutes ago.
> I hope this clarifies my problem.
>
> Rereading what I wrote sounds admittedly funnny.
> What I meant that I did not find a post that closely matches my problem (I
> know that the closeness needed in my case will seem excessive to more
> experienced Python/XML users).
>
> Best regards.
>
>   Martin
>
>
> P.S. Sorry for my late reply, but my Internet connection was down for a
> day.
>
>
>
>
>> ---------- Forwarded message ----------
>> From: Stefan Behnel <stefan_ml at behnel.de>
>> To: python-list at python.org
>> Date: Tue, 16 Mar 2010 08:50:30 +0100
>> Subject: Re: extract occurrence of regular expression from elements of XML
>> documents
>> Martin Schmidt, 15.03.2010 18:16:
>>
>>> I have just started to use Python a few weeks ago and until last week I
>>> had
>>> no knowledge of XML.
>>> Obviously my programming knowledge is pretty basic.
>>> Now I would like to use Python in combination with ca. 2000 XML documents
>>> (about 30 kb each) to search for certain regular expression within
>>> specific
>>> elements of these documents.
>>>
>>
>> 2000 * 30K isn't a huge problem, that's just 60M in total. If you just
>> have to do it once, drop your performance concerns and just get a solution
>> going. If you have to do it once a day, take care to use a tool that is not
>> too resource consuming. If you have strict requirements to do it once a
>> minute, use a fast machine with a couple of cores and do it in parallel. If
>> you have a huge request workload and want to reverse index the XML to do all
>> sorts of sophisticated queries on it, use a database instead.
>>
>>
>>  I would then like to record the number of occurrences of the regular
>>> expression within these elements.
>>> Moreover I would like to count the total number of words contained within
>>> these,
>>>
>>
>> len(text.split()) will give you those.
>>
>> BTW, is it document-style XML (with mixed content as in HTML) or is the
>> text always withing a leaf element?
>>
>>
>>  and record the attribute of a higher level element that contains
>>> them.
>>>
>>
>> An example would certainly help here.
>>
>>
>>  I was trying to figure out the best way how to do this, but got
>>> overwhelmed
>>> by the available information (e.g. posts using different approaches based
>>> on
>>> dom, sax, xpath, elementtree, expat).
>>> The outcome should be a file that lists the extracted attribute, the
>>> number
>>> of occurrences of the regular expression, and the total number of words.
>>> I did not find a post that addresses my problem.
>>>
>>
>> Funny that you say that after stating that you were overwhelmed by the
>> available information.
>>
>>
>>  If someone could help me with this I would really appreciate it.
>>>
>>
>> Most likely, the solution with the best simplicity/performance trade-off
>> would be xml.etree.cElementTree's iterparse(), intercept on each interesting
>> tag name, and search its text/tail using the regexp. That's doable in a
>> couple of lines.
>>
>> But unless you provide more information, it's hard to give better advice.
>>
>> Stefan
>>
>>
>>
>>
>> ---------- Forwarded message ----------
>> From: Chris Rebert <clp2 at rebertia.com>
>> To: "Lawrence D'Oliveiro" <ldo at geek-central.gen.nz>
>> Date: Tue, 16 Mar 2010 00:52:07 -0700
>> Subject: Re: import antigravity
>> On Tue, Mar 16, 2010 at 12:40 AM, Lawrence D'Oliveiro
>> <ldo at geek-central.gen.new_zealand> wrote:
>> > Subtle...
>>
>> You're a bit behind the times.
>> If my calculations are right, that comic is over 2 years old.
>>
>> Cheers,
>> Chris
>>
>>
>>
>> ---------- Forwarded message ----------
>> From: Stefan Behnel <stefan_ml at behnel.de>
>> To: python-list at python.org
>> Date: Tue, 16 Mar 2010 08:51:58 +0100
>> Subject: Re: import antigravity
>> Lawrence D'Oliveiro, 16.03.2010 08:40:
>>
>>> Subtle...
>>>
>>
>> Absolutely.
>>
>>  Python 2.4.6 (#2, Jan 21 2010, 23:45:25)
>>  [GCC 4.4.1] on linux2
>>  Type "help", "copyright", "credits" or "license" for more information.
>>  >>> import antigravity
>>  Traceback (most recent call last):
>>    File "<stdin>", line 1, in ?
>>  ImportError: No module named antigravity
>>
>>
>> Stefan
>>
>>
>>
>> --
>>
>> http://mail.python.org/mailman/listinfo/python-list
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20100316/9d2d3ec2/attachment.html>


More information about the Python-list mailing list