Python-list Digest, Vol 78, Issue 161

Martin Schmidt martin.schmidt1 at gmail.com
Tue Mar 16 17:56:15 CET 2010


Thanks, Stefan.
Actually I will have to run the searches I am interested in only a few times
and therefore will drop performance concerns.

Thanks for len(text.split()) .
I will try it later.

The text I am interested in is always in leaf elements.

I have posted a concrete example incl. a representative XML file a few
minutes ago.
I hope this clarifies my problem.

Rereading what I wrote sounds admittedly funnny.
What I meant that I did not find a post that closely matches my problem (I
know that the closeness needed in my case will seem excessive to more
experienced Python/XML users).

Best regards.

  Martin


P.S. Sorry for my late reply, but my Internet connection was down for a day.




> ---------- Forwarded message ----------
> From: Stefan Behnel <stefan_ml at behnel.de>
> To: python-list at python.org
> Date: Tue, 16 Mar 2010 08:50:30 +0100
> Subject: Re: extract occurrence of regular expression from elements of XML
> documents
> Martin Schmidt, 15.03.2010 18:16:
>
>> I have just started to use Python a few weeks ago and until last week I
>> had
>> no knowledge of XML.
>> Obviously my programming knowledge is pretty basic.
>> Now I would like to use Python in combination with ca. 2000 XML documents
>> (about 30 kb each) to search for certain regular expression within
>> specific
>> elements of these documents.
>>
>
> 2000 * 30K isn't a huge problem, that's just 60M in total. If you just have
> to do it once, drop your performance concerns and just get a solution going.
> If you have to do it once a day, take care to use a tool that is not too
> resource consuming. If you have strict requirements to do it once a minute,
> use a fast machine with a couple of cores and do it in parallel. If you have
> a huge request workload and want to reverse index the XML to do all sorts of
> sophisticated queries on it, use a database instead.
>
>
>  I would then like to record the number of occurrences of the regular
>> expression within these elements.
>> Moreover I would like to count the total number of words contained within
>> these,
>>
>
> len(text.split()) will give you those.
>
> BTW, is it document-style XML (with mixed content as in HTML) or is the
> text always withing a leaf element?
>
>
>  and record the attribute of a higher level element that contains
>> them.
>>
>
> An example would certainly help here.
>
>
>  I was trying to figure out the best way how to do this, but got
>> overwhelmed
>> by the available information (e.g. posts using different approaches based
>> on
>> dom, sax, xpath, elementtree, expat).
>> The outcome should be a file that lists the extracted attribute, the
>> number
>> of occurrences of the regular expression, and the total number of words.
>> I did not find a post that addresses my problem.
>>
>
> Funny that you say that after stating that you were overwhelmed by the
> available information.
>
>
>  If someone could help me with this I would really appreciate it.
>>
>
> Most likely, the solution with the best simplicity/performance trade-off
> would be xml.etree.cElementTree's iterparse(), intercept on each interesting
> tag name, and search its text/tail using the regexp. That's doable in a
> couple of lines.
>
> But unless you provide more information, it's hard to give better advice.
>
> Stefan
>
>
>
>
> ---------- Forwarded message ----------
> From: Chris Rebert <clp2 at rebertia.com>
> To: "Lawrence D'Oliveiro" <ldo at geek-central.gen.nz>
> Date: Tue, 16 Mar 2010 00:52:07 -0700
> Subject: Re: import antigravity
> On Tue, Mar 16, 2010 at 12:40 AM, Lawrence D'Oliveiro
> <ldo at geek-central.gen.new_zealand> wrote:
> > Subtle...
>
> You're a bit behind the times.
> If my calculations are right, that comic is over 2 years old.
>
> Cheers,
> Chris
>
>
>
> ---------- Forwarded message ----------
> From: Stefan Behnel <stefan_ml at behnel.de>
> To: python-list at python.org
> Date: Tue, 16 Mar 2010 08:51:58 +0100
> Subject: Re: import antigravity
> Lawrence D'Oliveiro, 16.03.2010 08:40:
>
>> Subtle...
>>
>
> Absolutely.
>
>  Python 2.4.6 (#2, Jan 21 2010, 23:45:25)
>  [GCC 4.4.1] on linux2
>  Type "help", "copyright", "credits" or "license" for more information.
>  >>> import antigravity
>  Traceback (most recent call last):
>    File "<stdin>", line 1, in ?
>  ImportError: No module named antigravity
>
>
> Stefan
>
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20100316/36dbe0ce/attachment.html>


More information about the Python-list mailing list