<br><br><div class="gmail_quote">On Tue, Mar 16, 2010 at 11:56 AM, Martin Schmidt <span dir="ltr"><<a href="mailto:martin.schmidt1@gmail.com">martin.schmidt1@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="gmail_quote"><div>Thanks, Stefan.</div><div>Actually I will have to run the searches I am interested in only a few times and therefore will drop performance concerns.</div><div><br></div><div>Thanks for len(text.split()) .</div>


<div>I will try it later.</div><div><br></div><div>The text I am interested in is always in leaf elements.</div><div><br></div><div>I have posted a concrete example incl. a representative XML file a few minutes ago.</div>


<div>I hope this clarifies my problem.</div><div><br></div><div>Rereading what I wrote sounds admittedly funnny.</div><div>What I meant that I did not find a post that closely matches my problem (I know that the closeness needed in my case will seem excessive to more experienced Python/XML users).</div>


<div><br></div><div>Best regards.</div><div><br></div><div>  Martin</div><div><br></div><div><br></div><div>P.S. Sorry for my late reply, but my Internet connection was down for a day.</div><div><br></div><div><br></div>

<div>

 </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

---------- Forwarded message ----------<br>From: Stefan Behnel <<a href="mailto:stefan_ml@behnel.de" target="_blank">stefan_ml@behnel.de</a>><br>To: <a href="mailto:python-list@python.org" target="_blank">python-list@python.org</a><br>


Date: Tue, 16 Mar 2010 08:50:30 +0100<br>

Subject: Re: extract occurrence of regular expression from elements of XML documents<br>Martin Schmidt, 15.03.2010 18:16:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I have just started to use Python a few weeks ago and until last week I had<br>

no knowledge of XML.<br>

Obviously my programming knowledge is pretty basic.<br>

Now I would like to use Python in combination with ca. 2000 XML documents<br>

(about 30 kb each) to search for certain regular expression within specific<br>

elements of these documents.<br>

</blockquote>

<br>

2000 * 30K isn't a huge problem, that's just 60M in total. If you just have to do it once, drop your performance concerns and just get a solution going. If you have to do it once a day, take care to use a tool that is not too resource consuming. If you have strict requirements to do it once a minute, use a fast machine with a couple of cores and do it in parallel. If you have a huge request workload and want to reverse index the XML to do all sorts of sophisticated queries on it, use a database instead.<br>


<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I would then like to record the number of occurrences of the regular<br>

expression within these elements.<br>

Moreover I would like to count the total number of words contained within<br>

these,<br>

</blockquote>

<br>

len(text.split()) will give you those.<br>

<br>

BTW, is it document-style XML (with mixed content as in HTML) or is the text always withing a leaf element?<br>

<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

and record the attribute of a higher level element that contains<br>

them.<br>

</blockquote>

<br>

An example would certainly help here.<br>

<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I was trying to figure out the best way how to do this, but got overwhelmed<br>

by the available information (e.g. posts using different approaches based on<br>

dom, sax, xpath, elementtree, expat).<br>

The outcome should be a file that lists the extracted attribute, the number<br>

of occurrences of the regular expression, and the total number of words.<br>

I did not find a post that addresses my problem.<br>

</blockquote>

<br>

Funny that you say that after stating that you were overwhelmed by the available information.<br>

<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

If someone could help me with this I would really appreciate it.<br>

</blockquote>

<br>

Most likely, the solution with the best simplicity/performance trade-off would be xml.etree.cElementTree's iterparse(), intercept on each interesting tag name, and search its text/tail using the regexp. That's doable in a couple of lines.<br>


<br>

But unless you provide more information, it's hard to give better advice.<br>

<br>

Stefan<br>

<br>

<br>

<br><br>---------- Forwarded message ----------<br>From: Chris Rebert <<a href="mailto:clp2@rebertia.com" target="_blank">clp2@rebertia.com</a>><br>To: "Lawrence D'Oliveiro" <<a href="mailto:ldo@geek-central.gen.nz" target="_blank">ldo@geek-central.gen.nz</a>><br>


Date: Tue, 16 Mar 2010 00:52:07 -0700<br>Subject: Re: import antigravity<br>On Tue, Mar 16, 2010 at 12:40 AM, Lawrence D'Oliveiro<br>

<ldo@geek-central.gen.new_zealand> wrote:<br>

> Subtle...<br>

<br>

You're a bit behind the times.<br>

If my calculations are right, that comic is over 2 years old.<br>

<br>

Cheers,<br>

Chris<br>

<br>

<br><br>---------- Forwarded message ----------<br>From: Stefan Behnel <<a href="mailto:stefan_ml@behnel.de" target="_blank">stefan_ml@behnel.de</a>><br>To: <a href="mailto:python-list@python.org" target="_blank">python-list@python.org</a><br>


Date: Tue, 16 Mar 2010 08:51:58 +0100<br>

Subject: Re: import antigravity<br>Lawrence D'Oliveiro, 16.03.2010 08:40:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Subtle...<br>

</blockquote>

<br>

Absolutely.<br>

<br>

  Python 2.4.6 (#2, Jan 21 2010, 23:45:25)<br>

  [GCC 4.4.1] on linux2<br>

  Type "help", "copyright", "credits" or "license" for more information.<br>

  >>> import antigravity<br>

  Traceback (most recent call last):<br>

    File "<stdin>", line 1, in ?<br>

  ImportError: No module named antigravity<br>

<br>

<br>

Stefan<br>

<br>

<br>

<br>--<div class="im"><br>

<a href="http://mail.python.org/mailman/listinfo/python-list" target="_blank">http://mail.python.org/mailman/listinfo/python-list</a><br></div></blockquote></div><br>

</blockquote></div><br>