extract occurrence of regular expression from elements of XML documents
Stefan Behnel
stefan_ml at behnel.de
Tue Mar 16 03:50:30 EDT 2010
Martin Schmidt, 15.03.2010 18:16:
> I have just started to use Python a few weeks ago and until last week I had
> no knowledge of XML.
> Obviously my programming knowledge is pretty basic.
> Now I would like to use Python in combination with ca. 2000 XML documents
> (about 30 kb each) to search for certain regular expression within specific
> elements of these documents.
2000 * 30K isn't a huge problem, that's just 60M in total. If you just have
to do it once, drop your performance concerns and just get a solution
going. If you have to do it once a day, take care to use a tool that is not
too resource consuming. If you have strict requirements to do it once a
minute, use a fast machine with a couple of cores and do it in parallel. If
you have a huge request workload and want to reverse index the XML to do
all sorts of sophisticated queries on it, use a database instead.
> I would then like to record the number of occurrences of the regular
> expression within these elements.
> Moreover I would like to count the total number of words contained within
> these,
len(text.split()) will give you those.
BTW, is it document-style XML (with mixed content as in HTML) or is the
text always withing a leaf element?
> and record the attribute of a higher level element that contains
> them.
An example would certainly help here.
> I was trying to figure out the best way how to do this, but got overwhelmed
> by the available information (e.g. posts using different approaches based on
> dom, sax, xpath, elementtree, expat).
> The outcome should be a file that lists the extracted attribute, the number
> of occurrences of the regular expression, and the total number of words.
> I did not find a post that addresses my problem.
Funny that you say that after stating that you were overwhelmed by the
available information.
> If someone could help me with this I would really appreciate it.
Most likely, the solution with the best simplicity/performance trade-off
would be xml.etree.cElementTree's iterparse(), intercept on each
interesting tag name, and search its text/tail using the regexp. That's
doable in a couple of lines.
But unless you provide more information, it's hard to give better advice.
Stefan
More information about the Python-list
mailing list