[Tutor] memory error
Joshua Valdez
jdv12 at case.edu
Thu Jul 2 18:57:13 CEST 2015
Hi so I figured out my problem, with this code and its working great but
its still taking a very long time to process...I was wondering if there was
a way to do this with just regular expressions instead of parsing the text
with lxml...
the idea would be to identify a <page> tag and then move to the next line
of a file to see if there is a match between the title text and the pages
in my pages file. I would then want to write the entire page tag
<page>fdsalkfdjadslf</page> text to ouput...
So again, my pages are just an array like: [Anarchism, Abrahamic Mythology,
...] I'm a little confused as to how to even start this my initial idea was
something like this but I'm not sure how to execute it:
wiki --> XML file
page_titles -> array of strings corresponding to titles
tag = r '(<page>)'
wiki = wiki.readlines()
for line in wiki:
page = re.search(tag,line)
if page:
......(I'm not sure what to do here)
is it possible to look ahead in a loop to discover other lines and then
backtrack?
I think this may be the solution but again I'm not sure how I would execute
such a command structure...
*Joshua Valdez*
*Computational Linguist : Cognitive Scientist
*
(440)-231-0479
jdv12 at case.edu <jdv2 at uw.edu> | jdv2 at uw.edu | joshv at armsandanchors.com
<http://www.linkedin.com/in/valdezjoshua/>
On Wed, Jul 1, 2015 at 10:13 AM, Joshua Valdez <jdv12 at case.edu> wrote:
> Hi Danny,
>
> So I got my code workin now and it looks like this
>
> TAG = '{http://www.mediawiki.org/xml/export-0.10/}page'
> doc = etree.iterparse(wiki)
>
> for _, node in doc:
> if node.tag == TAG:
> title = node.find("{
> http://www.mediawiki.org/xml/export-0.10/}title").text
> if title in page_titles:
> print (etree.tostring(node))
> node.clear()
> Its mostly giving me what I want. However it is adding extra formatting
> (I believe name_spaces and attributes). I was wondering if there was a way
> to strip these out when I'm printing the node tostring?
>
> Here is an example of the last few lines of my output:
>
> [[Category:Asteroids| ]]
> [[Category:Spaceflight]]</ns0:text>
> <ns0:sha1>h4rxxfq37qg30eqegyf4vfvkqn3r142</ns0:sha1>
> </ns0:revision>
> </ns0:page>
>
>
>
>
>
>
> *Joshua Valdez*
> *Computational Linguist : Cognitive Scientist
> *
>
> (440)-231-0479
> jdv12 at case.edu <jdv2 at uw.edu> | jdv2 at uw.edu | joshv at armsandanchors.com
> <http://www.linkedin.com/in/valdezjoshua/>
>
> On Wed, Jul 1, 2015 at 1:17 AM, Danny Yoo <dyoo at hashcollision.org> wrote:
>
>> Hi Joshua,
>>
>>
>>
>> The issue you're encountering sounds like XML namespace issues.
>>
>>
>> >> So I tried that code snippet you pointed me too and I'm not getting
>> any output.
>>
>>
>> This is probably because the tag names of the XML are being prefixed
>> with namespaces. This would make the original test for node.tag to be
>> too stingy: it wouldn't exactly match the string we want, because
>> there's a namespace prefix in front that's making the string mismatch.
>>
>>
>> Try relaxing the condition from:
>>
>> if node.tag == "page": ...
>>
>> to something like:
>>
>> if node.tag.endswith("page"): ...
>>
>>
>> This isn't quite technically correct, but we want to confirm whether
>> namespaces are the issue that's preventing you from seeing those
>> pages.
>>
>>
>> If namespaces are the issue, then read:
>>
>> http://effbot.org/zone/element-namespaces.htm
>>
>
>
More information about the Tutor
mailing list