[Tutor] memory error

Joshua Valdez jdv12 at case.edu
Thu Jul 2 18:57:13 CEST 2015


Hi so I figured out my problem, with this code and its working great but
its still taking a very long time to process...I was wondering if there was
a way to do this with just regular expressions instead of parsing the text
with lxml...

the idea would be to identify a <page> tag and then move to the next line
of a file to see if there is a match between the title text and the pages
in my pages file.  I would then want to write the entire page tag
<page>fdsalkfdjadslf</page> text to ouput...

So again, my pages are just an array like: [Anarchism, Abrahamic Mythology,
...] I'm a little confused as to how to even start this my initial idea was
something like this but I'm not sure how to execute it:
wiki --> XML file
page_titles -> array of strings corresponding to titles

tag = r '(<page>)'
wiki = wiki.readlines()

for line in wiki:
  page = re.search(tag,line)
  if page:
    ......(I'm not sure what to do here)

is it possible to look ahead in a loop to discover other lines and then
backtrack?
I think this may be the solution but again I'm not sure how I would execute
such a command structure...








*Joshua Valdez*
*Computational Linguist : Cognitive Scientist
     *

(440)-231-0479
jdv12 at case.edu <jdv2 at uw.edu> | jdv2 at uw.edu | joshv at armsandanchors.com
<http://www.linkedin.com/in/valdezjoshua/>

On Wed, Jul 1, 2015 at 10:13 AM, Joshua Valdez <jdv12 at case.edu> wrote:

> Hi Danny,
>
> So I got my code workin now and it looks like this
>
> TAG = '{http://www.mediawiki.org/xml/export-0.10/}page'
> doc = etree.iterparse(wiki)
>
> for _, node in doc:
>     if node.tag == TAG:
>         title = node.find("{
> http://www.mediawiki.org/xml/export-0.10/}title").text
>         if title in page_titles:
>             print (etree.tostring(node))
>         node.clear()
> Its mostly giving me what I want.  However it is adding extra formatting
> (I believe name_spaces and attributes).  I was wondering if there was a way
> to strip these out when I'm printing the node tostring?
>
> Here is an example of the last few lines of my output:
>
> [[Category:Asteroids| ]]
> [[Category:Spaceflight]]</ns0:text>
>       <ns0:sha1>h4rxxfq37qg30eqegyf4vfvkqn3r142</ns0:sha1>
>     </ns0:revision>
>   </ns0:page>
>
>
>
>
>
>
> *Joshua Valdez*
> *Computational Linguist : Cognitive Scientist
>        *
>
> (440)-231-0479
> jdv12 at case.edu <jdv2 at uw.edu> | jdv2 at uw.edu | joshv at armsandanchors.com
> <http://www.linkedin.com/in/valdezjoshua/>
>
> On Wed, Jul 1, 2015 at 1:17 AM, Danny Yoo <dyoo at hashcollision.org> wrote:
>
>> Hi Joshua,
>>
>>
>>
>> The issue you're encountering sounds like XML namespace issues.
>>
>>
>> >> So I tried that code snippet you pointed me too and I'm not getting
>> any output.
>>
>>
>> This is probably because the tag names of the XML are being prefixed
>> with namespaces.  This would make the original test for node.tag to be
>> too stingy: it wouldn't exactly match the string we want, because
>> there's a namespace prefix in front that's making the string mismatch.
>>
>>
>> Try relaxing the condition from:
>>
>>     if node.tag == "page": ...
>>
>> to something like:
>>
>>     if node.tag.endswith("page"): ...
>>
>>
>> This isn't quite technically correct, but we want to confirm whether
>> namespaces are the issue that's preventing you from seeing those
>> pages.
>>
>>
>> If namespaces are the issue, then read:
>>
>>     http://effbot.org/zone/element-namespaces.htm
>>
>
>


More information about the Tutor mailing list