[Tutor] memory error

Danny Yoo dyoo at hashcollision.org
Wed Jul 1 03:20:38 CEST 2015


Please use reply to all: I'm currently not in front of a keyboard at the
moment.  Others on the mailing list should be able to help.
On Jun 30, 2015 6:13 PM, "Joshua Valdez" <jdv12 at case.edu> wrote:

> Hi Danny,
>
> So I tried that code snippet you pointed me too and I'm not getting any
> output.
>
> I tried playing around with the code and when I tried
>
> doc = etree.iterparse(wiki)
> for _, node in doc:
>   print node
>
> I get output like:
>
> <Element '{http://www.mediawiki.org/xml/export-0.10/}sitename' at
> 0x100602410>
> <Element '{http://www.mediawiki.org/xml/export-0.10/}dbname' at
> 0x1006024d0>
> <Element '{http://www.mediawiki.org/xml/export-0.10/}base' at 0x100602590>
> <Element '{http://www.mediawiki.org/xml/export-0.10/}generator' at
> 0x100602710>
> <Element '{http://www.mediawiki.org/xml/export-0.10/}case' at 0x100602750>
> <Element '{http://www.mediawiki.org/xml/export-0.10/}namespace' at
> 0x1006027d0>
> <Element '{http://www.mediawiki.org/xml/export-0.10/}namespace' at
> 0x100602810>
> <Element '{http://www.mediawiki.org/xml/export-0.10/}namespace' at
> 0x100602850>
> <Element '{http://www.mediawiki.org/xml/export-0.10/}namespace' at
> 0x100602890>
> <Element '{http://www.mediawiki.org/xml/export-0.10/}namespace' at
> 0x1006028d0>
>
> so the .tag function is capturing everything in the string.  Do you know
> why this and how I can get around it?
>
>
>
>
>
>
>
> *Joshua Valdez*
> *Computational Linguist : Cognitive Scientist
>        *
>
> (440)-231-0479
> jdv12 at case.edu <jdv2 at uw.edu> | jdv2 at uw.edu | joshv at armsandanchors.com
> <http://www.linkedin.com/in/valdezjoshua/>
>
> On Tue, Jun 30, 2015 at 7:27 PM, Danny Yoo <dyoo at hashcollision.org> wrote:
>
>> On Tue, Jun 30, 2015 at 8:10 AM, Joshua Valdez <jdv12 at case.edu> wrote:
>> > So I wrote this script to go over a large wiki XML dump and pull out the
>> > pages I want. However, every time I run it the kernel displays 'Killed'
>> I'm
>> > assuming this is a memory issue after reading around but I'm not sure
>> where
>> > the memory problem is in my script and if there were any tricks to
>> reduce
>> > the virtual memory usage.
>>
>> Yes.  Unfortunately, this is a common problem with representing a
>> potentially large stream of data with a single XML document.  The
>> straightforward approach to load an XML, to read it all into memory at
>> once, doesn't work when files get large.
>>
>> We can work around this by using a parser that knows how to
>> progressively read chunks of the document in a streaming or "pulling"
>> approach.  Although I don't think Beautiful Soup knows how to do this,
>> however, if you're working with XML, there are other libraries that
>> work similarly to Beautiful Soup that can work in a streaming way.
>>
>> There was a thread about this about a year ago that has good
>> references, the "XML Parsing from XML" thread:
>>
>>     https://mail.python.org/pipermail/tutor/2014-May/101227.html
>>
>> Stefan Behnel's contribution to that thread is probably the most
>> helpful in seeing example code:
>>
>>     https://mail.python.org/pipermail/tutor/2014-May/101270.html
>>
>> I think you'll probably want to use xml.etree.cElementTree; I expect
>> the code for your situation will look something like (untested
>> though!):
>>
>> ###############################
>> from xml.etree.cElementTree import iterparse, tostring
>>
>> ## ... later in your code, something like this...
>>
>> doc = iterparse(wiki)
>> for _, node in doc:
>>     if node.tag == "page":
>>         title = node.find("title").text
>>         if title in page_titles:
>>             print tostring(node)
>>         node.clear()
>> ###############################
>>
>>
>> Also, don't use "del" unless you know what you're doing.  It's not
>> particularly helpful in your particular scenario, and it is cluttering
>> up the code.
>>
>>
>> Let us know if this works out or if you're having difficulty, and I'm
>> sure folks would be happy to help out.
>>
>>
>> Good luck!
>>
>
>


More information about the Tutor mailing list