I'm still trying to find a good way to process big xml files (i.e. xml files around as large as the available RAM, or larger). My last post to the mailing list asked if there might be a way to do this by processing fragments of the document at a time. I now realise that lxml's feed parser is intended for this sort of task (correct me if I'm wrong). So I'm trying to learn to use it, but I find its behaviour a little odd. Not sure if this is a bug or not, so I'm posting to the list for advice.
When I run:
from lxml import etree
class EchoTarget:
def start(self, tag, attrib):
print "start", tag, attrib
def end(self, tag):
print "end", tag
def data(self, data):
print "data", repr(data)
def close(self):
print "close"
return "closed!"
parser = etree.XMLParser(target = EchoTarget())
parser.feed("<somethin")
parser.feed("g>foo</something>")
I get the result:
start something {}
data u'foo'
end something
But when I run:
from lxml import etree
class EchoTarget:
def start(self, tag, attrib):
print "start", tag, attrib
def end(self, tag):
print "end", tag
def data(self, data):
print "data", repr(data)
def close(self):
print "close"
return "closed!"
parser = etree.XMLParser(target = EchoTarget())
parser.feed("<something>foo</something>")
nothing gets sent to stdout. Isn't that weird? I think so. I would have expected it to give the same result as the program above. I'd be very thankful if anyone can shed some light on the matter for me.
Many thanks,
Sam
To add to the message below, I've just tried running a much simpler program that doesn't call lxml to see if the memory error is a Python/environment one rather than being due to lxml. It turns out to be:
>>> infile = open("content.rdf.u8.xml", "r")
>>> print infile.read()
Traceback (most recent call last):File "<stdin>", line 1, in <module>MemoryError
Ok, so clearly Python isn't happy to read content.rdf.u8.xml in one go. The normal workaround for processing large text files piece by piece seems to be either to set a byte limit on how much is read at once, or to read the file line by line. However, neither of those will work in this case because they won't produce well-formed XML that the target parser interface can handle (correct me if I'm wrong).
I'm sure there must be a fairly easy solution to this, but it's eluding me. All assistance greatly appreciated!
Sam