To whom it may concern,

I was working with xml.dom.minidom to parse XML files but for memory usage purposes (I'm currently working with ~10GB files) I had to change module and I started using xml.dom.pulldom.

Since pulldom and minidom rely on the same data structures, the processing functions used with minidom works perfectly also on the pulldom implementation, but there's something wrong with pulldom behaviour.

As you can see below this is a simple for loop where I check the tagName and proceed to get the attributes I want with the get_prjna function.

This function works perfectly with minidom, but with pulldom the data I get are literally truncated. The first attribute in the first row is the key of a dictionary and is in the format PRJ******, in the second row you can see that the function just can't take a full string (taking only "PRJN"). This happens

also with other data. The second element of the first row is in the format "Illumina **Seq ****" and it's obvious that something is going wrong here, because it just takes some truncated strings with random lenght. The last element of the first row is "TRANSCRIPTOMIC" but again it takes incomplete strings.

This is for make you understand how data are stored, in this example the "PRJ******":

And this is another data:

Having used minidom until this morning, I can assure you that the same data structure that you can see in the screen is well generated using minidom, demonstrating that there's no trace of incomplete data.

I think this data loss happens when using the expandNode function, but I just can't test it because the file are too large to print the content of the expandend nodes.

To be as complete as possible, I tried something like:

if event == pulldom.START_ELEMENT and node.tagName == "STUDY" but this error keeps happening.

Thanks in advance for the help and have a nice day!