[Expat-bugs] [ expat-Bugs-2723522 ] expat memory consumption issue - advise needed

SourceForge.net noreply at sourceforge.net
Thu Apr 2 14:07:56 CEST 2009


Bugs item #2723522, was opened at 2009-03-31 11:50
Message generated for change (Comment added) made by kwaclaw
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110127&aid=2723522&group_id=10127

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: Not a Bug
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Alex Manov (alexmanovbg)
Assigned to: Nobody/Anonymous (nobody)
Summary: expat memory consumption issue - advise needed

Initial Comment:

We have a application which uses expat to convert a xml data file into a binary version of the file. The file at the moment is about 600M but will grow.
We encountered a blocking problem - while parsing the file the application starts using a huge amount of memory it needs 4G of RAM to finish successfuly a 600MB file.
Our engineers explained that this is due to block memory management in expat when it builds the xml tree. They explained that our xml has alot of tags which in turn requires separate 4K memory pages for even 3 bytes of actual data.

Is there any way to improve this? Could anyone suggest how we can optimize this process? Is there any settings which we can use to make it work?

Here is the file structure ( I am not uploading the file since it is 600M I can provide it though ).
<?xml version="1.0" encoding="utf-8" ?>
<Groups>
<Group>
<ID>9</ID>
<Status>Active</Status>
<EffectiveDate>196912311900</EffectiveDate>
<ExpireDate>203012301700</ExpireDate>
<Elements>
<Element>
<ID>2345737</ID>
<StartDate>20000101</StartDate>
<EndDate>20351231</EndDate>
<StartTime>00:00</StartTime>
<EndTime>00:00</EndTime>
<DayOfWeek>0,1,2,3,4,5,6</DayOfWeek>
<DayOfMonth></DayOfMonth>
<Month></Month>
<Data>1619</Data>
<Value_1>0.0000</Value_1>
<Value_2 type ="RELATIVE">0.0000</Value_2>
<Subelements>
<Subelement>
<ID>1</ID>
<Value_3>0</Value_3>
<Value_4>1</Value_4>
<Value_5>0.0000</Value_5>
<Value_6 type="FIXED">0.0000</Value_6>
</Subelement>
<Subelement>
<ID>Default</ID>
<Value_3>0</Value_3>
<Value_4>1</Value_4>
<Value_5>0.0000</Value_5>
<Value_6 type="FIXED">0.0000</Value_6>
</Subelement>
</Subelements>
</Element>
</Elements>
</Group>
</Groups>

There can be many Groups - in practice about 100
Each Group can have many elements - in practice about 100,000
Each Element can have many subelements - in practice about 4


----------------------------------------------------------------------

>Comment By: Karl Waclawek (kwaclaw)
Date: 2009-04-02 08:07

Message:
I looked at your C code.

The XmlNodeRef class gave it away, you are not actually using Expat
directly, but apparently some DOM wrapper around it. This will build an
in-memory representation of the XML file and consequently consume several
times as much memory as the file size.

If you check the Expat reference document you will see that Expat based
code uses call-backs to inform the application of each tag/attribute it
encounters, but does not build a tree.

Karl

----------------------------------------------------------------------

Comment By: Alex Manov (alexmanovbg)
Date: 2009-04-01 23:51

Message:
Karl,

The file is around 3MB compressed. I can send it to you via email if you
need it.

----------------------------------------------------------------------

Comment By: Alex Manov (alexmanovbg)
Date: 2009-04-01 23:44

Message:
Sorry I could not attach it. Here is the C application code:

#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
#include <string.h>
#include <time.h>
#include "xml.h"


//// Load() of CRatingEngine
int LoadXML ( const char *szFileName)
{
        FILE *pXMLFile = fopen ( szFileName, "rb");
        if ( !pXMLFile) {
                printf ("Load: File name %s is not exists\n", szFileName);
                return 0;
        }

  if( fseek ( pXMLFile, 0, SEEK_END) != 0) {
           printf ("Load: The file %s is corrupted\n", szFileName);
     fclose ( pXMLFile);
     return 0;
  }

  int nSize = ( int) ftell ( pXMLFile);
  rewind ( pXMLFile);

  ////Load all data once
  char *szBuffer = (char *) malloc ( nSize + 1);
        if( !szBuffer) {
                 printf ("Parse: Insufficient memory to alloc  %d
bytes\n", nSize );
                 return 0;
        }

  fseek ( pXMLFile, 0, SEEK_SET );
        fread ( szBuffer, nSize, 1, pXMLFile );
        szBuffer[nSize] = 0;

  printf ("Parse: Starting ...\n");

        XmlParser xmlParser;
        XmlNodeRef xtRaps = xmlParser.Parse ( szBuffer);

        if( !xtRaps) {
                printf ("Error: XML data incorrect (%s)\n", szFileName);
                fclose ( pXMLFile);
                return 0;
        }

        if ( szBuffer) free ( szBuffer);
        fclose ( pXMLFile);

  printf ("Parse: End ...\n");
        return 1;
}

int main(int argc, char* argv[])
{
        char szXMLFile[20] = "rbx.xml";

#ifndef WIN32
        if ( argc!=2) {
                printf ("Command format: rbxdbgen [xmlfile]\n");
                return 0;
        }
        strcpy ( szXMLFile, argv[1]);
#endif

  time_t start = time ( NULL);

        printf("Load from XML file...\n");
        if( LoadXML ( szXMLFile) == 1 ){
                printf("Done\n");
        }
        else {
                printf("Failed\n");
                return 0;
        }

  return 0;
}


----------------------------------------------------------------------

Comment By: Alex Manov (alexmanovbg)
Date: 2009-04-01 23:41

Message:
Hello Karl,

I am attacching a small file (testxml.cxx) we use only for the parsing. It
basically loads the xml and starts the parsing process with no other
processing at all.
The behavior is the same the parsing application eats up to 3G ram before
we kill it.

Could you tell us what we are doing wrong? Or is it a bug with expat?

Attached in a rar file are the following files:
- testxml.cxx - test program
- rbx.xml.gz - compressed XML file

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2009-03-31 15:22

Message:
I think your engineers are mistaken.
Expat does not build an in-memory tree of the XML file at all, and its
memory consumption is negligible, even for multi-gigabyte files. The only
exceptions are entity declarations in the DTD which could use a lot of
memory (google for "million laughs attack"). If you don't have a DTD (it
looks like that from your example), then I cannot see how Expat would
consume much memory.

Maybe you have a software library/layer on top of Expat which builds the
tree?

Karl

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110127&aid=2723522&group_id=10127


More information about the Expat-bugs mailing list