[Tutor] Problems with creating XML-documents

Stefan Behnel stefan_ml at behnel.de
Thu Apr 15 09:32:20 CEST 2010


Hi,

my main advice up-front: if you want good advice, give good details about 
the problem you face and enough background to let others understand what 
the real problem is and what your environmental constraints are.


Karjer Jdfjdf, 15.04.2010 08:03:
> I know the theory of XML but have never used it really and
> I'm a bit unsecure about it.

At least in Python, XML is trivial to use, as long as you use proper XML 
tools. It's a bit more tricky in Java, but there are some usable tools 
there, too. From what you write about the program, it's rather unlikely 
that is uses them, though.


> Basically I'm doing the following:
>
> 1. retrieve data from a database ( instance in q )
> 2. pass the data to an external java-program that requires file-input

XML input, I assume. So this is the part where you generate your XML output 
from database input. Fair enough.


> 3. the java-program modifies the inputfile and creates an outputfile based on the inputfile

Is this program outside of your control or do you have the sources?

Does it do anything really complex and tricky, or could it be reimplemented 
in a couple of lines in Python? The mere fact that a Java program "is huge 
and takes minutes to run" doesn't mean much in general. That's often just 
due to a bad design.


> 4. I read the outputfile and try to parse it.

Is the output file a newly created XML file or does the Java program do 
text processing on the input file?


> 1 to 3 are performed by a seperate program that creates the XML
> 4 is a program that tries to parse it (and then perform other
> modifications using python)
>
> When I try to parse the outputfile it creates different errors such as:
>     * ExpatError: not well-formed (invalid token):
>
> Basically it ususally has something to do with not-well-formed XML.

I.e. not XML at all.


> Unfortunately the Java-program also alters the content on essential
> points such as inserting spaces in tags (e.g. id="value" to id = " value " ),

Any XML parser will handle the whitespace around 'id' and '=' for you, but 
the fact that the program alters the content of an 'id' attribute seems 
rather frightening.


> The Java is really a b&%$#!, but I have
> no alternatives because it is custommade (but very poorly imho).

You should make sure it's really carved into stone. Could you give a hint 
about what the program actually does?


> This means trying a few little things takes hours.
> Because of the long load and processing time of the Java-program I'm forced
> to store the output in a single file instead of processing it record by record.

Maybe you could reduce the input file size for testing, or does it really 
take 10 Minutes to run a small file through it?


> Also each time I have to change something I have to modify functions in
> different libraries that perform specific functions. This probably means
> that I've not done it the right way in the first place.

Not sure I understand what you mean here.


>>>       text = str('<record id="' + str(instance.id)+ '">\n' + \
> '<date>' + str(instance.datetime) + '</date>\n' + \
> '<order>' + instance.order + '</order>\n' + \
> '</record>\n')
>
>> You can simplify this quite a lot. You almost certaionly don;t need
>> the outer str() and you probably don;t need the \ characters either.
>
> I use a very simplified text-variable here. In reality I also include
> other fields which contain numeric values as well. I use the \ to
> keep each XML-tag on a seperate line to keep the overview.

You should really use a proper tool to generate the XML. I never used jaxml 
myself, but I guess it would do the job at hand. If the file isn't too 
large (a couple of MB maybe), you can also go with cElementTree and just 
create the file in memory before writing it out.


>> Also it might be easier to use a triple quoted string and format
>> characters to insert the dasta values.

That's certainly good advice if you really want to take the "string output" 
road. Check the tutorial for '%' string formatting.


>> Did you try to use the standard library tools that come with Python,
>> like elementTree or even sax?
>
> I've been trying to do this with minidom, but I'm not sure if this
> is the right solution because I'm pretty unaware of XML-writing/parsing

If anything, use cElementTree instead.


> At the moment I'm tempted to do a line-by-line parse and trigger on
> an identifier-string that identifies the end and start of a record.
> But that way I'll never learn XML.

If the Java program really can't be forced into outputting XML, running an 
XML parser over the result is doomed to fail. XML is a very well specified 
format and a compliant XML parser is bound to reject anything that violates 
the spec.

Stefan



More information about the Tutor mailing list