Encoding when building XML via minidom? (...beginner needs advice)
Martin v. Loewis
martin at v.loewis.de
Wed Mar 27 13:03:47 CET 2002
"Petr Prikryl" <Answer.via.news.please.prikryl at skil.nospam.cz> writes:
> Briefly: I want to parse txt files with some internal structure
> and convert them into well-formed XML internal format
> (custom markup). I do not know how to deal with
> windows-1250 input encoding and how to produce
> the XML output with the same encoding.
People have varying opinions about that, but I would recommend to
*not* use the DOM for generating XML files from scratch. Instead,
plain print statements might work just fine.
> I probably need some advice for what is the best way to create my
> own parser
Writing a parser is always tricky, and what the best approach is
depends on the input language. In any case, it seems that this is
mostly unrelated to XML, yet - if you get a parser working, it should
be straight forward to build a DOM structure from it, if you wish to.
If you really care, you could try to produce a SAX reader: an object
that emits SAX events. In your case, it would not read XML, but it
still could emit SAX events as if it would read XML. With that, you
could pass your reader to xml.dom.minidom.parse, and it would build a
DOM tree right away. This is quite involved, though, and likely more
complicated than necessary.
> <?xml version="1.0" encoding="windows-1250" ?>
This is somewhat tricky with minidom: the toxml method with always
assume UTF-8 output. You can use a StreamWriter for windows-1252,
and you will need to write the XML header separately.
> Are there some tutorials on how to work with encoding conversions?
You really need to familiarize yourself with the Unicode support
> (Python's DOM works with Unicode internally, doesn't it?)
In principle, yes. It won't complain if you pass byte strings, but you
may face troubles later on if you do.
> Are there some means to make writing the plain text with indentation
> easier, in Python?
Yes, you can use the multiplication of strings:
print " "*indent+text
More information about the Python-list