[XML-SIG] Parsing XML file with Minidom has problem with cr/lf
stefan_ml at behnel.de
Mon May 10 09:43:25 CEST 2010
Dieter Maurer, 10.05.2010 09:07:
> Stefan Behnel wrote at 2010-5-10 08:57 +0200:
>> Dieter Maurer, 10.05.2010 07:50:
>>> Peterson, Wayne wrote at 2010-5-8 23:43 -0700:
>>>> I am parsing an XML file with Python 2.6.5 minidom in Windows and it is
>>>> mostly working but minidom seems to have problems dealing with Windows
>>>> cr/lf characters. It creates an extra textnode that needs to be ignored
>>>> instead of just returning the xml elements. I have tried different
>>>> methods of opening the file but it doesn't seem to make a difference. It
>>>> is happiest when reading a file in Unix format.
>>> The parser should not see these "cr/lf" characters at all.
>>> Python strings itself use only "\n" (aka "lf") to delimite lines.
>>> The "\r" (aka "cr") should only be introduced when those lines
>>> are written to text files. And they should be removed when
>>> those line are read in again.
>>> Are you sure that you access your files as "text" files?
>> The correct way to parse XML files is as binary data.
> Why do you think so?
> The default "minidom" parser seems not to expect "\r\n" line endings....
Interesting. Then this might really be a bug. There was a change in Python
2.6.5 that broke universal newline handling for the codecs module, this
might hit here.
However, according to what the OP described, the cr/lf characters turn up
correctly now, so ISTM that it's the plain '\n' line ending that needs fixing.
More information about the XML-SIG