puzzled by xml declaration ps.

Ater posting this query it occurred to me that I could do the following: declaration = '<?xml version="1.0" encoding="utf-8"?>' print(declaration, '\n', etree.tostring(tree, encoding='unicode', pretty_print=True)) That works.Is there a more elegant or formal way of doing it? ===== I usually serialize lxml trees (lxml 3.7, python 3.6) with the command print(etree.tostring(tree, encoding=”unicode”, pretty_print=True) That command strips the xml declaration from the first line. This happens to matter at some level to the eXist, the database I use. I posed a question on this list a couple of weeks ago about how to keep the declaration. Holger Jouki helpfully suggested that I need to add an explicit xml_declaration. There are of postings on Stack Overflow, mainly from a few years ago, that say the same thing. However, if I add the “xml_declaration=True” to the above command, I get the error message File "/Users/martin/Dropbox/PycharmProjects/earlyprint/transform/try12.py", line 62, in <module> print(etree.tostring(tree, xml_declaration =True, encoding="unicode" , pretty_print=True),file=fileout) File "src/lxml/lxml.etree.pyx", line 3320, in lxml.etree.tostring (src/lxml/lxml.etree.c:80187) ValueError: Serialisation to unicode must not request an XML declaration If I formulate the command (as recommended by Stack Overflow) as Print(etree.tostring(tree, encoding=”utf-8”, xml_declaration=True, pretty_print=True) I do indeed get the xml declaration, but the file is processed in “b’” format, with line breaks given as ‘\n;’, which is not what I want. I assume that “encoding=”unicode” is or should be the common garden variety form of serializing in the world of Python 3.x. But how can I assure that serialization will keep rather than drop the xml declaration? There are ways of adding afterwards but that’s rather kludgy, and it’s also easy to forget it. Grateful for any help MM

Hi,
Ater posting this query it occurred to me that I could do the following:
declaration = '<?xml version="1.0" encoding="utf-8"?>' print(declaration, '\n', etree.tostring(tree, encoding='unicode', pretty_print=True))
Well, the
print(etree.tostring(tree, xml_declaration =True, encoding="unicode" , pretty_print=True),file=fileout) File "src/lxml/lxml.etree.pyx", line 3320, in lxml.etree.tostring (src/lxml/lxml.etree.c:80187) ValueError: Serialisation to unicode must not request an XML declaration
is there for a reason: XML knows nothing about an "encoding" unicode. In fact, it's not even an encoding really. In my experience it's best to think of unicode objects conceptually as "program-internal string objects that can represent any characters in the world" (and basically forget about the fact that of course this is bytes in memory too ;-)) I'd consider it an error to add a "utf-8" declaration to s.th. that isn't utf-8. So unicode is not UTF-8 (neither is it UTF-16, UTF-32, UCS-4, ...). These encodings are serializiation formats that can represent the code space of unicode characters (or parts of it). With that in mind etree.write(..., encoding="unicode") is really a bit of a mis-nomer (and in fact there's also a deprecated etree.tounicode() function...) A bit more in-depth: Unicode: A Unicode is first of all just a code point that represents a character as a number. So the Unicode concept is basically that of a code table that assigns numbers to characters, and literally all characters you can ever think of, e.g. chinese or arabic scripts etc. And it is best to think of Unicode not as an encoding but as a logical concept. Whenever a Unicode string, i.e. a sequence of Unicode code points representing characters, is serialized to a file, a database, a middleware message or whatever (which means for each Unicode character of the Unicode string one or more bytes need to be stored in a certain way), it needs to be encoded. There exist several alternatives for how a sequence of such characters or their respective integer values can be represented as a sequence of bytes: The different encodings. Due to the widespread usage of ISO-8859-1/Latin-1 Unicode has been designed as an "extension" of it's code table: The Unicode code points 0-255 (hex: 00-FF) and the ISO-8859-1 ordinal byte values 00-FF represent the same characters. For further reading on Unicode you might want to look at this good unicode FAQ: http://www.cl.cam.ac.uk/~mgk25/unicode.html Not even going into depths and the distinction between characters, glyphs and graphemes or even multi-code characters here. If you are really interested look here: http://icu-project.org/docs/papers/forms_of_unicode/
Print(etree.tostring(tree, encoding=”utf-8”, xml_declaration=True, pretty_print=True)
I do indeed get the xml declaration, but the file is processed in “b’” format, with line breaks given as ‘\n;’, which is not what I want.
I don't quite understand that. What are you trying to do here, just write to stdout? Is Print() a custom function of yours or just a typo? etree.tostring() returns a byte string (unless you give the non-encoding unicode with the encoding="unicode" parameter ;-)). But line breaks are always represented as '\n', also in unicode (I'm using Python2.7 here):
s = u"""hello ... you ... XMLista"""
s u'hello\nyou\nXMLista' print s hello you XMLista
I assume that “encoding=”unicode” is or should be the common garden variety form of serializing in the world of Python 3.x.
No. Serializing XML means encoding it to a byte string - which isn't unicode. Much like you can't just write Python unicode strings to a file without encoding them to byte strings first, which sometimes the file or file-like object might do for you see e.g. the codecs module.
But how can I assure that serialization will keep rather than drop the xml declaration? There are ways of adding afterwards but that’s rather kludgy, and it’s also easy to forget it.
If you serialize you'll invariable serialize to a byte string with a certaion encoding and xml_declaration=True will help you to keep the XML declaration. etree.write(..., encoding="unicode") doesn't do serialization imho but gives you an in-memory unicode string representation of the XML tree. Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart

How can ‘encoding=”unicode”’ be a “misnomer” when it is an essential instruction for lxml to produce files in a format that you can then feed into something else? If you don’t use it you get “byte code”, which may have its purposes but is useless for mine—and I suspect those of many others as well. Which brings me back to the question: Is there a standard way for lxml to keep the xml declaration at the beginning of an input file. The lxml documentation appears to be quiet about it. I know that you can make lxml prepend a declaration by a bit of code that says “before writing out the document, write out the xml declaration which you ignored”. That works, but it seems to me a klutzy way of addressing what is somewhere between a bug and a missing feature in an otherwise excellent program. MM On 3/20/17, 3:18 AM, "lxml on behalf of Holger Joukl" <lxml-bounces@lxml.de on behalf of Holger.Joukl@LBBW.de> wrote: Hi, > Ater posting this query it occurred to me that I could do the following: > > declaration = '<?xml version="1.0" encoding="utf-8"?>' > print(declaration, '\n', etree.tostring(tree, encoding='unicode', > pretty_print=True)) Well, the > print(etree.tostring(tree, xml_declaration =True, > encoding="unicode" , pretty_print=True),file=fileout) > File "src/lxml/lxml.etree.pyx", line 3320, in lxml.etree.tostring > (src/lxml/lxml.etree.c:80187) > ValueError: Serialisation to unicode must not request an XML declaration is there for a reason: XML knows nothing about an "encoding" unicode. In fact, it's not even an encoding really. In my experience it's best to think of unicode objects conceptually as "program-internal string objects that can represent any characters in the world" (and basically forget about the fact that of course this is bytes in memory too ;-)) I'd consider it an error to add a "utf-8" declaration to s.th. that isn't utf-8. So unicode is not UTF-8 (neither is it UTF-16, UTF-32, UCS-4, ...). These encodings are serializiation formats that can represent the code space of unicode characters (or parts of it). With that in mind etree.write(..., encoding="unicode") is really a bit of a mis-nomer (and in fact there's also a deprecated etree.tounicode() function...) A bit more in-depth: Unicode: A Unicode is first of all just a code point that represents a character as a number. So the Unicode concept is basically that of a code table that assigns numbers to characters, and literally all characters you can ever think of, e.g. chinese or arabic scripts etc. And it is best to think of Unicode not as an encoding but as a logical concept. Whenever a Unicode string, i.e. a sequence of Unicode code points representing characters, is serialized to a file, a database, a middleware message or whatever (which means for each Unicode character of the Unicode string one or more bytes need to be stored in a certain way), it needs to be encoded. There exist several alternatives for how a sequence of such characters or their respective integer values can be represented as a sequence of bytes: The different encodings. Due to the widespread usage of ISO-8859-1/Latin-1 Unicode has been designed as an "extension" of it's code table: The Unicode code points 0-255 (hex: 00-FF) and the ISO-8859-1 ordinal byte values 00-FF represent the same characters. For further reading on Unicode you might want to look at this good unicode FAQ: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cl.cam.ac.uk_-7Emgk25_unicode.html&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=8K5H4N6oSmPiIbyBFqcnuvsRA8owZx8KBe0e7clXVsM&s=JNmhhpHsTNaRvUiYIv41GOue7WcmtT_f6S7W2ceiqEE&e= Not even going into depths and the distinction between characters, glyphs and graphemes or even multi-code characters here. If you are really interested look here: https://urldefense.proofpoint.com/v2/url?u=http-3A__icu-2Dproject.org_docs_papers_forms-5Fof-5Funicode_&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=8K5H4N6oSmPiIbyBFqcnuvsRA8owZx8KBe0e7clXVsM&s=z1zhHokAHlI8VpooYK1tEdhThEQC9n0LPxegXucn8ps&e= > Print(etree.tostring(tree, encoding=”utf-8”, xml_declaration=True, > pretty_print=True) > > I do indeed get the xml declaration, but the file is processed in > “b’” format, with line breaks given as ‘\n;’, which is not what I want. I don't quite understand that. What are you trying to do here, just write to stdout? Is Print() a custom function of yours or just a typo? etree.tostring() returns a byte string (unless you give the non-encoding unicode with the encoding="unicode" parameter ;-)). But line breaks are always represented as '\n', also in unicode (I'm using Python2.7 here): >>> s = u"""hello ... you ... XMLista""" >>> >>> s u'hello\nyou\nXMLista' >>> print s hello you XMLista >>> > I assume that “encoding=”unicode” is or should be the common garden > variety form of serializing in the world of Python 3.x. No. Serializing XML means encoding it to a byte string - which isn't unicode. Much like you can't just write Python unicode strings to a file without encoding them to byte strings first, which sometimes the file or file-like object might do for you see e.g. the codecs module. > But how can > I assure that serialization will keep rather than drop the xml > declaration? There are ways of adding afterwards but that’s rather > kludgy, and it’s also easy to forget it. If you serialize you'll invariable serialize to a byte string with a certaion encoding and xml_declaration=True will help you to keep the XML declaration. etree.write(..., encoding="unicode") doesn't do serialization imho but gives you an in-memory unicode string representation of the XML tree. Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart _________________________________________________________________ Mailing list for the lxml Python XML toolkit - https://urldefense.proofpoint.com/v2/url?u=http-3A__lxml.de_&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=8K5H4N6oSmPiIbyBFqcnuvsRA8owZx8KBe0e7clXVsM&s=riI6sZ_eeFspoqyyE4OD9eaboIREDEcXdxnIhrItqAs&e= lxml@lxml.de https://urldefense.proofpoint.com/v2/url?u=https-3A__mailman-2Dmail5.webfaction.com_listinfo_lxml&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=8K5H4N6oSmPiIbyBFqcnuvsRA8owZx8KBe0e7clXVsM&s=62xHNBM6dWtBJUmB7iSDG0Gcn8TlatvOhIR38mxx-74&e=

Am .03.2017, 14:07 Uhr, schrieb Martin Mueller <martinmueller@northwestern.edu>:
How can ‘encoding=”unicode”’ be a “misnomer” when it is an essential instruction for lxml to produce files in a format that you can then feed into something else? If you don’t use it you get “byte code”, which may have its purposes but is useless for mine—and I suspect those of many others as well.
As Holger says unicode is not an encoding. You always have to encode unicode when you serialise it, ie. as a file or a stream. It's incredibly easy to get confused between unicode and things like UTF-8 (os developers haven't helped to make this easier) but they are not the same thing.
Which brings me back to the question: Is there a standard way for lxml to keep the xml declaration at the beginning of an input file. The lxml documentation appears to be quiet about it. I know that you can make lxml prepend a declaration by a bit of code that says “before writing out the document, write out the xml declaration which you ignored”. That works, but it seems to me a klutzy way of addressing what is somewhere between a bug and a missing feature in an otherwise excellent program.
What do you want to maintain? The XML header is basically there for compatibility with SGML but is rarely any use. For example, you'll routinely encounter files that declare a different encoding than the one they use. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226

I do understand that ‘unicode’ is a misnomer when it comes to distinguishing between different forms of Unicode. But within the lxml environment “encoding=’unicode’” is an essential instruction if you want output of a certain kind. But that instruction prevents you from using a previously valid instruction that tells the machine not to drop the xml declaration. In the world in which I operate—TEI encoded texts that are managed by eXist—it is normal to prefix a text with the declaration <?xml version="1.0" encoding="UTF-8"?> It may not be strictly speaking necessary, although there is a version 1.1 of xml (don’t ask me what it is because I don’t know). My colleagues use xquery or XSLT, neither of which drops an xml declaration. We’re in the business of scholarly editing and use git diff to identify files that have changed. Dropping the xml declaration will produce a changed file. That’s not a showstopper, but it’s a minor nuisance, and they are rightly annoyed when I supply them files that have “fake changes”. So I need a dependable and preferably standard routine for keeping (or restoring) the xml declaration. If the only way to do is to prepend it to a file before serialization, so be it. But in that case the lxml documentation should say something like: If you use Python 3 and serialize a text with the instruction “encoding=’unicode’” the instruction “xml_declarion=True” will raise an exception. The only way to keep the xml declaration is to write a special instruction. For instance declaration = '<?xml version="1.0" encoding="utf-8"?>' print(declaration, '\n', etree.tostring(tree, encoding='unicode', pretty_print=True)) If there is a better way of achieving this result, I’ll be grateful to hear about it. MM On 3/20/17, 8:15 AM, "lxml on behalf of Charlie Clark" <lxml-bounces@lxml.de on behalf of charlie.clark@clark-consulting.eu> wrote: Am .03.2017, 14:07 Uhr, schrieb Martin Mueller <martinmueller@northwestern.edu>: > How can ‘encoding=”unicode”’ be a “misnomer” when it is an essential > instruction for lxml to produce files in a format that you can then feed > into something else? If you don’t use it you get “byte code”, which may > have its purposes but is useless for mine—and I suspect those of many > others as well. As Holger says unicode is not an encoding. You always have to encode unicode when you serialise it, ie. as a file or a stream. It's incredibly easy to get confused between unicode and things like UTF-8 (os developers haven't helped to make this easier) but they are not the same thing. > Which brings me back to the question: Is there a standard way for lxml > to keep the xml declaration at the beginning of an input file. The lxml > documentation appears to be quiet about it. I know that you can make > lxml prepend a declaration by a bit of code that says “before writing > out the document, write out the xml declaration which you ignored”. > That works, but it seems to me a klutzy way of addressing what is > somewhere between a bug and a missing feature in an otherwise excellent > program. What do you want to maintain? The XML header is basically there for compatibility with SGML but is rarely any use. For example, you'll routinely encounter files that declare a different encoding than the one they use. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226 _________________________________________________________________ Mailing list for the lxml Python XML toolkit - https://urldefense.proofpoint.com/v2/url?u=http-3A__lxml.de_&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=WrrvvjiGNhcxLAuFdbFTEhSANqE59jv1JWItgxj9Alc&s=pl6ikNXVZozSj7vhI_UliIHi9B3RaFMLyaocvv2rodc&e= lxml@lxml.de https://urldefense.proofpoint.com/v2/url?u=https-3A__mailman-2Dmail5.webfaction.com_listinfo_lxml&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=WrrvvjiGNhcxLAuFdbFTEhSANqE59jv1JWItgxj9Alc&s=2yDKL_U84DFzhhogGOwOh59Y1eVDTutpdl1daROwKmI&e=

Am .03.2017, 16:01 Uhr, schrieb Martin Mueller <martinmueller@northwestern.edu>:
I do understand that ‘unicode’ is a misnomer when it comes to distinguishing between different forms of Unicode. But within the lxml environment “encoding=’unicode’” is an essential instruction if you want output of a certain kind. But that instruction prevents you from using a previously valid instruction that tells the machine not to drop the xml declaration.
It's not a misnomer, it's nonsense.
In the world in which I operate—TEI encoded texts that are managed by eXist—it is normal to prefix a text with the declaration <?xml version="1.0" encoding="UTF-8"?> It may not be strictly speaking necessary, although there is a version 1.1 of xml (don’t ask me what it is because I don’t know).
In 1.1 a declaration is mandatory even if it is redundant but the attributes are entirely optional as they really only make sense in an SGML world. Does anyone remember SGML? ;-) Unfortunately, there are some poorly written libraries that will fail if they don't find the header.
My colleagues use xquery or XSLT, neither of which drops an xml declaration. We’re in the business of scholarly editing and use git diff to identify files that have changed. Dropping the xml declaration will produce a changed file. That’s not a showstopper, but it’s a minor nuisance, and they are rightly annoyed when I supply them files that have “fake changes”.
Using diff on XML can be a frustrating exercise because of things like namespace declarations or attribute orders. You might want to look at using docutils for this. Stefan patiently explained to me how to use this and we use it extensively in the openpxyl test suite. from lxml.doctestcompare import LXMLOutputChecker, PARSE_XML def compare_xml(generated, expected): """Use doctest checking from lxml for comparing XML trees. Returns diff if the two are not the same""" checker = LXMLOutputChecker() class DummyDocTest(): pass ob = DummyDocTest() ob.want = expected check = checker.check_output(expected, generated, PARSE_XML) if check is False: diff = checker.output_difference(ob, generated, PARSE_XML) return diff Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226

Thanks for the good advice on docutils. But whose “nonsense” is the instruction “encoding=’unicode’” ? Given the minimal xml document <?xml version="1.0" encoding="UTF-8"?> <p>ein Märchen</p> and its transformation into a ‘tree’ via etree.parse print(etree.tostring(tree) will produce b'<p>ein Märchen</p>' which will rarely be what you want. If you want a readable version with the declaration intact you have to say declaration = '<?xml version="1.0" encoding="UTF-8"?>' print( declaration, '\n', etree.tostring(tree, encoding='unicode')) That will reproduce the original document <?xml version="1.0" encoding="UTF-8"?> <p>ein Märchen</p> As I understand it, the instruction “encoding=’unicode’” is not an encoding declaration of the same type that appears in the xml declaration. I’m not sure whether it is an lxml convention or a Python3 convention. But it appears to be a way of saying “print out Unicode characters in their readable form rather than as code points.” I still haven’t had an answer to my original question. Is there a standard way of keeping the xml declaration while using the necessary instruction “encoding=’unicode’” or is my kludge the only way of getting their? On 3/20/17, 10:16 AM, "lxml on behalf of Charlie Clark" <lxml-bounces@lxml.de on behalf of charlie.clark@clark-consulting.eu> w Am .03.2017, 16:01 Uhr, schrieb Martin Mueller <martinmueller@northwestern.edu>: > I do understand that ‘unicode’ is a misnomer when it comes to > distinguishing between different forms of Unicode. But within the lxml > environment “encoding=’unicode’” is an essential instruction if you > want output of a certain kind. But that instruction prevents you from > using a previously valid instruction that tells the machine not to drop > the xml declaration. It's not a misnomer, it's nonsense. > In the world in which I operate—TEI encoded texts that are managed by > eXist—it is normal to prefix a text with the declaration > <?xml version="1.0" encoding="UTF-8"?> It may not be strictly speaking > necessary, although there is a version 1.1 of xml (don’t ask me what it > is because I don’t know). In 1.1 a declaration is mandatory even if it is redundant but the attributes are entirely optional as they really only make sense in an SGML world. Does anyone remember SGML? ;-) Unfortunately, there are some poorly written libraries that will fail if they don't find the header. > My colleagues use xquery or XSLT, neither of which drops an xml > declaration. We’re in the business of scholarly editing and use git diff > to identify files that have changed. Dropping the xml declaration will > produce a changed file. That’s not a showstopper, but it’s a minor > nuisance, and they are rightly annoyed when I supply them files that > have “fake changes”. Using diff on XML can be a frustrating exercise because of things like namespace declarations or attribute orders. You might want to look at using docutils for this. Stefan patiently explained to me how to use this and we use it extensively in the openpxyl test suite. from lxml.doctestcompare import LXMLOutputChecker, PARSE_XML def compare_xml(generated, expected): """Use doctest checking from lxml for comparing XML trees. Returns diff if the two are not the same""" checker = LXMLOutputChecker() class DummyDocTest(): pass ob = DummyDocTest() ob.want = expected check = checker.check_output(expected, generated, PARSE_XML) if check is False: diff = checker.output_difference(ob, generated, PARSE_XML) return diff Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226 _________________________________________________________________ Mailing list for the lxml Python XML toolkit - https://urldefense.proofpoint.com/v2/url?u=http-3A__lxml.de_&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=MwDt7gOxZdvumT7gLjib4uFMgJqhHXtjhT1LScHj7mQ&s=FBaXapgh6n2DPOtjP6HSUOG_TQ6At_7YN0RSh2Ysl2s&e= lxml@lxml.de https://urldefense.proofpoint.com/v2/url?u=https-3A__mailman-2Dmail5.webfaction.com_listinfo_lxml&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=MwDt7gOxZdvumT7gLjib4uFMgJqhHXtjhT1LScHj7mQ&s=1QnUhd4Yd5cJdfHhPM93MXrbayZPhNXiXAteBf1F-uY&e=

On 20. Mar 2017, at 16:56, Martin Mueller <martinmueller@northwestern.edu> wrote:
I still haven’t had an answer to my original question. Is there a standard way of keeping the xml declaration while using the necessary instruction “encoding=’unicode’” or is my kludge the only way of getting their?
There are other possible kludges, two examples follow. - Encoding to bytes, then decoding to unicode, then encoding to utf-8 during print. Note that triple encoding/decoding of the xml document is not a very efficient way to print the XML declaration. - Replacing the linefeed in the encoded bytes (to fix the reason for encoding while writing to the file)
xml = "<p>Märchen</p>"
from lxml import etree
a = etree.XML(xml)
etree.tostring(a, encoding='utf-8', xml_declaration=True) b"<?xml version='1.0' encoding='utf-8'?>\n<p>M\xc3\xa4rchen</p>"
etree.tostring(a, encoding='utf-8', xml_declaration=True).decode('utf-8') "<?xml version='1.0' encoding='utf-8'?>\n<p>Märchen</p>"
print(etree.tostring(a, encoding='utf-8', xml_declaration=True).decode('utf-8')) <?xml version='1.0' encoding='utf-8'?> <p>Märchen</p>
etree.tostring(a, encoding='utf-8', xml_declaration=True).replace(b'\n', b'\x13\x10') b"<?xml version='1.0' encoding='utf-8'?>\x13\x10<p>M\xc3\xa4rchen</p>"

But whose “nonsense” is the instruction “encoding=’unicode’” ? Given the minimal xml document
<?xml version="1.0" encoding="UTF-8"?> <p>ein Märchen</p>
and its transformation into a ‘tree’ via etree.parse
print(etree.tostring(tree)
will produce
b'<p>ein Märchen</p>'
Note that this just happens because etree.tostring() serializes to "(7bit-) ASCII encoding without XML declaration" per default, see help(etree.tostring). As the Umlaut "ä" isn't representable in ASCII encoding lxml uses a character entity reference instead.
which will rarely be what you want. If you want a readable version with the declaration intact you have to say
declaration = '<?xml version="1.0" encoding="UTF-8"?>' print( declaration, '\n', etree.tostring(tree, encoding='unicode'))
That will reproduce the original document
<?xml version="1.0" encoding="UTF-8"?> <p>ein Märchen</p>
If you want any chance to get diffable original (serialized) document bytes back you'll need to encode to the same encoding as the one you parsed from. But be aware that attribute order and whitespace might not even be guaranteed to survive a parse-serialize cycle; from an XML tree perspective they don't really make a difference. I *think* attribute order is undefined, and whitespace is (partly) insignificant.
As I understand it, the instruction “encoding=’unicode’” is not an encoding declaration of the same type that appears in the xml declaration. I’m not sure whether it is an lxml convention or a Python3 convention. But it appears to be a way of saying “print out Unicode characters in their readable form rather than as code points.”
If you use a target encoding (e.g. UTF-8) that is able to represent the characters in the document tree you will not get character references (or, as you put it, code points). When using etree.tostring(..., encoding="unicode") you won't get a byte string back but a unicode object; encoding declarations only make sense with byte strings.
I still haven’t had an answer to my original question. Is there a standard way of keeping the xml declaration while using the necessary instruction “encoding=’unicode’” or is my kludge the only way of getting their?
import sys print sys.stdout.encoding ISO8859-1 print(u'\u20AC') # EURO sign - not in iso-8859-1 Traceback (most recent call last): File "<stdin>", line 1, in ? File "/apps/prod//lib/python2.4/encodings/iso8859_1.py", line 18, in encode return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character u'\u20ac' in
I still believe you don't really *need* to "serialize" to unicode - you certainly can't write the resulting unicode string to a file. To do this, you'd then manually need to encode to a byte string afterwards, like declaration = '<?xml version="1.0" encoding="UTF-8"?>' unicode_xml_str = declaration + '\n' + etree.tostring(tree, encoding='unicode') encoded_xml_str = unicode_xml_str.encode('utf-8') myfile.write(encoded_xml_str) but what you should rather do encoded_xml_str = etree.tostring(..., encoding='utf-8') myfile.write(encoded_xml_str) Or maybe your need to to "serialize" to unicode stems from print()ing in Python 3? Could then be this is helpful as background information (copying from an FAQ of ours): How can I programmatically output (print) with an encoding of my choice? In Python 2 you'd usually do something like: position 0: character maps to <undefined>
print(u'\u20AC'.encode('iso-8859-15')) # EURO sign - available in iso8859-15 €
import sys print(sys.stdout.encoding) ISO8859-1 print('\u20AC') # EURO sign - not in iso-8859-1 Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in
I.e. you hand encoded strings (the str type in Python 2) to print. Whereas in Python 3 you need to reset sys.stdout to an encoding of your choice: Python 3.2.1 (default, Aug 26 2011, 15:12:40) [GCC 4.6.1] on sunos5 Type "help", "copyright", "credits" or "license" for more information. position 0: ordinal not in range(256)
import io sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding='iso-8859-15') # change stdout to a different encoding of choice print('\u20AC') €
You can also set the encoding error handling when resetting sys.stdout:
sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding='iso-8859-1', errors='replace') print("hällo, need 20 \u20AC")hällo, need 20 ? hällo, need 20 ?
Some lowlevel background: When you call the print() function (execute the print statement in Python 2), the thing you’re printing is sent to sys.stdout: print basically adds a carriage return to the end of the string you’re printing, and calls sys.stdout.write. In Python 2, sys.stdout is a plain file object (representing the built-in stdout pipe on UNIX-like systems):
sys.stdout <open file '<stdout>', mode 'w' at 0x13d068>
It expects str byte strings as input:
sys.stdout.write(u"hello\n") hello
sys.stdout.write(u"hällo\n") Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
While it also accepts unicode objects, it internally converts those to str byte strings with the default encoding used for the Unicode implementation: position 1: ordinal not in range(128)
print(sys.getdefaultencoding()) ascii
In Python 3, sys.stdout is an io.TextIOWrapper:
sys.stdout <_io.TextIOWrapper name='<stdout>' mode='w' encoding='ISO8859-1'>
ret = sys.stdout.write("hello\n") # TextIOWrapper.write() returns the number of characters written - assign to result value to not mangle interactive output hello print(ret) 6
ret = sys.stdout.write("hällo, need 20 \u20AC\n") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in
It expects str text input (remember that str objects are unicode string objects in Python 3) and encodes it to sys.stdout.encoding for outputting: position 15: ordinal not in range(256)
It does not accept bytes objects:
ret = sys.stdout.write("hällo, need 20 \u20AC\n".encode('iso-8859-15')) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: must be str, not bytes "hällo, need 20 \u20AC\n".encode('iso-8859-15') b'h\xe4llo, need 20 \xa4\n' type("hällo, need 20 \u20AC\n".encode('iso-8859-15')) <class 'bytes'>
A bytes object can be handed to the underlying "raw" IO buffer, though:
ret = sys.stdout.buffer.write("hällo, need 20 \u20AC\n".encode ('iso-8859-15')) hällo, need 20 €
But you shouldn't normally do this and instead reset sys.stdout to an encoding of your choice and then feed it with your text input, as shown above:
sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding='iso-8859-15') # change stdout to a different encoding of choice ret = sys.stdout.write("hällo, need 20 \u20AC\n") hällo, need 20 €
print() accepts bytes objects and prints the repr() of the bytes object:
print("hällo, need 20 \u20AC\n".encode('iso-8859-15')) b'h\xe4llo, need 20 \xa4\n'
Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart

Dear Martin, I wonder if your issues stem from using print() in the first place: It is surely helpful to see the inner state of your script or experiment in the console, but usually not the preferred method of serializing your documents. "serializing" to unicode is usually only required if you want to feed the XML programmatically into another function which expects a unicode string – or for testing with print(). But then, prefixing an XML declaration that lies about an encoding (UTF-8) that is not really used is rather unclean. But if that is what you need, just adding the declaration as a string seems to be okay. If you actually want to output a file, I’d always use tree.write() instead of print(etree.tostring(tree)). So why not just: tree.write('myfile.xml', encoding=tree.docinfo.encoding, xml_declaration=True) But even then, this does not guarantee that diff won’t spot a difference: I used your test file, and after parsing and re-serializing, diff spots two differences: diff -u testdoc.xml testdoc2.xml --- testdoc.xml 2017-03-20 21:07:33.959225775 +0100 +++ testdoc2.xml 2017-03-20 21:10:27.379191770 +0100 @@ -1,2 +1,2 @@ -<?xml version="1.0" encoding="UTF-8"?> -<p>ein Märchen</p> +<?xml version='1.0' encoding='UTF-8'?> +<p>ein Märchen</p> \ No newline at end of file First, your file uses " in the declaration, lxml’s output uses '. Both are perfectly fine (AFAIK), but diff will spot a difference. Second, lxml omits a newline at the end which my text editor adds. So, as others pointed out: diff is not very reliable for XML anyway, because the very same logical XML might have different representations on disk. If you want to print to a console and see the XML declaration, this works: tree.write(sys.stdout.buffer, encoding=sys.stdout.encoding, xml_declaration=True) If there is a specific reason why tree.write() does not work for you, maybe you could say a bit more about your scenario? Best, Frederik Am 20.03.2017 um 16:56 schrieb Martin Mueller:
Thanks for the good advice on docutils.
But whose “nonsense” is the instruction “encoding=’unicode’” ? Given the minimal xml document
<?xml version="1.0" encoding="UTF-8"?> <p>ein Märchen</p>
and its transformation into a ‘tree’ via etree.parse
print(etree.tostring(tree)
will produce
b'<p>ein Märchen</p>'
which will rarely be what you want. If you want a readable version with the declaration intact you have to say
declaration = '<?xml version="1.0" encoding="UTF-8"?>' print( declaration, '\n', etree.tostring(tree, encoding='unicode'))
That will reproduce the original document
<?xml version="1.0" encoding="UTF-8"?> <p>ein Märchen</p>
As I understand it, the instruction “encoding=’unicode’” is not an encoding declaration of the same type that appears in the xml declaration. I’m not sure whether it is an lxml convention or a Python3 convention. But it appears to be a way of saying “print out Unicode characters in their readable form rather than as code points.”
I still haven’t had an answer to my original question. Is there a standard way of keeping the xml declaration while using the necessary instruction “encoding=’unicode’” or is my kludge the only way of getting their?
On 3/20/17, 10:16 AM, "lxml on behalf of Charlie Clark" <lxml-bounces@lxml.de on behalf of charlie.clark@clark-consulting.eu> w
Am .03.2017, 16:01 Uhr, schrieb Martin Mueller <martinmueller@northwestern.edu>:
> I do understand that ‘unicode’ is a misnomer when it comes to > distinguishing between different forms of Unicode. But within the lxml > environment “encoding=’unicode’” is an essential instruction if you > want output of a certain kind. But that instruction prevents you from > using a previously valid instruction that tells the machine not to drop > the xml declaration.
It's not a misnomer, it's nonsense.
> In the world in which I operate—TEI encoded texts that are managed by > eXist—it is normal to prefix a text with the declaration > <?xml version="1.0" encoding="UTF-8"?> It may not be strictly speaking > necessary, although there is a version 1.1 of xml (don’t ask me what it > is because I don’t know).
In 1.1 a declaration is mandatory even if it is redundant but the attributes are entirely optional as they really only make sense in an SGML world. Does anyone remember SGML? ;-) Unfortunately, there are some poorly written libraries that will fail if they don't find the header.
> My colleagues use xquery or XSLT, neither of which drops an xml > declaration. We’re in the business of scholarly editing and use git diff > to identify files that have changed. Dropping the xml declaration will > produce a changed file. That’s not a showstopper, but it’s a minor > nuisance, and they are rightly annoyed when I supply them files that > have “fake changes”.
Using diff on XML can be a frustrating exercise because of things like namespace declarations or attribute orders. You might want to look at using docutils for this. Stefan patiently explained to me how to use this and we use it extensively in the openpxyl test suite.
from lxml.doctestcompare import LXMLOutputChecker, PARSE_XML
def compare_xml(generated, expected): """Use doctest checking from lxml for comparing XML trees. Returns diff if the two are not the same""" checker = LXMLOutputChecker()
class DummyDocTest(): pass
ob = DummyDocTest() ob.want = expected
check = checker.check_output(expected, generated, PARSE_XML) if check is False: diff = checker.output_difference(ob, generated, PARSE_XML) return diff
Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226 _________________________________________________________________ Mailing list for the lxml Python XML toolkit - https://urldefense.proofpoint.com/v2/url?u=http-3A__lxml.de_&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=MwDt7gOxZdvumT7gLjib4uFMgJqhHXtjhT1LScHj7mQ&s=FBaXapgh6n2DPOtjP6HSUOG_TQ6At_7YN0RSh2Ysl2s&e= lxml@lxml.de https://urldefense.proofpoint.com/v2/url?u=https-3A__mailman-2Dmail5.webfaction.com_listinfo_lxml&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=MwDt7gOxZdvumT7gLjib4uFMgJqhHXtjhT1LScHj7mQ&s=1QnUhd4Yd5cJdfHhPM93MXrbayZPhNXiXAteBf1F-uY&e=
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml

That’s very helpful, Frederik. I’ve used print because that’s what I know how to do, and it works. It had occurred to me in the course of this correspondence that there might be something to tree.write() that I should explore. So I’ll try that. Many thanks MM On 3/20/17, 3:27 PM, "Frederik Elwert" <frederik.elwert@web.de> wrote: Dear Martin, I wonder if your issues stem from using print() in the first place: It is surely helpful to see the inner state of your script or experiment in the console, but usually not the preferred method of serializing your documents. "serializing" to unicode is usually only required if you want to feed the XML programmatically into another function which expects a unicode string – or for testing with print(). But then, prefixing an XML declaration that lies about an encoding (UTF-8) that is not really used is rather unclean. But if that is what you need, just adding the declaration as a string seems to be okay. If you actually want to output a file, I’d always use tree.write() instead of print(etree.tostring(tree)). So why not just: tree.write('myfile.xml', encoding=tree.docinfo.encoding, xml_declaration=True) But even then, this does not guarantee that diff won’t spot a difference: I used your test file, and after parsing and re-serializing, diff spots two differences: diff -u testdoc.xml testdoc2.xml --- testdoc.xml 2017-03-20 21:07:33.959225775 +0100 +++ testdoc2.xml 2017-03-20 21:10:27.379191770 +0100 @@ -1,2 +1,2 @@ -<?xml version="1.0" encoding="UTF-8"?> -<p>ein Märchen</p> +<?xml version='1.0' encoding='UTF-8'?> +<p>ein Märchen</p> \ No newline at end of file First, your file uses " in the declaration, lxml’s output uses '. Both are perfectly fine (AFAIK), but diff will spot a difference. Second, lxml omits a newline at the end which my text editor adds. So, as others pointed out: diff is not very reliable for XML anyway, because the very same logical XML might have different representations on disk. If you want to print to a console and see the XML declaration, this works: tree.write(sys.stdout.buffer, encoding=sys.stdout.encoding, xml_declaration=True) If there is a specific reason why tree.write() does not work for you, maybe you could say a bit more about your scenario? Best, Frederik Am 20.03.2017 um 16:56 schrieb Martin Mueller: > Thanks for the good advice on docutils. > > But whose “nonsense” is the instruction “encoding=’unicode’” ? Given the minimal xml document > > <?xml version="1.0" encoding="UTF-8"?> > <p>ein Märchen</p> > > and its transformation into a ‘tree’ via etree.parse > > print(etree.tostring(tree) > > will produce > > b'<p>ein Märchen</p>' > > which will rarely be what you want. If you want a readable version with the declaration intact you have to say > > declaration = '<?xml version="1.0" encoding="UTF-8"?>' > print( declaration, '\n', etree.tostring(tree, encoding='unicode')) > > That will reproduce the original document > > <?xml version="1.0" encoding="UTF-8"?> > <p>ein Märchen</p> > > As I understand it, the instruction “encoding=’unicode’” is not an encoding declaration of the same type that appears in the xml declaration. I’m not sure whether it is an lxml convention or a Python3 convention. But it appears to be a way of saying “print out Unicode characters in their readable form rather than as code points.” > > I still haven’t had an answer to my original question. Is there a standard way of keeping the xml declaration while using the necessary instruction “encoding=’unicode’” or is my kludge the only way of getting their? > > > > On 3/20/17, 10:16 AM, "lxml on behalf of Charlie Clark" <lxml-bounces@lxml.de on behalf of charlie.clark@clark-consulting.eu> w > > > Am .03.2017, 16:01 Uhr, schrieb Martin Mueller > <martinmueller@northwestern.edu>: > > > I do understand that ‘unicode’ is a misnomer when it comes to > > distinguishing between different forms of Unicode. But within the lxml > > environment “encoding=’unicode’” is an essential instruction if you > > want output of a certain kind. But that instruction prevents you from > > using a previously valid instruction that tells the machine not to drop > > the xml declaration. > > It's not a misnomer, it's nonsense. > > > In the world in which I operate—TEI encoded texts that are managed by > > eXist—it is normal to prefix a text with the declaration > > <?xml version="1.0" encoding="UTF-8"?> It may not be strictly speaking > > necessary, although there is a version 1.1 of xml (don’t ask me what it > > is because I don’t know). > > In 1.1 a declaration is mandatory even if it is redundant but the > attributes are entirely optional as they really only make sense in an SGML > world. Does anyone remember SGML? ;-) Unfortunately, there are some poorly > written libraries that will fail if they don't find the header. > > > My colleagues use xquery or XSLT, neither of which drops an xml > > declaration. We’re in the business of scholarly editing and use git diff > > to identify files that have changed. Dropping the xml declaration will > > produce a changed file. That’s not a showstopper, but it’s a minor > > nuisance, and they are rightly annoyed when I supply them files that > > have “fake changes”. > > Using diff on XML can be a frustrating exercise because of things like > namespace declarations or attribute orders. You might want to look at > using docutils for this. Stefan patiently explained to me how to use this > and we use it extensively in the openpxyl test suite. > > from lxml.doctestcompare import LXMLOutputChecker, PARSE_XML > > > def compare_xml(generated, expected): > """Use doctest checking from lxml for comparing XML trees. Returns > diff if the two are not the same""" > checker = LXMLOutputChecker() > > class DummyDocTest(): > pass > > ob = DummyDocTest() > ob.want = expected > > check = checker.check_output(expected, generated, PARSE_XML) > if check is False: > diff = checker.output_difference(ob, generated, PARSE_XML) > return diff > > Charlie > -- > Charlie Clark > Managing Director > Clark Consulting & Research > German Office > Kronenstr. 27a > Düsseldorf > D- 40217 > Tel: +49-211-600-3657 > Mobile: +49-178-782-6226 > _________________________________________________________________ > Mailing list for the lxml Python XML toolkit - https://urldefense.proofpoint.com/v2/url?u=http-3A__lxml.de_&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=MwDt7gOxZdvumT7gLjib4uFMgJqhHXtjhT1LScHj7mQ&s=FBaXapgh6n2DPOtjP6HSUOG_TQ6At_7YN0RSh2Ysl2s&e= > lxml@lxml.de > https://urldefense.proofpoint.com/v2/url?u=https-3A__mailman-2Dmail5.webfaction.com_listinfo_lxml&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=MwDt7gOxZdvumT7gLjib4uFMgJqhHXtjhT1LScHj7mQ&s=1QnUhd4Yd5cJdfHhPM93MXrbayZPhNXiXAteBf1F-uY&e= > > > _________________________________________________________________ > Mailing list for the lxml Python XML toolkit - https://urldefense.proofpoint.com/v2/url?u=http-3A__lxml.de_&d=DwIDaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=WiRnzh6nIS2Zl59zJAo_24mmUnRz6t-x3I2b512gnSA&s=eMS26yG2k8lziTrvn3VeGtZon87-VNpEiJstfb-s_0M&e= > lxml@lxml.de > https://urldefense.proofpoint.com/v2/url?u=https-3A__mailman-2Dmail5.webfaction.com_listinfo_lxml&d=DwIDaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=WiRnzh6nIS2Zl59zJAo_24mmUnRz6t-x3I2b512gnSA&s=RB0012GfjuArjU52ftNDdzgsG4m4lKtLOWqTy7O7m4Q&e= >
participants (5)
-
Charlie Clark
-
Frederik Elwert
-
Holger Joukl
-
Jens Quade
-
Martin Mueller