Copyright symbol changed to HTML entity
I have a Python program that reads XML files and modifies the version attributes. Some of these files also have a copyright notice, with the copyright symbol (c). The lxml package turns these into the HTML entity ©. Is there a way to prevent this? I tried using the resolve_entities parameter of the XMLParser function, but that has no effect. I've tried with both Python 2.7 and 3.6.3. The program below is for Python 3. # coding: utf-8 import os import glob import argparse from lxml import etree xParser = etree.XMLParser(strip_cdata=False, resolve_entities=False) etree.set_default_parser(xParser) someXML ='<node version="1.0.1"><copyright>Copyright (c) 2017 by me</copyright></node>' doc = etree.fromstring(someXML) print(someXML) print(etree.tostring(doc)) It prints out: <node version="1.0.1"><copyright>Copyright c 2017 by me</copyright></node> b'<node version="1.0.1"><copyright>Copyright © 2017 by me</copyright></node>' I have also posted this question to Stack Overflow. https://stackoverflow.com/questions/47779890/lxml-python-package-changes-cop... Thanks for any suggestions. Jeff Lanam Software Designer NonStop Enterprise Division jeff.lanam@hpe.com +1 650 386 4703 Skype Palo Alto, CA hpe.com @jefflanam [HPE logo]<http://www.hpe.com/>
Hi Lanam, use an explicit encoding: print(etree.tostring(doc, encoding='utf-8')) --dirk Am 12.12.2017, 20:23 Uhr, schrieb Lanam, Jeff <jeff.lanam@hpe.com>:
I have a Python program that reads XML files and modifies the version attributes. Some of these files also have a copyright notice, with the copyright symbol (c). The lxml package turns these into the HTML entity ©. Is there a way to prevent this? I tried using the resolve_entities parameter of the XMLParser function, but that has no effect. I've tried with both Python 2.7 and 3.6.3. The program below is for Python 3.
# coding: utf-8 import os import glob import argparse from lxml import etree
xParser = etree.XMLParser(strip_cdata=False, resolve_entities=False) etree.set_default_parser(xParser)
someXML ='<node version="1.0.1"><copyright>Copyright (c) 2017 by me</copyright></node>'
doc = etree.fromstring(someXML)
print(someXML) print(etree.tostring(doc))
It prints out:
<node version="1.0.1"><copyright>Copyright c 2017 by me</copyright></node> b'<node version="1.0.1"><copyright>Copyright © 2017 by me</copyright></node>'
I have also posted this question to Stack Overflow. https://stackoverflow.com/questions/47779890/lxml-python-package-changes-cop...
Thanks for any suggestions.
Jeff Lanam Software Designer NonStop Enterprise Division
jeff.lanam@hpe.com +1 650 386 4703 Skype
Palo Alto, CA hpe.com @jefflanam
[HPE logo]<http://www.hpe.com/>
Thank you. It turns out that printing to the Windows Command Prompt window displays b'<node version="1.0.1"><copyright>Copyright \xc2\xa9 2017 by me</copyright></node>' but when I use the encoding parameter in the write function, which is what I really need, I get the correct symbol. etree.ElementTree(doc).write('testout.xml', encoding='utf-8') Regards, Jeff -----Original Message----- From: Dirk Rothe [mailto:d.rothe@semantics.de] Sent: Tuesday, December 12, 2017 11:34 AM To: lxml@lxml.de; Lanam, Jeff <jeff.lanam@hpe.com> Subject: Re: [lxml] Copyright symbol changed to HTML entity Hi Lanam, use an explicit encoding: print(etree.tostring(doc, encoding='utf-8')) --dirk Am 12.12.2017, 20:23 Uhr, schrieb Lanam, Jeff <jeff.lanam@hpe.com>:
I have a Python program that reads XML files and modifies the version attributes. Some of these files also have a copyright notice, with the copyright symbol (c). The lxml package turns these into the HTML entity ©. Is there a way to prevent this? I tried using the resolve_entities parameter of the XMLParser function, but that has no effect. I've tried with both Python 2.7 and 3.6.3. The program below is for Python 3.
# coding: utf-8 import os import glob import argparse from lxml import etree
xParser = etree.XMLParser(strip_cdata=False, resolve_entities=False) etree.set_default_parser(xParser)
someXML ='<node version="1.0.1"><copyright>Copyright (c) 2017 by me</copyright></node>'
doc = etree.fromstring(someXML)
print(someXML) print(etree.tostring(doc))
It prints out:
<node version="1.0.1"><copyright>Copyright c 2017 by me</copyright></node> b'<node version="1.0.1"><copyright>Copyright © 2017 by me</copyright></node>'
I have also posted this question to Stack Overflow. https://stackoverflow.com/questions/47779890/lxml-python-package-chang es-copyright-symbol-to-html-entity
Thanks for any suggestions.
Jeff Lanam Software Designer NonStop Enterprise Division
jeff.lanam@hpe.com +1 650 386 4703 Skype
Palo Alto, CA hpe.com @jefflanam
[HPE logo]<http://www.hpe.com/>
participants (2)
-
Dirk Rothe
-
Lanam, Jeff