Hi Lanam, use an explicit encoding: print(etree.tostring(doc, encoding='utf-8')) --dirk Am 12.12.2017, 20:23 Uhr, schrieb Lanam, Jeff <jeff.lanam@hpe.com>:
I have a Python program that reads XML files and modifies the version attributes. Some of these files also have a copyright notice, with the copyright symbol (c). The lxml package turns these into the HTML entity ©. Is there a way to prevent this? I tried using the resolve_entities parameter of the XMLParser function, but that has no effect. I've tried with both Python 2.7 and 3.6.3. The program below is for Python 3.
# coding: utf-8 import os import glob import argparse from lxml import etree
xParser = etree.XMLParser(strip_cdata=False, resolve_entities=False) etree.set_default_parser(xParser)
someXML ='<node version="1.0.1"><copyright>Copyright (c) 2017 by me</copyright></node>'
doc = etree.fromstring(someXML)
print(someXML) print(etree.tostring(doc))
It prints out:
<node version="1.0.1"><copyright>Copyright c 2017 by me</copyright></node> b'<node version="1.0.1"><copyright>Copyright © 2017 by me</copyright></node>'
I have also posted this question to Stack Overflow. https://stackoverflow.com/questions/47779890/lxml-python-package-changes-cop...
Thanks for any suggestions.
Jeff Lanam Software Designer NonStop Enterprise Division
jeff.lanam@hpe.com +1 650 386 4703 Skype
Palo Alto, CA hpe.com @jefflanam
[HPE logo]<http://www.hpe.com/>