Adding HTML inside XML

Hello, I need to add some HTML inside XML. The result should look like this: <content> <![CDATA[<?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd"><en-note><p>line one</p><p>line two</p></en-note>]]> </content> the code i'm using is this: # read html from file - result is : content_text = '<p>line one</p><p>line two</p>' en_note_el = etree.Element('en-note') en_note_el.text = content_text en_note_doctype = '<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">' en_note_str = etree.tostring(en_note_el, encoding='UTF-8', method="xml", xml_declaration=True, pretty_print=False, standalone=False, doctype=en_note_doctype) content_el = etree.SubElement(note_el, 'content') content_el.text = etree.CDATA(en_note_str) == This works, except the included HTML in the text element of en-note is escaped. Can you help me figure how to not have it be escaped? The contents inside the <en-note> tags are supposed to be valid HTML, but without any <html> or <body> sections, and there isn't really a root element.

Hi Karl, You're not parsing the context_string as XML or HTML; so lxml will be thinking its just some text that looks horribly like XML but is not XML and therefore needs to be escaped to be included within XML. The following: import lxml.etree as etree content_text = '<p>line one</p><p>line two</p>' en_note_el = etree.XML(f'<en-note>{content_text}</en-note>') en_note_doctype = '<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">' en_note_str = etree.tostring(en_note_el, encoding='UTF-8', method="xml", xml_declaration=True, pretty_print=False, standalone=False, doctype=en_note_doctype) content_el = etree.Element('content') content_el.text = etree.CDATA(en_note_str) print(etree.tostring(content_el).decode('utf8')) Produces the output: <content><![CDATA[<?xml version='1.0' encoding='UTF-8' standalone='no'?> <!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd"> <en-note><p>line one</p><p>line two</p></en-note>]]></content> Which would expect is what you're after? Cheers, aid
On 18 Aug 2022, at 15:57, karl@cs.stanford.edu wrote:
Hello, I need to add some HTML inside XML. The result should look like this:
<content> <![CDATA[<?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd"><en-note><p>line one</p><p>line two</p></en-note>]]> </content>
the code i'm using is this: # read html from file - result is : content_text = '<p>line one</p><p>line two</p>'
en_note_el = etree.Element('en-note') en_note_el.text = content_text en_note_doctype = '<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">' en_note_str = etree.tostring(en_note_el, encoding='UTF-8', method="xml", xml_declaration=True, pretty_print=False, standalone=False, doctype=en_note_doctype)
content_el = etree.SubElement(note_el, 'content') content_el.text = etree.CDATA(en_note_str) ==
This works, except the included HTML in the text element of en-note is escaped. Can you help me figure how to not have it be escaped? The contents inside the <en-note> tags are supposed to be valid HTML, but without any <html> or <body> sections, and there isn't really a root element. _______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-leave@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: aid@logic.org.uk
participants (2)
-
Adrian Bool
-
karl@cs.stanford.edu