preserving entities with lxml
Robin Becker
robin at reportlab.com
Thu Jan 13 04:13:43 EST 2022
On 12/01/2022 20:49, Dieter Maurer wrote:
.......
>>
>> when run I see this
>>
>> $ python tmp/tlp.py
>> using tostring
>> xxml=b'<a attr="&mysym; < & > !">aaaaa &mysym; < & >
>> ! AAAAA</a>'
>> ET.tostring(tree)=b'<a attr="&mysym; < & > !">aaaaa &mysym; < &
>> > ! AAAAA</a>'
>>
>> using attributes
>> tree.text='aaaaa &mysym; < & > ! AAAAA'
>> tree.getchildren()=[]
>> tree.tail=None
>
> Apparently, the `resolve_entities=False` was not effective: otherwise,
> your tree content should have more structure (especially some
> entity reference children).
>
except that the tree knows not to expand the entities using ET.tostring so in some circumstances resolve_entities=False
does work.
I expected that the tree would contain the parsed (unexpanded) values, but referencing the actual tree.text/tail/attrib
doesn't give the expected results. There's no criticism here, it makes my life a bit easier. If I had wanted the
unexpanded values in the attrib/text/tail it would be more of a problem.
> `&#<value>` is not an entity reference but a character reference.
> It may rightfully be treated differently from entity references.
I understand the difference, but lxml (and perhaps libxml2) doesn't provide a way to turn off character reference
expansion. This makes using lxml for source transformation a bit harder since the original text is not preserved.
More information about the Python-list
mailing list