what to do with multiple BOMs
Robin Becker
robin at reportlab.com
Thu Aug 19 09:07:43 EDT 2021
Channeling unicode text experts and xml people:
I have xml entity with initial bytes ff fe ff fe which the file command says is
UTF-16, little-endian text.
I agree, but what should be done about the additional BOM.
A test output made many years ago seems to keep the extra BOM. The xml context is
xml file 014.xml
<!DOCTYPE doc [
<!ELEMENT doc (#PCDATA)>
<!ENTITY e SYSTEM "014.ent">
]>
<doc>&e;</doc
the entitity file 014.ent is bombomdata
b'\xff\xfe\xff\xfed\x00a\x00t\x00a\x00'
The old saved test output of processing is
b'<doc>\xef\xbb\xbfdata</doc>'
which implies seems as though the extra BOM in the entity has been kept and processed into a different BOM meaning utf8.
I think the test file is wrong and that multiple BOM chars in the entiry should have been removed.
Am I right?
--
Robin Becker
More information about the Python-list
mailing list