what to do with multiple BOMs
MRAB
python at mrabarnett.plus.com
Thu Aug 19 14:52:35 EDT 2021
On 2021-08-19 14:07, Robin Becker wrote:
> Channeling unicode text experts and xml people:
>
> I have xml entity with initial bytes ff fe ff fe which the file command says is
> UTF-16, little-endian text.
>
> I agree, but what should be done about the additional BOM.
>
> A test output made many years ago seems to keep the extra BOM. The xml context is
>
>
> xml file 014.xml
> <!DOCTYPE doc [
> <!ELEMENT doc (#PCDATA)>
> <!ENTITY e SYSTEM "014.ent">
> ]>
> <doc>&e;</doc
>
> the entitity file 014.ent is bombomdata
>
> b'\xff\xfe\xff\xfed\x00a\x00t\x00a\x00'
>
> The old saved test output of processing is
>
> b'<doc>\xef\xbb\xbfdata</doc>'
>
> which implies seems as though the extra BOM in the entity has been kept and processed into a different BOM meaning utf8.
>
> I think the test file is wrong and that multiple BOM chars in the entiry should have been removed.
>
> Am I right?
>
The use of a BOM b'\xef\xbb\xbf' at the start of a UTF-8 file is a
Windows thing. It's not used on non-Windows systems. Putting it in the
middle, e.g. b'<doc>\xef\xbb\xbfdata</doc>', just looks wrong.
It looks like the contents of a UTF-8 file, with a BOM because it
originated on a Windows system, were read in without stripping the BOM
first.
More information about the Python-list
mailing list