Unicode BOM marks
shoot at the.moon
Wed Mar 9 22:16:38 CET 2005
Francis Girard wrote:
> Le lundi 7 Mars 2005 21:54, "Martin v. Löwis" a écrit :
> Thank you for your very informative answer. Some interspersed remarks follow.
>>I personally would write my applications so that they put the signature
>>into files that cannot be concatenated meaningfully (since the
>>signature simplifies encoding auto-detection) and leave out the
>>signature from files which can be concatenated (as concatenating the
>>files will put the signature in the middle of a file).
> Well, no text files can't be concatenated ! Sooner or later, someone will use
> "cat" on the text files your application did generate. That will be a lot of
> fun for the new unicode aware "super-cat".
It is my understanding that the BOM (U+feff) is actually the
Unicode character "Non-breaking zero-width space". I take
this to mean that the character can appear invisibly
anywhere in text, and its appearance as the first character
of a text is pretty harmless. Concateniating files will
leave invisible space characters in the middle of the text,
but presumably not in the middle of words, so no harm is
done there either.
I suspect that the fact that an explicitly invisible
character feff has an invalid character code fffe for its
byte-reversed counterpart is no accident, and that the
charecter was intended from inception to also server as a
byte order indication.
More information about the Python-list