Unicode BOM marks
newsgroups at jhrothjr.com
Tue Mar 8 16:52:18 CET 2005
""Martin v. Löwis"" <martin at v.loewis.de> wrote in message
news:422cf441$0$12162$9b622d9e at news.freenet.de...
> Francis Girard wrote:
>> Well, no text files can't be concatenated ! Sooner or later, someone will
>> use "cat" on the text files your application did generate. That will be a
>> lot of fun for the new unicode aware "super-cat".
> Well, no. For example, Python source code is not typically concatenated,
> nor is source code in any other language. The same holds for XML files:
> concatenating two XML documents (using cat) gives an ill-formed document
> - whether the files start with an UTF-8 signature or not.
And if you're talking HTML and XML, the situation is even worse, since
the application absolutely needs to be aware of the signature. HTML might
have a <meta ... > directive close to the front to tell you what the
is supposed to be, and then again, it might not. You should be able to
on the first character being a <, but you might not be able to. FitNesse,
example, sends FIT a file that consists of the HTML between the <body>
and </body> tags, and nothing else. This situation makes character set
detection in PyFit, um, interesting. (Fortunately, I have other ways of
dealing with FitNesse, but it's still an issue for batch use.)
> As for the "super-cat": there is actually no problem with putting U+FFFE
> in the middle of some document - applications are supposed to filter it
> out. The precise processing instructions in the Unicode standard vary
> from Unicode version to Unicode version, but essentially, you are
> supposed to ignore the BOM if you see it.
It would be useful for "super-cat" to filter all but the first one, however.
More information about the Python-list