Unicode BOM marks
francis.girard at free.fr
Mon Mar 7 23:56:57 CET 2005
Le lundi 7 Mars 2005 21:54, "Martin v. Löwis" a écrit :
Thank you for your very informative answer. Some interspersed remarks follow.
> I personally would write my applications so that they put the signature
> into files that cannot be concatenated meaningfully (since the
> signature simplifies encoding auto-detection) and leave out the
> signature from files which can be concatenated (as concatenating the
> files will put the signature in the middle of a file).
Well, no text files can't be concatenated ! Sooner or later, someone will use
"cat" on the text files your application did generate. That will be a lot of
fun for the new unicode aware "super-cat".
> > I guess that this leading BOM mark are special marking bytes that can't
> > be, in no way, decoded as valid text.
> > Right ?
> Wrong. The BOM mark decodes as U+FEFF:
> >>> codecs.BOM_UTF8.decode("utf-8")
I meant "valid text" to denote human readable actual real natural language
text. My intent with this question was to get sure that we can easily
distinguish a UTF-8 with the signature from one without. Your answer implies
> > I also guess that this leading BOM mark is silently ignored by any
> > unicode aware file stream reader to which we already indicated that the
> > file follows the UTF-8 encoding standard.
> > Right ?
> No. It should eventually be ignored by the application, but whether the
> stream reader special-cases it or not is depends on application needs.
Well, for most of us, I think, the need is to transparently decode the input
into a unique internal unicode encoding (UFT-16 for both java and Qt ; Qt
docs saying there might be a need to switch to UFT-32 some day) and then be
able to manipulate this internal text with the usual tools your programming
system provides. By "transparent", I mean, at least, to be able to
automatically process the two variants of the same UTF-8 encoding. We should
only have to specify "UTF-8" and the streamer takes care of the rest.
BTW, the python "unicode" built-in function documentation says it returns a
"unicode" string which scarcely means something. What is the python
"internal" unicode encoding ?
> No; the Python UTF-8 codec is unaware of the UTF-8 signature. It reports
> it to the application when it finds it, and it will never generate the
> signature on its own. So processing the UTF-8 signature is left to the
> application in Python.
> > In python documentation, I see theseconstants. The documentation is not
> > clear to which encoding these constants apply. Here's my understanding :
> > BOM : UTF-8 only or UTF-8 and UTF-32 ?
> > BOM_BE : UTF-8 only or UTF-8 and UTF-32 ?
> > BOM_LE : UTF-8 only or UTF-8 and UTF-32 ?
> > Why should I need these constants if codecs decoder can handle them
> > without my help, only specifying the encoding ?
> Well, because the codecs don't. It might be useful to add a
> "utf-8-signature" codec some day, which generates the signature on
> encoding, and removes it on decoding.
My sincere thanks,
More information about the Python-list