Unicode BOM marks
francis.girard at free.fr
Tue Mar 8 09:53:46 CET 2005
> Well, no. For example, Python source code is not typically concatenated,
> nor is source code in any other language.
We did it with C++ files in order to have only one compilation unit to
accelarate compilation time over network. Also, all the languages with some
"include" directive will have to take care of it. I guess a unicode aware C
pre-compiler already does.
> As for the "super-cat": there is actually no problem with putting U+FFFE
> in the middle of some document - applications are supposed to filter it
> out. The precise processing instructions in the Unicode standard vary
> from Unicode version to Unicode version, but essentially, you are
> supposed to ignore the BOM if you see it.
Ok. I'm re-assured.
> A Unicode string is a sequence of integers. The numbers are typically
> represented as base-2, but the details depend on the C compiler.
> It is specifically *not* UTF-16, big or little endian (i.e. a single
> number is *not* a sequence of bytes). It may be UCS-2 or UCS-4,
> depending on a compile-time choice (which can be determined by looking
> at sys.maxunicode, which in turn can be either 65535 or 1114111).
> The programming interface to the individual characters is formed by
> the unichr and ord builtin functions, which expect and return integers
> between 0 and sys.maxunicode.
Ok. I guess that Python gives the flexibility of being configurable (when
compiling Python) to internally represent unicode strings as fixed 2 or 4
bytes per characters (UCS).
More information about the Python-list