Unicode BOM marks

Francis Girard francis.girard at free.fr
Tue Mar 8 09:53:46 CET 2005


> Well, no. For example, Python source code is not typically concatenated,
> nor is source code in any other language. 

We did it with C++ files in order to have only one compilation unit to 
accelarate compilation time over network. Also, all the languages with some 
"include" directive will have to take care of it. I guess a unicode aware C 
pre-compiler already does.

> As for the "super-cat": there is actually no problem with putting U+FFFE
> in the middle of some document - applications are supposed to filter it
> out. The precise processing instructions in the Unicode standard vary
> from Unicode version to Unicode version, but essentially, you are
> supposed to ignore the BOM if you see it.

Ok. I'm re-assured.

> A Unicode string is a sequence of integers. The numbers are typically
> represented as base-2, but the details depend on the C compiler.
> It is specifically *not* UTF-16, big or little endian (i.e. a single
> number is *not* a sequence of bytes). It may be UCS-2 or UCS-4,
> depending on a compile-time choice (which can be determined by looking
> at sys.maxunicode, which in turn can be either 65535 or 1114111).
> The programming interface to the individual characters is formed by
> the unichr and ord builtin functions, which expect and return integers
> between 0 and sys.maxunicode.

Ok. I guess that Python gives the flexibility of being configurable (when 
compiling Python) to internally represent unicode strings as fixed 2 or 4 
bytes per characters (UCS). 

Thank you
Francis Girard

More information about the Python-list mailing list