[Python-Dev] Internationalization Toolkit
M.-A. Lemburg
mal@lemburg.com
Fri, 12 Nov 1999 16:50:33 +0100
"Fred L. Drake, Jr." wrote:
>
> M.-A. Lemburg writes:
> > Access to this mark will go into sys: sys.bom.
>
> Can the name in sys be a little more descriptive?
> sys.byte_order_mark would be reasonable.
The abbreviation BOM is quite common w/r to Unicode.
> I think that a support module (possibly unicodec) should provide
> constants for all four byte order marks as strings (2- & 4-byte,
> little- and big-endian). Names could be short BOM_2_LE, BOM_4_LE,
> etc.
Good idea...
sys.bom should return the byte order mark (BOM) for the format used
internally. The unicodec module should provide symbols for all
possible values of this variable:
BOM_BE: '\376\377'
(corresponds to Unicode 0x0000FEFF in UTF-16
== ZERO WIDTH NO-BREAK SPACE)
BOM_LE: '\377\376'
(corresponds to Unicode 0x0000FFFE in UTF-16
== illegal Unicode character)
BOM4_BE: '\000\000\377\376'
(corresponds to Unicode 0x0000FEFF in UCS-4)
BOM4_LE: '\376\377\000\000'
(corresponds to Unicode 0x0000FFFE in UCS-4)
Note that Unicode sees big endian byte order as being "correct". The
swapped order is taken to be an indicator for a "wrong" format, hence
the illegal character definition.
--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 49 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/