[Python-Dev] Internationalization Toolkit

M.-A. Lemburg mal@lemburg.com
Fri, 12 Nov 1999 16:50:33 +0100


"Fred L. Drake, Jr." wrote:
> 
> M.-A. Lemburg writes:
>  > Access to this mark will go into sys: sys.bom.
> 
>   Can the name in sys be a little more descriptive?
> sys.byte_order_mark would be reasonable.

The abbreviation BOM is quite common w/r to Unicode.

>   I think that a support module (possibly unicodec) should provide
> constants for all four byte order marks as strings (2- & 4-byte,
> little- and big-endian).  Names could be short BOM_2_LE, BOM_4_LE,
> etc.

Good idea...

sys.bom should return the byte order mark (BOM) for the format used
internally. The unicodec module should provide symbols for all
possible values of this variable:

  BOM_BE: '\376\377' 
    (corresponds to Unicode 0x0000FEFF in UTF-16 
     == ZERO WIDTH NO-BREAK SPACE)

  BOM_LE: '\377\376' 
    (corresponds to Unicode 0x0000FFFE in UTF-16 
     == illegal Unicode character)

  BOM4_BE: '\000\000\377\376'
    (corresponds to Unicode 0x0000FEFF in UCS-4)

  BOM4_LE: '\376\377\000\000'
    (corresponds to Unicode 0x0000FFFE in UCS-4)

Note that Unicode sees big endian byte order as being "correct". The
swapped order is taken to be an indicator for a "wrong" format, hence
the illegal character definition.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/