Unicode BOM marks

Jeff Epler jepler at unpythonic.net
Mon Mar 7 18:44:42 EST 2005


On Mon, Mar 07, 2005 at 11:56:57PM +0100, Francis Girard wrote:
> BTW, the python "unicode" built-in function documentation says it returns a 
> "unicode" string which scarcely means something. What is the python 
> "internal" unicode encoding ?

The language reference says farily little about unicode objects.  Here's
what it does say: [http://docs.python.org/ref/types.html#l2h-48]
    Unicode
        The items of a Unicode object are Unicode code units. A Unicode
        code unit is represented by a Unicode object of one item and can
        hold either a 16-bit or 32-bit value representing a Unicode
        ordinal (the maximum value for the ordinal is given in
        sys.maxunicode, and depends on how Python is configured at
        compile time). Surrogate pairs may be present in the Unicode
        object, and will be reported as two separate items. The built-in
        functions unichr() and ord() convert between code units and
        nonnegative integers representing the Unicode ordinals as
        defined in the Unicode Standard 3.0. Conversion from and to
        other encodings are possible through the Unicode method encode
        and the built-in function unicode().

In terms of the CPython implementation, the PyUnicodeObject is laid out
as follows:
    typedef struct {
        PyObject_HEAD
        int length;                 /* Length of raw Unicode data in buffer */
        Py_UNICODE *str;            /* Raw Unicode buffer */
        long hash;                  /* Hash value; -1 if not set */
        PyObject *defenc;           /* (Default) Encoded version as Python
                                       string, or NULL; this is used for
                                       implementing the buffer protocol */
    } PyUnicodeObject;
Py_UNICODE is some "C" integral type that can hold values up to
sys.maxunicode (probably one of unsigned short, unsigned int, unsigned
long, wchar_t).

Jeff
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20050307/28452cd9/attachment.sig>


More information about the Python-list mailing list