Unicode BOM marks

Francis Girard francis.girard at free.fr
Tue Mar 8 04:01:00 EST 2005


Hi,

Thank you for your answer. That confirms what Martin v. Löwis says. You can 
choose between UCS-2 or UCS-4 for internal unicode representation.

Francis Girard

Le mardi 8 Mars 2005 00:44, Jeff Epler a écrit :
> On Mon, Mar 07, 2005 at 11:56:57PM +0100, Francis Girard wrote:
> > BTW, the python "unicode" built-in function documentation says it returns
> > a "unicode" string which scarcely means something. What is the python
> > "internal" unicode encoding ?
>
> The language reference says farily little about unicode objects.  Here's
> what it does say: [http://docs.python.org/ref/types.html#l2h-48]
>     Unicode
>         The items of a Unicode object are Unicode code units. A Unicode
>         code unit is represented by a Unicode object of one item and can
>         hold either a 16-bit or 32-bit value representing a Unicode
>         ordinal (the maximum value for the ordinal is given in
>         sys.maxunicode, and depends on how Python is configured at
>         compile time). Surrogate pairs may be present in the Unicode
>         object, and will be reported as two separate items. The built-in
>         functions unichr() and ord() convert between code units and
>         nonnegative integers representing the Unicode ordinals as
>         defined in the Unicode Standard 3.0. Conversion from and to
>         other encodings are possible through the Unicode method encode
>         and the built-in function unicode().
>
> In terms of the CPython implementation, the PyUnicodeObject is laid out
> as follows:
>     typedef struct {
>         PyObject_HEAD
>         int length;                 /* Length of raw Unicode data in buffer
> */ Py_UNICODE *str;            /* Raw Unicode buffer */
>         long hash;                  /* Hash value; -1 if not set */
>         PyObject *defenc;           /* (Default) Encoded version as Python
>                                        string, or NULL; this is used for
>                                        implementing the buffer protocol */
>     } PyUnicodeObject;
> Py_UNICODE is some "C" integral type that can hold values up to
> sys.maxunicode (probably one of unsigned short, unsigned int, unsigned
> long, wchar_t).
>
> Jeff




More information about the Python-list mailing list