Unicode BOM marks
Jeff Epler
jepler at unpythonic.net
Mon Mar 7 18:44:42 EST 2005
On Mon, Mar 07, 2005 at 11:56:57PM +0100, Francis Girard wrote:
> BTW, the python "unicode" built-in function documentation says it returns a
> "unicode" string which scarcely means something. What is the python
> "internal" unicode encoding ?
The language reference says farily little about unicode objects. Here's
what it does say: [http://docs.python.org/ref/types.html#l2h-48]
Unicode
The items of a Unicode object are Unicode code units. A Unicode
code unit is represented by a Unicode object of one item and can
hold either a 16-bit or 32-bit value representing a Unicode
ordinal (the maximum value for the ordinal is given in
sys.maxunicode, and depends on how Python is configured at
compile time). Surrogate pairs may be present in the Unicode
object, and will be reported as two separate items. The built-in
functions unichr() and ord() convert between code units and
nonnegative integers representing the Unicode ordinals as
defined in the Unicode Standard 3.0. Conversion from and to
other encodings are possible through the Unicode method encode
and the built-in function unicode().
In terms of the CPython implementation, the PyUnicodeObject is laid out
as follows:
typedef struct {
PyObject_HEAD
int length; /* Length of raw Unicode data in buffer */
Py_UNICODE *str; /* Raw Unicode buffer */
long hash; /* Hash value; -1 if not set */
PyObject *defenc; /* (Default) Encoded version as Python
string, or NULL; this is used for
implementing the buffer protocol */
} PyUnicodeObject;
Py_UNICODE is some "C" integral type that can hold values up to
sys.maxunicode (probably one of unsigned short, unsigned int, unsigned
long, wchar_t).
Jeff
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20050307/28452cd9/attachment.sig>
More information about the Python-list
mailing list