[Python-3000] Unicode and OS strings

Wed Sep 19 00:29:24 CEST 2007

On 9/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> On 9/18/07, Guido van Rossum <guido at python.org> wrote:
> > On 9/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> > > On 9/18/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
>
> > > > There's no UTF-8 in Python's internal string encoding.
>
> > > (At least as of a few days ago)
>
> > > In Python 3 there is; strings are unicode.  A PyUnicodeObject object
> > > has two encodings that you can grab from a pointer (which means
> > > they have to be there; you don't have time to generate them like
> > > you would with a function pointer).
>
> > Incorrect. The pointer can be NULL.
>
> I had missed that comment, but I do see it now; thank you.
>
> > The API for getting the UTF-8 encoding is a function
>
> Thank you.  But given that defenc is now always UTF-8, won't exposing
> it in the public typedef then just be an attractive nuisance?

*ALL* fields of the struct def are strictly internal.

> > (moreover a function whose name starts with _Py).
>
> That I still don't see.

I am talking about _PyUnicode_AsDefaultEncoding(). (Which you
shouldn't be calling. :-)

> http://svn.python.org/view/python/branches/py3k/Include/unicodeobject.h?rev=57656&view=markup
>
> PyAPI_FUNC(PyObject*) PyUnicode_AsUTF8String(
>     PyObject *unicode           /* Unicode object */
>     );
>
> PyAPI_FUNC(PyObject*) PyUnicode_EncodeUTF8(
>     const Py_UNICODE *data,     /* Unicode char buffer */
>     Py_ssize_t length,          /* number of Py_UNICODE chars to encode */
>     const char *errors          /* error handling */
>     );
>
>
> Later, the same file shows me:
>
> /* --- Unicode Type ------------------------------------------------------- */
>
> typedef struct {
>     PyObject_HEAD
>     Py_ssize_t length;          /* Length of raw Unicode data in buffer */
>     Py_UNICODE *str;            /* Raw Unicode buffer */
>     long hash;                  /* Hash value; -1 if not set */
>     int state;                  /* != 0 if interned. In this case the two
>                                  * references from the dictionary to this object
>                                  * are *not* counted in ob_refcnt. */
>     PyObject *defenc;           /* (Default) Encoded version as Python
>                                    string, or NULL; this is used for
>                                    implementing the buffer protocol */
> } PyUnicodeObject;
>
>
> I would be happier with:
>
> typedef struct {
>     PyObject_VAR_HEAD           /* Length in code points, not chars */
> } PyUnicodeObject;
>
> And, in unicodeobject.c (*not* in a public header)
>
> typedef struct {
>     PyUnicodeObject ob_unicodehead;
>     Py_UNICODE *str;            /* Raw Unicode buffer */
>     long hash;                  /* Hash value; -1 if not set */
>     int state;                  /* != 0 if interned. In this case the two
>                                  * references from the dictionary to this object
>                                  * are *not* counted in ob_refcnt. */
>     PyObject *defenc;           /* (Default) Encoded version as Python
>                                    string, or NULL; this is used for
>                                    implementing the buffer protocol */
> } _PyDefaultUnicodeObject;
>
> As this would allow 3rd parties to create implementations specialized
> for (and saving space on) smaller alphabets, without breaking C
> extensions that stick to the public header files.  (Moving hash or
> even state to the public header might be OK too, but they seemed to
> get ignored for subclasses anyhow.)

That is not a supported use case.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)