Re: [Python-Dev] Unicode and Windows
[on the user-supplies-buffer interface] I think this would be much less error-prone than having fixed-length buffers all over the place.
PyArg_ParseTuple() should probably raise an error in case the data doesn't fit into the buffer.
Ah, that's right, that solves most of that problem.
[on the malloced interface] Good point. You'll still need the buffer_len output parameter though -- otherwise you wouldn't be able tell the size of the allocated buffer (the returned data may not be terminated).
Are you sure? I would expect the "eS" format to be used to obtain 8-bit data in some local encoding, and I would expect that all 8-bit encodings of unicode data would still allow for null-termination. Or are there 8-bit encodings out there where a zero byte is normal occurrence and where it can't be used as terminator? -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm
Jack Jansen wrote:
[on the user-supplies-buffer interface] I think this would be much less error-prone than having fixed-length buffers all over the place.
PyArg_ParseTuple() should probably raise an error in case the data doesn't fit into the buffer.
Ah, that's right, that solves most of that problem.
[on the malloced interface] Good point. You'll still need the buffer_len output parameter though -- otherwise you wouldn't be able tell the size of the allocated buffer (the returned data may not be terminated).
Are you sure? I would expect the "eS" format to be used to obtain 8-bit data in some local encoding, and I would expect that all 8-bit encodings of unicode data would still allow for null-termination. Or are there 8-bit encodings out there where a zero byte is normal occurrence and where it can't be used as terminator?
Not sure whether these exist or not, but they are certainly a possibility to keep in mind. Perhaps adding "es#" and "es" (with 0-byte check) would be ideal ?! -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
Ok, I've just added two new parser markers to PyArg_ParseTuple() which will hopefully make life a little easier for extension writers. The new code will be in the next patch set which I will release early next week. Here are the docs: Internal Argument Parsing: -------------------------- These markers are used by the PyArg_ParseTuple() APIs: "U": Check for Unicode object and return a pointer to it "s": For Unicode objects: auto convert them to the <default encoding> and return a pointer to the object's <defencstr> buffer. "s#": Access to the Unicode object via the bf_getreadbuf buffer interface (see Buffer Interface); note that the length relates to the buffer length, not the Unicode string length (this may be different depending on the Internal Format). "t#": Access to the Unicode object via the bf_getcharbuf buffer interface (see Buffer Interface); note that the length relates to the buffer length, not necessarily to the Unicode string length (this may be different depending on the <default encoding>). "es": Takes two parameters: encoding (const char **) and buffer (char **). The input object is first coerced to Unicode in the usual way and then encoded into a string using the given encoding. On output, a buffer of the needed size is allocated and returned through *buffer as NULL-terminated string. The encoded may not contain embedded NULL characters. The caller is responsible for free()ing the allocated *buffer after usage. "es#": Takes three parameters: encoding (const char **), buffer (char **) and buffer_len (int *). The input object is first coerced to Unicode in the usual way and then encoded into a string using the given encoding. If *buffer is non-NULL, *buffer_len must be set to sizeof(buffer) on input. Output is then copied to *buffer. If *buffer is NULL, a buffer of the needed size is allocated and output copied into it. *buffer is then updated to point to the allocated memory area. The caller is responsible for free()ing *buffer after usage. In both cases *buffer_len is updated to the number of characters written (excluding the trailing NULL-byte). The output buffer is assured to be NULL-terminated. Examples: Using "es#" with auto-allocation: static PyObject * test_parser(PyObject *self, PyObject *args) { PyObject *str; const char *encoding = "latin-1"; char *buffer = NULL; int buffer_len = 0; if (!PyArg_ParseTuple(args, "es#:test_parser", &encoding, &buffer, &buffer_len)) return NULL; if (!buffer) { PyErr_SetString(PyExc_SystemError, "buffer is NULL"); return NULL; } str = PyString_FromStringAndSize(buffer, buffer_len); free(buffer); return str; } Using "es" with auto-allocation returning a NULL-terminated string: static PyObject * test_parser(PyObject *self, PyObject *args) { PyObject *str; const char *encoding = "latin-1"; char *buffer = NULL; if (!PyArg_ParseTuple(args, "es:test_parser", &encoding, &buffer)) return NULL; if (!buffer) { PyErr_SetString(PyExc_SystemError, "buffer is NULL"); return NULL; } str = PyString_FromString(buffer); free(buffer); return str; } Using "es#" with a pre-allocated buffer: static PyObject * test_parser(PyObject *self, PyObject *args) { PyObject *str; const char *encoding = "latin-1"; char _buffer[10]; char *buffer = _buffer; int buffer_len = sizeof(_buffer); if (!PyArg_ParseTuple(args, "es#:test_parser", &encoding, &buffer, &buffer_len)) return NULL; if (!buffer) { PyErr_SetString(PyExc_SystemError, "buffer is NULL"); return NULL; } str = PyString_FromStringAndSize(buffer, buffer_len); return str; } -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
On Fri, 24 Mar 2000, M.-A. Lemburg wrote:
... "s": For Unicode objects: auto convert them to the <default encoding> and return a pointer to the object's <defencstr> buffer.
Guess that I didn't notice this before, but it seems wierd that "s" and "s#" return different encodings. Why?
"es": Takes two parameters: encoding (const char **) and buffer (char **). ... "es#": Takes three parameters: encoding (const char **), buffer (char **) and buffer_len (int *).
I see no reason to make the encoding (const char **) rather than (const char *). We are never returning a value, so this just makes it harder to pass the encoding into ParseTuple. There is precedent for passing in single-ref pointers. For example: PyArg_ParseTuple(args, "O!", &s, PyString_Type) I would recommend using just one pointer level for the encoding. Cheers, -g -- Greg Stein, http://www.lyra.org/
Greg Stein wrote:
On Fri, 24 Mar 2000, M.-A. Lemburg wrote:
... "s": For Unicode objects: auto convert them to the <default encoding> and return a pointer to the object's <defencstr> buffer.
Guess that I didn't notice this before, but it seems wierd that "s" and "s#" return different encodings.
Why?
This is due to the buffer interface being used for "s#". Since "s#" refers to the getreadbuf slot, it returns raw data. In this case this is UTF-16 in platform dependent byte order. "s" relies on NULL-terminated strings and doesn't use the buffer interface at all. Thus "s" returns NULL-terminated UTF-8 (UTF-16 is full of NULLs). "t#" uses the getcharbuf slot and thus should return character data. UTF-8 is the right encoding here.
"es": Takes two parameters: encoding (const char **) and buffer (char **). ... "es#": Takes three parameters: encoding (const char **), buffer (char **) and buffer_len (int *).
I see no reason to make the encoding (const char **) rather than (const char *). We are never returning a value, so this just makes it harder to pass the encoding into ParseTuple.
There is precedent for passing in single-ref pointers. For example:
PyArg_ParseTuple(args, "O!", &s, PyString_Type)
I would recommend using just one pointer level for the encoding.
You have a point there... even though it breaks the notion of prepending all parameters with an '&' (ok, except the type check one). OTOH, it would allow passing the encoding right with the PyArg_ParseTuple() call which probably makes more sense in this context. I'll change it... -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
M.-A. Lemburg writes:
You have a point there... even though it breaks the notion of prepending all parameters with an '&' (ok, except the
I've never heard of this notion; I hope I didn't just miss it in the docs! The O& also doesn't require a & in front of the name of the conversion function, you just pass the right value. So there are at least two cases where you *typically* don't use a &. (Other cases in the 1.5.2 API are probably just plain weird if they don't!) Changing it to avoid the extra machinery is the Right Thing; you get to feel good today. ;) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives
"Fred L. Drake, Jr." wrote:
M.-A. Lemburg writes:
You have a point there... even though it breaks the notion of prepending all parameters with an '&' (ok, except the
I've never heard of this notion; I hope I didn't just miss it in the docs!
If you scan the parameters list in getargs.c you'll come to this conclusion and thus my notion: I've been programming like this for years now :-)
The O& also doesn't require a & in front of the name of the conversion function, you just pass the right value. So there are at least two cases where you *typically* don't use a &. (Other cases in the 1.5.2 API are probably just plain weird if they don't!) Changing it to avoid the extra machinery is the Right Thing; you get to feel good today. ;)
Ok, feeling good now ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
Greg Stein writes:
There is precedent for passing in single-ref pointers. For example:
PyArg_ParseTuple(args, "O!", &s, PyString_Type) ^^^^^^^^^^^^^^^^^
Feeling ok? I *suspect* these are reversed. :) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives
On Fri, 24 Mar 2000, Fred L. Drake, Jr. wrote:
Greg Stein writes:
There is precedent for passing in single-ref pointers. For example:
PyArg_ParseTuple(args, "O!", &s, PyString_Type) ^^^^^^^^^^^^^^^^^
Feeling ok? I *suspect* these are reversed. :)
I just checked the code to ensure that it took a single pointer rather than a double-pointer. I guess that I didn't verify the order :-) Concept is valid, tho... the params do not necessarily require an ampersand. oop! Actually... this does require an ampersand: PyArg_ParseTuple(args, "O!", &PyString_Type, &s) Don't want to pass the whole structure... Well, regardless: I would much prefer to see the encoding passed as a constant string, rather than having to shove the sucker into a variable first, just so that I can insert a useless address-of operator. Cheers, -g -- Greg Stein, http://www.lyra.org/
Well, regardless: I would much prefer to see the encoding passed as a constant string, rather than having to shove the sucker into a variable first, just so that I can insert a useless address-of operator.
Of course. Use & for output args, not as a matter of principle. --Guido van Rossum (home page: http://www.python.org/~guido/)
participants (5)
-
Fred L. Drake, Jr. -
Greg Stein -
Guido van Rossum -
Jack Jansen -
M.-A. Lemburg