_PyUnicode_CheckConsistency() too strict?

_PyUnicode_CheckConsistency() checks that the contents of the string matches the _KIND of the string. However it does this in a very strict manner, ie. that the contents *exactly* match the _KIND rather than just detecting an inconsistency between the contents and the _KIND. For example, a string created with a maxchar of 255 (ie. a Latin-1 string) must contain at least one character in the range 128-255 otherwise you get an assertion failure. As it stands, when converting Latin-1 strings in my C extension module I must first check each character and specify a maxchar of 127 if the strings happens to only contain ASCII characters. What is the reasoning behind the checks being so strict? Phil

2014-02-03 Phil Thompson <phil@riverbankcomputing.com>:
Yes, it's the specification of the PEP 393.
Use PyUnicode_FromKindAndData(PyUnicode_1BYTE_KIND, latin1_str, length) which computes the kind for you.
What is the reasoning behind the checks being so strict?
Different Python functions rely on the exact kind to compare strings. For example, if you search a latin1 substring in an ASCII string, the search returns immediatly instead of searching in the string. A latin1 string cannot be found in an ASCII string. The main reason in the PEP 393 itself, a string must be compact to not waste memory. Victor

On 03-02-2014 3:35 pm, Victor Stinner wrote:
Are you saying that code will fail if a particular Latin-1 string just happens not to contains any character greater than 127? I would be very surprised if that was the case. If it isn't the case then I think that particular check shouldn't be made. Phil

2014-02-03 Phil Thompson <phil@riverbankcomputing.com>:
Are you saying that code will fail if a particular Latin-1 string just happens not to contains any character greater than 127?
PyUnicode_FromKindAndData(PyUnicode_1BYTE_KIND, latin1_str, length) accepts latin1 and ASCII strings. It computes the maximum code point and then use ASCII or latin1 unicode string. Victor

On 03-02-2014 4:04 pm, Victor Stinner wrote:
That doesn't answer my original question, that just works around the use case I presented. To restate... Why is a Latin-1 string considered inconsistent just because it doesn't happen to contain any characters in the range 128-255? Phil

On 3 February 2014 16:10, Phil Thompson <phil@riverbankcomputing.com> wrote:
Butting in here (sorry) but I thought what Victor was trying to say is that being able to say that a string marked as Latin1 "kind" definitely has characters >127 allows the code to optimise some tests (for example, two strings cannot be equal if their kinds differ). Obviously, requiring this kind of constraint makes it somewhat harder for user code to construct string objects that conform to the spec. That's why the PyUnicode_FromKindAndData function has the convenience feature of doing the check and setting the kind correctly for you - you should use that rather than trying to get the details right yourself.. Paul.

On 03-02-2014 4:38 pm, Paul Moore wrote:
So there *is* code that will fail if a particular Latin-1 string just happens not to contains any character greater than 127?
I see now... The docs for PyUnicode_FromKindAndData() say... "Create a new Unicode object *with* the given kind" ...and so I didn't think is was useful to me. If they said... "Create a new Unicode object *from* the given kind" ...then I might have got it. Thanks - I'm happy now. Phil

On 02/03/2014 09:16 AM, Phil Thompson wrote:
So there *is* code that will fail if a particular Latin-1 string just happens not to contains any character greater than 127?
Yes, because if it does not contain a character > 127 it is not a latin-1 string as far as Python is concerned. -- ~Ethan~

On Mon, 03 Feb 2014 16:10:03 +0000 Phil Thompson <phil@riverbankcomputing.com> wrote:
Why is a Latin-1 string considered inconsistent just because it doesn't happen to contain any characters in the range 128-255?
Because as Victor said, it allows for some optimization shortcuts (e.g. a non-ASCII latin1 string cannot be equal to an ASCII string - no need for a memcmp). Regards Antoine.

Can we provide a convenience API (or even a few lines of code one could copy+paste) that determines if a particular 8-bit string should have max-char equal to 127 or 255? I can easily imagine a number of use cases where this would come in handy (e.g. a list of strings produced by translation, or strings returned in Latin-1 by some other non-Python C-level API) -- and let's not get into a debate about whether UTF-8 wouldn't be better, I can also easily imagine legacy APIs where that isn't (yet) an option. On Mon, Feb 3, 2014 at 9:35 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
-- --Guido van Rossum (python.org/~guido)

On 02/03/2014 09:52 AM, Guido van Rossum wrote:
Can we provide a convenience API (or even a few lines of code one could copy+paste) that determines if a particular 8-bit string should have max-char equal to 127 or 255?
Isn't that what this is? ============================================================ PyObject* PyUnicode_FromKindAndData( int kind, const void *buffer, Py_ssize_t size) Create a new Unicode object with the given kind (possible values are PyUnicode_1BYTE_KIND etc., as returned by PyUnicode_KIND()). The buffer must point to an array of size units of 1, 2 or 4 bytes per character, as given by the kind. ============================================================ -- ~Ethan~

On 03-02-2014 5:52 pm, Guido van Rossum wrote:
For my particular use case PyUnicode_FromKindAndData() (once I'd interpreted the docs correctly) should have made such code unnecessary. However I've just discovered that it doesn't support surrogates in UCS2 so I'm going to have to roll my own anyway. Phil

2014-02-03 Phil Thompson <phil@riverbankcomputing.com>:
Yes, it's the specification of the PEP 393.
Use PyUnicode_FromKindAndData(PyUnicode_1BYTE_KIND, latin1_str, length) which computes the kind for you.
What is the reasoning behind the checks being so strict?
Different Python functions rely on the exact kind to compare strings. For example, if you search a latin1 substring in an ASCII string, the search returns immediatly instead of searching in the string. A latin1 string cannot be found in an ASCII string. The main reason in the PEP 393 itself, a string must be compact to not waste memory. Victor

On 03-02-2014 3:35 pm, Victor Stinner wrote:
Are you saying that code will fail if a particular Latin-1 string just happens not to contains any character greater than 127? I would be very surprised if that was the case. If it isn't the case then I think that particular check shouldn't be made. Phil

2014-02-03 Phil Thompson <phil@riverbankcomputing.com>:
Are you saying that code will fail if a particular Latin-1 string just happens not to contains any character greater than 127?
PyUnicode_FromKindAndData(PyUnicode_1BYTE_KIND, latin1_str, length) accepts latin1 and ASCII strings. It computes the maximum code point and then use ASCII or latin1 unicode string. Victor

On 03-02-2014 4:04 pm, Victor Stinner wrote:
That doesn't answer my original question, that just works around the use case I presented. To restate... Why is a Latin-1 string considered inconsistent just because it doesn't happen to contain any characters in the range 128-255? Phil

On 3 February 2014 16:10, Phil Thompson <phil@riverbankcomputing.com> wrote:
Butting in here (sorry) but I thought what Victor was trying to say is that being able to say that a string marked as Latin1 "kind" definitely has characters >127 allows the code to optimise some tests (for example, two strings cannot be equal if their kinds differ). Obviously, requiring this kind of constraint makes it somewhat harder for user code to construct string objects that conform to the spec. That's why the PyUnicode_FromKindAndData function has the convenience feature of doing the check and setting the kind correctly for you - you should use that rather than trying to get the details right yourself.. Paul.

On 03-02-2014 4:38 pm, Paul Moore wrote:
So there *is* code that will fail if a particular Latin-1 string just happens not to contains any character greater than 127?
I see now... The docs for PyUnicode_FromKindAndData() say... "Create a new Unicode object *with* the given kind" ...and so I didn't think is was useful to me. If they said... "Create a new Unicode object *from* the given kind" ...then I might have got it. Thanks - I'm happy now. Phil

On 02/03/2014 09:16 AM, Phil Thompson wrote:
So there *is* code that will fail if a particular Latin-1 string just happens not to contains any character greater than 127?
Yes, because if it does not contain a character > 127 it is not a latin-1 string as far as Python is concerned. -- ~Ethan~

On Mon, 03 Feb 2014 16:10:03 +0000 Phil Thompson <phil@riverbankcomputing.com> wrote:
Why is a Latin-1 string considered inconsistent just because it doesn't happen to contain any characters in the range 128-255?
Because as Victor said, it allows for some optimization shortcuts (e.g. a non-ASCII latin1 string cannot be equal to an ASCII string - no need for a memcmp). Regards Antoine.

Can we provide a convenience API (or even a few lines of code one could copy+paste) that determines if a particular 8-bit string should have max-char equal to 127 or 255? I can easily imagine a number of use cases where this would come in handy (e.g. a list of strings produced by translation, or strings returned in Latin-1 by some other non-Python C-level API) -- and let's not get into a debate about whether UTF-8 wouldn't be better, I can also easily imagine legacy APIs where that isn't (yet) an option. On Mon, Feb 3, 2014 at 9:35 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
-- --Guido van Rossum (python.org/~guido)

On 02/03/2014 09:52 AM, Guido van Rossum wrote:
Can we provide a convenience API (or even a few lines of code one could copy+paste) that determines if a particular 8-bit string should have max-char equal to 127 or 255?
Isn't that what this is? ============================================================ PyObject* PyUnicode_FromKindAndData( int kind, const void *buffer, Py_ssize_t size) Create a new Unicode object with the given kind (possible values are PyUnicode_1BYTE_KIND etc., as returned by PyUnicode_KIND()). The buffer must point to an array of size units of 1, 2 or 4 bytes per character, as given by the kind. ============================================================ -- ~Ethan~

On 03-02-2014 5:52 pm, Guido van Rossum wrote:
For my particular use case PyUnicode_FromKindAndData() (once I'd interpreted the docs correctly) should have made such code unnecessary. However I've just discovered that it doesn't support surrogates in UCS2 so I'm going to have to roll my own anyway. Phil
participants (6)
-
Antoine Pitrou
-
Ethan Furman
-
Guido van Rossum
-
Paul Moore
-
Phil Thompson
-
Victor Stinner