Flexible string representation, unicode, typography, ...
Terry Reedy
tjreedy at udel.edu
Thu Aug 30 16:44:32 EDT 2012
On 8/30/2012 12:00 PM, Steven D'Aprano wrote:
> On Thu, 30 Aug 2012 07:02:24 -0400, Roy Smith wrote:
>
>> In article <503f0e45$0$9416$c3e8da3$76491128 at news.astraweb.com>,
>> Steven D'Aprano <steve+comp.lang.python at pearwood.info> wrote:
>>
>>> The only thing which is innovative here is that instead of the Python
>>> compiler declaring that "all strings will be stored in UCS-2", the
>>> compiler chooses an implementation for each string as needed. So some
>>> strings will be stored internally as UCS-4, some as UCS-2, and some as
>>> ASCII (which is a standard, but not the Unicode consortium's standard).
>>
>> Is the implementation smart enough to know that x == y is always False
>> if x and y are using different internal representations?
Yes, after checking lengths, and in same circumstances, x != y is True. From
http://hg.python.org/cpython/file/ab6ab44921b2/Objects/unicodeobject.c
PyObject *
PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
{
int result;
if (PyUnicode_Check(left) && PyUnicode_Check(right)) {
PyObject *v;
if (PyUnicode_READY(left) == -1 ||
PyUnicode_READY(right) == -1)
return NULL;
if (PyUnicode_GET_LENGTH(left) != PyUnicode_GET_LENGTH(right) ||
PyUnicode_KIND(left) != PyUnicode_KIND(right)) {
if (op == Py_EQ) {
Py_INCREF(Py_False);
return Py_False;
}
if (op == Py_NE) {
Py_INCREF(Py_True);
return Py_True;
}
}
...
KIND is 1,2,4 bytes/char
'a in s' is also False if a chars are wider than s chars.
If s is all ascii, s.encode('ascii') or s.encode('utf-8') is a fast,
constant time operation, as I showed earlier in this discussion. This is
one thing that is much faster in 3.3.
Such things can be tested by timing with different lengths of strings,
where the initial string creation is done in setup code rather than in
the repeated operation code.
> But x and y are not necessarily always False just because they have
> different representations. There may be circumstances where two strings
> have different internal representations even though their content is the
> same, so it's an unsafe optimization to automatically treat them as
> unequal.
I am sure that str objects are always in canonical form once visible to
Python code. Note that unready (non-canonical) objects are rejected by
the rich comparison function.
> My expectation is that the initial implementation of PEP 393 will be
> relatively unoptimized,
The initial implementation was a year ago. At least three people have
expended considerable effort improving it since, so that the slowdown
mentioned in the PEP has mostly disappeared. The things that are still
slower are somewhat balanced by things that are faster.
--
Terry Jan Reedy
More information about the Python-list
mailing list