[Python-checkins] r46002 - in python/branches/release24-maint: Misc/ACKS Misc/NEWS Objects/unicodeobject.c

Tue May 16 14:13:25 CEST 2006

Martin v. Löwis wrote:
> M.-A. Lemburg wrote:
>> Could you please make this fix apply only on Solaris,
>> e.g. using an #ifdef ?!
> 
> That shouldn't be done. The code, as it was before, had
> undefined behaviour in C. With the fix, it is now correct.

I don't understand - what's undefined in:

const char *s;
Py_UNICODE *p;
...
*p = *(Py_UNICODE *)s;

> If you want to drop usage of memcpy on systems where you
> think it isn't needed, you should make a positive list of
> such systems, e.g. through an autoconf test (although such
> a test is difficult to formulate).

I don't want to drop memcpy() - just keep the existing
working code on platforms where the memcpy() is not
needed.

>> The memcpy is a lot more expensive than a simple memory
>> copy via registers and this operation is done per code point
>> in the Unicode string, so any change to the inner loop makes
>> a difference.
> 
> This is a bit too pessimistic. On Linux/x86, with gcc 4.0.4,
> this memcpy call is compiled into
> 
>         movl    8(%ebp), %eax          ; eax = s
>         movzwl  (%eax), %edx           ; (e)dx = *s
>         movl    -32(%ebp), %eax        ; eax = p
>         movw    %dx, (%eax)            ; *p = dx (= *s)
> 
> So it *is* a simple memory copy via registers. Any modern C
> compiler should be able to achieve this optimization: it can
> know what memcpy does, it can compute the number of bytes to
> be moved at compile time, see that this is two bytes only,
> and avoid calling a function, or generating a copy loop.

Last time I checked this (some years ago), the above
direct copy was always faster. Some compilers didn't even
inline the memcpy() as you would expect.

This is what gcc 3.3.4 (standard on SuSE 9.2 x64) generates for
the direct copy:

Without -O3:

        .loc 1 2316 0
        movq    -72(%rbp), %rdx
        movq    -8(%rbp), %rax
        movzwl  (%rax), %eax
        movw    %ax, (%rdx)

With -O3:

       .loc 1 2316 0
        movzwl  (%rax), %edx
.LVL2623:
        movw    %dx, (%rcx)

> (if you want to see what your compiler generates, put two
> function calls, say, foo() and bar(), around this statement,
> and find these function calls in the assembler output).

(or search for the embedded .loc directives which point at
the source code line in the original C file)

> If you worry about compilers which cannot do this optimization,
> you should use individual char assignments, e.g. through
> 
>         ((char*)p)[0] = s[0];
>         ((char*)p)[1] = s[1];
> 
> (and similarly for Py_UNICODE_WIDE).

What's wrong with the direct copy ?

A modern compiler should know the alignment requirements
of Py_UNICODE* on the platform and generate appropriate
code.

AFAICTL, only 64-bit platforms are subject to any
such problems due to their requirement to have pointers
aligned on 8-byte boundaries.

> While this also avoids
> the function call, it does generate worse code for gcc 4.0.4:
> 
>         movl    8(%ebp), %eax
>         movzbl  (%eax), %edx
>         movl    -32(%ebp), %eax
>         movb    %dl, (%eax)
> 
>         movl    8(%ebp), %eax
>         movzbl  1(%eax), %edx
>         movl    -32(%ebp), %eax
>         movb    %dl, 1(%eax)
> 
> (other compiler might be able to compile this into a single
>  two-byte move, of course).

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 16 2006)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::