[Tutor] how can I use unicode in ctypes?
Albert-Jan Roskam
fomcl at yahoo.com
Mon Dec 10 11:15:16 CET 2012
----- Original Message -----
> From: eryksun <eryksun at gmail.com>
> To: Albert-Jan Roskam <fomcl at yahoo.com>
> Cc: Python Mailing List <tutor at python.org>
> Sent: Friday, December 7, 2012 7:39 PM
> Subject: Re: [Tutor] how can I use unicode in ctypes?
>
> On Thu, Dec 6, 2012 at 2:39 PM, Albert-Jan Roskam <fomcl at yahoo.com> wrote:
>>
>> http://pastecode.org/index.php/view/29608996
>>
>> import ctypes
>> s = u'\u0627\u0644\u0633\u0644\u0627\u0645'
>> v = ctypes.c_wchar_p(s)
>> print v # prints
> c_wchar_p(u'\u0627\u0644\u0633\u0644\u0627\u0645')
>> v.value # prints
> u'\u0627\u0644\u0633\u0644\u0627\u0645'
>
> Your decorator could end up encoding str or decoding unicode.
> Typically this gets routed through the default encoding (i.e. ASCII)
> and probably triggers a UnicodeDecodeError or UnicodeEncodeError. I'd
> limit encoding to unicode and decoding to bytes/str.
Thanks. I modified it. Kinda annoying that the default encoding is ascii, but
I read that it could be changed with sys.setdefaultencoding and reload(sys)
> On the subject of wchar_t, here's a funny rant:
>
> http://losingfight.com/blog/2006/07/28/wchar_t-unsafe-at-any-size/
>
Interesting article. I experimented with a little in ctypes and also concluded that I couldn't use it
for my purposes. Apparently this is still not an entirely mature area of Python. Very useful to know.
Thank you very much for this detailed information!
> The base type for Unicode in CPython isn't wchar_t on all
> platforms/builds. It depends on Py_UNICODE_SIZE (2 or 4 bytes) vs
> sizeof(wchar_t) (also on whether wchar_t is unsigned, but that's not
> relevant here). 3.3 is in its own flexible universe.
>
> I recently came across a bug in create_unicode_buffer on Windows
> Python 3.3. The new flexible string implementation uses Py_UCS4
> instead of creating surrogate pairs on Windows. However, given that
> the size of c_wchar is 2 [bytes] on Windows, create_unicode_buffer
> still needs to factor in the surrogate pairs by calculating the target
> size = len(init) + sum(ord(c) > 0xffff for c in init) + 1. Naively it
> uses size = len(init) + 1, which fails if the string has multiple
> non-BMP characters.
>
> Here's another ctypes related issue. On a narrow build prior to 3.2,
> PyUnicode_AsWideChar returns a wide-character string that may contain
> surrogate pairs even if wchar_t is 32-bit. That isn't well-formed
> UTF-32. This was fixed in 3.2 as part of fixing a ctypes bug. ctypes
> u_set (type 'u' is c_wchar) was modified to use an updated
> PyUnicode_AsWideChar and Z_set (type 'Z' is c_wchar_p) was modified to
> use the new PyUnicode_AsWideCharString.
>
> 3.2.3 links:
>
> u_set:
> http://hg.python.org/cpython/file/3d0686d90f55/Modules/_ctypes/cfield.c#l1202
>
> Z_set:
> http://hg.python.org/cpython/file/3d0686d90f55/Modules/_ctypes/cfield.c#l1401
>
> The new PyUnicode_AsWideChar and PyUnicode_AsWideCharString call the
> helper function unicode_aswidechar. This was added in 3.2 to handle
> the different cases of Py_UNICODE_SIZE more carefully:
>
> http://hg.python.org/cpython/file/3d0686d90f55/Objects/unicodeobject.c#l1187
>
> Py_UNICODE_SIZE == SIZEOF_WCHAR_T
> Py_UNICODE_SIZE == 2 && SIZEOF_WCHAR_T == 4
> Py_UNICODE_SIZE == 4 && SIZEOF_WCHAR_T == 2
>
> The 2nd case takes advantage of the larger wchar_t to recombine
> surrogate pairs. The 3rd case creates surrogate pairs instead of
> truncating the character code. (Note: this helper was updated in 3.3
> to use the new function PyUnicode_AsUnicodeAndSize.)
>
> Prior to 3.2, PyUnicode_AsWideChar wasn't nearly as careful. See the
> version in 3.1.5:
>
> http://hg.python.org/cpython/file/7395330e495e/Objects/unicodeobject.c#l1085
>
More information about the Tutor
mailing list