[pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage

hubo hubo at jiedaibao.com
Mon Mar 7 04:45:51 EST 2016


Yes, it seems CPython 2.7 in Windows uses UTF-16, so:
>>> '\ud805\udc09'
'\\ud805\\udc09'
>>> u'\ud805\udc09'
u'\U00011409'
>>> u'\ud805\udc09' == u'\U00011409'
True
>>> len(u'\U00011409')
2

In Linux CPython 2.7:
>>> u'\U00011409'
u'\U00011409'
>>> len(u'\U00011409')
1
>>> u'\ud805\udc09'
u'\ud805\udc09'
>>> len(u'\ud805\udc09')
2
>>> u'\ud805\udc09' == u'\U00011409'
False
>>> u'\ud805\udc09'.encode('utf-8')
'\xf0\x91\x90\x89'
>>> u'\U00011409'.encode('utf-8')
'\xf0\x91\x90\x89'
>>> u'\ud805\udc09'.encode('utf-8') == u'\U00011409'.encode('utf-8')
True

2016-03-07 

hubo 



发件人:Maciej Fijalkowski <fijall at gmail.com>
发送时间:2016-03-07 17:31
主题:Re: Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage
收件人:"hubo"<hubo at jiedaibao.com>
抄送:"Armin Rigo"<arigo at tunes.org>,"Piotr Jurkiewicz"<piotr.jerzy.jurkiewicz at gmail.com>,"PyPy Developer Mailing List"<pypy-dev at python.org>

I think you're misunderstanding what we're proposing. 

We're proposing utf8 representation completely hidden from the user, 
where everything behaves just like cpython unicode (the len() example 
you're showing is a narrow unicode build I presume?) 

On Mon, Mar 7, 2016 at 11:21 AM, hubo <hubo at jiedaibao.com> wrote: 
> Yes, there are two-words characters in UTF-16, as I mentioned. But len() in 
> CPython returns 2 for these characters (even if they are correctly processed 
> in repr()): 
> 
>>>> len(u'\ud805\udc09') 
> 2 
>>>> u'\ud805\udc09' 
> u'\U00011409' 
> 
> (Python 3.x seems to have removed the display processing) 
> 
> Maybe it is better to be compatible with CPython in these situations. Since 
> two-words characters are really rare in Unicode strings, programmers may not 
> know their existence and allocate exactly 2 * len(s) bytes for storing an 
> unicode string. It will crash the program or create security problems if 
> len() return 1 for these characters even if it is the correct result 
> according to Unicode standard. 
> 
> UTF-8 might be very useful in XML or Web processing, which is quite 
> important in Python programming nowadays. But I think it is more important 
> to let programmers "understand" the machanism. In C/C++, it is quite common 
> to use char[] for ASCII (or ANSI) characters and wchar_t for unicode 
> (actually UTF-16, or UCS-2) characters, so it may be suprising if unicode is 
> actually "UTF-8" in PyPy. Web programmers who uses CPython may already be 
> familiar with the differences between bytes (or str in Python2) and unicode 
> (or str in Python3), it is less likely for them to design their programs 
> based on special implementations of PyPy. 
> 
> 2016-03-07 
> ________________________________ 
> hubo 
> ________________________________ 
> 
> 发件人:Maciej Fijalkowski <fijall at gmail.com> 
> 发送时间:2016-03-07 16:46 
> 主题:Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage 
> 收件人:"hubo"<hubo at jiedaibao.com> 
> 抄送:"Armin Rigo"<arigo at tunes.org>,"Piotr 
> Jurkiewicz"<piotr.jerzy.jurkiewicz at gmail.com>,"PyPy Developer Mailing 
> List"<pypy-dev at python.org> 
> 
> Hi hubo. 
> 
> I think you're slightly confusing two things. 
> 
> UTF-16 is a variable-length encoding that has two-word characters that 
> *has to* return "1" for len() of those. UCS-2 seems closer to what you 
> described (which is a fixed-width encoding), but can't encode all the 
> unicode characters and as such is unsuitable for a modern unicode 
> representation. 
> 
> I'll discard UCS-2 as unsuitable and were we to use UTF-16, then the 
> slicing and size calculations still has to be as complicated as for 
> UTF-8. 
> 
> Complicated logic in repr() - those are not usually performance 
> critical parts of your program and it's ok to have some complications 
> there. 
> 
> It's true that UTF-16 can be less efficient than UTF-8 for certain 
> languages, however both are more memory efficient than what we 
> currently use (UCS4). There are however some problems - even if you 
> work exclusively in, say, korean, for example web servers still have 
> to deal with some parts that are ascii (html markup, css etc.) while 
> handling text in korean. In those cases UTF8 vs UTF16 is more muddled 
> and the exact details depend a lot. We also need to consider the fact 
> that we ship one canonical PyPy to everybody - people using different 
> languages and different encodings. 
> 
> Overall, UTF8 seems like definitely a better alternative than UCS4 
> (also for asian languages), which is what we are using now and I would 
> be inclined to leave UTF16 as an option to see if it performs better 
> for certain benchmarks. 
> 
> Best regards, 
> Maciej Fijalkowski 
> 
> On Mon, Mar 7, 2016 at 9:58 AM, hubo <hubo at jiedaibao.com> wrote: 
>> I think it is not reasonable to use UTF-8 to represent the unicode string 
>> type. 
>> 
>> 
>> 1. Less storage - this is not always true. It is only true for strings 
>> with 
>> a lot of ASCII characters. In Asia, most strings in local languages 
>> (Japanese, Chinese, Korean) are non-ASCII characters, they may consume 
>> more 
>> storage than in UTF-16. To make things worse, while it always consumes 2*N 
>> bytes for a N-characters string in UTF-16, it is difficult to estimate the 
>> size of a N-characters string in UTF-8 (may be N bytes to 3 * N bytes) 
>> (UTF-16 also has two-word characters, but len() reports 2 for these 
>> characters, I think it is not harmful to treat them as two characters) 
>> 
>> 2. There would be very complicate logics for size calculating and slicing. 
>> For UTF-16, every character is represented with a 16-bit integer, so it is 
>> convient for size calculating and slicing. But character in UTF-8 consumes 
>> variant bytes, so either we call mb_* string functions instead (which is 
>> slow in nature) or we use special logic like storing indices of characters 
>> in another array (which introduces cost for extra addressings). 
>> 
>> 3. When displaying with repr(), non-ASCII characters are displayed with 
>> \uXXXX format. If the internal storage for unicode is UTF-8, the only way 
>> to 
>> be compatible with this format is to convert it back to UTF-16. 
>> 
>> It may be wiser to let programmers deside which encoding they would like 
>> to 
>> use. If they want to process UTF-8 strings without performance cost on 
>> converting, they should use "bytes". When correct size calculating and 
>> slicing of non-ASCII characters are concerned it may be better to use 
>> "unicode". 
>> 
>> 2016-03-07 
>> ________________________________ 
>> hubo 
>> ________________________________ 
>> 
>> 发件人:Armin Rigo <arigo at tunes.org> 
>> 发送时间:2016-03-05 16:09 
>> 主题:Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage 
>> 收件人:"Piotr Jurkiewicz"<piotr.jerzy.jurkiewicz at gmail.com> 
>> 抄送:"PyPy Developer Mailing List"<pypy-dev at python.org> 
>> 
>> Hi Piotr, 
>> 
>> Thanks for giving some serious thoughts to the utf8-stored unicode 
>> string proposal! 
>> 
>> On 5 March 2016 at 01:48, Piotr Jurkiewicz 
>> <piotr.jerzy.jurkiewicz at gmail.com> wrote: 
>>>     Random access would be as follows: 
>>> 
>>>         page_num, byte_in_page = divmod(codepoint_pos, 64) 
>>>         page_start_byte = index[page_num] 
>>>         exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) 
>>>         return buffer[exact_byte] 
>> 
>> This is the part I'm least sure about: seek_forward() needs to be a 
>> loop over 0 to 63 codepoints.  True, each loop can be branchless, and 
>> very short---let's say 4 instructions.  But it still makes a total of 
>> up to 252 instructions (plus the checks to know if we must go on). 
>> These instructions are all or almost all dependent on the previous 
>> one: you must have finished computing the length of one sequence to 
>> even being computing the length of the next one.  Maybe it's faster to 
>> use a more "XMM-izable" algorithm which counts 0 for each byte in 
>> 0x80-0xBF and 1 otherwise, and makes the sum. 
>> 
>> There are also variants, e.g. adding a second array of words similar 
>> to 'index', but where each word is 8 packed bytes giving 8 starting 
>> points inside the page (each in range 0-252).  This would reduce the 
>> walk to 0-7 codepoints. 
>> 
>> I'm +1 on your proposal. The whole thing is definitely worth a try. 
>> 
>> 
>> A bientôt, 
>> 
>> Armin. 
>> _______________________________________________ 
>> pypy-dev mailing list 
>> pypy-dev at python.org 
>> https://mail.python.org/mailman/listinfo/pypy-dev 
>> 
>> 
>> _______________________________________________ 
>> pypy-dev mailing list 
>> pypy-dev at python.org 
>> https://mail.python.org/mailman/listinfo/pypy-dev 
>> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pypy-dev/attachments/20160307/883787b1/attachment-0001.html>


More information about the pypy-dev mailing list