[pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage

hubo hubo at jiedaibao.com
Mon Mar 7 04:21:17 EST 2016


Yes, there are two-words characters in UTF-16, as I mentioned. But len() in CPython returns 2 for these characters (even if they are correctly processed in repr()):

>>> len(u'\ud805\udc09')
2
>>> u'\ud805\udc09'
u'\U00011409'

(Python 3.x seems to have removed the display processing)

Maybe it is better to be compatible with CPython in these situations. Since two-words characters are really rare in Unicode strings, programmers may not know their existence and allocate exactly 2 * len(s) bytes for storing an unicode string. It will crash the program or create security problems if len() return 1 for these characters even if it is the correct result according to Unicode standard. 

UTF-8 might be very useful in XML or Web processing, which is quite important in Python programming nowadays. But I think it is more important to let programmers "understand" the machanism. In C/C++, it is quite common to use char[] for ASCII (or ANSI) characters and wchar_t for unicode (actually UTF-16, or UCS-2) characters, so it may be suprising if unicode is actually "UTF-8" in PyPy. Web programmers who uses CPython may already be familiar with the differences between bytes (or str in Python2) and unicode (or str in Python3), it is less likely for them to design their programs based on special implementations of PyPy.

2016-03-07 

hubo 



发件人:Maciej Fijalkowski <fijall at gmail.com>
发送时间:2016-03-07 16:46
主题:Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage
收件人:"hubo"<hubo at jiedaibao.com>
抄送:"Armin Rigo"<arigo at tunes.org>,"Piotr Jurkiewicz"<piotr.jerzy.jurkiewicz at gmail.com>,"PyPy Developer Mailing List"<pypy-dev at python.org>

Hi hubo. 

I think you're slightly confusing two things. 

UTF-16 is a variable-length encoding that has two-word characters that 
*has to* return "1" for len() of those. UCS-2 seems closer to what you 
described (which is a fixed-width encoding), but can't encode all the 
unicode characters and as such is unsuitable for a modern unicode 
representation. 

I'll discard UCS-2 as unsuitable and were we to use UTF-16, then the 
slicing and size calculations still has to be as complicated as for 
UTF-8. 

Complicated logic in repr() - those are not usually performance 
critical parts of your program and it's ok to have some complications 
there. 

It's true that UTF-16 can be less efficient than UTF-8 for certain 
languages, however both are more memory efficient than what we 
currently use (UCS4). There are however some problems - even if you 
work exclusively in, say, korean, for example web servers still have 
to deal with some parts that are ascii (html markup, css etc.) while 
handling text in korean. In those cases UTF8 vs UTF16 is more muddled 
and the exact details depend a lot. We also need to consider the fact 
that we ship one canonical PyPy to everybody - people using different 
languages and different encodings. 

Overall, UTF8 seems like definitely a better alternative than UCS4 
(also for asian languages), which is what we are using now and I would 
be inclined to leave UTF16 as an option to see if it performs better 
for certain benchmarks. 

Best regards, 
Maciej Fijalkowski 

On Mon, Mar 7, 2016 at 9:58 AM, hubo <hubo at jiedaibao.com> wrote: 
> I think it is not reasonable to use UTF-8 to represent the unicode string 
> type. 
> 
> 
> 1. Less storage - this is not always true. It is only true for strings with 
> a lot of ASCII characters. In Asia, most strings in local languages 
> (Japanese, Chinese, Korean) are non-ASCII characters, they may consume more 
> storage than in UTF-16. To make things worse, while it always consumes 2*N 
> bytes for a N-characters string in UTF-16, it is difficult to estimate the 
> size of a N-characters string in UTF-8 (may be N bytes to 3 * N bytes) 
> (UTF-16 also has two-word characters, but len() reports 2 for these 
> characters, I think it is not harmful to treat them as two characters) 
> 
> 2. There would be very complicate logics for size calculating and slicing. 
> For UTF-16, every character is represented with a 16-bit integer, so it is 
> convient for size calculating and slicing. But character in UTF-8 consumes 
> variant bytes, so either we call mb_* string functions instead (which is 
> slow in nature) or we use special logic like storing indices of characters 
> in another array (which introduces cost for extra addressings). 
> 
> 3. When displaying with repr(), non-ASCII characters are displayed with 
> \uXXXX format. If the internal storage for unicode is UTF-8, the only way to 
> be compatible with this format is to convert it back to UTF-16. 
> 
> It may be wiser to let programmers deside which encoding they would like to 
> use. If they want to process UTF-8 strings without performance cost on 
> converting, they should use "bytes". When correct size calculating and 
> slicing of non-ASCII characters are concerned it may be better to use 
> "unicode". 
> 
> 2016-03-07 
> ________________________________ 
> hubo 
> ________________________________ 
> 
> 发件人:Armin Rigo <arigo at tunes.org> 
> 发送时间:2016-03-05 16:09 
> 主题:Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage 
> 收件人:"Piotr Jurkiewicz"<piotr.jerzy.jurkiewicz at gmail.com> 
> 抄送:"PyPy Developer Mailing List"<pypy-dev at python.org> 
> 
> Hi Piotr, 
> 
> Thanks for giving some serious thoughts to the utf8-stored unicode 
> string proposal! 
> 
> On 5 March 2016 at 01:48, Piotr Jurkiewicz 
> <piotr.jerzy.jurkiewicz at gmail.com> wrote: 
>>     Random access would be as follows: 
>> 
>>         page_num, byte_in_page = divmod(codepoint_pos, 64) 
>>         page_start_byte = index[page_num] 
>>         exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) 
>>         return buffer[exact_byte] 
> 
> This is the part I'm least sure about: seek_forward() needs to be a 
> loop over 0 to 63 codepoints.  True, each loop can be branchless, and 
> very short---let's say 4 instructions.  But it still makes a total of 
> up to 252 instructions (plus the checks to know if we must go on). 
> These instructions are all or almost all dependent on the previous 
> one: you must have finished computing the length of one sequence to 
> even being computing the length of the next one.  Maybe it's faster to 
> use a more "XMM-izable" algorithm which counts 0 for each byte in 
> 0x80-0xBF and 1 otherwise, and makes the sum. 
> 
> There are also variants, e.g. adding a second array of words similar 
> to 'index', but where each word is 8 packed bytes giving 8 starting 
> points inside the page (each in range 0-252).  This would reduce the 
> walk to 0-7 codepoints. 
> 
> I'm +1 on your proposal. The whole thing is definitely worth a try. 
> 
> 
> A bientôt, 
> 
> Armin. 
> _______________________________________________ 
> pypy-dev mailing list 
> pypy-dev at python.org 
> https://mail.python.org/mailman/listinfo/pypy-dev 
> 
> 
> _______________________________________________ 
> pypy-dev mailing list 
> pypy-dev at python.org 
> https://mail.python.org/mailman/listinfo/pypy-dev 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pypy-dev/attachments/20160307/3dbac574/attachment-0001.html>


More information about the pypy-dev mailing list