Internal representation of strings and Micropython
There is a discussion over at MicroPython about the internal representation of Unicode strings. Micropython is aimed at embedded devices, and so minimizing memory use is important, possibly even more important than performance. (I'm not speaking on their behalf, just commenting as an interested outsider.) At the moment, their Unicode support is patchy. They are talking about either: * Having a build-time option to restrict all strings to ASCII-only. (I think what they mean by that is that strings will be like Python 2 strings, ASCII-plus-arbitrary-bytes, not actually ASCII.) * Implementing Unicode internally as UTF-8, and giving up O(1) indexing operations. https://github.com/micropython/micropython/issues/657 Would either of these trade-offs be acceptable while still claiming "Python 3.4 compatibility"? My own feeling is that O(1) string indexing operations are a quality of implementation issue, not a deal breaker to call it a Python. I can't see any requirement in the docs that str[n] must take O(1) time, but perhaps I have missed something. -- Steven
I think UTF8 is the best option.
On Jun 3, 2014, at 9:17 PM, Steven D'Aprano <steve@pearwood.info> wrote:
There is a discussion over at MicroPython about the internal representation of Unicode strings. Micropython is aimed at embedded devices, and so minimizing memory use is important, possibly even more important than performance.
(I'm not speaking on their behalf, just commenting as an interested outsider.)
At the moment, their Unicode support is patchy. They are talking about either:
* Having a build-time option to restrict all strings to ASCII-only.
(I think what they mean by that is that strings will be like Python 2 strings, ASCII-plus-arbitrary-bytes, not actually ASCII.)
* Implementing Unicode internally as UTF-8, and giving up O(1) indexing operations.
https://github.com/micropython/micropython/issues/657
Would either of these trade-offs be acceptable while still claiming "Python 3.4 compatibility"?
My own feeling is that O(1) string indexing operations are a quality of implementation issue, not a deal breaker to call it a Python. I can't see any requirement in the docs that str[n] must take O(1) time, but perhaps I have missed something.
-- Steven _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/donald%40stufft.io
For those that haven't seen this: http://www.utf8everywhere.org/
-----Original Message----- From: Python-Dev [mailto:python-dev- bounces+kristjan=ccpgames.com@python.org] On Behalf Of Donald Stufft Sent: 4. júní 2014 01:46 To: Steven D'Aprano Cc: python-dev@python.org Subject: Re: [Python-Dev] Internal representation of strings and Micropython
I think UTF8 is the best option.
On Wed, Jun 4, 2014 at 11:17 AM, Steven D'Aprano <steve@pearwood.info> wrote:
* Having a build-time option to restrict all strings to ASCII-only.
(I think what they mean by that is that strings will be like Python 2 strings, ASCII-plus-arbitrary-bytes, not actually ASCII.)
What I was actually suggesting along those lines was that the str type still be notionally a Unicode string, but that any codepoints >127 would either raise an exception or blow an assertion, and all the code to handle multibyte representations would be compiled out. So there'd still be a difference between strings of text and streams of bytes, but all encoding and decoding to/from ASCII-compatible encodings would just point to the same bytes in RAM. Risk: Someone would implement that with assertions, then compile with assertions disabled, test only with ASCII, and have lurking bugs. ChrisA
On Tue, Jun 3, 2014 at 7:32 PM, Chris Angelico <rosuav@gmail.com> wrote:
On Wed, Jun 4, 2014 at 11:17 AM, Steven D'Aprano <steve@pearwood.info> wrote:
* Having a build-time option to restrict all strings to ASCII-only.
(I think what they mean by that is that strings will be like Python 2 strings, ASCII-plus-arbitrary-bytes, not actually ASCII.)
What I was actually suggesting along those lines was that the str type still be notionally a Unicode string, but that any codepoints >127 would either raise an exception or blow an assertion, and all the code to handle multibyte representations would be compiled out.
That would be a pretty lousy option. So there'd
still be a difference between strings of text and streams of bytes, but all encoding and decoding to/from ASCII-compatible encodings would just point to the same bytes in RAM.
I suppose this is why you propose to reject 128-255?
Risk: Someone would implement that with assertions, then compile with assertions disabled, test only with ASCII, and have lurking bugs.
Never mind disabling assertions -- even with enabled assertions you'd have to expect most Python programs to fail with non-ASCII input. Then again the UTF-8 option would be pretty devastating too for anything manipulating strings (especially since many Python APIs are defined using indexes, e.g. the re module). Why not support variable-width strings like CPython 3.4? -- --Guido van Rossum (python.org/~guido)
On Wed, Jun 4, 2014 at 3:23 PM, Guido van Rossum <guido@python.org> wrote:
On Tue, Jun 3, 2014 at 7:32 PM, Chris Angelico <rosuav@gmail.com> wrote:
On Wed, Jun 4, 2014 at 11:17 AM, Steven D'Aprano <steve@pearwood.info> wrote:
* Having a build-time option to restrict all strings to ASCII-only.
(I think what they mean by that is that strings will be like Python 2 strings, ASCII-plus-arbitrary-bytes, not actually ASCII.)
What I was actually suggesting along those lines was that the str type still be notionally a Unicode string, but that any codepoints >127 would either raise an exception or blow an assertion, and all the code to handle multibyte representations would be compiled out.
That would be a pretty lousy option.
So there'd still be a difference between strings of text and streams of bytes, but all encoding and decoding to/from ASCII-compatible encodings would just point to the same bytes in RAM.
I suppose this is why you propose to reject 128-255?
Correct. It would allow small devices to guarantee that strings are compact (MicroPython is aimed primarily at an embedded controller), guarantee identity transformations in several common encodings (and maybe this sort of build wouldn't ship with any non-ASCII-compat encodings at all), and never demonstrate behaviour different from CPython's except by explicitly failing.
Risk: Someone would implement that with assertions, then compile with assertions disabled, test only with ASCII, and have lurking bugs.
Never mind disabling assertions -- even with enabled assertions you'd have to expect most Python programs to fail with non-ASCII input.
Right, which is why I don't like the idea. But you don't need non-ASCII characters to blink an LED or turn a servo, and there is significant resistance to the notion that appending a non-ASCII character to a long ASCII-only string requires the whole string to be copied and doubled in size (lots of heap space used).
Then again the UTF-8 option would be pretty devastating too for anything manipulating strings (especially since many Python APIs are defined using indexes, e.g. the re module).
That's what I thought, too, but a quick poll on python-list suggests that indexing isn't nearly as common as I had thought it to be. On a smallish device, you won't have megabytes of string to index, so even O(N) indexing can't get pathological. (This would be an acknowledged limitation of micropython as a Unix Python - "it's designed for small programs, and it's performance-optimized for small programs, so it might get pathologically slow on certain large data manipulations".)
Why not support variable-width strings like CPython 3.4?
That was my first recommendation, and in fact I started writing code to implement parts of PEP 393, with a view to basically doing it the same way in both Pythons. But discussion on the tracker issue showed a certain amount of hostility toward the potential expansion of strings, particularly in the worst-case example of appending a single SMP character onto a long ASCII string. ChrisA
Hello, On Wed, 4 Jun 2014 17:03:22 +1000 Chris Angelico <rosuav@gmail.com> wrote: []
Why not support variable-width strings like CPython 3.4?
That was my first recommendation, and in fact I started writing code to implement parts of PEP 393, with a view to basically doing it the same way in both Pythons. But discussion on the tracker issue showed a certain amount of hostility toward the potential expansion of strings, particularly in the worst-case example of appending a single SMP character onto a long ASCII string.
An alternative view is that the discussion on the tracker showed Python developers' mind-fixation on implementing something the way CPython does it. And I didn't yet go to that argument, but in the end, MicroPython does not try to rewrite CPython or compete with it. So, having few choices with pros and cons leading approximately to the tie among them, it's the least productive to make the same choice as CPython did. Even having "rule of thumb" of choosing not-a-CPython way would be more productive than having the same rule of thumb for blindly choosing CPython way. (Of course, actually it should be technical discussion based on the target requirements, like we hopefully did, with strong arguments against using something else but the de-facto standard transfer encoding for Unicode).
ChrisA
-- Best regards, Paul mailto:pmiscml@gmail.com
On Wed, Jun 4, 2014 at 9:12 PM, Paul Sokolovsky <pmiscml@gmail.com> wrote:
An alternative view is that the discussion on the tracker showed Python developers' mind-fixation on implementing something the way CPython does it. And I didn't yet go to that argument, but in the end, MicroPython does not try to rewrite CPython or compete with it. So, having few choices with pros and cons leading approximately to the tie among them, it's the least productive to make the same choice as CPython did.
I'm not a CPython dev, nor a Python dev, and I don't think any of the big names of CPython or Python has showed up on that tracker as yet. But why is "be different from CPython" such a valuable choice? CPython works. It's had many hours of dev time put into it. Problems have been identified and avoided. Throwing that out means throwing away a freely-given shoulder to stand on, in an Isaac Newton way. http://www.joelonsoftware.com/articles/fog0000000069.html ChrisA
Can of worms, opened. On Jun 4, 2014 7:20 AM, "Chris Angelico" <rosuav@gmail.com> wrote:
On Wed, Jun 4, 2014 at 9:12 PM, Paul Sokolovsky <pmiscml@gmail.com> wrote:
An alternative view is that the discussion on the tracker showed Python developers' mind-fixation on implementing something the way CPython does it. And I didn't yet go to that argument, but in the end, MicroPython does not try to rewrite CPython or compete with it. So, having few choices with pros and cons leading approximately to the tie among them, it's the least productive to make the same choice as CPython did.
I'm not a CPython dev, nor a Python dev, and I don't think any of the big names of CPython or Python has showed up on that tracker as yet. But why is "be different from CPython" such a valuable choice? CPython works. It's had many hours of dev time put into it. Problems have been identified and avoided. Throwing that out means throwing away a freely-given shoulder to stand on, in an Isaac Newton way.
http://www.joelonsoftware.com/articles/fog0000000069.html
ChrisA _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/dholth%40gmail.com
Hello, On Wed, 4 Jun 2014 21:17:12 +1000 Chris Angelico <rosuav@gmail.com> wrote:
On Wed, Jun 4, 2014 at 9:12 PM, Paul Sokolovsky <pmiscml@gmail.com> wrote:
An alternative view is that the discussion on the tracker showed Python developers' mind-fixation on implementing something the way CPython does it. And I didn't yet go to that argument, but in the end, MicroPython does not try to rewrite CPython or compete with it. So, having few choices with pros and cons leading approximately to the tie among them, it's the least productive to make the same choice as CPython did.
I'm not a CPython dev, nor a Python dev, and I don't think any of the big names of CPython or Python has showed up on that tracker as yet. But why is "be different from CPython" such a valuable choice? CPython works. It's had many hours of dev time put into it.
Exactly, CPython (already) exists, and it works, so people can just use it. MicroPython's aim is to go where CPython didn't, and couldn't, go. For that, it's got to be different, or it literally won't fit there, like CPython doesn't. [] -- Best regards, Paul mailto:pmiscml@gmail.com
04.06.14 10:03, Chris Angelico написав(ла):
Right, which is why I don't like the idea. But you don't need non-ASCII characters to blink an LED or turn a servo, and there is significant resistance to the notion that appending a non-ASCII character to a long ASCII-only string requires the whole string to be copied and doubled in size (lots of heap space used).
But you need non-ASCII characters to display a title of MP3 track.
On Thu, Jun 5, 2014 at 12:17 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
04.06.14 10:03, Chris Angelico написав(ла):
Right, which is why I don't like the idea. But you don't need non-ASCII characters to blink an LED or turn a servo, and there is significant resistance to the notion that appending a non-ASCII character to a long ASCII-only string requires the whole string to be copied and doubled in size (lots of heap space used).
But you need non-ASCII characters to display a title of MP3 track.
Agreed. IMO, any Python, no matter how micro, needs full Unicode support; but there is resistance from uPy's devs. ChrisA
Hello, On Thu, 5 Jun 2014 00:26:10 +1000 Chris Angelico <rosuav@gmail.com> wrote:
On Thu, Jun 5, 2014 at 12:17 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
04.06.14 10:03, Chris Angelico написав(ла):
Right, which is why I don't like the idea. But you don't need non-ASCII characters to blink an LED or turn a servo, and there is significant resistance to the notion that appending a non-ASCII character to a long ASCII-only string requires the whole string to be copied and doubled in size (lots of heap space used).
But you need non-ASCII characters to display a title of MP3 track.
Yes, but to display a title, you don't need to do codepoint access at random - you need to either take a block of memory (length in bytes) and do something with it (pass to a C function, transfer over some bus, etc.), or *iterate in order* over codepoints in a string. All these operations are as efficient (O-notation) for UTF-8 as for UTF-32. Some operations are not going to be as fast, so - oops - avoid doing them without good reason. And kindly drop expectations that doing arbitrary operations on *Unicode* are as efficient as you imagined. (Note the *Unicode* in general, not particular flavor of which you got used to, up to thinking it's the one and only "right" flavor.)
Agreed. IMO, any Python, no matter how micro, needs full Unicode support; but there is resistance from uPy's devs.
FUD ;-).
ChrisA
-- Best regards, Paul mailto:pmiscml@gmail.com
On Thu, Jun 5, 2014 at 12:49 AM, Paul Sokolovsky <pmiscml@gmail.com> wrote:
But you need non-ASCII characters to display a title of MP3 track.
Yes, but to display a title, you don't need to do codepoint access at random - you need to either take a block of memory (length in bytes) and do something with it (pass to a C function, transfer over some bus, etc.), or *iterate in order* over codepoints in a string. All these operations are as efficient (O-notation) for UTF-8 as for UTF-32.
Suppose you have a long title, and you need to abbreviate it by dropping out words (delimited by whitespace), such that you keep the first word (always) and the last (if possible) and as many as possible in between. How are you going to write that? With PEP 393 or UTF-32 strings, you can simply record the index of every whitespace you find, count off lengths, and decide what to keep and what to ellipsize.
Some operations are not going to be as fast, so - oops - avoid doing them without good reason. And kindly drop expectations that doing arbitrary operations on *Unicode* are as efficient as you imagined. (Note the *Unicode* in general, not particular flavor of which you got used to, up to thinking it's the one and only "right" flavor.)
Not sure what you mean by flavors of Unicode. Unicode is a mapping of codepoints to characters, not an in-memory representation. And I've been working with Python 3.3 since before it came out, and with Pike (which has a very similar model) for longer, and in both of them, I casually perform operations on Unicode strings in the same way that I used to perform operations on REXX strings (which were eight-bit in the current system codepage - 437 for us). I do expect those operations to be efficient, and I get what I expect. Maybe they won't be in uPy, but that would be a limitation of uPy, not a fundamental problem with Unicode. ChrisA
Steven D'Aprano wrote:
The language semantics says that a string is an array of code points. Every index relates to a single code point, no code point extends over two or more indexes. There's a 1:1 relationship between code points and indexes. How is direct indexing "likely to be incorrect"?
We're discussing the behaviour under a different (hypothetical) design decision than a 1:1 relationship between code points and indexes, so arguing from that stance doesn't make much sense.
e.g.
s = "---ÿ---" offset = s.index('ÿ') assert s[offset] == 'ÿ'
That cannot fail with Python's semantics.
Agreed, and it shouldn't (I was actually referring to the optimization being incorrect for the goal, not the language semantics). What you'd probably find is that sizeof('ÿ') == sizeof(s[offset]) == 2, which may be surprising, but is also correct. But what are you trying to achieve (why are you writing this code)? All this example really shows is that you're only using indexing for trivial purposes. Chris's example of an actual case where it may look like a good idea to use indexing for optimization makes this more obvious IMHO: Chris Angelico wrote:
Suppose you have a long title, and you need to abbreviate it by dropping out words (delimited by whitespace), such that you keep the first word (always) and the last (if possible) and as many as possible in between. How are you going to write that? With PEP 393 or UTF-32 strings, you can simply record the index of every whitespace you find, count off lengths, and decide what to keep and what to ellipsize.
"Recording the index" is where the optimization comes in. With a variable-length encoding - heck, even with a fixed-length one - I'd just use str.split(' ') (or re.split('\\s', string), depending on how much I care about the type of delimiter) and manipulate the list. If copying into a separate list is a problem (memory-wise), re.finditer('\\S+', string) also provides the same behaviour and gives me the sliced string, so there's no need to index for anything. The downside is that it isn't as easy to teach as the 1:1 relationship, and currently it doesn't perform as well *in CPython*. But if MicroPython is focusing on size over speed, I don't see any reason why they shouldn't permit different performance characteristics and require a slightly different approach to highly-optimized coding. In any case, this is an interesting discussion with a genuine effect on the Python interpreter ecosystem. Jython and IronPython already have different string implementations from CPython - having official (and hopefully flexible) guidance on deviations from the reference implementation would I think help other implementations provide even more value, which is only a good thing for Python. Cheers, Steve
On 04/06/2014 16:32, Steve Dower wrote:
If copying into a separate list is a problem (memory-wise), re.finditer('\\S+', string) also provides the same behaviour and gives me the sliced string, so there's no need to index for anything.
Out of idle curiosity is there anything that stops MicroPython, or any other implementation for that matter, from providing views of a string rather than copying every time? IIRC memoryviews in CPython rely on the buffer protocol at the C API level, so since strings don't support this protocol you can't take a memoryview of them. Could this actually be implemented in the future, is the underlying C code just too complicated, or what? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com
On 04/06/2014 16:52, Mark Lawrence wrote:
On 04/06/2014 16:32, Steve Dower wrote:
If copying into a separate list is a problem (memory-wise), re.finditer('\\S+', string) also provides the same behaviour and gives me the sliced string, so there's no need to index for anything.
Out of idle curiosity is there anything that stops MicroPython, or any other implementation for that matter, from providing views of a string rather than copying every time? IIRC memoryviews in CPython rely on the buffer protocol at the C API level, so since strings don't support this protocol you can't take a memoryview of them. Could this actually be implemented in the future, is the underlying C code just too complicated, or what?
Anybody? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com
Hello, On Fri, 06 Jun 2014 09:32:25 +0100 Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
On 04/06/2014 16:52, Mark Lawrence wrote:
On 04/06/2014 16:32, Steve Dower wrote:
If copying into a separate list is a problem (memory-wise), re.finditer('\\S+', string) also provides the same behaviour and gives me the sliced string, so there's no need to index for anything.
Out of idle curiosity is there anything that stops MicroPython, or any other implementation for that matter, from providing views of a string rather than copying every time? IIRC memoryviews in CPython rely on the buffer protocol at the C API level, so since strings don't support this protocol you can't take a memoryview of them. Could this actually be implemented in the future, is the underlying C code just too complicated, or what?
Anybody?
I'd like to address this, and other, buffer manipulation optimization ideas I have for MicroPython at some time later. But as you suggest, it would possible to transparently have "strings-by-reference". The reasons MicroPython doesn't have such so far (and why I'm, as a uPy contributor, not ready to discuss them) is because they're optimization, and everyone knows what premature optimization is. [] -- Best regards, Paul mailto:pmiscml@gmail.com
On 06/04/2014 05:52 PM, Mark Lawrence wrote:
On 04/06/2014 16:32, Steve Dower wrote:
If copying into a separate list is a problem (memory-wise), re.finditer('\\S+', string) also provides the same behaviour and gives me the sliced string, so there's no need to index for anything.
Out of idle curiosity is there anything that stops MicroPython, or any other implementation for that matter, from providing views of a string rather than copying every time? IIRC memoryviews in CPython rely on the buffer protocol at the C API level, so since strings don't support this protocol you can't take a memoryview of them. Could this actually be implemented in the future, is the underlying C code just too complicated, or what?
Memory view of Unicode strings is controversial for two reasons: 1. It exposes the internal representation of the string. If memoryviews of strings were supported in Python 3, PEP 393 would not have been possible (without breaking that feature). 2. Even if it were OK to expose the internal representation, it might not be what the users expect. For example, memoryview("Hrvoje") would return a view of a 6-byte buffer, while memoryview("Nikšić") would return a view of a 12-byte UCS-2 buffer. The user of a memory view might expect to get UCS-2 (or UCS-4, or even UTF-8) in all cases. An implementation that decided to export strings as memory views might be forced to make a decision about internal representation of strings, and then stick to it. The byte objects don't have these issues, which is why in Python 2.7 memoryview("foo") works just fine, as does memoryview(b"foo") in Python 3.
On 06/06/2014 09:53, Hrvoje Niksic wrote:
On 06/04/2014 05:52 PM, Mark Lawrence wrote:
On 04/06/2014 16:32, Steve Dower wrote:
If copying into a separate list is a problem (memory-wise), re.finditer('\\S+', string) also provides the same behaviour and gives me the sliced string, so there's no need to index for anything.
Out of idle curiosity is there anything that stops MicroPython, or any other implementation for that matter, from providing views of a string rather than copying every time? IIRC memoryviews in CPython rely on the buffer protocol at the C API level, so since strings don't support this protocol you can't take a memoryview of them. Could this actually be implemented in the future, is the underlying C code just too complicated, or what?
Memory view of Unicode strings is controversial for two reasons:
1. It exposes the internal representation of the string. If memoryviews of strings were supported in Python 3, PEP 393 would not have been possible (without breaking that feature).
2. Even if it were OK to expose the internal representation, it might not be what the users expect. For example, memoryview("Hrvoje") would return a view of a 6-byte buffer, while memoryview("Nikšić") would return a view of a 12-byte UCS-2 buffer. The user of a memory view might expect to get UCS-2 (or UCS-4, or even UTF-8) in all cases.
An implementation that decided to export strings as memory views might be forced to make a decision about internal representation of strings, and then stick to it.
The byte objects don't have these issues, which is why in Python 2.7 memoryview("foo") works just fine, as does memoryview(b"foo") in Python 3.
Thanks for the explanation :) -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com
On 6/6/2014 4:53 AM, Hrvoje Niksic wrote:
On 06/04/2014 05:52 PM, Mark Lawrence wrote:
Out of idle curiosity is there anything that stops MicroPython, or any other implementation for that matter, from providing views of a string rather than copying every time? IIRC memoryviews in CPython rely on the buffer protocol at the C API level, so since strings don't support this protocol you can't take a memoryview of them. Could this actually be implemented in the future, is the underlying C code just too complicated, or what?
Memory view of Unicode strings is controversial for two reasons:
1. It exposes the internal representation of the string. If memoryviews of strings were supported in Python 3, PEP 393 would not have been possible (without breaking that feature).
2. Even if it were OK to expose the internal representation, it might not be what the users expect. For example, memoryview("Hrvoje") would return a view of a 6-byte buffer, while memoryview("Nikšić") would return a view of a 12-byte UCS-2 buffer. The user of a memory view might expect to get UCS-2 (or UCS-4, or even UTF-8) in all cases.
An implementation that decided to export strings as memory views might be forced to make a decision about internal representation of strings, and then stick to it.
The byte objects don't have these issues, which is why in Python 2.7 memoryview("foo") works just fine, as does memoryview(b"foo") in Python 3.
The other problem is that a small slice view of a large object keeps the large object alive, so a view user needs to think carefully about whether to make a copy or create a view, and later to copy views to delete the base object. This is not for beginners. -- Terry Jan Reedy
On 06/06/2014 05:59 PM, Terry Reedy wrote:
The other problem is that a small slice view of a large object keeps the large object alive, so a view user needs to think carefully about whether to make a copy or create a view, and later to copy views to delete the base object. This is not for beginners.
And this was important enough that Java 7 actually removed the long-standing feature of String.substring creating a string that shares the character array with the original. http://java-performance.info/changes-to-string-java-1-7-0_06/
Hello, On Fri, 06 Jun 2014 11:59:31 -0400 Terry Reedy <tjreedy@udel.edu> wrote: []
The other problem is that a small slice view of a large object keeps the large object alive, so a view user needs to think carefully about whether to make a copy or create a view, and later to copy views to delete the base object. This is not for beginners.
Yes, so it doesn't make sense to add such feature to any of existing APIs. However, as I pointed in another mail, it would make lot of sense to add iterator-based string API (because if dict methods were *switched* to iterators, why can't string have them *as alternative*), and for their return values, it would be ~ natural to return "string views", especially if it's clearly and explicitly described that if user wants to store them, they should be explicitly copied via str(view). One reason against this would be of course API bloat. But API bloat happens all the time, for example compare this modest proposal http://bugs.python.org/issue21180 with what's going to be actually implemented: http://legacy.python.org/dev/peps/pep-0467/#alternate-constructors . -- Best regards, Paul mailto:pmiscml@gmail.com
On Wed, Jun 04, 2014 at 03:32:25PM +0000, Steve Dower wrote:
Steven D'Aprano wrote:
The language semantics says that a string is an array of code points. Every index relates to a single code point, no code point extends over two or more indexes. There's a 1:1 relationship between code points and indexes. How is direct indexing "likely to be incorrect"?
We're discussing the behaviour under a different (hypothetical) design decision than a 1:1 relationship between code points and indexes, so arguing from that stance doesn't make much sense.
I'm open to different implementations. I earlier even suggested that the choice of O(1) indexing versus O(N) indexing was a quality of implementation issue, not a make-or-break issue for whether something can call itself Python (or even 99% compatible with Python"). But I don't believe that exposing that implementation at the Python level is valid: regardless of whether it is efficient or not, I should be able to write code like this: a = [mystring[i] for i in range(len(mystring))] b = list(mystring) assert a == b That is not the case if you expose the underlying byte-level implementation at the Python level, and treat strings as an array of *bytes*. Paul seems to want to do this, or at least he wants Python 4 to do this. I think it is *completely* inappropriate to do so. I *think* you may agree with me, (correct me if I'm wrong) because you go on to agree with me:
e.g.
s = "---ÿ---" offset = s.index('ÿ') assert s[offset] == 'ÿ'
That cannot fail with Python's semantics.
Agreed, and it shouldn't
but I'm not actually sure.
(I was actually referring to the optimization being incorrect for the goal, not the language semantics). What you'd probably find is that sizeof('ÿ') == sizeof(s[offset]) == 2, which may be surprising, but is also correct.
You don't seem to be taking about sys.getsizeof, so I guess you're talking about something at the C level (or other underlying implementation), ignoring the object overhead. I don't know why you think I'd find that surprising -- one cannot fit 0x10FFFF Unicode code points in a single byte, so whether you use UTF-32, UTF-16, UTF-8, Python 3.3's FSR or some other implementation, at least some code points are going to use more than one byte.
But what are you trying to achieve (why are you writing this code)? All this example really shows is that you're only using indexing for trivial purposes.
I'm trying to understand what point you are trying to make, because I'm afraid I don't quite get it. [...]
If copying into a separate list is a problem (memory-wise), re.finditer('\\S+', string) also provides the same behaviour and gives me the sliced string, so there's no need to index for anything.
finditer returns a bunch of MatchObjects, which give you the indexes of the found substring. Whether you do it yourself, or get the re module to do it, you're indexing somewhere.
The downside is that it isn't as easy to teach as the 1:1 relationship, and currently it doesn't perform as well *in CPython*. But if MicroPython is focusing on size over speed, I don't see any reason why they shouldn't permit different performance characteristics and require a slightly different approach to highly-optimized coding.
I don't have a problem with different implementations, so long as that implementation isn't exposed at the Python level with changes of semantics such as breaking the promise that a string is an array of code points, not of bytes.
In any case, this is an interesting discussion with a genuine effect on the Python interpreter ecosystem. Jython and IronPython already have different string implementations from CPython - having official (and hopefully flexible) guidance on deviations from the reference implementation would I think help other implementations provide even more value, which is only a good thing for Python.
Yes, agreed. -- Steven
Hello, On Thu, 5 Jun 2014 01:00:52 +1000 Chris Angelico <rosuav@gmail.com> wrote:
On Thu, Jun 5, 2014 at 12:49 AM, Paul Sokolovsky <pmiscml@gmail.com> wrote:
But you need non-ASCII characters to display a title of MP3 track.
Yes, but to display a title, you don't need to do codepoint access at random - you need to either take a block of memory (length in bytes) and do something with it (pass to a C function, transfer over some bus, etc.), or *iterate in order* over codepoints in a string. All these operations are as efficient (O-notation) for UTF-8 as for UTF-32.
Suppose you have a long title, and you need to abbreviate it by dropping out words (delimited by whitespace), such that you keep the first word (always) and the last (if possible) and as many as possible in between. How are you going to write that? With PEP 393 or UTF-32 strings, you can simply record the index of every whitespace you find, count off lengths, and decide what to keep and what to ellipsize.
I'll submit angry bugreport along the lines of "WWWHAT, it's 3.5 and there's still no str.isplit()??!!11", then do it with re.finditer() (while submitting another report on inconsistent naming scheme). [] -- Best regards, Paul mailto:pmiscml@gmail.com
04.06.14 17:49, Paul Sokolovsky написав(ла):
On Thu, 5 Jun 2014 00:26:10 +1000 Chris Angelico <rosuav@gmail.com> wrote:
On Thu, Jun 5, 2014 at 12:17 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
04.06.14 10:03, Chris Angelico написав(ла):
Right, which is why I don't like the idea. But you don't need non-ASCII characters to blink an LED or turn a servo, and there is significant resistance to the notion that appending a non-ASCII character to a long ASCII-only string requires the whole string to be copied and doubled in size (lots of heap space used). But you need non-ASCII characters to display a title of MP3 track.
Yes, but to display a title, you don't need to do codepoint access at random - you need to either take a block of memory (length in bytes) and do something with it (pass to a C function, transfer over some bus, etc.), or *iterate in order* over codepoints in a string. All these operations are as efficient (O-notation) for UTF-8 as for UTF-32.
Several previous comments discuss first option, ASCII-only strings.
Hello, On Tue, 3 Jun 2014 22:23:07 -0700 Guido van Rossum <guido@python.org> wrote: []
Never mind disabling assertions -- even with enabled assertions you'd have to expect most Python programs to fail with non-ASCII input.
Then again the UTF-8 option would be pretty devastating too for anything manipulating strings (especially since many Python APIs are defined using indexes, e.g. the re module).
If the Unicode is slow (*), then obvious choice is not using Unicode when not needed. Too bad that's a bit hard in Python3, as it enforces Unicode everywhere, and dealing with efficient strings requires prefixing them with funny characters like "b", etc. * If Unicode if slow because it causes heap to bloat and go swap, the choice is still the same.
Why not support variable-width strings like CPython 3.4?
Because, like good deal of community, we hope that Python4 will get back to reality, and strings will be efficient (both for processing and storage) by default, and niche and marginal "Unicode string" type will be used explicitly (using funny prefixes, etc.), only when really needed. Ah, all these not so funny geek jokes about internals of language implementation, hope they didn't make somebody's day dull!
-- --Guido van Rossum (python.org/~guido)
-- Best regards, Paul mailto:pmiscml@gmail.com
On 04/06/2014 11:53, Paul Sokolovsky wrote:
Hello,
On Tue, 3 Jun 2014 22:23:07 -0700 Guido van Rossum <guido@python.org> wrote:
[]
Never mind disabling assertions -- even with enabled assertions you'd have to expect most Python programs to fail with non-ASCII input.
Then again the UTF-8 option would be pretty devastating too for anything manipulating strings (especially since many Python APIs are defined using indexes, e.g. the re module).
If the Unicode is slow (*), then obvious choice is not using Unicode when not needed. Too bad that's a bit hard in Python3, as it enforces Unicode everywhere, and dealing with efficient strings requires prefixing them with funny characters like "b", etc.
* If Unicode if slow because it causes heap to bloat and go swap, the choice is still the same.
Where is your evidence that (presumably) CPython unicode is slow? What is your response to this message http://bugs.python.org/issue16061#msg171413 from the bug tracker?
Why not support variable-width strings like CPython 3.4?
Because, like good deal of community, we hope that Python4 will get back to reality, and strings will be efficient (both for processing and storage) by default, and niche and marginal "Unicode string" type will be used explicitly (using funny prefixes, etc.), only when really needed.
Where is your evidence that supports the above claim?
Ah, all these not so funny geek jokes about internals of language implementation, hope they didn't make somebody's day dull!
-- --Guido van Rossum (python.org/~guido)
-- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com
Hello, On Wed, 4 Jun 2014 12:32:12 +1000 Chris Angelico <rosuav@gmail.com> wrote:
On Wed, Jun 4, 2014 at 11:17 AM, Steven D'Aprano <steve@pearwood.info> wrote:
* Having a build-time option to restrict all strings to ASCII-only.
(I think what they mean by that is that strings will be like Python 2 strings, ASCII-plus-arbitrary-bytes, not actually ASCII.)
What I was actually suggesting along those lines was that the str type still be notionally a Unicode string, but that any codepoints >127 would either raise an exception or blow an assertion,
That's another reason why people don't like Unicode enforced upon them - all the talk about supporting all languages and scripts is demagogy and hypocrisy, given a choice, Unicode zealots would rather limit people to Latin script then give up on their arbitrarily chosen, one-among-thousands, soon-to-be-replaced-by-apples'-and-microsofts'-"exciting-new" encoding. Once again, my claim is what MicroPython implements now is more correct - in a sense wider than technical - handling. We don't provide Unicode encoding support, because it's highly bloated, but let people use any encoding they like. That comes at some price, like length of strings in characters are not know to runtime, only in bytes, but quite a lot of applications can be written by having just that. And I'm saying that not to discourage Unicode addition to MicroPython, but to hint that "force-force" approach implemented by CPython3 and causing rage and split in the community is not appreciated.
and all the code to handle multibyte representations would be compiled out. So there'd still be a difference between strings of text and streams of bytes, but all encoding and decoding to/from ASCII-compatible encodings would just point to the same bytes in RAM.
Risk: Someone would implement that with assertions, then compile with assertions disabled, test only with ASCII, and have lurking bugs.
ChrisA
-- Best regards, Paul mailto:pmiscml@gmail.com
On Wed, Jun 4, 2014 at 8:38 PM, Paul Sokolovsky <pmiscml@gmail.com> wrote:
That's another reason why people don't like Unicode enforced upon them - all the talk about supporting all languages and scripts is demagogy and hypocrisy, given a choice, Unicode zealots would rather limit people to Latin script then give up on their arbitrarily chosen, one-among-thousands, soon-to-be-replaced-by-apples'-and-microsofts'-"exciting-new" encoding.
Wrong. I use and recommend Unicode, with UTF-8 for transmission, and I do not ever want to limit people to Latin-1 or any other such subset. Even though English is the only language I speak, I am *frequently* using non-ASCII characters (eg when I discuss mathematics on a MUD), and if I could be absolutely sure that everyone in the conversation correctly comprehended Unicode, I could do this with a lot more confidence. Unfortunately, the server I use just passes bytes in and out, and some clients assume CP-1252, others assume Latin-1, and others (including my Gypsum) try UTF-8 first and fall back on an eight-bit encoding (currently CP-1252 because of the first group). But in an ideal world, server and clients would all speak Unicode everywhere, and transmit and receive UTF-8. This is not hypocrisy, this is the way to work reliably.
Once again, my claim is what MicroPython implements now is more correct - in a sense wider than technical - handling. We don't provide Unicode encoding support, because it's highly bloated, but let people use any encoding they like. That comes at some price, like length of strings in characters are not know to runtime, only in bytes, but quite a lot of applications can be written by having just that.
The current implementation is flat-out lying, actually. It claims that it's storing Unicode codepoints (as per the Python spec) while actually storing bytes, and then it transmits those bytes to the console etc as-is. This is a bug. It needs to be fixed. The only question is, what form will the fix take? Will it be PEP 393's flexible fixed-width representation? UTF-8? UTF-16 (I hope not!)? A hybrid of Latin-1 where possible and UTF-8 otherwise? But something has to be done. ChrisA
On Wed, Jun 4, 2014 at 8:38 PM, Paul Sokolovsky <pmiscml@gmail.com> wrote:
And I'm saying that not to discourage Unicode addition to MicroPython, but to hint that "force-force" approach implemented by CPython3 and causing rage and split in the community is not appreciated.
FWIW, it's Python 3 (the language) and not CPython 3.x (the implementation) that specifies Unicode strings in this way. I don't know why it has to cause a split in the community; this is the one way to make sure *everyone's* strings work perfectly, rather than having ASCII strings work fine and others start tripping over problems in various APIs. ChrisA
Hello, On Wed, 4 Jun 2014 20:53:46 +1000 Chris Angelico <rosuav@gmail.com> wrote:
On Wed, Jun 4, 2014 at 8:38 PM, Paul Sokolovsky <pmiscml@gmail.com> wrote:
And I'm saying that not to discourage Unicode addition to MicroPython, but to hint that "force-force" approach implemented by CPython3 and causing rage and split in the community is not appreciated.
FWIW, it's Python 3 (the language) and not CPython 3.x (the implementation) that specifies Unicode strings in this way.
Yeah, but it's CPython what dictates how language evolves (some people even think that it dictates how language should be implemented!), so all good parts belong to Python3, and all bad parts - to CPython3, right? ;-)
I don't know why it has to cause a split in the community; this is the one way to make sure *everyone's* strings work perfectly, rather than having ASCII strings work fine and others start tripping over problems in various APIs.
It did cause split in the community, that's the fact, that's why Python2 and Python3 are at the respective positions. Anyway, I'm not interested in participating in that split, I did not yet uttered my opinion on that publicly enough, so I seized a chance to drop some witty remarks, but I don't want to start yet another Unicode flame. So, let's please be back to Unicode storage representation in MicroPython. So, https://github.com/micropython/micropython/issues/657 discussed technical aspects, in a recent mail on this list I expressed my opinion why following CPython way is not productive (for development satisfaction and evolution of Python community, to be explicit). Final argument I would have is that you certainly can implement Unicode support the PEP393 way - it would be enormous help and would be gladly accepted. The question, how useful it will be for MicroPython. It certainly will be useful to report passing of testsuites. But will it be *really* used? For microcontroller board, it might be too heavy (put simple, with it, people will be able to do less (== heap running out sooner)), than without it, so one may expect it to be disabled by default. Then POSIX port is there surely not to let people replace "python" command with "micropython" and run Django, but to let people develop and debug their apps with more comfort than on embedded board. So, it should behave close to MCU version, and would follow with MCU choice re: Unicode. That's actually the reason why I keep up this discussion - not for the sake of argument or to bash Python3's Unicode choices. With recent MicroPython announcement, we surely looked for more people to contribute to its development. But then we (or at least I can speak for myself), would like to make sure that these contribution are actually the most useful ones (for both MicroPython, and Python community in general, which gets more choices, rather than just getting N% smaller CPython rewrite). So, you're not sure how O(N) string indexing will work? But MicroPython offers a great opportunity to try! And it's something new and exciting, which surely will be useful (== will save people memory), not just something old and boring ;-).
ChrisA
-- Best regards, Paul mailto:pmiscml@gmail.com
If we're voting I think representing Unicode internally in micropython as utf-8 with O(N) indexing is a great idea, partly because I'm not sure indexing into strings is a good idea - lots of Unicode code points don't make sense by themselves; see also grapheme clusters. It would probably work great. On Wed, Jun 4, 2014 at 7:49 AM, Paul Sokolovsky <pmiscml@gmail.com> wrote:
Hello,
On Wed, 4 Jun 2014 20:53:46 +1000 Chris Angelico <rosuav@gmail.com> wrote:
On Wed, Jun 4, 2014 at 8:38 PM, Paul Sokolovsky <pmiscml@gmail.com> wrote:
And I'm saying that not to discourage Unicode addition to MicroPython, but to hint that "force-force" approach implemented by CPython3 and causing rage and split in the community is not appreciated.
FWIW, it's Python 3 (the language) and not CPython 3.x (the implementation) that specifies Unicode strings in this way.
Yeah, but it's CPython what dictates how language evolves (some people even think that it dictates how language should be implemented!), so all good parts belong to Python3, and all bad parts - to CPython3, right? ;-)
I don't know why it has to cause a split in the community; this is the one way to make sure *everyone's* strings work perfectly, rather than having ASCII strings work fine and others start tripping over problems in various APIs.
It did cause split in the community, that's the fact, that's why Python2 and Python3 are at the respective positions. Anyway, I'm not interested in participating in that split, I did not yet uttered my opinion on that publicly enough, so I seized a chance to drop some witty remarks, but I don't want to start yet another Unicode flame.
So, let's please be back to Unicode storage representation in MicroPython. So, https://github.com/micropython/micropython/issues/657 discussed technical aspects, in a recent mail on this list I expressed my opinion why following CPython way is not productive (for development satisfaction and evolution of Python community, to be explicit).
Final argument I would have is that you certainly can implement Unicode support the PEP393 way - it would be enormous help and would be gladly accepted. The question, how useful it will be for MicroPython. It certainly will be useful to report passing of testsuites. But will it be *really* used?
For microcontroller board, it might be too heavy (put simple, with it, people will be able to do less (== heap running out sooner)), than without it, so one may expect it to be disabled by default. Then POSIX port is there surely not to let people replace "python" command with "micropython" and run Django, but to let people develop and debug their apps with more comfort than on embedded board. So, it should behave close to MCU version, and would follow with MCU choice re: Unicode.
That's actually the reason why I keep up this discussion - not for the sake of argument or to bash Python3's Unicode choices. With recent MicroPython announcement, we surely looked for more people to contribute to its development. But then we (or at least I can speak for myself), would like to make sure that these contribution are actually the most useful ones (for both MicroPython, and Python community in general, which gets more choices, rather than just getting N% smaller CPython rewrite).
So, you're not sure how O(N) string indexing will work? But MicroPython offers a great opportunity to try! And it's something new and exciting, which surely will be useful (== will save people memory), not just something old and boring ;-).
ChrisA
-- Best regards, Paul mailto:pmiscml@gmail.com _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/dholth%40gmail.com
I'm agree with Daniel. Directly indexing into text suggests an attempted optimization that is likely to be incorrect for a set of strings. Splitting, regex, concatenation and formatting are really the main operations that matter, and MicroPython can optimize their implementation of these easily enough for O(N) indexing. Cheers, Steve Top-posted from my Windows Phone ________________________________ From: Daniel Holth<mailto:dholth@gmail.com> Sent: 6/4/2014 5:17 To: Paul Sokolovsky<mailto:pmiscml@gmail.com> Cc: python-dev<mailto:python-dev@python.org> Subject: Re: [Python-Dev] Internal representation of strings and Micropython If we're voting I think representing Unicode internally in micropython as utf-8 with O(N) indexing is a great idea, partly because I'm not sure indexing into strings is a good idea - lots of Unicode code points don't make sense by themselves; see also grapheme clusters. It would probably work great. On Wed, Jun 4, 2014 at 7:49 AM, Paul Sokolovsky <pmiscml@gmail.com> wrote:
Hello,
On Wed, 4 Jun 2014 20:53:46 +1000 Chris Angelico <rosuav@gmail.com> wrote:
On Wed, Jun 4, 2014 at 8:38 PM, Paul Sokolovsky <pmiscml@gmail.com> wrote:
And I'm saying that not to discourage Unicode addition to MicroPython, but to hint that "force-force" approach implemented by CPython3 and causing rage and split in the community is not appreciated.
FWIW, it's Python 3 (the language) and not CPython 3.x (the implementation) that specifies Unicode strings in this way.
Yeah, but it's CPython what dictates how language evolves (some people even think that it dictates how language should be implemented!), so all good parts belong to Python3, and all bad parts - to CPython3, right? ;-)
I don't know why it has to cause a split in the community; this is the one way to make sure *everyone's* strings work perfectly, rather than having ASCII strings work fine and others start tripping over problems in various APIs.
It did cause split in the community, that's the fact, that's why Python2 and Python3 are at the respective positions. Anyway, I'm not interested in participating in that split, I did not yet uttered my opinion on that publicly enough, so I seized a chance to drop some witty remarks, but I don't want to start yet another Unicode flame.
So, let's please be back to Unicode storage representation in MicroPython. So, https://github.com/micropython/micropython/issues/657 discussed technical aspects, in a recent mail on this list I expressed my opinion why following CPython way is not productive (for development satisfaction and evolution of Python community, to be explicit).
Final argument I would have is that you certainly can implement Unicode support the PEP393 way - it would be enormous help and would be gladly accepted. The question, how useful it will be for MicroPython. It certainly will be useful to report passing of testsuites. But will it be *really* used?
For microcontroller board, it might be too heavy (put simple, with it, people will be able to do less (== heap running out sooner)), than without it, so one may expect it to be disabled by default. Then POSIX port is there surely not to let people replace "python" command with "micropython" and run Django, but to let people develop and debug their apps with more comfort than on embedded board. So, it should behave close to MCU version, and would follow with MCU choice re: Unicode.
That's actually the reason why I keep up this discussion - not for the sake of argument or to bash Python3's Unicode choices. With recent MicroPython announcement, we surely looked for more people to contribute to its development. But then we (or at least I can speak for myself), would like to make sure that these contribution are actually the most useful ones (for both MicroPython, and Python community in general, which gets more choices, rather than just getting N% smaller CPython rewrite).
So, you're not sure how O(N) string indexing will work? But MicroPython offers a great opportunity to try! And it's something new and exciting, which surely will be useful (== will save people memory), not just something old and boring ;-).
ChrisA
-- Best regards, Paul mailto:pmiscml@gmail.com _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/dholth%40gmail.com
Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/steve.dower%40microsoft.c...
On Wed, Jun 04, 2014 at 01:14:04PM +0000, Steve Dower wrote:
I'm agree with Daniel. Directly indexing into text suggests an attempted optimization that is likely to be incorrect for a set of strings.
I'm afraid I don't understand this argument. The language semantics says that a string is an array of code points. Every index relates to a single code point, no code point extends over two or more indexes. There's a 1:1 relationship between code points and indexes. How is direct indexing "likely to be incorrect"? e.g. s = "---ÿ---" offset = s.index('ÿ') assert s[offset] == 'ÿ' That cannot fail with Python's semantics. [Aside: it does fail in Python 2, showing that the idea that "strings are bytes" is fatally broken. Fortunately Python has moved beyond that.]
Splitting, regex, concatenation and formatting are really the main operations that matter, and MicroPython can optimize their implementation of these easily enough for O(N) indexing.
Really? Well, it will be a nice experiment. Fortunately MicroPython runs under Linux as well as on embedded systems (a clever decision, by the way) so I look forward to seeing how their internal-utf8 implementation stacks up against CPython's FSR implementation. Out of curiosity, when the FSR was proposed, did anyone consider an internal UTF-8 representation? If so, why was it rejected? -- Steven
On Wed, Jun 4, 2014 at 10:12 AM, Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Jun 04, 2014 at 01:14:04PM +0000, Steve Dower wrote:
I'm agree with Daniel. Directly indexing into text suggests an attempted optimization that is likely to be incorrect for a set of strings.
I'm afraid I don't understand this argument. The language semantics says that a string is an array of code points. Every index relates to a single code point, no code point extends over two or more indexes. There's a 1:1 relationship between code points and indexes. How is direct indexing "likely to be incorrect"?
"Useful" is probably a better word. When you get into the complicated languages and you want to know how wide something is, and you might have y with two dots on it as one code point or two and left-to-right and right-to-left indicators and who knows what else... then looking at individual code points only works sometimes. I get the slicing idea. I like the idea that encoding to utf-8 would be the fastest thing you can do with a string. You could consider doing regexps in that domain, and other implementation specific optimizations in exactly the same way that any Python implementation has them. None of this would make it harder to move a servo.
On 6/4/2014 6:14 AM, Steve Dower wrote:
I'm agree with Daniel. Directly indexing into text suggests an attempted optimization that is likely to be incorrect for a set of strings. Splitting, regex, concatenation and formatting are really the main operations that matter, and MicroPython can optimize their implementation of these easily enough for O(N) indexing.
Cheers, Steve
Top-posted from my Windows Phone ------------------------------------------------------------------------ From: Daniel Holth <mailto:dholth@gmail.com> Sent: 6/4/2014 5:17 To: Paul Sokolovsky <mailto:pmiscml@gmail.com> Cc: python-dev <mailto:python-dev@python.org> Subject: Re: [Python-Dev] Internal representation of strings and Micropython
If we're voting I think representing Unicode internally in micropython as utf-8 with O(N) indexing is a great idea, partly because I'm not sure indexing into strings is a good idea - lots of Unicode code points don't make sense by themselves; see also grapheme clusters. It would probably work great.
I think native UTF-8 support is the most promising route for a micropython Unicode support. It would be an interesting proof-of-concept to implement an alternative CPython with PEP-393 replaced by UTF-8 internally... doing conversions for APIs that require a different encoding, but always maintaining and computing with the UTF-8 representation. 1) The first proof-of-concept implementation should implement codepoint indexing as a O(N) operation, searching from the beginning of the string for the Nth codepoint. Other Proof-of-concept implementation could implement a codepoint boundary cache, there could be a variety of caching algorithms. 2) (Least space efficient) An array that could be indexed by codepoint position and result in byte position. (This would use more space than a UTF-32 representation!) 3) (Most space efficient) One cached entry, that caches the last codepoint/byte position referenced. UTF-8 is able to be traversed in either direction, so "next/previous" codepoint access would be relatively fast (and such are very common operations, even when indexing notation is used: "for ix in range( len( str_x )): func( str_x[ ix ])".) 4) (Fixed size caches) N entries, one for the last codepoint, and others at Codepoint_Length/N intervals. N could be tunable. 5) (Fixed size caches) Like 4, plus an extra entry like 3. 6) (Variable size caches) Like 2, but only indexing every Nth code point. N could be tunable. 7) (Variable size caches) Like 6, plus an extra entry like 3. 8) (Content specific variable size caches) Index each codepoint that is a different byte size than the previous codepoint, allowing indexing to be used in the intervals. Worst case size is like 2, best case size is a single entry for the end, when all code points are represented by the same number of bytes. 9) (Content specific variable size caches) Like 8, only cache entries could indicate fixed or variable size characters in the next interval, with a scheme like 4 or 6 used to prevent one interval from covering the whole string. Other hybrid schemes may present themselves as useful once experience is gained with some of these. It might be surprising how few algorithms need more than algorithm 3 to get reasonable performance. Glenn
On Thu, Jun 5, 2014 at 6:50 AM, Glenn Linderman <v+python@g.nevcal.com> wrote:
8) (Content specific variable size caches) Index each codepoint that is a different byte size than the previous codepoint, allowing indexing to be used in the intervals. Worst case size is like 2, best case size is a single entry for the end, when all code points are represented by the same number of bytes.
Conceptually interesting, and I'd love to know how well that'd perform in real-world usage. Would do very nicely on blocks of text that are all from the same range of codepoints, but if you intersperse high and low codepoints it'll be like 2 but with significantly more complicated lookups (imagine a "name=value\nname=value\n" stream where the names and values are all in the same language - you'll have a lot of transitions). Chrisa
On 6/4/2014 2:28 PM, Chris Angelico wrote:
On Thu, Jun 5, 2014 at 6:50 AM, Glenn Linderman <v+python@g.nevcal.com> wrote:
8) (Content specific variable size caches) Index each codepoint that is a different byte size than the previous codepoint, allowing indexing to be used in the intervals. Worst case size is like 2, best case size is a single entry for the end, when all code points are represented by the same number of bytes. Conceptually interesting, and I'd love to know how well that'd perform in real-world usage.
So would I :)
Would do very nicely on blocks of text that are all from the same range of codepoints, but if you intersperse high and low codepoints it'll be like 2 but with significantly more complicated lookups (imagine a "name=value\nname=value\n" stream where the names and values are all in the same language - you'll have a lot of transitions).
Lookup is binary search on code point index or a search for same in some tree structure, I would think. "like 2 but ..." well, the data structure would be bigger than for 2, but your example shows 4-5 high codepoints per low codepoint (for some languages). I did just think of another refinement to this technique (my list was not intended to be all-inclusive... just a bunch of variations I thought of then). 10) (Content specific variable size caches) Like 8, but the last character in a run is allowed (but not required) to be a different number of bytes than prior characters, because the offset calculation will still work for the first character of a different size. So #10 would halve the size of your imagined stream that intersperses one low-byte charater with each sequence of high-byte characters.
Glenn Linderman writes:
3) (Most space efficient) One cached entry, that caches the last codepoint/byte position referenced. UTF-8 is able to be traversed in either direction, so "next/previous" codepoint access would be relatively fast (and such are very common operations, even when indexing notation is used: "for ix in range( len( str_x )): func( str_x[ ix ])".)
Been there, tried that (Emacsen). Either it's a YAGNI (moving forward or backward over UTF-8 by characters short distances is plenty fast, especially if you've got a lot of ASCII you can move by words for somewhat longer distances), or it's not good enough. There *may* be a sweet spot, but it's definitely smaller than the one on Sharapova's racket.
4) (Fixed size caches) N entries, one for the last codepoint, and others at Codepoint_Length/N intervals. N could be tunable.
To achieve space saving, cache has to be quite small, and the bigger your integers, the smaller it gets. A naive implementation on 64-bit machine would give you 16 bytes/cache entry. Using a non-native size will be a space win, but needs care in implementation. Initializing the cache is very expensive for small strings, so you need conditional and maybe lazy initialization (for large strings). By the way, there's also 10) Keep counts of the leading and trailing number of ASCII (one-octet) characters. This is often a *huge* win; it's quite common to encounter documents where size - lc - tc = 2 (ie, there's only one two-octet character in the document). 11) Keep a list (or tree) of most-recently-accessed positions. Despite my negative experience with multibyte encodings in Emacsen, I'm persuaded by the arguments that there probably aren't all that many places in core Python where indexing is used in an essential way, so MicroPython itself can probably optimize those "behind the scenes". Application programmers in the embedded context may be expected to be deal with the need to avoid random access algorithms and use iterators and generators to accomplish most tasks.
04.06.14 23:50, Glenn Linderman написав(ла):
3) (Most space efficient) One cached entry, that caches the last codepoint/byte position referenced. UTF-8 is able to be traversed in either direction, so "next/previous" codepoint access would be relatively fast (and such are very common operations, even when indexing notation is used: "for ix in range( len( str_x )): func( str_x[ ix ])".)
Great idea! It should cover most real-word cases. Note that we can scan UTF-8 string left-to-right and right-to-left.
On Wed, Jun 04, 2014 at 01:38:57PM +0300, Paul Sokolovsky wrote:
That's another reason why people don't like Unicode enforced upon them
Enforcing design and language decisions is the job of the programming language. You might as well complain that Python forces C doubles as the floating point type, or that it forces Bignums as the integer type, or that it forces significant indentation, or "class" as a keyword. Or that C forces you to use braces and manage your own memory. That's the purpose of the language, to make those decisions as to what features to provide and what not to provide.
- all the talk about supporting all languages and scripts is demagogy and hypocrisy, given a choice, Unicode zealots would rather limit people to Latin script
I have no words to describe how ridiculous this accusation is.
then give up on their arbitrarily chosen, one-among-thousands, soon-to-be-replaced-by-apples'-and-microsofts'-"exciting-new" encoding.
Once again, my claim is what MicroPython implements now is more correct - in a sense wider than technical - handling. We don't provide Unicode encoding support, because it's highly bloated, but let people use any encoding they like. That comes at some price, like length of strings in characters are not know to runtime, only in bytes
What's does uPy return for the length of '∞'? If the answer is anything but 1, that's a bug. -- Steven
On 4 June 2014 11:17, Steven D'Aprano <steve@pearwood.info> wrote:
My own feeling is that O(1) string indexing operations are a quality of implementation issue, not a deal breaker to call it a Python.
If string indexing & iteration is still presented to the user as "an array of code points", it should still avoid the bugs that plagued both Python 2 narrow builds and direct use of UTF-8 encoded Py2 strings. If they don't try to offer C API compatibility, it should be feasible to do it that way. If they *do* try to offer C API compatibility, they may have a problem.
I can't see any requirement in the docs that str[n] must take O(1) time, but perhaps I have missed something.
There's a general expectation that indexing will be O(1) because all the builtin containers that support that syntax use it for O(1) lookup operations. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Wed, Jun 04, 2014 at 03:17:00PM +1000, Nick Coghlan wrote:
There's a general expectation that indexing will be O(1) because all the builtin containers that support that syntax use it for O(1) lookup operations.
Depending on your definition of built in, there is at least one standard library container that does not - collections.deque. Given the specialized kinds of application this Python implementation is targetted at, it seems UTF-8 is ideal considering the huge memory savings resulting from the compressed representation, and the reduced likelihood of there being any real need for serious text processing on the device. It is also unlikely to find software or libraries like Django or Werkzeug running on a microcontroller, more likely all the Python code would be custom, in which case, replacing string indexing with iteration, or temporary conversion to a list is easily done. In this context, while a fixed-width encoding may be the correct choice it would also likely be the wrong choice. David
dw+python-dev@hmmz.org writes:
Given the specialized kinds of application this Python implementation is targetted at, it seems UTF-8 is ideal considering the huge memory savings resulting from the compressed representation,
I think you really need to check what the applications are in detail. UTF-8 costs about 35% more storage for Japanese, and even more for Chinese, than does UTF-16. So if you might be using a lot of Asian localized strings, it might even be worth implementing PEP-393 to get the best of both worlds for most strings.
On Wed, Jun 4, 2014 at 11:36 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I think you really need to check what the applications are in detail. UTF-8 costs about 35% more storage for Japanese, and even more for Chinese, than does UTF-16.
"UTF-8 can be smaller even for Asian languages, e.g.: front page of Wikipedia Japan: 83 kB in UTF-8, 144 kB in UTF-16"
From http://www.lua.org/wshop12/Ierusalimschy.pdf (p. 12)
On 4 June 2014 15:39, <dw+python-dev@hmmz.org> wrote:
On Wed, Jun 04, 2014 at 03:17:00PM +1000, Nick Coghlan wrote:
There's a general expectation that indexing will be O(1) because all the builtin containers that support that syntax use it for O(1) lookup operations.
Depending on your definition of built in, there is at least one standard library container that does not - collections.deque.
Given the specialized kinds of application this Python implementation is targetted at, it seems UTF-8 is ideal considering the huge memory savings resulting from the compressed representation, and the reduced likelihood of there being any real need for serious text processing on the device.
Right - I wasn't clear that I think storing text internally as UTF-8 sounds fine for MicroPython. Anything where the O(N) nature of indexing by code point matters probably won't be run in that environment anyway. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 2014-06-04 14:33, Nick Coghlan wrote:
On 4 June 2014 15:39, <dw+python-dev@hmmz.org> wrote:
On Wed, Jun 04, 2014 at 03:17:00PM +1000, Nick Coghlan wrote:
There's a general expectation that indexing will be O(1) because all the builtin containers that support that syntax use it for O(1) lookup operations.
Depending on your definition of built in, there is at least one standard library container that does not - collections.deque.
Given the specialized kinds of application this Python implementation is targetted at, it seems UTF-8 is ideal considering the huge memory savings resulting from the compressed representation, and the reduced likelihood of there being any real need for serious text processing on the device.
Right - I wasn't clear that I think storing text internally as UTF-8 sounds fine for MicroPython. Anything where the O(N) nature of indexing by code point matters probably won't be run in that environment anyway.
In order to avoid indexing, you could use some kind of 'cursor' class to step forwards and backwards along strings. The cursor could include both the codepoint index and the byte index.
04.06.14 19:52, MRAB написав(ла):
In order to avoid indexing, you could use some kind of 'cursor' class to step forwards and backwards along strings. The cursor could include both the codepoint index and the byte index.
So you need different string library and different regular expression library.
On Wed, Jun 4, 2014 at 3:17 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 4 June 2014 11:17, Steven D'Aprano <steve@pearwood.info> wrote:
My own feeling is that O(1) string indexing operations are a quality of implementation issue, not a deal breaker to call it a Python.
If string indexing & iteration is still presented to the user as "an array of code points", it should still avoid the bugs that plagued both Python 2 narrow builds and direct use of UTF-8 encoded Py2 strings.
It would. The downsides of a UTF-8 representation would be slower iteration and much slower (O(N)) indexing/slicing. ChrisA
Le 04/06/2014 02:51, Chris Angelico a écrit :
On Wed, Jun 4, 2014 at 3:17 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
It would. The downsides of a UTF-8 representation would be slower iteration and much slower (O(N)) indexing/slicing.
There's no reason for iteration to be slower. Slicing would get O(slice offset + slice size) instead of O(slice size). Regards Antoine.
Zitat von Steven D'Aprano <steve@pearwood.info>:
* Having a build-time option to restrict all strings to ASCII-only.
(I think what they mean by that is that strings will be like Python 2 strings, ASCII-plus-arbitrary-bytes, not actually ASCII.)
An ASCII-plus-arbitrary-bytes type called "str" would prevent claiming "Python 3.4 compatibility" for sure. Restricting strings to ASCII (as Chris apparently actually suggested) would allow to claim compatibility with a stretch: existing Python code might not run on such an implementation. However, since a lot of existing Python code wouldn't run on MicroPython, anyway, one might claim to implement a Python 3.4 subset.
* Implementing Unicode internally as UTF-8, and giving up O(1) indexing operations.
Would either of these trade-offs be acceptable while still claiming "Python 3.4 compatibility"?
My own feeling is that O(1) string indexing operations are a quality of implementation issue, not a deal breaker to call it a Python. I can't see any requirement in the docs that str[n] must take O(1) time, but perhaps I have missed something.
I agree. It's an open question whether such an implementation would be practical, both in terms of existing Python code, and in terms of existing C extension modules that people might want to port to MicroPython. There are more things to consider for the internal implementation, in particular how the string length is implemented. Several alternatives exist: 1. store the UTF-8 length (i.e. memory size) 2. store the number of code points (i.e. Python len()) 3. store both 4. store neither, but use null termination instead Variant 3 is most run-time efficient, but could easily use 8 bytes just for the length, which could outweigh the storage of the actual data. Variants 1 and 2 lose on some operations (1 loses on computing len(), 2 loses on string concatenation). 3 would add the restriction of not allowing U+0000 in a string (which would be reasonable IMO), and make all length computations inefficient. However, it wouldn't be worse than standard C. Regards, Martin
On Wed, Jun 4, 2014 at 5:02 PM, <martin@v.loewis.de> wrote:
There are more things to consider for the internal implementation, in particular how the string length is implemented. Several alternatives exist: 1. store the UTF-8 length (i.e. memory size) 2. store the number of code points (i.e. Python len()) 3. store both 4. store neither, but use null termination instead
Variant 3 is most run-time efficient, but could easily use 8 bytes just for the length, which could outweigh the storage of the actual data. Variants 1 and 2 lose on some operations (1 loses on computing len(), 2 loses on string concatenation). 3 would add the restriction of not allowing U+0000 in a string (which would be reasonable IMO), and make all length computations inefficient. However, it wouldn't be worse than standard C.
The current implementation stores a 16-bit length, which is both the memory size and the len(). As far as I can see, the memory size is never needed, so I'd just go for option 2; string concatenation is already known to be one of those operations that can be slow if you do it badly, and an optimized str.join() would cover the recommended use-case. ChrisA
Jython uses UTF-16 internally -- probably the only sensible choice in a Python that can call Java. Indexing is O(N), fundamentally. By "fundamentally", I mean for those strings that have not yet noticed that they contain no supplementary (>0xffff) characters. I've toyed with making this O(1) universally. Like Steven, I understand this to be a freedom afforded to implementers, rather than an issue of conformity. Jeff Allen On 04/06/2014 02:17, Steven D'Aprano wrote:
There is a discussion over at MicroPython about the internal representation of Unicode strings. ... My own feeling is that O(1) string indexing operations are a quality of implementation issue, not a deal breaker to call it a Python. I can't see any requirement in the docs that str[n] must take O(1) time, but perhaps I have missed something.
For Jython and IronPython, UTF-16 may be best internal encoding. Recent languages (Swiffy, Golang, Rust) chose UTF-8 as internal encoding. Using utf-8 is simple and efficient. For example, no need for utf-8 copy of the string when writing to file and serializing to JSON. When implementing Python using these languages, UTF-8 will be best internal encoding. To allow Python implementations other than CPython can use UTF-8 or UTF-16 as internal encoding efficiently, I think adding internal position based API is the best solution.
s = "\U00100000x" len(s) 2 s[1:] 'x' s.find('x') 1 # s.isize() # Internal length. 5 for utf-8, 3 for utf-16 # s.ifind('x') # Internal position, 4 for utf-8, 2 for utf-16 # s.islice(s.ifind('x')) => 'x'
(I like design of golang and Rust. I hope CPython uses utf-8 as internal encoding in the future. But this is off-topic.) On Wed, Jun 4, 2014 at 4:41 PM, Jeff Allen <ja.py@farowl.co.uk> wrote:
Jython uses UTF-16 internally -- probably the only sensible choice in a Python that can call Java. Indexing is O(N), fundamentally. By "fundamentally", I mean for those strings that have not yet noticed that they contain no supplementary (>0xffff) characters.
I've toyed with making this O(1) universally. Like Steven, I understand this to be a freedom afforded to implementers, rather than an issue of conformity.
Jeff Allen
On 04/06/2014 02:17, Steven D'Aprano wrote:
There is a discussion over at MicroPython about the internal representation of Unicode strings.
...
My own feeling is that O(1) string indexing operations are a quality of implementation issue, not a deal breaker to call it a Python. I can't see any requirement in the docs that str[n] must take O(1) time, but perhaps I have missed something.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com
-- INADA Naoki <songofacandy@gmail.com>
On 6/4/2014 3:41 AM, Jeff Allen wrote:
Jython uses UTF-16 internally -- probably the only sensible choice in a Python that can call Java. Indexing is O(N), fundamentally. By "fundamentally", I mean for those strings that have not yet noticed that they contain no supplementary (>0xffff) characters.
I've toyed with making this O(1) universally. Like Steven, I understand this to be a freedom afforded to implementers, rather than an issue of conformity.
Jeff Allen
On 04/06/2014 02:17, Steven D'Aprano wrote:
There is a discussion over at MicroPython about the internal representation of Unicode strings. ... My own feeling is that O(1) string indexing operations are a quality of implementation issue, not a deal breaker to call it a Python. I can't see any requirement in the docs that str[n] must take O(1) time, but perhaps I have missed something.
-- Terry Jan Reedy
On 6/4/2014 3:41 AM, Jeff Allen wrote:
Jython uses UTF-16 internally -- probably the only sensible choice in a Python that can call Java. Indexing is O(N), fundamentally. By "fundamentally", I mean for those strings that have not yet noticed that they contain no supplementary (>0xffff) characters.
Indexing can be made O(log(k)) where k is the number of astral chars, and is usually small. -- Terry Jan Reedy
05.06.14 00:21, Terry Reedy написав(ла):
On 6/4/2014 3:41 AM, Jeff Allen wrote:
Jython uses UTF-16 internally -- probably the only sensible choice in a Python that can call Java. Indexing is O(N), fundamentally. By "fundamentally", I mean for those strings that have not yet noticed that they contain no supplementary (>0xffff) characters.
Indexing can be made O(log(k)) where k is the number of astral chars, and is usually small.
I like your idea and think it would be great if Jython will implement it. Unfortunately it is too late to do this in CPython.
On 6/4/2014 6:54 PM, Serhiy Storchaka wrote:
05.06.14 00:21, Terry Reedy написав(ла):
On 6/4/2014 3:41 AM, Jeff Allen wrote:
Jython uses UTF-16 internally -- probably the only sensible choice in a Python that can call Java. Indexing is O(N), fundamentally. By "fundamentally", I mean for those strings that have not yet noticed that they contain no supplementary (>0xffff) characters.
Indexing can be made O(log(k)) where k is the number of astral chars, and is usually small.
I like your idea and think it would be great if Jython will implement it.
A proof of concept implementation in Python that handles both indexing and slicing is on the tracker. It is simpler than I initially expected.
Unfortunately it is too late to do this in CPython.
I mentioned it as an alternative during the '393 discussion. I more than half agree that the FSR is the better choice for CPython, which had no particular attachment to UTF-16 in the way that I think Jython, for instance, does. -- Terry Jan Reedy
05.06.14 05:25, Terry Reedy написав(ла):
I mentioned it as an alternative during the '393 discussion. I more than half agree that the FSR is the better choice for CPython, which had no particular attachment to UTF-16 in the way that I think Jython, for instance, does.
Yes, I remember. I thing that hybrid FSR-UTF16 (like FSR, but UTF-16 is used instead of UCS4) is the better choice for CPython. I suppose that with populating emoticons and other icon characters in nearest 5 or 10 years, even English text will often contain astral characters. And spending 4 bytes per character if long text contains one astral character looks too prodigally.
Serhiy Storchaka writes:
Yes, I remember. I thing that hybrid FSR-UTF16 (like FSR, but UTF-16 is used instead of UCS4) is the better choice for CPython. I suppose that with populating emoticons and other icon characters in nearest 5 or 10 years, even English text will often contain astral characters. And spending 4 bytes per character if long text contains one astral character looks too prodigally.
Why use something that complex if you don't have to? For the use case you have in mind, just map them into private space. If you really want to be aggressive, use surrogate space, too (anything that cares what a scalar represents should be trapping on non-scalars, catch that exception and look up the char -- dangerous, though, because such exceptions are probably all over the place).
04.06.14 04:17, Steven D'Aprano написав(ла):
Would either of these trade-offs be acceptable while still claiming "Python 3.4 compatibility"?
My own feeling is that O(1) string indexing operations are a quality of implementation issue, not a deal breaker to call it a Python. I can't see any requirement in the docs that str[n] must take O(1) time, but perhaps I have missed something.
I think than breaking O(1) expectation for indexing makes the implementation significant incompatible with Python. Virtually all string operations in Python operates with indices. O(1) indexing operations can be kept with minimal memory requirements if implement Unicode internally as modified UTF-8 plus optional array of offsets for every, say, 32th character (which even can be compressed to an array of 16-bit or 32-bit integers).
MicroPython is going to be significantly incompatible with Python anyway. But you should be able to run your mp code on regular Python. On Wed, Jun 4, 2014 at 9:39 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
04.06.14 04:17, Steven D'Aprano написав(ла):
Would either of these trade-offs be acceptable while still claiming "Python 3.4 compatibility"?
My own feeling is that O(1) string indexing operations are a quality of implementation issue, not a deal breaker to call it a Python. I can't see any requirement in the docs that str[n] must take O(1) time, but perhaps I have missed something.
I think than breaking O(1) expectation for indexing makes the implementation significant incompatible with Python. Virtually all string operations in Python operates with indices.
O(1) indexing operations can be kept with minimal memory requirements if implement Unicode internally as modified UTF-8 plus optional array of offsets for every, say, 32th character (which even can be compressed to an array of 16-bit or 32-bit integers).
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/dholth%40gmail.com
On 4 June 2014 14:39, Serhiy Storchaka <storchaka@gmail.com> wrote:
I think than breaking O(1) expectation for indexing makes the implementation significant incompatible with Python. Virtually all string operations in Python operates with indices.
I don't use indexing on strings except in rare situations. Sure I use lots of operations that may well use indexing *internally* but that's the point. MicroPython can optimise those operations without needing to guarantee O(1) indexing, and I'd be fine with that. Paul
04.06.14 17:02, Paul Moore написав(ла):
On 4 June 2014 14:39, Serhiy Storchaka <storchaka@gmail.com> wrote:
I think than breaking O(1) expectation for indexing makes the implementation significant incompatible with Python. Virtually all string operations in Python operates with indices.
I don't use indexing on strings except in rare situations. Sure I use lots of operations that may well use indexing *internally* but that's the point. MicroPython can optimise those operations without needing to guarantee O(1) indexing, and I'd be fine with that.
Any non-trivial text parsing uses indices or regular expressions (and regular expressions themself use indices internally). It would be interesting to collect a statistic about how many indexing operations happened during the life of a string in typical (Micro)Python program.
Hello, On Wed, 04 Jun 2014 17:40:14 +0300 Serhiy Storchaka <storchaka@gmail.com> wrote:
04.06.14 17:02, Paul Moore написав(ла):
On 4 June 2014 14:39, Serhiy Storchaka <storchaka@gmail.com> wrote:
I think than breaking O(1) expectation for indexing makes the implementation significant incompatible with Python. Virtually all string operations in Python operates with indices.
I don't use indexing on strings except in rare situations. Sure I use lots of operations that may well use indexing *internally* but that's the point. MicroPython can optimise those operations without needing to guarantee O(1) indexing, and I'd be fine with that.
Any non-trivial text parsing uses indices or regular expressions (and regular expressions themself use indices internally).
I keep hearing this stuff, and unfortunately so far don't have enough time to collect all that stuff and provide detailed response. So, here's spur of the moment response - hopefully we're in the same context so it is easy to understand. So, gentlemen, you keep mixing up character-by-character random access to string and taking substrings of a string. Character-by-character random access imply that you would need to scan thru (possibly almost) all chars in a string. That's O(N) (N-length of string). With varlength encoding (taking O(N) to index arbitrary char), there's thus concern that this would be O(N^2) op. But show me real-world case for that. Common usecase is scanning string left-to-right, that should be done using iterator and thus O(N). Right-to-left scanning would be order(s) of magnitude less frequent, as and also handled by iterator. What's next? You're doing some funky anagrams and need to swap each 2 adjacent chars? Sorry, naive implementation will be slow. If you're in serious anagram business, you'll need to code C extension. No, wait! Instead you should learn Python better. You should run a string windowing iterator which will return adjacent pair and swap those constant-len strings. More cases anyone? Implementing DES and doing arbitrary permutations? Kindly drop doing that on strings, do it on bytes or lists. Hopefully, the idea is clear - if you *scan* thru string using indexes in *random* order, you're doing weird thing and *want* weird performance. Doing stuff is s[0] ot s[-1] - there's finite (and small) number of such operation per strings. Now about taking substrings of strings (which in Python often expressed by slice indexing). Well, this is quite different from scanning each character of a strings. Just like s[0]/s[-1] this usually happens finite number of times for a particular string, independent of its length, i.e. O(1) times (ex, you take a string and split it in 3 parts), or maybe number of substrings is not bound-fixed, but has different growth order, O(M) (for example, you split string in tokens, tokens can be long, but there're usually external limits on how many it's sensible to have on one line). So, again, you're not going to get quadric time unless you're unlucky or sloppy. And just again, you should brush up your Python skills and use regex functions shich return iterators to get your parsed tokens, etc. (To clarify the obvious - "you" here is abstract pronoun, not referring to respectable Python developers who actually made it possible to write efficient Python programs). So, hopefully the point is conveyed - you can write inefficient Python programs. CPython goes out of the way to hide many inefficiencies (using unbelievably bloated heap usage - from uPy's point of view, which starts up in 2K heap). You just shouldn't write inefficient programs, voila. But if you want, you can keep writing inefficient programs, they just will be inefficient. Peace.
It would be interesting to collect a statistic about how many indexing operations happened during the life of a string in typical (Micro)Python program.
Yup. -- Best regards, Paul mailto:pmiscml@gmail.com
04.06.14 18:38, Paul Sokolovsky написав(ла):
Any non-trivial text parsing uses indices or regular expressions (and regular expressions themself use indices internally).
I keep hearing this stuff, and unfortunately so far don't have enough time to collect all that stuff and provide detailed response. So, here's spur of the moment response - hopefully we're in the same context so it is easy to understand.
So, gentlemen, you keep mixing up character-by-character random access to string and taking substrings of a string.
Character-by-character random access imply that you would need to scan thru (possibly almost) all chars in a string. That's O(N) (N-length of string). With varlength encoding (taking O(N) to index arbitrary char), there's thus concern that this would be O(N^2) op.
But show me real-world case for that. Common usecase is scanning string left-to-right, that should be done using iterator and thus O(N). Right-to-left scanning would be order(s) of magnitude less frequent, as and also handled by iterator.
html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize don't use iterators. They use indices, str.find and/or regular expressions. Common use case is quickly find substring starting from current position using str.find or re.search, process found token, advance position and repeat.
Hello, On Wed, 04 Jun 2014 19:49:18 +0300 Serhiy Storchaka <storchaka@gmail.com> wrote: []
But show me real-world case for that. Common usecase is scanning string left-to-right, that should be done using iterator and thus O(N). Right-to-left scanning would be order(s) of magnitude less frequent, as and also handled by iterator.
html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize don't use iterators. They use indices, str.find and/or regular expressions. Common use case is quickly find substring starting from current position using str.find or re.search, process found token, advance position and repeat.
That's sad, I agree. -- Best regards, Paul mailto:pmiscml@gmail.com
04.06.14 20:05, Paul Sokolovsky написав(ла):
On Wed, 04 Jun 2014 19:49:18 +0300 Serhiy Storchaka <storchaka@gmail.com> wrote:
html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize don't use iterators. They use indices, str.find and/or regular expressions. Common use case is quickly find substring starting from current position using str.find or re.search, process found token, advance position and repeat.
That's sad, I agree.
Other languages (Go, Rust) can be happy without O(1) indexing of strings. All string and regex operations work with iterators or cursors, and I believe this approach is not significant worse than implementing strings as O(1)-indexable arrays of characters (for some applications it can be worse, for other it can be better). But Python is different language, it has different operations for strings and different idioms. A language which doesn't support O(1) indexing is not Python, it is only Python-like language.
Hello, On Wed, 04 Jun 2014 20:52:14 +0300 Serhiy Storchaka <storchaka@gmail.com> wrote: []
That's sad, I agree.
Other languages (Go, Rust) can be happy without O(1) indexing of strings. All string and regex operations work with iterators or cursors, and I believe this approach is not significant worse than implementing strings as O(1)-indexable arrays of characters (for some applications it can be worse, for other it can be better). But Python is different language, it has different operations for strings and different idioms. A language which doesn't support O(1) indexing is not Python, it is only Python-like language.
Sorry, but that's just your personal opinion, not shared by other developers, as this thread showed. And let's not pretend we live in happy-ever world of Python 1.5.2 which doesn't need anything more because it's perfect as it is. Somebody added all those iterators and iterator-returning functions to Pythons. And then the problem Python has is a typical "last mile" problem, that iterators were not applied completely everywhere. There's little choice but to move in that direction, though. What you call "idioms", other people call "sloppy programming practices". There's common suggestion how to be at peace with Python's indentation for those who find it a problem - "get over it". Well, somehow it itches to say same for people who think that Python3 should be used the same way as Python1: Get over the fact that Python is no longer little funny language being laughed at by Perl crowd for being order of magnitude slower at processing text files. While you still can do little funny tricks we all love Python for, it now also offers framework to do it right, and it makes little sense saying that doing it little funny way is the definitive trait of Python. (And for me it's easy to be such categorical - the only way I could subscribe to idea of running Python on an MCU and not be laughable is by trusting Python to provide framework for being efficient. I quit working on another language because I have trusted that iterator, generator, buffer protocols are not little funny things but thoroughly engineered efficient concepts, and I don't feel betrayed.) -- Best regards, Paul mailto:pmiscml@gmail.com
Serhiy Storchaka wrote:
A language which doesn't support O(1) indexing is not Python, it is only Python-like language.
That's debatable, but even if it's true, I don't think there's anything wrong with MicroPython being only a "Python-like language". As has been pointed out, fitting Python onto a small device is always going to necessitate some compromises. -- Greg
Hello, On Thu, 05 Jun 2014 12:08:21 +1200 Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Serhiy Storchaka wrote:
A language which doesn't support O(1) indexing is not Python, it is only Python-like language.
That's debatable, but even if it's true, I don't think there's anything wrong with MicroPython being only a "Python-like language". As has been pointed out, fitting Python onto a small device is always going to necessitate some compromises.
Thanks. I mentioned in another mail that we exactly trying to develop a minimalistic, but Python implementation, not Python-like language. What is "Python-like" for me. The other most well-know, and mature (as in "started quite some time ago") "small Python" implementation is PyMite aka Python-on-a-chip https://code.google.com/p/python-on-a-chip/ . It implements good deal of Python2 language. It doesn't implement exception handling (try/except). Can a Python be without exception handling? For me, the clear answer is "no". Please put that in perspective when alarming over O(1) indexing of inherently problematic niche datatype. (Again, it's not my or MicroPython's fault that it was forced as standard string type. Maybe if CPython seriously considered now-standard UTF-8 encoding, results of what is "str" type might be different. But CPython has gigabytes of heap to spare, and for MicroPython, every half-bit is precious).
-- Greg _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/pmiscml%40gmail.com
-- Best regards, Paul mailto:pmiscml@gmail.com
Paul Sokolovsky writes:
Please put that in perspective when alarming over O(1) indexing of inherently problematic niche datatype. (Again, it's not my or MicroPython's fault that it was forced as standard string type. Maybe if CPython seriously considered now-standard UTF-8 encoding, results of what is "str" type might be different. But CPython has gigabytes of heap to spare, and for MicroPython, every half-bit is precious).
Would you please stop trolling? The reasons for adopting Unicode as a separate data type were good and sufficient in 2000, and they remain so today, even if you have been fortunate enough not to burn yourself on character-byte conflation yet. What matters to you is that str (unicode) is an opaque type -- there is no specification of the internal representation in the language reference, and in fact several different ones coexist happily across existing Python implementations -- and you're free to use a UTF-8 implementation if that suits the applications you expect for MicroPython. PEP 393 exists, of course, and specifies the current internal representation for CPython 3. But I don't see anything in it that suggests it's mandated for any other implementation.
Hello, On Thu, 05 Jun 2014 16:54:11 +0900 "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
Paul Sokolovsky writes:
Please put that in perspective when alarming over O(1) indexing of inherently problematic niche datatype. (Again, it's not my or MicroPython's fault that it was forced as standard string type. Maybe if CPython seriously considered now-standard UTF-8 encoding, results of what is "str" type might be different. But CPython has gigabytes of heap to spare, and for MicroPython, every half-bit is precious).
Would you please stop trolling? The reasons for adopting Unicode as a separate data type were good and sufficient in 2000, and they remain
If it was kept at "separate data type" bay, there wouldn't be any problem. But it was made "one and only string type", and all strife started then. And there going to be "trolling" as long as Python developers and decision-makers will ignore (troll?) outcry from the community (again, I was surprised and not surprised to see ~50% of traffic on python-list touches Unicode issues). Well, I understand the plan - hoping that people will "get over this". And I'm personally happy to stay away from this "trolling", but any discussion related to Unicode goes in circles and returns to feeling that Unicode at the central role as put there by Python3 is misplaced. Then for me, it's just a matter of job security and personal future - I don't want to spend rest of my days as a javascript (or other idiotic language) monkey. And the message is clear in the air (http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/ and elsewhere): if Python strings are now in Go, and in Python itself are now Java strings, all causing strife, why not go cruising around and see what's up, instead of staying strong, and growing bigger, community.
so today, even if you have been fortunate enough not to burn yourself on character-byte conflation yet.
What matters to you is that str (unicode) is an opaque type -- there is no specification of the internal representation in the language reference, and in fact several different ones coexist happily across existing Python implementations -- and you're free to use a UTF-8 implementation if that suits the applications you expect for MicroPython.
PEP 393 exists, of course, and specifies the current internal representation for CPython 3. But I don't see anything in it that suggests it's mandated for any other implementation.
I knew all this before very well. What's strange is that other developers don't know, or treat seriously, all of the above. That's why gentleman who kindly was interested in adding Unicode support to MicroPython started with the idea of dragging in CPython implementation. And the only effect persuasion that it's not necessarily the best solution had, was that he started to feel that he's being manipulated into writing something ugly, instead of the bright idea he had. That's why another gentleman reduces it to: "O(1) on string indexing or not a Python!". And that's why another gentleman, who agrees to UTF-8 arguments, still gives an excuse (https://mail.python.org/pipermail/python-dev/2014-June/134727.html): "In this context, while a fixed-width encoding may be the correct choice it would also likely be the wrong choice." In this regard, I'm glad to participate in mind-resetting discussion. So, let's reiterate - there's nothing like "the best", "the only right", "the only correct", "righter than", "more correct than" in CPython's implementation of Unicode storage. It is *arbitrary*. Well, sure, it's not arbitrary, but based on requirements, and these requirements match CPython's (implied) usage model well enough. But among all possible sets of requirements, CPython's requirements are no more valid that other possible. And other set of requirement fairly clearly lead to situation where CPython implementation is rejected as not correct for those requirements at all. -- Best regards, Paul mailto:pmiscml@gmail.com
On 5 June 2014 21:25, Paul Sokolovsky <pmiscml@gmail.com> wrote:
Well, I understand the plan - hoping that people will "get over this". And I'm personally happy to stay away from this "trolling", but any discussion related to Unicode goes in circles and returns to feeling that Unicode at the central role as put there by Python3 is misplaced.
Many of the challenges network programmers face in Python 3 are around binary data being more inconvenient to work with than it needs to be, not the fact we decentralised boundary code by offering a strict binary/text separation as the default mode of operation. Aside from some of the POSIX locale handling issues on Linux, many of the concerns are with the usability of bytes and bytearray, not with str - that's why binary interpolation is coming back in 3.5, and there will likely be other usability tweaks for those types as well. More on that at http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_an... Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Hello, On Thu, 5 Jun 2014 21:43:16 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
On 5 June 2014 21:25, Paul Sokolovsky <pmiscml@gmail.com> wrote:
Well, I understand the plan - hoping that people will "get over this". And I'm personally happy to stay away from this "trolling", but any discussion related to Unicode goes in circles and returns to feeling that Unicode at the central role as put there by Python3 is misplaced.
Many of the challenges network programmers face in Python 3 are around binary data being more inconvenient to work with than it needs to be, not the fact we decentralised boundary code by offering a strict binary/text separation as the default mode of operation.
Just to clarify - (many) other gentlemen and I (in that order, I'm not taking a lead), don't call to go back to Python2 behavior with implicit conversion between byte-oriented strings and Unicode, etc. They just point out that perhaps Python3 went too far with Unicode cause by making it the default string type. Strict separation is surely mostly good thing (I can sigh that it leads to Java-like dichotomical bloat for all I/O classes, but well, I was able to put up with that in MicroPython already).
Aside from some of the POSIX locale handling issues on Linux, many of the concerns are with the usability of bytes and bytearray, not with str - that's why binary interpolation is coming back in 3.5, and there will likely be other usability tweaks for those types as well.
All these changes are what let me dream on and speculate on possibility that Python4 could offer an encoding-neutral string type (which means based on bytes), while move unicode back to an explicit type to be used explicitly only when needed (bloated frameworks like Django can force users to it anyway, but that will be forcing on framework level, not on language level, against which people rebel.) People can dream, right? Thanks, Paul mailto:pmiscml@gmail.com
On 5 June 2014 22:01, Paul Sokolovsky <pmiscml@gmail.com> wrote:
Aside from some of the POSIX locale handling issues on Linux, many of the concerns are with the usability of bytes and bytearray, not with str - that's why binary interpolation is coming back in 3.5, and there will likely be other usability tweaks for those types as well.
All these changes are what let me dream on and speculate on possibility that Python4 could offer an encoding-neutral string type (which means based on bytes), while move unicode back to an explicit type to be used explicitly only when needed (bloated frameworks like Django can force users to it anyway, but that will be forcing on framework level, not on language level, against which people rebel.) People can dream, right?
If you don't model strings as arrays of code points, or at least assume a particular universal encoding (like UTF-8), you have to give up string concatenation in order to tolerate arbitrary encodings - otherwise you end up with unintelligible data that nobody can decode because it switches encodings without notice. That's a viable model if your OS guarantees it (Mac OS X does, for example, so Python 3 assumes UTF-8 for all OS interfaces there), but Linux currently has no such guarantee - many runtimes just decide they don't care, and assume UTF-8 anyway (Python 3 may even join them some day, due to the problems caused by trusting the locale encoding to be correct, but the startup code will need non-trivial changes for that to happen - the C.UTF-8 locale may even become widespread before we get there). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Hello, On Thu, 5 Jun 2014 22:20:04 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote: []
problems caused by trusting the locale encoding to be correct, but the startup code will need non-trivial changes for that to happen - the C.UTF-8 locale may even become widespread before we get there).
... And until those golden times come, it would be nice if Python did not force its perfect world model, which unfortunately is not based on surrounding reality, and let users solve their encoding problems themselves - when they need, because again, one can go quite a long way without dealing with encodings at all. Whereas now Python3 forces users to deal with encoding almost universally, but forcing a particular for all strings (which is again, doesn't correspond to the state of surrounding reality). I already hear response that it's good that users taught to deal with encoding, that will make them write correct programs, but that's a bit far away from the original aim of making it write "correct" programs easy and pleasant. (And definition of "correct" vary.) But all that is just an opinion.
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
-- Best regards, Paul mailto:pmiscml@gmail.com
On 5 June 2014 22:37, Paul Sokolovsky <pmiscml@gmail.com> wrote:
On Thu, 5 Jun 2014 22:20:04 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
problems caused by trusting the locale encoding to be correct, but the startup code will need non-trivial changes for that to happen - the C.UTF-8 locale may even become widespread before we get there).
... And until those golden times come, it would be nice if Python did not force its perfect world model, which unfortunately is not based on surrounding reality, and let users solve their encoding problems themselves - when they need, because again, one can go quite a long way without dealing with encodings at all. Whereas now Python3 forces users to deal with encoding almost universally, but forcing a particular for all strings (which is again, doesn't correspond to the state of surrounding reality). I already hear response that it's good that users taught to deal with encoding, that will make them write correct programs, but that's a bit far away from the original aim of making it write "correct" programs easy and pleasant. (And definition of "correct" vary.)
As I've said before in other contexts, find me Windows, Mac OS X and JVM developers, or educators and scientists that are as concerned by the text model changes as folks that are primarily focused on Linux system (including network) programming, and I'll be more willing to concede the point. Windows, Mac OS X, and the JVM are all opinionated about the text encodings to be used at platform boundaries (using UTF-16, UTF-8 and UTF-16, respectively). By contrast, Linux (or, more accurately, POSIX) says "well, it's configurable, but we won't provide a reliable mechanism for finding out what the encoding is. So either guess as best you can based on the info the OS *does* provide, assume UTF-8, assume 'some ASCII compatible encoding', or don't do anything that requires knowing the encoding of the data being exchanged with the OS, like, say, displaying file names to users or accepting arbitrary text as input, transforming it in a content aware fashion, and echoing it back in a console application". None of those options are perfectly good choices. 6(ish) years ago, we chose the first option, because it has the best chance of working properly on Linux systems that use ASCII incompatible encodings like ShiftJIS, ISO-2022, and various other East Asian codecs. For normal user space programming, Linux is pretty reliable when it comes to ensuring the locale encoding is set to something sensible, but the price we currently pay for that decision is interoperability issues with things like daemons not receiving any configuration settings and hence falling back the POSIX locale and ssh environment forwarding moving a clients encoding settings to a session on a server with different settings. I still consider it preferable to impose inconveniences like that based on use case (situations where Linux systems don't provide sensible encoding settings) than geographic region (locales where ASCII incompatible encodings are likely to still be in common use). If I (or someone else) ever find the time to implement PEP 432 (or something like it) to address some of the limitations of the interpreter startup sequence that currently make it difficult to avoid relying on the POSIX locale encoding on Linux, then we'll be in a position to reassess that decision based on the increased adoption of UTF-8 by Linux distributions in recent years. As the major community Linux distributions complete the migration of their system utilities to Python 3, we'll get to see if they decide it's better to make their locale settings more reliable, or help make it easier for Python 3 to ignore them when they're wrong. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 5 June 2014 14:15, Nick Coghlan <ncoghlan@gmail.com> wrote:
As I've said before in other contexts, find me Windows, Mac OS X and JVM developers, or educators and scientists that are as concerned by the text model changes as folks that are primarily focused on Linux system (including network) programming, and I'll be more willing to concede the point.
There is once again a strong selection bias in this discussion, by its very nature. People who like the new model don't have anything to complain about, and so are not heard. Just to support Nick's point, I for one find the Python 3 text model a huge benefit, both in practical terms of making my programs more robust, and educationally, as I have a far better understanding of encodings and their issues than I ever did under Python 2. Whenever a discussion like this occurs, I find it hard not to resent the people arguing that the new model should be taken away from me and replaced with a form of the old error-prone (for me) approach - as if it was in my best interests. Internal details don't bother me - using UTF8 and having indexing be potentially O(N) is of little relevance. But make me work with a string type that *doesn't* abstract a string as a sequence of Unicode code points and I'll get very upset. Paul
On Thu, Jun 5, 2014 at 11:59 AM, Paul Moore <p.f.moore@gmail.com> wrote:
On 5 June 2014 14:15, Nick Coghlan <ncoghlan@gmail.com> wrote:
As I've said before in other contexts, find me Windows, Mac OS X and JVM developers, or educators and scientists that are as concerned by the text model changes as folks that are primarily focused on Linux system (including network) programming, and I'll be more willing to concede the point.
There is once again a strong selection bias in this discussion, by its very nature. People who like the new model don't have anything to complain about, and so are not heard.
Just to support Nick's point, I for one find the Python 3 text model a huge benefit, both in practical terms of making my programs more robust, and educationally, as I have a far better understanding of encodings and their issues than I ever did under Python 2. Whenever a discussion like this occurs, I find it hard not to resent the people arguing that the new model should be taken away from me and replaced with a form of the old error-prone (for me) approach - as if it was in my best interests.
Internal details don't bother me - using UTF8 and having indexing be potentially O(N) is of little relevance. But make me work with a string type that *doesn't* abstract a string as a sequence of Unicode code points and I'll get very upset.
Once you get past whether str + bytes throws an exception which seems to be the divide most people focus on, you can discover new things like dance-encoded strings, bytes decoded using an incorrect encoding intended to be transcoded into the correct encoding later, surrogates that work perfectly until .encode(), str(bytes), APIs that disagree with you about whether the result should be str or bytes, APIs that return either string or bytes depending on their initializers and so on. Unicode can still be complicated in Python 3 independent of any judgement about whether it is worse, better, or different than Python 2.
On 6/5/2014 11:41 AM, Daniel Holth wrote:
discover new things like dance-encoded strings, bytes decoded using an incorrect encoding intended to be transcoded into the correct encoding later, surrogates that work perfectly until .encode(), str(bytes), APIs that disagree with you about whether the result should be str or bytes, APIs that return either string or bytes depending on their initializers and so on. Unicode can still be complicated in Python 3 independent of any judgement about whether it is worse, better, or different than Python 2. Yes, people can find ways to write bad code in any language.
On 6 Jun 2014 05:13, "Glenn Linderman" <v+python@g.nevcal.com> wrote:
On 6/5/2014 11:41 AM, Daniel Holth wrote:
discover new things like dance-encoded strings, bytes decoded using an incorrect encoding intended to be transcoded into the correct encoding later, surrogates that work perfectly until .encode(), str(bytes), APIs that disagree with you about whether the result should be str or bytes, APIs that return either string or bytes depending on their initializers and so on. Unicode can still be complicated in Python 3 independent of any judgement about whether it is worse, better, or different than Python 2.
Yes, people can find ways to write bad code in any language.
Note that several of the issues Daniel mentions here are due to the lack of reliable encoding settings on Linux and the challenges of the Py2->3 migration, rather than users writing bad code. Several of them represent bugs to be fixed or serve as indicators of missing features that would make it easier to work around an imperfect world. Cheers, Nick.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Hello, On Thu, 5 Jun 2014 23:15:54 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
On 5 June 2014 22:37, Paul Sokolovsky <pmiscml@gmail.com> wrote:
On Thu, 5 Jun 2014 22:20:04 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
problems caused by trusting the locale encoding to be correct, but the startup code will need non-trivial changes for that to happen - the C.UTF-8 locale may even become widespread before we get there).
... And until those golden times come, it would be nice if Python did not force its perfect world model, which unfortunately is not based on surrounding reality, and let users solve their encoding problems themselves - when they need, because again, one can go quite a long way without dealing with encodings at all. Whereas now Python3 forces users to deal with encoding almost universally, but forcing a particular for all strings (which is again, doesn't correspond to the state of surrounding reality). I already hear response that it's good that users taught to deal with encoding, that will make them write correct programs, but that's a bit far away from the original aim of making it write "correct" programs easy and pleasant. (And definition of "correct" vary.)
As I've said before in other contexts, find me Windows, Mac OS X and JVM developers, or educators and scientists that are as concerned by the text model changes as folks that are primarily focused on Linux system (including network) programming, and I'll be more willing to concede the point.
Well, but this question reduces to finding out (or specifying) who are target audiences of Python. It always has been (with a bow to Guido) forpost of scientific users (and probably even if there was mass exodus of other categories of users will remain prominent in that role). But Python has always had its share as system scripting language among Perl-haters, and with Perl going flatline, I guess it's fair to say that Python is major system scripting and service implementation language. To whom all features like memoryview, array.array, in-place input operations, etc. cater? To scientists? I'm sure most of them are just happy with stuffing "@jit" for their kernel functions. And scientist who bother with memoryviews for their data structures are system-level-ish programmers too. So, no wonder that Linux crowd cries at Python3 - it makes doing simple things unnecessarily complicated.
Windows, Mac OS X, and the JVM are all opinionated about the text encodings to be used at platform boundaries (using UTF-16, UTF-8 and UTF-16, respectively). By contrast, Linux (or, more accurately, POSIX) says "well, it's configurable, but we won't provide a reliable mechanism for finding out what the encoding is. So either guess as
[] Yes, I understand complexity of developing cross-platform language with advanced features. By I may offer another look at all this activity: Python3 was brave enough to do revolution in its own world (catching a lot of its users by surprise), but surely not brave enough to do revolution around itself, by saying something like "We choose ONE, the most right, and even the most used (per bytes transferred) encoding as our standard I/O encoding. Grow up or explicitly specify encoding which you personally need.". Surely, it didn't to that - it makes no sense to fight the world. But then Python3 is sympathetic about Java's desire to use "UTF-16" instead of "right" encoding, and no so about Unix desire to treat encodings as a separate level from content (and treating Unicode by nothing else as yet another arbitrary encoding, which it is formally, and will be for a long time de-facto, however sad it is). So, maybe "cross-platform" should have mean "don't do implicit conversions". Because see, Python2 had a problem with implicit encoding conversion when str and unicode objects were mixed, and Python3 has problem with implicit conversions whenever str is used at all. Anyway, I appreciate detailed responses, and understand what you (Python3 developers) are trying to achieve, and appreciate your work, and hope it all work out. Each user has own concerns about Unicode. Mine are efficiency and layering. But once MicroPython has UTF-8 support I will be much more relaxed about it. Layering is harder to accept, but hopefully can be tackled too both on own mind's and technical sides. I hope other users will find their peace with Unicode too! [] -- Best regards, Paul mailto:pmiscml@gmail.com
On 6 June 2014 21:15, Paul Sokolovsky <pmiscml@gmail.com> wrote:
Hello,
On Thu, 5 Jun 2014 23:15:54 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
On 5 June 2014 22:37, Paul Sokolovsky <pmiscml@gmail.com> wrote:
On Thu, 5 Jun 2014 22:20:04 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
problems caused by trusting the locale encoding to be correct, but the startup code will need non-trivial changes for that to happen - the C.UTF-8 locale may even become widespread before we get there).
... And until those golden times come, it would be nice if Python did not force its perfect world model, which unfortunately is not based on surrounding reality, and let users solve their encoding problems themselves - when they need, because again, one can go quite a long way without dealing with encodings at all. Whereas now Python3 forces users to deal with encoding almost universally, but forcing a particular for all strings (which is again, doesn't correspond to the state of surrounding reality). I already hear response that it's good that users taught to deal with encoding, that will make them write correct programs, but that's a bit far away from the original aim of making it write "correct" programs easy and pleasant. (And definition of "correct" vary.)
As I've said before in other contexts, find me Windows, Mac OS X and JVM developers, or educators and scientists that are as concerned by the text model changes as folks that are primarily focused on Linux system (including network) programming, and I'll be more willing to concede the point.
Well, but this question reduces to finding out (or specifying) who are target audiences of Python. It always has been (with a bow to Guido) forpost of scientific users (and probably even if there was mass exodus of other categories of users will remain prominent in that role). But Python has always had its share as system scripting language among Perl-haters, and with Perl going flatline, I guess it's fair to say that Python is major system scripting and service implementation language.
Correct - and the efforts of a number of core developers are focused on getting the Linux distros and major projects like OpenStack migrated. If other Linux users say "I'm not switching to Python 3 until after my distro has switched their own Python applications over", that's a perfectly reasonable course of action for them to take. After all, that approach to the adoption of new Python versions is a large part of why Python 2.6 is still so widely supported by library and framework developers: enterprise Linux distros haven't even finished migrating to Python 2.7 yet, let alone Python 3. (The other reason is that the language moratorium that was applied to Python 2.7 and 3.2 means that supporting back to Python 2.6 isn't that much harder than supporting 2.7 at this point in time). That said, the feedback from the early adopters of Python 3 on Linux is proving invaluable, and Linux users in general will benefit from their work as the distros move their infrastructure applications over. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 5 June 2014 22:01, Paul Sokolovsky <pmiscml@gmail.com> wrote:
All these changes are what let me dream on and speculate on possibility that Python4 could offer an encoding-neutral string type (which means based on bytes)
To me, an "encoding neutral string type" means roughly "characters are atomic", and the best representation we have for a "character" is a Unicode code point. Through any interface that provides "characters" each individual "character" (code point) is indivisible. To me, Python 3 has exactly an "encoding-neutral string type". It also has a bytes type that is is just that - bytes which can represent anything at all.It might be the UTF-8 representation of a string, but you have the freedom to manipulate it however you like - including making it no longer valid UTF-8. Whilst I think O(1) indexing of strings is important, I don't think it's as important as the property that "characters" are indivisible and would be quite happy for MicroPython to use UTF-8 as the underlying string representation (or some more clever thing, several ideas in this thread) so long as: 1. It maintains a string type that presents code points as indivisible elements; 2. The performance consequences of using UTF-8 are documented, as well as any optimisations, tricks, etc that are used to overcome those consequences (and what impact if any they would have if code written for MicroPython was run in CPython). Cheers, Tim Delaney
Hello, On Thu, 5 Jun 2014 22:21:30 +1000 Tim Delaney <timothy.c.delaney@gmail.com> wrote:
On 5 June 2014 22:01, Paul Sokolovsky <pmiscml@gmail.com> wrote:
All these changes are what let me dream on and speculate on possibility that Python4 could offer an encoding-neutral string type (which means based on bytes)
To me, an "encoding neutral string type" means roughly "characters are atomic", and the best representation we have for a "character" is a
And for me it means exactly what "encoding neutral string type" moniker promises - that you should not make any assumption about its encoding. That kinda means "string is atomic", instead of your "characters are atomic". That's the most basic level, and you can write a big enough set of applications using it - for example, get some information from user, store in database, then show back to user at later time. []
Cheers,
Tim Delaney
-- Best regards, Paul mailto:pmiscml@gmail.com
Paul Sokolovsky writes:
That kinda means "string is atomic", instead of your "characters are atomic".
I would be very surprised if a language that behaved that way was called a "Python subset". No indexing, no slicing, no regexps, no .split(), no .startswith(), no sorted() or .sort(), ...!? If that's not what you mean by "string is atomic", I think you're using very confusing terminology.
Hello, On Fri, 06 Jun 2014 20:11:27 +0900 "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
Paul Sokolovsky writes:
That kinda means "string is atomic", instead of your "characters are atomic".
I would be very surprised if a language that behaved that way was called a "Python subset". No indexing, no slicing, no regexps, no .split(), no .startswith(), no sorted() or .sort(), ...!?
If that's not what you mean by "string is atomic", I think you're using very confusing terminology.
I'm sorry if I didn't mention it, or didn't make it clear enough - it's all about layering. On level 0, you treat strings verbatim, and can write some subset of apps (my point is that even this level allows to write lot enough apps). Let's call this set A0. On level 1, you accept that there's some universal enough conventions for some chars, like space or newline. And you can write set of apps A1 > A0. On level 2, you add len(), and - oh magic - you now can center a string within fixed-size field, something you probably to as often as once a month, so hopefully that will keep you busy for few. On level 3, it indeed starts to smell Unicode, we get isdigit(), isalpha(), which require long boring tables, which hopefully can be compressed enough to fit in your pocket. On level 4, it's pumping up, with tolower() and friends, tables for which you carry around in suitcase. On level 5, everything is Unicode, what a bliss! You can even start pretending that no other levels exist (God created Unicode on a second day). On level 6, there're mind-boggling, ugly manual-use utilities to deal with internals of "magic" "working on its own for everyone" encoding to deal with stuff like code-point vs charecters vs surrogate pair vs grapheme separation, etc. So, once again, for me and some other people, it's not that bright idea to shoot for level 5 if levels 0-4 exist and well-proven pragmatic model. And level 6 is still there anyway. -- Best regards, Paul mailto:pmiscml@gmail.com
On 6 June 2014 21:34, Paul Sokolovsky <pmiscml@gmail.com> wrote:
On Fri, 06 Jun 2014 20:11:27 +0900 "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
Paul Sokolovsky writes:
That kinda means "string is atomic", instead of your "characters are atomic".
I would be very surprised if a language that behaved that way was called a "Python subset". No indexing, no slicing, no regexps, no .split(), no .startswith(), no sorted() or .sort(), ...!?
If that's not what you mean by "string is atomic", I think you're using very confusing terminology.
I'm sorry if I didn't mention it, or didn't make it clear enough - it's all about layering.
On level 0, you treat strings verbatim, and can write some subset of apps (my point is that even this level allows to write lot enough apps). Let's call this set A0.
On level 1, you accept that there's some universal enough conventions for some chars, like space or newline. And you can write set of apps A1 > A0.
At heart, this is exactly what the Python 3 "str" type is. The universal convention is "code points". It's got nothing to do with encodings, or bytes. A Python string is simply a finite sequence of atomic code points - it is indexable, and it has a length. Once you have that, everything is layered on top of it. How the code points themselves are implemented is opaque and irrelevant other than the memory and performance consequences of the implementation decisions (for example, a string could be indexable by iterating from the start until you find the nth code point). Similarly the "bytes" type is a sequence of 8-bit bytes. Encodings are simply a way to transport code points via a byte-oriented transport. Tim Delaney
Hello, On Fri, 6 Jun 2014 21:48:41 +1000 Tim Delaney <timothy.c.delaney@gmail.com> wrote:
On 6 June 2014 21:34, Paul Sokolovsky <pmiscml@gmail.com> wrote:
On Fri, 06 Jun 2014 20:11:27 +0900 "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
Paul Sokolovsky writes:
That kinda means "string is atomic", instead of your "characters are atomic".
I would be very surprised if a language that behaved that way was called a "Python subset". No indexing, no slicing, no regexps, no .split(), no .startswith(), no sorted() or .sort(), ...!?
If that's not what you mean by "string is atomic", I think you're using very confusing terminology.
I'm sorry if I didn't mention it, or didn't make it clear enough - it's all about layering.
On level 0, you treat strings verbatim, and can write some subset of apps (my point is that even this level allows to write lot enough apps). Let's call this set A0.
On level 1, you accept that there's some universal enough conventions for some chars, like space or newline. And you can write set of apps A1 > A0.
At heart, this is exactly what the Python 3 "str" type is. The universal convention is "code points".
Yes. Except for one small detail - Python3 specifies these code points to be Unicode code points. And Unicode is a very bloated thing. But if we drop that "Unicode" stipulation, then it's also exactly what MicroPython implements. Its "str" type consists of codepoints, we don't have pet names for them yet, like Unicode does, but their numeric values are 0-255. Note that it in no way limits encodings, characters, or scripts which can be used with MicroPython, because just like Unicode, it support concept of "surrogate pairs" (but we don't call it like that) - specifically, smaller code points may comprise bigger groupings. But unlike Unicode, we don't stipulate format, value or other constraints on how these "surrogate pairs"-alikes are formed, leaving that to users. -- Best regards, Paul mailto:pmiscml@gmail.com
On 7 June 2014 00:52, Paul Sokolovsky <pmiscml@gmail.com> wrote:
At heart, this is exactly what the Python 3 "str" type is. The universal convention is "code points".
Yes. Except for one small detail - Python3 specifies these code points to be Unicode code points. And Unicode is a very bloated thing.
But if we drop that "Unicode" stipulation, then it's also exactly what MicroPython implements. Its "str" type consists of codepoints, we don't have pet names for them yet, like Unicode does, but their numeric values are 0-255. Note that it in no way limits encodings, characters, or scripts which can be used with MicroPython, because just like Unicode, it support concept of "surrogate pairs" (but we don't call it like that) - specifically, smaller code points may comprise bigger groupings. But unlike Unicode, we don't stipulate format, value or other constraints on how these "surrogate pairs"-alikes are formed, leaving that to users.
I think you've missed my point. There is absolutely nothing conceptually bloaty about what a Python 3 string is. It's just like a 7-bit ASCII string, except each entry can be from a larger table. When you index into a Python 3 string, you get back exactly *one valid entry* from the Unicode code point table. That plus the length of the string, plus the guarantee of immutability gives everything needed to layer the rest of the string functionality on top. There are no surrogate pairs - each code point is standalone (unlike code *units*). It is conceptually very simple. The implementation may be difficult (if you're trying to do better than 4 bytes per code point) but the concept is dead simple. If the MicroPython string type requires people *using* it to deal with surrogates (i.e. indexing could return a value that is not a valid Unicode code point) then it will have broken the conceptual simplicity of the Python 3 string type (and most certainly can't be considered in any way compatible). Tim Delaney
On 7 Jun 2014 00:53, "Paul Sokolovsky" <pmiscml@gmail.com> wrote:
Yes. Except for one small detail - Python3 specifies these code points to be Unicode code points. And Unicode is a very bloated thing.
I rather suspect users of East Asian & African scripts might have a different notion of what constitutes "bloated" vs "can actually represent this language properly, unlike 8-bit code spaces".
But if we drop that "Unicode" stipulation, then it's also exactly what MicroPython implements. Its "str" type consists of codepoints, we don't have pet names for them yet, like Unicode does, but their numeric values are 0-255. Note that it in no way limits encodings, characters, or scripts which can be used with MicroPython, because just like Unicode, it support concept of "surrogate pairs" (but we don't call it like that) - specifically, smaller code points may comprise bigger groupings. But unlike Unicode, we don't stipulate format, value or other constraints on how these "surrogate pairs"-alikes are formed, leaving that to users.
This is effectively what the Python 2 str type does, and it's a recipe for data driven latent defects. You inevitably end up concatenating strings using different code spaces, or else splitting strings between surrogate pairs rather than on the proper boundaries, etc. The abstraction presented to users by the str type *must* be the full range of Unicode code points as atomic units. Storing those internally as UTF-8 rather than as fixed width code points as CPython does is an experiment worth trying, since you don't have the same C level backwards compatibility constraints we do. But limiting the str type to a single code page per process is not an acceptable constraint in a Python 3 implementation. Regards, Nick.
-- Best regards, Paul mailto:pmiscml@gmail.com _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Paul Sokolovsky wrote:
All these changes are what let me dream on and speculate on possibility that Python4 could offer an encoding-neutral string type (which means based on bytes)
Can you elaborate on exactly what you have in mind? You seem to want something different from Python 3 str, Python 3 bytes and Python 2 str, but it's far from clear what you want this type to be like. -- Greg
Paul Sokolovsky <pmiscml@gmail.com> wrote:
In this regard, I'm glad to participate in mind-resetting discussion. So, let's reiterate - there's nothing like "the best", "the only right", "the only correct", "righter than", "more correct than" in CPython's implementation of Unicode storage. It is *arbitrary*. Well, sure, it's not arbitrary, but based on requirements, and these requirements match CPython's (implied) usage model well enough. But among all possible sets of requirements, CPython's requirements are no more valid that other possible. And other set of requirement fairly clearly lead to situation where CPython implementation is rejected as not correct for those requirements at all.
Several core-devs have said that using UTF-8 for MicroPython is perfectly okay. I also think it's the right choice and I hope that you guys come up with a very efficient implementation. Stefan Krah
On 5 June 2014 22:10, Stefan Krah <stefan@bytereef.org> wrote:
Paul Sokolovsky <pmiscml@gmail.com> wrote:
In this regard, I'm glad to participate in mind-resetting discussion. So, let's reiterate - there's nothing like "the best", "the only right", "the only correct", "righter than", "more correct than" in CPython's implementation of Unicode storage. It is *arbitrary*. Well, sure, it's not arbitrary, but based on requirements, and these requirements match CPython's (implied) usage model well enough. But among all possible sets of requirements, CPython's requirements are no more valid that other possible. And other set of requirement fairly clearly lead to situation where CPython implementation is rejected as not correct for those requirements at all.
Several core-devs have said that using UTF-8 for MicroPython is perfectly okay. I also think it's the right choice and I hope that you guys come up with a very efficient implementation.
Based on this discussion , I've also posted a draft patch aimed at clarifying the relevant aspects of the data model section of the language reference (http://bugs.python.org/issue21667). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Hello, On Thu, 5 Jun 2014 22:38:13 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
On 5 June 2014 22:10, Stefan Krah <stefan@bytereef.org> wrote:
Paul Sokolovsky <pmiscml@gmail.com> wrote:
In this regard, I'm glad to participate in mind-resetting discussion. So, let's reiterate - there's nothing like "the best", "the only right", "the only correct", "righter than", "more correct than" in CPython's implementation of Unicode storage. It is *arbitrary*. Well, sure, it's not arbitrary, but based on requirements, and these requirements match CPython's (implied) usage model well enough. But among all possible sets of requirements, CPython's requirements are no more valid that other possible. And other set of requirement fairly clearly lead to situation where CPython implementation is rejected as not correct for those requirements at all.
Several core-devs have said that using UTF-8 for MicroPython is perfectly okay. I also think it's the right choice and I hope that you guys come up with a very efficient implementation.
Based on this discussion , I've also posted a draft patch aimed at clarifying the relevant aspects of the data model section of the language reference (http://bugs.python.org/issue21667).
Thanks, it's very much appreciated. Though, the discussion there opened another can of worms. I'm sorry if I was somehow related to that, my bringing in the formal language spec was more a rhetorical figure, a response to people claiming O(1) requirement. So, it either should be in spec, or spec should be treated as such - something not specified means it's underspecified and implementation-dependent. I'm glad that the last point now explicitly pronounced by BDFL in the last comment of that ticket (http://bugs.python.org/issue21667#msg219824)
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/pmiscml%40gmail.com
-- Best regards, Paul mailto:pmiscml@gmail.com
On Fri, Jun 6, 2014 at 8:15 PM, Paul Sokolovsky <pmiscml@gmail.com> wrote:
I'm sorry if I was somehow related to that, my bringing in the formal language spec was more a rhetorical figure, a response to people claiming O(1) requirement.
This was exactly why this whole discussion came up, though. We were debating on the uPy bug tracker about how important O(1) indexing is; I then came to python-list to try to get some solid data from which to debate; and then the discussion jumped here to python-dev for more solid explanations. The spec wasn't perfectly clear, and now it's being made clearer: O(N) indexing does not violate Python's spec, ergo uPy is allowed to use UTF-8 as its internal representation, as long as script-visible behaviour is correct. It'll be interesting to see when it's done (I'm currently working on that implementation, bit by bit) and to run the CPython benchmarks on it. It's been a fruitful and interesting discussion, and the formal language spec is key to it. No need to apologize! ChrisA
On 5 June 2014 17:54, Stephen J. Turnbull <stephen@xemacs.org> wrote:
What matters to you is that str (unicode) is an opaque type -- there is no specification of the internal representation in the language reference, and in fact several different ones coexist happily across existing Python implementations -- and you're free to use a UTF-8 implementation if that suits the applications you expect for MicroPython.
However, as others have noted in the thread, the critical thing is to *not* let that internal implementation detail leak into the Python level string behaviour. That's what happened with narrow builds of Python 2 and pre-PEP-393 releases of Python 3 (effectively using UTF-16 internally), and it was the cause of a sufficiently large number of bugs that the Linux distributions tend to instead accept the memory cost of using wide builds (4 bytes for all code points) for affected versions. Preserving the "the Python 3 str type is an immutable array of code points" semantics matters significantly more than whether or not indexing by code point is O(1). The various caching tricks suggested in this thread (especially "leading ASCII characters", "trailing ASCII characters" and "position & index of last lookup") could keep the typical lookup performance well below O(N).
PEP 393 exists, of course, and specifies the current internal representation for CPython 3. But I don't see anything in it that suggests it's mandated for any other implementation.
CPython is constrained by C API compatibility requirements, as well as implementation constraints due to the amount of internal code that would need to be rewritten to handle a variable width encoding as the canonical internal representation (since the problems with Python 2 narrow builds mean we already know variable width encodings aren't handled correctly by the current code). Implementations that share code with CPython, or try to mimic the C API especially closely, may face similar restrictions. Outside that, I think we're better off if alternative implementations are free to experiment with different internal string representations. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
05.06.14 03:08, Greg Ewing написав(ла):
Serhiy Storchaka wrote:
A language which doesn't support O(1) indexing is not Python, it is only Python-like language.
That's debatable, but even if it's true, I don't think there's anything wrong with MicroPython being only a "Python-like language". As has been pointed out, fitting Python onto a small device is always going to necessitate some compromises.
Agree, there's anything wrong. I think that even limiting integers to 32 or 64 bits is acceptable compromise for Python-like language targeted to small devices. But programming on such language requires different techniques and habits.
Serhiy Storchaka wrote:
html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize don't use iterators. They use indices, str.find and/or regular expressions. Common use case is quickly find substring starting from current position using str.find or re.search, process found token, advance position and repeat.
For that kind of thing, you don't need an actual character index, just some way of referring to a place in a string. Instead of an integer, str.find() etc. could return a StringPosition, which would be an opaque reference to a particular point in a particular string. You would be able to pass StringPositions to indexing and slicing operations to get fast indexing into the string that they were derived from. StringPositions could support the following operations: StringPosition + int --> StringPosition StringPosition - int --> StringPosition StringPosition - StringPosition --> int These would be computed by counting characters forwards or backwards in the string, which would be slower than int arithmetic but still faster than counting from the beginning of the string every time. In other contexts, StringPositions would coerce to ints (maybe being an int subclass?) allowing them to be used in any existing algorithm that slices strings using ints. -- Greg
On 6/4/2014 5:03 PM, Greg Ewing wrote:
Serhiy Storchaka wrote:
html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize don't use iterators. They use indices, str.find and/or regular expressions. Common use case is quickly find substring starting from current position using str.find or re.search, process found token, advance position and repeat.
For that kind of thing, you don't need an actual character index, just some way of referring to a place in a string.
I think you meant codepoint index, rather than character index.
Instead of an integer, str.find() etc. could return a StringPosition, which would be an opaque reference to a particular point in a particular string. You would be able to pass StringPositions to indexing and slicing operations to get fast indexing into the string that they were derived from.
StringPositions could support the following operations:
StringPosition + int --> StringPosition StringPosition - int --> StringPosition StringPosition - StringPosition --> int
These would be computed by counting characters forwards or backwards in the string, which would be slower than int arithmetic but still faster than counting from the beginning of the string every time.
In other contexts, StringPositions would coerce to ints (maybe being an int subclass?) allowing them to be used in any existing algorithm that slices strings using ints.
This starts to diverge from Python codepoint indexing via integers. Calculating or caching the codepoint index to byte offset as part of the str implementation stays compatible with Python. Introducing StringPosition makes a Python-like language. Or so it seems to me.
On 6/4/2014 5:08 PM, Glenn Linderman wrote:
On 6/4/2014 5:03 PM, Greg Ewing wrote:
Serhiy Storchaka wrote:
html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize don't use iterators. They use indices, str.find and/or regular expressions. Common use case is quickly find substring starting from current position using str.find or re.search, process found token, advance position and repeat.
For that kind of thing, you don't need an actual character index, just some way of referring to a place in a string.
I think you meant codepoint index, rather than character index.
Instead of an integer, str.find() etc. could return a StringPosition, which would be an opaque reference to a particular point in a particular string. You would be able to pass StringPositions to indexing and slicing operations to get fast indexing into the string that they were derived from.
StringPositions could support the following operations:
StringPosition + int --> StringPosition StringPosition - int --> StringPosition StringPosition - StringPosition --> int
These would be computed by counting characters forwards or backwards in the string, which would be slower than int arithmetic but still faster than counting from the beginning of the string every time.
In other contexts, StringPositions would coerce to ints (maybe being an int subclass?) allowing them to be used in any existing algorithm that slices strings using ints.
This starts to diverge from Python codepoint indexing via integers. Calculating or caching the codepoint index to byte offset as part of the str implementation stays compatible with Python. Introducing StringPosition makes a Python-like language. Or so it seems to me.
Another thought is that StringPosition only works (quickly, at least), as you point out, for the string that they were derived from... so algorithms that walk two strings at a time cannot use the same StringPosition to do so... yep, this is quite divergent from CPython and Python.
Glenn Linderman wrote:
so algorithms that walk two strings at a time cannot use the same StringPosition to do so... yep, this is quite divergent from CPython and Python.
They can, it's just that at most one of the indexing operations would be fast; the StringPosition would devolve into an int for the other one. Such an algorithm would be of dubious correctness anyway, since as you pointed out, codepoints and characters are not quite the same thing. A codepoint index in one string doesn't necessarily count off the same number of characters in another string. So to be safe, you should really walk each string individually. -- Greg
Glenn Linderman wrote:
For that kind of thing, you don't need an actual character index, just some way of referring to a place in a string.
I think you meant codepoint index, rather than character index.
Probably, but what I said is true either way.
This starts to diverge from Python codepoint indexing via integers.
That's true, although most programs would have to go out of their way to tell the difference, especially if StringPosition were a subclass of int. I agree that cacheing indexes would be more transparent, though. -- Greg
Hello, On Thu, 05 Jun 2014 12:03:17 +1200 Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Serhiy Storchaka wrote:
html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize don't use iterators. They use indices, str.find and/or regular expressions. Common use case is quickly find substring starting from current position using str.find or re.search, process found token, advance position and repeat.
For that kind of thing, you don't need an actual character index, just some way of referring to a place in a string.
Instead of an integer, str.find() etc. could return a StringPosition,
That's more brave then I had in mind, but definitely shows what alternative implementation have in store to fight back if some perfomance problems are actually detected. My own thoughts were, for example, as response to people who (quoting) "slice strings for living" is some form of "extended slicing" like str[(0, 4, 6, 8, 15)]. But I really think that providing iterator interface for common string operations would cover most of real-world cases, and will be actually beneficial for Python language in general.
-- Greg
-- Best regards, Paul mailto:pmiscml@gmail.com
On Thu, Jun 5, 2014 at 10:03 AM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
StringPositions could support the following operations:
StringPosition + int --> StringPosition StringPosition - int --> StringPosition StringPosition - StringPosition --> int
These would be computed by counting characters forwards or backwards in the string, which would be slower than int arithmetic but still faster than counting from the beginning of the string every time.
The SP would have to keep track of which string it's associated with, which might make for some surprising retentions of large strings. (Imagine returning what you think is an integer, but actually turns out to be a SP, and you're trying to work out why your program is eating up so much more memory than it should. This int-like object is so much more than an int.) ChrisA
05.06.14 03:03, Greg Ewing написав(ла):
Serhiy Storchaka wrote:
html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize don't use iterators. They use indices, str.find and/or regular expressions. Common use case is quickly find substring starting from current position using str.find or re.search, process found token, advance position and repeat.
For that kind of thing, you don't need an actual character index, just some way of referring to a place in a string.
Of course. But _existing_ Python interfaces all work with indices. And it is too late to change this, this train was gone 20 years ago. There is no need in yet one way to do string operations. One obvious way is enough.
Serhiy Storchaka writes:
It would be interesting to collect a statistic about how many indexing operations happened during the life of a string in typical (Micro)Python program.
Probably irrelevant (I doubt anybody is going to be writing programmers' editors in MicroPython), but by far the most frequently called functions in XEmacs are byte_to_char_index and its inverse.
This thread has devolved into a flame war. I think we should trust the Micropython implementers (whoever they are -- are they participating here?) to know their users and let them do what feels right to them. We should just ask them not to claim full compatibility with any particular Python version -- that seems the most contentious point. Realistically, most Python code that works on Python 3.4 won't work on Micropython (for various reasons, not just the string behavior) and neither does it need to. -- --Guido van Rossum (python.org/~guido)
Hello, On Wed, 4 Jun 2014 11:25:51 -0700 Guido van Rossum <guido@python.org> wrote:
This thread has devolved into a flame war. I think we should trust the Micropython implementers (whoever they are -- are they participating here?)
I'm a regular contributor. I'm not sure if the author, Damien George, is on the list. In either case, he's a nice guy who prefer to do development rather than participate in flame wars ;-). And for the record, all opinions expressed are solely mine, and not official position of MicroPython project.
to know their users and let them do what feels right to them. We should just ask them not to claim full compatibility with any particular Python version -- that seems the most contentious point.
"Full" compatibility is never claimed, and understanding it as such is optimistic, "between the lines" reading of some users. All of: announcement posted on python-list (which prompted current inflow of MicroPython-related discussions), README at https://github.com/micropython/micropython , and detailed differences doc https://github.com/micropython/micropython/wiki/Differences make it clear there's no talk about "full" compatibility, and only specific compatibility (and incompatibility) points are claimed. That said, and unlike previous attempts to develop a small Python implementations (which of course existed), we're striving to be exactly a Python language implementation, not a Python-like language implementation. As there's no formal, implementation-independent language spec, what constitutes a compatible language implementation is subject to opinions, and we welcome and appreciate independent review, like this thread did.
Realistically, most Python code that works on Python 3.4 won't work on Micropython (for various reasons, not just the string behavior) and neither does it need to.
That's true. However, as was said, we're striving to provide a compatible implementation, and compatibility claims must be validated. While we have simple "in-house" testsuite, more serious compatibility validation requires running a testsuite for reference implementation (CPython), and that's gradually being approached.
-- --Guido van Rossum (python.org/~guido)
-- Best regards, Paul mailto:pmiscml@gmail.com
On Thu, 05 Jun 2014 00:14:32 +0300, Paul Sokolovsky <pmiscml@gmail.com> wrote:
That said, and unlike previous attempts to develop a small Python implementations (which of course existed), we're striving to be exactly a Python language implementation, not a Python-like language implementation. As there's no formal, implementation-independent language spec, what constitutes a compatible language implementation is subject to opinions, and we welcome and appreciate independent review, like this thread did.
The language reference is also the language specification. I don't know what you mean by 'formal', so presumably it doesn't qualify :) That said, if there are places that are not correctly marked as implementation specific, those are bugs in the reference and should be fixed. There almost certainly are still such bugs, and I suspect MicroPython can help us fix them, just as PyPy did/does. --David
On 6/4/2014 5:14 PM, Paul Sokolovsky wrote:
That said, and unlike previous attempts to develop a small Python implementations (which of course existed), we're striving to be exactly a Python language implementation, not a Python-like language implementation. As there's no formal, implementation-independent language spec, what constitutes a compatible language implementation is subject to opinions, and we welcome and appreciate independent review, like this thread did.
Realistically, most Python code that works on Python 3.4 won't work on Micropython (for various reasons, not just the string behavior) and neither does it need to.
That's true. However, as was said, we're striving to provide a compatible implementation, and compatibility claims must be validated. While we have simple "in-house" testsuite, more serious compatibility validation requires running a testsuite for reference implementation (CPython), and that's gradually being approached.
I would call what you are doing a 'Python 3.n subset, with limitations', where n should be a specific number, which I would urge should be at least 3, if not 4 ('yield from'). To me, that would mean that every Micropython program (that does not use a clearly non-Python addon like inline assembly) would run the same* on CPython 3.n. Conversely, a Python 3.n program should either run the same* on MicroPython as CPython, or raise. What most to avoid is giving different* answers. *'same' does not include timing differences or normal float variations or bug fixes in MicroPython not in CPython. As for unicode: I would see ascii-only (very limited codepoints) or bare utf-8 (limited speed == expanded time) as possibly fitting the definition above. Just be clear what the limitations are. And accept that there will be people who do not bother to read the limitations and then complain when they bang into them. PS. You do not seem to be aware of how well the current PEP393 implementation works. If you are going to write any more about it, I suggest you run Tools/Stringbench/stringbench.py for timings. -- Terry Jan Reedy
05.06.14 01:04, Terry Reedy написав(ла):
PS. You do not seem to be aware of how well the current PEP393 implementation works. If you are going to write any more about it, I suggest you run Tools/Stringbench/stringbench.py for timings.
AFAIK stringbench is ASCII-only, so it likely is compatible with current and any future MicroPython implementations, but unlikely will expose non-ASCII limitations or performance.
Hello, On Wed, 04 Jun 2014 18:04:52 -0400 Terry Reedy <tjreedy@udel.edu> wrote:
On 6/4/2014 5:14 PM, Paul Sokolovsky wrote:
That said, and unlike previous attempts to develop a small Python implementations (which of course existed), we're striving to be exactly a Python language implementation, not a Python-like language implementation. As there's no formal, implementation-independent language spec, what constitutes a compatible language implementation is subject to opinions, and we welcome and appreciate independent review, like this thread did.
Realistically, most Python code that works on Python 3.4 won't work on Micropython (for various reasons, not just the string behavior) and neither does it need to.
That's true. However, as was said, we're striving to provide a compatible implementation, and compatibility claims must be validated. While we have simple "in-house" testsuite, more serious compatibility validation requires running a testsuite for reference implementation (CPython), and that's gradually being approached.
I would call what you are doing a 'Python 3.n subset, with
Thanks, that's what we call it ourselves in the docs linked in the original message, and use n=4. Note that being a subset is not a design requirement, but there's higher-priority requirement of staying lean, so realistically uPy will always stay a subset.
limitations', where n should be a specific number, which I would urge should be at least 3, if not 4 ('yield from'). To me, that would mean that every Micropython program (that does not use a clearly non-Python addon like inline assembly) would run the same* on CPython 3.n. Conversely, a Python 3.n program should either run the same* on MicroPython as CPython, or raise. What most to avoid is giving different* answers.
That's nice aim, to implement which we don't have enough resources, so would appreciate any help from interested parties.
*'same' does not include timing differences or normal float variations or bug fixes in MicroPython not in CPython.
As for unicode: I would see ascii-only (very limited codepoints) or bare utf-8 (limited speed == expanded time) as possibly fitting the definition above. Just be clear what the limitations are. And accept that there will be people who do not bother to read the limitations and then complain when they bang into them.
PS. You do not seem to be aware of how well the current PEP393 implementation works. If you are going to write any more about it, I suggest you run Tools/Stringbench/stringbench.py for timings.
"Well" is subjective (or should be defined formally based on the requirements). With my MicroPython hat on, an implementation which receives a string, transcodes it, leading to bigger size, just to immediately transcode back and send out - is awful, environment unfriendly implementation ;-). -- Best regards, Paul mailto:pmiscml@gmail.com
On Thu, Jun 5, 2014 at 8:52 AM, Paul Sokolovsky <pmiscml@gmail.com> wrote:
"Well" is subjective (or should be defined formally based on the requirements). With my MicroPython hat on, an implementation which receives a string, transcodes it, leading to bigger size, just to immediately transcode back and send out - is awful, environment unfriendly implementation ;-).
Be careful of confusing correctness and performance, though. The transcoding you describe is inefficient, but (presumably) correct; something that's fast but wrong is straight-up buggy. You can always fix inefficiency in a later release, but buggy behaviour sometimes is relied on (which is why ECMAScript still exposes UTF-16 to scripts, and why Windows window messages have a WPARAM and an LPARAM, and why Python's threading module has duplicate names for a lot of functions, because it's just not worth changing). I'd be much more comfortable releasing something where "everything works fine, but if you use astral characters in your strings, memory usage blows out by a factor of four" (or "... the len() function takes O(N) time") than one where "everything works fine as long as you use BMP only, but SMP characters result in tests failing". ChrisA
On 6/4/2014 6:52 PM, Paul Sokolovsky wrote:
"Well" is subjective (or should be defined formally based on the requirements). With my MicroPython hat on, an implementation which receives a string, transcodes it, leading to bigger size, just to immediately transcode back and send out - is awful, environment unfriendly implementation ;-).
I am not sure what you concretely mean by 'receive a string', but I think you are again batting at a strawman. If you mean 'read from a file', and all you want to do is read bytes from and write bytes to external 'files', then there is obviously no need to transcode and neither Python 2 or 3 make you do so. -- Terry Jan Reedy
Hello, On Wed, 04 Jun 2014 22:15:30 -0400 Terry Reedy <tjreedy@udel.edu> wrote:
On 6/4/2014 6:52 PM, Paul Sokolovsky wrote:
"Well" is subjective (or should be defined formally based on the requirements). With my MicroPython hat on, an implementation which receives a string, transcodes it, leading to bigger size, just to immediately transcode back and send out - is awful, environment unfriendly implementation ;-).
I am not sure what you concretely mean by 'receive a string', but I
I (surely) mean an abstract input (as an Input/Output aka I/O) operation.
think you are again batting at a strawman. If you mean 'read from a file', and all you want to do is read bytes from and write bytes to external 'files', then there is obviously no need to transcode and neither Python 2 or 3 make you do so.
But most files, network protocols are text-based, and I (and many other people) don't want to artificially use "binary data" type for them, with all attached funny things, like "b" prefix. And then Python2 indeed doesn't transcode anything, and Python3 does, without being asked, and for no good purpose, because in most cases, Input data will be Output as-is (maybe in byte-boundary-split chunks). So, it all goes in rounds - ignoring the forced-Unicode problem (after a week of subscription to python-list, half of traffic there appear to be dedicated to Unicode-related flames) on python-dev behalf is not going to help (Python community). [] -- Best regards, Paul mailto:pmiscml@gmail.com
On 6/5/2014 3:10 AM, Paul Sokolovsky wrote:
Hello,
On Wed, 04 Jun 2014 22:15:30 -0400 Terry Reedy <tjreedy@udel.edu> wrote:
think you are again batting at a strawman. If you mean 'read from a file', and all you want to do is read bytes from and write bytes to external 'files', then there is obviously no need to transcode and neither Python 2 or 3 make you do so. But most files, network protocols are text-based, and I (and many other people) don't want to artificially use "binary data" type for them, with all attached funny things, like "b" prefix. And then Python2 indeed doesn't transcode anything, and Python3 does, without being asked, and for no good purpose, because in most cases, Input data will be Output as-is (maybe in byte-boundary-split chunks).
So, it all goes in rounds - ignoring the forced-Unicode problem (after a week of subscription to python-list, half of traffic there appear to be dedicated to Unicode-related flames) on python-dev behalf is not going to help (Python community).
If all your program is doing is reading and writing data (input data will be output as-is), then use of binary doesn't require "b" prefix, because you aren't manipulating the data. Then you have no unnecessary transcoding. If you actually wish to examine or manipulate the content as it flows by, then there are choices. 1) If you need to examine/manipulate only a small fraction of text data with the file, you can pay the small price of a few "b" prefixes to get high performance, and explicitly transcode only the portions that need to be manipulated. 2) If you are examining the bulk of the data as it flows by, but not manipulating it, just examining/extracting, then a full transcoding may be useful for that purpose... but you can perhaps do it explicitly, so that you keep the binary form for I/O. Careful of the block boundaries, in this case, however. 3) If you are actually manipulating the bulk of the data, then the double transcoding (once on input, and once on output) allows you to work in units of codepoints, rather than bytes, which generally makes the manipulation algorithms easier. 4) If you truly cannot afford the processor code of the double transcoding, and need to do all your manipulations at the byte level, then you could avoid the need for "b" prefix by use of a preprocessor for those sections of code that are doing all and only bytes processing... and you'll have lots of arcane, error-prone code to write to manipulate the bytes rather than the codepoints. On the other hand, if you can convince your data sources and sinks to deal in UTF-8, and implement a UTF-8 str in μPy, then you can both avoid transcoding, and make the arcane algorithms part of the implementation of μPy rather than of the application code, and support full Unicode. And it seems to me that the world is moving that way... towards UTF-8 as the standard interchange format. Encourage it. Glenn
On Wed, Jun 4, 2014 at 3:14 PM, Paul Sokolovsky <pmiscml@gmail.com> wrote:
That said, and unlike previous attempts to develop a small Python implementations (which of course existed), we're striving to be exactly a Python language implementation, not a Python-like language implementation. As there's no formal, implementation-independent language spec, what constitutes a compatible language implementation is subject to opinions, and we welcome and appreciate independent review, like this thread did.
Actually, there is a "formal, implementation-independent language spec": https://docs.python.org/3/reference/
Realistically, most Python code that works on Python 3.4 won't work on Micropython (for various reasons, not just the string behavior) and neither does it need to.
That's true. However, as was said, we're striving to provide a compatible implementation, and compatibility claims must be validated. While we have simple "in-house" testsuite, more serious compatibility validation requires running a testsuite for reference implementation (CPython), and that's gradually being approached.
To a large extent the test suite in http://hg.python.org/cpython/file/default/Lib/test effectively validates (full) compliance with the corresponding release (change "default" to the release branch of your choice). With that goal, no small effort has been made to mark implementation-specific tests as such. So uPy could consider using the test suite (and explicitly skip the tests for features that uPy doesn't support). -eric
Hello, On Wed, 4 Jun 2014 16:12:23 -0600 Eric Snow <ericsnowcurrently@gmail.com> wrote:
On Wed, Jun 4, 2014 at 3:14 PM, Paul Sokolovsky <pmiscml@gmail.com> wrote:
That said, and unlike previous attempts to develop a small Python implementations (which of course existed), we're striving to be exactly a Python language implementation, not a Python-like language implementation. As there's no formal, implementation-independent language spec, what constitutes a compatible language implementation is subject to opinions, and we welcome and appreciate independent review, like this thread did.
Actually, there is a "formal, implementation-independent language spec":
Opening that link in browser, pressing Ctrl+F and pasting your quote gives zero hits, so it's not exactly what you claim it to be. It's also pretty far from being formal (unambiguous, covering all choices, etc.) and comprehensive. Also, please point me at "conformance" section. That said, all of us Pythoneers treat it as the best formal reference available, no news here.
Realistically, most Python code that works on Python 3.4 won't work on Micropython (for various reasons, not just the string behavior) and neither does it need to.
That's true. However, as was said, we're striving to provide a compatible implementation, and compatibility claims must be validated. While we have simple "in-house" testsuite, more serious compatibility validation requires running a testsuite for reference implementation (CPython), and that's gradually being approached.
To a large extent the test suite in http://hg.python.org/cpython/file/default/Lib/test effectively validates (full) compliance with the corresponding release (change "default" to the release branch of your choice). With that goal, no small effort has been made to mark implementation-specific tests as such. So uPy could consider using the test suite (and explicitly skip the tests for features that uPy doesn't support).
That's exactly what we do, per the previous paragraph. And we face a lot of questionable tests, just like you say. Shameless plug: if anyone interested to run existing code on MicroPython, please help us with CPython testsuite! ;-)
-eric
-- Best regards, Paul mailto:pmiscml@gmail.com
On Wed, Jun 4, 2014 at 5:11 PM, Paul Sokolovsky <pmiscml@gmail.com> wrote:
On Wed, 4 Jun 2014 16:12:23 -0600 Eric Snow <ericsnowcurrently@gmail.com> wrote:
Actually, there is a "formal, implementation-independent language spec":
Opening that link in browser, pressing Ctrl+F and pasting your quote gives zero hits, so it's not exactly what you claim it to be. It's also pretty far from being formal (unambiguous, covering all choices, etc.) and comprehensive. Also, please point me at "conformance" section.
That said, all of us Pythoneers treat it as the best formal reference available, no news here.
It's not just the best formal reference. It's the official specification. I agree it is not so "formal" as other language specifications and it does not enumerate every facet of the language. However, underspecified parts are worth improving (as we've done with the import system portion in the last few years). Incidentally, the efforts of other Python implementors have often resulted in such improvements to the language reference. Those improvements typically come as a result of questions to this very list. :) That's essentially what this email thread is! -eric
On Wed, Jun 04, 2014 at 11:17:18AM +1000, Steven D'Aprano wrote:
There is a discussion over at MicroPython about the internal representation of Unicode strings. Micropython is aimed at embedded devices, and so minimizing memory use is important, possibly even more important than performance. [...]
Wow! I'm amazed at the response here, since I expected it would have received a fairly brief "Yes" or "No" response, not this long thread. Here is a summary (as best as I am able) of a few points which I think are important: (1) I asked if it would be okay for MicroPython to *optionally* use nominally Unicode strings limited to ASCII. Pretty much the only response to this as been Guido saying "That would be a pretty lousy option", and since nobody has really defended the suggestion, I think we can assume that it's off the table. (2) I asked if it would be okay for µPy to use an UTF-8 implementation even though it would lead to O(N) indexing operations instead of O(1). There's been some opposition to this, including Guido's: Then again the UTF-8 option would be pretty devastating too for anything manipulating strings (especially since many Python APIs are defined using indexes, e.g. the re module). but unless Guido wants to say different, I think the consensus is that a UTF-8 implementation is allowed, even at the cost of O(N) indexing operations. Saving memory -- assuming that it does save memory, which I think is an assumption and not proven -- over time is allowed. (3) It seems to me that there's been a lot of theorizing about what implementation will be obviously more efficient. Folks, how about some benchmarks before making claims about code efficiency? :-) (4) Similarly, there have been many suggestions more suited in my opinion to python-ideas, or even python-list, for ways to implement O(1) indexing on top of UTF-8. Some of them involve per-string mutable state (e.g. the last index seen), or complicated int sub-classes that need to know what string they come from. Remember your Zen please: Simple is better than complex. Complex is better than complicated. ... If the implementation is hard to explain, it's a bad idea. (5) I'm not convinced that UTF-8 internally is *necessarily* more efficient, but look forward to seeing the result of benchmarks. The rationale of internal UTF-8 is that the use of any other encoding internally will be inefficient since those strings will need to be transcoded to UTF-8 before they can be written or printed, so keeping them as UTF-8 in the first place saves the transcoding step. Well, yes, but many strings may never be written out: print(prefix + s[1:].strip().lower().center(80) + suffix) creates five strings that are never written out and one that is. So if the internal encoding of strings is more efficient than UTF-8, and most of them never need transcoding to UTF-8, a non-UTF-8 internal format might be a nett win. So I'm looking forward to seeing the results of µPy's experiments with it. Thanks to all who have commented. -- Steven
Steven D'Aprano wrote:
(1) I asked if it would be okay for MicroPython to *optionally* use nominally Unicode strings limited to ASCII. Pretty much the only response to this as been Guido saying "That would be a pretty lousy option",
It would be limiting to have this as the *only* way of dealing with unicode, but I don't see anything wrong with having this available as an option for applications that truly don't need anything more than ascii. There must be plenty of those; the controller that runs my car engine, for example, doesn't exchange text with the outside world at all.
The rationale of internal UTF-8 is that the use of any other encoding internally will be inefficient since those strings will need to be transcoded to UTF-8 before they can be written or printed,
No, I think the rationale is that UTF-8 is likely to use less memory than UTF-16 or UTF-32. -- Greg
On Fri, Jun 06, 2014 at 12:51:11PM +1200, Greg Ewing wrote:
Steven D'Aprano wrote:
(1) I asked if it would be okay for MicroPython to *optionally* use nominally Unicode strings limited to ASCII. Pretty much the only response to this as been Guido saying "That would be a pretty lousy option",
It would be limiting to have this as the *only* way of dealing with unicode, but I don't see anything wrong with having this available as an option for applications that truly don't need anything more than ascii. There must be plenty of those; the controller that runs my car engine, for example, doesn't exchange text with the outside world at all.
I don't know about car engine controllers, but presumably they have diagnostic ports, and they may sometimes output text. If they output text, then at least hypothetically car mechanics in Russia might prefer their car to output "правда" and "ложный" rather than "true" and "false". I think that opportunities for ASCII-only optimizations are shrinking, not getting bigger, as more people come to expect that their computing devices speak their language rather than Foreign.
The rationale of internal UTF-8 is that the use of any other encoding internally will be inefficient since those strings will need to be transcoded to UTF-8 before they can be written or printed,
No, I think the rationale is that UTF-8 is likely to use less memory than UTF-16 or UTF-32.
Right. I was talking about memory efficiency. Instead of this, which requires two copies of the string at one time: 1) accept UTF-8 bytes 2) transcode to internal representation 3) discard UTF-8 bytes you could have: 1) accept UTF-8 bytes and be done. -- Steve
Steven D'Aprano wrote:
I don't know about car engine controllers, but presumably they have diagnostic ports, and they may sometimes output text. If they output text, then at least hypothetically car mechanics in Russia might prefer their car to output "правда" and "ложный" rather than "true" and "false".
From a bit of googling, it seems that engine controller diagnostic ports typically speak some kind of binary protocol. So it would be up to the software running on whatever was plugged into the port to display the information in the user's native language. E.g. this document lists a big pile of hex byte values and little or no text that I can see: https://law.resource.org/pub/us/cfr/ibr/005/sae.j1979.2002.pdf -- Greg
Steven D'Aprano wrote:
(1) I asked if it would be okay for MicroPython to *optionally* use nominally Unicode strings limited to ASCII. Pretty much the only response to this as been Guido saying "That would be a pretty lousy option", and since nobody has really defended the suggestion, I think we can assume that it's off the table.
Lousy is not quite the same as forbidden. Doing it in good faith would require making the limit prominent in the documentation, and raising some sort of CharacterNotSupported exception (or at least a warning) whenever there is an attempt to create a non-ASCII string, even via the C API.
(2) I asked if it would be okay ... to use an UTF-8 implementation even though it would lead to O(N) indexing operations instead of O(1). There's been some opposition to this, including Guido's:
[Non-ASCII character removed.] It is bad when quirks -- even good quirks -- of one implementation lead people to write code that will perform badly on a different Python implementation. Cpython has at least delayed obvious optimizations for this reason. Changing idiomatic operations from O(1) to O(N) is big enough to cause a concern. That said, the target environment itself apparently limits N to small enough that the problem should be mostly theoretical. If you want to be good citizens, then do put a note in the documentation warning that particularly long strings are likely to cause performance issues unique to the MicroPython implementation. (Frankly, my personal opinion is that if you're really optimizing for space, then long strings will start getting awkward long before N is big enough for algorithmic complexity to overcome constant factors.)
... those strings will need to be transcoded to UTF-8 before they can be written or printed, so keeping them as UTF-8 ...
That all assumes that the external world is using UTF-8 anyhow. Which is more likely to be true if you document it as a limitation of MicroPython.
... but many strings may never be written out:
print(prefix + s[1:].strip().lower().center(80) + suffix)
creates five strings that are never written out and one that is.
But looking at the actual strings -- UTF-8 doesn't really hurt much. Only the slice and center() are more complex, and for a string less than 80 characters long, O(N) is irrelevant. -jJ -- If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ
participants (29)
-
Antoine Pitrou
-
Chris Angelico
-
Daniel Holth
-
Donald Stufft
-
dw+python-dev@hmmz.org
-
Eric Snow
-
Glenn Linderman
-
Greg Ewing
-
Guido van Rossum
-
Hrvoje Niksic
-
INADA Naoki
-
Jeff Allen
-
Jim J. Jewett
-
Juraj Sukop
-
Kristján Valur Jónsson
-
Mark Lawrence
-
martin@v.loewis.de
-
MRAB
-
Nick Coghlan
-
Paul Moore
-
Paul Sokolovsky
-
R. David Murray
-
Serhiy Storchaka
-
Stefan Krah
-
Stephen J. Turnbull
-
Steve Dower
-
Steven D'Aprano
-
Terry Reedy
-
Tim Delaney