How do I display unicode value stored in a string variable using ord()
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sun Aug 19 02:30:54 EDT 2012
On Sat, 18 Aug 2012 11:05:07 -0700, wxjmfauth wrote:
> As I understand (I think) the undelying mechanism, I can only say, it is
> not a surprise that it happens.
>
> Imagine an editor, I type an "a", internally the text is saved as ascii,
> then I type en "é", the text can only be saved in at least latin-1. Then
> I enter an "€", the text become an internal ucs-4 "string". The remove
> the "€" and so on.
Firstly, that is not what Python does. For starters, € is in the BMP, and
so is nearly every character you're ever going to use unless you are
Asian or a historian using some obscure ancient script. NONE of the
examples you have shown in your emails have included 4-byte characters,
they have all been ASCII or UCS-2.
You are suffering from a misunderstanding about what is going on and
misinterpreting what you have seen.
In *both* Python 3.2 and 3.3, both é and € are represented by two bytes.
That will not change. There is a tiny amount of fixed overhead for
strings, and that overhead is slightly different between the versions,
but you'll never notice the difference.
Secondly, how a text editor or word processor chooses to store the text
that you type is not the same as how Python does it. A text editor is not
going to be creating a new immutable string after every key press. That
will be slow slow SLOW. The usual way is to keep a buffer for each
paragraph, and add and subtract characters from the buffer.
> Intuitively I expect there is some kind slow down between all these
> "strings" conversion.
Your intuition is wrong. Strings are not converted from ASCII to USC-2 to
USC-4 on the fly, they are converted once, when the string is created.
The tests we ran earlier, e.g.:
('ab…' * 1000).replace('…', 'œ…')
show the *worst possible case* for the new string handling, because all
we do is create new strings. First we create a string 'ab…', then we
create another string 'ab…'*1000, then we create two new strings '…' and
'œ…', and finally we call replace and create yet another new string.
But in real applications, once you have created a string, you don't just
immediately create a new one and throw the old one away. You likely do
work with that string:
steve at runes:~$ python3.2 -m timeit "s = 'abcœ…'*1000; n = len(s); flag =
s.startswith(('*', 'a'))"
100000 loops, best of 3: 2.41 usec per loop
steve at runes:~$ python3.3 -m timeit "s = 'abcœ…'*1000; n = len(s); flag =
s.startswith(('*', 'a'))"
100000 loops, best of 3: 2.29 usec per loop
Once you start doing *real work* with the strings, the overhead of
deciding whether they should be stored using 1, 2 or 4 bytes begins to
fade into the noise.
> When I tested this flexible representation, a few months ago, at the
> first alpha release. This is precisely what, I tested. String
> manipulations which are forcing this internal change and I concluded the
> result is not brillant. Realy, a factor 0.n up to 10.
Like I said, if you really think that there is a significant, repeatable
slow-down on Windows, report it as a bug.
> Does any body know a way to get the size of the internal "string" in
> bytes?
sys.getsizeof(some_string)
steve at runes:~$ python3.2 -c "from sys import getsizeof as size; print(size
('abcœ…'*1000))"
10030
steve at runes:~$ python3.3 -c "from sys import getsizeof as size; print(size
('abcœ…'*1000))"
10038
As I said, there is a *tiny* overhead difference. But identifiers will
generally be smaller:
steve at runes:~$ python3.2 -c "from sys import getsizeof as size; print(size
(size.__name__))"
48
steve at runes:~$ python3.3 -c "from sys import getsizeof as size; print(size
(size.__name__))"
34
You can check the object overhead by looking at the size of the empty
string.
--
Steven
More information about the Python-list
mailing list