hex dump w/ or w/out utf-8 chars
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sat Jul 13 05:36:06 EDT 2013
On Sat, 13 Jul 2013 00:56:52 -0700, wxjmfauth wrote:
> I am convinced you are not conceptually understanding utf-8 very well. I
> wrote many times, "utf-8 does not produce bytes, but Unicode Encoding
> Units".
Just because you write it many times, doesn't make it correct. You are
simply wrong. UTF-8 produces bytes. That's what gets written to files and
transmitted over networks, bytes, not "Unicode Encoding Units", whatever
they are.
> A similar coding scheme: iso-6937 .
>
> Try to write an editor, a text widget, with with a coding scheme like
> the Flexible String Represenation. You will quickly notice, it is
> impossible (understand correctly). (You do not need a computer, just a
> sheet of paper and a pencil) Hint: what is the character at the caret
> position?
That is a simple index operation into the buffer. If the caret position
is 10 characters in, you index buffer[10-1] and it will give you the
character to the left of the caret. buffer[10] will give you the
character to the right of the caret. It is simple, trivial, and easy. The
buffer itself knows whether to look ahead 10 bytes, 10*2 bytes or 10*4
bytes.
Here is an example of such a tiny buffer, implemented in Python 3.3 with
the hated Flexible String Representation. In each example, imagine the
caret is five characters from the left:
12345|more characters here...
It works regardless of whether your characters are ASCII:
py> buffer = '12345ABCD...'
py> buffer[5-1] # character to the left of the caret
'5'
py> buffer[5] # character to the right of the caret
'A'
Latin 1:
py> buffer = '12345áßçð...'
py> buffer[5-1] # character to the left of the caret
'5'
py> buffer[5] # character to the right of the caret
'á'
Other BMP characters:
py> buffer = '12345αдᚪ∞...'
py> buffer[5-1] # character to the left of the caret
'5'
py> buffer[5] # character to the right of the caret
'α'
And Supplementary Plane Characters:
py> buffer = ('12345'
... '\N{ALCHEMICAL SYMBOL FOR AIR}'
... '\N{ALCHEMICAL SYMBOL FOR FIRE}'
... '\N{ALCHEMICAL SYMBOL FOR EARTH}'
... '\N{ALCHEMICAL SYMBOL FOR WATER}'
... '...')
py> buffer
'12345🜁🜂🜃🜄...'
py> len(buffer)
12
py> buffer[5-1] # character to the left of the caret
'5'
py> buffer[5] # character to the right of the caret
'🜁'
py> unicodedata.name(buffer[5])
'ALCHEMICAL SYMBOL FOR AIR'
And it all Just Works in Python 3.3. So much for "impossible to tell"
what the character at the carat is. It is *trivial*.
Ah, but how about Python 3.2? We set up the same buffer:
py> buffer = ('12345'
... '\N{ALCHEMICAL SYMBOL FOR AIR}'
... '\N{ALCHEMICAL SYMBOL FOR FIRE}'
... '\N{ALCHEMICAL SYMBOL FOR EARTH}'
... '\N{ALCHEMICAL SYMBOL FOR WATER}'
... '...')
py> buffer
'12345🜁🜂🜃🜄...'
py> len(buffer)
16
Sixteen? Sixteen? Where did the extra four characters come from? They
came from *surrogate pairs*.
py> buffer[5-1] # character to the left of the caret
'5'
py> buffer[5] # character to the right of the caret
'\ud83d'
Funny, that looks different.
py> unicodedata.name(buffer[5])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
No name?
Because buffer[5] is only *half* of the surrogate pair. It is broken, and
there is really no way of fixing that breakage in Python 3.2 with a
narrow build. You can fix it with a wide build, but only at the cost of
every string, every name, using double the amount of storage, whether it
needs it or not.
--
Steven
More information about the Python-list
mailing list