[python-win32] python3 and extended mapi

Mon Jun 23 16:39:38 CEST 2014

On Jun 14, 2014, at 10:10 AM, Christian K. <ckkart at hoc.net> wrote:

>>>  <Paul_Koning <at> Dell.com> writes:
>> 
>>>> I would only want/expect to see “bytes” types when the values in question
>>> are binary data streams, or
>>>> unknown format.  But anytime we’re dealing with text strings, the Python 3
>>> approach is that the Python
>>>> code sees “str” type, and questions of encoding have been handled at the
>>> edge.  This is where Python 3
>>> 
>>> This is not true, see above.
>> 
>> What I mean is that you cannot leave it to the application. Once a "str"
>> type is returned is has to be assumed that it contains unicode data. "str"
>> classes do not have a decode method or is there a way to decode the data
>> hold by a "str" object assuming that is encoded and not unicode?
> 
> No objections here? I am not an expert to be without doubts on what I said myself. So please tell me if I am wrong.

A “str” contains Unicode text, always.

If you have an object that contains encoded text, in some encoding (UTF-8, Latin-1, Win-314159, KOI-8, whatever), then that object *must* be “bytes”.  You can then obtain the corresponding “str” by performing a decode operation, running it through the correct Codec for the encoding that was used.  The notion of “a str object that is encoded” is incorrect.

I suspect you’re still dealing with the damage done by how this was mishandled in Python 2.  Python 3 completely redesigned string handling, and got it right.  It’s helpful to spend some time reading the documentation so you can see what the new scheme is.

The key point is that “str” is an abstract type (it is “text” but the concept of encoding is simply not applicable).  You can perform operations on “str” but it cannot be used in I/O, or any other operation that needs a concrete type (something with a defined representation, say as a  pile of bytes).

Conversely, “bytes” is concrete — you can find it in a file, or send it across the network.  “bytes” can represent all kinds of data.  One of the options is to represent text in some encoding.  To know what a given sequence of bytes means as text, you have to know what the encoding is.  The same sequence of bytes, interpreted as Latin-1 text, is different text than that sequence of bytes interpreted as KOI-8 text.  Both are potentially valid; which (if any) is “correct” depends on what the creator of that byte string intended.

	paul