On 6/5/2014 3:10 AM, Paul Sokolovsky wrote:

Hello,

On Wed, 04 Jun 2014 22:15:30 -0400
Terry Reedy <tjreedy@udel.edu> wrote:

think you are again batting at a strawman. If you mean 'read from a 
file', and all you want to do is read bytes from and write bytes to 
external 'files', then there is obviously no need to transcode and 
neither Python 2 or 3 make you do so.

But most files, network protocols are text-based, and I (and many other
people) don't want to artificially use "binary data" type for them,
with all attached funny things, like "b" prefix. And then Python2
indeed doesn't transcode anything, and Python3 does, without being
asked, and for no good purpose, because in most cases, Input data will
be Output as-is (maybe in byte-boundary-split chunks).

So, it all goes in rounds - ignoring the forced-Unicode problem (after a
week of subscription to python-list, half of traffic there appear to be
dedicated to Unicode-related flames) on python-dev behalf is not
going to help (Python community).

If all your program is doing is reading and writing data (input data will be output as-is), then use of binary doesn't require "b" prefix, because you aren't manipulating the data. Then you have no unnecessary transcoding.

If you actually wish to examine or manipulate the content as it flows by, then there are choices.

1) If you need to examine/manipulate only a small fraction of text data with the file, you can pay the small price of a few "b" prefixes to get high performance, and explicitly transcode only the portions that need to be manipulated.

2) If you are examining the bulk of the data as it flows by, but not manipulating it, just examining/extracting, then a full transcoding may be useful for that purpose... but you can perhaps do it explicitly, so that you keep the binary form for I/O. Careful of the block boundaries, in this case, however.

3) If you are actually manipulating the bulk of the data, then the double transcoding (once on input, and once on output) allows you to work in units of codepoints, rather than bytes, which generally makes the manipulation algorithms easier.

4) If you truly cannot afford the processor code of the double transcoding, and need to do all your manipulations at the byte level, then you could avoid the need for "b" prefix by use of a preprocessor for those sections of code that are doing all and only bytes processing... and you'll have lots of arcane, error-prone code to write to manipulate the bytes rather than the codepoints.

On the other hand, if you can convince your data sources and sinks to deal in UTF-8, and implement a UTF-8 str in μPy, then you can both avoid transcoding, and make the arcane algorithms part of the implementation of μPy rather than of the application code, and support full Unicode. And it seems to me that the world is moving that way... towards UTF-8 as the standard interchange format. Encourage it.

Glenn