On 6/5/2014 3:10 AM, Paul Sokolovsky
On Wed, 04 Jun 2014 22:15:30 -0400
Terry Reedy <firstname.lastname@example.org> wrote:
think you are again batting at a strawman. If you mean 'read from a
file', and all you want to do is read bytes from and write bytes to
external 'files', then there is obviously no need to transcode and
neither Python 2 or 3 make you do so.
But most files, network protocols are text-based, and I (and many other
people) don't want to artificially use "binary data" type for them,
with all attached funny things, like "b" prefix. And then Python2
indeed doesn't transcode anything, and Python3 does, without being
asked, and for no good purpose, because in most cases, Input data will
be Output as-is (maybe in byte-boundary-split chunks).
So, it all goes in rounds - ignoring the forced-Unicode problem (after a
week of subscription to python-list, half of traffic there appear to be
dedicated to Unicode-related flames) on python-dev behalf is not
going to help (Python community).
If all your program is doing is reading and writing data (input data
will be output as-is), then use of binary doesn't require "b"
prefix, because you aren't manipulating the data. Then you have no
If you actually wish to examine or manipulate the content as it
flows by, then there are choices.
1) If you need to examine/manipulate only a small fraction of text
data with the file, you can pay the small price of a few "b"
prefixes to get high performance, and explicitly transcode only the
portions that need to be manipulated.
2) If you are examining the bulk of the data as it flows by, but not
manipulating it, just examining/extracting, then a full transcoding
may be useful for that purpose... but you can perhaps do it
explicitly, so that you keep the binary form for I/O. Careful of the
block boundaries, in this case, however.
3) If you are actually manipulating the bulk of the data, then the
double transcoding (once on input, and once on output) allows you to
work in units of codepoints, rather than bytes, which generally
makes the manipulation algorithms easier.
4) If you truly cannot afford the processor code of the double
transcoding, and need to do all your manipulations at the byte
level, then you could avoid the need for "b" prefix by use of a
preprocessor for those sections of code that are doing all and only
bytes processing... and you'll have lots of arcane, error-prone code
to write to manipulate the bytes rather than the codepoints.
On the other hand, if you can convince your data sources and sinks
to deal in UTF-8, and implement a UTF-8 str in μPy, then you can
both avoid transcoding, and make the arcane algorithms part of the
implementation of μPy rather than of the application code, and
support full Unicode. And it seems to me that the world is moving
that way... towards UTF-8 as the standard interchange format.