[Python-ideas] duck typing for io write methods

Wolfgang Maier wolfgang.maier at biologie.uni-freiburg.de
Fri Jun 14 10:35:02 CEST 2013


Andrew Barnert <abarnert at ...> writes:

> 
> From: Wolfgang Maier <wolfgang.maier at ...>
>  
> > Well, I was illustrating the case with a literal integer, but, of course, I
> > was thinking of cases with references:
> > a=1234
> > str(a).encode() # gives b'1234' in Python3, but converting your int to 
> > str
> > first, just to encode it again to bytes seems weird
> 
> Conceptually, it makes perfect sense. b'1234' isn't a string with the
canonical numeral representation
> of 1234, it's a sequence of bytes, which happens to be a particular
(unspecified) encoding of a
> string with the canonical numeral representation of 1234.
> 
> The docs (http://docs.python.org/3.3/library/functions.html#bytes)
explicitly say a bytes object:
> 
> > is an immutable sequence of integers in the range 0 <= x < 256. bytes is
an immutable version of bytearray
> 
> Practically, you often want to use bytes as "ASCII strings", and you often
can get away with it. It works
> for literals, some but not all methods, and of course everything that
strings inherit from sequences
> (concatenation, slicing, etc.). 
> 
> But often you can't get away with it. It doesn't work for formatting,
anything strings do differently
> from sequences (notably indexing), some methods, most functions that
special-case on strings,
> type-checking (there's no basestring in 3.x), etc.
> 
> Likewise, the bytes() constructor doesn't work quite like str(), and
there's no bytes equivalent of repr().
> 
> Obviously, there's a tradeoff behind all of those decisions. It wouldn't
have been hard to put
> bytes.__mod__, bytes.format, basestring, etc. into Python 3, or to make
b'a'[0] return b'a' instead of
> 97, or to make bytes(x) work more like str(x), or to add a brepr or
similar function, etc. But it would make
> bytes less useful as a sequence of 8-bit integers. And, more importantly,
it would be an attractive
> nuisance, making a lot of common errors more common (as they were in
2.x). As the docs
> (http://docs.python.org/3.3/library/stdtypes.html#bytes) put it:
> 
> > This is done deliberately to emphasise that while many binary formats
include ASCII based elements and
> can be usefully manipulated with some text-oriented algorithms, this is
not generally the case for
> arbitrary binary data (blindly applying text processing algorithms to
binary data formats that are not
> ASCII compatible will usually lead to data corruption).
>

I have to say I'm not too enthusiastic about the bytes type in Python 3.
The tradeoffs you're mentioning cause bytes to be sort of a hybrid between
strings and numbers trying to combine aspects of both. This makes them very
different from all other types in Python and is the reason behind much
confusion. Personally, I would have preferred a clear decision to make bytes
a sequence of 8-bit characters *or* integers (or have two separate types
bytestring and byteint). Still, the current design has been discussed among
people who understand the topic much better than me, so I'm not trying to
argue about it, but to arrange with the status quo.


> Anyway, why do you actually want a bytes here? Maybe there's a better
design for what you're trying to do that
> would make this whole issue irrelevant to your code.
> 

The actual problem here is that I'm reading bytes from a text file (it's a
huge file, so I/O speed matters and working in text mode is no option). Then
I'm extracting numeric values from the file that I need for calculations, so
I'm converting bytes to int here. While that's fine, I then want to write
the result along with other parts of the original file to a new file. Now
the result is an integer, while the rest of the data is bytes already, so I
have to convert my integer to bytes to .join it with the rest, then write it.
Here's the (simplified) problem:
an input line from my file:
b'somelinedescriptor\t100\t500\tmorestuffhere\n'
what I need is calculate the difference between the numbers (500-100), then
write this to a new file:
b'somelinedescriptor\t400\tmorestuffhere\n'

Currently I solve this by splitting on '\t', converting elements 1 and 2 of
the resulting list to int, then (in slightly abstracted code)
b'\t'.join((element0, str(subtraction_result).encode(), element3)), then
writing. So, in essence, I'm going through this int -> str -> bytes
conversion scheme for a million lines in my file, which just doesn't feel
right. What's missing is a direct way for int -> bytes. Any suggestions are
welcome.
Best,
Wolfgang




More information about the Python-ideas mailing list