[Python-Dev] Generalised String Coercion
Phillip J. Eby
pje at telecommunity.com
Mon Aug 8 15:54:20 CEST 2005
At 10:07 AM 8/8/2005 +0200, Martin v. Löwis wrote:
>Phillip J. Eby wrote:
> >>Hm. What would be the use case for using %s with binary, non-text data?
> >
> >
> > Well, I could see using it to write things like netstrings,
> > i.e. sock.send("%d:%s," % (len(data),data)) seems like the One Obvious
> Way
> > to write a netstring in today's Python at least. But perhaps there's a
> > subtlety I've missed here.
>
>As written, this would stop working when strings become Unicode. It's
>pretty clear what '%d' means (format the number in decimal numbers,
>using "\N{DIGIT ZERO}" .. "\N{DIGIT NINE}" as the digits). It's not
>all that clear what %s means: how do you get a sequence of characters
>out of data, when data is a byte string?
>
>Perhaps there could be byte string literals, so that you would write
>
> sock.send(b"%d:%s," % (len(data),data))
Actually, thinking about it some more, it seems to me it's actually more
like this:
sock.send( ("%d:%s," %
(len(data),data.decode('latin1'))).encode('latin1') )
That is, if all we have is unicode and bytes, and 'data' is bytes, then
encoding and decoding from latin1 is the right way to do a netstring. It's
a bit more painful, but still doable.
>but this would raise different questions:
>- what does %d mean for a byte string formatting? str(len(data))
> returns a character string, how do you get a byte string?
> In the specific case of %d, encoding as ASCII would work, though.
>- if byte strings are mutable, what about byte string literals?
> I.e. if I do
>
> x = b"%d:%s,"
> x[1] = b'f'
>
> and run through the code the second time, will the literal have
> changed? Perhaps these would be displays, not literals (although
> I never understood why Guido calls these displays)
I'm thinking that bytes.decode and unicode.encode are the correct way to
convert between the two, and there's no such thing as a bytes literal. We
can always optimize "constant.encode(constant)" to a bytes display
internally if necessary, although it will be a pain for programs that have
lots of bytestring constants. OTOH, we've previously discussed having a
'bytes()' constructor, and perhaps it should use latin1 as its default
encoding.
More information about the Python-Dev
mailing list