Mailman 3 Dropping bytes "support" in json - Python-Dev

Dropping bytes "support" in json

Antoine Pitrou

April 8, 2009

11:10 a.m.

Hello, We're in the process of forward-porting the recent (massive) json updates to 3.1, and we are also thinking of dropping remnants of support of the bytes type in the json library (in 3.1, again). This bytes support almost didn't work at all, but there was a lot of C and Python code for it nevertheless. We're also thinking of dropping the "encoding" argument in the various APIs, since it is useless. Under the new situation, json would only ever allow str as input, and output str as well. By posting here, I want to know whether anybody would oppose this (knowing, once again, that bytes support is already broken in the current py3k trunk). The bug entry is: http://bugs.python.org/issue4136 Regards Antoine.

Show replies by date

Raymond Hettinger

April 2009

3:51 p.m.

...

We're in the process of forward-porting the recent (massive) json updates to 3.1, and we are also thinking of dropping remnants of support of the bytes type in the json library (in 3.1, again). This bytes support almost didn't work at all, but there was a lot of C and Python code for it nevertheless. We're also thinking of dropping the "encoding" argument in the various APIs, since it is useless.

Under the new situation, json would only ever allow str as input, and output str as well. By posting here, I want to know whether anybody would oppose this (knowing, once again, that bytes support is already broken in the current py3k trunk).

+1 Raymond

"Martin v. Löwis"

6:33 p.m.

...

We're in the process of forward-porting the recent (massive) json updates to 3.1, and we are also thinking of dropping remnants of support of the bytes type in the json library (in 3.1, again). This bytes support almost didn't work at all, but there was a lot of C and Python code for it nevertheless. We're also thinking of dropping the "encoding" argument in the various APIs, since it is useless.

Under the new situation, json would only ever allow str as input, and output str as well. By posting here, I want to know whether anybody would oppose this (knowing, once again, that bytes support is already broken in the current py3k trunk).

What does Bob Ippolito think about this change? IIUC, he considers simplejson's speed one of its primary advantages, and also attributes it to the fact that he can parse directly out of byte strings, and marshal into them (which is important, as you typically receive them over the wire). Having to run them through a codec slows parsing down. Regards, Martin

Antoine Pitrou

11:31 p.m.

Martin v. Löwis <martin <at> v.loewis.de> writes:

...

What does Bob Ippolito think about this change? IIUC, he considers simplejson's speed one of its primary advantages, and also attributes it to the fact that he can parse directly out of byte strings, and marshal into them (which is important, as you typically receive them over the wire).

The only thing I know is that the new version (the one I've tried to merge) is massively faster than the old one - several times faster - and within 20-30% of the speed of the 2.x version (*). Besides, Bob doesn't really seem to care about porting to py3k (he hasn't said anything about it until now, other than that he didn't feel competent to do it). But I'm happy with someone proposing an alternate patch if they want to. As for me, I just wanted to fill the gap and I'm not interested in doing lot of work on this issue. (*) timeit -s "import json; l=['abc']*100" "json.dumps(l)" -> trunk: 33.4 usec per loop -> py3k + patch: 37.1 usec per loop -> vanilla py3k: 314 usec per loop timeit -s "import json; s=json.dumps(['abc']*100)" "json.loads(s)" -> trunk: 44.8 usec per loop -> py3k + patch: 35.4 usec per loop -> vanilla py3k: 1.48 msec per loop (!) Regards Antoine.

"Martin v. Löwis"

5:55 a.m.

...

Besides, Bob doesn't really seem to care about porting to py3k (he hasn't said anything about it until now, other than that he didn't feel competent to do it).

That is quite unfortunate, and suggests that perhaps the module shouldn't have been added to Python in the first place. I can understand that you don't want to spend much time on it. How about removing it from 3.1? We could re-add it when long-term support becomes more likely. Regards, Martin

Raymond Hettinger

7:16 a.m.

[Antoine Pitrou]

...

...
Besides, Bob doesn't really seem to care about porting to py3k (he hasn't said anything about it until now, other than that he didn't feel competent to do it).

His actual words were: "I will need some help with 3.0 since I am not well versed in the changes to the C API or Python code for that, but merging for 2.6.1 should be no big deal." [MvL]

...

That is quite unfortunate, and suggests that perhaps the module shouldn't have been added to Python in the first place.

Bob participated actively in http://bugs.python.org/issue4136 and was responsive to detailed patch review. He gave a popular talk at PyCon less than two weeks ago. He's not derelict.

...

I can understand that you don't want to spend much time on it. How about removing it from 3.1? We could re-add it when long-term support becomes more likely.

I'm speechless. Raymond

"Martin v. Löwis"

8:05 p.m.

...

...
I can understand that you don't want to spend much time on it. How about removing it from 3.1? We could re-add it when long-term support becomes more likely.

I'm speechless.

It seems that my statement has surprised you, so let me explain: I think we should refrain from making design decisions (such as API decisions) without Bob's explicit consent, unless we assign a new maintainer for the simplejson module (perhaps just for the 3k branch, which perhaps would be a fork from Bob's code). Antoine suggests that Bob did not comment on the issues at hand, therefore, we should not proceed with the proposed design. Since the 3.1 release is only a few weeks ahead, we have the choice of either shipping with the broken version that is currently in the 3k branch, or drop the module from the 3k branch. I believe our users are better served by not having to waste time with a module that doesn't quite work, or may change. Regards, Martin

Bob Ippolito

9:13 p.m.

On Thu, Apr 9, 2009 at 1:05 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:

...

...
...
I can understand that you don't want to spend much time on it. How about removing it from 3.1? We could re-add it when long-term support becomes more likely.

I'm speechless.

It seems that my statement has surprised you, so let me explain:

I think we should refrain from making design decisions (such as API decisions) without Bob's explicit consent, unless we assign a new maintainer for the simplejson module (perhaps just for the 3k branch, which perhaps would be a fork from Bob's code).

Antoine suggests that Bob did not comment on the issues at hand, therefore, we should not proceed with the proposed design. Since the 3.1 release is only a few weeks ahead, we have the choice of either shipping with the broken version that is currently in the 3k branch, or drop the module from the 3k branch. I believe our users are better served by not having to waste time with a module that doesn't quite work, or may change.

Most of my time to spend on json/simplejson and these mailing list discussions is on weekends, I try not to bother with it when I'm busy doing Actual Work unless there is a bug or some other issue that needs more immediate attention. I also wasn't aware that I was expected to comment on those issues. I'm CC'ed on the discussion for issue4136 but I don't see any unanswered questions directed at me. I have the issues (issue5723, issue4136) starred in my gmail and I planned to look at it more closely later, hopefully on Friday or Saturday. As far as Python 3 goes, I honestly have not yet familiarized myself with the changes to the IO infrastructure and what the new idioms are. At this time, I can't make any educated decisions with regard to how it should be done because I don't know exactly how bytes are supposed to work and what the common idioms are for other libraries in the stdlib that do similar things. Until I figure that out, someone else is better off making decisions about the Python 3 version. My guess is that it should work the same way as it does in Python 2.x: take bytes or unicode input in loads (which means encoding is still relevant). I also think the output of dumps should also be bytes, since it is a serialization, but I am not sure how other libraries do this in Python 3 because one could argue that it is also text. If other libraries that do text/text encodings (e.g. binascii, mimelib, ...) use str for input and output instead of bytes then maybe Antoine's changes are the right solution and I just don't know better because I'm not up to speed with how people write Python 3 code. I'll do my best to find some time to look into Python 3 more closely soon, but thus far I have not been very motivated to do so because Python 3 isn't useful for us at work and twiddling syntax isn't a very interesting problem for me to solve. -bob

"Martin v. Löwis"

10:07 p.m.

...

As far as Python 3 goes, I honestly have not yet familiarized myself with the changes to the IO infrastructure and what the new idioms are. At this time, I can't make any educated decisions with regard to how it should be done because I don't know exactly how bytes are supposed to work and what the common idioms are for other libraries in the stdlib that do similar things.

It's really very similar to 2.x: the "bytes" type is to used in all interfaces that operate on byte sequences that may or may not represent characters; in particular, for interface where the operating system deliberately uses bytes - ie. low-level file IO and socket IO; also for cases where the encoding is embedded in the stream that still needs to be processed (e.g. XML parsing). (Unicode) strings should be used where the data is truly text by nature, i.e. where no encoding information is necessary to find out what characters are intended. It's used on interfaces where the encoding is known (e.g. text IO, where the encoding is specified on opening, XML parser results, with the declared encoding, and GUI libraries, which naturally expect text).

...

Until I figure that out, someone else is better off making decisions about the Python 3 version.

Some of us can certainly explain to you how this is supposed to work. However, we need you to check any assumption against the known use cases - would the users of the module be happy if it worked one way or the other?

...

My guess is that it should work the same way as it does in Python 2.x: take bytes or unicode input in loads (which means encoding is still relevant). I also think the output of dumps should also be bytes, since it is a serialization, but I am not sure how other libraries do this in Python 3 because one could argue that it is also text.

This, indeed, had been an endless debate, and, in the end, the decision was somewhat arbitrary. Here are some examples: - base64.encodestring expects bytes (naturally, since it is supposed to encode arbitrary binary data), and produces bytes (debatably) - binascii.b2a_hex likewise (expect and produce bytes) - pickle.dumps produces bytes (uniformly, both for binary and text pickles) - marshal.dumps likewise - email.message.Message().as_string produces a (unicode) string (see Barry's recent thread on whether that's a good thing; the email package hasn't been fully ported to 3k, either) - the XML libraries (continue to) parse bytes, and produce Unicode strings - for the IO libraries, see above

...

If other libraries that do text/text encodings (e.g. binascii, mimelib, ...) use str for input and output

See above - most of them don't; mimetools is no longer (replaced by email package)

...

instead of bytes then maybe Antoine's changes are the right solution and I just don't know better because I'm not up to speed with how people write Python 3 code.

There isn't too much fresh end-user code out there, so we can't really tell, either. As for standard library users - users will do whatever the library forces them to do. This is why I'm so concerned about this issue: we should get it right, or not done at all. I still think you would be the best person to determine what is right.

...

I'll do my best to find some time to look into Python 3 more closely soon, but thus far I have not been very motivated to do so because Python 3 isn't useful for us at work and twiddling syntax isn't a very interesting problem for me to solve.

And I didn't expect you to - it seems people are quite willing to do the actual work, as long as there is some guidance. Regards, Martin

Guido van Rossum

12:36 a.m.

On Wed, Apr 8, 2009 at 4:10 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

We're in the process of forward-porting the recent (massive) json updates to 3.1, and we are also thinking of dropping remnants of support of the bytes type in the json library (in 3.1, again). This bytes support almost didn't work at all, but there was a lot of C and Python code for it nevertheless. We're also thinking of dropping the "encoding" argument in the various APIs, since it is useless.

Under the new situation, json would only ever allow str as input, and output str as well. By posting here, I want to know whether anybody would oppose this (knowing, once again, that bytes support is already broken in the current py3k trunk).

The bug entry is: http://bugs.python.org/issue4136

I'm kind of surprised that a serialization protocol like JSON wouldn't support reading/writing bytes (as the serialized format -- I don't care about having bytes as values, since JavaScript doesn't have something equivalent AFAIK, and hence JSON doesn't allow it IIRC). Marshal and Pickle, for example, *always* treat the serialized format as bytes. And since in most cases it will be sent over a socket, at some point the serialized representation *will* be bytes, I presume. What makes supporting this hard? -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Antoine Pitrou

5:15 a.m.

Guido van Rossum <guido <at> python.org> writes:

...

I'm kind of surprised that a serialization protocol like JSON wouldn't support reading/writing bytes (as the serialized format -- I don't care about having bytes as values, since JavaScript doesn't have something equivalent AFAIK, and hence JSON doesn't allow it IIRC). Marshal and Pickle, for example, *always* treat the serialized format as bytes. And since in most cases it will be sent over a socket, at some point the serialized representation *will* be bytes, I presume. What makes supporting this hard?

It's not hard, it just means a lot of duplicated code if the library wants to support both str and bytes in an optimized way as Martin alluded to. This duplicated code already exists in the C parts to support the 2.x semantics of accepting unicode objects as well as str, but not in the Python parts, which explains why the bytes support is broken in py3k - in 2.x, the same Python code can be used for str and unicode. On the other hand, supporting it without going after the last percents of performance should be fairly trivial (by encoding/decoding before doing the processing proper), and it would avoid the current duplicated code. As for reading/writing bytes over the wire, JSON is often used in the same context as HTML: you are supposed to know the charset and decode/encode the payload using that charset. However, the RFC specifies a default encoding of utf-8. (*) (*) http://www.ietf.org/rfc/rfc4627.txt The RFC also specifies a discrimination algorithm for non-supersets of ASCII (“Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets.”), but it is not implemented in the json module:

...

...
...
json.loads('"hi"') 'hi' json.loads(u'"hi"'.encode('utf16')) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/antoine/cpython/__svn__/Lib/json/__init__.py", line 310, in loads return _default_decoder.decode(s) File "/home/antoine/cpython/__svn__/Lib/json/decoder.py", line 344, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/home/antoine/cpython/__svn__/Lib/json/decoder.py", line 362, in raw_decode raise ValueError("No JSON object could be decoded") ValueError: No JSON object could be decoded

Regards Antoine.

Dirkjan Ochtman

7:59 a.m.

On Thu, Apr 9, 2009 at 07:15, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

The RFC also specifies a discrimination algorithm for non-supersets of ASCII (“Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets.”), but it is not implemented in the json module:

Well, your example is bad in the context of the RFC. The RFC states that JSON-text = object / array, meaning "loads" for '"hi"' isn't strictly valid. The discrimination algorithm obviously only works in the context of that grammar, where the first character of a document must be { or [ and the next character can only be {, [, f, n, t, ", -, a number, or insignificant whitespace (space, \t, \r, \n).

...

...
...
...
json.loads('"hi"') 'hi' json.loads(u'"hi"'.encode('utf16')) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/antoine/cpython/__svn__/Lib/json/__init__.py", line 310, in loads return _default_decoder.decode(s) File "/home/antoine/cpython/__svn__/Lib/json/decoder.py", line 344, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/home/antoine/cpython/__svn__/Lib/json/decoder.py", line 362, in raw_decode raise ValueError("No JSON object could be decoded") ValueError: No JSON object could be decoded

Cheers, Dirkjan

Antoine Pitrou

11:10 a.m.

Dirkjan Ochtman <dirkjan <at> ochtman.nl> writes:

...

The RFC states that JSON-text = object / array, meaning "loads" for '"hi"' isn't strictly valid.

Sure, but then:

...

...
...
json.loads('[]') [] json.loads(u'[]'.encode('utf16')) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/antoine/cpython/__svn__/Lib/json/__init__.py", line 310, in loads return _default_decoder.decode(s) File "/home/antoine/cpython/__svn__/Lib/json/decoder.py", line 344, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/home/antoine/cpython/__svn__/Lib/json/decoder.py", line 362, in raw_decode raise ValueError("No JSON object could be decoded") ValueError: No JSON object could be decoded

Cheers Antoine.

Dirkjan Ochtman

12:02 p.m.

On Thu, Apr 9, 2009 at 13:10, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

Sure, but then:

...
...
...
json.loads('[]') [] json.loads(u'[]'.encode('utf16')) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/antoine/cpython/__svn__/Lib/json/__init__.py", line 310, in loads return _default_decoder.decode(s) File "/home/antoine/cpython/__svn__/Lib/json/decoder.py", line 344, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/home/antoine/cpython/__svn__/Lib/json/decoder.py", line 362, in raw_decode raise ValueError("No JSON object could be decoded") ValueError: No JSON object could be decoded

Right. :) Just wanted to point your test might not be testing what you want to test. Cheers, Dirkjan

Barry Warsaw

11:01 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Apr 9, 2009, at 1:15 AM, Antoine Pitrou wrote:

...

Guido van Rossum <guido <at> python.org> writes:

...
I'm kind of surprised that a serialization protocol like JSON wouldn't support reading/writing bytes (as the serialized format -- I don't care about having bytes as values, since JavaScript doesn't have something equivalent AFAIK, and hence JSON doesn't allow it IIRC). Marshal and Pickle, for example, *always* treat the serialized format as bytes. And since in most cases it will be sent over a socket, at some point the serialized representation *will* be bytes, I presume. What makes supporting this hard?

It's not hard, it just means a lot of duplicated code if the library wants to support both str and bytes in an optimized way as Martin alluded to. This duplicated code already exists in the C parts to support the 2.x semantics of accepting unicode objects as well as str, but not in the Python parts, which explains why the bytes support is broken in py3k - in 2.x, the same Python code can be used for str and unicode.

This is an interesting question, and something I'm struggling with for the email package for 3.x. It turns out to be pretty convenient to have both a bytes and a string API, both for input and output, but I think email really wants to be represented internally as bytes. Maybe. Or maybe just for content bodies and not headers, or maybe both. Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments). Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iQCVAwUBSd3Vf3EjvBPtnXfVAQKyNgQApNmI5hh9heTYynyADYaDkP8wzZFXUpgg cKYL741MbLpOFn3IFGAGaRWBQe4Dt8i4CiIEIbg3X7QZqwQJjoTtFwxsJKmXFd1M JR0oCB8Du2kE5YzD+avrEp+d8zwl2goxvzD9dJwziBav5V98w7PMiZc3sApklQFD gNYzbHEOfv4= =tjGr -----END PGP SIGNATURE-----

Steve Holden

12:07 p.m.

Barry Warsaw wrote:

...

On Apr 9, 2009, at 1:15 AM, Antoine Pitrou wrote:

...
Guido van Rossum <guido <at> python.org> writes:

...
I'm kind of surprised that a serialization protocol like JSON wouldn't support reading/writing bytes (as the serialized format -- I don't care about having bytes as values, since JavaScript doesn't have something equivalent AFAIK, and hence JSON doesn't allow it IIRC). Marshal and Pickle, for example, *always* treat the serialized format as bytes. And since in most cases it will be sent over a socket, at some point the serialized representation *will* be bytes, I presume. What makes supporting this hard?

...
It's not hard, it just means a lot of duplicated code if the library wants to support both str and bytes in an optimized way as Martin alluded to. This duplicated code already exists in the C parts to support the 2.x semantics of accepting unicode objects as well as str, but not in the Python parts, which explains why the bytes support is broken in py3k - in 2.x, the same Python code can be used for str and unicode.

This is an interesting question, and something I'm struggling with for the email package for 3.x. It turns out to be pretty convenient to have both a bytes and a string API, both for input and output, but I think email really wants to be represented internally as bytes. Maybe. Or maybe just for content bodies and not headers, or maybe both. Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments).

The real problem I came across in storing email in a relational database was the inability to store messages as Unicode. Some messages have a body in one encoding and an attachment in another, so the only ways to store the messages are either as a monolithic bytes string that gets parsed when the individual components are required or as a sequence of components in the database's preferred encoding (if you want to keep the original encoding most relational databases won't be able to help unless you store the components as bytes). All in all, as you might expect from a system that's been growing up since 1970 or so, it can be quite intractable. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ Watch PyCon on video now! http://pycon.blip.tv/

Tony Nelson

3:05 p.m.

New subject: email package Bytes vs Unicode (was Re: Dropping bytes "support" in json)

(email-sig added) At 08:07 -0400 04/09/2009, Steve Holden wrote:

...

Barry Warsaw wrote: ...

...
This is an interesting question, and something I'm struggling with for the email package for 3.x. It turns out to be pretty convenient to have both a bytes and a string API, both for input and output, but I think email really wants to be represented internally as bytes. Maybe. Or maybe just for content bodies and not headers, or maybe both. Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments).

The real problem I came across in storing email in a relational database was the inability to store messages as Unicode. Some messages have a body in one encoding and an attachment in another, so the only ways to store the messages are either as a monolithic bytes string that gets parsed when the individual components are required or as a sequence of components in the database's preferred encoding (if you want to keep the original encoding most relational databases won't be able to help unless you store the components as bytes). ...

I found it confusing myself, and did it wrong for a while. Now, I understand that essages come over the wire as bytes, either 7-bit US-ASCII or 8-bit whatever, and are parsed at the receiver. I think of the database as a wire to the future, and store the data as bytes (a BLOB), letting the future receiver parse them as it did the first time, when I cleaned the message. Data I care to query is extracted into fields (in UTF-8, what I usually use for char fields). I have no need to store messages as Unicode, and they aren't Unicode anyway. I have no need ever to flatten a message to Unicode, only to US-ASCII or, for messages (spam) that are corrupt, raw 8-bit data. If you need the data from the message, by all means extract it and store it in whatever form is useful to the purpose of the database. If you need the entire message, store it intact in the database, as the bytes it is. Email isn't Unicode any more than a JPEG or other image types (often payloads in a message) are Unicode. -- ____________________________________________________________________ TonyN.:' <mailto:tonynelson@georgeanelson.com> ' <http://www.georgeanelson.com/>

Steve Holden

4:20 p.m.

New subject: email package Bytes vs Unicode (was Re: Dropping bytes "support" in json)

Tony Nelson wrote:

...

(email-sig added)

At 08:07 -0400 04/09/2009, Steve Holden wrote:

...
Barry Warsaw wrote: ...

...
This is an interesting question, and something I'm struggling with for the email package for 3.x. It turns out to be pretty convenient to have both a bytes and a string API, both for input and output, but I think email really wants to be represented internally as bytes. Maybe. Or maybe just for content bodies and not headers, or maybe both. Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments).

The real problem I came across in storing email in a relational database was the inability to store messages as Unicode. Some messages have a body in one encoding and an attachment in another, so the only ways to store the messages are either as a monolithic bytes string that gets parsed when the individual components are required or as a sequence of components in the database's preferred encoding (if you want to keep the original encoding most relational databases won't be able to help unless you store the components as bytes). ...

I found it confusing myself, and did it wrong for a while. Now, I understand that essages come over the wire as bytes, either 7-bit US-ASCII or 8-bit whatever, and are parsed at the receiver. I think of the database as a wire to the future, and store the data as bytes (a BLOB), letting the future receiver parse them as it did the first time, when I cleaned the message. Data I care to query is extracted into fields (in UTF-8, what I usually use for char fields). I have no need to store messages as Unicode, and they aren't Unicode anyway. I have no need ever to flatten a message to Unicode, only to US-ASCII or, for messages (spam) that are corrupt, raw 8-bit data.

If you need the data from the message, by all means extract it and store it in whatever form is useful to the purpose of the database. If you need the entire message, store it intact in the database, as the bytes it is. Email isn't Unicode any more than a JPEG or other image types (often payloads in a message) are Unicode.

This is all great, and I did quite quickly realize that the best approach was to store the mails in their network byte-stream format as bytes. The approach was negated in my own case because of PostgreSQL's execrable BLOB-handling capabilities. I took a look at the escaping they required, snorted with derision and gave it up as a bad job. PostgreSQL strongly encourages you to store text as encoded columns. Because emails lack an encoding it turns out this is a most inconvenient storage type for it. Sadly BLOBs are such a pain in PostgreSQL that it's easier to store the messages in external files and just use the relational database to index those files to retrieve content, so that's what I ended up doing. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ Watch PyCon on video now! http://pycon.blip.tv/

Tony Nelson

5:14 p.m.

New subject: email package Bytes vs Unicode (was Re: Dropping bytes "support" in json)

(email-sig dropped, as I didn't see Steve Holden's message there) At 12:20 -0400 04/09/2009, Steve Holden wrote:

...

Tony Nelson wrote: ...

...
If you need the data from the message, by all means extract it and store it in whatever form is useful to the purpose of the database. If you need the entire message, store it intact in the database, as the bytes it is. Email isn't Unicode any more than a JPEG or other image types (often payloads in a message) are Unicode.

This is all great, and I did quite quickly realize that the best approach was to store the mails in their network byte-stream format as bytes. The approach was negated in my own case because of PostgreSQL's execrable BLOB-handling capabilities. I took a look at the escaping they required, snorted with derision and gave it up as a bad job. ...

I use MySQL, but sort of intend to learn PostgreSQL. I didn't know that PostgreSQL has no real support for BLOBs. I agree that having to import them from a file is awful. Also, there appears to be a severe limit on the size of character data fields, so storing in Base64 is out. About the only thing to do then is to use external storage for the BLOBs. Still, email seems to demand such binary storage, whether all databases provide it or not. -- ____________________________________________________________________ TonyN.:' <mailto:tonynelson@georgeanelson.com> ' <http://www.georgeanelson.com/>

Oleg Broytmann

5:24 p.m.

New subject: BLOBs in Pg (was: email package Bytes vs Unicode)

On Thu, Apr 09, 2009 at 01:14:21PM -0400, Tony Nelson wrote:

...

I use MySQL, but sort of intend to learn PostgreSQL. I didn't know that PostgreSQL has no real support for BLOBs.

I think it has - BYTEA data type. Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd@phd.pp.ru Programmers don't die, they just GOSUB without RETURN.

Steve Holden

6:05 p.m.

New subject: BLOBs in Pg

Oleg Broytmann wrote:

...

On Thu, Apr 09, 2009 at 01:14:21PM -0400, Tony Nelson wrote:

...
I use MySQL, but sort of intend to learn PostgreSQL. I didn't know that PostgreSQL has no real support for BLOBs.

I think it has - BYTEA data type.

But the Python DB adapters appears to require some fairly hairy escaping of the data to make it usable with the cursor execute() method. IMHO you shouldn't have to escape data that is passed for insertion via a parameterized query. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ Watch PyCon on video now! http://pycon.blip.tv/

Sylvain Thénault

7:49 a.m.

New subject: BLOBs in Pg

On 09 avril 14:05, Steve Holden wrote:

...

Oleg Broytmann wrote:

...
On Thu, Apr 09, 2009 at 01:14:21PM -0400, Tony Nelson wrote:

...
I use MySQL, but sort of intend to learn PostgreSQL. I didn't know that PostgreSQL has no real support for BLOBs.

I think it has - BYTEA data type.

But the Python DB adapters appears to require some fairly hairy escaping of the data to make it usable with the cursor execute() method. IMHO you shouldn't have to escape data that is passed for insertion via a parameterized query.

can't you simply use dbmodule.Binary to do the job? -- Sylvain Thénault LOGILAB, Paris (France) Formations Python, Debian, Méth. Agiles: http://www.logilab.fr/formations Développement logiciel sur mesure: http://www.logilab.fr/services CubicWeb, the semantic web framework: http://www.cubicweb.org

Tony Nelson

6:43 p.m.

New subject: BLOBs in Pg (was: email package Bytes vs Unicode)

At 21:24 +0400 04/09/2009, Oleg Broytmann wrote:

...

On Thu, Apr 09, 2009 at 01:14:21PM -0400, Tony Nelson wrote:

...
I use MySQL, but sort of intend to learn PostgreSQL. I didn't know that PostgreSQL has no real support for BLOBs.

I think it has - BYTEA data type.

So it does; I see that now that I've opened up the PostgreSQL docs. I don't find escaping data to be a problem -- I do it for all untrusted data. So, after all, there isn't an example of a database that makes onerous the storing of email and other such byte-oriented data, and Python's email package has no need for workarounds in that area. -- ____________________________________________________________________ TonyN.:' <mailto:tonynelson@georgeanelson.com> ' <http://www.georgeanelson.com/>

Steve Holden

8:42 p.m.

New subject: BLOBs in Pg

Tony Nelson wrote:

...

At 21:24 +0400 04/09/2009, Oleg Broytmann wrote:

...
On Thu, Apr 09, 2009 at 01:14:21PM -0400, Tony Nelson wrote:

...
I use MySQL, but sort of intend to learn PostgreSQL. I didn't know that PostgreSQL has no real support for BLOBs. I think it has - BYTEA data type.

So it does; I see that now that I've opened up the PostgreSQL docs. I don't find escaping data to be a problem -- I do it for all untrusted data.

You shouldn't have to when you are using parameterized queries.

...

So, after all, there isn't an example of a database that makes onerous the storing of email and other such byte-oriented data, and Python's email package has no need for workarounds in that area.

Create a table: CREATE TABLE tst ( id serial, byt bytea, PRIMARY KEY (id) ) WITH (OIDS=FALSE) ; ALTER TABLE tst OWNER TO steve; The following program prints "0": import psycopg2 as db conn = db.connect(database="maildb", user="@@@", password="@@@", host="localhost", port=5432) curs = conn.cursor() curs.execute("DELETE FROM tst") curs.execute("INSERT INTO tst (byt) VALUES (%s)", ("".join(chr(i) for i in range(256)), )) conn.commit() curs.execute("SELECT byt FROM tst") for st, in curs.fetchall(): print len(st) If I change the date to use range(1, 256) I get a ProgrammingError fron PostgreSQL "invalid input syntax for type bytea". If I can't pass a 256-byte string into a BLOB and get it back without anything like this happening then there's *something* in the chain that makes the database useless. My current belief is that this something is fairly deeply embedded in the PostgreSQL engine. No "syntax" should be necessary. I suppose if we have to go round again on this we should take it to email as we have gotten pretty far off-topic for python-dev. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ Watch PyCon on video now! http://pycon.blip.tv/

Aahz

8:53 p.m.

New subject: BLOBs in Pg

On Thu, Apr 09, 2009, Steve Holden wrote:

...

import psycopg2 as db conn = db.connect(database="maildb", user="@@@", password="@@@", host="localhost", port=5432) curs = conn.cursor() curs.execute("DELETE FROM tst") curs.execute("INSERT INTO tst (byt) VALUES (%s)", ("".join(chr(i) for i in range(256)), )) conn.commit() curs.execute("SELECT byt FROM tst") for st, in curs.fetchall(): print len(st)

If I change the date to use range(1, 256) I get a ProgrammingError fron PostgreSQL "invalid input syntax for type bytea".

If I can't pass a 256-byte string into a BLOB and get it back without anything like this happening then there's *something* in the chain that makes the database useless. My current belief is that this something is fairly deeply embedded in the PostgreSQL engine. No "syntax" should be necessary.

You're not using a parameterized query. I suggest you post to c.l.py for more information. ;-) -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Why is this newsgroup different from all other newsgroups?

Oleg Broytmann

9:12 p.m.

New subject: BLOBs in Pg

On Thu, Apr 09, 2009 at 04:42:21PM -0400, Steve Holden wrote:

...

If I can't pass a 256-byte string into a BLOB and get it back without anything like this happening then there's *something* in the chain that makes the database useless.

import psycopg2 con = psycopg2.connect(database="test") cur = con.cursor() cur.execute("CREATE TABLE test (id serial, data BYTEA)") cur.execute('INSERT INTO test (data) VALUES (%s)', (psycopg2.Binary(''.join([chr(i) for i in range(256)])),)) cur.execute('SELECT * FROM test ORDER BY id') for rec in cur.fetchall(): print rec[0], type(rec[1]), repr(str(rec[1])) Result: 1 <type 'buffer'> '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff' What am I doing wrong? Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd@phd.pp.ru Programmers don't die, they just GOSUB without RETURN.

Steve Holden

11:56 p.m.

New subject: BLOBs in Pg

Oleg Broytmann wrote:

...

On Thu, Apr 09, 2009 at 04:42:21PM -0400, Steve Holden wrote:

...
If I can't pass a 256-byte string into a BLOB and get it back without anything like this happening then there's *something* in the chain that makes the database useless.

import psycopg2

con = psycopg2.connect(database="test") cur = con.cursor() cur.execute("CREATE TABLE test (id serial, data BYTEA)") cur.execute('INSERT INTO test (data) VALUES (%s)', (psycopg2.Binary(''.join([chr(i) for i in range(256)])),)) cur.execute('SELECT * FROM test ORDER BY id') for rec in cur.fetchall(): print rec[0], type(rec[1]), repr(str(rec[1]))

Result:

1 <type 'buffer'> '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

What am I doing wrong?

Oleg.

Corresponding with me, probably. Thank you Oleg. I feel suddenly saner again. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ Watch PyCon on video now! http://pycon.blip.tv/

Barry Warsaw

2:40 a.m.

New subject: email package Bytes vs Unicode (was Re: Dropping bytes "support" in json)

On Apr 9, 2009, at 12:20 PM, Steve Holden wrote:

...

PostgreSQL strongly encourages you to store text as encoded columns. Because emails lack an encoding it turns out this is a most inconvenient storage type for it. Sadly BLOBs are such a pain in PostgreSQL that it's easier to store the messages in external files and just use the relational database to index those files to retrieve content, so that's what I ended up doing.

That's not insane for other reasons. Do you really want to store 10MB of mp3 data in your database? Which of course reminds me that I want to add an interface, probably to the parser and message class, to allow an application to store message payloads in other than memory. Parsing and holding onto messages with huge payloads can kill some applications, when you might not care too much about the actual payload content. Barry

Barry Warsaw

2:26 a.m.

On Apr 9, 2009, at 8:07 AM, Steve Holden wrote:

...

The real problem I came across in storing email in a relational database was the inability to store messages as Unicode. Some messages have a body in one encoding and an attachment in another, so the only ways to store the messages are either as a monolithic bytes string that gets parsed when the individual components are required or as a sequence of components in the database's preferred encoding (if you want to keep the original encoding most relational databases won't be able to help unless you store the components as bytes).

All in all, as you might expect from a system that's been growing up since 1970 or so, it can be quite intractable.

There are really two ways to look at an email message. It's either an unstructured blob of bytes, or it's a structured tree of objects. Those objects have headers and payload. The payload can be of any type, though I think it generally breaks down into "strings" for text/ * types and bytes for anything else (not counting multiparts). The email package isn't a perfect mapping to this, which is something I want to improve. That aside, I think storing a message in a database means storing some or all of the headers separately from the byte stream (or text?) of its payload. That's for non-multipart types. It would be more complicated to represent a message tree of course. It does seem to make sense to think about headers as text header names and text header values. Of course, header values can contain almost anything and there's an encoding to bring it back to 7-bit ASCII, but again, you really have two views of a header value. Which you want really depends on your application. Maybe you just care about the text of both the header name and value. In that case, I think you want the values as unicodes, and probably the headers as unicodes containing only ASCII. So your table would be strings in both cases. OTOH, maybe your application cares about the raw underlying encoded data, in which case the header names are probably still strings of ASCII-ish unicodes and the values are bytes. It's this distinction (and I think the competing use cases) that make a true Python 3.x API for email more complicated. Thinking about this stuff makes me nostalgic for the sloppy happy days of Python 2.x -Barry

glyph＠divmod.com

3:11 a.m.

New subject: the email module, text, and bytes (was Re: Dropping bytes "support" in json)

On 02:26 am, barry@python.org wrote:

...

There are really two ways to look at an email message. It's either an unstructured blob of bytes, or it's a structured tree of objects. Those objects have headers and payload. The payload can be of any type, though I think it generally breaks down into "strings" for text/ * types and bytes for anything else (not counting multiparts).

I think this is a problematic way to model bytes vs. text; it gives text a special relationship to bytes which should be avoided. IMHO the right way to think about domains like this is a multi-level representation. The "low level" representation is always bytes, whether your MIME type is text/whatever or application/x-i-dont-know. The thing that's "special" about text is that it's a "high level" representation that the standard library can know about. But the 'email' package ought to support being extended to support other types just as well. For example, I want to ask for image/png content as PIL.Image objects, not bags of bytes. Of course this presupposes some way for PIL itself to get at some bytes, but then you need the email module itself to get at the bytes to convert to text in much the same way. There also needs to be layering at the level of bytes->base64->some different bytes->PIL->Image. There are mail clients that will base64-encode unusual encodings so you have to do that same layering for text sometimes. I'm also being somewhat handwavy with talk of "low" and "high" level representations; of course there are actually multiple levels beyond that. I might want text/x-python content to show up as an AST, but the intermediate DOM-parsing representation really wants to operate on characters. Similarly for a DOM and text/html content. (Modulo the usual encoding-detection weirdness present in parsers.) So, as long as there's a crisp definition of what layer of the MIME stack one is operating on, I don't think that there's really any ambiguity at all about what type you should be getting.

Barry Warsaw

3:03 a.m.

New subject: the email module, text, and bytes (was Re: Dropping bytes "support" in json)

On Apr 9, 2009, at 11:11 PM, glyph@divmod.com wrote:

...

I think this is a problematic way to model bytes vs. text; it gives text a special relationship to bytes which should be avoided.

IMHO the right way to think about domains like this is a multi-level representation. The "low level" representation is always bytes, whether your MIME type is text/whatever or application/x-i-dont-know.

This is a really good point, and I really should be clearer when describing my current thinking (sleep would help :).

...

The thing that's "special" about text is that it's a "high level" representation that the standard library can know about. But the 'email' package ought to support being extended to support other types just as well. For example, I want to ask for image/png content as PIL.Image objects, not bags of bytes. Of course this presupposes some way for PIL itself to get at some bytes, but then you need the email module itself to get at the bytes to convert to text in much the same way. There also needs to be layering at the level of bytes->base64->some different bytes->PIL->Image. There are mail clients that will base64-encode unusual encodings so you have to do that same layering for text sometimes.

I'm also being somewhat handwavy with talk of "low" and "high" level representations; of course there are actually multiple levels beyond that. I might want text/x-python content to show up as an AST, but the intermediate DOM-parsing representation really wants to operate on characters. Similarly for a DOM and text/html content. (Modulo the usual encoding-detection weirdness present in parsers.)

When I was talking about supporting text/* content types as strings, I was definitely thinking about using basically the same plug-in or higher level or whatever API to do that as you might use to get PIL images from an image/gif.

...

So, as long as there's a crisp definition of what layer of the MIME stack one is operating on, I don't think that there's really any ambiguity at all about what type you should be getting.

In that case, we really need the bytes-in-bytes-out-bytes-in-the-chewy- center API first, and build things on top of that. -Barry

Bill Janssen

4:35 p.m.

New subject: [Email-SIG] the email module, text, and bytes (was Re: Dropping bytes "support" in json)

Barry Warsaw <barry@python.org> wrote:

...

In that case, we really need the bytes-in-bytes-out-bytes-in-the-chewy- center API first, and build things on top of that.

Yep. Bill

Stephen J. Turnbull

7:06 p.m.

New subject: [Email-SIG] the email module, text, and bytes (was Re: Dropping bytes "support" in json)

Bill Janssen writes:

...

Barry Warsaw <barry@python.org> wrote:

...
In that case, we really need the bytes-in-bytes-out-bytes-in-the-chewy- center API first, and build things on top of that.

Yep.

Uh, I hate to rain on a parade, but isn't that how we arrived at the *current* email package?

Barry Warsaw

7:04 p.m.

New subject: [Email-SIG] the email module, text, and bytes (was Re: Dropping bytes "support" in json)

On Apr 10, 2009, at 3:06 PM, Stephen J. Turnbull wrote:

...

Bill Janssen writes:

...
Barry Warsaw <barry@python.org> wrote:

...
In that case, we really need the bytes-in-bytes-out-bytes-in-the-chewy- center API first, and build things on top of that.

Yep.

Uh, I hate to rain on a parade, but isn't that how we arrived at the *current* email package?

Not really. We got here because <ahem>we</ahem> were too damn sloppy about the distinction. I'm going to remove python-dev from subsequent follow ups. Please join us at email-sig for further discussion. Barry

Guido van Rossum

2:11 a.m.

New subject: [Email-SIG] the email module, text, and bytes (was Re: Dropping bytes "support" in json)

On Fri, Apr 10, 2009 at 12:04 PM, Barry Warsaw <barry@python.org> wrote:

...

On Apr 10, 2009, at 3:06 PM, Stephen J. Turnbull wrote:

...
Bill Janssen writes:

...
Barry Warsaw <barry@python.org> wrote:

...
In that case, we really need the bytes-in-bytes-out-bytes-in-the-chewy- center API first, and build things on top of that.

Yep.

Uh, I hate to rain on a parade, but isn't that how we arrived at the *current* email package?

Not really. We got here because <ahem>we</ahem> were too damn sloppy about the distinction.

Agreed. I take full responsibility -- the str/unicode approach we introduced in 2.0 seemed like the best thing we could do at the time, but in retrospect it would've been better if we'd left str alone and introduced a unicode type that was truly distinct -- like str in 3.0. The email package is not the only system that ended up with a muddled distinction between the two as a result.

...

I'm going to remove python-dev from subsequent follow ups. Please join us at email-sig for further discussion.

Barry

-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Tony Nelson

3:59 a.m.

New subject: [Email-SIG] Dropping bytes "support" in json

At 22:26 -0400 04/09/2009, Barry Warsaw wrote:

...

There are really two ways to look at an email message. It's either an unstructured blob of bytes, or it's a structured tree of objects. Those objects have headers and payload. The payload can be of any type, though I think it generally breaks down into "strings" for text/ * types and bytes for anything else (not counting multiparts).

The email package isn't a perfect mapping to this, which is something I want to improve. That aside, I think storing a message in a database means storing some or all of the headers separately from the byte stream (or text?) of its payload. That's for non-multipart types. It would be more complicated to represent a message tree of course.

Storing an email message in a database does mean storing some of the header fields as database fields, but the set of email header fields is open, so any "unused" fields in a message must be stored elsewhere. It isn't useful to just have a bag of name/value pairs in a table. General message MIME payload trees don't map well to a database either, unless one wants to get very relational. Sometimes the database needs to represent the entire email message, header fields and MIME tree, but only if it is an email program and usually not even then. Usually, the database has a specific purpose, and can be designed for the data it cares about; it may choose to keep the original message as bytes.

...

It does seem to make sense to think about headers as text header names and text header values. Of course, header values can contain almost anything and there's an encoding to bring it back to 7-bit ASCII, but again, you really have two views of a header value. Which you want really depends on your application.

I think of header fields as having text-like names (the set of allowed characters is more than just text, though defined headers don't make use of that), but the data is either bytes or it should be parsed into something appropriate: text for unstructured fields like Subject:, a list of addresses for address fields like To:. Many of the structured header fields have a reasonable mapping to text; certainly this is true for adress header fields. Content-Type header fields are barely text, they can be so convolutedly structured, but I suppose one could flatten one of them to text instead of bytes if the user wanted. It's not very useful, though, except for debugging (either by the programmer or the recipient who wants to know what was cleaned from the message).

...

Maybe you just care about the text of both the header name and value. In that case, I think you want the values as unicodes, and probably the headers as unicodes containing only ASCII. So your table would be strings in both cases. OTOH, maybe your application cares about the raw underlying encoded data, in which case the header names are probably still strings of ASCII-ish unicodes and the values are bytes. It's this distinction (and I think the competing use cases) that make a true Python 3.x API for email more complicated.

If a database stores the Subject: header field, it would be as text. The various recipient address fields are a one message to many names and addresses mapping, and need a related table of name/address fields, with each field being text. The original message (or whatever part of it one preserves) should be bytes. I don't think this complicates the email package API; rather, it just shows where generality is needed.

...

Thinking about this stuff makes me nostalgic for the sloppy happy days of Python 2.x

You now have the opportunity to finally unsnarl that mess. It is not an insurmountable opportunity. -- ____________________________________________________________________ TonyN.:' <mailto:tonynelson@georgeanelson.com> ' <http://www.georgeanelson.com/>

Barry Warsaw

5:12 p.m.

New subject: [Email-SIG] Dropping bytes "support" in json

On Apr 9, 2009, at 11:59 PM, Tony Nelson wrote:

...

...
Thinking about this stuff makes me nostalgic for the sloppy happy days of Python 2.x

You now have the opportunity to finally unsnarl that mess. It is not an insurmountable opportunity.

No, it's just a full time job <wink>. Now where did I put that hack- drink-coffee-twitter clone? -Barry

Stephen J. Turnbull

5:22 a.m.

New subject: [Email-SIG] Dropping bytes "support" in json

Barry Warsaw writes:

...

There are really two ways to look at an email message. It's either an unstructured blob of bytes, or it's a structured tree of objects.

Indeed!

...

Those objects have headers and payload. The payload can be of any type, though I think it generally breaks down into "strings" for text/ * types and bytes for anything else (not counting multiparts).

*sigh* Why are you back-tracking? The payload should be of an appropriate *object* type. Atomic object types will have their content stored as string or bytes [nb I use Python 3 terminology throughout]. Composite types (multipart/*) won't need string or bytes attributes AFAICS. Start by implementing the application/octet-stream and text/plain;charset=utf-8 object types, of course.

...

It does seem to make sense to think about headers as text header names and text header values.

I disagree. IMHO, structured header types should have object values, and something like message['to'] = "Barry 'da FLUFL' Warsaw <barry@python.org>" should be smart enough to detect that it's a string and attempt to (flexibly) parse it into a fullname and a mailbox adding escapes, etc. Whether these should be structured objects or they can be strings or bytes, I'm not sure (probably bytes, not strings, though -- see next exampl). OTOH message['to'] = b'''"Barry 'da.FLUFL' Warsaw" <barry@python.org>''' should assume that the client knows what they are doing, and should parse it strictly (and I mean "be a real bastard", eg, raise an exception on any non-ASCII octet), merely dividing it into fullname and mailbox, and caching the bytes for later insertion in a wire-format message.

...

In that case, I think you want the values as unicodes, and probably the headers as unicodes containing only ASCII. So your table would be strings in both cases. OTOH, maybe your application cares about the raw underlying encoded data, in which case the header names are probably still strings of ASCII-ish unicodes and the values are bytes. It's this distinction (and I think the competing use cases) that make a true Python 3.x API for email more complicated.

I don't see why you can't have the email API be specific, with message['to'] always returning a structured_header object (or maybe even more specifically an address_header object), and methods like message['to'].build_header_as_text() which returns """To: "Barry 'da.FLUFL' Warsaw" <barry@python.org>""" and message['to'].build_header_in_wire_format() which returns b"""To: "Barry 'da.FLUFL' Warsaw" <barry@python.org>""" Then have email.textview.Message and email.wireview.Message which provide a simple interface where message['to'] would invoke .build_header_as_text() and .build_header_in_wire_format() respectively.

...

Thinking about this stuff makes me nostalgic for the sloppy happy days of Python 2.x

Er, yeah. Nostalgic-for-the-BITNET-days-where-everything-was-Just-EBCDIC-ly y'rs,

Barry Warsaw

5:21 p.m.

New subject: [Email-SIG] Dropping bytes "support" in json

On Apr 10, 2009, at 1:22 AM, Stephen J. Turnbull wrote:

...

...
Those objects have headers and payload. The payload can be of any type, though I think it generally breaks down into "strings" for text/ * types and bytes for anything else (not counting multiparts).

*sigh* Why are you back-tracking?

I'm not. Sleep deprivation on makes it seem like that.

...

The payload should be of an appropriate *object* type. Atomic object types will have their content stored as string or bytes [nb I use Python 3 terminology throughout]. Composite types (multipart/*) won't need string or bytes attributes AFAICS.

Yes, agreed.

...

Start by implementing the application/octet-stream and text/plain;charset=utf-8 object types, of course.

Yes. See my lament about using inheritance for this.

...

...
It does seem to make sense to think about headers as text header names and text header values.

I disagree. IMHO, structured header types should have object values, and something like

While I agree, there's still a need for a higher level API that make it easy to do the simple things.

...

message['to'] = "Barry 'da FLUFL' Warsaw <barry@python.org>"

should be smart enough to detect that it's a string and attempt to (flexibly) parse it into a fullname and a mailbox adding escapes, etc. Whether these should be structured objects or they can be strings or bytes, I'm not sure (probably bytes, not strings, though -- see next exampl). OTOH

message['to'] = b'''"Barry 'da.FLUFL' Warsaw" <barry@python.org>'''

should assume that the client knows what they are doing, and should parse it strictly (and I mean "be a real bastard", eg, raise an exception on any non-ASCII octet), merely dividing it into fullname and mailbox, and caching the bytes for later insertion in a wire-format message.

I agree that the Message class needs to be strict. A parser needs to be lenient; see the .defects attribute introduced in the current email package. Oh, and this reminds me that we still haven't talked about idempotency. That's an important principle in the current email package, but do we need to give up on that?

...

...
In that case, I think you want the values as unicodes, and probably the headers as unicodes containing only ASCII. So your table would be strings in both cases. OTOH, maybe your application cares about the raw underlying encoded data, in which case the header names are probably still strings of ASCII-ish unicodes and the values are bytes. It's this distinction (and I think the competing use cases) that make a true Python 3.x API for email more complicated.

I don't see why you can't have the email API be specific, with message['to'] always returning a structured_header object (or maybe even more specifically an address_header object), and methods like

message['to'].build_header_as_text()

which returns

"""To: "Barry 'da.FLUFL' Warsaw" <barry@python.org>"""

and

message['to'].build_header_in_wire_format()

which returns

b"""To: "Barry 'da.FLUFL' Warsaw" <barry@python.org>"""

Then have email.textview.Message and email.wireview.Message which provide a simple interface where message['to'] would invoke .build_header_as_text() and .build_header_in_wire_format() respectively.

This seems similar to Glyph's basic idea, but with a different spelling.

...

...
Thinking about this stuff makes me nostalgic for the sloppy happy days of Python 2.x

Er, yeah.

Nostalgic-for-the-BITNET-days-where-everything-was-Just-EBCDIC-ly y'rs,

Can I have my uucp address back now? -Barry

Stephen J. Turnbull

7:04 p.m.

New subject: [Email-SIG] Dropping bytes "support" in json

Shouldn't this thread move lock stock and .signature to email-sig? Barry Warsaw writes:

...

...
...
It does seem to make sense to think about headers as text header names and text header values.

I disagree. IMHO, structured header types should have object values, and something like

While I agree, there's still a need for a higher level API that make it easy to do the simple things.

Sure. I'm suggesting that the way to determine whether something is simple or not is by whether it falls out naturally from correct structure. Ie, no operations that only a Cirque du Soleil juggler can perform are allowed.

...

I agree that the Message class needs to be strict. A parser needs to be lenient;

Not always. The Postel Principle only applies to stuph coming in off the wire. But we're *also* going to be parsing pseudo-email components that are being handed to us by applications (eg, the perennial control-character-in-the-unremovable-address Mailman bug). Our parser should Just Say No to that crap.

...

see the .defects attribute introduced in the current email package. Oh, and this reminds me that we still haven't talked about idempotency. That's an important principle in the current email package, but do we need to give up on that?

"Idempotency"? I'm not sure what that means in the context of the email package ... multiplication by zero?<wink> Do you mean that .parse().to_wire() should be idempotent? Yes, I think that's a good idea, and it shouldn't be too hard to implement by (optionally?) caching the whole original message or individual components (headers with all whitespace including folding cached verbatim, etc). I think caching has to be done, since stuff like "did the original fold with a leading tab or a leading space, and at what column" and so on seems kind of pointless to encode as attributes on Header objects. [Description of MessageTextView and MessageWireView elided.]

...

This seems similar to Glyph's basic idea, but with a different spelling.

Yes. I don't much care which way it's done, and Glyph's style of spelling is more explicit. But I was thinking in terms of the number of people who are surely going to sing "Mama don' 'low no Unicodes roun' here" and squeal "codec WTF?! outta mah face, man!"

Bill Janssen

3:08 p.m.

Barry Warsaw <barry@python.org> wrote:

...

Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments).

Probably a good thing. It just promotes more confusion to do things that way, IMO. Bill

Barry Warsaw

2:29 a.m.

On Apr 9, 2009, at 11:08 AM, Bill Janssen wrote:

...

Barry Warsaw <barry@python.org> wrote:

...
Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments).

Probably a good thing. It just promotes more confusion to do things that way, IMO.

Very possibly so. But applications will definitely want stuff like the text/plain payload as a unicode, or the image/gif payload as a bytes (or even as a PIL image or whatever). Not that I think the email package needs to know about every content type under the sun, but I do think that it should be pluggable so as to allow applications to more conveniently access the data that way. Possibly the defaults should be unicodes for any text/* type and bytes for everything else. -Barry

Daniel Stutzbach

3:55 p.m.

On Thu, Apr 9, 2009 at 6:01 AM, Barry Warsaw <barry@python.org> wrote:

...

Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments).

Won't this work? (assuming dumps() always returns a string) def dumpb(obj, encoding='utf-8', *args, **kw): s = dumps(obj, *args, **kw) return s.encode(encoding) -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

Barry Warsaw

2:38 a.m.

On Apr 9, 2009, at 11:55 AM, Daniel Stutzbach wrote:

...

On Thu, Apr 9, 2009 at 6:01 AM, Barry Warsaw <barry@python.org> wrote: Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments).

Won't this work? (assuming dumps() always returns a string)

def dumpb(obj, encoding='utf-8', *args, **kw): s = dumps(obj, *args, **kw) return s.encode(encoding)

So, what I'm really asking is this. Let's say you agree that there are use cases for accessing a header value as either the raw encoded bytes or the decoded unicode. What should this return:

...

...
...
message['Subject']

The raw bytes or the decoded unicode? Okay, so you've picked one. Now how do you spell the other way? The Message class probably has these explicit methods:

...

...
...
Message.get_header_bytes('Subject') Message.get_header_string('Subject')

(or better names... it's late and I'm tired ;). One of those maps to message['Subject'] but which is the more obvious choice? Now, setting headers. Sometimes you have some unicode thing and sometimes you have some bytes. You need to end up with bytes in the ASCII range and you'd like to leave the header value unencoded if so. But in both cases, you might have bytes or characters outside that range, so you need an explicit encoding, defaulting to utf-8 probably.

...

...
...
Message.set_header('Subject', 'Some text', encoding='utf-8') Message.set_header('Subject', b'Some bytes')

One of those maps to

...

...
...
message['Subject'] = ???

I'm open to any suggestions here! -Barry

Aahz

2:52 a.m.

On Thu, Apr 09, 2009, Barry Warsaw wrote:

...

So, what I'm really asking is this. Let's say you agree that there are use cases for accessing a header value as either the raw encoded bytes or the decoded unicode. What should this return:

...
...
...
message['Subject']

The raw bytes or the decoded unicode?

Let's make that the raw bytes by default -- we can add a parameter to Message() to specify that the default where possible is unicode for returned values, if that isn't too painful. Here's my reasoning: ultimately, everyone NEEDS to understand that the underlying transport for e-mail is bytes (similar to sockets). We do people no favors by pasting over this too much. We can overlay convenience at various points, but except for text payloads, everything should be bytes by default. Even for text payloads, I'm not entirely certain the default shouldn't be bytes: consider an HTML attachment that you want to compare against the output from a webserver. Still, as long as it's easy to get bytes for text payloads, I think overall I'm still leaning toward unicode for them. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Why is this newsgroup different from all other newsgroups?

Barry Warsaw

3:05 a.m.

On Apr 9, 2009, at 10:52 PM, Aahz wrote:

...

On Thu, Apr 09, 2009, Barry Warsaw wrote:

...
So, what I'm really asking is this. Let's say you agree that there are use cases for accessing a header value as either the raw encoded bytes or the decoded unicode. What should this return:

...
...
...
message['Subject']

The raw bytes or the decoded unicode?

Let's make that the raw bytes by default -- we can add a parameter to Message() to specify that the default where possible is unicode for returned values, if that isn't too painful.

I don't know whether the parameter thing will work or not, but you're probably right that we need to get the bytes-everywhere API first. -Barry

Nick Coghlan

3:21 a.m.

Barry Warsaw wrote:

...

I don't know whether the parameter thing will work or not, but you're probably right that we need to get the bytes-everywhere API first.

Given that json is a wire protocol, that sounds like the right approach for json as well. Once bytes-everywhere works, then a text API can be built on top of it, but it is difficult to build a bytes API on top of a text one. So I guess the IO library *is* the right model: bytes at the bottom of the stack, with text as a wrapper around it (mediated by codecs). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Barry Warsaw

3:23 a.m.

On Apr 9, 2009, at 11:21 PM, Nick Coghlan wrote:

...

Barry Warsaw wrote:

...
I don't know whether the parameter thing will work or not, but you're probably right that we need to get the bytes-everywhere API first.

Given that json is a wire protocol, that sounds like the right approach for json as well. Once bytes-everywhere works, then a text API can be built on top of it, but it is difficult to build a bytes API on top of a text one.

Agreed!

...

So I guess the IO library *is* the right model: bytes at the bottom of the stack, with text as a wrapper around it (mediated by codecs).

Yes, that's a very interesting (and proven?) model. I don't quite see how we could apply that email and json, but it seems like there's a good idea there. ;) -Barry

glyph＠divmod.com

5:28 a.m.

On 03:21 am, ncoghlan@gmail.com wrote:

...

Barry Warsaw wrote:

...

...
I don't know whether the parameter thing will work or not, but you're probably right that we need to get the bytes-everywhere API first.

...

Given that json is a wire protocol, that sounds like the right approach for json as well. Once bytes-everywhere works, then a text API can be built on top of it, but it is difficult to build a bytes API on top of a text one.

I wish I could agree, but JSON isn't really a wire protocol. According to http://www.ietf.org/rfc/rfc4627.txt JSON is "a text format for the serialization of structured data". There are some notes about encoding, but it is very clearly described in terms of unicode code points.

...

So I guess the IO library *is* the right model: bytes at the bottom of the stack, with text as a wrapper around it (mediated by codecs).

In email's case this is true, but in JSON's case it's not. JSON is a format defined as a sequence of code points; MIME is defined as a sequence of octets.

Nick Coghlan

8:40 a.m.

glyph@divmod.com wrote:

...

On 03:21 am, ncoghlan@gmail.com wrote:

...
Given that json is a wire protocol, that sounds like the right approach for json as well. Once bytes-everywhere works, then a text API can be built on top of it, but it is difficult to build a bytes API on top of a text one.

I wish I could agree, but JSON isn't really a wire protocol. According to http://www.ietf.org/rfc/rfc4627.txt JSON is "a text format for the serialization of structured data". There are some notes about encoding, but it is very clearly described in terms of unicode code points.

Ah, my apologies - if the RFC defines things such that the native format is Unicode, then yes, the appropriate Python 3.x data type for the base implementation would indeed be strings. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Paul Moore

11:53 a.m.

2009/4/10 Nick Coghlan <ncoghlan@gmail.com>:

...

glyph@divmod.com wrote:

...
On 03:21 am, ncoghlan@gmail.com wrote:

...
Given that json is a wire protocol, that sounds like the right approach for json as well. Once bytes-everywhere works, then a text API can be built on top of it, but it is difficult to build a bytes API on top of a text one.

I wish I could agree, but JSON isn't really a wire protocol. According to http://www.ietf.org/rfc/rfc4627.txt JSON is "a text format for the serialization of structured data". There are some notes about encoding, but it is very clearly described in terms of unicode code points.

Ah, my apologies - if the RFC defines things such that the native format is Unicode, then yes, the appropriate Python 3.x data type for the base implementation would indeed be strings.

Indeed, the RFC seems to clearly imply that loads should take a Unicode string, dumps should produce one, and load/dump should work in terms of text files (not byte files). On the other hand, further down in the document: """ 3. Encoding JSON text SHALL be encoded in Unicode. The default encoding is UTF-8. Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets. """ This is at best confused (in my utterly non-expert opinion :-)) as Unicode isn't an encoding... I would guess that what the RFC is trying to say is that JSON is text (Unicode) and where a byte stream purporting to be JSON is encountered without a defined encoding, this is how to guess one. That implies that loads can/should also allow bytes as input, applying the given algorithm to guess an encoding. And similarly load can/should accept a byte stream, on the same basis. (There's no need to allow the possibility of accepting bytes plus an encoding - in that case the user should decode the bytes before passing Unicode to the JSON module). An alternative might be for the JSON module to register a special encoding ('JSON-guess'?) which captures the rules here. Then there's no need for special bytes parameter handling. Of course, this is all from a native English speaker, who therefore has no idea of the real life issues involved in Unicode :-) Paul.

Stephen J. Turnbull

3:38 p.m.

Paul Moore writes:

...

On the other hand, further down in the document:

""" 3. Encoding

JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.

Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets. """

This is at best confused (in my utterly non-expert opinion :-)) as Unicode isn't an encoding...

The word "encoding" (by itself) does not have a standard definition AFAIK. However, since Unicode *is* a "coded character set" (plus a bunch of hairy usage rules), there's nothing wrong with saying "text is encoded in Unicode". The RFC 2130 and Unicode TR#17 taxonomies are annoying verbose and pedantic to say the least. So what is being said there (in UTR#17 terminology) is (1) JSON is *text*, that is, a sequence of characters. (2) The abstract repertoire and coded character set are defined by the Unicode standard. (3) The default transfer encoding syntax is UTF-8.

...

That implies that loads can/should also allow bytes as input, applying the given algorithm to guess an encoding.

It's not a guess, unless the data stream is corrupt---or nonconforming. But it should not be the JSON package's responsibility to deal with corruption or non-conformance (eg, ISO-8859-15-encoded programs). That's the whole point of specifying the coded character set in the standard the first place. I think it's a bad idea for any of the core JSON API to accept or produce bytes in any language that provides a Unicode string type. That doesn't mean Python's module shouldn't provide convenience functions to read and write JSON serialized as UTF-8 (in fact, that *should* be done, IMO) and/or other UTFs (I'm not so happy about that). But those who write programs using them should not report bugs until they've checked out and eliminated the possibility of an encoding screwup!

Bob Ippolito

3:55 p.m.

On Fri, Apr 10, 2009 at 8:38 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:

...

Paul Moore writes:

> On the other hand, further down in the document: > > """ > 3. Encoding > > JSON text SHALL be encoded in Unicode. The default encoding is > UTF-8. > > Since the first two characters of a JSON text will always be ASCII > characters [RFC0020], it is possible to determine whether an octet > stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking > at the pattern of nulls in the first four octets. > """ > > This is at best confused (in my utterly non-expert opinion :-)) as > Unicode isn't an encoding...

The word "encoding" (by itself) does not have a standard definition AFAIK. However, since Unicode *is* a "coded character set" (plus a bunch of hairy usage rules), there's nothing wrong with saying "text is encoded in Unicode". The RFC 2130 and Unicode TR#17 taxonomies are annoying verbose and pedantic to say the least.

So what is being said there (in UTR#17 terminology) is

(1) JSON is *text*, that is, a sequence of characters. (2) The abstract repertoire and coded character set are defined by the Unicode standard. (3) The default transfer encoding syntax is UTF-8.

> That implies that loads can/should also allow bytes as input, applying > the given algorithm to guess an encoding.

It's not a guess, unless the data stream is corrupt---or nonconforming.

But it should not be the JSON package's responsibility to deal with corruption or non-conformance (eg, ISO-8859-15-encoded programs). That's the whole point of specifying the coded character set in the standard the first place. I think it's a bad idea for any of the core JSON API to accept or produce bytes in any language that provides a Unicode string type.

That doesn't mean Python's module shouldn't provide convenience functions to read and write JSON serialized as UTF-8 (in fact, that *should* be done, IMO) and/or other UTFs (I'm not so happy about that). But those who write programs using them should not report bugs until they've checked out and eliminated the possibility of an encoding screwup!

The current implementation doesn't do any encoding guesswork and I have no intention to allow that as a feature. The input must be unicode, UTF-8 bytes, or an encoding must be specified. Personally most of experience with JSON is as a wire protocol and thus bytes, so the obvious function to encode json should do that. There probably should be another function to get unicode output, but nobody has ever asked for that in the Python 2.x version. They either want the default behavior (encoding as ASCII str which can be used as unicode due to implementation details of Python 2.x) or encoding as a more compact UTF-8 str (without escaping non-ASCII code points). Perhaps Python 3 users would ask for a unicode output when decoding though. -bob

"Martin v. Löwis"

4:11 p.m.

...

(3) The default transfer encoding syntax is UTF-8.

Notice that the RFC is partially irrelevant. It only applies to the application/json mime type, and JSON is used in various other protocols, using various other encodings.

...

I think it's a bad idea for any of the core JSON API to accept or produce bytes in any language that provides a Unicode string type.

So how do you integrate the encoding detection that the RFC suggests to be done? Regards, Martin

Stephen J. Turnbull

6:13 p.m.

"Martin v. Löwis" writes:

...

...
(3) The default transfer encoding syntax is UTF-8.

Notice that the RFC is partially irrelevant. It only applies to the application/json mime type, and JSON is used in various other protocols, using various other encodings.

Sure. That's their problem. In Python, Unicode is the native encoding, and we have codecs to deal with the outside world, no? That happens to match very well not only with RFC 4627, but the sidebar on json.org that defines JSON.

...

...
I think it's a bad idea for any of the core JSON API to accept or produce bytes in any language that provides a Unicode string type.

So how do you integrate the encoding detection that the RFC suggests to be done?

I suggest you don't. That's mission creep. Think about writing tests for it, and remember that out in the wild those "various other encodings" almost certainly include Shift JIS, Big5, and KOI8-R. Both those considerations point to "er, let's delegate detection and en/decoding to the nice folks who maintain the codec suite." Where it's embedded in some other protocol which specifies a TES, the TES can be implemented there, too. As I wrote earlier, I don't see anything wrong with providing a wrapper module that deals with some default/common/easy cases. But I'd stick it in the contrib directory.

Greg Ewing

12:51 a.m.

Paul Moore wrote:

...

3. Encoding

JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.

This is at best confused (in my utterly non-expert opinion :-)) as Unicode isn't an encoding...

I'm inclined to agree. I'd go further and say that if JSON is really mean to be a text format, the standard has no business mentioning encodings at all. The reason you use a text format in the first place is that you have some way of transmitting text, and you want to send something that isn't text. In that situation, the encoding is already determined by whatever means you're using to send the text. -- Greg

Stephen J. Turnbull

8:35 a.m.

Greg Ewing writes:

...

The reason you use a text format in the first place is that you have some way of transmitting text, and you want to send something that isn't text. In that situation, the encoding is already determined by whatever means you're using to send the text.

Determined, yes, but all too often in a nondeterministic way. That's precisely the problem that the spec is trying to avert. People often schlep "text" around as if that were well-defined, forcing receivers to guess what is meant. Having a spec isn't going to stop them, but at least you can lash them with a wet noodle. The specification of at least the abstract character repertoire and coded character set also allows implementers like Python to proceed confidently with their usual internal encoding.

Antoine Pitrou

11:41 a.m.

<glyph <at> divmod.com> writes:

...

In email's case this is true, but in JSON's case it's not. JSON is a format defined as a sequence of code points; MIME is defined as a sequence of octets.

Another to look at it is that JSON is a subset of Javascript, and as such is text rather than bytes. Regards Antoine.

"Martin v. Löwis"

12:55 p.m.

...

...
In email's case this is true, but in JSON's case it's not. JSON is a format defined as a sequence of code points; MIME is defined as a sequence of octets.

Another to look at it is that JSON is a subset of Javascript, and as such is text rather than bytes.

I don't think this can be approached from a theoretical point of view. Instead, what matters is how users want to use it. Regards, Martin

Terry Reedy

9:05 p.m.

glyph@divmod.com wrote:

...

On 03:21 am, ncoghlan@gmail.com wrote:

...
Barry Warsaw wrote:

...
...
I don't know whether the parameter thing will work or not, but you're probably right that we need to get the bytes-everywhere API first.

...
Given that json is a wire protocol, that sounds like the right approach for json as well. Once bytes-everywhere works, then a text API can be built on top of it, but it is difficult to build a bytes API on top of a text one.

I wish I could agree, but JSON isn't really a wire protocol. According to http://www.ietf.org/rfc/rfc4627.txt JSON is "a text format for the serialization of structured data". There are some notes about encoding, but it is very clearly described in terms of unicode code points.

...
So I guess the IO library *is* the right model: bytes at the bottom of the stack, with text as a wrapper around it (mediated by codecs).

In email's case this is true, but in JSON's case it's not. JSON is a format defined as a sequence of code points; MIME is defined as a sequence of octets.

What is the 'bytes support' issue for json? Is it about content within a json text? Or about the transport format of a json text? Reading rfc4627, a json text is a unicode string representation of an instance of one of 6 classes. In Python terms, they are Nonetype, bool, numbers (int, float, decimal?), (unicode) str, list, and [string-keyed] dict. The representation is nearly identical to Python's literals and displays. For transport, the encoding SHALL be one of UTF-8, -16LE/BE, -32LE/BD, with UFT-8 the 'default'. So a json parser (a restricted eval()) tokenizes and parses a stream of unicode chars which in Python could come from either a unicode string or decoded bytes object. The bytes decoding could be either bulk or incremental. Similarly, a json generator (an repr()-like function) produces a stream of unicode chars which again could be optionally encoded to bytes, either incrementally or in bulk. The standard does not specify any correspondence between representations and domain objects, For Python making 'null', 'true', and 'false' inter-convert with None, True, False is obvious. Numbers are slightly more problemmtical. A generator could produce decimal literals from both floats and decimals but without a non-json extension, a parser could only convert back to one, so the other would not round-trip. (Int could be handled by the presence or absence of '.0'.) Similarly, tuples could be represented, like lists, as json square-bracketed arrays, but they would be converted back to lists, not tuples, unless a non-json extension were used. So the two possible byte-suppost content issues I see are how to represent them as legal json strings and/or whether some device should be added to make them round-trip. But as indicated above, these two issues are not unique to bytes. Terry Jan Reedy

"Martin v. Löwis"

3:06 a.m.

...

...
In email's case this is true, but in JSON's case it's not. JSON is a format defined as a sequence of code points; MIME is defined as a sequence of octets.

What is the 'bytes support' issue for json? Is it about content within a json text? Or about the transport format of a json text?

The question is whether the json parsing should take bytes or str as input, and whether the json marshalling should produce bytes or str. More specifically, the question is whether it is ok to drop bytes. I personally think that it needs to support bytes, and that perhaps str support is optional (as you could always explicitly encode the str as UTF-8 before passing it to the JSON parser, if you somehow managed to get a str of JSON to parse). However, I really think that this question cannot be answered by reading the RFC. It should be answered by verifying how people use the json library in 2.x.

...

The standard does not specify any correspondence between representations and domain objects

And that is not the issue at all; nobody is debating what output the parsing should produce. Regards, Martin

Mark Hammond

4:36 a.m.

[Dropping email sig] On 11/04/2009 1:06 PM, "Martin v. Löwis" wrote:

...

However, I really think that this question cannot be answered by reading the RFC. It should be answered by verifying how people use the json library in 2.x.

In the absence of anything more formal, here are 2 anecdotes: * The python-twitter package seems to: - Use dumps() mainly to get string objects. It uses it both for __str__, and for an API called 'AsJsonString' - the intent of this seems to be to provide strings for the consumer of the twitter API - its not clear how such consumers would use them. Note that this API doesn't seem to need to 'write' json objects, else I suspect they would then be expecting dumps to return bytes to put on the wire. They expect loads to accept the bytes they are reading directly off the wire. * couchdb's wrappers use these functions purely as bytes - they are either decoding an application/json object from the bits they read, or they are encoding it to use directly in the body of a request (or even directly in the URL of the request!) I find myself conflicted. On one hand I believe the most common use of json will be to exchange data with something inherently byte-based. On the other hand though, json itself seems to be naturally "stringy" and the most natural interface for a casual user would be strings. I'm personally leaning slightly towards strings, putting the burden on bytes-users of json to explicitly use the appropriate encoding, even in cases where it *must* be utf8. On the other hand, I'm too lazy to dig back through this large thread, but I seem to recall a suggestion that using bytes would be significantly faster. If that is true, I'd be happy to settle for bytes as I believe the most common *actual* use of json will be via things like the twitter and couch libraries - and may even be a key bottleneck for such libraries - so people will not be directly exposed to its interface... Mark Cheers, Mark

"Martin v. Löwis"

5:49 a.m.

...

I'm personally leaning slightly towards strings, putting the burden on bytes-users of json to explicitly use the appropriate encoding, even in cases where it *must* be utf8. On the other hand, I'm too lazy to dig back through this large thread, but I seem to recall a suggestion that using bytes would be significantly faster.

Not sure whether it would be *significantly* faster, but yes, Bob wrote an accelerator for parsing out of a byte string to make it really fast; IIRC, he claims that it is faster than pickling. Regards, Martin

Antoine Pitrou

8:12 a.m.

Martin v. Löwis <martin <at> v.loewis.de> writes:

...

Not sure whether it would be *significantly* faster, but yes, Bob wrote an accelerator for parsing out of a byte string to make it really fast; IIRC, he claims that it is faster than pickling.

Isn't premature optimization the root of all evil? Besides, the fact that many values in a typical JSON object will be strings, and must be encoded from/decoded to unicode objects in py3k, suggests that accepting/outputting unicode as default is the laziest (i.e. the best) choice performance-wise. But you don't have to trust me: look at the quick numbers I've posted. The py3k version (in the str-only incarnation I've proposed) is sometimes actually faster than the trunk version: http://mail.python.org/pipermail/python-dev/2009-April/088498.html Regards Antoine.

Mark Hammond

2:29 a.m.

On 11/04/2009 6:12 PM, Antoine Pitrou wrote:

...

Martin v. Löwis<martin<at> v.loewis.de> writes:

...
Not sure whether it would be *significantly* faster, but yes, Bob wrote an accelerator for parsing out of a byte string to make it really fast; IIRC, he claims that it is faster than pickling.

Isn't premature optimization the root of all evil?

Besides, the fact that many values in a typical JSON object will be strings, and must be encoded from/decoded to unicode objects in py3k, suggests that accepting/outputting unicode as default is the laziest (i.e. the best) choice performance-wise.

I don't see it as premature optimization, but rather trying to ensure the interface/api best suits the actual use cases.

...

But you don't have to trust me: look at the quick numbers I've posted. The py3k version (in the str-only incarnation I've proposed) is sometimes actually faster than the trunk version: http://mail.python.org/pipermail/python-dev/2009-April/088498.html

But if all *actual* use-cases involve moving to and from utf8 encoded bytes, I'm not sure that little example is particularly useful. In those use-cases, I'd be surprised if there wasn't significant time and space benefits in not asking apps to use an 'intermediate' string object before getting the bytes they need, particularly when the payload may be a significant size. Assuming the above is all true, I'd see choosing bytes less as a premature optimization and more a design choice which best supports actual use. So to my mind the only real question is whether the above *is* true, or if there are common use-cases which don't involve utf8-off/on-the-wire... Cheers, Mark

Daniel Stutzbach

4:11 p.m.

On Fri, Apr 10, 2009 at 10:06 PM, "Martin v. Löwis" <martin@v.loewis.de>wrote:

...

However, I really think that this question cannot be answered by reading the RFC. It should be answered by verifying how people use the json library in 2.x.

I use the json module in 2.6 to communicate with a C# JSON library and a JavaScript JSON library. The C# and JavaScript libraries produce and consume the equivalent of str, not the equivalent of bytes. Yes, the data eventually has to go over a socket as bytes, but that's often handled by a different layer of code. For JavaScript, data is typically received by via XMLHttpRequest(), which automatically figures out the encoding from the HTTP headers and/or other information (defaulting to UTF-8) and returns a str-like object that I pass to the JavaScript JSON library. For C#, I wrap the socket in a StreamReader object, which decodes the byte stream into a string stream (similar to Python's new TextIOWrapper class). Hope that helps, -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

"Martin v. Löwis"

5:19 p.m.

...

I use the json module in 2.6 to communicate with a C# JSON library and a JavaScript JSON library. The C# and JavaScript libraries produce and consume the equivalent of str, not the equivalent of bytes.

I assume there is a TCP connection between the json module and the C#/JavaScript libraries? If so, it doesn't matter what representation these implementations chose to use.

...

Hope that helps,

Maybe I misunderstood, and you are *not* communicating over the wire. In this case, I'm puzzled how you get the data from Python to the C# JSON library, or to the JavaScript library. Regards, Martin

Daniel Stutzbach

6:42 p.m.

On Mon, Apr 13, 2009 at 12:19 PM, "Martin v. Löwis" <martin@v.loewis.de>wrote:

...

...
I use the json module in 2.6 to communicate with a C# JSON library and a JavaScript JSON library. The C# and JavaScript libraries produce and consume the equivalent of str, not the equivalent of bytes.

I assume there is a TCP connection between the json module and the C#/JavaScript libraries?

Yes, there's a TCP connection. Sorry for not making that clear to begin with. I also sometimes store JSON objects in a database. In that case, I pass strings to the database API which stores them in a TEXT field. Obviously somewhere they get encoding to bytes, but that's handled by the database.

...

If so, it doesn't matter what representation these implementations chose to use.

True, I can always convert from bytes to str or vise versa. Sometimes it is illustrative to see how others have chosen to solve the same problem. The JSON specification and other implementations serializes an object to a string. Python's json.dumps() needs to either return a str or let the user specify an encoding. At least one of these two needs to work: json.dumps({}).encode('utf-16le') # dumps() returns str '{\x00}\x00' json.dumps({}, encoding='utf-16le') # dumps() returns bytes '{\x00}\x00' In 2.6, the first one works. The second incorrectly returns '{}'. -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

"Martin v. Löwis"

8:02 p.m.

...

Yes, there's a TCP connection. Sorry for not making that clear to begin with.

If so, it doesn't matter what representation these implementations chose to use.

True, I can always convert from bytes to str or vise versa.

I think you are missing the point. It will not be necessary to convert. You can write the JSON into the TCP connection in Python, and it will come out just fine as strings just fine in C# and JavaScript. This is how middleware works - it abstracts from programming languages, and allows for different representations in different languages, in a manner invisible to the participating processes.

...

At least one of these two needs to work:

json.dumps({}).encode('utf-16le') # dumps() returns str '{\x00}\x00'

json.dumps({}, encoding='utf-16le') # dumps() returns bytes '{\x00}\x00'

In 2.6, the first one works. The second incorrectly returns '{}'.

Ok, that might be a bug in the JSON implementation - but you shouldn't be using utf-16le, anyway. Use UTF-8 always, and it will work fine. The questions is: which of them is more appropriate, if, what you want, is bytes. I argue that the second form is better, since it saves you an encode invocation. Regards, Martin

Bob Ippolito

8:28 p.m.

On Mon, Apr 13, 2009 at 1:02 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:

...

...
Yes, there's a TCP connection. Sorry for not making that clear to begin with.

If so, it doesn't matter what representation these implementations chose to use.

True, I can always convert from bytes to str or vise versa.

I think you are missing the point. It will not be necessary to convert. You can write the JSON into the TCP connection in Python, and it will come out just fine as strings just fine in C# and JavaScript. This is how middleware works - it abstracts from programming languages, and allows for different representations in different languages, in a manner invisible to the participating processes.

...
At least one of these two needs to work:

json.dumps({}).encode('utf-16le') # dumps() returns str '{\x00}\x00'

json.dumps({}, encoding='utf-16le') # dumps() returns bytes '{\x00}\x00'

In 2.6, the first one works. The second incorrectly returns '{}'.

Ok, that might be a bug in the JSON implementation - but you shouldn't be using utf-16le, anyway. Use UTF-8 always, and it will work fine.

The questions is: which of them is more appropriate, if, what you want, is bytes. I argue that the second form is better, since it saves you an encode invocation.

It's not a bug in dumps, it's a matter of not reading the documentation. The encoding parameter of dumps decides how byte strings should be interpreted, not what the output encoding is. The output of json/simplejson dumps for Python 2.x is either an ASCII bytestring (default) or a unicode string (when ensure_ascii=False). This is very practical in 2.x because an ASCII bytestring can be treated as either text or bytes in most situations, isn't going to get mangled over any kind of encoding mismatch (as long as it's an ASCII superset), and skips an encoding step if getting sent over the wire..

...

...
...
simplejson.dumps(['\x00f\x00o\x00o'], encoding='utf-16be') '["foo"]' simplejson.dumps(['\x00f\x00o\x00o'], encoding='utf-16be', ensure_ascii=False) u'["foo"]'

-bob

Daniel Stutzbach

8:32 p.m.

On Mon, Apr 13, 2009 at 3:28 PM, Bob Ippolito <bob@redivi.com> wrote:

...

It's not a bug in dumps, it's a matter of not reading the documentation. The encoding parameter of dumps decides how byte strings should be interpreted, not what the output encoding is.

You're right; I apologize for not reading more closely. -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

Antoine Pitrou

11:58 p.m.

Bob Ippolito <bob <at> redivi.com> writes:

...

The output of json/simplejson dumps for Python 2.x is either an ASCII bytestring (default) or a unicode string (when ensure_ascii=False). This is very practical in 2.x because an ASCII bytestring can be treated as either text or bytes in most situations, isn't going to get mangled over any kind of encoding mismatch (as long as it's an ASCII superset), and skips an encoding step if getting sent over the wire..

Which means that the json module already deals with text rather than bytes, apart from the optimization that pure ASCII text is returned as 8-bit strings. Regards Antoine.

Daniel Stutzbach

9:25 p.m.

On Mon, Apr 13, 2009 at 3:02 PM, "Martin v. Löwis" <martin@v.loewis.de>wrote:

...

...
True, I can always convert from bytes to str or vise versa.

I think you are missing the point. It will not be necessary to convert.

Sometimes I want bytes and sometimes I want str. I am going to be converting some of the time. ;-) Below is a basic CGI application that assumes that json module works with str, not bytes. How would you write it if the json module does not support returning a str? print("Content-Type: application/json; charset=utf-8") input_object = json.loads(sys.stdin.read()) output_object = do_some_work(input_object) print(json.dumps(output_object)) print() The questions is: which of them is more appropriate, if, what you want,

...

is bytes. I argue that the second form is better, since it saves you an encode invocation.

If what you want is bytes, encoding has to happen somewhere. If the json module has some optimizations to do the encoding at the same time as the serialization, great. However, based on the original post of this thread, it sounds like that code doesn't exist or doesn't work correctly. What's the benefit of preventing users from getting a str out if that's what they want? -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

Alexandre Vassalotti

11:44 p.m.

On Mon, Apr 13, 2009 at 5:25 PM, Daniel Stutzbach <daniel@stutzbachenterprises.com> wrote:

...

On Mon, Apr 13, 2009 at 3:02 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:

...
...
True, I can always convert from bytes to str or vise versa.

I think you are missing the point. It will not be necessary to convert.

Sometimes I want bytes and sometimes I want str. I am going to be converting some of the time. ;-)

Below is a basic CGI application that assumes that json module works with str, not bytes. How would you write it if the json module does not support returning a str?

print("Content-Type: application/json; charset=utf-8") input_object = json.loads(sys.stdin.read()) output_object = do_some_work(input_object) print(json.dumps(output_object)) print()

Like this? print("Content-Type: application/json; charset=utf-8") input_object = json.loads(sys.stdin.buffer.read()) output_object = do_some_work(input_object) stdout.buffer.write(json.dumps(output_object)) -- Alexandre

Greg Ewing

midnight

Alexandre Vassalotti wrote:

...

...
print("Content-Type: application/json; charset=utf-8") input_object = json.loads(sys.stdin.read()) output_object = do_some_work(input_object) print(json.dumps(output_object)) print()

That assumes the encoding being used by stdout has ascii as a subset. -- Greg

"Martin v. Löwis"

2:40 a.m.

...

Below is a basic CGI application that assumes that json module works with str, not bytes. How would you write it if the json module does not support returning a str?

In a CGI application, you shouldn't be using sys.stdin or print(). Instead, you should be using sys.stdin.buffer (or sys.stdin.buffer.raw), and sys.stdout.buffer.raw. A CGI script essentially does binary IO; if you use TextIO, there likely will be bugs (e.g. if you have attachments of type application/octet-stream).

...

print("Content-Type: application/json; charset=utf-8") input_object = json.loads(sys.stdin.read()) output_object = do_some_work(input_object) print(json.dumps(output_object)) print()

out = sys.stdout.buffer.raw out.write(b"Content-Type: application/json; charset=utf-8\n\n") input_object = json.loads(sys.stdin.buffer.raw.read()) output_object = do_some_work(input_object) out.write(json.dumps(output_object))

...

What's the benefit of preventing users from getting a str out if that's what they want?

If they really want it, there is no benefit from preventing them. I'm just puzzled why they want it, and what possible applications might be where they want it. Perhaps they misunderstand something when they think they want it. Regards, Martin

Lino Mastrodomenico

8:54 a.m.

2009/4/13 Daniel Stutzbach <daniel@stutzbachenterprises.com>:

...

print("Content-Type: application/json; charset=utf-8")

Please don't do that! According to RFC 4627 the "charset" parameter is not allowed for the application/json media type. Just use "Content-Type: application/json", the charset is only misleading because even if you specify, e.g., ISO-8859-1 a standard-compliant receiver will probably still try to interpret the content as UTF-8/16/32. OTOH a charset can be used if you send JSON with an application/javascript MIME type. -- Lino Mastrodomenico

Tony Nelson

3:41 a.m.

New subject: [Email-SIG] Dropping bytes "support" in json

At 22:38 -0400 04/09/2009, Barry Warsaw wrote: ...

...

So, what I'm really asking is this. Let's say you agree that there are use cases for accessing a header value as either the raw encoded bytes or the decoded unicode. What should this return:

...
...
...
message['Subject']

The raw bytes or the decoded unicode?

That's an easy one: Subject: is an unstructured header, so it must be text, thus Unicode. We're looking at a high-level representation of an email message, with parsed header fields and a MIME message tree.

...

Okay, so you've picked one. Now how do you spell the other way?

message.get_header_bytes('Subject') Oh, I see that's what you picked.

...

The Message class probably has these explicit methods:

...
...
...
Message.get_header_bytes('Subject') Message.get_header_string('Subject')

(or better names... it's late and I'm tired ;). One of those maps to message['Subject'] but which is the more obvious choice?

Structured header fields are more of a problem. Any header with addresses should return a list of addresses. I think the default return type should depend on the data type. To get an explicit bytes or string or list of addresses, be explicit; otherwise, for convenience, return the appropriate type for the particular header field name.

...

Now, setting headers. Sometimes you have some unicode thing and sometimes you have some bytes. You need to end up with bytes in the ASCII range and you'd like to leave the header value unencoded if so. But in both cases, you might have bytes or characters outside that range, so you need an explicit encoding, defaulting to utf-8 probably.

Never for header fields. The default is always RFC 2047, unless it isn't, say for params. The Message class should create an object of the appropriate subclass of Header based on the name (or use the existing object, see other discussion), and that should inspect its argument and DTRT or complain.

...

...
...
...
Message.set_header('Subject', 'Some text', encoding='utf-8') Message.set_header('Subject', b'Some bytes')

One of those maps to

...
...
...
message['Subject'] = ???

The expected data type should depend on the header field. For Subject:, it should be bytes to be parsed or verbatim text. For To:, it should be a list of addresses or bytes or text to be parsed. The email package should be pythonic, and not require deep understanding of dozens of RFCs to use properly. Users don't need to know about the raw bytes; that's the whole point of MIME and any email package. It should be easy to set header fields with their natural data types, and doing it with bad data should produce an error. This may require a bit more care in the message parser, to always produce a parsed message with defects. -- ____________________________________________________________________ TonyN.:' <mailto:tonynelson@georgeanelson.com> ' <http://www.georgeanelson.com/>

Barry Warsaw

5:08 p.m.

New subject: [Email-SIG] Dropping bytes "support" in json

On Apr 9, 2009, at 11:41 PM, Tony Nelson wrote:

...

At 22:38 -0400 04/09/2009, Barry Warsaw wrote: ...

...
So, what I'm really asking is this. Let's say you agree that there are use cases for accessing a header value as either the raw encoded bytes or the decoded unicode. What should this return:

...
...
...
message['Subject']

The raw bytes or the decoded unicode?

That's an easy one: Subject: is an unstructured header, so it must be text, thus Unicode. We're looking at a high-level representation of an email message, with parsed header fields and a MIME message tree.

I'm liking Glyph's suggestion here. We'll probably have to support the message['Subject'] API for backward compatibility, but in that case it really should be a bytes API.

...

...
(or better names... it's late and I'm tired ;). One of those maps to message['Subject'] but which is the more obvious choice?

Structured header fields are more of a problem. Any header with addresses should return a list of addresses. I think the default return type should depend on the data type. To get an explicit bytes or string or list of addresses, be explicit; otherwise, for convenience, return the appropriate type for the particular header field name.

Yes, structured headers are trickier. In a separate message, James Knight makes some excellent points, which I agree with. However the email package obviously cannot support every time of structured header possible. It must support this through extensibility. The obvious way is through inheritance (i.e. subclasses of Header), but in my experience, using inheritance of the Message class really doesn't work very well. You need to pass around factories to parsing functions and your application tends to have its own hierarchy of subclasses for whatever extra things it needs. ISTM that subclassing is simply not the right pattern to support extensibility in the Message objects or Header objects. Yes, this leads me to think that all the MIME* subclasses are essentially /wrong/. Having said all that, the email package must support structured headers. Look at the insanity which is the current folding whitespace splitting and the impossibility of the current code to do the right thing for say Subject headers and Received headers, and you begin to see why it must be possible to extend this stuff.

...

...
Now, setting headers. Sometimes you have some unicode thing and sometimes you have some bytes. You need to end up with bytes in the ASCII range and you'd like to leave the header value unencoded if so. But in both cases, you might have bytes or characters outside that range, so you need an explicit encoding, defaulting to utf-8 probably.

Never for header fields. The default is always RFC 2047, unless it isn't, say for params.

The Message class should create an object of the appropriate subclass of Header based on the name (or use the existing object, see other discussion), and that should inspect its argument and DTRT or complain.

...

...
...
...
...
Message.set_header('Subject', 'Some text', encoding='utf-8') Message.set_header('Subject', b'Some bytes')

One of those maps to

...
...
...
message['Subject'] = ???

The expected data type should depend on the header field. For Subject:, it should be bytes to be parsed or verbatim text. For To:, it should be a list of addresses or bytes or text to be parsed.

At a higher level, yes. At the low level, it has to be bytes.

...

The email package should be pythonic, and not require deep understanding of dozens of RFCs to use properly. Users don't need to know about the raw bytes; that's the whole point of MIME and any email package. It should be easy to set header fields with their natural data types, and doing it with bad data should produce an error. This may require a bit more care in the message parser, to always produce a parsed message with defects.

I agree that we should have some higher level APIs that make it easy to compose email messages, and probably easy-ish to parse a byte stream into an email message tree. But we can't build those without the lower level raw support. I'm also convinced that this lower level will be the domain of those crazy enough to have the RFCs tattooed to the back of their eyelids. -Barry

glyph＠divmod.com

5:19 a.m.

On 02:38 am, barry@python.org wrote:

...

So, what I'm really asking is this. Let's say you agree that there are use cases for accessing a header value as either the raw encoded bytes or the decoded unicode. What should this return:

...
...
...
message['Subject']

The raw bytes or the decoded unicode?

My personal preference would be to just get deprecate this API, and get rid of it, replacing it with a slightly more explicit one. message.headers['Subject'] message.bytes_headers['Subject']

...

Now, setting headers. Sometimes you have some unicode thing and sometimes you have some bytes. You need to end up with bytes in the ASCII range and you'd like to leave the header value unencoded if so. But in both cases, you might have bytes or characters outside that range, so you need an explicit encoding, defaulting to utf-8 probably.

message.headers['Subject'] = 'Some text' should be equivalent to message.headers['Subject'] = Header('Some text') My preference would be that message.headers['Subject'] = b'Some Bytes' would simply raise an exception. If you've got some bytes, you should instead do message.bytes_headers['Subject'] = b'Some Bytes' or message.headers['Subject'] = Header(bytes=b'Some Bytes', encoding='utf-8') Explicit is better than implicit, right?

Barry Warsaw

4:56 p.m.

On Apr 10, 2009, at 1:19 AM, glyph@divmod.com wrote:

...

On 02:38 am, barry@python.org wrote:

...
So, what I'm really asking is this. Let's say you agree that there are use cases for accessing a header value as either the raw encoded bytes or the decoded unicode. What should this return:

...
...
...
message['Subject']

The raw bytes or the decoded unicode?

My personal preference would be to just get deprecate this API, and get rid of it, replacing it with a slightly more explicit one.

message.headers['Subject'] message.bytes_headers['Subject']

This is pretty darn clever Glyph. Stop that! :) I'm not 100% sure I like the name .bytes_headers or that .headers should be the decoded header (rather than have .headers return the bytes thingie and say .decoded_headers return the decoded thingies), but I do like the general approach.

...

...
Now, setting headers. Sometimes you have some unicode thing and sometimes you have some bytes. You need to end up with bytes in the ASCII range and you'd like to leave the header value unencoded if so. But in both cases, you might have bytes or characters outside that range, so you need an explicit encoding, defaulting to utf-8 probably.

message.headers['Subject'] = 'Some text'

should be equivalent to

message.headers['Subject'] = Header('Some text')

Yes, absolutely. I think we're all in general agreement that header values should be instances of Header, or subclasses thereof.

...

My preference would be that

message.headers['Subject'] = b'Some Bytes'

would simply raise an exception. If you've got some bytes, you should instead do

message.bytes_headers['Subject'] = b'Some Bytes'

or

message.headers['Subject'] = Header(bytes=b'Some Bytes', encoding='utf-8')

Explicit is better than implicit, right?

Yes. Again, I really like the general idea, if I might quibble about some of the details. Thanks for a great suggestion. -Barry

Glenn Linderman

6 p.m.

New subject: [Email-SIG] Dropping bytes "support" in json

On approximately 4/10/2009 9:56 AM, came the following characters from the keyboard of Barry Warsaw:

...

On Apr 10, 2009, at 1:19 AM, glyph@divmod.com wrote:

...
On 02:38 am, barry@python.org wrote:

...
So, what I'm really asking is this. Let's say you agree that there are use cases for accessing a header value as either the raw encoded bytes or the decoded unicode. What should this return:

...
...
...
message['Subject']

The raw bytes or the decoded unicode?

My personal preference would be to just get deprecate this API, and get rid of it, replacing it with a slightly more explicit one.

message.headers['Subject'] message.bytes_headers['Subject']

This is pretty darn clever Glyph. Stop that! :)

I'm not 100% sure I like the name .bytes_headers or that .headers should be the decoded header (rather than have .headers return the bytes thingie and say .decoded_headers return the decoded thingies), but I do like the general approach.

If one name has to be longer than the other, it should be the bytes version. Real user code is more likely to want to use the text version, and hopefully there will be more of that type of code than implementations using bytes. Of course, one could use message.header and message.bythdr and they'd be the same length. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Michael Foord

6:06 p.m.

New subject: [Email-SIG] Dropping bytes "support" in json

Glenn Linderman wrote:

...

On approximately 4/10/2009 9:56 AM, came the following characters from the keyboard of Barry Warsaw:

...
On Apr 10, 2009, at 1:19 AM, glyph@divmod.com wrote:

...
On 02:38 am, barry@python.org wrote:

...
So, what I'm really asking is this. Let's say you agree that there are use cases for accessing a header value as either the raw encoded bytes or the decoded unicode. What should this return:

...
...
> message['Subject']

The raw bytes or the decoded unicode?

My personal preference would be to just get deprecate this API, and get rid of it, replacing it with a slightly more explicit one.

message.headers['Subject'] message.bytes_headers['Subject']

This is pretty darn clever Glyph. Stop that! :)

I'm not 100% sure I like the name .bytes_headers or that .headers should be the decoded header (rather than have .headers return the bytes thingie and say .decoded_headers return the decoded thingies), but I do like the general approach.

If one name has to be longer than the other, it should be the bytes version. Real user code is more likely to want to use the text version, and hopefully there will be more of that type of code than implementations using bytes.

Of course, one could use message.header and message.bythdr and they'd be the same length.

Shouldn't headers always be text? Michael -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog

Barry Warsaw

6:55 p.m.

New subject: [Email-SIG] Dropping bytes "support" in json

On Apr 10, 2009, at 2:06 PM, Michael Foord wrote:

...

Shouldn't headers always be text?

/me weeps

Aahz

7:05 p.m.

On Fri, Apr 10, 2009, Barry Warsaw wrote:

...

On Apr 10, 2009, at 2:06 PM, Michael Foord wrote:

...
Shouldn't headers always be text?

/me weeps

/me hands Barry a hankie -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Why is this newsgroup different from all other newsgroups?

Barry Warsaw

6:55 p.m.

New subject: [Email-SIG] Dropping bytes "support" in json

On Apr 10, 2009, at 2:00 PM, Glenn Linderman wrote:

...

If one name has to be longer than the other, it should be the bytes version. Real user code is more likely to want to use the text version, and hopefully there will be more of that type of code than implementations using bytes.

I'm not sure we know that yet, actually. Nothing written for Python 2 counts, and email is too broken in 3 for any sane person to be writing such code for Python 3.

...

Of course, one could use message.header and message.bythdr and they'd be the same length.

I was trying to figure out what a 'thdr' was that we'd want to index 'by' it. :) -Barry

Nick Coghlan

7:09 a.m.

New subject: [Email-SIG] Dropping bytes "support" in json

Barry Warsaw wrote:

...

...
Of course, one could use message.header and message.bythdr and they'd be the same length.

I was trying to figure out what a 'thdr' was that we'd want to index 'by' it. :)

In the discussions about os.environ, the suggested approach was to just tack a 'b' onto the end of the name to get the bytes version (i.e. os.environb). That aligns nicely with the b"" prefix for bytes literals, and isn't much of a typing or reading burden when dealing with the bytes API instead of the text one. A similar naming scheme (i.e. msg.headers and msg.headersb) would probably work for email as well. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Chris Withers

12:41 p.m.

New subject: [Email-SIG] Dropping bytes "support" in json

Nick Coghlan wrote:

...

Barry Warsaw wrote:

...
...
Of course, one could use message.header and message.bythdr and they'd be the same length. I was trying to figure out what a 'thdr' was that we'd want to index 'by' it. :)

In the discussions about os.environ, the suggested approach was to just tack a 'b' onto the end of the name to get the bytes version (i.e. os.environb).

That aligns nicely with the b"" prefix for bytes literals, and isn't much of a typing or reading burden when dealing with the bytes API instead of the text one.

A similar naming scheme (i.e. msg.headers and msg.headersb) would probably work for email as well.

That just feels nasty though :-( Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk

Greg Ewing

11:49 p.m.

New subject: [Email-SIG] Dropping bytes "support" in json

Chris Withers wrote:

...

Nick Coghlan wrote:

...
A similar naming scheme (i.e. msg.headers and msg.headersb) would probably work for email as well.

That just feels nasty though :-(

It does tend to look like a typo to me. Inserting an underscore (headers_b) would make it look less accidental. -- Greg

curtin＠acm.org

12:12 a.m.

New subject: [Email-SIG] Dropping bytes "support" in json

FWIW, that is also the way things are done in the pickle/cPickle module. dump/dumps and load/loads to differentiate between the file object and string ways of using that functionality. On Sat, Apr 11, 2009 at 7:41 AM, Chris Withers <chris@simplistix.co.uk>wrote:

...

Nick Coghlan wrote:

...
Barry Warsaw wrote:

...
Of course, one could use message.header and message.bythdr and they'd

...
be the same length.

I was trying to figure out what a 'thdr' was that we'd want to index 'by' it. :)

In the discussions about os.environ, the suggested approach was to just tack a 'b' onto the end of the name to get the bytes version (i.e. os.environb).

That aligns nicely with the b"" prefix for bytes literals, and isn't much of a typing or reading burden when dealing with the bytes API instead of the text one.

A similar naming scheme (i.e. msg.headers and msg.headersb) would probably work for email as well.

That just feels nasty though :-(

Chris

-- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk

_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/brian.curtin%40gmail.com

Barry Warsaw

2:14 p.m.

New subject: [Email-SIG] Dropping bytes "support" in json

On Apr 10, 2009, at 2:00 PM, Glenn Linderman wrote:

...

If one name has to be longer than the other, it should be the bytes version. Real user code is more likely to want to use the text version, and hopefully there will be more of that type of code than implementations using bytes.

Of course, one could use message.header and message.bythdr and they'd be the same length.

Actually, thinking about this over the weekend, it's much better for message['subject'] to return a Header instance in all cases. Use bytes(header) to get the raw bytes. A good API for getting the parsed and decoded header values needs to take into account that it won't always be a string. For unstructured headers like Subject, str(header) would work just fine. For an Originator or Destination address, what does str(header) return? And what would be the API for getting the set of realname/addresses out of the header? -Barry

Greg Ewing

11:28 p.m.

New subject: [Email-SIG] Dropping bytes "support" in json

Barry Warsaw wrote:

...

For an Originator or Destination address, what does str(header) return?

It should be an error, I think. -- Greg

R. David Murray

11:46 p.m.

New subject: [Email-SIG] Dropping bytes "support" in json

On Tue, 14 Apr 2009 at 11:28, Greg Ewing wrote:

...

Barry Warsaw wrote:

...
For an Originator or Destination address, what does str(header) return?

It should be an error, I think.

That doesn't make sense to me. str(<arbitrary object>) should return _something_. --David

Greg Ewing

11:59 p.m.

New subject: [Email-SIG] Dropping bytes "support" in json

R. David Murray wrote:

...

That doesn't make sense to me. str(<arbitrary object>) should return _something_.

Well, it might return something like "<AddressList object at 0x123456>". But you shouldn't rely on it to give you anything useful for an arbitrary header. -- Greg

Chris Withers

12:46 p.m.

glyph@divmod.com wrote:

...

My preference would be that

message.headers['Subject'] = b'Some Bytes'

would simply raise an exception. If you've got some bytes, you should instead do

message.bytes_headers['Subject'] = b'Some Bytes'

Remind me again why you need to differentiate between headers and bytes_headers? I think bytes headers are evil. If you don't know the encoding when you have one, who does or ever will?

...

message.headers['Subject'] = Header(bytes=b'Some Bytes', encoding='utf-8')

Explicit is better than implicit, right?

Indeed, and the case for the above would be to keep indempotence of incoming messages in applications like mailman... ...otherwise we could just decode them and be done with it. cheers, Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk

James Y Knight

3:08 p.m.

On Apr 9, 2009, at 10:38 PM, Barry Warsaw wrote:

...

So, what I'm really asking is this. Let's say you agree that there are use cases for accessing a header value as either the raw encoded bytes or the decoded unicode.

As I said in the thread having nearly the same exact discussion on web- sig, except about WSGI headers...

...

What should this return:

...
...
...
message['Subject']

The raw bytes or the decoded unicode?

Until you write a parser for every header, you simply cannot decode to unicode. The only sane choices are: 1) raw bytes 2) parsed structured data There's no "decoded to unicode but not parsed" option: that's doing things in the wrong order. If you RFC2047-decode the header before doing tokenization and parsing, you will just have a *broken* implementation. Here's an example where it matters. If you decode the RFC2047 part before parsing, you'd decide that there's two recipients to the message. There aren't. "<broken@example.com>, " is the display-name of "actual@example.com", not a second recipient. To: =?UTF-8?B?PGJyb2tlbkBleGFtcGxlLmNvbT4sIA==?= <actual@example.com> Here's a quote from RFC2047:

...

NOTE: Decoding and display of encoded-words occurs *after* a structured field body is parsed into tokens. It is therefore possible to hide 'special' characters in encoded-words which, when displayed, will be indistinguishable from 'special' characters in the surrounding text. For this and other reasons, it is NOT generally possible to translate a message header containing 'encoded- word's to an unencoded form which can be parsed by an RFC 822 mail reader. And another quote for good measure: (2) Any header field not defined as '*text' should be parsed according to the syntax rules for that header field. However, any 'word' that appears within a 'phrase' should be treated as an 'encoded-word' if it meets the syntax rules in section 2. Otherwise it should be treated as an ordinary 'word'.

Now, I suppose there's also a third possibility: 3) US-ASCII-only strings, unmolested except for doing a .decode('ascii'). That'll give you a string all right, but it's really just cheating. It's not actually a text string in any meaningful sense. (in all this I'm assuming your question is not about the "Subject" header in particular; that is of course just unstructured text so the parse step doesn't actually do anything...). James

Barry Warsaw

2:11 p.m.

On Apr 10, 2009, at 11:08 AM, James Y Knight wrote:

...

Until you write a parser for every header, you simply cannot decode to unicode. The only sane choices are: 1) raw bytes 2) parsed structured data

The email package does not need a parser for every header, but it should provide a framework that applications (or third party libraries) can use to extend the built-in header parsers. A bare minimum for functionality requires a Content-Type parser. I think the email package should also include an address header (Originator, Destination) parser, and a Message-ID header parser. Possibly others. The default would probably be some unstructured parser for headers like Subject. -Barry

James Y Knight

7:11 p.m.

On Apr 13, 2009, at 10:11 AM, Barry Warsaw wrote:

...

The email package does not need a parser for every header, but it should provide a framework that applications (or third party libraries) can use to extend the built-in header parsers. A bare minimum for functionality requires a Content-Type parser. I think the email package should also include an address header (Originator, Destination) parser, and a Message-ID header parser. Possibly others.

Sure, that's fine...

...

The default would probably be some unstructured parser for headers like Subject.

But for unknown headers, it's not a useful choice to return a "str" object. "str" is just one possible structured data representation for a header: there's no correct useful decoding of all headers into str. Of course for the "Subject" header, str is the correct result type, but that's not a default, that's explicit support for "Subject". You can't correctly decode "To" into a str, so what makes you think you can decode "X-Gabazaborph" into str? The only useful and correct representation for unknown (or unimplemented) headers is the raw bytes. James

Greg Ewing

11:27 p.m.

Barry Warsaw wrote:

...

The default would probably be some unstructured parser for headers like Subject.

Only for headers known to be unstructured, I think. Completely unknown headers should be available only as bytes. -- Greg

Stephen J. Turnbull

7 a.m.

Warning: Reply-To set to email-sig. Greg Ewing writes:

...

Only for headers known to be unstructured, I think. Completely unknown headers should be available only as bytes.

Why do I get the feeling that you guys are feeling up an elephant?<wink> There are four things you might want to do with a header: (1) Put it on the wire, which must be bytes (in fact, ASCII). (2) Show it to a user (such as a rootin-tootin spam-fightin mail admin), which for consistency with well-behaved, implemented headers (ie, you might want to *gasp* *concatenate* your unknown header with a string), will sooner or later be string (ie, Unicode). (3) (Try to) parse it, in which case an internal representation with some other structure may or may not be appropriate for storing the parsed data. (4) Munge it, in which case an internal representation with some other structure may or may not be appropriate. I see no particular reason for restricting these basic API classes for any header.

Robert Brewer

4:47 p.m.

On Thu, 2009-04-09 at 22:38 -0400, Barry Warsaw wrote:

...

On Apr 9, 2009, at 11:55 AM, Daniel Stutzbach wrote:

...
On Thu, Apr 9, 2009 at 6:01 AM, Barry Warsaw <barry@python.org> wrote: Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments).

Won't this work? (assuming dumps() always returns a string)

def dumpb(obj, encoding='utf-8', *args, **kw): s = dumps(obj, *args, **kw) return s.encode(encoding)

So, what I'm really asking is this. Let's say you agree that there are use cases for accessing a header value as either the raw encoded bytes or the decoded unicode. What should this return:

...
...
...
message['Subject']

The raw bytes or the decoded unicode?

Okay, so you've picked one. Now how do you spell the other way?

The Message class probably has these explicit methods:

...
...
...
Message.get_header_bytes('Subject') Message.get_header_string('Subject')

(or better names... it's late and I'm tired ;). One of those maps to message['Subject'] but which is the more obvious choice?

Now, setting headers. Sometimes you have some unicode thing and sometimes you have some bytes. You need to end up with bytes in the ASCII range and you'd like to leave the header value unencoded if so. But in both cases, you might have bytes or characters outside that range, so you need an explicit encoding, defaulting to utf-8 probably.

...
...
...
Message.set_header('Subject', 'Some text', encoding='utf-8') Message.set_header('Subject', b'Some bytes')

One of those maps to

...
...
...
message['Subject'] = ???

I'm open to any suggestions here!

Syntactically, there's no sense in providing: Message.set_header('Subject', 'Some text', encoding='utf-16') ...since you could more clearly write the same as: Message.set_header('Subject', 'Some text'.encode('utf-16')) The only interesting case is if you provided a *default* encoding, so that: Message.default_header_encoding = 'utf-16' Message.set_header('Subject', 'Some text') ...has the same effect. But it would be far easier to do all the encoding at once in an output() or serialize() method. Do different headers need different encodings? If so, make message['Subject'] a subclass of str and give it an .encoding attribute (with a default). If not, Message.header_encoding should be sufficient. Robert Brewer fumanchu@aminus.org

Stephen J. Turnbull

7:22 p.m.

Robert Brewer writes:

...

Syntactically, there's no sense in providing:

Message.set_header('Subject', 'Some text', encoding='utf-16')

...since you could more clearly write the same as:

Message.set_header('Subject', 'Some text'.encode('utf-16'))

Which you now must *parse* and guess the encoding to determine how to RFC-2047-encode the binary mush. I think the encoding parameter is necessary here.

...

But it would be far easier to do all the encoding at once in an output() or serialize() method. Do different headers need different encodings?

You can have multiple encodings within a single header (and a naïve algorithm might very well encode "The price of Gödel-Escher-Bach is €25" as "The price of =?ISO-8859-1?Q?G=F6del-Escher-Bach?= is =?ISO-8859-15?Q?=A425?=").

...

If so, make message['Subject'] a subclass of str and give it an .encoding attribute (with a default).

But if you've set the .encoding attribute, you don't need to encode 'Some text'; .set_header() can take care of it for you. And what about the possibility that the encoding attributes disagree with the argument you passed to the codec?

Chris Withers

12:33 p.m.

New subject: email header encoding

Stephen J. Turnbull wrote:

...

Robert Brewer writes:

...
Syntactically, there's no sense in providing:

Message.set_header('Subject', 'Some text', encoding='utf-16')

...since you could more clearly write the same as:

Message.set_header('Subject', 'Some text'.encode('utf-16'))

Which you now must *parse* and guess the encoding to determine how to RFC-2047-encode the binary mush. I think the encoding parameter is necessary here.

Indeed.

...

...
But it would be far easier to do all the encoding at once in an output() or serialize() method. Do different headers need different encodings?

You can have multiple encodings within a single header (and a naïve

"can" and "should" are two very different things. When is it even a good idea to have more than one encoding in a single header? Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk

Stephen J. Turnbull

2:19 p.m.

New subject: email header encoding

Chris Withers writes:

...

When is it even a good idea to have more than one encoding in a single header?

I'd be happy to discuss that on email-sig, but it's really OT for Python-Dev at this point.

Chris Withers

12:39 p.m.

New subject: headers api for email package

Barry Warsaw wrote:

...

...
...
...
message['Subject']

The raw bytes or the decoded unicode?

A header object.

...

Okay, so you've picked one. Now how do you spell the other way?

str(message['Subject']) bytes(message['Subject'])

...

Now, setting headers. Sometimes you have some unicode thing and sometimes you have some bytes. You need to end up with bytes in the ASCII range and you'd like to leave the header value unencoded if so. But in both cases, you might have bytes or characters outside that range, so you need an explicit encoding, defaulting to utf-8 probably.

...
...
...
Message.set_header('Subject', 'Some text', encoding='utf-8') Message.set_header('Subject', b'Some bytes')

Where you just want "a damned valid email and stop making my life hard!": Message['Subject']='Some text' Where you care about what encoding is used: Message['Subject']=Header('Some text',encoding='utf-8') If you have bytes, for whatever reason: Message['Subject']=b'some bytes'.decode('utf-8') ...because only you know what encoding those bytes use!

...

One of those maps to

...
...
...
message['Subject'] = ???

...should only accept text or a Header object. Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk

Barry Warsaw

2:28 p.m.

New subject: headers api for email package

On Apr 11, 2009, at 8:39 AM, Chris Withers wrote:

...

Barry Warsaw wrote:

...
...
...
...
message['Subject'] The raw bytes or the decoded unicode?

A header object.

Yep. You got there before I did. :)

...

...
Okay, so you've picked one. Now how do you spell the other way?

str(message['Subject'])

Yes for unstructured headers like Subject. For structured headers... hmm.

...

bytes(message['Subject'])

Yes.

...

...
Now, setting headers. Sometimes you have some unicode thing and sometimes you have some bytes. You need to end up with bytes in the ASCII range and you'd like to leave the header value unencoded if so. But in both cases, you might have bytes or characters outside that range, so you need an explicit encoding, defaulting to utf-8 probably.

...
...
...
Message.set_header('Subject', 'Some text', encoding='utf-8') Message.set_header('Subject', b'Some bytes')

Where you just want "a damned valid email and stop making my life hard!":

Message['Subject']='Some text'

Yes. In which case I propose we guess the encoding as 1) ascii, 2) utf-8, 3) wtf?

...

Where you care about what encoding is used:

Message['Subject']=Header('Some text',encoding='utf-8')

Yes.

...

If you have bytes, for whatever reason:

Message['Subject']=b'some bytes'.decode('utf-8')

...because only you know what encoding those bytes use!

So you're saying that __setitem__() should not accept raw bytes? -Barry

R. David Murray

3:49 p.m.

New subject: headers api for email package

On Mon, 13 Apr 2009 at 10:28, Barry Warsaw wrote:

...

On Apr 11, 2009, at 8:39 AM, Chris Withers wrote:

...
Barry Warsaw wrote:

...
...
...
...
message['Subject'] The raw bytes or the decoded unicode?

A header object.

Yep. You got there before I did. :)

...

...
...
Okay, so you've picked one. Now how do you spell the other way?

str(message['Subject'])

Yes for unstructured headers like Subject. For structured headers... hmm.

Some "reasonable" printable interpretation that has no semantic meaning?

...

...
bytes(message['Subject'])

Yes.

...
...
Now, setting headers. Sometimes you have some unicode thing and sometimes you have some bytes. You need to end up with bytes in the ASCII range and you'd like to leave the header value unencoded if so. But in both cases, you might have bytes or characters outside that range, so you need an explicit encoding, defaulting to utf-8 probably.

...
...
...
Message.set_header('Subject', 'Some text', encoding='utf-8') Message.set_header('Subject', b'Some bytes')

Where you just want "a damned valid email and stop making my life hard!":

Message['Subject']='Some text'

Yes. In which case I propose we guess the encoding as 1) ascii, 2) utf-8, 3) wtf?

Given some usenet postings I've just dealt with, (3) appears to sometimes be spelled 'x-unknown' and sometimes (in the most recent case) 'unknown-8bit'. A quick google turns up a hit on RFC1428 for the latter, and a bunch of trouble tickets for the former...so I think 'wtf' is correctly spelled 'unknown-8bit'. However, it's not supposed to be used by mail composers, who are expected to know the encoding. It's for mail gateways that are transforming something and don't know the encoding. I'm not sure what this means for the email module, which certainly will be used in a mail gateways....maybe it's the responsibility of the application code to explicitly say 'unknown encoding'?

...

...
Where you care about what encoding is used:

Message['Subject']=Header('Some text',encoding='utf-8')

Yes.

...
If you have bytes, for whatever reason:

Message['Subject']=b'some bytes'.decode('utf-8')

...because only you know what encoding those bytes use!

So you're saying that __setitem__() should not accept raw bytes?

If I'm understanding things correctly, if it did accept bytes the person using that interface would need to do whatever encoding (eg: encoded-word) was needed, so the interface should check that the byte string is 8 bit clean. But having some sort of 'setraw' method on Header might be better for that case. --David

Chris Withers

May 2009

5:14 p.m.

New subject: headers api for email package

...

...
...
Where you just want "a damned valid email and stop making my life hard!":

Message['Subject']='Some text'

Yes. In which case I propose we guess the encoding as 1) ascii, 2) utf-8, 3) wtf?

Well, we're talking about Python 3 here right? In which case the above involves only unicode, so why do we need to guess anything? Just use utf-8 and be done with it...

...

However, it's not supposed to be used by mail composers, who are expected to know the encoding. It's for mail gateways that are transforming something and don't know the encoding. I'm not sure what this means for the email module, which certainly will be used in a mail gateways....maybe it's the responsibility of the application code to explicitly say 'unknown encoding'?

Indeed, surely this happens when you have bytes and need to do something with it? That's not what my example above is about...

...

...
...
Where you care about what encoding is used:

Message['Subject']=Header('Some text',encoding='utf-8')

Yes.

...it's covered by this.

...

...
...
If you have bytes, for whatever reason:

Message['Subject']=b'some bytes'.decode('utf-8')

...because only you know what encoding those bytes use!

So you're saying that __setitem__() should not accept raw bytes?

Indeed :-) Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk

Stephen J. Turnbull

April 2009

5:15 p.m.

New subject: [Email-SIG] headers api for email package

Barry Warsaw writes:

...

On Apr 11, 2009, at 8:39 AM, Chris Withers wrote:

...
Barry Warsaw wrote:

...
...
...
...
message['Subject'] The raw bytes or the decoded unicode?

A header object.

Yep. You got there before I did. :)

...
...
Okay, so you've picked one. Now how do you spell the other way?

str(message['Subject'])

Yes for unstructured headers like Subject. For structured headers... hmm.

Well, suppose we get really radical here. *People* see email as (rich-)text. So ... message['Subject'] returns an object, partly to be consistent with more complex headers' APIs, but partly to remind us that nothing in email is as simple as it seems. Now, str(message['Subject']) is really for presentation to the user, right? OK, so let's make it a presentation function! Decode the MIME-words, optionally unfold folded lines, optionally compress spaces, etc. This by default returns the subject field as a single, possibly quite long, line. Then a higher-level API can rewrap it, add fonts etc, for fancy presentation. This also suggests that we don't the field tag (ie, "Subject") to be part of this value. Of course a *really* smart higher-level API would access structured headers based on their structure, not on the one-size-fits-all str() conversion. Then MTAs see email as a string of octets. So guess what:

...

...
bytes(message['Subject'])

gives wire format. Yow! I think I'm just joking. Right?

...

...
...
Now, setting headers. Sometimes you have some unicode thing and sometimes you have some bytes. You need to end up with bytes in the ASCII range and you'd like to leave the header value unencoded if so. But in both cases, you might have bytes or characters outside that range, so you need an explicit encoding, defaulting to utf-8 probably.

...
...
...
Message.set_header('Subject', 'Some text', encoding='utf-8') Message.set_header('Subject', b'Some bytes')

Where you just want "a damned valid email and stop making my life hard!":

-1 I mean, yeah, Brother, I feel your pain but it just isn't that easy. If that were feasible, it would be *criminal* to have a .set_header() method at all! In fact,

...

...
Message['Subject']='Some text'

is going to (a) need to take *only* unicodes, or (b) raise Exceptions at the slightest provocation when handed bytes. And things only get worse if you try to provide this interface for say "From" (let alone "Content-Type"). Is it really worth doing the mapping interface if it's only usable with free-form headers (ie, only Subject among the commonly used headers)?

...

Yes. In which case I propose we guess the encoding as 1) ascii, 2) utf-8, 3) wtf?

Uh, what guessing? If you don't know what you have but you believe it to be a valid header field, then presumably you got it off the wire and it's still in bytes and you just spit it out on the wire without trying to decode or encode it. But as I already said, I think that's a bad idea. Otherwise, you should have a unicode, and you simply look at the range of the string. If it fits in ASCII, Bob's your uncle. If not, Bob's your aunt (and you use UTF-8).

...

...
Where you care about what encoding is used:

Message['Subject']=Header('Some text',encoding='utf-8')

Yes.

...
If you have bytes, for whatever reason:

Message['Subject']=b'some bytes'.decode('utf-8')

...because only you know what encoding those bytes use!

So you're saying that __setitem__() should not accept raw bytes?

How do you distinguish "raw" bytes from "encoded bytes"? __setitem__() shouldn't accept bytes at all. There should be an API which sets a .formatted_for_the_wire member, and it should have a "validate" option (ie, when true the API attempts to parse the header and raises an exception if it fails to do so; when false, it assumes you know what you're doing and will send out the bytes verbatim).

Steven D'Aprano

6:32 p.m.

New subject: [Email-SIG] headers api for email package

On Tue, 14 Apr 2009 03:15:20 am Stephen J. Turnbull wrote:

...

*People* see email as (rich-)text.

We do? It's not clear what you actually mean by "(rich-)text". In the context of email, I understand it to mean HTML in the body, web-bugs, security exploits, 36pt hot-pink bold text on a lime-green background, and all the other wonderful things modern mail clients let you put in your email. But as far as I know, no mail client tries to render HTML tags inside mail headers, so you're probably not talking about HTML rich-text. I guess you mean Unicode characters. Am I right? Now, correct me if I'm wrong, but I don't think mail headers can actually be anything *but* bytes. I see that my mail client, at least, sends bytes in the Subject header. If I try to send characters, e.g. the subject header "Testing-β-" (without the quotes), what actually gets sent is the bytes "=?utf-8?q?Testing-=CE=B2-?=" (again without the quotation marks). This seems to be covered by RFC 2047: http://tools.ietf.org/html/rfc2047 If you're proposing converting those bytes into characters, that's all very well and good, but what's your strategy for dealing with the inevitable wrongly-formatted headers? If the header can't be correctly decoded into text, there still needs to be a way to get to the raw bytes. Apart from (e.g.) mail processing apps like SpamBayes which will want to inspect the raw bytes, mail readers will need to deal with badly formatted mail. The RFC states: "However, a mail reader MUST NOT prevent the display or handling of a message because an 'encoded-word' is incorrectly formed." [...]

...

Then MTAs see email as a string of octets. So guess what:

> > bytes(message['Subject'])

gives wire format. Yow! I think I'm just joking. Right?

Er, I'm not sure. Are you joking? I hope not, because it is important to be able to get to the raw, unmodified bytes that the MTA sees, without all the fancy processing you suggest. [...]

...

Otherwise, you should have a unicode, and you simply look at the range of the string. If it fits in ASCII, Bob's your uncle. If not, Bob's your aunt (and you use UTF-8).

Again, correct me if I'm wrong, but *all* valid mail headers must fit in ASCII. RFC 5335 defines an experimental approach to allowing full Unicode in mail headers, but surely it's going to be a while before that's common, let alone standard. http://tools.ietf.org/html/rfc5335 -- Steven D'Aprano

Chris Withers

May 2009

5:18 p.m.

New subject: [Email-SIG] headers api for email package

Stephen J. Turnbull wrote:

...

...
...
str(message['Subject'])

Yes for unstructured headers like Subject. For structured headers... hmm.

Well, suppose we get really radical here. *People* see email as (rich-)text. So ... message['Subject'] returns an object, partly to be consistent with more complex headers' APIs, but partly to remind us that nothing in email is as simple as it seems. Now, str(message['Subject']) is really for presentation to the user, right? OK, so let's make it a presentation function! Decode the MIME-words, optionally unfold folded lines, optionally compress spaces, etc. This by default returns the subject field as a single, possibly quite long, line. Then a higher-level API can rewrap it, add fonts etc, for fancy presentation. This also suggests that we don't the field tag (ie, "Subject") to be part of this value.

Of course a *really* smart higher-level API would access structured headers based on their structure, not on the one-size-fits-all str() conversion.

All sounds good to me.

...

Then MTAs see email as a string of octets. So guess what:

...
...
bytes(message['Subject'])

gives wire format. Yow! I think I'm just joking. Right?

Why? That also sounds fine to me and "feels right"...

...

...
...
Where you just want "a damned valid email and stop making my life hard!":

-1 I mean, yeah, Brother, I feel your pain but it just isn't that easy. If that were feasible, it would be *criminal* to have a .set_header() method at all! In fact,

Don't agree...

...

...
...
Message['Subject']='Some text'

is going to (a) need to take *only* unicodes, or (b) raise Exceptions at the slightest provocation when handed bytes.

It should only take unicodes and bitch profusely about anything else.

...

And things only get worse if you try to provide this interface for say "From" (let alone "Content-Type"). Is it really worth doing the mapping interface if it's only usable with free-form headers (ie, only Subject among the commonly used headers)?

Sure, for other headers it might *not* accept unicodes...

...

How do you distinguish "raw" bytes from "encoded bytes"? __setitem__() shouldn't accept bytes at all.

Right on :-) Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk

"Martin v. Löwis"

April 2009

6:25 p.m.

...

This is an interesting question, and something I'm struggling with for the email package for 3.x. It turns out to be pretty convenient to have both a bytes and a string API, both for input and output, but I think email really wants to be represented internally as bytes. Maybe. Or maybe just for content bodies and not headers, or maybe both. Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments).

If you allow for content-transfer-encoding: 8bit, I think there is just no way to represent email as text. You have to accept conversion to, say, base64 (or quoted-unreadable) when converting an email message to text. Regards, Martin

Barry Warsaw

2:41 a.m.

On Apr 9, 2009, at 2:25 PM, Martin v. Löwis wrote:

...

...
This is an interesting question, and something I'm struggling with for the email package for 3.x. It turns out to be pretty convenient to have both a bytes and a string API, both for input and output, but I think email really wants to be represented internally as bytes. Maybe. Or maybe just for content bodies and not headers, or maybe both. Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments).

If you allow for content-transfer-encoding: 8bit, I think there is just no way to represent email as text. You have to accept conversion to, say, base64 (or quoted-unreadable) when converting an email message to text.

Agreed. But applications will want to deal with some parts of the message as text on the boundaries. Internally, it should be all bytes (although even that is a pain to write ;). -Barry

Alexandre Vassalotti

7:51 p.m.

On Thu, Apr 9, 2009 at 1:15 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

As for reading/writing bytes over the wire, JSON is often used in the same context as HTML: you are supposed to know the charset and decode/encode the payload using that charset. However, the RFC specifies a default encoding of utf-8. (*)

(*) http://www.ietf.org/rfc/rfc4627.txt

That is one short and sweet RFC. :-)

...

The RFC also specifies a discrimination algorithm for non-supersets of ASCII (“Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets.”), but it is not implemented in the json module:

Given the RFC specifies that the encoding used should be one of the encodings defined by Unicode, wouldn't be a better idea to remove the "unicode" support, instead? To me, it would make sense to use the detection algorithms for Unicode to sniff the encoding of the JSON stream and then use the detected encoding to decode the strings embed in the JSON stream. Cheers, -- Alexandre

"Martin v. Löwis"

8:19 p.m.

Alexandre Vassalotti wrote:

...

On Thu, Apr 9, 2009 at 1:15 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...
As for reading/writing bytes over the wire, JSON is often used in the same context as HTML: you are supposed to know the charset and decode/encode the payload using that charset. However, the RFC specifies a default encoding of utf-8. (*)

(*) http://www.ietf.org/rfc/rfc4627.txt

That is one short and sweet RFC. :-)

It is indeed well-specified. Unfortunately, it only talks about the application/json type; the pre-existing other versions of json in MIME types vary widely, such as text/plain (possibly with a charset= parameter), text/json, or text/javascript. For these, the RFC doesn't apply.

...

Given the RFC specifies that the encoding used should be one of the encodings defined by Unicode, wouldn't be a better idea to remove the "unicode" support, instead? To me, it would make sense to use the detection algorithms for Unicode to sniff the encoding of the JSON stream and then use the detected encoding to decode the strings embed in the JSON stream.

That might be reasonable. (but then, I also stand by my view that we shouldn't proceed without Bob's approval). Regards, Martin

Damien Diederen

2:25 p.m.

Hello, Antoine Pitrou <solipsis@pitrou.net> writes:

...

Hello,

We're in the process of forward-porting the recent (massive) json updates to 3.1, and we are also thinking of dropping remnants of support of the bytes type in the json library (in 3.1, again). This bytes support almost didn't work at all, but there was a lot of C and Python code for it nevertheless. We're also thinking of dropping the "encoding" argument in the various APIs, since it is useless.

I had a quick look into the module on both branches, and at Antoine's latest patch (json_py3k-3). The current situation on trunk is indeed not very pretty in terms of code duplication, and I agree it would be nice not to carry that forward. I couldn't figure out a way to get rid of it short of multi-#including "templates" and playing with the C preprocessor, however, and have the nagging feeling the latter would be frowned upon by the maintainers. There is a precedent with xmltok.c/xmltok_impl.c, though, so maybe I'm wrong about that. Should I give it a try, and see how "clean" the result can be made?

...

Under the new situation, json would only ever allow str as input, and output str as well. By posting here, I want to know whether anybody would oppose this (knowing, once again, that bytes support is already broken in the current py3k trunk).

Provided one of the alternatives is dropped, wouldn't it be better to do the opposite, i.e., have the decoder take bytes as input, and the encoder produce bytes—and layer the str functionality on top of that? I guess the answer depends on how the (most common) lower layers are structured, but it would be nice to allow a straight bytes path to/from the underlying transport. (I'm willing to have a go at the conversion in case somebody is interested.) Bob, would you have an idea of which lower layers are most commonly used with the json module, and whether people are more likely to expect strs or bytes in Python 3.x? Maybe that data could be inferred from some bug tracking system?

...

The bug entry is: http://bugs.python.org/issue4136

Regards Antoine.

Regards, Damien -- http://crosstwine.com "Strong Opinions, Weakly Held" -- Bob Johansen

Eric Smith

3:05 p.m.

...

I couldn't figure out a way to get rid of it short of multi-#including "templates" and playing with the C preprocessor, however, and have the nagging feeling the latter would be frowned upon by the maintainers.

Not sure if this is exactly what you mean, but look at Objects/stringlib. str.format() and unicode.format() share the same implementation, using stringdefs.h and unicodedefs.h. Eric.

Damien Diederen

3:22 p.m.

Hi Eric, "Eric Smith" <eric@trueblade.com> writes:

...

...
I couldn't figure out a way to get rid of it short of multi-#including "templates" and playing with the C preprocessor, however, and have the nagging feeling the latter would be frowned upon by the maintainers.

Not sure if this is exactly what you mean, but look at Objects/stringlib. str.format() and unicode.format() share the same implementation, using stringdefs.h and unicodedefs.h.

That's indeed a much better example! I'm more confortable applying the same technique to the json module now that I see it used in the core. (Provided Bob and Antoine are not turned away by the relative ugliness, that is.)

...

Eric.

Cheers, Damien -- http://crosstwine.com "Strong Opinions, Weakly Held" -- Bob Johansen

Bob Ippolito

3:07 p.m.

On Mon, Apr 27, 2009 at 7:25 AM, Damien Diederen <dd@crosstwine.com> wrote:

...

Antoine Pitrou <solipsis@pitrou.net> writes:

...
Hello,

We're in the process of forward-porting the recent (massive) json updates to 3.1, and we are also thinking of dropping remnants of support of the bytes type in the json library (in 3.1, again). This bytes support almost didn't work at all, but there was a lot of C and Python code for it nevertheless. We're also thinking of dropping the "encoding" argument in the various APIs, since it is useless.

I had a quick look into the module on both branches, and at Antoine's latest patch (json_py3k-3). The current situation on trunk is indeed not very pretty in terms of code duplication, and I agree it would be nice not to carry that forward.

I couldn't figure out a way to get rid of it short of multi-#including "templates" and playing with the C preprocessor, however, and have the nagging feeling the latter would be frowned upon by the maintainers.

There is a precedent with xmltok.c/xmltok_impl.c, though, so maybe I'm wrong about that. Should I give it a try, and see how "clean" the result can be made?

...
Under the new situation, json would only ever allow str as input, and output str as well. By posting here, I want to know whether anybody would oppose this (knowing, once again, that bytes support is already broken in the current py3k trunk).

Provided one of the alternatives is dropped, wouldn't it be better to do the opposite, i.e., have the decoder take bytes as input, and the encoder produce bytes—and layer the str functionality on top of that? I guess the answer depends on how the (most common) lower layers are structured, but it would be nice to allow a straight bytes path to/from the underlying transport.

(I'm willing to have a go at the conversion in case somebody is interested.)

Bob, would you have an idea of which lower layers are most commonly used with the json module, and whether people are more likely to expect strs or bytes in Python 3.x? Maybe that data could be inferred from some bug tracking system?

I don't know what Python 3.x users expect. As far as I know, none of the lower layers of the json package are used directly. They're certainly not supposed to be or documented as such. My use case for dumps is typically bytes output because we push it straight to and from IO. Some people embed JSON in other documents (e.g. HTML) where you would want it to be text. I'm pretty sure that the IO case is more common. -bob

Antoine Pitrou

3:24 p.m.

Damien Diederen <dd <at> crosstwine.com> writes:

...

I couldn't figure out a way to get rid of it short of multi-#including "templates" and playing with the C preprocessor, however, and have the nagging feeling the latter would be frowned upon by the maintainers.

There is a precedent with xmltok.c/xmltok_impl.c, though, so maybe I'm wrong about that. Should I give it a try, and see how "clean" the result can be made?

Keep in mind that json is externally maintained by Bob. The more we rework his code, the less easy it will be to backport other changes from the simplejson library. I think we should either keep the code duplication (if we want to keep fast paths for both bytes and str objects), or only keep one of the two versions as my patch does.

...

Provided one of the alternatives is dropped, wouldn't it be better to do the opposite, i.e., have the decoder take bytes as input, and the encoder produce bytes—and layer the str functionality on top of that? I guess the answer depends on how the (most common) lower layers are structured, but it would be nice to allow a straight bytes path to/from the underlying transport.

The straightest path is actually to/from unicode, since JSON data can contain unicode strings but no byte strings. Also, the json library /has/ to output unicode when `ensure_ascii` is False. In 2.x:

...

...
...
json.dumps([u"éléphant"], ensure_ascii=False) u'["\xe9l\xe9phant"]'

In any case, I don't think it will matter much in terms of speed whether we take one route or the other. UTF-8 encoding/decoding is probably much faster (in characters per second) than JSON encoding/decoding is. Regards Antoine.

Damien Diederen

4:21 p.m.

Hi Antoine, Antoine Pitrou <solipsis@pitrou.net> writes:

...

Damien Diederen <dd <at> crosstwine.com> writes:

...
I couldn't figure out a way to get rid of it short of multi-#including "templates" and playing with the C preprocessor, however, and have the nagging feeling the latter would be frowned upon by the maintainers.

There is a precedent with xmltok.c/xmltok_impl.c, though, so maybe I'm wrong about that. Should I give it a try, and see how "clean" the result can be made?

Keep in mind that json is externally maintained by Bob. The more we rework his code, the less easy it will be to backport other changes from the simplejson library.

I think we should either keep the code duplication (if we want to keep fast paths for both bytes and str objects), or only keep one of the two versions as my patch does.

Yes, I was (slowly) reaching the same conclusion.

...

...
Provided one of the alternatives is dropped, wouldn't it be better to do the opposite, i.e., have the decoder take bytes as input, and the encoder produce bytes—and layer the str functionality on top of that? I guess the answer depends on how the (most common) lower layers are structured, but it would be nice to allow a straight bytes path to/from the underlying transport.

The straightest path is actually to/from unicode, since JSON data can contain unicode strings but no byte strings. Also, the json library /has/ to output unicode when `ensure_ascii` is False. In 2.x:

...
...
...
json.dumps([u"éléphant"], ensure_ascii=False) u'["\xe9l\xe9phant"]'

In any case, I don't think it will matter much in terms of speed whether we take one route or the other. UTF-8 encoding/decoding is probably much faster (in characters per second) than JSON encoding/decoding is.

You're undoubtedly right. I was more concerned about the interaction with other modules, and avoiding unnecessary copies/conversions especially when they don't make sense from the user's perspective. I will whip up a patch adding a {loadb,dumpb} API as you suggested in another email, with the most trivial implementation, and then we'll see where to go from there. It can still be dropped if there is a concern of perpetuating a "bad idea," or I can follow up with a port of Bob's "bytes" implementation from 2.x if there is any interest.

...

Regards Antoine.

Cheers, Damien -- http://crosstwine.com "Strong Opinions, Weakly Held" -- Bob Johansen

5778

Age (days ago)

5801

Last active (days ago)

List overview

Download

120 comments

34 participants

participants (34)

"Martin v. Löwis"
Aahz
Alexandre Vassalotti
Antoine Pitrou
Barry Warsaw
Bill Janssen
Bob Ippolito
Chris Withers
curtin＠acm.org
Damien Diederen
Daniel Stutzbach
Dirkjan Ochtman
Eric Smith
Glenn Linderman
glyph＠divmod.com
Greg Ewing
Guido van Rossum
James Y Knight
Lino Mastrodomenico
Mark Hammond
Michael Foord
Nick Coghlan
Oleg Broytmann
Paul Moore
R. David Murray
Raymond Hettinger
Robert Brewer
Stephen J. Turnbull
Stephen J. Turnbull
Steve Holden
Steven D'Aprano
Sylvain Thénault
Terry Reedy
Tony Nelson