[Python-Dev] Dropping bytes "support" in json

Fri Apr 10 17:08:04 CEST 2009

On Apr 9, 2009, at 10:38 PM, Barry Warsaw wrote:
> So, what I'm really asking is this.  Let's say you agree that there  
> are use cases for accessing a header value as either the raw encoded  
> bytes or the decoded unicode.

As I said in the thread having nearly the same exact discussion on web- 
sig, except about WSGI headers...

> What should this return:
>
> >>> message['Subject']
>
> The raw bytes or the decoded unicode?

Until you write a parser for every header, you simply cannot decode to  
unicode. The only sane choices are:
1) raw bytes
2) parsed structured data

There's no "decoded to unicode but not parsed" option: that's doing  
things in the wrong order. If you RFC2047-decode the header before  
doing tokenization and parsing, you will just have a *broken*  
implementation.

Here's an example where it matters. If you decode the RFC2047 part  
before parsing, you'd decide that there's two recipients to the  
message. There aren't. "<broken at example.com>, " is the display-name of  
"actual at example.com", not a second recipient.

   To: =?UTF-8?B?PGJyb2tlbkBleGFtcGxlLmNvbT4sIA==?= <actual at example.com>

Here's a quote from RFC2047:
> NOTE: Decoding and display of encoded-words occurs *after* a  
> structured field body is parsed into tokens. It is therefore  
> possible to hide 'special' characters in encoded-words which, when  
> displayed, will be indistinguishable from 'special' characters in  
> the surrounding text. For this and other reasons, it is NOT  
> generally possible to translate a message header containing 'encoded- 
> word's to an unencoded form which can be parsed by an RFC 822 mail  
> reader.
And another quote for good measure:
> (2) Any header field not defined as '*text' should be parsed  
> according to the syntax rules for that header field. However, any  
> 'word' that appears within a 'phrase' should be treated as an  
> 'encoded-word' if it meets the syntax rules in section 2. Otherwise  
> it should be treated as an ordinary 'word'.

Now, I suppose there's also a third possibility:
3) US-ASCII-only strings, unmolested except for doing  
a .decode('ascii'). That'll give you a string all right, but it's  
really just cheating. It's not actually a text string in any  
meaningful sense.

(in all this I'm assuming your question is not about the "Subject"  
header in particular; that is of course just unstructured text so the  
parse step doesn't actually do anything...).

James