[Python-Dev] Edits to Metadata 1.2 to add extras (optional ependencies)

Sat Sep 1 14:58:30 CEST 2012

On Sat, 01 Sep 2012 13:55:11 +0900, "Stephen J. Turnbull" <stephen at xemacs.org> wrote:
> "Martin v. LÃ¶wis" writes:
> 
>  > Unfortunately, this conflicts with the desire to use UTF-8 in attribute
>  > values - RFC 822 (and also 2822) don't support this, but require the
>  > use oF MIME instead (Q or B encoding).
> 
> This can be achieved simply by extending the set of characters
> permitted, as MIME did for message bodies.  I'd be cautious about RFC
> 5335, not just because it's experimental, but because there may be
> other requirements we don't want to mess with.  (If RDM says
> otherwise, listen to him.  I just know the RFC exists.)

That is essentially what that RFC does.  I haven't gone through it with
a fine-tooth yet, but that's why I say the parsing side mostly works
already: we allow unicode characters anywhere non-special-characters are
allowed during parsing.  The only issue is that we encode non-ASCII using
the normal rules during serialization, so we need a new policy control to
disable that.  I'm thinking it will be any easy addition...the hard part
for RFC5335 is doing that fine-tooth read and adding appropriate tests.

Alternatively, as Donald pointed out, you can use the Binary mode, where
the utf-8 bytes just go along for the ride.  In the context of the
metadata, I think that should produce the desired results, since there
should be no need to re-wrap metadata lines.  It will also preserve the
line endings *if* you don't use the new policies.  But that is why I
would prefer to use explicit RFC5335 support...I'd like the email
backward compatibility policy to go away some day :)  (On the gripping
hand, it will always be possible to recreate it as a custom policy.)

>  > RFC 2822 also has a continuation line semantics which traditionally 
>  > conflicts with the metadata; in particular, line breaks cannot be 
>  > represented (but are interpreted as continuation lines instead).
> 
> Of course line breaks can be represented, without any further change
> to RFC 2822.  Just use Unicode LINE SEPARATOR.  You could even do it
> within ASCII by adhering strictly to RFC 2822 syntax which interprets
> continuation lines by removing exactly the CRLF pair.  Just use ASCII
> TAB as the field separator.

Yes, that is what I was talking to Tarek about.  And since ReST source
shouldn't contain tabs, a tab would probably work as the separator,
if for some reason you didn't want to use LINE SEPARATOR.

> There's a final dodge that occurs to me: the semantics you're talking
> about are *lexical* semantics in the RFC 2822 context (line unfolding
> and RFC 2047 decoding).  We could possibly in the context of the email
> module treat Metadata as an intermediate post-lexical-decoding
> pre-syntactic-analysis representation.  I don't know if that makes
> sense in the context of using email module facilities to parse
> Metadata.

The policy has hooks that support this.  A policy gets handed the source
line complete with the line breaks, determines what gets stored in the
model, and also gets to control what gets handed back to the application
when a header is retrieved from the model.  The policy can also control
the header folding during serialization.  So preserving line separators
using a custom policy is not only possible, but should be fairly easy.

--David