Re: [Python-Dev] cpython: #12586: add provisional email policy with new header parsing and folding.

Am 26.05.2012 00:44, schrieb r.david.murray:
http://hg.python.org/cpython/rev/0189b9d2d6bc changeset: 77148:0189b9d2d6bc user: R David Murray rdmurray@bitdance.com date: Fri May 25 18:42:14 2012 -0400 summary: #12586: add provisional email policy with new header parsing and folding.
When the new policies are used (and only when the new policies are explicitly used) headers turn into objects that have attributes based on their parsed values, and can be set using objects that encapsulate the values, as well as set directly from unicode strings. The folding algorithm then takes care of encoding unicode where needed, and folding according to the highest level syntactic objects.
With this patch only date and time headers are parsed as anything other than unstructured, but that is all the helper methods in the existing API handle. I do plan to add more parsers, and complete the set specified in the RFC before the package becomes stable.
files: Doc/library/email.policy.rst | 323 + Lib/email/_encoded_words.py | 211 + Lib/email/_header_value_parser.py | 2145 ++++++++ Lib/email/_headerregistry.py | 456 + Lib/email/_policybase.py | 12 +- Lib/email/errors.py | 43 +- Lib/email/generator.py | 11 +- Lib/email/policy.py | 173 +- Lib/email/utils.py | 7 + Lib/test/test_email/__init__.py | 6 + Lib/test/test_email/test__encoded_words.py | 187 + Lib/test/test_email/test__header_value_parser.py | 2466 ++++++++++ Lib/test/test_email/test__headerregistry.py | 717 ++ Lib/test/test_email/test_generator.py | 170 +- Lib/test/test_email/test_pickleable.py | 57 + Lib/test/test_email/test_policy.py | 126 +- 16 files changed, 6994 insertions(+), 116 deletions(-)
diff --git a/Doc/library/email.policy.rst b/Doc/library/email.policy.rst --- a/Doc/library/email.policy.rst +++ b/Doc/library/email.policy.rst @@ -306,3 +306,326 @@ ``7bit``, non-ascii binary data is CTE encoded using the ``unknown-8bit`` charset. Otherwise the original source header is used, with its existing line breaks and and any (RFC invalid) binary data it may contain.
+.. note::
- The remainder of the classes documented below are included in the standard
- library on a :term:`provisional basis <provisional package>`. Backwards
- incompatible changes (up to and including removal of the feature) may occur
- if deemed necessary by the core developers.
+.. class:: EmailPolicy(**kw)
- This concrete :class:`Policy` provides behavior that is intended to be fully
- compliant with the current email RFCs. These include (but are not limited
- to) :rfc:`5322`, :rfc:`2047`, and the current MIME RFCs.
- This policy adds new header parsing and folding algorithms. Instead of
- simple strings, headers are custom objects with custom attributes depending
- on the type of the field. The parsing and folding algorithm fully implement
- :rfc:`2047` and :rfc:`5322`.
- In addition to the settable attributes listed above that apply to all
- policies, this policy adds the following additional attributes:
- .. attribute:: refold_source
If the value for a header in the ``Message`` object originated from a
:mod:`~email.parser` (as opposed to being set by a program), this
attribute indicates whether or not a generator should refold that value
when transforming the message back into stream form. The possible values
are:
======== ===============================================================
``none`` all source values use original folding
``long`` source values that have any line that is longer than
``max_line_length`` will be refolded
``all`` all values are refolded.
======== ===============================================================
The default is ``long``.
- .. attribute:: header_factory
A callable that takes two arguments, ``name`` and ``value``, where
``name`` is a header field name and ``value`` is an unfolded header field
value, and returns a string-like object that represents that header. A
default ``header_factory`` is provided that understands some of the
:RFC:`5322` header field types. (Currently address fields and date
fields have special treatment, while all other fields are treated as
unstructured. This list will be completed before the extension is marked
stable.)
- The class provides the following concrete implementations of the abstract
- methods of :class:`Policy`:
- .. method:: header_source_parse(sourcelines)
The implementation of this method is the same as that for the
:class:`Compat32` policy.
- .. method:: header_store_parse(name, value)
The name is returned unchanged. If the input value has a ``name``
attribute and it matches *name* ignoring case, the value is returned
unchanged. Otherwise the *name* and *value* are passed to
``header_factory``, and the resulting custom header object is returned as
the value. In this case a ``ValueError`` is raised if the input value
contains CR or LF characters.
- .. method:: header_fetch_parse(name, value)
If the value has a ``name`` attribute, it is returned to unmodified.
Otherwise the *name*, and the *value* with any CR or LF characters
removed, are passed to the ``header_factory``, and the resulting custom
header object is returned. Any surrogateescaped bytes get turned into
the unicode unknown-character glyph.
- .. method:: fold(name, value)
Header folding is controlled by the :attr:`refold_source` policy setting.
A value is considered to be a 'source value' if and only if it does not
have a ``name`` attribute (having a ``name`` attribute means it is a
header object of some sort). If a source value needs to be refolded
according to the policy, it is converted into a custom header object by
passing the *name* and the *value* with any CR and LF characters removed
to the ``header_factory``. Folding of a custom header object is done by
calling its ``fold`` method with the current policy.
Source values are split into lines using :meth:`~str.splitlines`. If
the value is not to be refolded, the lines are rejoined using the
``linesep`` from the policy and returned. The exception is lines
containing non-ascii binary data. In that case the value is refolded
regardless of the ``refold_source`` setting, which causes the binary data
to be CTE encoded using the ``unknown-8bit`` charset.
- .. method:: fold_binary(name, value)
The same as :meth:`fold` if :attr:`cte_type` is ``7bit``, except that
the returned value is bytes.
If :attr:`cte_type` is ``8bit``, non-ASCII binary data is converted back
into bytes. Headers with binary data are not refolded, regardless of the
``refold_header`` setting, since there is no way to know whether the
binary data consists of single byte characters or multibyte characters.
+The following instances of :class:`EmailPolicy` provide defaults suitable for +specific application domains. Note that in the future the behavior of these +instances (in particular the ``HTTP` instance) may be adjusted to conform even +more closely to the RFCs relevant to their domains.
+.. data:: default
- An instance of ``EmailPolicy`` with all defaults unchanged. This policy
- uses the standard Python ``\n`` line endings rather than the RFC-correct
- ``\r\n``.
+.. data:: SMTP
- Suitable for serializing messages in conformance with the email RFCs.
- Like ``default``, but with ``linesep`` set to ``\r\n``, which is RFC
- compliant.
+.. data:: HTTP
- Suitable for serializing headers with for use in HTTP traffic. Like
- ``SMTP`` except that ``max_line_length`` is set to ``None`` (unlimited).
+.. data:: strict
- Convenience instance. The same as ``default`` except that
- ``raise_on_defect`` is set to ``True``. This allows any policy to be made
- strict by writing::
somepolicy + policy.strict
+With all of these :class:`EmailPolicies <.EmailPolicy>`, the effective API of +the email package is changed from the Python 3.2 API in the following ways:
- Setting a header on a :class:`~email.message.Message` results in that
header being parsed and a custom header object created.
- Fetching a header value from a :class:`~email.message.Message` results
in that header being parsed and a custom header object created and
returned.
- Any custom header object, or any header that is refolded due to the
policy settings, is folded using an algorithm that fully implements the
RFC folding algorithms, including knowing where encoded words are required
and allowed.
+From the application view, this means that any header obtained through the +:class:`~email.message.Message` is a custom header object with custom +attributes, whose string value is the fully decoded unicode value of the +header. Likewise, a header may be assigned a new value, or a new header +created, using a unicode string, and the policy will take care of converting +the unicode string into the correct RFC encoded form.
+The custom header objects and their attributes are described below. All custom +header objects are string subclasses, and their string value is the fully +decoded value of the header field (the part of the field after the ``:``)
+.. class:: BaseHeader
- This is the base class for all custom header objects. It provides the
- following attributes:
- .. attribute:: name
The header field name (the portion of the field before the ':').
- .. attribute:: defects
A possibly empty list of :class:`~email.errors.MessageDefect` objects
that record any RFC violations found while parsing the header field.
- .. method:: fold(*, policy)
Return a string containing :attr:`~email.policy.Policy.linesep`
characters as required to correctly fold the header according
to *policy*. A :attr:`~email.policy.Policy.cte_type` of
``8bit`` will be treated as if it were ``7bit``, since strings
may not contain binary data.
+.. class:: UnstructuredHeader
- The class used for any header that does not have a more specific
- type. (The :mailheader:`Subject` header is an example of an
- unstructured header.) It does not have any additional attributes.
+.. class:: DateHeader
- The value of this type of header is a single date and time value. The
- primary example of this type of header is the :mailheader:`Date` header.
- .. attribute:: datetime
A :class:`~datetime.datetime` encoding the date and time from the
header value.
The ``datetime`` will be a naive ``datetime`` if the value either does
not have a specified timezone (which would be a violation of the RFC) or
if the timezone is specified as ``-0000``. This timezone value indicates
that the date and time is to be considered to be in UTC, but with no
indication of the local timezone in which it was generated. (This
contrasts to ``+0000``, which indicates a date and time that really is in
the UTC ``0000`` timezone.)
If the header value contains a valid timezone that is not ``-0000``, the
``datetime`` will be an aware ``datetime`` having a
:class:`~datetime.tzinfo` set to the :class:`~datetime.timezone`
indicated by the header value.
- A ``datetime`` may also be assigned to a :mailheader:`Date` type header.
- The resulting string value will use a timezone of ``-0000`` if the
- ``datetime`` is naive, and the appropriate UTC offset if the ``datetime`` is
- aware.
+.. class:: AddressHeader
- This class is used for all headers that can contain addresses, whether they
- are supposed to be singleton addresses or a list.
- .. attribute:: addresses
A list of :class:`.Address` objects listing all of the addresses that
could be parsed out of the field value.
- .. attribute:: groups
A list of :class:`.Group` objects. Every address in :attr:`.addresses`
appears in one of the group objects in the tuple. Addresses that are not
syntactically part of a group are represented by ``Group`` objects whose
``name`` is ``None``.
- In addition to addresses in string form, any combination of
- :class:`.Address` and :class:`.Group` objects, singly or in a list, may be
- assigned to an address header.
+.. class:: Address(display_name='', username='', domain='', addr_spec=None):
- The class used to represent an email address. The general form of an
- address is::
[display_name] <username@domain>
- or::
username@domain
- where each part must conform to specific syntax rules spelled out in
- :rfc:`5322`.
- As a convenience *addr_spec* can be specified instead of *username* and
- *domain*, in which case *username* and *domain* will be parsed from the
- *addr_spec*. An *addr_spec* must be a properly RFC quoted string; if it is
- not ``Address`` will raise an error. Unicode characters are allowed and
- will be property encoded when serialized. However, per the RFCs, unicode is
- *not* allowed in the username portion of the address.
- .. attribute:: display_name
The display name portion of the address, if any, with all quoting
removed. If the address does not have a display name, this attribute
will be an empty string.
- .. attribute:: username
The ``username`` portion of the address, with all quoting removed.
- .. attribute:: domain
The ``domain`` portion of the address.
- .. attribute:: addr_spec
The ``username@domain`` portion of the address, correctly quoted
for use as a bare address (the second form shown above). This
attribute is not mutable.
- .. method:: __str__()
The ``str`` value of the object is the address quoted according to
:rfc:`5322` rules, but with no Content Transfer Encoding of any non-ASCII
characters.
+.. class:: Group(display_name=None, addresses=None)
- The class used to represent an address group. The general form of an
- address group is::
display_name: [address-list];
- As a convenience for processing lists of addresses that consist of a mixture
- of groups and single addresses, a ``Group`` may also be used to represent
- single addresses that are not part of a group by setting *display_name* to
- ``None`` and providing a list of the single address as *addresses*.
- .. attribute:: display_name
The ``display_name`` of the group. If it is ``None`` and there is
exactly one ``Address`` in ``addresses``, then the ``Group`` represents a
single address that is not in a group.
- .. attribute:: addresses
A possibly empty tuple of :class:`.Address` objects representing the
addresses in the group.
- .. method:: __str__()
The ``str`` value of a ``Group`` is formatted according to :rfc:`5322`,
but with no Content Transfer Encoding of any non-ASCII characters. If
``display_name`` is none and there is a single ``Address`` in the
``addresses` list, the ``str`` value will be the same as the ``str`` of
that single ``Address``.
There's a lot of new stuff here: should have a versionadded? (Or do we need new markup for "provisional" stuff?)
Georg

On Sat, 26 May 2012 09:14:07 +0200, Georg Brandl g.brandl@gmx.net wrote:
Am 26.05.2012 00:44, schrieb r.david.murray:
http://hg.python.org/cpython/rev/0189b9d2d6bc changeset: 77148:0189b9d2d6bc user: R David Murray rdmurray@bitdance.com date: Fri May 25 18:42:14 2012 -0400 summary: #12586: add provisional email policy with new header parsing and folding.
[...]
diff --git a/Doc/library/email.policy.rst b/Doc/library/email.policy.rst --- a/Doc/library/email.policy.rst +++ b/Doc/library/email.policy.rst @@ -306,3 +306,326 @@ ``7bit``, non-ascii binary data is CTE encoded using the ``unknown-8bit`` charset. Otherwise the original source header is used, with its existing line breaks and and any (RFC invalid) binary data it may contain.
[...]
There's a lot of new stuff here: should have a versionadded? (Or do we need new markup for "provisional" stuff?)
The entire policy module is new in 3.3 and has a versionadded at the top.
New markup for provisional would be cool, though.
I think eventually some of these docs will get factored out of policy, but that probably won't happen until it is no longer provisional. At that point I'll be doing a massive doc reorganization to deprecate many of the old APIs.
Another option here is to consider 'policy' itself as the provisional package...except that to use it requires hooks in the other packages (the policy= keyword arguments). And I'm pretty satisfied with the API of the policy module itself, so I don't think it needs to be considered provisional.
--David
participants (2)
-
Georg Brandl
-
R. David Murray