[Python-Dev] cpython: #12586: add provisional email policy with new header parsing and folding.

Georg Brandl g.brandl at gmx.net
Sat May 26 09:14:07 CEST 2012


Am 26.05.2012 00:44, schrieb r.david.murray:
> http://hg.python.org/cpython/rev/0189b9d2d6bc
> changeset:   77148:0189b9d2d6bc
> user:        R David Murray <rdmurray at bitdance.com>
> date:        Fri May 25 18:42:14 2012 -0400
> summary:
>   #12586: add provisional email policy with new header parsing and folding.
> 
> When the new policies are used (and only when the new policies are explicitly
> used) headers turn into objects that have attributes based on their parsed
> values, and can be set using objects that encapsulate the values, as well as
> set directly from unicode strings.  The folding algorithm then takes care of
> encoding unicode where needed, and folding according to the highest level
> syntactic objects.
> 
> With this patch only date and time headers are parsed as anything other than
> unstructured, but that is all the helper methods in the existing API handle.
> I do plan to add more parsers, and complete the set specified in the RFC
> before the package becomes stable.
> 
> files:
>   Doc/library/email.policy.rst                     |   323 +
>   Lib/email/_encoded_words.py                      |   211 +
>   Lib/email/_header_value_parser.py                |  2145 ++++++++
>   Lib/email/_headerregistry.py                     |   456 +
>   Lib/email/_policybase.py                         |    12 +-
>   Lib/email/errors.py                              |    43 +-
>   Lib/email/generator.py                           |    11 +-
>   Lib/email/policy.py                              |   173 +-
>   Lib/email/utils.py                               |     7 +
>   Lib/test/test_email/__init__.py                  |     6 +
>   Lib/test/test_email/test__encoded_words.py       |   187 +
>   Lib/test/test_email/test__header_value_parser.py |  2466 ++++++++++
>   Lib/test/test_email/test__headerregistry.py      |   717 ++
>   Lib/test/test_email/test_generator.py            |   170 +-
>   Lib/test/test_email/test_pickleable.py           |    57 +
>   Lib/test/test_email/test_policy.py               |   126 +-
>   16 files changed, 6994 insertions(+), 116 deletions(-)
> 
> 
> diff --git a/Doc/library/email.policy.rst b/Doc/library/email.policy.rst
> --- a/Doc/library/email.policy.rst
> +++ b/Doc/library/email.policy.rst
> @@ -306,3 +306,326 @@
>        ``7bit``, non-ascii binary data is CTE encoded using the ``unknown-8bit``
>        charset.  Otherwise the original source header is used, with its existing
>        line breaks and and any (RFC invalid) binary data it may contain.
> +
> +
> +.. note::
> +
> +   The remainder of the classes documented below are included in the standard
> +   library on a :term:`provisional basis <provisional package>`.  Backwards
> +   incompatible changes (up to and including removal of the feature) may occur
> +   if deemed necessary by the core developers.
> +
> +
> +.. class:: EmailPolicy(**kw)
> +
> +   This concrete :class:`Policy` provides behavior that is intended to be fully
> +   compliant with the current email RFCs.  These include (but are not limited
> +   to) :rfc:`5322`, :rfc:`2047`, and the current MIME RFCs.
> +
> +   This policy adds new header parsing and folding algorithms.  Instead of
> +   simple strings, headers are custom objects with custom attributes depending
> +   on the type of the field.  The parsing and folding algorithm fully implement
> +   :rfc:`2047` and :rfc:`5322`.
> +
> +   In addition to the settable attributes listed above that apply to all
> +   policies, this policy adds the following additional attributes:
> +
> +   .. attribute:: refold_source
> +
> +      If the value for a header in the ``Message`` object originated from a
> +      :mod:`~email.parser` (as opposed to being set by a program), this
> +      attribute indicates whether or not a generator should refold that value
> +      when transforming the message back into stream form.  The possible values
> +      are:
> +
> +      ========  ===============================================================
> +      ``none``  all source values use original folding
> +
> +      ``long``  source values that have any line that is longer than
> +                ``max_line_length`` will be refolded
> +
> +      ``all``   all values are refolded.
> +      ========  ===============================================================
> +
> +      The default is ``long``.
> +
> +   .. attribute:: header_factory
> +
> +      A callable that takes two arguments, ``name`` and ``value``, where
> +      ``name`` is a header field name and ``value`` is an unfolded header field
> +      value, and returns a string-like object that represents that header.  A
> +      default ``header_factory`` is provided that understands some of the
> +      :RFC:`5322` header field types.  (Currently address fields and date
> +      fields have special treatment, while all other fields are treated as
> +      unstructured.  This list will be completed before the extension is marked
> +      stable.)
> +
> +   The class provides the following concrete implementations of the abstract
> +   methods of :class:`Policy`:
> +
> +   .. method:: header_source_parse(sourcelines)
> +
> +      The implementation of this method is the same as that for the
> +      :class:`Compat32` policy.
> +
> +   .. method:: header_store_parse(name, value)
> +
> +      The name is returned unchanged.  If the input value has a ``name``
> +      attribute and it matches *name* ignoring case, the value is returned
> +      unchanged.  Otherwise the *name* and *value* are passed to
> +      ``header_factory``, and the resulting custom header object is returned as
> +      the value.  In this case a ``ValueError`` is raised if the input value
> +      contains CR or LF characters.
> +
> +   .. method:: header_fetch_parse(name, value)
> +
> +      If the value has a ``name`` attribute, it is returned to unmodified.
> +      Otherwise the *name*, and the *value* with any CR or LF characters
> +      removed, are passed to the ``header_factory``, and the resulting custom
> +      header object is returned.  Any surrogateescaped bytes get turned into
> +      the unicode unknown-character glyph.
> +
> +   .. method:: fold(name, value)
> +
> +      Header folding is controlled by the :attr:`refold_source` policy setting.
> +      A value is considered to be a 'source value' if and only if it does not
> +      have a ``name`` attribute (having a ``name`` attribute means it is a
> +      header object of some sort).  If a source value needs to be refolded
> +      according to the policy, it is converted into a custom header object by
> +      passing the *name* and the *value* with any CR and LF characters removed
> +      to the ``header_factory``.  Folding of a custom header object is done by
> +      calling its ``fold`` method with the current policy.
> +
> +      Source values are split into lines using :meth:`~str.splitlines`.  If
> +      the value is not to be refolded, the lines are rejoined using the
> +      ``linesep`` from the policy and returned.  The exception is lines
> +      containing non-ascii binary data.  In that case the value is refolded
> +      regardless of the ``refold_source`` setting, which causes the binary data
> +      to be CTE encoded using the ``unknown-8bit`` charset.
> +
> +   .. method:: fold_binary(name, value)
> +
> +      The same as :meth:`fold` if :attr:`cte_type` is ``7bit``, except that
> +      the returned value is bytes.
> +
> +      If :attr:`cte_type` is ``8bit``, non-ASCII binary data is converted back
> +      into bytes.  Headers with binary data are not refolded, regardless of the
> +      ``refold_header`` setting, since there is no way to know whether the
> +      binary data consists of single byte characters or multibyte characters.
> +
> +The following instances of :class:`EmailPolicy` provide defaults suitable for
> +specific application domains.  Note that in the future the behavior of these
> +instances (in particular the ``HTTP` instance) may be adjusted to conform even
> +more closely to the RFCs relevant to their domains.
> +
> +.. data:: default
> +
> +   An instance of ``EmailPolicy`` with all defaults unchanged.  This policy
> +   uses the standard Python ``\n`` line endings rather than the RFC-correct
> +   ``\r\n``.
> +
> +.. data:: SMTP
> +
> +   Suitable for serializing messages in conformance with the email RFCs.
> +   Like ``default``, but with ``linesep`` set to ``\r\n``, which is RFC
> +   compliant.
> +
> +.. data:: HTTP
> +
> +   Suitable for serializing headers with for use in HTTP traffic.  Like
> +   ``SMTP`` except that ``max_line_length`` is set to ``None`` (unlimited).
> +
> +.. data:: strict
> +
> +   Convenience instance.  The same as ``default`` except that
> +   ``raise_on_defect`` is set to ``True``.  This allows any policy to be made
> +   strict by writing::
> +
> +        somepolicy + policy.strict
> +
> +With all of these :class:`EmailPolicies <.EmailPolicy>`, the effective API of
> +the email package is changed from the Python 3.2 API in the following ways:
> +
> +   * Setting a header on a :class:`~email.message.Message` results in that
> +     header being parsed and a custom header object created.
> +
> +   * Fetching a header value from a :class:`~email.message.Message` results
> +     in that header being parsed and a custom header object created and
> +     returned.
> +
> +   * Any custom header object, or any header that is refolded due to the
> +     policy settings, is folded using an algorithm that fully implements the
> +     RFC folding algorithms, including knowing where encoded words are required
> +     and allowed.
> +
> +From the application view, this means that any header obtained through the
> +:class:`~email.message.Message` is a custom header object with custom
> +attributes, whose string value is the fully decoded unicode value of the
> +header.  Likewise, a header may be assigned a new value, or a new header
> +created, using a unicode string, and the policy will take care of converting
> +the unicode string into the correct RFC encoded form.
> +
> +The custom header objects and their attributes are described below.  All custom
> +header objects are string subclasses, and their string value is the fully
> +decoded value of the header field (the part of the field after the ``:``)
> +
> +
> +.. class:: BaseHeader
> +
> +   This is the base class for all custom header objects.  It provides the
> +   following attributes:
> +
> +   .. attribute:: name
> +
> +      The header field name (the portion of the field before the ':').
> +
> +   .. attribute:: defects
> +
> +      A possibly empty list of :class:`~email.errors.MessageDefect` objects
> +      that record any RFC violations found while parsing the header field.
> +
> +   .. method:: fold(*, policy)
> +
> +      Return a string containing :attr:`~email.policy.Policy.linesep`
> +      characters as required to correctly fold the header according
> +      to *policy*.  A :attr:`~email.policy.Policy.cte_type` of
> +      ``8bit`` will be treated as if it were ``7bit``, since strings
> +      may not contain binary data.
> +
> +
> +.. class:: UnstructuredHeader
> +
> +   The class used for any header that does not have a more specific
> +   type.  (The :mailheader:`Subject` header is an example of an
> +   unstructured header.)  It does not have any additional attributes.
> +
> +
> +.. class:: DateHeader
> +
> +   The value of this type of header is a single date and time value.  The
> +   primary example of this type of header is the :mailheader:`Date` header.
> +
> +   .. attribute:: datetime
> +
> +      A :class:`~datetime.datetime` encoding the date and time from the
> +      header value.
> +
> +      The ``datetime`` will be a naive ``datetime`` if the value either does
> +      not have a specified timezone (which would be a violation of the RFC) or
> +      if the timezone is specified as ``-0000``.  This timezone value indicates
> +      that the date and time is to be considered to be in UTC, but with no
> +      indication of the local timezone in which it was generated.  (This
> +      contrasts to ``+0000``, which indicates a date and time that really is in
> +      the UTC ``0000`` timezone.)
> +
> +      If the header value contains a valid timezone that is not ``-0000``, the
> +      ``datetime`` will be an aware ``datetime`` having a
> +      :class:`~datetime.tzinfo` set to the :class:`~datetime.timezone`
> +      indicated by the header value.
> +
> +   A ``datetime`` may also be assigned to a :mailheader:`Date` type header.
> +   The resulting string value will use a timezone of ``-0000`` if the
> +   ``datetime`` is naive, and the appropriate UTC offset if the ``datetime`` is
> +   aware.
> +
> +
> +.. class:: AddressHeader
> +
> +   This class is used for all headers that can contain addresses, whether they
> +   are supposed to be singleton addresses or a list.
> +
> +   .. attribute:: addresses
> +
> +      A list of :class:`.Address` objects listing all of the addresses that
> +      could be parsed out of the field value.
> +
> +   .. attribute:: groups
> +
> +      A list of :class:`.Group` objects.  Every address in :attr:`.addresses`
> +      appears in one of the group objects in the tuple.  Addresses that are not
> +      syntactically part of a group are represented by ``Group`` objects whose
> +      ``name`` is ``None``.
> +
> +   In addition to addresses in string form, any combination of
> +   :class:`.Address` and :class:`.Group` objects, singly or in a list, may be
> +   assigned to an address header.
> +
> +
> +.. class:: Address(display_name='', username='', domain='', addr_spec=None):
> +
> +   The class used to represent an email address.  The general form of an
> +   address is::
> +
> +      [display_name] <username at domain>
> +
> +   or::
> +
> +      username at domain
> +
> +   where each part must conform to specific syntax rules spelled out in
> +   :rfc:`5322`.
> +
> +   As a convenience *addr_spec* can be specified instead of *username* and
> +   *domain*, in which case *username* and *domain* will be parsed from the
> +   *addr_spec*.  An *addr_spec* must be a properly RFC quoted string; if it is
> +   not ``Address`` will raise an error.  Unicode characters are allowed and
> +   will be property encoded when serialized.  However, per the RFCs, unicode is
> +   *not* allowed in the username portion of the address.
> +
> +   .. attribute:: display_name
> +
> +      The display name portion of the address, if any, with all quoting
> +      removed.  If the address does not have a display name, this attribute
> +      will be an empty string.
> +
> +   .. attribute:: username
> +
> +      The ``username`` portion of the address, with all quoting removed.
> +
> +   .. attribute:: domain
> +
> +      The ``domain`` portion of the address.
> +
> +   .. attribute:: addr_spec
> +
> +      The ``username at domain`` portion of the address, correctly quoted
> +      for use as a bare address (the second form shown above).  This
> +      attribute is not mutable.
> +
> +   .. method:: __str__()
> +
> +      The ``str`` value of the object is the address quoted according to
> +      :rfc:`5322` rules, but with no Content Transfer Encoding of any non-ASCII
> +      characters.
> +
> +
> +.. class:: Group(display_name=None, addresses=None)
> +
> +   The class used to represent an address group.  The general form of an
> +   address group is::
> +
> +     display_name: [address-list];
> +
> +   As a convenience for processing lists of addresses that consist of a mixture
> +   of groups and single addresses, a ``Group`` may also be used to represent
> +   single addresses that are not part of a group by setting *display_name* to
> +   ``None`` and providing a list of the single address as *addresses*.
> +
> +   .. attribute:: display_name
> +
> +      The ``display_name`` of the group.  If it is ``None`` and there is
> +      exactly one ``Address`` in ``addresses``, then the ``Group`` represents a
> +      single address that is not in a group.
> +
> +   .. attribute:: addresses
> +
> +      A possibly empty tuple of :class:`.Address` objects representing the
> +      addresses in the group.
> +
> +   .. method:: __str__()
> +
> +      The ``str`` value of a ``Group`` is formatted according to :rfc:`5322`,
> +      but with no Content Transfer Encoding of any non-ASCII characters.  If
> +      ``display_name`` is none and there is a single ``Address`` in the
> +      ``addresses` list, the ``str`` value will be the same as the ``str`` of
> +      that single ``Address``.

There's a lot of new stuff here: should have a versionadded?  (Or do we need new
markup for "provisional" stuff?)

Georg




More information about the Python-Dev mailing list