[Email-SIG] Proposal: feedparser.py stream large attachments

PyTonic pytonic at i2pmail.org
Mon May 25 21:04:34 CEST 2015


I'd like to propose a backwards compatible change for feedparser.py
which optionally allows streaming of Message() payloads via subclassed
message.Message objects.

Currently, storing of message payload is implemented by creating a
local list in FeedParser(), appending incoming lines to that list and
finally joining that list to set it as payload inside the Message()
object [1].

This may work (and actually may be desirable) for smaller payloads.
Once one starts dealing with payloads bigger than, lets say, 20 MB it
becomes less practical. Not even talking about what happens with 500 MB
payloads.

A 20 MB Base64 encoded payload (where encoding adds about 33 %) costs:
1) about (20 * 1.33) MB memory inside the local list at FeedParser()
2) another (20 * 1.33) MB RAM once it is set via self._cur.set_payload()
The first will be garbage collected at some point but before that both
are kept in memory at the same time.

Once the user requests this payload with get_payload(decode=True) it is
again held two times in memory:
3) in its encoded form (20 * 1.33) MB via self._cur._payload
4) in its decoded binary form 20 MB returned from various wrappers
    around binascii.a2b_base64()

Thus, it would be useful to have an (optional) way to stream
(and decode/store) payloads so they are never held in memory at once.
As the FeedParser supports a _factory keyword to use other kind of
Message objects, 3) and 4) could be solved by rewriting the
set_payload() and get_payload() callables in a subclassed Message class.

Sadly this won't help much as 1) and 2) are buried deep inside the
FeedParser itself. Another drawback of rewriting set_payload and
get_payload in a subclass is that code may be out of sync with the
installed email.message.Message class.


Following is a possible solution I propose to overcome this issues.
It consists of two parts which should be compatible with a non changed
Message class and thus a default FeedParser() instance.

1) Allow for optional keywords to be passed to FeedParser() constructor.
    They will be saved and then passed to new Message() objects.

2) A new streaming interface for Message objects.
    It consists of 3 additional callables:
      2.1) start_payload_chunks()
      2.2) append_payload_chunk(line)
      2.3) finalize_payload()

The patch [2] first checks for the availability of the new streaming
interface and falls back to the old code. This should allow the
FeedParser to work with existing subclasses of message.Message as well
as with new subclasses which implement the streaming interface.

Please note this diff is based on Python 2.7 as shipped with
Debian Wheezy. The current implementation uses the same problematic
code for 2) though (see [1]).

I will post another message containing a simple use case for the new
interface which only streams, decodes and stores base64 encoded
payloads on the fly and uses the old method for everything else. It
additionally uses two more callables inside its Message subclass:
get_payload_file() and is_streamed().

It also contains some comments about unresolved issues like how
decoding errors should be properly dealt with. And who is responsible
for catching exceptions raised by the new interfaces so they can't
break the FeedParser itself.

This patch is mostly designed to present my idea to work around the
current "All-in-RAM" situation (in Python 2 and Python 3).

Comments, critics and suggestions on how to proceed to have a feature
like this merged (and in the best case also be backported to Python 2)
are more than welcome.


[1] 
https://hg.python.org/cpython/file/78986c99dd6c/Lib/email/feedparser.py#l462
[2] feedparser.diff:

--- /usr/lib/python2.7/email/feedparser.py	2014-03-13 10:54:56.000000000 
+0000
+++ feedparser_stream.py	2015-05-25 03:02:09.000000000 +0000
@@ -137,9 +137,10 @@
  class FeedParser:
      """A feed-style parser of email."""

-    def __init__(self, _factory=message.Message):
+    def __init__(self, _factory=message.Message, **kwargs):
          """_factory is called with no arguments to create a new 
message obj"""
          self._factory = _factory
+        self._factory_kwargs = kwargs
          self._input = BufferedSubFile()
          self._msgstack = []
          self._parse = self._parsegen().next
@@ -175,7 +176,7 @@
          return root

      def _new_message(self):
-        msg = self._factory()
+        msg = self._factory(**self._factory_kwargs)
          if self._cur and self._cur.get_content_type() == 
'multipart/digest':
              msg.set_default_type('message/rfc822')
          if self._msgstack:
@@ -420,6 +421,22 @@
              return
          # Otherwise, it's some non-multipart type, so the entire rest 
of the
          # file contents becomes the payload.
+
+        # Test for message streaming interface
+        if hasattr(self._cur, 'start_payload_chunks') \
+        and callable(self._cur.start_payload_chunks):
+            _cur = self._cur
+            _cur.start_payload_chunks()
+            for line in self._input:
+                if line is NeedMoreData:
+                    yield NeedMoreData
+                    continue
+                _cur.append_payload_chunk(line)
+            _cur.finalize_payload()
+            return
+
+        # Streaming interface not available
+        # Fall back to legacy all in RAM solution.
          lines = []
          for line in self._input:
              if line is NeedMoreData:



More information about the Email-SIG mailing list