[Python-Dev] Mailbox module - timings and functionality changes

Tue Jun 29 23:02:14 CEST 2010

Guido van Rossum wrote:
> It should probably be opened in binary mode. Binary files do have a
> .readline() method (returning a bytes object), and bytes objects have
> a .startswith() method. The tell positions computed this way are even
> compatible with those used by the text file. So you could do it this
> way:
> 
> - open binary stream
> - compute TOC by reading through it using .readline() and .tell()
> - rewind (don't close)

Because closing is inefficient, or because it breaks the algorithm?

> - wrap the binary stream in a text stream

"wrap" how? The ultimate destiny of the text is twofold:

1) To be stored as some kind of LOB in a database, and
2) Therefrom to be reconstituted and parsed into email.Message objects.

Is the wrapping a one-off operation or a software layer? Sorry, being a
bit dense here, I know.

regards
 Steve

> - use that for the rest of the code
> 
> --Guido
> 
> On Tue, Jun 29, 2010 at 10:54 AM, Steve Holden <steve at holdenweb.com> wrote:
>> A.M. Kuchling wrote:
>>> On Tue, Jun 29, 2010 at 11:40:50AM -0400, Steve Holden wrote:
>>>> I will leave the profiler output to speak for itself, since I can find
>>>> nothing much to say about it except that there's a hell of a lot of
>>>> decoding going on inside mailbox.iterkeys().
>>> The problem is actually in _generate_toc(), which is reading through
>>> the entire file to figure out where all the 'From' lines that start
>>> messages are located.  TextIOWrapper()'s tell() method seems to be
>>> very slow, so one help is to only call tell() when necessary; patch:
>>>
>>> -> svn diff Lib/
>>> Index: Lib/mailbox.py
>>> ===================================================================
>>> --- Lib/mailbox.py    (revision 82346)
>>> +++ Lib/mailbox.py    (working copy)
>>> @@ -775,13 +775,14 @@
>>>          starts, stops = [], []
>>>          self._file.seek(0)
>>>          while True:
>>> -            line_pos = self._file.tell()
>>>              line = self._file.readline()
>>>              if line.startswith('From '):
>>> +                line_pos = self._file.tell()
>>>                  if len(stops) < len(starts):
>>>                      stops.append(line_pos - len(os.linesep))
>>>                  starts.append(line_pos)
>>>              elif not line:
>>> +                line_pos = self._file.tell()
>>>                  stops.append(line_pos)
>>>                  break
>>>          self._toc = dict(enumerate(zip(starts, stops)))
>>>
>>> But should mailboxes really be opened in a UTF-8 encoding, or should
>>> they be treated as 7-bit text?  I'll have to think about this.
>> Neither! You can't open them as 7-bit text, because real-world email
>> does contain bytes whose ordinal value exceeds 127. You can't open them
>> using a text encoding because theoretically there might be ASCII headers
>> that indicate that parts of the content are in specific character sets
>> or encodings.
>>
>> If only we had a data structure that easily allowed us to manipulate
>> 8-bit characters ...
>>
>> regards
>>  Steve
-- 
Steve Holden           +1 571 484 6266   +1 800 494 3119
See Python Video!       http://python.mirocommunity.org/
Holden Web LLC                 http://www.holdenweb.com/
UPCOMING EVENTS:        http://holdenweb.eventbrite.com/
"All I want for my birthday is another birthday" -
                                     Ian Dury, 1942-2000