UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and...

Alf P. Steinbach alfps at start.no
Mon Nov 23 22:06:29 CET 2009

This is the tragic story of this evening:
1. Aspirins to lessen the pain somewhat.
2. Over in [comp.programming] someone mentions paper on Quicksort.
3. I recall that X once sent me link to paper about how to foil
    Quicksort, written by was it Doug McIlroy, anyway some Bell Labs guy.
    Want to post that link in response to [comp.programming] article.
4. Checking in Thunderbird, no mails from X or about QS there.
5. But his mail address in address list so something funny going on!
6. Googling, yes, it seems Thunderbird has a habit of "forgetting" mails. But
    they're really there after all. It's just the index that's screwed up.
7. OK, opening Thunderbird mailbox file (it's just text) in nearest editor.
8. Machine hangs, Windows says it must increase virtual memory, blah blah.
9. Making little Python script to extract individual mails from file.
10. It says UnicodeDecodeError on mail nr. something something.
11. I switch mode to binary. Didn't know if that would work with std input.
12. It's now apparently ten times faster but *still* UnicodeDecodeError!
13. I ask here!

Of course could have googled that paper, but at each step above it seemed just a 
half minute more to find the link in mails, and now I decided it must be found.

And I'm hesitant to just delete index file, hoping that it'll rebuild.

Thunderbird does funny things, so best would be if Python script worked.

import os
import fileinput

def write( s ): print( s, end = "" )

msg_id = 0
f = open( "nul", "w" )
for line in fileinput.input( mode = "rb" ):
     if line.startswith( "From - " ):
         msg_id += 1;
         print( msg_id )
         f = open( "msg_{0:0>6}.txt".format( msg_id ), "w+" )
         f.write( line )

<last few lines of output>
Traceback (most recent call last):
   File "C:\test\tbfix\splitmails.py", line 11, in <module>
     for line in fileinput.input( mode = "rb" ):
   File "C:\Program Files\cpython\python31\lib\fileinput.py", line 254, in __next__
     line = self.readline()
   File "C:\Program Files\cpython\python31\lib\fileinput.py", line 349, in readline
     self._buffer = self._file.readlines(self._bufsize)
   File "C:\Program Files\cpython\python31\lib\encodings\cp1252.py", line 23, in 
     return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 2188: 
character maps to <undefined
</last few lines of output>


- Alf

More information about the Python-list mailing list