help with unicode email parse
neoedmund at
Fri Sep 8 01:24:49 EDT 2006
john , you can look my code:
it downloads email and save to local filesystem(filename and email
contains non-english characters)
it works now.
but i still think python's unicode string is not as straightforward as
string SHOULD always be unicode. or i'm trouble dealing them when they
are in different encodings. because before using it, i must try to find
what encoding it use, unicode or 8bit. and why the system use ascii to
decode. you have explained some, but i cannot catch up you. however i
never have encoding problem using string in java.
#-*- coding:utf8 -*-
import poplib
from poplib import POP3
import Utils
#from datetime import datetime
import time
import email
from email.Header import decode_header
conf = Utils.getValues()
def getMail():
a = POP3( conf.get( "mailhost" ) )
print a.getwelcome()
a.user( conf.get( "mailuser" ) )
a.pass_( conf.get( "mailpass" ) )
( numMsgs, totalSize ) = a.stat()
print "==begin==total %s mail,%s bytes" % ( numMsgs, totalSize )
for i in range( 1, numMsgs + 1 ):
( header, msg, octets ) = a.retr( i )
print "Message %d:" % i
text= list2txt( msg )
save( text )
#print octets
#print header
print "==finish==total %s mail,%s bytes" % ( numMsgs, totalSize )
def list2txt( l ):
return reduce( lambda x, y:x+"\r\n"+y, l )
def save( text ):
stamp = getStamp()
store=conf.get( "mailstore" )
msg = email.message_from_string( text )
path = getPath( text, msg )
title = decode_header( msg["Subject"] )
title= title[0][0]
title= title.decode( "utf8" )
print repr( title )
title = encodeFilename( title )
print repr( title )
fn = "%s/%s/%s-%s.mail"%( store.encode( "utf8" ),
path.encode( "utf8" ),
stamp.encode( "utf8" ),
title )
print repr( fn )
# fn = fn.decode( "utf8" )
import os
path =os.path.dirname( fn )
if not os.path.exists( path ) :
os.makedirs( path )
print repr( fn )
f = file( fn, "wb" )
f.write( text )
def encodeFilename( s ):
for ch in s:
#print "CH", repr( ch )
if "\":?*/\\<>|".find( ch ) >= 0:
#print "here"
slist.append( "_" )
#print "there"
slist.append( ch )
#print repr( slist )
return "".join( slist )
#encodeFilename( "abc:dd" )
def getPath( text, msg ):
import Classify
return text, msg )
def getStamp():
s = repr( int( time.clock() * 1000000000000000L ) )
#print s
return unicode( s )
#print repr( getStamp() )
def test():
subject = decode_header( "=?UTF-8?B?5rWL6K+V?=" )
print "s1=", repr( subject )
t1 = subject[0][0]
print "t1=", repr( t1 )
fn = "%s/%s-%s.mail"%( "d:/mail", "12345", '\xe6\xb5\x8b\xe8\xaf\x95'
print "fn=", repr( fn )
fn = fn.decode( "utf8" )
print "fn=", repr( fn )
f = file( fn, "wb" )
f.write( "test" )
John Machin wrote:
> neoedmund wrote:
> [top-posting corrected]
> > John Machin wrote:
> > > neoedmund wrote:
> > > > i want to get the subject from email and construct a filename with the
> > > > subject.
> > > > but tried a lot, always got error like this:
> > > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 4:
> > > > ordinal not in range(128)
> > > >
> > > >
> > > > msg = email.message_from_string( text )
> > > > title = decode_header( msg["Subject"] )
> > > > title= title[0][0]
> > > > #title=title.encode("utf8")
> > >
> > > Why is that commented out?
> > >
> > > > print title
> > > > fn = ""+path+"/"+stamp+"-"+title+".mail"
> > > >
> > > >
> > > > the variable "text" come from sth like this:
> > > > ( header, msg, octets ) = a.retr( i )
> > > > text= list2txt( msg )
> > > > def list2txt( l ):
> > > > return reduce( lambda x, y:x+"\r\n"+y, l )
> > > >
> > > > anyone can help me out? thanks.
> > >
> > > Not without a functional crystal ball.
> > >
> > > You could help yourself considerably by (1) working out which line of
> > > code the problem occurs in [the traceback will tell you that] (2)
> > > working out which string is being decoded into Unicode, and has '\xe9'
> > > as its 5th byte. Either that string needs to be decoded using something
> > > like 'latin1' [should be specified in the message headers] rather than
> > > the default 'ascii', or the code has a deeper problem ...
> > >
> > > If you can't work it out for yourself, show us the exact code that ran,
> > > together with the traceback. If (for example) title is the problem,
> > > insert code like:
> > > print 'title=', repr(title)
> > > and include that in your next post as well.
> > >
> > > HTH,
> > > John
> > thank you John and Diez.
> > i found
> > fn = "%s/%s-%s.mail"%("d:/mail", "12345", '\xe6\xb5\x8b\xe8\xaf\x95' )
> > is ok
> > fn = "%s/%s-%s.mail"%(u"d:/mail", "12345", '\xe6\xb5\x8b\xe8\xaf\x95' )
> > results:
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0:
> > ordinal not in range(128)
> > So "str"%(param) not accept unicode, only accept byte array?
> No, quite the contrary. And that's no "byte array", it's a string.
> The first substitution is in unicode, so the "%" operation ups the
> ante from 8-bit string, and tries to decode the remaining
> substitutions, using the default ascii codec, which barfs on the 3rd
> substitution, which isn't ascii.
> If you want fn to be in some 8-bit encoding, then don't put the u in
> front of the first substitution.
> If you want fn to be in unicode, then you'll have to determine what
> encoding you're dealing with, and specify that explicitly.
> By the way, what has this "fn" stuff to do with your original problem?
> Cheers,
> John
