How to manage accented characters in mail header?

Sat Jan 4 14:07:57 EST 2025

Stefan Ram <ram at zedat.fu-berlin.de> wrote:
> Chris Green <cl at isbd.net> wrote or quoted:
> >From: =?utf-8?B?U8OpYmFzdGllbiBDcmlnbm9u?= <sebastien.crignon at amvs.fr>
> 
>   In Python, when you roll with decode_header from the email.header
>   module, it spits out a list of parts, where each part is like
>   a tuple of (decoded string, charset). To smash these decoded
>   sections into one string, you’ll want to loop through the list,
>   decode each piece (if it needs it), and then throw them together.
>   Here’s a straightforward example of how to pull this off:
> 
> from email.header import decode_header
> 
> # Example header
> header_example = \
> 'From: =?utf-8?B?U8OpYmFzdGllbiBDcmlnbm9u?= <sebastien.crignon at amvs.fr>'
> 
> # Decode the header
> decoded_parts = decode_header(header_example)
> 
> # Kick off an empty list for the decoded strings
> decoded_strings = []
> 
> for part, charset in decoded_parts:
>     if isinstance(part, bytes):
>         # Decode the bytes to a string using the charset
>         decoded_string = part.decode(charset or 'utf-8')
>     else:
>         # If it’s already a string, just roll with it
>         decoded_string = part
>     decoded_strings.append(decoded_string)
> 
> # Join the parts into a single string
> final_string = ''.join(decoded_strings)
> 
> print(final_string)# From: Sébastien Crignon <sebastien.crignon at amvs.fr>
> 
>   Breakdown
> 
>   decode_header(header_example): This line takes your email header
>   and breaks it down into a list of tuples.
> 
>   Looping through decoded_parts: You check if each part is in
>   bytes. If it is, you decode it using whatever charset it’s
>   got (defaulting to 'utf-8' if it’s a little vague).
> 
>   Appending Decoded Strings: You toss each decoded part into a list.
> 
>   Joining Strings: Finally, you use ''.join(decoded_strings) to glue
>   all the decoded strings into a single, coherent piece.
> 
>   Just a Heads Up
> 
>   Keep an eye out for cases where the charset might be None. In those
>   moments, it’s smart to fall back to 'utf-8' or something safe.
> 
Thanks, I think! :-)

Is there a simple[r] way to extract just the 'real' address between
the <>, that's all I actually need.  I think it has the be the last
chunk of the From: doesn't it?

-- 
Chris Green
·