[New-bugs-announce] [issue34460] email.charset: common IANA labels missing

era report at bugs.python.org
Wed Aug 22 05:17:06 EDT 2018

New submission from era <era+python at iki.fi>:

The email.charset module should contain common informal character-set identifiers even if they are not formally specified in a IANA RFC.

>From a quick grep of a pile of recent email, I find the following:

   46 "cp-850"
    6 "windows-874"

For scale, the same collection contained around 10,000 messages with "utf-8" and 2,000 with "iso-8859-1".  Still, the fact that there are multiple occurrences in a spool of recent messages indicates that they are fairly common.

Currently, the email module throws a traceback if you attempt to parse a message whose character set is not known to Python. This is not possible to prevent in the general case, but making it more robust with encodings which are reasonably prevalent in the wild would definitely be desirable.  

For what it's worth, "cp-850" is apparently an alias for IBM code page 850 which is defined with the name "cp850" in RFC1345.  "windows-874" is an official designation which is detailed in https://www.iana.org/assignments/charset-reg/windows-874 which is apparently equivalent to the Python codec "cp784".

components: email
messages: 323870
nosy: barry, era, r.david.murray
priority: normal
severity: normal
status: open
title: email.charset: common IANA labels missing
versions: Python 3.6

Python tracker <report at bugs.python.org>

More information about the New-bugs-announce mailing list