[New-bugs-announce] [issue5902] Stricter codec names

Ezio Melotti report at bugs.python.org
Sat May 2 10:00:20 CEST 2009

New submission from Ezio Melotti <ezio.melotti at gmail.com>:

I noticed that codec names[1]:
1) can contain random/unnecessary spaces and punctuation;
2) have several aliases that could probably be removed;

A few examples of valid codec names (done with Python 3):
>>> s = 'xxx'
>>> s.encode('utf')
>>> s.encode('utf-')
>>> s.encode('}Utf~->8<-~siG{ ;)')

'utf' is an alias for UTF-8 and that doesn't quite make sense to me that
'utf' alone refers to UTF-8.
'utf-' could be a mistyped 'utf-8', 'utf-7' or even 'utf-16'; I'd like
it to raise an error instead.
The third example is not probably something that can be found in the
real world (I hope) but it shows how permissive the parsing of the names is.

Apparently the whitespaces are removed and the punctuation is used to
split the name in several parts and then the check is performed.

About the aliases: in the documentation the "official" name for the
UTF-8 codec is 'utf_8' and there are 3 more aliases: U8, UTF, utf8. For
ISO-8859-1, the "official" name is 'latin_1' and there are 7 more
aliases: iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1.
The Zen says "There should be one—and preferably only one—obvious way to
do it.", so I suggest to
1) disallow random punctuation and spaces within the name (only allow
leading and trailing spaces);
2) change the default names to, for example: 'utf-8', 'iso-8859-1'
instead of 'utf_8' and 'iso8859_1'. The name are case-insentive.
3) remove the unnecessary aliases, for example: 'UTF', 'U8' for UTF-8
and 'iso8859-1', '8859', 'latin', 'L1' for ISO-8859-1;

This last point could break some code and may need some
DeprecationWarning. If there are good reason to keep around these codecs
only the other two issues can be addressed. 
If the name of the codec has to be a valid variable name (that is,
without '-'), only the documentation could be changed to have 'utf-8',
'iso-8859-1', etc. as preferred name.

[1]: http://docs.python.org/library/codecs.html#standard-encodings

assignee: georg.brandl
components: Documentation, Library (Lib)
messages: 86933
nosy: ezio.melotti, georg.brandl
severity: normal
status: open
title: Stricter codec names
type: behavior
versions: Python 2.6, Python 2.7, Python 3.0, Python 3.1

Python tracker <report at bugs.python.org>

More information about the New-bugs-announce mailing list