Query available codecs and error handlers

Unless I am badly misinformed, there is no way to programmatically check what codecs and error handlers are available. I propose two functions in the codecs module: get_codecs() get_error_handlers() which each return a set of the available codec, or error handler, names. Use-cases --------- (1) At the interactive interpreter, it is often useful to experiment with different codecs. Being able to get a list of available codecs avoids the frustration of having to look them up elsewhere, or guess. (2) Applications which offer to read or write text files may wish to allow the user to select an encoding. With no obvious way of finding out what encodings are available, the application is limited to hard-coding some, and hoping that they don't get renamed or removed. Why sets, rather than lists? ---------------------------- Sets emphasis that the names are in no particular order. Since the names should always be strings and therefore hashable, and the order is irrelevant, there is no particular need for a list. Should they be frozensets rather than sets? ------------------------------------------- I don't care :-) Using sets may make it easier for the application to filter the results in some way, but I really don't think it matters. Thoughts or comments? -- Steven

On Fri, Aug 29, 2014 at 3:45 PM, Steven D'Aprano <steve@pearwood.info> wrote:
A quick look at the codecs module suggests that this may not be as simple as returning a list/set; when _codecs.lookup() is called, it does a search. So this is actually like asking how to list importable modules. It may not be possible, but if it is, I would suggest calling it "list_codecs" or something, as it's basically going to be tracing the search path and enumerating every codec it finds. But it sounds like a quite useful facility. +1. ChrisA

On Fri, Aug 29, 2014 at 05:27:19PM +1000, Chris Angelico wrote:
Sure, but codecs have to be registered before they can be used. The register function can cache the names. Perhaps any builtin codecs might need to be explicitly added to the cache in order to support this, I don't know the details of _codecs. If the codec registry were a dict, we might even return a view of the dict.keys(). I'm sure there is some solution which is not quite as difficult as enumerating all importable modules. There's a lot fewer codecs, they don't have platform-dependent suffixes, and unlike modules in the PYTHONPATH, once Python starts up the available codecs won't change unless register() is called. -- Steven

On 29.08.2014 07:45, Steven D'Aprano wrote:
Question is: how would you implement these ? The codec registry uses lookup functions to find codecs, so we'd have to extend those lookup functions to also support reporting known installed codecs. For the stdlib encodings package we could simply put a list into the package, e.g. encodings.available_codecs returning a dictionary of mappings from codec name to CodecInfo tuples and then extend the CodecInfo with some extra information such as supported error handlers, alternative names and information about the supported input/output types. At the moment, the available codecs are documented here: https://docs.python.org/3.5/library/codecs.html?highlight=codecs#standard-en... It's probably a good idea to add information about supported error handlers to that list. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 29 2014)
2014-08-27: Released eGenix PyRun 2.0.1 ... http://egenix.com/go62 2014-09-19: PyCon UK 2014, Coventry, UK ... 21 days to go 2014-09-27: PyDDF Sprint 2014 ... 29 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 29 August 2014 17:33, M.-A. Lemburg <mal@egenix.com> wrote:
I'd actually be a fan of a PEP to add such an introspection API that also made it easier to register new codecs just by adding them to a suitable namespace package. I believe MvL's original idea was to use the existing encodings package for that, but that doesn't seem feasible due to the non-empty __init__
Those tables are already pretty busy though - I'm not sure how we could add supported error handler details without making them hard to read. Agreed it would be good to make the info more readily available, though (I had actually hoped to get some proposed revisions together for the codecs module docs before 3.4 went out the door, but alas, it was not to be). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 29.08.2014 09:49, Nick Coghlan wrote:
It's fairly easy to have the lookup function in the encodings package to also look in say a "siteencodings" package for codecs it cannot find in the stdlib encodings package. This new siteencodings package could then be setup as namespace package to make installation easier. That doesn't answer the original question, though, since introspection of available codecs would still not be possible.
Here's one idea: don't use a table, but instead have one subsection per codec. There are already a few subsections for specific codecs on the page.
Since it's not a feature, the doc change could potentially be backported. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 29 2014)
2014-08-27: Released eGenix PyRun 2.0.1 ... http://egenix.com/go62 2014-09-19: PyCon UK 2014, Coventry, UK ... 21 days to go 2014-09-27: PyDDF Sprint 2014 ... 29 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 29 August 2014 18:44, M.-A. Lemburg <mal@egenix.com> wrote:
Right, I think that's probably worth doing. The tricky part is figuring out *how* to change them. At the moment, there isn't an especially clear distinction between the text encoding specific parts (e.g. several of the error handlers, most of the encodings) and the underlying general purpose codecs machinery. However, beyond a vague notion of "that distinction should be made clearer in the docs now that it affects which codecs can be used with str.encode, bytes.decode and bytearray.decode, and which can *only* be used with codecs.encode and codecs.decode", I don't have any specific suggestions to make. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Fri, Aug 29, 2014 at 09:33:47AM +0200, M.-A. Lemburg wrote:
Arrgggh, you're right -- I've been working on a wrong assumption. I had forgotten that the codecs.register function takes a function, not a name. I always forgot that, because it's such a strange and unhelpful API compared to, say, a mapping of name:codec. It's probably water under the bridge now, but is there any documentation for why this API was used in the first place? -- Steven

On 29.08.2014 10:16, Steven D'Aprano wrote:
Because at the time I designed the API in 1999/2000 it wasn't clear how people would start writing codecs. Note that codecs do not use a simple name to codec mapping to figure out the implementation module name. The name typically goes through a few layers of normalization and then a alias dictionary to find the name of the implementation. The lookup functions were meant to implement these more complex n-1 mappings. I also thought that codec implementation might want to tap into system registries of codecs, use file based tables as basis for encodings, or even implement load on demand. Today, it's rather obvious that apparently no one has considered doing any of this, so it would have been better to design a system where you explicitly register individual codecs (together with a set of attributes). It should be possible to phase out the lookup API and expose the encodings package lookup mechanism directly in the codecs module. I can help guide people, if they are willing to do the work, but don't have time to work on this myself. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 29 2014)
2014-08-27: Released eGenix PyRun 2.0.1 ... http://egenix.com/go62 2014-09-19: PyCon UK 2014, Coventry, UK ... 21 days to go 2014-09-27: PyDDF Sprint 2014 ... 29 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Fri, Aug 29, 2014 at 3:45 PM, Steven D'Aprano <steve@pearwood.info> wrote:
A quick look at the codecs module suggests that this may not be as simple as returning a list/set; when _codecs.lookup() is called, it does a search. So this is actually like asking how to list importable modules. It may not be possible, but if it is, I would suggest calling it "list_codecs" or something, as it's basically going to be tracing the search path and enumerating every codec it finds. But it sounds like a quite useful facility. +1. ChrisA

On Fri, Aug 29, 2014 at 05:27:19PM +1000, Chris Angelico wrote:
Sure, but codecs have to be registered before they can be used. The register function can cache the names. Perhaps any builtin codecs might need to be explicitly added to the cache in order to support this, I don't know the details of _codecs. If the codec registry were a dict, we might even return a view of the dict.keys(). I'm sure there is some solution which is not quite as difficult as enumerating all importable modules. There's a lot fewer codecs, they don't have platform-dependent suffixes, and unlike modules in the PYTHONPATH, once Python starts up the available codecs won't change unless register() is called. -- Steven

On 29.08.2014 07:45, Steven D'Aprano wrote:
Question is: how would you implement these ? The codec registry uses lookup functions to find codecs, so we'd have to extend those lookup functions to also support reporting known installed codecs. For the stdlib encodings package we could simply put a list into the package, e.g. encodings.available_codecs returning a dictionary of mappings from codec name to CodecInfo tuples and then extend the CodecInfo with some extra information such as supported error handlers, alternative names and information about the supported input/output types. At the moment, the available codecs are documented here: https://docs.python.org/3.5/library/codecs.html?highlight=codecs#standard-en... It's probably a good idea to add information about supported error handlers to that list. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 29 2014)
2014-08-27: Released eGenix PyRun 2.0.1 ... http://egenix.com/go62 2014-09-19: PyCon UK 2014, Coventry, UK ... 21 days to go 2014-09-27: PyDDF Sprint 2014 ... 29 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 29 August 2014 17:33, M.-A. Lemburg <mal@egenix.com> wrote:
I'd actually be a fan of a PEP to add such an introspection API that also made it easier to register new codecs just by adding them to a suitable namespace package. I believe MvL's original idea was to use the existing encodings package for that, but that doesn't seem feasible due to the non-empty __init__
Those tables are already pretty busy though - I'm not sure how we could add supported error handler details without making them hard to read. Agreed it would be good to make the info more readily available, though (I had actually hoped to get some proposed revisions together for the codecs module docs before 3.4 went out the door, but alas, it was not to be). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 29.08.2014 09:49, Nick Coghlan wrote:
It's fairly easy to have the lookup function in the encodings package to also look in say a "siteencodings" package for codecs it cannot find in the stdlib encodings package. This new siteencodings package could then be setup as namespace package to make installation easier. That doesn't answer the original question, though, since introspection of available codecs would still not be possible.
Here's one idea: don't use a table, but instead have one subsection per codec. There are already a few subsections for specific codecs on the page.
Since it's not a feature, the doc change could potentially be backported. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 29 2014)
2014-08-27: Released eGenix PyRun 2.0.1 ... http://egenix.com/go62 2014-09-19: PyCon UK 2014, Coventry, UK ... 21 days to go 2014-09-27: PyDDF Sprint 2014 ... 29 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 29 August 2014 18:44, M.-A. Lemburg <mal@egenix.com> wrote:
Right, I think that's probably worth doing. The tricky part is figuring out *how* to change them. At the moment, there isn't an especially clear distinction between the text encoding specific parts (e.g. several of the error handlers, most of the encodings) and the underlying general purpose codecs machinery. However, beyond a vague notion of "that distinction should be made clearer in the docs now that it affects which codecs can be used with str.encode, bytes.decode and bytearray.decode, and which can *only* be used with codecs.encode and codecs.decode", I don't have any specific suggestions to make. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Fri, Aug 29, 2014 at 09:33:47AM +0200, M.-A. Lemburg wrote:
Arrgggh, you're right -- I've been working on a wrong assumption. I had forgotten that the codecs.register function takes a function, not a name. I always forgot that, because it's such a strange and unhelpful API compared to, say, a mapping of name:codec. It's probably water under the bridge now, but is there any documentation for why this API was used in the first place? -- Steven

On 29.08.2014 10:16, Steven D'Aprano wrote:
Because at the time I designed the API in 1999/2000 it wasn't clear how people would start writing codecs. Note that codecs do not use a simple name to codec mapping to figure out the implementation module name. The name typically goes through a few layers of normalization and then a alias dictionary to find the name of the implementation. The lookup functions were meant to implement these more complex n-1 mappings. I also thought that codec implementation might want to tap into system registries of codecs, use file based tables as basis for encodings, or even implement load on demand. Today, it's rather obvious that apparently no one has considered doing any of this, so it would have been better to design a system where you explicitly register individual codecs (together with a set of attributes). It should be possible to phase out the lookup API and expose the encodings package lookup mechanism directly in the codecs module. I can help guide people, if they are willing to do the work, but don't have time to work on this myself. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 29 2014)
2014-08-27: Released eGenix PyRun 2.0.1 ... http://egenix.com/go62 2014-09-19: PyCon UK 2014, Coventry, UK ... 21 days to go 2014-09-27: PyDDF Sprint 2014 ... 29 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
participants (4)
-
Chris Angelico
-
M.-A. Lemburg
-
Nick Coghlan
-
Steven D'Aprano