Guido van Rossum wrote:
The problem is that the encoding names are not Python identifiers, e.g. iso-8859-1 is allowed as identifier.
This is easily taken care of by translating each string of consecutive non-identifier-characters to an underscore, so this would import the iso_8859_1.py module. (I also noticed in an earlier post that the official name for Shift_JIS has an underscore, while most other encodings use hyphens.)
Right. That's one way of doing it.
This and the fact that applications may want to ship their own codecs (which do not get installed under the system wide encodings package) make the registry necessary.
But it could be enough to register a package where to look for encodings (in addition to the system package).
Or there could be a registry for encoding search functions. (See the import discussion.)
Like a path of search functions ? Not a bad idea... I will still want the internal dict for caching purposes though. I'm not sure how often these encodings will be, but even a few hundred function call will slow down the Unicode implementation quite a bit. The implementation could proceed as follows: def lookup(encoding): codecs = _internal_dict.get(encoding,None) if codecs: return codecs for query in sys.encoders: codecs = query(encoding) if codecs: break else: raise UnicodeError,'unkown encoding: %s' % encoding _internal_dict[encoding] = codecs return codecs For simplicity, codecs should be a tuple (encoder,decoder, stream_writer,stream_reader) of factory functions. ...that is if we can agree on these 4 APIs :-) Here are my current versions: ----------------------------------------------------------------------- class Codec: """ Defines the interface for stateless encoders/decoders. """ def __init__(self,errors='strict'): """ Creates a Codec instance. The Codec may implement different error handling schemes by providing the errors argument. These parameters are defined: 'strict' - raise an UnicodeError (or a subclass) 'ignore' - ignore the character and continue with the next (a single character) - replace errorneous characters with the given character (may also be a Unicode character) """ self.errors = errors def encode(self,u,slice=None): """ Return the Unicode object u encoded as Python string. If slice is given (as slice object), only the sliced part of the Unicode object is encoded. The method may not store state in the Codec instance. Use SteamCodec for codecs which have to keep state in order to make encoding/decoding efficient. """ ... def decode(self,s,offset=0): """ Decodes data from the Python string s and returns a tuple (Unicode object, bytes consumed). If offset is given, the decoding process starts at s[offset]. It defaults to 0. The method may not store state in the Codec instance. Use SteamCodec for codecs which have to keep state in order to make encoding/decoding efficient. """ ... StreamWriter and StreamReader define the interface for stateful encoders/decoders: class StreamWriter(Codec): def __init__(self,stream,errors='strict'): """ Creates a StreamWriter instance. stream must be a file-like object open for writing (binary) data. The StreamWriter may implement different error handling schemes by providing the errors argument. These parameters are defined: 'strict' - raise an UnicodeError (or a subclass) 'ignore' - ignore the character and continue with the next (a single character) - replace errorneous characters with the given character (may also be a Unicode character) """ self.stream = stream def write(self,u,slice=None): """ Writes the Unicode object's contents encoded to self.stream and returns the number of bytes written. If slice is given (as slice object), only the sliced part of the Unicode object is written. """ ... the base class should provide a default implementation of this method using self.encode ... def flush(self): """ Flushed the codec buffers used for keeping state. Returns values are not defined. Implementations are free to return None, raise an exception (in case there is pending data in the buffers which could not be decoded) or return any remaining data from the state buffers used. """ pass class StreamReader(Codec): def __init__(self,stream,errors='strict'): """ Creates a StreamReader instance. stream must be a file-like object open for reading (binary) data. The StreamReader may implement different error handling schemes by providing the errors argument. These parameters are defined: 'strict' - raise an UnicodeError (or a subclass) 'ignore' - ignore the character and continue with the next (a single character) - replace errorneous characters with the given character (may also be a Unicode character) """ self.stream = stream def read(self,chunksize=0): """ Decodes data from the stream self.stream and returns a tuple (Unicode object, bytes consumed). chunksize indicates the approximate maximum number of bytes to read from the stream for decoding purposes. The decoder can modify this setting as appropriate. The default value 0 indicates to read and decode as much as possible. The chunksize is intended to prevent having to decode huge files in one step. """ ... the base class should provide a default implementation of this method using self.decode ... def flush(self): """ Flushed the codec buffers used for keeping state. Returns values are not defined. Implementations are free to return None, raise an exception (in case there is pending data in the buffers which could not be decoded) or return any remaining data from the state buffers used. """ In addition to the above methods, the StreamWriter and StreamReader instances should also provide access to all other methods defined for the stream object. Stream codecs are free to combine the StreamWriter and StreamReader interfaces into one class. -----------------------------------------------------------------------
I don't see a problem with the registry though -- the encodings package can take care of the registration process without any user interaction. There would only have to be an API for looking up an encoding published by the encodings package for the Unicode implementation to use. The magic behind that API is left to the encodings package...
I think that the collection of encodings will eventually grow large enough to make it a requirement to avoid doing work proportional to the number of supported encodings at startup (or even when an encoding is referenced for the first time). Any "lazy" mechanism (of which module search is an example) will do.
Right. The list of search functions should provide this kind of lazyness. It also provides ways to implement other strategies to look for codecs, e.g. PIL could provide such a search function for its codecs, mxCrypto for the included ciphers, etc.
BTW, nothing's wrong with your idea :-) In fact, I like it a lot because it keeps the encoding modules out of the top-level scope which is good.
Yes.
PS: we could probably even take the whole codec idea one step further and also allow other input/output formats to be registered, e.g. stream ciphers or pickle mechanisms. The step in that direction is not a big one: we'd only have to drop the specification of the Unicode object in the spec and replace it with an arbitrary object. Of course, this will still have to be a Unicode object for use by the Unicode implementation.
This is a step towards Java's architecture of stackable streams.
But I'm always in favor of tackling what we know we need before tackling the most generalized version of the problem.
Well, I just wanted to mention the possibility... might be something to look into next year. I find it rather thrilling to be able to create encrypted streams by just hooking together a few stream codecs... f = open('myfile.txt','w') CipherWriter = sys.codec('rc5-cipher')[3] sf = StreamWriter(f,key='xxxxxxxx') UTF8Writer = sys.codec('utf-8')[3] sfx = UTF8Writer(sf) sfx.write('asdfasdfasdfasdf') sfx.close() Hmm, we should probably define the additional constructor arguments to be keyword arguments... writers/readers other than Unicode ones will probably need different kinds of parameters (such as the key in the above example). Ahem, ...I'm getting distracted here :-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/