
Here is a new proposal for the codec interface: class Codec: def encode(self,u,slice=None): """ Return the Unicode object u encoded as Python string. If slice is given (as slice object), only the sliced part of the Unicode object is encoded. The method may not store state in the Codec instance. Use SteamCodec for codecs which have to keep state in order to make encoding/decoding efficient. """ ... def decode(self,s,slice=None): """ Return an equivalent Unicode object for the encoded Python string s. If slice is given (as slice object), only the sliced part of the Python string is decoded and returned as Unicode object. Note that this can cause the decoding algorithm to fail due to truncations in the encoding. The method may not store state in the Codec instance. Use SteamCodec for codecs which have to keep state in order to make encoding/decoding efficient. """ ... class StreamCodec(Codec): def __init__(self,stream=None,errors='strict'): """ Creates a StreamCodec instance. stream must be a file-like object open for reading and/or writing binary data depending on the intended codec action or None. The StreamCodec may implement different error handling schemes by providing the errors argument. These parameters are known (they need not all be supported by StreamCodec subclasses): 'strict' - raise an UnicodeError (or a subclass) 'ignore' - ignore the character and continue with the next (a single character) - replace errorneous characters with the given character (may also be a Unicode character) """ self.stream = stream def write(self,u,slice=None): """ Writes the Unicode object's contents encoded to self.stream. stream must be a file-like object open for writing binary data. If slice is given (as slice object), only the sliced part of the Unicode object is written. """ ... the base class should provide a default implementation of this method using self.encode ... def read(self,length=None): """ Reads an encoded string from the stream and returns an equivalent Unicode object. If length is given, only length Unicode characters are returned (the StreamCodec instance reads as many raw bytes as needed to fulfill this requirement). Otherwise, all available data is read and decoded. """ ... the base class should provide a default implementation of this method using self.decode ... It is not required by the unicodec.register() API to provide a subclass of these base class, only the given methods must be present; this allows writing Codecs as extensions types. All Codecs must provide the .encode()/.decode() methods. Codecs having the .read() and/or .write() methods are considered to be StreamCodecs. The Unicode implementation will by itself only use the stateless .encode() and .decode() methods. All other conversion have to be done by explicitly instantiating the appropriate [Stream]Codec. -- Feel free to beat on this one ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

It is not required by the unicodec.register() API to provide a subclass of these base class, only the given methods must be present; this allows writing Codecs as extensions types. All Codecs must provide the .encode()/.decode() methods. Codecs having the .read() and/or .write() methods are considered to be StreamCodecs.
The Unicode implementation will by itself only use the stateless .encode() and .decode() methods.
All other conversion have to be done by explicitly instantiating the appropriate [Stream]Codec.
Looks okay, although I'd like someone to implement a simple shift-state-based stream codec to check this out further. I have some questions about the constructor. You seem to imply that instantiating the class without arguments creates a codec without state. That's fine. When given a stream argument, shouldn't the direction of the stream be given as an additional argument, so the proper state for encoding or decoding can be set up? I can see that for an implementation it might be more convenient to have separate classes for encoders and decoders -- certainly the state being kept is very different. Also, I don't want to ignore the alternative interface that was suggested by /F. It uses feed() similar to htmllib c.s. This has some advantages (although we might want to define some compatibility so it can also feed directly into a file). Perhaps someone should go ahead and implement prototype codecs using either paradigm and then write some simple apps, so we can make a better decision. In any case I think the specs codec registry API aren't on the critical path, integration of /F's basic unicode object is the first thing we need. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum writes:
Also, I don't want to ignore the alternative interface that was suggested by /F. It uses feed() similar to htmllib c.s. This has some advantages (although we might want to define some compatibility so it can also feed directly into a file).
I think one or the other can be used, and then a wrapper that converts to the other interface. Perhaps the encoders should provide feed(), and a file-like wrapper can convert write() to feed(). It could also be done the other way; I'm not sure if it matters which is "normal." (Or perhaps feed() was badly named and should be write()? The general intent was a little different, I think, but an output file is very much a stream consumer.) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

Guido van Rossum wrote:
It is not required by the unicodec.register() API to provide a subclass of these base class, only the given methods must be present; this allows writing Codecs as extensions types. All Codecs must provide the .encode()/.decode() methods. Codecs having the .read() and/or .write() methods are considered to be StreamCodecs.
The Unicode implementation will by itself only use the stateless .encode() and .decode() methods.
All other conversion have to be done by explicitly instantiating the appropriate [Stream]Codec.
Looks okay, although I'd like someone to implement a simple shift-state-based stream codec to check this out further.
I have some questions about the constructor. You seem to imply that instantiating the class without arguments creates a codec without state. That's fine. When given a stream argument, shouldn't the direction of the stream be given as an additional argument, so the proper state for encoding or decoding can be set up? I can see that for an implementation it might be more convenient to have separate classes for encoders and decoders -- certainly the state being kept is very different.
Wouldn't it be possible to have the read/write methods set up the state when called for the first time ? Note that I wrote ".read() and/or .write() methods" in the proposal on purpose: you can of course implement Codecs which only implement one of them, i.e. Readers and Writers. The registry doesn't care about them anyway :-) Then, if you use a Reader for writing, it will result in an AttributeError...
Also, I don't want to ignore the alternative interface that was suggested by /F. It uses feed() similar to htmllib c.s. This has some advantages (although we might want to define some compatibility so it can also feed directly into a file).
AFAIK, .feed() and .finalize() (or .close() etc.) have a different backgound: you add data in chunks and then process it at some final stage rather than for each feed. This is often more efficient. With respest to codecs this would mean, that you buffer the output in memory, first doing only preliminary operations on the feeds and then apply some final logic to the buffer at the time .finalize() is called. We could define a StreamCodec subclass for this kind of operation.
Perhaps someone should go ahead and implement prototype codecs using either paradigm and then write some simple apps, so we can make a better decision.
In any case I think the specs codec registry API aren't on the critical path, integration of /F's basic unicode object is the first thing we need.
-- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

I have some questions about the constructor. You seem to imply that instantiating the class without arguments creates a codec without state. That's fine. When given a stream argument, shouldn't the direction of the stream be given as an additional argument, so the proper state for encoding or decoding can be set up? I can see that for an implementation it might be more convenient to have separate classes for encoders and decoders -- certainly the state being kept is very different.
Wouldn't it be possible to have the read/write methods set up the state when called for the first time ?
Hm, I'd rather be explicit. We don't do this for files either.
Note that I wrote ".read() and/or .write() methods" in the proposal on purpose: you can of course implement Codecs which only implement one of them, i.e. Readers and Writers. The registry doesn't care about them anyway :-)
Then, if you use a Reader for writing, it will result in an AttributeError...
Also, I don't want to ignore the alternative interface that was suggested by /F. It uses feed() similar to htmllib c.s. This has some advantages (although we might want to define some compatibility so it can also feed directly into a file).
AFAIK, .feed() and .finalize() (or .close() etc.) have a different backgound: you add data in chunks and then process it at some final stage rather than for each feed. This is often more efficient.
With respest to codecs this would mean, that you buffer the output in memory, first doing only preliminary operations on the feeds and then apply some final logic to the buffer at the time .finalize() is called.
This is part of the purpose, yes.
We could define a StreamCodec subclass for this kind of operation.
The difference is that to decode from a file, your proposed interface is to call read() on the codec which will in turn call read() on the stream. In /F's version, I call read() on the stream (geting multibyte encoded data), feed() that to the codec, which in turn calls feed() to some other back end -- perhaps another codec which in turn feed()s its converted data to another file, perhaps an XML parser. --Guido van Rossum (home page: http://www.python.org/~guido/)

M.-A. Lemburg writes:
Wouldn't it be possible to have the read/write methods set up the state when called for the first time ?
That slows the down; the constructor should handle initialization. Perhaps what gets registered should be: encoding function, decoding function, stream encoder factory (can be a class), stream decoder factory (again, can be a class). These can be encapsulated either before or after hitting the registry, and can be None. The registry and provide default implementations from what is provided (stream handlers from the functions, or functions from the stream handlers) as required. Ideally, I should be able to write a module with four well-known entry points and then provide the module object itself as the registration entry. Or I could construct a new object that has the right interface and register that if it made more sense for the encoding.
AFAIK, .feed() and .finalize() (or .close() etc.) have a different backgound: you add data in chunks and then process it at some final stage rather than for each feed. This is often more
Many of the classes that provide feed() do as much work as possible as data is fed into them (see htmllib.HTMLParser); this structure is commonly used to support asynchonous operation.
With respest to codecs this would mean, that you buffer the output in memory, first doing only preliminary operations on the feeds and then apply some final logic to the buffer at the time .finalize() is called.
That depends on the encoding. I'd expect it to feed encoded data to a sink as quickly as it could and let the target decide what needs to happen. If buffering is needed, the target could be a StringIO or whatever. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

On Tue, 16 Nov 1999, Fred L. Drake, Jr. wrote:
M.-A. Lemburg writes:
Wouldn't it be possible to have the read/write methods set up the state when called for the first time ?
That slows the down; the constructor should handle initialization. Perhaps what gets registered should be: encoding function, decoding function, stream encoder factory (can be a class), stream decoder factory (again, can be a class). These can be encapsulated either before or after hitting the registry, and can be None. The registry
I'm with Fred here; he beat me to the punch (and his email is better than what I'd write anyhow :-). I'd like to see the API be *functions* rather than a particular class specification. If the spec is going to say "do not alter/store state", then a function makes much more sense than a method on an object. Of course, bound method objects could be registered. This might occur if you have a general JIS encode/decoder but need to instantiate it a little differently for each JIS variant. (Andy also mentioned something about "options" in JIS encoding/decoding)
and provide default implementations from what is provided (stream handlers from the functions, or functions from the stream handlers) as required.
Excellent idea... "I'll provide the encode/decode functions, but I don't have a spiffy algorithm for streaming -- please provide a stream wrapper for my functions."
Ideally, I should be able to write a module with four well-known entry points and then provide the module object itself as the registration entry. Or I could construct a new object that has the right interface and register that if it made more sense for the encoding.
Mark's idea about throwing these things into a package for on-demand registrations is much better than a "register-beforehand" model. When the module is loaded from the package, it calls a registration function to insert its 4-tuple of registration data. Cheers, -g -- Greg Stein, http://www.lyra.org/

"Fred L. Drake, Jr." wrote:
M.-A. Lemburg writes:
Wouldn't it be possible to have the read/write methods set up the state when called for the first time ?
That slows the down; the constructor should handle initialization. Perhaps what gets registered should be: encoding function, decoding function, stream encoder factory (can be a class), stream decoder factory (again, can be a class).
Guido proposed the factory approach too, though not seperated into these 4 APIs (note that your proposal looks very much like what I had in the early version of my proposal). Anyway, I think that factory functions are the way to go, because they offer more flexibility w/r to reusing already instantiated codecs, importing modules on-the-fly as was suggested in another thread (thereby making codec module import lazy) or mapping encoder and decoder requests all to one class. So here's a new registry approach: unicodec.register(encoding,factory_function,action) with encoding - name of the supported encoding, e.g. Shift_JIS factory_function - a function that returns an object or function ready to be used for action action - a string stating the supported action: 'encode' 'decode' 'stream write' 'stream read' The factory_function API depends on the implementation of the codec. The returned object's interface on the value of action: Codecs: ------- obj = factory_function_for_<action>(errors='strict') 'encode': obj(u,slice=None) -> Python string 'decode': obj(s,offset=0,chunksize=0) -> (Unicode object, bytes consumed) factory_functions are free to return simple function objects for stateless encodings. StreamCodecs: ------------- obj = factory_function_for_<action>(stream,errors='strict') obj should provide access to all methods defined for the stream object, overriding these: 'stream write': obj.write(u,slice=None) -> bytes written to stream obj.flush() -> ??? 'stream read': obj.read(chunksize=0) -> (Unicode object, bytes read) obj.flush() -> ??? errors is defined like in my Codec spec. The codecs are expected to use this argument to handle error conditions. I'm not sure what Fredrik intended with the .flush() methods, so the definition is still open. I would expect it to do some finalization of state. Perhaps we need another set of actions for the .feed()/.close() approach... As in earlier version of the proposal: The registry should provide default implementations for missing action factory_functions using the other registered functions, e.g. 'stream write' can be emulated using 'encode' and 'stream read' using 'decode'. The same probably holds for feed approach. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Wed, 17 Nov 1999, M.-A. Lemburg wrote:
... Anyway, I think that factory functions are the way to go, because they offer more flexibility w/r to reusing already instantiated codecs, importing modules on-the-fly as was suggested in another thread (thereby making codec module import lazy) or mapping encoder and decoder requests all to one class.
Why a factory? I've got a simple encode() function. I don't need a factory. "flexibility" at the cost of complexity (IMO).
So here's a new registry approach:
unicodec.register(encoding,factory_function,action)
with encoding - name of the supported encoding, e.g. Shift_JIS factory_function - a function that returns an object or function ready to be used for action action - a string stating the supported action: 'encode' 'decode' 'stream write' 'stream read'
This action thing is subject to error. *if* you're wanting to go this route, then have: unicodec.register_encode(...) unicodec.register_decode(...) unicodec.register_stream_write(...) unicodec.register_stream_read(...) They are equivalent. Guido has also told me in the past that he dislikes parameters that alter semantics -- preferring different functions instead. (this is why there are a good number of PyBufferObject interfaces; I had fewer to start with) This suggested approach is also quite a bit more wordy/annoying than Fred's alternative: unicode.register('iso-8859-1', encoder, decoder, None, None) And don't say "future compatibility allows us to add new actions." Well, those same future changes can add new registration functions or additional parameters to the single register() function. Not that I'm advocating it, but register() could also take a single parameter: if a class, then instantiate it and call methods for each action; if an instance, then just call methods for each action. [ and the third/original variety: a function object as the first param is the actual hook, and params 2 thru 4 (each are optional, or just the stream funcs?) are the other hook functions ]
The factory_function API depends on the implementation of the codec. The returned object's interface on the value of action:
Codecs: -------
obj = factory_function_for_<action>(errors='strict')
Where does this "errors" value come from? How does a user alter that value? Without an ability to change this, I see no reason for a factory. [ and no: don't tell me it is a thread-state value :-) ] On the other hand: presuming the "errors" thing is valid, *then* I see a need for a factory. Truly... I dislike factories. IMO, they just add code/complexity in many cases where the functionality isn't needed. But that's just me :-) Cheers, -g -- Greg Stein, http://www.lyra.org/

Greg Stein <gstein@lyra.org> wrote:
Why a factory? I've got a simple encode() function. I don't need a factory. "flexibility" at the cost of complexity (IMO).
so where do you put the state? how do you reset the state between strings? how do you handle incremental decoding/encoding? etc. (I suggest taking another look at PIL's codec design. it solves all these problems with a minimum of code, and it works -- people have been hammering on PIL for years...) </F>

On Wed, 17 Nov 1999, Fredrik Lundh wrote:
Greg Stein <gstein@lyra.org> wrote:
Why a factory? I've got a simple encode() function. I don't need a factory. "flexibility" at the cost of complexity (IMO).
so where do you put the state?
encode() is not supposed to retain state. It is supposed to do a complete translation. It is not a stream thingy, which may have received partial characters.
how do you reset the state between strings?
There is none :-)
how do you handle incremental decoding/encoding?
Streams. -g -- Greg Stein, http://www.lyra.org/

Greg Stein <gstein@lyra.org> wrote:
so where do you put the state?
encode() is not supposed to retain state. It is supposed to do a complete translation. It is not a stream thingy, which may have received partial characters.
how do you handle incremental decoding/encoding?
Streams.
hmm. why have two different mechanisms when you can do the same thing with one? </F>

Why a factory? I've got a simple encode() function. I don't need a factory. "flexibility" at the cost of complexity (IMO).
Unless there are certain cases where factories are useful. But let's read on...
action - a string stating the supported action: 'encode' 'decode' 'stream write' 'stream read'
This action thing is subject to error. *if* you're wanting to go this route, then have:
unicodec.register_encode(...) unicodec.register_decode(...) unicodec.register_stream_write(...) unicodec.register_stream_read(...)
They are equivalent. Guido has also told me in the past that he dislikes parameters that alter semantics -- preferring different functions instead.
Yes, indeed! (But weren't we going to do away with the whole registry idea in favor of an encodings package?)
Not that I'm advocating it, but register() could also take a single parameter: if a class, then instantiate it and call methods for each action; if an instance, then just call methods for each action.
Nah, that's bad -- a class is just a factory, and once you are allowing classes it's really good to also allowing factory functions.
[ and the third/original variety: a function object as the first param is the actual hook, and params 2 thru 4 (each are optional, or just the stream funcs?) are the other hook functions ]
Fine too. They should all be optional.
obj = factory_function_for_<action>(errors='strict')
Where does this "errors" value come from? How does a user alter that value? Without an ability to change this, I see no reason for a factory. [ and no: don't tell me it is a thread-state value :-) ]
On the other hand: presuming the "errors" thing is valid, *then* I see a need for a factory.
The idea is that various places that take an encoding name can also take a codec instance. So the user can call the factory function / class constructor.
Truly... I dislike factories. IMO, they just add code/complexity in many cases where the functionality isn't needed. But that's just me :-)
Get over it... In a sense, every Python class is a factory for its own instances! I think you must be confusing Python with Java or C++. :-) --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Why a factory? I've got a simple encode() function. I don't need a factory. "flexibility" at the cost of complexity (IMO).
Unless there are certain cases where factories are useful. But let's read on...
action - a string stating the supported action: 'encode' 'decode' 'stream write' 'stream read'
This action thing is subject to error. *if* you're wanting to go this route, then have:
unicodec.register_encode(...) unicodec.register_decode(...) unicodec.register_stream_write(...) unicodec.register_stream_read(...)
They are equivalent. Guido has also told me in the past that he dislikes parameters that alter semantics -- preferring different functions instead.
Yes, indeed!
Ok.
(But weren't we going to do away with the whole registry idea in favor of an encodings package?)
One way or another, the Unicode implementation will have to access a dictionary containing references to the codecs for a particular encoding. You won't get around registering these at some point... be it in a lazy way, on-the-fly or by some other means. What we could do is implement the lookup like this: 1. call encodings.lookup_<action>(encoding) and use the return value for the conversion 2. if all fails, cop out with an error Step 1. would do all the import magic and then register the found codecs in some dictionary for faster access (perhaps this could be done in a way that is directly available to the Unicode implementation, e.g. in a global internal dictionary -- the one I originally had in mind for the unicodec registry).
Not that I'm advocating it, but register() could also take a single parameter: if a class, then instantiate it and call methods for each action; if an instance, then just call methods for each action.
Nah, that's bad -- a class is just a factory, and once you are allowing classes it's really good to also allowing factory functions.
[ and the third/original variety: a function object as the first param is the actual hook, and params 2 thru 4 (each are optional, or just the stream funcs?) are the other hook functions ]
Fine too. They should all be optional.
Ok.
obj = factory_function_for_<action>(errors='strict')
Where does this "errors" value come from? How does a user alter that value? Without an ability to change this, I see no reason for a factory. [ and no: don't tell me it is a thread-state value :-) ]
On the other hand: presuming the "errors" thing is valid, *then* I see a need for a factory.
The idea is that various places that take an encoding name can also take a codec instance. So the user can call the factory function / class constructor.
Right. The argument is reachable via: Codec = encodings.lookup_encode('utf-8') codec = Codec(errors='?') s = codec(u"abcäöäü") s would then equal 'abc??'. -- Should I go ahead then and change the registry business to the new strategy (via the encodings package in the above sense) ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[Guido]
(But weren't we going to do away with the whole registry idea in favor of an encodings package?)
[MAL]
One way or another, the Unicode implementation will have to access a dictionary containing references to the codecs for a particular encoding. You won't get around registering these at some point... be it in a lazy way, on-the-fly or by some other means.
What is wrong with my idea of using well-known-names from the encoding module? The dict then is "encodings.<encoding-name>.__dict__". All encodings "just work" because the leverage from the Python module system. Unless Im missing something, there is no need for any extra registry at all. I guess it would actually resolve to 2 dict lookups, but thats OK surely? Mark.

Mark Hammond wrote:
[Guido]
(But weren't we going to do away with the whole registry idea in favor of an encodings package?)
[MAL]
One way or another, the Unicode implementation will have to access a dictionary containing references to the codecs for a particular encoding. You won't get around registering these at some point... be it in a lazy way, on-the-fly or by some other means.
What is wrong with my idea of using well-known-names from the encoding module? The dict then is "encodings.<encoding-name>.__dict__". All encodings "just work" because the leverage from the Python module system. Unless Im missing something, there is no need for any extra registry at all. I guess it would actually resolve to 2 dict lookups, but thats OK surely?
The problem is that the encoding names are not Python identifiers, e.g. iso-8859-1 is allowed as identifier. This and the fact that applications may want to ship their own codecs (which do not get installed under the system wide encodings package) make the registry necessary. I don't see a problem with the registry though -- the encodings package can take care of the registration process without any user interaction. There would only have to be an API for looking up an encoding published by the encodings package for the Unicode implementation to use. The magic behind that API is left to the encodings package... BTW, nothing's wrong with your idea :-) In fact, I like it a lot because it keeps the encoding modules out of the top-level scope which is good. PS: we could probably even take the whole codec idea one step further and also allow other input/output formats to be registered, e.g. stream ciphers or pickle mechanisms. The step in that direction is not a big one: we'd only have to drop the specification of the Unicode object in the spec and replace it with an arbitrary object. Of course, this will still have to be a Unicode object for use by the Unicode implementation. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

The problem is that the encoding names are not Python identifiers, e.g. iso-8859-1 is allowed as identifier.
This is easily taken care of by translating each string of consecutive non-identifier-characters to an underscore, so this would import the iso_8859_1.py module. (I also noticed in an earlier post that the official name for Shift_JIS has an underscore, while most other encodings use hyphens.)
This and the fact that applications may want to ship their own codecs (which do not get installed under the system wide encodings package) make the registry necessary.
But it could be enough to register a package where to look for encodings (in addition to the system package). Or there could be a registry for encoding search functions. (See the import discussion.)
I don't see a problem with the registry though -- the encodings package can take care of the registration process without any user interaction. There would only have to be an API for looking up an encoding published by the encodings package for the Unicode implementation to use. The magic behind that API is left to the encodings package...
I think that the collection of encodings will eventually grow large enough to make it a requirement to avoid doing work proportional to the number of supported encodings at startup (or even when an encoding is referenced for the first time). Any "lazy" mechanism (of which module search is an example) will do.
BTW, nothing's wrong with your idea :-) In fact, I like it a lot because it keeps the encoding modules out of the top-level scope which is good.
Yes.
PS: we could probably even take the whole codec idea one step further and also allow other input/output formats to be registered, e.g. stream ciphers or pickle mechanisms. The step in that direction is not a big one: we'd only have to drop the specification of the Unicode object in the spec and replace it with an arbitrary object. Of course, this will still have to be a Unicode object for use by the Unicode implementation.
This is a step towards Java's architecture of stackable streams. But I'm always in favor of tackling what we know we need before tackling the most generalized version of the problem. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
The problem is that the encoding names are not Python identifiers, e.g. iso-8859-1 is allowed as identifier.
This is easily taken care of by translating each string of consecutive non-identifier-characters to an underscore, so this would import the iso_8859_1.py module. (I also noticed in an earlier post that the official name for Shift_JIS has an underscore, while most other encodings use hyphens.)
Right. That's one way of doing it.
This and the fact that applications may want to ship their own codecs (which do not get installed under the system wide encodings package) make the registry necessary.
But it could be enough to register a package where to look for encodings (in addition to the system package).
Or there could be a registry for encoding search functions. (See the import discussion.)
Like a path of search functions ? Not a bad idea... I will still want the internal dict for caching purposes though. I'm not sure how often these encodings will be, but even a few hundred function call will slow down the Unicode implementation quite a bit. The implementation could proceed as follows: def lookup(encoding): codecs = _internal_dict.get(encoding,None) if codecs: return codecs for query in sys.encoders: codecs = query(encoding) if codecs: break else: raise UnicodeError,'unkown encoding: %s' % encoding _internal_dict[encoding] = codecs return codecs For simplicity, codecs should be a tuple (encoder,decoder, stream_writer,stream_reader) of factory functions. ...that is if we can agree on these 4 APIs :-) Here are my current versions: ----------------------------------------------------------------------- class Codec: """ Defines the interface for stateless encoders/decoders. """ def __init__(self,errors='strict'): """ Creates a Codec instance. The Codec may implement different error handling schemes by providing the errors argument. These parameters are defined: 'strict' - raise an UnicodeError (or a subclass) 'ignore' - ignore the character and continue with the next (a single character) - replace errorneous characters with the given character (may also be a Unicode character) """ self.errors = errors def encode(self,u,slice=None): """ Return the Unicode object u encoded as Python string. If slice is given (as slice object), only the sliced part of the Unicode object is encoded. The method may not store state in the Codec instance. Use SteamCodec for codecs which have to keep state in order to make encoding/decoding efficient. """ ... def decode(self,s,offset=0): """ Decodes data from the Python string s and returns a tuple (Unicode object, bytes consumed). If offset is given, the decoding process starts at s[offset]. It defaults to 0. The method may not store state in the Codec instance. Use SteamCodec for codecs which have to keep state in order to make encoding/decoding efficient. """ ... StreamWriter and StreamReader define the interface for stateful encoders/decoders: class StreamWriter(Codec): def __init__(self,stream,errors='strict'): """ Creates a StreamWriter instance. stream must be a file-like object open for writing (binary) data. The StreamWriter may implement different error handling schemes by providing the errors argument. These parameters are defined: 'strict' - raise an UnicodeError (or a subclass) 'ignore' - ignore the character and continue with the next (a single character) - replace errorneous characters with the given character (may also be a Unicode character) """ self.stream = stream def write(self,u,slice=None): """ Writes the Unicode object's contents encoded to self.stream and returns the number of bytes written. If slice is given (as slice object), only the sliced part of the Unicode object is written. """ ... the base class should provide a default implementation of this method using self.encode ... def flush(self): """ Flushed the codec buffers used for keeping state. Returns values are not defined. Implementations are free to return None, raise an exception (in case there is pending data in the buffers which could not be decoded) or return any remaining data from the state buffers used. """ pass class StreamReader(Codec): def __init__(self,stream,errors='strict'): """ Creates a StreamReader instance. stream must be a file-like object open for reading (binary) data. The StreamReader may implement different error handling schemes by providing the errors argument. These parameters are defined: 'strict' - raise an UnicodeError (or a subclass) 'ignore' - ignore the character and continue with the next (a single character) - replace errorneous characters with the given character (may also be a Unicode character) """ self.stream = stream def read(self,chunksize=0): """ Decodes data from the stream self.stream and returns a tuple (Unicode object, bytes consumed). chunksize indicates the approximate maximum number of bytes to read from the stream for decoding purposes. The decoder can modify this setting as appropriate. The default value 0 indicates to read and decode as much as possible. The chunksize is intended to prevent having to decode huge files in one step. """ ... the base class should provide a default implementation of this method using self.decode ... def flush(self): """ Flushed the codec buffers used for keeping state. Returns values are not defined. Implementations are free to return None, raise an exception (in case there is pending data in the buffers which could not be decoded) or return any remaining data from the state buffers used. """ In addition to the above methods, the StreamWriter and StreamReader instances should also provide access to all other methods defined for the stream object. Stream codecs are free to combine the StreamWriter and StreamReader interfaces into one class. -----------------------------------------------------------------------
I don't see a problem with the registry though -- the encodings package can take care of the registration process without any user interaction. There would only have to be an API for looking up an encoding published by the encodings package for the Unicode implementation to use. The magic behind that API is left to the encodings package...
I think that the collection of encodings will eventually grow large enough to make it a requirement to avoid doing work proportional to the number of supported encodings at startup (or even when an encoding is referenced for the first time). Any "lazy" mechanism (of which module search is an example) will do.
Right. The list of search functions should provide this kind of lazyness. It also provides ways to implement other strategies to look for codecs, e.g. PIL could provide such a search function for its codecs, mxCrypto for the included ciphers, etc.
BTW, nothing's wrong with your idea :-) In fact, I like it a lot because it keeps the encoding modules out of the top-level scope which is good.
Yes.
PS: we could probably even take the whole codec idea one step further and also allow other input/output formats to be registered, e.g. stream ciphers or pickle mechanisms. The step in that direction is not a big one: we'd only have to drop the specification of the Unicode object in the spec and replace it with an arbitrary object. Of course, this will still have to be a Unicode object for use by the Unicode implementation.
This is a step towards Java's architecture of stackable streams.
But I'm always in favor of tackling what we know we need before tackling the most generalized version of the problem.
Well, I just wanted to mention the possibility... might be something to look into next year. I find it rather thrilling to be able to create encrypted streams by just hooking together a few stream codecs... f = open('myfile.txt','w') CipherWriter = sys.codec('rc5-cipher')[3] sf = StreamWriter(f,key='xxxxxxxx') UTF8Writer = sys.codec('utf-8')[3] sfx = UTF8Writer(sf) sfx.write('asdfasdfasdfasdf') sfx.close() Hmm, we should probably define the additional constructor arguments to be keyword arguments... writers/readers other than Unicode ones will probably need different kinds of parameters (such as the key in the above example). Ahem, ...I'm getting distracted here :-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Like a path of search functions ? Not a bad idea... I will still want the internal dict for caching purposes though. I'm not sure how often these encodings will be, but even a few hundred function call will slow down the Unicode implementation quite a bit.
Of course. (It's like sys.modules caching the results of an import). [...]
def flush(self):
""" Flushed the codec buffers used for keeping state.
Returns values are not defined. Implementations are free to return None, raise an exception (in case there is pending data in the buffers which could not be decoded) or return any remaining data from the state buffers used.
"""
I don't know where this came from, but a flush() should work like flush() on a file. It doesn't return a value, it just sends any remaining data to the underlying stream (for output). For input it shouldn't be supported at all. The idea is that flush() should do the same to the encoder state that close() followed by a reopen() would do. Well, more or less. But if the process were to be killed right after a flush(), the data written to disk should be a complete encoding, and not have a lingering shift state. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Like a path of search functions ? Not a bad idea... I will still want the internal dict for caching purposes though. I'm not sure how often these encodings will be, but even a few hundred function call will slow down the Unicode implementation quite a bit.
Of course. (It's like sys.modules caching the results of an import).
I've fixed the "path of search functions" approach in the latest version of the spec.
[...]
def flush(self):
""" Flushed the codec buffers used for keeping state.
Returns values are not defined. Implementations are free to return None, raise an exception (in case there is pending data in the buffers which could not be decoded) or return any remaining data from the state buffers used.
"""
I don't know where this came from, but a flush() should work like flush() on a file.
It came from Fredrik's proposal.
It doesn't return a value, it just sends any remaining data to the underlying stream (for output). For input it shouldn't be supported at all.
The idea is that flush() should do the same to the encoder state that close() followed by a reopen() would do. Well, more or less. But if the process were to be killed right after a flush(), the data written to disk should be a complete encoding, and not have a lingering shift state.
Ok. I've modified the API as follows: StreamWriter: def flush(self): """ Flushes and resets the codec buffers used for keeping state. Calling this method should ensure that the data on the output is put into a clean state, that allows appending of new fresh data without having to rescan the whole stream to recover state. """ pass StreamReader: def read(self,chunksize=0): """ Decodes data from the stream self.stream and returns a tuple (Unicode object, bytes consumed). chunksize indicates the approximate maximum number of bytes to read from the stream for decoding purposes. The decoder can modify this setting as appropriate. The default value 0 indicates to read and decode as much as possible. The chunksize is intended to prevent having to decode huge files in one step. The method should use a greedy read strategy meaning that it should read as much data as is allowed within the definition of the encoding and the given chunksize, e.g. if optional encoding endings or state markers are available on the stream, these should be read too. """ ... the base class should provide a default implementation of this method using self.decode ... def reset(self): """ Resets the codec buffers used for keeping state. Note that no stream repositioning should take place. This method is primarely intended to recover from decoding errors. """ pass The .reset() method replaces the .flush() method on StreamReaders. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 42 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg writes:
The problem is that the encoding names are not Python identifiers, e.g. iso-8859-1 is allowed as identifier. This and the fact that applications may want to ship their own codecs (which do not get installed under the system wide encodings package) make the registry necessary.
This isn't a substantial problem. Try this on for size (probably not too different from what everyone is already thinking, but let's make it clear). This could be in encodings/__init__.py; I've tried to be really clear on the names. (No testing, only partially complete.) ------------------------------------------------------------------------ import string import sys try: from cStringIO import StringIO except ImportError: from StringIO import StringIO class EncodingError(Exception): def __init__(self, encoding, error): self.encoding = encoding self.strerror = "%s %s" % (error, `encoding`) self.error = error Exception.__init__(self, encoding, error) _registry = {} def registerEncoding(encoding, encode=None, decode=None, make_stream_encoder=None, make_stream_decoder=None): encoding = encoding.lower() if _registry.has_key(encoding): info = _registry[encoding] else: info = _registry[encoding] = Codec(encoding) info._update(encode, decode, make_stream_encoder, make_stream_decoder) def getCodec(encoding): encoding = encoding.lower() if _registry.has_key(encoding): return _registry[encoding] # load the module modname = "encodings." + encoding.replace("-", "_") try: __import__(modname) except ImportError: raise EncodingError("unknown uncoding " + `encoding`) # if the module registered, use the codec as-is: if _registry.has_key(encoding): return _registry[encoding] # nothing registered, use well-known names module = sys.modules[modname] codec = _registry[encoding] = Codec(encoding) encode = getattr(module, "encode", None) decode = getattr(module, "decode", None) make_stream_encoder = getattr(module, "make_stream_encoder", None) make_stream_decoder = getattr(module, "make_stream_decoder", None) codec._update(encode, decode, make_stream_encoder, make_stream_decoder) class Codec: __encode = None __decode = None __stream_encoder_factory = None __stream_decoder_factory = None def __init__(self, name): self.name = name def encode(self, u): if self.__stream_encoder_factory: sio = StringIO() encoder = self.__stream_encoder_factory(sio) encoder.write(u) encoder.flush() return sio.getvalue() else: raise EncodingError("no encoder available for " + `self.name`) # similar for decode()... def make_stream_encoder(self, target): if self.__stream_encoder_factory: return self.__stream_encoder_factory(target) elif self.__encode: return DefaultStreamEncoder(target, self.__encode) else: raise EncodingError("no encoder available for " + `self.name`) # similar for make_stream_decoder()... def _update(self, encode, decode, make_stream_encoder, make_stream_decoder): self.__encode = encode or self.__encode self.__decode = decode or self.__decode self.__stream_encoder_factory = ( make_stream_encoder or self.__stream_encoder_factory) self.__stream_decoder_factory = ( make_stream_decoder or self.__stream_decoder_factory) ------------------------------------------------------------------------
I don't see a problem with the registry though -- the encodings package can take care of the registration process without any
No problem at all; we just need to make sure the right magic is there for the "normal" case.
PS: we could probably even take the whole codec idea one step further and also allow other input/output formats to be registered,
File formats are different from text encodings, so let's keep them separate. Yes, a registry can be a good approach whenever the various things being registered are sufficiently similar semantically, but the behavior of the registry/lookup can be very different for each type of thing. Let's not over-generalize. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

Er, I should note that the sample code I just sent makes use of string methods. ;) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

Guido van Rossum <guido@CNRI.Reston.VA.US> wrote:
Also, I don't want to ignore the alternative interface that was suggested by /F. It uses feed() similar to htmllib c.s. This has some advantages (although we might want to define some compatibility so it can also feed directly into a file).
seeing this made me switch on my brain for a moment, and recall how things are done in PIL (which is, as I've bragged about before, another library with an internal format, and many possible external encodings). among other things, PIL lets you read and write images to both ordinary files and arbitrary file objects, but it also lets you incrementally decode images by feeding it chunks of data (through ImageFile.Parser). and it's fast -- it has to be, since images tends to contain lots of pixels... anyway, here's what I came up with (code will follow, if someone's interested). -------------------------------------------------------------------- A PIL-like Unicode Codec Proposal -------------------------------------------------------------------- In the PIL model, the codecs are called with a piece of data, and returns the result to the caller. The codecs maintain internal state when needed. class decoder: def decode(self, s, offset=0): # decode as much data as we possibly can from the # given string. if there's not enough data in the # input string to form a full character, return # what we've got this far (this might be an empty # string). def flush(self): # flush the decoding buffers. this should usually # return None, unless the fact that knowing that the # input stream has ended means that the state can be # interpreted in a meaningful way. however, if the # state indicates that there last character was not # finished, this method should raise a UnicodeError # exception. class encoder: def encode(self, u, offset=0, buffersize=0): # encode data from the given offset in the input # unicode string into a buffer of the given size # (or slightly larger, if required to proceed). # if the buffer size is 0, the decoder is free # to pick a suitable size itself (if at all # possible, it should make it large enough to # encode the entire input string). returns a # 2-tuple containing the encoded data, and the # number of characters consumed by this call. def flush(self): # flush the encoding buffers. returns an ordinary # string (which may be empty), or None. Note that a codec instance can be used for a single string; the codec registry should hold codec factories, not codec instances. In addition, you may use a single type or class to implement both interfaces at once. -------------------------------------------------------------------- Use Cases -------------------------------------------------------------------- A null decoder: class decoder: def decode(self, s, offset=0): return s[offset:] def flush(self): pass A null encoder: class encoder: def encode(self, s, offset=0, buffersize=0): if buffersize: s = s[offset:offset+buffersize] else: s = s[offset:] return s, len(s) def flush(self): pass Decoding a string: def decode(s, encoding) c = registry.getdecoder(encoding) u = c.decode(s) t = c.flush() if not t: return u return u + t # not very common Encoding a string: def encode(u, encoding) c = registry.getencoder(encoding) p = [] o = 0 while o < len(u): s, n = c.encode(u, o) p.append(s) o = o + n if len(p) == 1: return p[0] return string.join(p, "") # not very common Implementing stream codecs is left as an exercise (see the zlib material in the eff-bot guide for a decoder example). --- end of proposal

Fredrik Lundh wrote:
-------------------------------------------------------------------- A PIL-like Unicode Codec Proposal --------------------------------------------------------------------
In the PIL model, the codecs are called with a piece of data, and returns the result to the caller. The codecs maintain internal state when needed.
class decoder:
def decode(self, s, offset=0): # decode as much data as we possibly can from the # given string. if there's not enough data in the # input string to form a full character, return # what we've got this far (this might be an empty # string).
def flush(self): # flush the decoding buffers. this should usually # return None, unless the fact that knowing that the # input stream has ended means that the state can be # interpreted in a meaningful way. however, if the # state indicates that there last character was not # finished, this method should raise a UnicodeError # exception.
Could you explain for reason for having a .flush() method and what it should return. Note that the .decode method is not so much different from my Codec.decode method except that it uses a single offset where my version uses a slice (the offset is probably the better variant, because it avoids data truncation).
class encoder:
def encode(self, u, offset=0, buffersize=0): # encode data from the given offset in the input # unicode string into a buffer of the given size # (or slightly larger, if required to proceed). # if the buffer size is 0, the decoder is free # to pick a suitable size itself (if at all # possible, it should make it large enough to # encode the entire input string). returns a # 2-tuple containing the encoded data, and the # number of characters consumed by this call.
Dito.
def flush(self): # flush the encoding buffers. returns an ordinary # string (which may be empty), or None.
Note that a codec instance can be used for a single string; the codec registry should hold codec factories, not codec instances. In addition, you may use a single type or class to implement both interfaces at once.
Perhaps I'm missing something, but how would you define stream codecs using this interface ?
Implementing stream codecs is left as an exercise (see the zlib material in the eff-bot guide for a decoder example).
...? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg <mal@lemburg.com> wrote:
def flush(self): # flush the decoding buffers. this should usually # return None, unless the fact that knowing that the # input stream has ended means that the state can be # interpreted in a meaningful way. however, if the # state indicates that there last character was not # finished, this method should raise a UnicodeError # exception.
Could you explain for reason for having a .flush() method and what it should return.
in most cases, it should either return None, or raise a UnicodeError exception: >>> u = unicode("å i åa ä e ö", "iso-latin-1") >>> # yes, that's a valid Swedish sentence ;-) >>> s = u.encode("utf-8") >>> d = decoder("utf-8") >>> d.decode(s[:-1]) "å i åa ä e " >>> d.flush() UnicodeError: last character not complete on the other hand, there are situations where it might actually return a string. consider a "HTML entity decoder" which uses the following pattern to match a character entity: "&\w+;?" (note that the trailing semicolon is optional). >>> u = unicode("å i åa ä e ö", "iso-latin-1") >>> s = u.encode("html-entities") >>> d = decoder("html-entities") >>> d.decode(s[:-1]) "å i åa ä e " >>> d.flush() "ö"
Perhaps I'm missing something, but how would you define stream codecs using this interface ?
input: read chunks of data, decode, and keep extra data in a local buffer. output: encode data into suitable chunks, and write to the output stream (that's why there's a buffersize argument to encode -- if someone writes a 10mb unicode string to an encoded stream, python shouldn't allocate an extra 10-30 megabytes just to be able to encode the darn thing...)
Implementing stream codecs is left as an exercise (see the zlib material in the eff-bot guide for a decoder example).
everybody should have a copy of the eff-bot guide ;-) (but alright, I plan to post a complete utf-8 implementation in a not too distant future). </F>

Fredrik Lundh wrote:
M.-A. Lemburg <mal@lemburg.com> wrote:
def flush(self): # flush the decoding buffers. this should usually # return None, unless the fact that knowing that the # input stream has ended means that the state can be # interpreted in a meaningful way. however, if the # state indicates that there last character was not # finished, this method should raise a UnicodeError # exception.
Could you explain for reason for having a .flush() method and what it should return.
in most cases, it should either return None, or raise a UnicodeError exception:
>>> u = unicode("å i åa ä e ö", "iso-latin-1") >>> # yes, that's a valid Swedish sentence ;-) >>> s = u.encode("utf-8") >>> d = decoder("utf-8") >>> d.decode(s[:-1]) "å i åa ä e " >>> d.flush() UnicodeError: last character not complete
on the other hand, there are situations where it might actually return a string. consider a "HTML entity decoder" which uses the following pattern to match a character entity: "&\w+;?" (note that the trailing semicolon is optional).
>>> u = unicode("å i åa ä e ö", "iso-latin-1") >>> s = u.encode("html-entities") >>> d = decoder("html-entities") >>> d.decode(s[:-1]) "å i åa ä e " >>> d.flush() "ö"
Ah, ok. So the .flush() method checks for proper string endings and then either returns the remaining input or raises an error.
Perhaps I'm missing something, but how would you define stream codecs using this interface ?
input: read chunks of data, decode, and keep extra data in a local buffer.
output: encode data into suitable chunks, and write to the output stream (that's why there's a buffersize argument to encode -- if someone writes a 10mb unicode string to an encoded stream, python shouldn't allocate an extra 10-30 megabytes just to be able to encode the darn thing...)
So the stream codecs would be wrappers around the string codecs. Have you read my latest version of the Codec interface ? Wouldn't that be a reasonable approach ? Note that I have integrated your ideas into the new API -- it's basically only missing the .flush() methods, which I can add now that I know what you meant.
Implementing stream codecs is left as an exercise (see the zlib material in the eff-bot guide for a decoder example).
everybody should have a copy of the eff-bot guide ;-)
Sure, but the format, the format... make it printed and add a CD and you would probably have a good selling book there ;-)
(but alright, I plan to post a complete utf-8 implementation in a not too distant future).
-- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[Responding to some lingering mails] [/F]
>>> u = unicode("å i åa ä e ö", "iso-latin-1") >>> s = u.encode("html-entities") >>> d = decoder("html-entities") >>> d.decode(s[:-1]) "å i åa ä e " >>> d.flush() "ö"
[MAL]
Ah, ok. So the .flush() method checks for proper string endings and then either returns the remaining input or raises an error.
No, please. See my previous post on flush().
input: read chunks of data, decode, and keep extra data in a local buffer.
output: encode data into suitable chunks, and write to the output stream (that's why there's a buffersize argument to encode -- if someone writes a 10mb unicode string to an encoded stream, python shouldn't allocate an extra 10-30 megabytes just to be able to encode the darn thing...)
So the stream codecs would be wrappers around the string codecs.
No -- the other way around. Think of the stream encoder as a little FSM engine that you feed with unicode characters and which sends bytes to the backend stream. When a unicode character comes in that requires a particular shift state, and the FSM isn't in that shift state, it emits the escape sequence to enter that shift state first. It should use standard buffered writes to the output stream; i.e. one call to feed the encoder could cause several calls to write() on the output stream, or vice versa (if you fed the encoder a single character it might keep it in its own buffer). That's all up to the codec implementation. The flush() forces the FSM into the "neutral" shift state, possibly writing an escape sequence to leave the current shift state, and empties the internal buffer. The string codec CONCEPTUALLY uses the stream codec to a cStringIO object, using flush() to force the final output. However the implementation may take a shortcut. For stateless encodings the stream codec may call on the string codec, but that's all an implementation issue. For input, things are slightly different (you don't know how much encoded data you must read to give you N Unicode characters, so you may have to make a guess and hold on to some data that you read unnecessarily -- either in encoded form or in Unicode form, at the discretion of the implementation. Using seek() on the input stream is forbidden (it could be a pipe or socket). --Guido van Rossum (home page: http://www.python.org/~guido/)
participants (7)
-
Barry A. Warsaw
-
Fred L. Drake, Jr.
-
Fredrik Lundh
-
Greg Stein
-
Guido van Rossum
-
M.-A. Lemburg
-
Mark Hammond