
In general, I like this proposal a lot, but I think it only covers half the story. How we actually build the encoder/decoder for each encoding is a very big issue. Thoughts on this below. First, a little nit
On to the important stuff:>
unicodec.register(<encname>,<encoder>,<decoder> [,<stream_encoder>, <stream_decoder>])
I would MUCH prefer a single 'Encoding' class or type to wrap up these things, rather than up to four disconnected objects/functions. Essentially it would be an interface standard and would offer methods to do the four things above. There are several reasons for this. (1) there are quite a lot of things you might want to do with an encoding object, and we could extend the interface in future easily. As a minimum, give it the four methods implied by the above, two of which can be defaults. But I'd like an encoding to be able to tell me the set of characters to which it applies; validate a string; and maybe tell me if it is a subset or superset of another. (2) especially with double-byte encodings, they will need to load up some kind of database on startup and use this for both encoding and decoding - much better to share it and encapsulate it inside one object (3) for some languages, there are extra functions wanted. For Japanese, you need two or three functions to expand half-width to full-width katakana, convert double-byte english to single-byte and vice versa. A Japanese encoding object would be a handy place to put this knowledge. (4) In the real world you get many encodings which are subtle variations of the same thing, plus or minus a few characters. One bit of code might be able to share the work of several encodings, by setting a few flags. Certainly true of Japanese. (5) encoding/decoding algorithms can be program or data or (very often) a bit of both. We have not yet discussed where to keep all the mapping tables, but if data is involved it should be hidden in an object. (6) See my comments on a state machine for doing the encodings. If this is done well, we might two different standard objects which conform to the Encoding interface (a really light one for single-byte encodings, and a bigger one for multi-byte), and everything else could be data driven. (6) Easy to grow - encodings can be prototyped and proven in Python, ported to C if needed or when ready. In summary, firm up the concept of an Encoding object and give it room to grow - that's the key to real-world usefulness. If people feel the same way I'll have a go at an interface for that, and try show how it would have simplified specific problems I have faced. We also need to think about where encoding info will live. You cannot avoid mapping tables, although you can hide them inside code modules or pickled objects if you want. Should there be a standard "..\Python\Enc" directory? And we're going to need some kind of testing and certification procedure when adding new encodings. This stuff has to be right. Guido asked about TypedString. This can probably be done on top of the built-in stuff - it is just a convenience which would clarify intent, reduce lines of code and prevent people shooting themselves in the foot when juggling a lot of strings in different (non-Unicode) encodings. I can do a Python module to implement that on top of whatever is built. Regards, Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com

Andy Robinson wrote:
u = unicode('...I am UTF8...','utf-8') will do just that. I've moved to Tim's proposal with the \uXXXX encoding for u'', BTW.
Ok, you have a point there. Here's a proposal (note that this only defines an interface, not a class structure): Codec Interface Definition: --------------------------- The following base class should be defined in the module unicodec. class Codec: def encode(self,u): """ Return the Unicode object u encoded as Python string. """ ... def decode(self,s): """ Return an equivalent Unicode object for the encoded Python string s. """ ... def dump(self,u,stream,slice=None): """ Writes the Unicode object's contents encoded to the stream. stream must be a file-like object open for writing binary data. If slice is given (as slice object), only the sliced part of the Unicode object is written. """ ... the base class should provide a default implementation of this method using self.encode ... def load(self,stream,length=None): """ Reads an encoded string (up to <length> bytes) from the stream and returns an equivalent Unicode object. stream must be a file-like object open for reading binary data. If length is given, only length bytes are read. Note that this can cause the decoding algorithm to fail due to truncations in the encoding. """ ... the base class should provide a default implementation of this method using self.encode ... Codecs should raise an UnicodeError in case the conversion is not possible. It is not required by the unicodec.register() API to provide a subclass of this base class, only the 4 given methods must be present. This allows writing Codecs as extensions types. XXX Still to be discussed: · support for line breaks (see http://www.unicode.org/unicode/reports/tr13/ ) · support for case conversion: Problems: string lengths can change due to multiple characters being mapped to a single new one, capital letters starting a word can be different than ones occurring in the middle, there are locale dependent deviations from the standard mappings. · support for numbers, digits, whitespace, etc. · support (or no support) for private code point areas
Mapping tables should be incorporated into the codec modules preferably as static C data. That way multiple processes can share the same data.
I will have to rely on your cooperation for the test data. Roundtrip testing is easy to implement, but I will also have to verify the output against prechecked data which is probably only creatable using visual tools to which I don't have access (e.g. a Japanese Windows installation).
Ok. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg writes:
def encode(self,u):
""" Return the Unicode object u encoded as Python string.
This should accept an optional slice parameter, and use it in the same way as .dump().
def dump(self,u,stream,slice=None):
...
def load(self,stream,length=None):
Why not have something like .wrapFile(f) that returns a file-like object with all the file methods implemented, and doing to "right thing" regarding encoding/decoding? That way, the new file-like object can be used directly with code that works with files and doesn't care whether it uses 8-bit or unicode strings.
Codecs should raise an UnicodeError in case the conversion is not possible.
I think that should be ValueError, or UnicodeError should be a subclass of ValueError. (Can the -X interpreter option be removed yet?) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

"Fred L. Drake, Jr." wrote:
Ok.
See File Output of the latest version: File/Stream Output: ------------------- Since file.write(object) and most other stream writers use the 's#' argument parsing marker, the buffer interface implementation determines the encoding to use (see Buffer Interface). For explicit handling of Unicode using files, the unicodec module could provide stream wrappers which provide transparent encoding/decoding for any open stream (file-like object): import unicodec file = open('mytext.txt','rb') ufile = unicodec.stream(file,'utf-16') u = ufile.read() ... ufile.close() XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which also assures that <mode> contains the 'b' character when needed.
Ok.
(Can the -X interpreter option be removed yet?)
Doesn't Python convert class exceptions to strings when -X is used ? I would guess that many scripts already rely on the class based mechanism (much of my stuff does for sure), so by the time 1.6 is out, I think -X should be considered an option to run pre 1.5 code rather than using it for performance reasons. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg writes:
Sounds good to me! I guess I just missed, there's been so much going on lately.
Actually, I'd call it unicodec.open(). I asked:
(Can the -X interpreter option be removed yet?)
You commented:
Gosh, I never thought of it as a performance issue! What I'd like to do is avoid code like this: try: class UnicodeError(ValueError): # well, something would probably go here... pass except TypeError: class UnicodeError: # something slightly different for this one... pass Trying to use class exceptions can be really tedious, and often I'd like to pick up the stuff from Exception. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

"M" == M <mal@lemburg.com> writes:
M> Doesn't Python convert class exceptions to strings when -X is M> used ? I would guess that many scripts already rely on the M> class based mechanism (much of my stuff does for sure), so by M> the time 1.6 is out, I think -X should be considered an option M> to run pre 1.5 code rather than using it for performance M> reasons. This is a little off-topic so I'll be brief. When using -X Python never even creates the class exceptions, so it isn't really a conversion. It just uses string exceptions and tries to craft tuples for what would be the superclasses in the class-based exception hierarchy. Yes, class-based exceptions are a bit of a performance hit when you are catching exceptions in Python (because they need to be instantiated), but they're just so darn *useful*. I wouldn't mind seeing the -X option go away for 1.6. -Barry

[MAL]
Codecs should raise an UnicodeError in case the conversion is not possible.
[Fred L. Drake, Jr.]
[MAL]
-X is a red herring. That is, do what seems best without regard for -X. I already added one subclass exception to the CVS tree (UnboundLocalError as a subclass of NameError), and in doing that had to figure out how to make it do the right thing under -X too. It's a bit clumsy to arrange, but not a problem.

Andy Robinson wrote:
u = unicode('...I am UTF8...','utf-8') will do just that. I've moved to Tim's proposal with the \uXXXX encoding for u'', BTW.
Ok, you have a point there. Here's a proposal (note that this only defines an interface, not a class structure): Codec Interface Definition: --------------------------- The following base class should be defined in the module unicodec. class Codec: def encode(self,u): """ Return the Unicode object u encoded as Python string. """ ... def decode(self,s): """ Return an equivalent Unicode object for the encoded Python string s. """ ... def dump(self,u,stream,slice=None): """ Writes the Unicode object's contents encoded to the stream. stream must be a file-like object open for writing binary data. If slice is given (as slice object), only the sliced part of the Unicode object is written. """ ... the base class should provide a default implementation of this method using self.encode ... def load(self,stream,length=None): """ Reads an encoded string (up to <length> bytes) from the stream and returns an equivalent Unicode object. stream must be a file-like object open for reading binary data. If length is given, only length bytes are read. Note that this can cause the decoding algorithm to fail due to truncations in the encoding. """ ... the base class should provide a default implementation of this method using self.encode ... Codecs should raise an UnicodeError in case the conversion is not possible. It is not required by the unicodec.register() API to provide a subclass of this base class, only the 4 given methods must be present. This allows writing Codecs as extensions types. XXX Still to be discussed: · support for line breaks (see http://www.unicode.org/unicode/reports/tr13/ ) · support for case conversion: Problems: string lengths can change due to multiple characters being mapped to a single new one, capital letters starting a word can be different than ones occurring in the middle, there are locale dependent deviations from the standard mappings. · support for numbers, digits, whitespace, etc. · support (or no support) for private code point areas
Mapping tables should be incorporated into the codec modules preferably as static C data. That way multiple processes can share the same data.
I will have to rely on your cooperation for the test data. Roundtrip testing is easy to implement, but I will also have to verify the output against prechecked data which is probably only creatable using visual tools to which I don't have access (e.g. a Japanese Windows installation).
Ok. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg writes:
def encode(self,u):
""" Return the Unicode object u encoded as Python string.
This should accept an optional slice parameter, and use it in the same way as .dump().
def dump(self,u,stream,slice=None):
...
def load(self,stream,length=None):
Why not have something like .wrapFile(f) that returns a file-like object with all the file methods implemented, and doing to "right thing" regarding encoding/decoding? That way, the new file-like object can be used directly with code that works with files and doesn't care whether it uses 8-bit or unicode strings.
Codecs should raise an UnicodeError in case the conversion is not possible.
I think that should be ValueError, or UnicodeError should be a subclass of ValueError. (Can the -X interpreter option be removed yet?) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

"Fred L. Drake, Jr." wrote:
Ok.
See File Output of the latest version: File/Stream Output: ------------------- Since file.write(object) and most other stream writers use the 's#' argument parsing marker, the buffer interface implementation determines the encoding to use (see Buffer Interface). For explicit handling of Unicode using files, the unicodec module could provide stream wrappers which provide transparent encoding/decoding for any open stream (file-like object): import unicodec file = open('mytext.txt','rb') ufile = unicodec.stream(file,'utf-16') u = ufile.read() ... ufile.close() XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which also assures that <mode> contains the 'b' character when needed.
Ok.
(Can the -X interpreter option be removed yet?)
Doesn't Python convert class exceptions to strings when -X is used ? I would guess that many scripts already rely on the class based mechanism (much of my stuff does for sure), so by the time 1.6 is out, I think -X should be considered an option to run pre 1.5 code rather than using it for performance reasons. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg writes:
Sounds good to me! I guess I just missed, there's been so much going on lately.
Actually, I'd call it unicodec.open(). I asked:
(Can the -X interpreter option be removed yet?)
You commented:
Gosh, I never thought of it as a performance issue! What I'd like to do is avoid code like this: try: class UnicodeError(ValueError): # well, something would probably go here... pass except TypeError: class UnicodeError: # something slightly different for this one... pass Trying to use class exceptions can be really tedious, and often I'd like to pick up the stuff from Exception. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

"M" == M <mal@lemburg.com> writes:
M> Doesn't Python convert class exceptions to strings when -X is M> used ? I would guess that many scripts already rely on the M> class based mechanism (much of my stuff does for sure), so by M> the time 1.6 is out, I think -X should be considered an option M> to run pre 1.5 code rather than using it for performance M> reasons. This is a little off-topic so I'll be brief. When using -X Python never even creates the class exceptions, so it isn't really a conversion. It just uses string exceptions and tries to craft tuples for what would be the superclasses in the class-based exception hierarchy. Yes, class-based exceptions are a bit of a performance hit when you are catching exceptions in Python (because they need to be instantiated), but they're just so darn *useful*. I wouldn't mind seeing the -X option go away for 1.6. -Barry

[MAL]
Codecs should raise an UnicodeError in case the conversion is not possible.
[Fred L. Drake, Jr.]
[MAL]
-X is a red herring. That is, do what seems best without regard for -X. I already added one subclass exception to the CVS tree (UnboundLocalError as a subclass of NameError), and in doing that had to figure out how to make it do the right thing under -X too. It's a bit clumsy to arrange, but not a problem.
participants (5)
-
Andy Robinson
-
Barry A. Warsaw
-
Fred L. Drake, Jr.
-
M.-A. Lemburg
-
Tim Peters