Some thoughts on the codecs...

Andy Robinson wrote: 1. Stream interface At the moment a codec has dump and load methods which read a (slice of a) stream into a string in memory and vice versa. As the proposal notes, this could lead to errors if you take a slice out of a stream. This is not just due to character truncation; some Asian encodings are modal and have shift-in and shift-out sequences as they move from Western single-byte characters to double-byte ones. It also seems a bit pointless to me as the source (or target) is still a Unicode string in memory. This is a real problem - a filter to convert big files between two encodings should be possible without knowledge of the particular encoding, as should one on the input/output of some server. We can still give a default implementation for single-byte encodings. What's a good API for real stream conversion? just Codec.encodeStream(infile, outfile) ? or is it more useful to feed the codec with data a chunk at a time? A user defined chunking factor (suitably defaulted) would be useful for processing large files. 2. Data driven codecs I really like codecs being objects, and believe we could build support for a lot more encodings, a lot sooner than is otherwise possible, by making them data driven rather making each one compiled C code with static mapping tables. What do people think about the approach below? First of all, the ISO8859-1 series are straight mappings to Unicode code points. So one Python script could parse these files and build the mapping table, and a very small data file could hold these encodings. A compiled helper function analogous to string.translate() could deal with most of them. Secondly, the double-byte ones involve a mixture of algorithms and data. The worst cases I know are modal encodings which need a single-byte lookup table, a double-byte lookup table, and have some very simple rules about escape sequences in between them. A simple state machine could still handle these (and the single-byte mappings above become extra-simple special cases); I could imagine feeding it a totally data-driven set of rules. Third, we can massively compress the mapping tables using a notation which just lists contiguous ranges; and very often there are relationships between encodings. For example, "cpXYZ is just like cpXYY but with an extra 'smiley' at 0XFE32". In these cases, a script can build a family of related codecs in an auditable manner. The problem here is that we need to decide whether we are Unicode-centric, or whether Unicode is just another supported encoding. If we are Unicode-centric, then all code-page translations will require static mapping tables between the appropriate Unicode character and the relevant code points in the other encoding. This would involve (worst case) 64k static tables for each supported encoding. Unfortunately this also precludes the use of algorithmic conversions and or sparse conversion tables because most of these transformations are relative to a source and target non-Unicode encoding, eg JIS <---->EUCJIS. If we are taking the IBM approach (see CDRA), then we can mix and match approaches, and treat Unicode strings as just Unicode, and normal strings as being any arbitrary MBCS encoding. To guarantee the utmost interoperability and Unicode 3.0 (and beyond) compliance, we should probably assume that all core encodings are relative to Unicode as the pivot encoding. This should hopefully avoid any gotcha's with roundtrips between any two arbitrary native encodings. The downside is this will probably be slower than an optimised algorithmic transformation. 3. What encodings to distribute? The only clean answers to this are 'almost none', or 'everything that Unicode 3.0 has a mapping for'. The latter is going to add some weight to the distribution. What are people's feelings? Do we ship any at all apart from the Unicode ones? Should new encodings be downloadable from www.python.org <http://www.python.org> ? Should there be an optional package outside the main distribution? Ship with Unicode encodings in the core, the rest should be an add on package. If we are truly Unicode-centric, this gives us the most value in terms of accessing a Unicode character properties database, which will provide language neutral case folding, Hankaku <----> Zenkaku folding (Japan specific), and composition / normalisation between composed characters and their component nonspacing characters. Regards, Mike da Silva

"Da Silva, Mike" wrote:
Optimizations should go into separate packages for direct EncodingA -> EncodingB conversions. I don't think we need them in the core.
From the proposal:
""" Unicode Character Properties: ----------------------------- A separate module "unicodedata" should provide a compact interface to all Unicode character properties defined in the standard's UnicodeData.txt file. Among other things, these properties provide ways to recognize numbers, digits, spaces, whitespace, etc. Since this module will have to provide access to all Unicode characters, it will eventually have to contain the data from UnicodeData.txt which takes up around 200kB. For this reason, the data should be stored in static C data. This enables compilation as shared module which the underlying OS can shared between processes (unlike normal Python code modules). XXX Define the interface... """ Special CJK packages can then access this data for the purposes you mentioned above. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

"Da Silva, Mike" wrote:
Optimizations should go into separate packages for direct EncodingA -> EncodingB conversions. I don't think we need them in the core.
From the proposal:
""" Unicode Character Properties: ----------------------------- A separate module "unicodedata" should provide a compact interface to all Unicode character properties defined in the standard's UnicodeData.txt file. Among other things, these properties provide ways to recognize numbers, digits, spaces, whitespace, etc. Since this module will have to provide access to all Unicode characters, it will eventually have to contain the data from UnicodeData.txt which takes up around 200kB. For this reason, the data should be stored in static C data. This enables compilation as shared module which the underlying OS can shared between processes (unlike normal Python code modules). XXX Define the interface... """ Special CJK packages can then access this data for the purposes you mentioned above. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
participants (2)
-
Da Silva, Mike
-
M.-A. Lemburg