[I18n-sig] Codec API questions

M.-A. Lemburg mal@lemburg.com
Tue, 11 Apr 2000 00:34:31 +0200

Andy Robinson wrote:
> 1. Set Default Encoding at site level
> ----------------------------------------------------
> The default encoding is defined as UTF8, which will at least annoy all
> nations equally :-).
> It looks like you can hack this any way you want by creating your own
> wrappers around stdin/stdout/stderr.  However, I wonder if Python should
> make this customizable on a site basis - for example, site.py checks for
> some option somewhere to say "I want to see Latin-1" or Shift-JIS or
> whatever.  I often used to write scripts to parse files of names and
> addresses, and use an interactive prompt to inspect the lists and tuples
> directly; the convenience of typing 'print mydata' and see it properly is
> nice.  What do people think?
> (Or is this feature there already and I've missed it?)

The design leaves this to user-land. I'd suggest using stdin/stdout
wrappers as needed, possibly only enabled in interactive sessions.
> 2. lookup returns Codec object rather than tuple?
> ---------------------------------------------------------------------
> I shuld have thought of this when we were in the draft stage months back,
> but couldn't really get my mind around it until I had something concrete to
> play with.
> Right now, codecs.lookup() returns a tuple of
>     (encode_func,
>     decode_func,
>     stream_encoder_factory,
>     stream_decoder_factory)
> But there is no easy way to lookup the codec object itself - indeed, no
> requirement that there be one.  I'd like to see lookup always return a Codec
> object
> every time, which is guaranteed to have four methods as above, but might
> have more.  (Note that a Codec object would have the ability to create
> StreamEncoders and StreamDecoders, but would not be one by itself).
> A fifth method which is potentially very useful is validate(); a sixth might
> be repair().  And for each language, there could be specific ones such as
> expanding half-width to full-width katakana.
> Furthermore, if we can get hold of the Codec objects, we can start to reason
> about codecs - for example, ask whether encodings are compatible with each
> other.

Why do you want to query an object ? The factory functions
will provide you with an object you can use as codec
when called with the proper arguments... note that there 
can't be just one object alive since these objects can
carry state.

BTW, the Codec API is designed to work for all kinds of
codecs. If you have a need for special new methods there's
no problem adding them to your Codec subclass -- the standard
codec mechanism won't rely on them, but you can still provide
and use them.
> 3. direct conversion lookups and short-circuiting Unicode
> ----------------------------------------------------------------------------
> This is an extension rather than a change.  I know what I want to do, but
> have only the vaguest ideas how to implement it.
> As noted here before, you can get from shift-JIS to EUC and vice versa
> without going through Unicode.  Because these algorithmic conversions work
> on the full 94x94 'kuten space' and not just the 6879 code points in the
> standard, they tend to work for any vendor-specific extensions and for
> user-defined characters.  Most other Asian native encodings have used a
> similar scheme.
> I'd like to see an 'extended API' to go from one native character set to
> another.  As before, this comes in two flavours, string and stream:
>     convert(string, from_enc, to_enc)   returns a string.
> We also need ways to get hold of StreamReader and StreamWriter versions.
> Now one can trivially build these using Unicode in the middle
> codecs.lookup('from_enc', 'to_enc') would return a codec object able to
> convert from one encoding to another.  By default, this would weld together
> two Unicode codecs.  But if someone writes a codec to do the job directly,
> there should be a way to register that.

Looks like we need a set of recode codec classes here.
There is already one in codecs.py: StreamRecoder. We'd
probably need similar subclasses for the basic Codec
class though.

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/