[I18n-sig] Codec API questions

Brian Hooper brian_takashi@hotmail.com
Mon, 10 Apr 2000 21:09:58 GMT

Hi Andy,

I've been busy recently working with the Unicode API myself and
am thinking some of the same things... (BTW, for a current project
I am working with Basistech's Rosette libraries, and have actually
plugged them into a Python codec, so any Q's about how/what Basistech
does I might be able to help with).

>I'm beginning to wonder about some issues with the unicode implementation.
>Bear in mind we have seven weeks left - if anyone else has issues or
>opinions, we should raise them now.
>1. Set Default Encoding at site level
>The default encoding is defined as UTF8, which will at least annoy all
>nations equally :-).
>It looks like you can hack this any way you want by creating your own
>wrappers around stdin/stdout/stderr.  However, I wonder if Python should
>make this customizable on a site basis - for example, site.py checks for
>some option somewhere to say "I want to see Latin-1" or Shift-JIS or
>whatever.  I often used to write scripts to parse files of names and
>addresses, and use an interactive prompt to inspect the lists and tuples
>directly; the convenience of typing 'print mydata' and see it properly is
>nice.  What do people think?
Is there any reason that this should be set on a per site basis - I 
definitely agree that it should be possible to change the interpreter 
encoding, but wouldn't it be nicer if it could instead be changed on a 
per-interpreter basis?  Either via environment variables or maybe 
command-line flags?  Would it be too much of a performance hit to look up 
the default on any conversion which doesn't explicitly specify the encoding 
- this would give the most flexibility of all... (it doesn't seem to me that 
this would be too slow, but I don't have very deep knowledge about this).

>(Or is this feature there already and I've missed it?)
No, UTF-8 is the hardcoded default.

>2. lookup returns Codec object rather than tuple?


I really like this idea too, and the optional addition of validate()
and repair() are good ideas too.

>3. direct conversion lookups and short-circuiting Unicode


This also seems like a good idea to me, and something that would
be really good for Japanese support.

As for registering, rather than changing how that's done what about changing 
search functions so that they should be required to take a
second argument, which is by default Unicode (UTF-16) but could also
be some other encoding.  The search function would always be called
by the lookup procedure with a to and from encoding, and the search
function could deal with the arguments by returning a direct converter
or a 'welded' converter codec as appropriate.

Get Your Private, Free Email at http://www.hotmail.com