[Python-Dev] codecs question
M.-A. Lemburg
mal@lemburg.com
Sat, 30 Sep 2000 12:21:43 +0200
Martin von Loewis wrote:
>
> > the "unicodenames" patch (which replaces ucnhash) includes this
> > functionality -- but with a little distance, I think it's better to add
> > it to the unicodedata module.
> >
> > (it's included in the step 4 patch, soon to be posted to a patch
> > manager near you...)
>
> Sounds good. Is there any chance to use this in codecs, then?
If you need speed, you'd have to write a C codec for this
and yes: the ucnhash module does import a C API using a
PyCObject which you can use to access the static C data
table.
Don't know if Fredrik's version will also support this.
I think a C function as access method would be more generic
than the current direct C table access.
> I'm thinking of
>
> >>> print u"\N{COPYRIGHT SIGN}".encode("ascii-ucn")
> \N{COPYRIGHT SIGN}
> >>> print u"\N{COPYRIGHT SIGN}".encode("latin-1-ucn")
> ©
>
> Regards,
> Martin
>
> P.S. Some people will recognize this as the disguised question 'how
> can I convert non-convertable characters using the XML entity
> notation?'
If you just need a single encoding, e.g. Latin-1, simply clone
the codec (it's coded in unicodeobject.c) and add the XML entity
processing.
Unfortunately, reusing the existing codecs is not too
efficient: the reason is that there is no error handling
which would permit you to say "encode as far as you can
and then return the encoded data plus a position marker
in the input stream/data".
Perhaps we should add a new standard error handling
scheme "break" which simply stops encoding/decoding
whenever an error occurrs ?!
This should then allow reusing existing codecs by
processing the input in slices.
--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/