[issue2857] add codec for java modified utf-8

paul rubin report at bugs.python.org
Thu May 15 16:39:45 CEST 2008


paul rubin <phr at users.sourceforge.net> added the comment:

I'm not sure what you mean by "ditto for Lucene indexes".  I wasn't
planning to use C code.  I was hoping to write Python code to parse
those indexes, then found they use this weird encoding, and Python's
codec set is fairly inclusive already, so this codec sounded like a
reasonably useful addition.  It probably shows up other places as well.
 It might even be a reasonable internal representation for Python, which
as I understand it currently can't represent codepoints outside the BMP.
 Also, it is used in Java serialization, which I think of as a somewhat
weird and whacky thing, but it's conceivable that somebody someday might
want to write a Python program that speaks the Java serialization
protocol (I don't have a good sense of whether that's feasible).

Writing an application specific codec with the C API is doable in
principle, but it seems like an awful lot of effort for just one quickie
program.  These indexes are very large and so writing the codec in
Python would probably be painfully slow.

__________________________________
Tracker <report at bugs.python.org>
<http://bugs.python.org/issue2857>
__________________________________


More information about the Python-bugs-list mailing list