[Python-Dev] Decoding incomplete unicode
M.-A. Lemburg
mal at egenix.com
Wed Aug 18 10:36:06 CEST 2004
Martin v. Löwis wrote:
> M.-A. Lemburg wrote:
>
>> I've thought about this some more. Perhaps I'm still missing
>> something, but wouldn't it be possible to add a feeding
>> mode to the existing stream codecs by creating a new queue
>> data type (much like the queue you have in the test cases of
>> your patch) and using the stream codecs on these ?
>
> Here is the problem. In UTF-8, how does the actual algorithm
> tell (the application) that the bytes it got on decoding provide
> for three fully decodable characters, and that 2 bytes are left
> undecoded, and that those bytes are not inherently ill-formed,
> but lack a third byte to complete the multi-byte sequence?
This state can be stored in the stream codec instance,
e.g. by using a special state object that is stored in
the instance and passed to the encode/decode APIs of the
codec or by implementing the stream codec itself in C.
We do need to extend the API between the stream codec
and the encode/decode functions, no doubt about that.
However, this is an extension that is well hidden from
the user of the codec and won't break code.
> On top of that, you can implement whatever queuing or streaming
> APIs you want, but you *need* an efficient way to communicate
> incompleteness.
Agreed.
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, Aug 18 2004)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
More information about the Python-Dev
mailing list