[Patches] [ python-Patches-998993 ] Decoding incomplete unicode
SourceForge.net
noreply at sourceforge.net
Tue Sep 7 22:30:34 CEST 2004
Patches item #998993, was opened at 2004-07-27 22:35
Message generated for change (Comment added) made by doerwalter
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=998993&group_id=5470
Category: None
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Walter Dörwald (doerwalter)
Assigned to: Nobody/Anonymous (nobody)
Summary: Decoding incomplete unicode
Initial Comment:
Pythons unicode machinery currently has problems when
decoding incomplete input.
When codecs.StreamReader.read() encounters a
decoding error it reads more bytes from the input stream
and retries decoding. This is broken for two reasons:
1) The error might be due to a malformed byte sequence
in the input, a problem that can't be fixed by reading
more bytes.
2) There may be no more bytes available at this time.
Once more data is available decoding can't continue
because bytes from the input stream have already been
read and thrown away.
(sio.DecodingInputFilter has the same problems)
To fix this, three changes are required:
a) We need stateful versions of the decoding functions
that don't raise "truncated data" exceptions, but decode
as much as possible and return the position where
decoding stopped.
b) The StreamReader classes need to use those stateful
versions of the decoding functions.
c) codecs.StreamReader needs to keep an internal
buffer with the bytes read from the input stream that
haven't been decoded into unicode yet.
For a) the Python API already exists: All decoding
functions in the codecs module return a tuple containing
the decoded unicode object and the number of bytes
consumed. But this functionality isn't implemented in the
decoders:
codec.utf_8_decode(u"aä".encode("utf-8")[:-1])
raises an exception instead of returning (u"a", 1).
This can be fixed by extending the UTF-8 and UTF-16
decoding functions like this:
PyObject *PyUnicode_DecodeUTF8Stateful(
const char *s, int size,
const char *errors, int *consumed)
If consumed == NULL PyUnicode_DecodeUTF8Stateful()
behaves like PyUnicode_DecodeUTF8() (i.e. it raises
a "truncated data" exception). If consumed != NULL it
decodes as much as possible (raising exceptions for
invalid byte sequences) and puts the number of bytes
consumed into *consumed.
Additionally for UTF-7 we need to pass the decoder
state around.
An implementation of c) looks like this:
def read(self, size=-1):
if size < 0:
data = self.bytebuffer+self.stream.read()
else:
data = self.bytebuffer+self.stream.read(size)
(object, decodedbytes) = self.decode(data,
self.errors)
self.bytebuffer = data[decodedbytes:]
return object
Unfortunately this changes the semantics. read() might
return an empty string even if there would be more data
available. But this can be fixed if we continue reading
until at least one character is available.
The patch implements a few additional features:
read() has an additional argument chars that can be
used to specify the number of characters that should be
returned.
readline() is supported on all readers derived from
codecs.StreamReader().
readline() and readlines() have an additional option for
dropping the u"\n".
The patch is still missing changes to the escape codecs
("unicode_escape" and "raw_unicode_escape"), but it
has test cases that check the new functionality for all
affected codecs (UTF-7, UTF-8, UTF-16, UTF-16-LE,
UTF-16-BE).
----------------------------------------------------------------------
>Comment By: Walter Dörwald (doerwalter)
Date: 2004-09-07 22:30
Message:
Logged In: YES
user_id=89016
Checked in as:
Doc/api/concrete.tex 1.56
Doc/lib/libcodecs.tex 1.33
Include/unicodeobject.h 2.46
Lib/codecs.py 1.34
Lib/encodings/utf_16.py 1.5
Lib/encodings/utf_16_be.py 1.4
Lib/encodings/utf_16_le.py 1.4
Lib/encodings/utf_8.py 1.3
Lib/test/test_codecs.py 1.13
Misc/NEWS 1.1129
Modules/_codecsmodule.c 2.20
Objects/unicodeobject.c 2.224
I've added documentation for the chars and keepends
argument.
I've removed the #defines for the UTF7 codec, although I
think they should be added back in: The C functions *do*
exist, it's just the UCS2/UCS4 name mangling that's missing.
> diff4.txt looks OK (even though I don't like the final
> argument in the _codecs module decode APIs).
I think the other alternatives are worse: 1) Implement two
version of the decoding function that use a common
PyUnicode_Decode???() (like the first patch does). 2)
Implement two versions of the decoding functions, each one
using a separate version of PyUnicode_Decode???().
I'll open the new report once 2.4 is out the door and we can
start discussing the final argument and the feed API.
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2004-08-31 16:19
Message:
Logged In: YES
user_id=38388
diff4.txt looks OK (even though I don't like the final
argument in the _codecs module decode APIs).
Please remove the UTF-7 #defines and then check it in.
Thanks, Walter.
----------------------------------------------------------------------
Comment By: Walter Dörwald (doerwalter)
Date: 2004-08-27 21:00
Message:
Logged In: YES
user_id=89016
diff4.txt includes patches to the documentation
----------------------------------------------------------------------
Comment By: Walter Dörwald (doerwalter)
Date: 2004-08-24 22:02
Message:
Logged In: YES
user_id=89016
Here is a third version of the patch with the requested
changes.
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2004-08-24 12:15
Message:
Logged In: YES
user_id=38388
Walter, please update the first version of the patch as
outlined in my python-dev
posting:
* move the UTF-7 change to a separate patch (this won't get
checked in
for Python 2.4)
* remove the extra APIs from the _codecs patches (these are
not needed;
instead the existing APIs should be updated to use the
...Stateful() C APIs
and pass along the possibly changed consumed setting)
Thanks.
----------------------------------------------------------------------
Comment By: Walter Dörwald (doerwalter)
Date: 2004-08-10 21:22
Message:
Logged In: YES
user_id=89016
Here is a second version of the patch: It implements a final
argument for read/write/decode/encode, with specifies
whether this is the last call to the method, it adds a chunk
reader/writer API to StreamReader/Writer and it unifies the
stateless/stateful decoding functions in the codecs module
again.
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2004-07-27 23:11
Message:
Logged In: YES
user_id=21627
Marc-Andre, can you please specifically point to the places
in the patch where it violates the principles you have
stated? E.g. where does it maintain state outside the
StreamReader/Writer?
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2004-07-27 22:56
Message:
Logged In: YES
user_id=38388
Walter, I think you should split this into multiple feature
requests.
First of all, I agree that the current situation with
StreamReader on malformed input is not really ideal.
However, I don't think we need to fix anything in terms of
adding new interfaces. Also, introducing state at the
encode/decode breaks the design of the codecs functions --
only StreamReader/Writer may maintain state.
Now, the situation is not that bad though: the case of a
codec continuing as far as possible and then returning the
decoded data as well as the number of bytes consumed is
basically just another error handling scheme. Let's call it
"break". If errors is set to "break", the codec will stop
decoding/encoding and return the coded data as well as the
number of input characters consumed.
You could then use this scheme in the StreamWriter/Reader to
implement the "read as far as possible" scheme.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=998993&group_id=5470
More information about the Patches
mailing list