[New-bugs-announce] [issue26730] SpooledTemporaryFile doesn't correctly preserve data for text (non-binary) SpooledTemporaryFile objects when Unicode characters are written
James Hennessy
report at bugs.python.org
Sun Apr 10 15:11:43 EDT 2016
New submission from James Hennessy:
The tempfile.SpooledTemporaryFile class doesn't correctly preserve data for text (non-binary) SpooledTemporaryFile objects when Unicode characters are written. The attached program demonstrates the failure. It creates a SpooledTemporaryFile object, writes 20 string characters to it, and then tries to read them back. If the SpooledTemporaryFile has rolled over to disk, as it does in the demonstration program, then the data is not read back correctly. Instead, an exception is recognized due to the data in the SpooledTemporaryFile being corrupted.
The problem is this statement in tempfile.py, in the rollover() method:
newfile.seek(file.tell(), 0)
The "file" variable references a StringIO object, whose tell() and seek() methods count in characters, not bytes, yet this value is applied to a TemporaryFile object, whose tell() and seek() methods deal in bytes, not characters. The demonstration program writes 10 characters to the SpooledTemporaryFile. Since 10 exceeds the rollover size of 5, the implementation writes the 10 characters to the TemporaryFile and then seeks to position 10 in the TemporaryFile, which it thinks is the end of the stream. But those 10 characters got encoded to 30 bytes, and seek position 10 is in the middle of the UTF-8 sequence for the fourth character. The next write to the SpooledTemporaryFile starts overlaying bytes from there. The attempt to read back the data fails because the byte stream no longer represents a valid UTF-8 stream of data.
The related problem is the inconsistency of the behavior of tell() and seek() for text (non-binary) SpooledTemporaryFile objects. If the data hasn't yet rolled over to a TemporaryFile, they count in string characters. If the data has rolled over, they count in bytes.
A quick fix for this is to remove the seek() in the rollover() method. I presume it's there to preserve the stream position if an explicit call to rollover() is made, since for an implicit call, the position would be at the end of the stream already. This quick fix, therefore, would introduce an external incompatibility in the behavior of rollover().
Another possibility is to never use a StringIO object, but to always buffer data in a BytesIO object, as is done for binary SpooledTemporaryFile objects. This has the advantage of "fixing" the tell() and seek() inconsistency, making them count bytes all the time. The downside, of course, is that data that doesn't end up being rolled over to a TemporaryFile gets encoded and decoded, a round trip that could otherwise be avoided.
This problem can be circumvented by a user of SpooledTemporaryFile by explicitly seeking to the end of the stream after every write to the SpooledTemporaryFile object: spool.seek(0, io.SEEK_END)
----------
components: Library (Lib)
files: showbug.py
messages: 263147
nosy: James Hennessy
priority: normal
severity: normal
status: open
title: SpooledTemporaryFile doesn't correctly preserve data for text (non-binary) SpooledTemporaryFile objects when Unicode characters are written
type: behavior
versions: Python 3.4
Added file: http://bugs.python.org/file42423/showbug.py
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue26730>
_______________________________________
More information about the New-bugs-announce
mailing list