[ python-Bugs-1701389 ] utf-16 codec problems with multiple file append

SourceForge.net noreply at sourceforge.net
Thu May 3 17:03:57 CEST 2007


Bugs item #1701389, was opened at 2007-04-16 12:05
Message generated for change (Comment added) made by doerwalter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1701389&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: Python 2.5
Status: Closed
Resolution: Remind
Priority: 5
Private: No
Submitted By: Iceberg Luo (iceberg4ever)
Assigned to: M.-A. Lemburg (lemburg)
Summary: utf-16 codec problems with multiple file append

Initial Comment:
This bug is similar but not exactly the same as bug215974.  (http://sourceforge.net/tracker/?group_id=5470&atid=105470&aid=215974&func=detail)

In my test, even multiple write() within an open()~close() lifespan will not cause the multi BOM phenomena mentioned in bug215974. Maybe it is because bug 215974 was somehow fixed during the past 7 years, although Lemburg classified it as WontFix. 

However, if a file is appended for more than once, by an "codecs.open('file.txt', 'a', 'utf16')", the multi BOM appears.

At the same time, the saying of "(Extra unnecessary) BOM marks are removed from the input stream by the Python UTF-16 codec" in bug215974 is not true even in today, on Python2.4.4 and Python2.5.1c1 on Windows XP.

Iceberg
------------------

PS: Did not find the "File Upload" checkbox mentioned in this web page, so I think I'd better paste the code right here...

import codecs, os

filename = "test.utf-16"
if os.path.exists(filename): os.unlink(filename)  # reset

def myOpen():
  return codecs.open(filename, "a", 'UTF-16')
def readThemBack():
  return list( codecs.open(filename, "r", 'UTF-16') )
def clumsyPatch(raw): # you can read it after your first run of this program
  for line in raw:
    if line[0] in (u'\ufffe', u'\ufeff'): # get rid of the BOMs
      yield line[1:]
    else:
      yield line

fout = myOpen()
fout.write(u"ab\n") # to simplify the problem, I only use ASCII chars here
fout.write(u"cd\n")
fout.close()
print readThemBack()
assert readThemBack() == [ u'ab\n', u'cd\n' ]
assert os.stat(filename).st_size == 14  # Only one BOM in the file

fout = myOpen()
fout.write(u"ef\n")
fout.write(u"gh\n")
fout.close()
print readThemBack()
#print list( clumsyPatch( readThemBack() ) )  # later you can enable this fix
assert readThemBack() == [ u'ab\n', u'cd\n', u'ef\n', u'gh\n' ] # fails here
assert os.stat(filename).st_size == 26  # not to mention here: multi BOM appears


----------------------------------------------------------------------

>Comment By: Walter Dörwald (doerwalter)
Date: 2007-05-03 17:03

Message:
Logged In: YES 
user_id=89016
Originator: NO

>BTW, even the official document of Python2.4, chapter "7.3.2.1 Built-in
> Codecs", mentions that the:
>   PyObject* PyUnicode_DecodeUTF16( const char *s, int size, const char
> *errors, int *byteorder)
> can "switches according to all byte order marks (BOM) it finds in the
> input data. BOMs are not copied into the resulting Unicode string".  I
> don't know whether it is the BOM-less decoder we talked for long time.

This seems to be wrong. Looking at the source code
(Objects/unicodeobjects.c) reveals that only the first BOM is skipped.


----------------------------------------------------------------------

Comment By: Iceberg Luo (iceberg4ever)
Date: 2007-05-03 16:08

Message:
Logged In: YES 
user_id=1770538
Originator: YES

The longtime arguable ZWNBSP is deprecated nowadays ( the
http://www.unicode.org/unicode/faq/utf_bom.html#24 suggests a "U+2060 WORD
JOINER" instead of ZWNBSP ). However I can understand that "backwards
compatibility" is always a good concern, and that's why SteamReader seems
reluctant to change.

In practice, a ZWNBSP inside a file is rarely intended (please also refer
to the topic "Q: What should I do with U+FEFF in the middle of a file?" in
same URL above). IMHO, it is very likely caused by the multi-append file
operation or etc. Well, at least, the unsymmetric "what you write is NOT
what you get/read" effect between "codecs.open(filename, 'a', 'UTF-16')"
and "codecs.open(filename, 'r', 'UTF-16')" is not elegant enough.

Aiming at the unsymmetry, finally I come up with a wrapper function for
the codecs.open(), which solve (or you may say "bypass") the problem well
in my case. I'll post the code as attachment.

BTW, even the official document of Python2.4, chapter "7.3.2.1 Built-in
Codecs", mentions that the:
   PyObject* PyUnicode_DecodeUTF16( const char *s, int size, const char
*errors, int *byteorder)
can "switches according to all byte order marks (BOM) it finds in the
input data. BOMs are not copied into the resulting Unicode string".  I
don't know whether it is the BOM-less decoder we talked for long time.
//shrug

Hope the information above can be some kind of recipe for those who
encounter same problem.  That's it. Thanks for your patience.

Best regards,
                            Iceberg
File Added: _codecs.py

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2007-04-23 12:56

Message:
Logged In: YES 
user_id=89016
Originator: NO

But BOMs *may* appear in normal content: Then their meaning is that of
ZERO WIDTH NO-BREAK SPACE (see
http://docs.python.org/lib/encodings-overview.html for more info).


----------------------------------------------------------------------

Comment By: Iceberg Luo (iceberg4ever)
Date: 2007-04-20 05:39

Message:
Logged In: YES 
user_id=1770538
Originator: YES

If such a bug would be fixed, either StreamWriter or StreamReader should
do something.

I can understand Doerwalter that it is somewhat not comfortable for a
StreamWriter to detect whether these is already a BOM at current file
header, especially when operating in append mode. But, IMHO, the
StreamReader should be able to detect multi BOM during its life span and
automatically ignore the non-first one, providing that a BOM is never
supposed to occur in normal content.  Not to mention that such a Reader
seems exist for a while, according to the "(extra unnecessary) BOM marks
are removed
from the input stream by the Python UTF-16 codec" in bug215974
(http://sourceforge.net/tracker/?group_id=5470&atid=105470&aid=215974&func=
detail).

Therefore I don't think a WontFix will be the proper FINAL solution for
this case.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2007-04-19 13:30

Message:
Logged In: YES 
user_id=89016
Originator: NO

Closing as "won't fix"

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2007-04-19 12:35

Message:
Logged In: YES 
user_id=38388
Originator: NO

I suggest you close this as wont fix.


----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2007-04-19 12:30

Message:
Logged In: YES 
user_id=89016
Originator: NO

append mode is simply not supported for codecs. How would the codec find
out the codec state that was active after the last characters where written
to the file?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1701389&group_id=5470


More information about the Python-bugs-list mailing list