Python3.1: gzip encoding with UTF-8 fails
Mark Tolonen
metolone+gmane at gmail.com
Sun Dec 20 13:01:49 EST 2009
> "Diez B. Roggisch" <deets at nospam.web.de> wrote in message
> news:7p7328F3r1r25U1 at mid.uni-berlin.de...
> Johannes Bauer schrieb:
> > Hello group,
> >
> > with this following program:
> >
> > #!/usr/bin/python3
> > import gzip
> > x = gzip.open("testdatei", "wb")
> > x.write("ä")
> > x.close()
> >
> > I get a broken .gzip file when decompressing:
> >
> > $ cat testdatei |gunzip
> > ä
> > gzip: stdin: invalid compressed data--length error
> >
> > As it only happens with UTF-8 characters, I suppose the gzip module
>
> UTF-8 is not unicode. Even if the source-encoding above is UTF-8, I'm not
> sure what is used to encode the unicode-string when it's written.
>
> > writes a length of 1 in the gzip file header (one character "ä"), but
> > then actually writes 2 characters (0xc3 0xa4).
> >
> > Is there a solution?
>
> What about writinga bytestring by explicitly decoding the string to utf-8
> first?
>
> x.write("ä".encode("utf-8"))
While that works, it still seems like a bug in gzip. If gzip.open is
replaced with a simple open:
# coding: utf-8
import gzip
x = open("testdatei", "wb")
x.write("ä")
x.close()
The result is:
Traceback (most recent call last):
File
"C:\dev\python3\Lib\site-packages\Pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
exec(codeObj, __main__.__dict__)
File "<auto import>", line 1, in <module>
File "y.py", line 4, in <module>
x.write("ä")
TypeError: must be bytes or buffer, not str
Opening a file in binary mode should require a bytes or buffer object.
-Mark
More information about the Python-list
mailing list