Python3.1: gzip encoding with UTF-8 fails
Diez B. Roggisch
deets at nospam.web.de
Sun Dec 20 11:52:24 EST 2009
Johannes Bauer schrieb:
> Hello group,
>
> with this following program:
>
> #!/usr/bin/python3
> import gzip
> x = gzip.open("testdatei", "wb")
> x.write("ä")
> x.close()
>
> I get a broken .gzip file when decompressing:
>
> $ cat testdatei |gunzip
> ä
> gzip: stdin: invalid compressed data--length error
>
> As it only happens with UTF-8 characters, I suppose the gzip module
UTF-8 is not unicode. Even if the source-encoding above is UTF-8, I'm
not sure what is used to encode the unicode-string when it's written.
> writes a length of 1 in the gzip file header (one character "ä"), but
> then actually writes 2 characters (0xc3 0xa4).
>
> Is there a solution?
What about writinga bytestring by explicitly decoding the string to
utf-8 first?
x.write("ä".encode("utf-8"))
Diez
More information about the Python-list
mailing list