How to read gzipped utf8 file in Python?

"Martin v. Löwis" martin at v.loewis.de
Thu Nov 22 15:25:20 EST 2007


>   I have a large (gigabytes) file which is encoded in UTF-8 and then
> compressed with gzip.  I'd like to read it with the "gzip" module
> and "utf8" decoding.

You didn't specify the processing you want to perform. For example,
this should work just fine

fd = gzip.open(fname, 'rb')
for line in fd.readline():
    pass

For that processing, it is not even necessary to know what the encoding
of the file is, except that it is an ASCII superset (which UTF-8 is).

> The obvious approach is
> 
>     fd = gzip.open(fname, 'rb',encoding='utf8')
> 
> But "gzip.open" doesn't support an "encoding" parameter.  (It
> probably should, for consistency.)

I think I disagree. The builtin open function does not support an
encoding argument, either (in Python 2.x). Conceptually, gzip operates
on byte streams, not character streams.

> Is it possible to express "unzip, then decode utf8" via
> "codecs.open"?

If that's the processing you want to do - sure

fd0 = gzip.open(fname, 'rb')
fd = codecs.getreader("utf-8")(fd0)
data = fd.readline()

You can combine that to

fd = codecs.getreader("utf-8")(gzip.open(fname))

HTH,
Martin



More information about the Python-list mailing list