How to read gzipped utf8 file in Python?
"Martin v. Löwis"
martin at v.loewis.de
Thu Nov 22 15:25:20 EST 2007
> I have a large (gigabytes) file which is encoded in UTF-8 and then
> compressed with gzip. I'd like to read it with the "gzip" module
> and "utf8" decoding.
You didn't specify the processing you want to perform. For example,
this should work just fine
fd = gzip.open(fname, 'rb')
for line in fd.readline():
pass
For that processing, it is not even necessary to know what the encoding
of the file is, except that it is an ASCII superset (which UTF-8 is).
> The obvious approach is
>
> fd = gzip.open(fname, 'rb',encoding='utf8')
>
> But "gzip.open" doesn't support an "encoding" parameter. (It
> probably should, for consistency.)
I think I disagree. The builtin open function does not support an
encoding argument, either (in Python 2.x). Conceptually, gzip operates
on byte streams, not character streams.
> Is it possible to express "unzip, then decode utf8" via
> "codecs.open"?
If that's the processing you want to do - sure
fd0 = gzip.open(fname, 'rb')
fd = codecs.getreader("utf-8")(fd0)
data = fd.readline()
You can combine that to
fd = codecs.getreader("utf-8")(gzip.open(fname))
HTH,
Martin
More information about the Python-list
mailing list