[CentralOH] Pipeness _Within_ Python

Sat Aug 18 19:28:46 CEST 2012

On Thu, 16 Aug 2012 22:40:40 -0400, Neil Ludban <nludban at columbus.rr.com> wrote:

> infile = ShaInputFile(sys.stdin)
> xfile = gzip.GzipFile(fileobj=infile)

That chaining (reading xfile, reads from infile through 
gzip.GzipFile, which reads from sys.stdin through ShaInputFile) 
is pipeness. That's the idea I needed. The goodness of file-like 
objects and Python in general is shown again. Thank you very much! 

The original post, had mogrify reading the gunziped output for 
unspecified processing, so I rewrote ShaOutputFile 
to have a read method that in turn reads from an upstream 
file-like thing. 

#!/usr/bin/env python

import gzip
import hashlib
import sys

class StatifyFile:
    def __init__(self, fin=None):
        self._fin = fin
        self._sha = hashlib.sha1()
        if 'tell' in dir(self._fin):
            self.tell = self._fin.tell
        if 'seek' in dir(self._fin):
            self.seek = self._fin.seek
        self._nbytes = 0
        self._nlines = 0

    def __iter__(self):
        return self

    def next(self):
        try:
            buf = self._fin.next()
        except StopIteration:
            raise StopIteration
        self._sha.update(buf)
        self._nbytes += len(buf)
        self._nlines += buf.count('\n')
        return buf

    def read(self, *nbytes):
        tell_before = self.tell()
        buf = self._fin.read(*nbytes)
        if buf and tell_before == self._nbytes:
            self._sha.update(buf)
            self._nbytes += len(buf)
            self._nlines += buf.count('\n')
        return buf

    def get_stats(self):
        return (self._nbytes, self._nlines, self._sha.hexdigest())

gzipped_file = StatifyFile(sys.stdin)
# gzipped_file = StatifyFile(open('moogrify.py.gz'))
# gunzipped_file = gzip.GzipFile(fileobj=gzipped_file)
# my_stdin = StatifyFile(gunzipped_file)
my_stdin = StatifyFile(gzip.GzipFile(fileobj=gzipped_file))

for i, line in enumerate(my_stdin):
    if i % 10 == 0:
        print i, line,
        pass

print 'compressed', repr(gzipped_file.get_stats())
print 'uncompressed', repr(my_stdin.get_stats())

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

If I didn't care about the getting info from the filters, 
I could have written: 

for i, line in enumerate(
        StatifyFile(gzip.GzipFile(fileobj=StatifyFile(sys.stdin)))):

> Note the Input has wrong byte count and sha1sum, probably because the
> gzip library is pre-reading the file header to determine the format.
> Correctly supporting tell() and seek() on class ShaInputFile is left
> as an exercise for the reader...

[jep200404 at test nl]$ ll moog*
-rwxrwxr-x. 1 jep jep 1116 Aug 17 23:29 moogrify.py
-rw-rw-r--. 1 jep jep  435 Aug 17 23:30 moogrify.py.gz
[jep200404 at test nl]$ sha1sum moog*
f8abef74e7d7de1592ea9ba888886e21befa762f  moogrify.py
52ac0f5b2fbafceeb0a926470eff9525189797a8  moogrify.py.gz
[jep200404 at test nl]$ ./two.py <moogrify.py.gz
0 #!/usr/local/bin/python2.7
10         self._fin = fin
20 
30         self._nlines = 0
40 
compressed (435, 2, '52ac0f5b2fbafceeb0a926470eff9525189797a8')
uncompressed (1116, 50, 'f8abef74e7d7de1592ea9ba888886e21befa762f')
[jep200404 at test nl]$ 

Seems to work now, although my code isn't very clean. 
I need to study file-like objects and iterators much 
to see what edge cases I'm missing. I also need to pay 
attention to what is treated as binary bytes and what 
is treated as text. 

- - - - - 

One could write the program to be mostly nested file-like 
objects that write. One could also generalize what you did. 
I.e., One could have a nested chain of objects that read, 
a nested chain of objects that write, 
with shutil.copyfileobj() in between. 
But multiple shutil.copyfileobj() would likely be difficult. 

> $ sha1 moogrify.py moogrify.py.gz 
> SHA1 (moogrify.py) = f8abef74e7d7de1592ea9ba888886e21befa762f
> SHA1 (moogrify.py.gz) = b6e747168a6725c55b9e5abbe58d5fcb845df9e6

By the way, which OS were you using?