[CentralOH] Pipeness _Within_ Python
jep200404 at columbus.rr.com
jep200404 at columbus.rr.com
Sat Aug 18 19:28:46 CEST 2012
On Thu, 16 Aug 2012 22:40:40 -0400, Neil Ludban <nludban at columbus.rr.com> wrote:
> infile = ShaInputFile(sys.stdin)
> xfile = gzip.GzipFile(fileobj=infile)
That chaining (reading xfile, reads from infile through
gzip.GzipFile, which reads from sys.stdin through ShaInputFile)
is pipeness. That's the idea I needed. The goodness of file-like
objects and Python in general is shown again. Thank you very much!
The original post, had mogrify reading the gunziped output for
unspecified processing, so I rewrote ShaOutputFile
to have a read method that in turn reads from an upstream
file-like thing.
#!/usr/bin/env python
import gzip
import hashlib
import sys
class StatifyFile:
def __init__(self, fin=None):
self._fin = fin
self._sha = hashlib.sha1()
if 'tell' in dir(self._fin):
self.tell = self._fin.tell
if 'seek' in dir(self._fin):
self.seek = self._fin.seek
self._nbytes = 0
self._nlines = 0
def __iter__(self):
return self
def next(self):
try:
buf = self._fin.next()
except StopIteration:
raise StopIteration
self._sha.update(buf)
self._nbytes += len(buf)
self._nlines += buf.count('\n')
return buf
def read(self, *nbytes):
tell_before = self.tell()
buf = self._fin.read(*nbytes)
if buf and tell_before == self._nbytes:
self._sha.update(buf)
self._nbytes += len(buf)
self._nlines += buf.count('\n')
return buf
def get_stats(self):
return (self._nbytes, self._nlines, self._sha.hexdigest())
gzipped_file = StatifyFile(sys.stdin)
# gzipped_file = StatifyFile(open('moogrify.py.gz'))
# gunzipped_file = gzip.GzipFile(fileobj=gzipped_file)
# my_stdin = StatifyFile(gunzipped_file)
my_stdin = StatifyFile(gzip.GzipFile(fileobj=gzipped_file))
for i, line in enumerate(my_stdin):
if i % 10 == 0:
print i, line,
pass
print 'compressed', repr(gzipped_file.get_stats())
print 'uncompressed', repr(my_stdin.get_stats())
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
If I didn't care about the getting info from the filters,
I could have written:
for i, line in enumerate(
StatifyFile(gzip.GzipFile(fileobj=StatifyFile(sys.stdin)))):
> Note the Input has wrong byte count and sha1sum, probably because the
> gzip library is pre-reading the file header to determine the format.
> Correctly supporting tell() and seek() on class ShaInputFile is left
> as an exercise for the reader...
[jep200404 at test nl]$ ll moog*
-rwxrwxr-x. 1 jep jep 1116 Aug 17 23:29 moogrify.py
-rw-rw-r--. 1 jep jep 435 Aug 17 23:30 moogrify.py.gz
[jep200404 at test nl]$ sha1sum moog*
f8abef74e7d7de1592ea9ba888886e21befa762f moogrify.py
52ac0f5b2fbafceeb0a926470eff9525189797a8 moogrify.py.gz
[jep200404 at test nl]$ ./two.py <moogrify.py.gz
0 #!/usr/local/bin/python2.7
10 self._fin = fin
20
30 self._nlines = 0
40
compressed (435, 2, '52ac0f5b2fbafceeb0a926470eff9525189797a8')
uncompressed (1116, 50, 'f8abef74e7d7de1592ea9ba888886e21befa762f')
[jep200404 at test nl]$
Seems to work now, although my code isn't very clean.
I need to study file-like objects and iterators much
to see what edge cases I'm missing. I also need to pay
attention to what is treated as binary bytes and what
is treated as text.
- - - - -
One could write the program to be mostly nested file-like
objects that write. One could also generalize what you did.
I.e., One could have a nested chain of objects that read,
a nested chain of objects that write,
with shutil.copyfileobj() in between.
But multiple shutil.copyfileobj() would likely be difficult.
> $ sha1 moogrify.py moogrify.py.gz
> SHA1 (moogrify.py) = f8abef74e7d7de1592ea9ba888886e21befa762f
> SHA1 (moogrify.py.gz) = b6e747168a6725c55b9e5abbe58d5fcb845df9e6
By the way, which OS were you using?
More information about the CentralOH
mailing list