Book Draft Available: Text Processing in Python

Jim Dennis jimd at
Tue Mar 19 14:46:13 CET 2002

In article <mailman.1015993462.32701.python-list at>, 
 David Mertz, Ph.D. wrote:

> Pythonistas:

> As some folks know, I am working on a book called _Text Processing in
> Python_.  With the gracious permission of my publisher, Addison Wesley,
> I am hereby making in-progress drafts of the book available to the
> Python community.  It's about half done right now, but that will
> increase over time.

> Take a look at the book URL:

> I welcome any comments or feedback the Python community has about my
> book.

> Yours, David...

 I was glancing through it and stopped when I read your word
 counter (with no support for the command line options).  I just
 had to do one to emulate the GNU wc utility as closely as I can
 in one quick session.

 Below is a somewhat more faithful rendering of the GNU wc command.
 Although it's about 120 lines long, almost forty of those are 
 blank lines, docstrings, or comments.  In most cases it gives output
 that is identical to GNU wc (including the character spacing).  
 The only discrepancies I've seen are in the -L (--max-line-length)
 calculation (particularly on binary files).

 It's pedagogical value is more in the use of the getopts module
 and possibly in file iteration (for line in file: ...).  The text 
 processing being done here is trivial.  There's also a little bit 
 of exception handling, and a minimal amount of error avoidance --- 
 since Python will allow me to open a directory but will complain if 
 I try to read lines therefrom).

 It is mildly interesting that this Python implementation of wc
 is only about a third the length of the GNU version from the 
 text utils package (wc.c is 371 lines).  Actually counting words
 and characters the Python version is only about half the length.
 (Glancing at the sources I see that I missed support for the 
 POSIXLY_CORRECT environment variable -- which modifies, or uglifies
 if you prefer, the output format; I could add that in a few lines).

 David, you're welcome to use this script as an example.  Perhaps
 you could list this as an example of how 14 lines of simple, focused
 code grows to 140 lines by the time we add option handling, help
 and error messages, exception handling and error avoidance, and all
 that other stuff.  (If you really want to scare people you could 
 include the wc.c from the GNU textutils package by way of comparison).

#!/usr/bin/env python2.2
import sys, os
""" wc: Emulate GNU wc (word count) """

help = '''Usage: wc [OPTION]... [FILE]...
Print line, word, and byte counts for each FILE, and a total line if
more than one FILE is specified.  With no FILE, or when FILE is -,
read standard input.
  -c, --bytes, --chars   print the byte counts
  -l, --lines            print the newline counts
  -L, --max-line-length  print the length of the longest line
  -w, --words            print the word counts
      --help             display this help and exit
      --version          output version information and exit
Report bugs to <jimd+python at>'''

version = """Python word count: wc(1) emulation by James T. Dennis 
 version 0.1"""

def options():
	"""Process command line options"""
	import getopt
	short = "clLw"
	long  = ('bytes', 'chars', 'lines', 'max-line-length', 
			 'words', 'help', 'version')          
		opts, args = getopt.getopt(sys.argv[1:], short, long)
	except getopt.GetoptError,err:
		msg = "wc: invalid option \nTry `wc --help' for more information." 
		print >> sys.stderr, sys.argv[0], err
		print >> sys.stderr, msg
	return opts, args

def count(f=None):
	"""Count and return words, chars, lines, and maxlength"""
	# We count them all, since that's and much easier than alot of 
	# conditional logic to decide what to count.  
	# We return it all, and main() can decide what to return.
	lines = words = chars = maxline = 0
	if f == None: file = sys.stdin
		if os.path.isdir(f):
			print >> sys.stderr, "wc: %s: Is a directory" % f
			return lines, words, chars, maxline 
			file = open(f,'r')
		except IOError:
			print >> sys.stderr, "Error opening:", f
			return lines, words, chars, maxline 
		# If we get this far, we can count stuff
	for line in file:
		length = len(line)
		lines += 1
		chars += length
		words += len(line.split())
		if length - 1 > maxline: maxline = length - 1 
		# GNU wc doesn't count line terminator in maxlength?
		# +++ binary files anve much different line length semantics!
	return lines, words, chars, maxline 

def printcount(flags, totals, filename=None):
	"""Print counts for each file and for the grand totals
	   takes two 4-tuples, the flags for which items to print, and 
	   the total lines, words, characters, and max-line-length 
	   and an optional filename"""
	if filename == None: filename = ""
	dolines, dochars, dowords, domaxln = flags
	l, w, c, m = totals
	print "",		# GNU wc prints one leading space?
	if dolines: 	print "%6d" % l,
	if dowords: 	print "%7d" % w,
	if dochars:		print "%7d" % c,
	if domaxln:		print "%7d" % m,
	print filename

if __name__ == "__main__":
	opts, args = options()
	dolines = dochars = dowords = domaxln = 0
	for opt,arg in opts:
		if opt == '--help': 		
			print help
		elif opt == '--version': 	
			print version
		elif opt in ('-l', '--lines'): 				dolines = 1
		elif opt in ('-c', '--chars', '--bytes'): 	dochars = 1
		elif opt in ('-w', '--words'): 				dowords = 1
		elif opt in ('-L', '--max-line-length'): 	domaxln = 1
	if dolines + dochars + dowords + domaxln == 0:
		# None specified so default is to do lines, chars, and words
		dolines = dochars = dowords = 1
		# Else we do only the ones that are specified
		# GNU wc always prints the stats in the same order, regardless
		# of the order of the options/switches.
	printflags = (dolines, dochars, dowords, domaxln)

	if not args: 	
		# No files named: so just do stdin 
		# No grand totals. and no filename
		l, w, c, m = count()
		printcount (printflags, (l,w,c,m))
	else: 			# Else we do each file and keep track of grand totals
		all_lines = all_words = all_chars = longest_line = 0 
		files_processed = 0	
		for i in args: 
			if i == '-': 	l, w, c, m = count()
			else: 			l, w, c, m = count(i)
			all_lines += l
			all_words += w
			all_chars += c
			if m > longest_line: longest_line = m
			printcount (printflags, (l,w,c,m), i)
			files_processed += 1
		if files_processed > 1: 			# Print totals
			totals = (all_lines, all_words, all_chars, longest_line)
			printcount (printflags, totals, "total")

More information about the Python-list mailing list