crimes in Python

Wed Mar 8 08:42:22 EST 2000

Thanks very much for your response!

In article <slrn8ccfat.29m.scarblac-spamtrap at flits104-37.flits.rug.nl>,
Remco Gerlich <scarblac-rt at pino.selwerd.nl> wrote:
>Kragen Sitaker wrote in comp.lang.python:
>> The Perl version is 79 lines; the Python one is 121 lines.
>> 
>> This is probably not a fair way to evaluate Python, given that this is
>> Perl's natural habitat.
>
>Even if it weren't you're only checking how useful it is to write Perl in
>Python. You would get bad results as well if you wrote it in Python first
>and then translated to Perl, I suppose.

That's almost certainly true.

Still, I'm impressed with Python's facility at supporting Perl-like
programming.  I'm sure I'll be even happier when I learn to think in
Python.  :)

>> - What's the equivalent of the Perl idiom while (<FH>) { }?  I tried 
>>   while line = sys.stdin.readline():, but Python complained that the
>>   syntax was invalid.
>
>If it's ok to read the whole file into memory at once:
>for line in sys.stdin.readlines():
>
>Otherwise, I suppose:
>while 1:
>   line = sys.stdin.readline()
>   if not line:
>      break

I thought about that.  I decided it was just as icky, especially given
that I couldn't outdent the if not line the way I would in other
languages.  (So my loop control would be hard to find.)

>> - what kind of an idiot am I to list all the attributes of a victim
>>   line in an __init__ argument list, in an __init__ body, in the place
>>   that calls __init__ (implicitly), and in the output statement?  It was
>>   dumb both times I did it, but it was more obvious it was dumb in Python.
>
>Dicts, you mean? :)

I used a dict in the Perl version; I still ended up with the list in
two places (which admittedly is better by far than four, and perhaps
excusable, because in one case it was describing the input format,
while in the other, it was describing the output format).  I thought
using a class instance would simplify things.  I was wrong.

>> infile = None
>> while len(sys.argv) > 1:
>> 	if not infile:
>> 		infile = sys.argv.pop(1)
>> 	else:
>> 		sys.exit("Usage: " + sys.argv[0] + "infile")
>> 
>> if infile:
>> 	sys.stdin = open (infile, 'r')
>
>Using a while here is misleading, imo. You only ever use one argument.

The theory was that I was going to add command-line switches at some
point, which admittedly is a lousy reason to code such an obscure
structure.

>if len(sys.argv) > 2:
>   sys.exit("Usage: %s infile" % sys.argv[0])
>elif len(sys.argv) == 2:
>   infile = sys.argv[1]
>   sys.stdin = open(infile, 'r')
>else
>   infile = None
>   
>And you need some error checking on the open.

I thought "IOError: [Errno 2] No such file or directory: 'foo'" was a
sufficiently explanatory error message, given the context of the
program's use.  You'll note the Perl version does error-check the open
and prints essentially the same error message if it fails.

The string % operator seems to be a source of great sweetness and
light.  I should use it more.

>> '''
>> sub splitcsv {
>> 	my ($line) = @_;
>> 	return $line =~ /\G((?:[^,"]|"[^"]*")*),?/g
>> }
>> '''
>
>I don't know what this means. The =~ splits the line according to that
>regexp, and returns a list or so?

The =~ tells the regexp what string it should work on.  The regexp,
being a /g regexp in a list context, applies itself to the string
repeatedly and returns a list of the parenthesized matches, a lot like
re.findall.

>> # Yow, how do I Python that?
>> # The 're' module doesn't have \G.
>> csv_re = re.compile("((?:[^,\"]|\"[^\"]*\")*),?")
>
>Use single quotes so you don't need to escape the double quotes.

Good point.

>> def splitcsv(line):
>> 	return csv_re.findall(line)
>
>My intuition is that there should be a clean way without regexpen. But I
>can't think of any right now. If this works, keep it :-)

Well, you really do need a state machine with several states; regexps
are a compact and (theoretically) understandable way to write those.

They do lack something in the error-handling department; in this case,
I'd like to make sure that I match the whole string, with no gaps or
trailing garbage.

>> victims = []
>> victim = None
>> 
>> headerpat = re.compile("CRIME")
>> carriage_control = re.compile("[\r\n]")
>> blank_line = re.compile("\\s*$")
>> comma = re.compile(",")
>
>Using regexps for this sort of thing is just plain silly. You have a big
>Perl stuck in your brain! :-)

Guilty!  Perl in the first degree.  :)

In Perl, of course, searching for regexps is easier than searching for
strings, which is why Perl folk usually search for strings by
pretending they're regexps.

>> line = sys.stdin.readline()
>> while line:
>> 	line = carriage_control.sub("", line)
>
>You know that there is always a \n on the end of the line. It's always \n,
>not \r\n or \r, since you don't read the file in as binary.
>So that line is just:
>    line = line[:-1]

Actually, it is \r\n --- the file was written by Microsoft Excel, and
Python is running on Linux.  But the last line may have no CR or LF,
and if someone wants to try the program on the Usenet post, they may
not bother to put the CRs on.

>If you want to delete trailing spaces as well, use string.strip. Makes
>the check for empty line easier as well (line == "").

Yow, perfect!  In Perl, we have to roll our own string.strip, so I did.

(Turns out \r is "whitespace" to string.strip, even on Unix.)

>> 	if headerpat.match(line):
>> 		pass
>
>What happens when the word "CRIME" is part of any real data line?
>I would check something like:
>    if line[:5] == "CRIME":

Doesn't match() look only at the beginning of the string?

Nevertheless, your solution is better.

>> 	elif blank_line.match(line):
>
>    elif string.strip(line) == '':
>	
>I'm sure there are more similar improvements down below, but I have no more
>time.

Thanks for the time you did spend!  It was extremely helpful.

>> sys.stderr.write(join(
>> 	split('crimeno crime type age sex race', ' '), "\t"))
>> sys.stderr.write(join(
>> 	map((lambda x: 'age' + `x` + "\tsex" 
>> 	     + `x` + "\trace" + `x`),
>> 	range(1, max_suspects+1)))
>> 	)
>
>This looks like it could be improved a lot.

I thought so too.

By the way, it's broken, now that I look at it.  It puts unwanted
spaces in the output.

lambda x: '\tage%d\tsex%d\trace%d' % (x, x, x) would be one
improvement.  Perhaps

for i in range(1, max_suspects+1):
	sys.stderr.write('\tage%d\tsex%d\trace%d' % (i, i, i))

-- 
<kragen at pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
The Internet stock bubble didn't burst on 1999-11-08.  Hurrah!
<URL:http://www.pobox.com/~kragen/bubble.html>
The power didn't go out on 2000-01-01 either.  :)