crimes in Python

Wed Mar 8 00:31:06 EST 2000

(The application involved is analyzing crime data, or preparing it for
someone else to analyze.  No crimes are being committed in Python.)

I just started learning Python; I figured the best way to do it would
be to write actual programs in it.

So, tonight, I needed to write a program to do some data massaging.
I wrote the first version in Perl, which took me about 90 minutes,
including time to correct several misconceptions about the data format
(and start afresh twice).

I translated the Perl code more or less word-for-word into Python,
changing things around locally when it looked like there was a more
Pythonish way.  It took me about 70 minutes with the various Python
documentation pages open in Mozilla alongside my editor.

The Perl version is 79 lines; the Python one is 121 lines.

This is probably not a fair way to evaluate Python, given that this is
Perl's natural habitat.

It left me with several questions:
- on a 2800-record file, the Python program took 8 seconds, while the
  Perl program took 1.5.  Why?  (I tried precompiling all the REs I'm
  using in the loop; it took me down to 7.9.)
- is there a way to print things out with "print" without tacking trailing
  spaces or newlines on?  Or is using sys.stdout.write() the only way?
- What's the equivalent of the Perl idiom while (<FH>) { }?  I tried 
  while line = sys.stdin.readline():, but Python complained that the
  syntax was invalid.  I guess an assignment is a statement, not an
  expression, so I can't use it in the while condition.  I resorted to
  cutting and pasting the readline() call outside the top of the loop
  and inside the bottom of the loop.
- what kind of an idiot am I to list all the attributes of a victim
  line in an __init__ argument list, in an __init__ body, in the place
  that calls __init__ (implicitly), and in the output statement?  It was
  dumb both times I did it, but it was more obvious it was dumb in Python.
- how do I write long expressions broken over lines prettily in Python?
- despite my high-and-mighty words yesterday about interval
  representations, I wrote [0:2] when I meant [0:3}.

Comments, criticisms, suggestions, and flames are welcome.

Here's some sample input, which covers most of the cases --- except
that it's likely you may lose the carriage-return characters before
the newlines:

CRIME,TYPE,ROLE,AGE,SEX,RACE,CRIMENO

2911.02,"ROBBERY - FORCE, THR",VICTIM,4,M,W,1
,,SUSPECT,23,M,B,

2903.13,ASSAULT,VICTIM,57,F,W,2
,,SUSPECT,60,M,W,
,,SUSPECT,48,M,W,

2903.13,ASSAULT,VICTIM,14,F,B,2
,,SUSPECT,60,M,W,
,,SUSPECT,48,M,W,

2903.13,ASSAULT,VICTIM,46,M,W,2
,,SUSPECT,60,M,W,

2903.13,ASSAULT,VICTIM,7,M,W,3
,,SUSPECT,23,F,W,

Here's the Python version; the Perl version follows it.

#!/usr/bin/python
# read crime data

import sys
import re
from string import join, split

infile = None
while len(sys.argv) > 1:
	if not infile:
		infile = sys.argv.pop(1)
	else:
		sys.exit("Usage: " + sys.argv[0] + "infile")

if infile:
	sys.stdin = open (infile, 'r')

'''
sub splitcsv {
	my ($line) = @_;
	return $line =~ /\G((?:[^,"]|"[^"]*")*),?/g
}
'''

# Yow, how do I Python that?
# The 're' module doesn't have \G.
csv_re = re.compile("((?:[^,\"]|\"[^\"]*\")*),?")
def splitcsv(line):
	return csv_re.findall(line)

# in Perl:
# $victims[$x] = {'crime' => 105.69,
#                 'type' => 'RIDICULOUS ASSAULT',
#                 'crimeno' => $crimeno, 
#                 'suspects' => \@suspects,
#                 'age' => 21,
#                 'sex' => 'M',
#                 'race' => 'W'}
# each suspect is an array: [39, 'M', 'W']

class Victim:
	def __init__(self, crime, type, crimeno, age, sex, race):
		self.crime = crime
		self.type = type
		self.crimeno = crimeno
		self.age = age
		self.sex = sex
		self.race = race
		self.suspects = []

victims = []
victim = None

headerpat = re.compile("CRIME")
carriage_control = re.compile("[\r\n]")
blank_line = re.compile("\\s*$")
comma = re.compile(",")

line = sys.stdin.readline()
while line:
	line = carriage_control.sub("", line)
	if headerpat.match(line):
		pass
	elif blank_line.match(line):
		if victim:
			victims.append(victim)
		victim = None
	elif comma.match(line): # a suspect
		fields = splitcsv (line)
		[role, age, race, sex] = fields[2:6]
		if role != 'SUSPECT':
			sys.exit("not a suspect: " + role +
				" under " + victim.crimeno)
		victim.suspects.append([age, race, sex])
	else: # a victim and crime
		if victim:
			sys.exit("two victims, no blank line")
		fields = splitcsv (line)
		victim = Victim(
			crime = fields[0], 
			type = fields[1], 
			age = fields[3],
			sex = fields[4],
			race = fields[5],
			crimeno = fields[6])

		if fields[2] != 'VICTIM':
			sys.exit("not a victim: " + fields[2] +
				" at " + victim.crimeno)
	line = sys.stdin.readline()

if victim: victims.append(victim)

max_suspects = 0;
for victim in victims:
	sys.stdout.write(join (
		(victim.crimeno,
		victim.crime,
		victim.type,
		victim.age,
		victim.sex,
		victim.race),
		"\t"))
	if not victim.suspects:
		print
		continue
	if len(victim.suspects) > max_suspects:
		max_suspects = len(victim.suspects)
	for suspect in victim.suspects:
		for value in suspect[0:3]:
			sys.stdout.write("\t" + value)
	print

sys.stderr.write(join(
	split('crimeno crime type age sex race', ' '), "\t"))
sys.stderr.write(join(
	map((lambda x: 'age' + `x` + "\tsex" 
	     + `x` + "\trace" + `x`),
	range(1, max_suspects+1)))
	)
sys.stderr.write("\n")

@@@@@@@@@@@@ And here's the Perl version:

#!/usr/bin/perl -w
use strict;
# read crime data

my $infile = undef;
while (@ARGV) {
	if (not $infile) {
		$infile = shift @ARGV;
	} else {
		die "Usage: $0 infile\n";
	}
}

if ($infile) {
	open STDIN, "< $infile" or die "Can't open $infile: $!";
}

# $victims[$x] = {'crime' => 105.69,
#                 'type' => 'RIDICULOUS ASSAULT',
#                 'crimeno' => $crimeno, 
#                 'suspects' => \@suspects,
#                 'age' => 21,
#                 'sex' => 'M',
#                 'race' => 'W'}
# each suspect is an array: [39, 'M', 'W']

sub splitcsv {
	my ($line) = @_;
	return $line =~ /\G((?:[^,"]|"[^"]*")*),?/g
}

my @victims;
my %victim;
while (<STDIN>) {
	s/\r//g;
	chomp;
	next if /^CRIME/; # first line

	if (/^\s*$/) { # blank line
		push @victims, {%victim} if keys %victim;
		%victim = ();
	} elsif (/^,/) { # a suspect
		my $role;
		my @suspect;
		(undef, undef, $role, @suspect) = splitcsv $_;
		die "not a suspect: $role under $victim{crimeno}" 
			if $role ne 'SUSPECT';
		push @{$victim{'suspects'}}, \@suspect;
	} else { # a victim and crime
		my $role;
		die "two victims, no blank line, stopped" if defined 
			$victim{'crime'};
		@victim{qw(crime type role age sex race crimeno)} = splitcsv $_;
		$role = $victim{'role'};
		delete $victim{'role'};
		die "not a victim: $role at $victim{crimeno}" 
			if $role ne 'VICTIM';
	}
}

push @victims, {%victim} if keys %victim;

my $max_suspects = 0;
for my $victim (@victims) {
	print join "\t", @{$victim}{qw(crimeno crime type  age sex race)};
	(print "\n"), next if not $victim->{suspects};
	if (@{$victim->{suspects}} > $max_suspects) {
		$max_suspects = @{$victim->{suspects}};
	}
	for my $suspect (@{$victim->{suspects}}) {
		for my $value (@{$suspect}[0..2]) {
			print "\t", $value;
		}
	}
	print "\n";
}

print STDERR join "\t", qw(crimeno crime type age sex race), map { "age$_\tsex$_\trace$_" } (1..$max_suspects);
print STDERR "\n";
-- 
<kragen at pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
The Internet stock bubble didn't burst on 1999-11-08.  Hurrah!
<URL:http://www.pobox.com/~kragen/bubble.html>
The power didn't go out on 2000-01-01 either.  :)