crimes in Python
Kragen Sitaker
kragen at dnaco.net
Wed Mar 8 00:31:06 EST 2000
(The application involved is analyzing crime data, or preparing it for
someone else to analyze. No crimes are being committed in Python.)
I just started learning Python; I figured the best way to do it would
be to write actual programs in it.
So, tonight, I needed to write a program to do some data massaging.
I wrote the first version in Perl, which took me about 90 minutes,
including time to correct several misconceptions about the data format
(and start afresh twice).
I translated the Perl code more or less word-for-word into Python,
changing things around locally when it looked like there was a more
Pythonish way. It took me about 70 minutes with the various Python
documentation pages open in Mozilla alongside my editor.
The Perl version is 79 lines; the Python one is 121 lines.
This is probably not a fair way to evaluate Python, given that this is
Perl's natural habitat.
It left me with several questions:
- on a 2800-record file, the Python program took 8 seconds, while the
Perl program took 1.5. Why? (I tried precompiling all the REs I'm
using in the loop; it took me down to 7.9.)
- is there a way to print things out with "print" without tacking trailing
spaces or newlines on? Or is using sys.stdout.write() the only way?
- What's the equivalent of the Perl idiom while (<FH>) { }? I tried
while line = sys.stdin.readline():, but Python complained that the
syntax was invalid. I guess an assignment is a statement, not an
expression, so I can't use it in the while condition. I resorted to
cutting and pasting the readline() call outside the top of the loop
and inside the bottom of the loop.
- what kind of an idiot am I to list all the attributes of a victim
line in an __init__ argument list, in an __init__ body, in the place
that calls __init__ (implicitly), and in the output statement? It was
dumb both times I did it, but it was more obvious it was dumb in Python.
- how do I write long expressions broken over lines prettily in Python?
- despite my high-and-mighty words yesterday about interval
representations, I wrote [0:2] when I meant [0:3}.
Comments, criticisms, suggestions, and flames are welcome.
Here's some sample input, which covers most of the cases --- except
that it's likely you may lose the carriage-return characters before
the newlines:
CRIME,TYPE,ROLE,AGE,SEX,RACE,CRIMENO
2911.02,"ROBBERY - FORCE, THR",VICTIM,4,M,W,1
,,SUSPECT,23,M,B,
2903.13,ASSAULT,VICTIM,57,F,W,2
,,SUSPECT,60,M,W,
,,SUSPECT,48,M,W,
2903.13,ASSAULT,VICTIM,14,F,B,2
,,SUSPECT,60,M,W,
,,SUSPECT,48,M,W,
2903.13,ASSAULT,VICTIM,46,M,W,2
,,SUSPECT,60,M,W,
2903.13,ASSAULT,VICTIM,7,M,W,3
,,SUSPECT,23,F,W,
Here's the Python version; the Perl version follows it.
#!/usr/bin/python
# read crime data
import sys
import re
from string import join, split
infile = None
while len(sys.argv) > 1:
if not infile:
infile = sys.argv.pop(1)
else:
sys.exit("Usage: " + sys.argv[0] + "infile")
if infile:
sys.stdin = open (infile, 'r')
'''
sub splitcsv {
my ($line) = @_;
return $line =~ /\G((?:[^,"]|"[^"]*")*),?/g
}
'''
# Yow, how do I Python that?
# The 're' module doesn't have \G.
csv_re = re.compile("((?:[^,\"]|\"[^\"]*\")*),?")
def splitcsv(line):
return csv_re.findall(line)
# in Perl:
# $victims[$x] = {'crime' => 105.69,
# 'type' => 'RIDICULOUS ASSAULT',
# 'crimeno' => $crimeno,
# 'suspects' => \@suspects,
# 'age' => 21,
# 'sex' => 'M',
# 'race' => 'W'}
# each suspect is an array: [39, 'M', 'W']
class Victim:
def __init__(self, crime, type, crimeno, age, sex, race):
self.crime = crime
self.type = type
self.crimeno = crimeno
self.age = age
self.sex = sex
self.race = race
self.suspects = []
victims = []
victim = None
headerpat = re.compile("CRIME")
carriage_control = re.compile("[\r\n]")
blank_line = re.compile("\\s*$")
comma = re.compile(",")
line = sys.stdin.readline()
while line:
line = carriage_control.sub("", line)
if headerpat.match(line):
pass
elif blank_line.match(line):
if victim:
victims.append(victim)
victim = None
elif comma.match(line): # a suspect
fields = splitcsv (line)
[role, age, race, sex] = fields[2:6]
if role != 'SUSPECT':
sys.exit("not a suspect: " + role +
" under " + victim.crimeno)
victim.suspects.append([age, race, sex])
else: # a victim and crime
if victim:
sys.exit("two victims, no blank line")
fields = splitcsv (line)
victim = Victim(
crime = fields[0],
type = fields[1],
age = fields[3],
sex = fields[4],
race = fields[5],
crimeno = fields[6])
if fields[2] != 'VICTIM':
sys.exit("not a victim: " + fields[2] +
" at " + victim.crimeno)
line = sys.stdin.readline()
if victim: victims.append(victim)
max_suspects = 0;
for victim in victims:
sys.stdout.write(join (
(victim.crimeno,
victim.crime,
victim.type,
victim.age,
victim.sex,
victim.race),
"\t"))
if not victim.suspects:
print
continue
if len(victim.suspects) > max_suspects:
max_suspects = len(victim.suspects)
for suspect in victim.suspects:
for value in suspect[0:3]:
sys.stdout.write("\t" + value)
print
sys.stderr.write(join(
split('crimeno crime type age sex race', ' '), "\t"))
sys.stderr.write(join(
map((lambda x: 'age' + `x` + "\tsex"
+ `x` + "\trace" + `x`),
range(1, max_suspects+1)))
)
sys.stderr.write("\n")
@@@@@@@@@@@@ And here's the Perl version:
#!/usr/bin/perl -w
use strict;
# read crime data
my $infile = undef;
while (@ARGV) {
if (not $infile) {
$infile = shift @ARGV;
} else {
die "Usage: $0 infile\n";
}
}
if ($infile) {
open STDIN, "< $infile" or die "Can't open $infile: $!";
}
# $victims[$x] = {'crime' => 105.69,
# 'type' => 'RIDICULOUS ASSAULT',
# 'crimeno' => $crimeno,
# 'suspects' => \@suspects,
# 'age' => 21,
# 'sex' => 'M',
# 'race' => 'W'}
# each suspect is an array: [39, 'M', 'W']
sub splitcsv {
my ($line) = @_;
return $line =~ /\G((?:[^,"]|"[^"]*")*),?/g
}
my @victims;
my %victim;
while (<STDIN>) {
s/\r//g;
chomp;
next if /^CRIME/; # first line
if (/^\s*$/) { # blank line
push @victims, {%victim} if keys %victim;
%victim = ();
} elsif (/^,/) { # a suspect
my $role;
my @suspect;
(undef, undef, $role, @suspect) = splitcsv $_;
die "not a suspect: $role under $victim{crimeno}"
if $role ne 'SUSPECT';
push @{$victim{'suspects'}}, \@suspect;
} else { # a victim and crime
my $role;
die "two victims, no blank line, stopped" if defined
$victim{'crime'};
@victim{qw(crime type role age sex race crimeno)} = splitcsv $_;
$role = $victim{'role'};
delete $victim{'role'};
die "not a victim: $role at $victim{crimeno}"
if $role ne 'VICTIM';
}
}
push @victims, {%victim} if keys %victim;
my $max_suspects = 0;
for my $victim (@victims) {
print join "\t", @{$victim}{qw(crimeno crime type age sex race)};
(print "\n"), next if not $victim->{suspects};
if (@{$victim->{suspects}} > $max_suspects) {
$max_suspects = @{$victim->{suspects}};
}
for my $suspect (@{$victim->{suspects}}) {
for my $value (@{$suspect}[0..2]) {
print "\t", $value;
}
}
print "\n";
}
print STDERR join "\t", qw(crimeno crime type age sex race), map { "age$_\tsex$_\trace$_" } (1..$max_suspects);
print STDERR "\n";
--
<kragen at pobox.com> Kragen Sitaker <http://www.pobox.com/~kragen/>
The Internet stock bubble didn't burst on 1999-11-08. Hurrah!
<URL:http://www.pobox.com/~kragen/bubble.html>
The power didn't go out on 2000-01-01 either. :)
More information about the Python-list
mailing list