crimes in Python
Remco Gerlich
scarblac-spamtrap at pino.selwerd.nl
Wed Mar 8 05:46:17 EST 2000
Kragen Sitaker wrote in comp.lang.python:
> I just started learning Python; I figured the best way to do it would
> be to write actual programs in it.
I agree.
> The Perl version is 79 lines; the Python one is 121 lines.
>
> This is probably not a fair way to evaluate Python, given that this is
> Perl's natural habitat.
Even if it weren't you're only checking how useful it is to write Perl in
Python. You would get bad results as well if you wrote it in Python first
and then translated to Perl, I suppose.
> It left me with several questions:
> - on a 2800-record file, the Python program took 8 seconds, while the
> Perl program took 1.5. Why? (I tried precompiling all the REs I'm
> using in the loop; it took me down to 7.9.)
Perl's regexen are still somewhat faster. Perl's file access is optimized
in hideous platform dependent ways. These are the two main points that Perl
is optimized for. Still...
> - is there a way to print things out with "print" without tacking trailing
> spaces or newlines on? Or is using sys.stdout.write() the only way?
The latter.
> - What's the equivalent of the Perl idiom while (<FH>) { }? I tried
> while line = sys.stdin.readline():, but Python complained that the
> syntax was invalid.
If it's ok to read the whole file into memory at once:
for line in sys.stdin.readlines():
Otherwise, I suppose:
while 1:
line = sys.stdin.readline()
if not line:
break
> I guess an assignment is a statement, not an
> expression, so I can't use it in the while condition.
Correct.
There's another way to do it using a Reader class in the snippets archive,
http://tor.dhs.org/~zephyrfalcon/snippets/source/28.html
> - what kind of an idiot am I to list all the attributes of a victim
> line in an __init__ argument list, in an __init__ body, in the place
> that calls __init__ (implicitly), and in the output statement? It was
> dumb both times I did it, but it was more obvious it was dumb in Python.
Dicts, you mean? :)
> - how do I write long expressions broken over lines prettily in Python?
See the coding style guide somewhere on python.org. For one thing, (ab)use
the fact that you don't need a trailing \ if you have a ( ) pair open in an
expression.
> Here's some sample input, which covers most of the cases --- except
> that it's likely you may lose the carriage-return characters before
> the newlines:
>
> CRIME,TYPE,ROLE,AGE,SEX,RACE,CRIMENO
>
> 2911.02,"ROBBERY - FORCE, THR",VICTIM,4,M,W,1
> ,,SUSPECT,23,M,B,
<snip>
> Here's the Python version; the Perl version follows it.
>
> #!/usr/bin/python
> # read crime data
>
> import sys
> import re
> from string import join, split
>
> infile = None
> while len(sys.argv) > 1:
> if not infile:
> infile = sys.argv.pop(1)
> else:
> sys.exit("Usage: " + sys.argv[0] + "infile")
>
> if infile:
> sys.stdin = open (infile, 'r')
Using a while here is misleading, imo. You only ever use one argument.
if len(sys.argv) > 2:
sys.exit("Usage: %s infile" % sys.argv[0])
elif len(sys.argv) == 2:
infile = sys.argv[1]
sys.stdin = open(infile, 'r')
else
infile = None
And you need some error checking on the open.
> '''
> sub splitcsv {
> my ($line) = @_;
> return $line =~ /\G((?:[^,"]|"[^"]*")*),?/g
> }
> '''
I don't know what this means. The =~ splits the line according to that
regexp, and returns a list or so?
> # Yow, how do I Python that?
> # The 're' module doesn't have \G.
> csv_re = re.compile("((?:[^,\"]|\"[^\"]*\")*),?")
Use single quotes so you don't need to escape the double quotes.
> def splitcsv(line):
> return csv_re.findall(line)
My intuition is that there should be a clean way without regexpen. But I
can't think of any right now. If this works, keep it :-)
Does any resident bot have a nice idiom for reading CSV files? :-)
> class Victim:
> def __init__(self, crime, type, crimeno, age, sex, race):
> self.crime = crime
> self.type = type
> self.crimeno = crimeno
> self.age = age
> self.sex = sex
> self.race = race
> self.suspects = []
I trust you've found out about dictionaries by now :-).
>
> victims = []
> victim = None
>
> headerpat = re.compile("CRIME")
> carriage_control = re.compile("[\r\n]")
> blank_line = re.compile("\\s*$")
> comma = re.compile(",")
Using regexps for this sort of thing is just plain silly. You have a big
Perl stuck in your brain! :-)
> line = sys.stdin.readline()
> while line:
> line = carriage_control.sub("", line)
You know that there is always a \n on the end of the line. It's always \n,
not \r\n or \r, since you don't read the file in as binary.
So that line is just:
line = line[:-1]
If you want to delete trailing spaces as well, use string.strip. Makes
the check for empty line easier as well (line == "").
> if headerpat.match(line):
> pass
What happens when the word "CRIME" is part of any real data line?
I would check something like:
if line[:5] == "CRIME":
> elif blank_line.match(line):
elif string.strip(line) == '':
I'm sure there are more similar improvements down below, but I have no more
time.
> sys.stdout.write(join (
> (victim.crimeno,
> victim.crime,
> victim.type,
> victim.age,
> victim.sex,
> victim.race),
> "\t"))
Gotta love dicts :)
> sys.stderr.write(join(
> split('crimeno crime type age sex race', ' '), "\t"))
> sys.stderr.write(join(
> map((lambda x: 'age' + `x` + "\tsex"
> + `x` + "\trace" + `x`),
> range(1, max_suspects+1)))
> )
This looks like it could be improved a lot.
> @@@@@@@@@@@@ And here's the Perl version:
(snip)
--
Remco Gerlich, scarblac at pino.selwerd.nl
Murphy's Rules, "Sonar silliness":
In Torpedo Fire, the convoy player has about a 20% chance of a false
sightings per minute. That averages out to almost 300 false sightings
each day.
More information about the Python-list
mailing list