[Tutor] awk like functionality in python

Alan Gauld alan.gauld at yahoo.co.uk
Wed Sep 14 06:02:28 EDT 2016


On 13/09/16 18:27, Ron Williams wrote:
> Hello, this is my first post. I'm glad a place like this exists. I'm coming
> from primarily a shell scripting background and it was suggested I should
> learn python because the two are closely related.

Its often been said but its not strictly true.
Python is a general purpose programming language with
some scripting facilities, but they are a small part
of the  whole Python arsenal.

If you are doing a lot of scripting type projects there are a few
modules that you are going to need to study closely.
- The os module gives you access to lots of tools like the basic os
commands (ls, chmod, pwd etc)
- The path module lets you manipulate file paths
- glob lets you access lists of files using wildcards
- shutil gives you tools to manipulate files(cop, rm etc)
- subprocess lets you run other programs and manipulate
the input/output
- fileinput lets you process all files in a sub-directory tree.
- string methoods - not a module but the builtin string class
- re gives you access to regex (as in grep, sed)

> FILE_DIR=/home/ronw/my_python_journey/names
> 
> for i in `ls $FILE_DIR`
> do

look at the for loop in Python
check out os.listdir() and glob.glob()

> OUTPUT=`awk -F, 'BEGIN{OFS=" ";} {sum+=$3} END {print FILENAME, sum}'
> $FILE_DIR/$i | awk -F'[^0-9]*' '{print $2, $3}'`

There is nothing that directly matches awk in Python you have to do a
bit more work here. You usually need an if/elif/else chain, although
in this case you match every line so its easier:

sep = " "
with open(filename) as inp:
    for line in inp:
       sum += int(line.split(sep)[2])
    print(filename, sum)


But using fileinput makes it slightly simpler,
the whole thing then reduces to:

import fileinput as fin
FILE_DIR = "/home/ronw/my_python_journey/names"
data = {}
for line in fin.input(os.listdir(FILE_DIR))
    births = int(line.split()[2])
    # year = line.split[???] # is there a year field
    data.setdefault(fin.filename(),0) += births # or use year as key?

print(data)

Which basically processes every file in the directory
line by line. It extracts the birth figure and converts
it to an integer. It then inserts/adds it into a
dictionary(data) based on the filename. At the end
it prints the dictionary - you may need some extra
processing to map filenames to years? Or maybe extract
the year from the line?

I didn't look at your files to check the data format,
but it should be close...

hth
-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos




More information about the Tutor mailing list