[Tutor] sql-like join on two lists or dictionaries

Peter Otten __peter__ at web.de
Sat Feb 14 10:55:04 CET 2015


street.sweeper at mailworks.org wrote:

> Hello all,
> 
> Basically what I have here is header and line data for sales or purchase
> orders, and I'm trying to do a sql-like join to bring them together
> (which ultimately I did because I couldn't figure this out :)).  I've
> managed to get the files into python using string slicing, that's not a
> problem.
> 
> headers - h.dat
> 
> B134542    Bob      ZQ775235
> B875432    Joe      ZQ987656
> B567943    Steve    ZQ256222
> 
> lines - l.dat
> 
> B134542       112342   0012
> B134542       176542   0001
> B875732       76542    0003
> B567943       654565   0001
> B567943       900011   0001
> 
> desired result - hl.dat
> 
> B134542       112342   0012    Bob    ZQ775235
> B134542       176542   0001    Bob    ZQ775235
> B875732       76542    0003    Joe    ZQ987656
> B567943       654565   0001    Steve  ZQ256222
> B567943       900011   0001    Steve  ZQ256222
> 
> 
> 
> in python3 on linux:
> 
> #!/usr/bin/env python3
> 
> import os
> 
> basepath=os.path.join(os.path.expanduser('~'),'temp',)
> linefile=os.path.join(basepath,'l.dat')
> headerfile=os.path.join(basepath,'h.dat')
> 
> with open(headerfile) as h, open(linefile) as l:
>   lines = l.readlines()
>   headers = h.readlines()
> 
> llist = [[linedata[0:7],
>           linedata[14:23],
>           linedata[23:27]] for linedata in lines]
> 
> hlist = [[headerdata[0:7],
>           headerdata[11:19],
>           headerdata[19:28]] for headerdata in headers]
> 
> ldict = [{linedata[0:7]:
>               [linedata[14:23],
>                linedata[23:27]]} for linedata in lines]
> 
> hdict = [{headerdata[0:7]:
>               [headerdata[11:19],
>                headerdata[19:28]]} for headerdata in headers]
> 
> # :)
> 
> quit()
> 
> 
> 
> Details on the data are that it's a one or many lines to one header
> relationship, at least one of each will exist in each file, and
> performance probably isn't an issue as it will only be a few tens to
> about 100 lines maximum in the lines file.  The match string will be the
> 0:7 slice.
> 
> You can probably guess my questions: should I be making lists or
> dictionaries out of this data, and then of course, what should I do with
> them to arrive at the combined file?  I saw some examples of joining two
> two-item lists, or dictionaries with a single string as the value, but I
> couldn't seem to adapt them to what I'm doing here. I also ran across
> the dict.extend method, but looking at the result, I didn't think that
> was going to go anywhere, particularly with the one to many
> headers:lines relationship.
> 
> After a while I pulled this into a sqlite file in memory and did the
> join.  Using writelines I think I'll be able to get it out to a file,
> but it seems to me that there's probably a way to do this without
> resorting to sql.  Or is there?

You only need one dictionary that maps the first column in headerfile to the 
complete line:

with open(headerfile) as f:
    lookup_header = {
        headerdata[:6]: headerdata.rstrip("\n") for headerdata in f}

Then you can iterate over the lines in linefile, extract the key from that 
and look it up in the dict:

with open(linefile) as lines, open("hl.dat", "w") as joined:
    for line in lines:
        try:
            header = lookup_header[line[:6]]
        except KeyError:
            header = ""
        print(line.rstrip("\n"), header, sep="", file=joined)

This approach works with linefiles of arbitrary size as you only ever have 
one line of that file in memory. 

As written the output file includes lines with no matching header; replace

header = ""

with a

continue

statement to get an inner join.

A general remark: whenever possible you should avoid the readlines() method 
which creates a list of lines and instead iterate over the file directly.



More information about the Tutor mailing list