[Tutor] sql-like join on two lists or dictionaries
Peter Otten
__peter__ at web.de
Sat Feb 14 10:55:04 CET 2015
street.sweeper at mailworks.org wrote:
> Hello all,
>
> Basically what I have here is header and line data for sales or purchase
> orders, and I'm trying to do a sql-like join to bring them together
> (which ultimately I did because I couldn't figure this out :)). I've
> managed to get the files into python using string slicing, that's not a
> problem.
>
> headers - h.dat
>
> B134542 Bob ZQ775235
> B875432 Joe ZQ987656
> B567943 Steve ZQ256222
>
> lines - l.dat
>
> B134542 112342 0012
> B134542 176542 0001
> B875732 76542 0003
> B567943 654565 0001
> B567943 900011 0001
>
> desired result - hl.dat
>
> B134542 112342 0012 Bob ZQ775235
> B134542 176542 0001 Bob ZQ775235
> B875732 76542 0003 Joe ZQ987656
> B567943 654565 0001 Steve ZQ256222
> B567943 900011 0001 Steve ZQ256222
>
>
>
> in python3 on linux:
>
> #!/usr/bin/env python3
>
> import os
>
> basepath=os.path.join(os.path.expanduser('~'),'temp',)
> linefile=os.path.join(basepath,'l.dat')
> headerfile=os.path.join(basepath,'h.dat')
>
> with open(headerfile) as h, open(linefile) as l:
> lines = l.readlines()
> headers = h.readlines()
>
> llist = [[linedata[0:7],
> linedata[14:23],
> linedata[23:27]] for linedata in lines]
>
> hlist = [[headerdata[0:7],
> headerdata[11:19],
> headerdata[19:28]] for headerdata in headers]
>
> ldict = [{linedata[0:7]:
> [linedata[14:23],
> linedata[23:27]]} for linedata in lines]
>
> hdict = [{headerdata[0:7]:
> [headerdata[11:19],
> headerdata[19:28]]} for headerdata in headers]
>
> # :)
>
> quit()
>
>
>
> Details on the data are that it's a one or many lines to one header
> relationship, at least one of each will exist in each file, and
> performance probably isn't an issue as it will only be a few tens to
> about 100 lines maximum in the lines file. The match string will be the
> 0:7 slice.
>
> You can probably guess my questions: should I be making lists or
> dictionaries out of this data, and then of course, what should I do with
> them to arrive at the combined file? I saw some examples of joining two
> two-item lists, or dictionaries with a single string as the value, but I
> couldn't seem to adapt them to what I'm doing here. I also ran across
> the dict.extend method, but looking at the result, I didn't think that
> was going to go anywhere, particularly with the one to many
> headers:lines relationship.
>
> After a while I pulled this into a sqlite file in memory and did the
> join. Using writelines I think I'll be able to get it out to a file,
> but it seems to me that there's probably a way to do this without
> resorting to sql. Or is there?
You only need one dictionary that maps the first column in headerfile to the
complete line:
with open(headerfile) as f:
lookup_header = {
headerdata[:6]: headerdata.rstrip("\n") for headerdata in f}
Then you can iterate over the lines in linefile, extract the key from that
and look it up in the dict:
with open(linefile) as lines, open("hl.dat", "w") as joined:
for line in lines:
try:
header = lookup_header[line[:6]]
except KeyError:
header = ""
print(line.rstrip("\n"), header, sep="", file=joined)
This approach works with linefiles of arbitrary size as you only ever have
one line of that file in memory.
As written the output file includes lines with no matching header; replace
header = ""
with a
continue
statement to get an inner join.
A general remark: whenever possible you should avoid the readlines() method
which creates a list of lines and instead iterate over the file directly.
More information about the Tutor
mailing list