[Tutor] 1 to N searches in files

Mon Dec 3 16:46:19 CET 2012

From: Dave Angel <d at davea.name>
To: Spectral None <spectralnone at yahoo.com.sg> 
Cc: "tutor at python.org" <tutor at python.org> 
Sent: Sunday, 2 December 2012, 20:05
Subject: Re: [Tutor] 1 to N searches in files

On 12/02/2012 03:53 AM, Spectral None wrote:
> Hi all
>
> I have two files (File A and File B) with strings of data in them (each string on a separate line). Basically, each string in File B will be compared with all the strings in File A and the resulting output is to show a list of matched/unmatched lines and optionally to write to a third File C
>
> File A: Unique strings
> File B: Can have duplicate strings (that is, "string1" may appear more than once)
>
> My code currently looks like this:
>
> -----------------
> FirstFile = open('C:\FileA.txt', 'r')
> SecondFile = open('C:\FileB.txt', 'r')
> ThirdFile = open('C:\FileC.txt', 'w')
>
> a = FirstFile.readlines()
> b = SecondFile.readlines()
>
> mydiff = difflib.Differ()
> results = mydiff(a,b)
> print("\n".join(results))
>
> #ThirdFile.writelines(results)
>
> FirstFile.close()
> SecondFile.close()
> ThirdFile.close()
> ---------------------
>
> However, it seems that the results do not correctly reflect the matched/unmatched lines. As an example, if FileA contains "string1" and FileB contains multiple occurrences of "string1", it seems that the first occurrence matches correctly but subsequent "string1"s are treated as unmatched strings.
>
> I am thinking perhaps I don't understand Differ() that well and that it is not doing what I hoped to do? Is Differ() comparing first line to first line and second line to second line etc in contrast to what I wanted to do?
>
> Regards
>
>
> Let me guess your goal, and then, on that assumption, discuss your code.

> I think your File A is supposed to be a dictionary of valid words
> (strings).  You want to process File B, checking each line against that
> dictionary, and make a list of which lines are "valid" (in the
> dictionary), and another of which lines are not (missing from the
> dictionary).  That's one list for matched lines, and one for unmatched.

> That isn't even close to what difflib does.  This can be solved with
> minimal code, but not by starting with difflib.

> What you should do is to loop through File A, adding all the lines to a
> set called valid_dictionary.  Calling set(FirstFile) can do that in one
> line, without even calling readlines().
> Then a simple loop can build the desired lists.  The matched_lines is
> simply all lines which are in the dictionary, while unmatched_lines are
> those which are not.

> The heart of the comparison could simply look like:

>     if line in valid_dictionary:
>            matched_lines.append(line)
>      else:
>            unmatched_lines.append(line)

> -- 

> DaveA

---------------------

Hi Dave

Your solution seems to work:

setA = set(FileA)
setB = set(FileB)

for line in setB:
  if line in setA:
    matched_lines.writelines(line)
  else:
    non_matched_lines.writelines(line)

There are no duplicates in the results as well. Thanks for helping out

Regards
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20121203/f368615e/attachment.html>