[Tutor] Tab delimited question

Mon Dec 13 21:43:52 CET 2010

Greetings Ben,

 : I'm searching line by line for certain tags and then printing the 
 : tag followed by the word immediately following the tag.

What you are describing is an awful lot like 'grep'.  But, of 
course, many different sorts of file searching resemble grep.

 : So for example, suppose I had the following line of text in a file:
 : "this	is	a	key	test123	noise	 noise	noise 	noise 	noise"
 : 
 : In this example, I would want to print "key test123" to a new 
 : file. The rest of the words I would not want.
 : 
 : Here is my code so far:
 : 
 : def test(infile, outfile):
 :   for line in infile:
 :             tagIndex = line.find("key")
 :             start = tagIndex + 4
 :             stop = line[start:].find("\t") -1
 :             if tagIndex != -1:
 :                 print("start is: ", start)
 :                 print("stop is: ", stop)
 :                 print("spliced word is ", line[start: stop])

Your problem is that you are calculating the value for 'stop' from a 
subset of the 'line string (and then subtracting 1), though you want 
to be adding the value of 'start'.  Replace your above line which 
performs assignment on the stop variable with the following.

  stop = line[start:].find("\t") + start

 : My question is the following: What is wrong w/ the variable 
 : 'stop'? The index it gives me when I print out 'stop' is not even 
 : close to the right number.  Furthermore, when I try to print out 
 : just the word following the tag w/ the form: line[start: stop], 
 : it prints nothing (it seems b/c my stop variable is incorrect).

Now, think about why this is happening....

You are calculating 'stop' based on a the substring of 'line'.  You 
use the 'start' offset to create a substring, in which you then 
search for a tab.  Then, you subtract 1 and try to use that to mean 
something in the original string 'line.  Finally, you are slicing 
incorrectly (well, that's just the issue of subtracting 1 when you 
shouldn't be), a not uncommon slicing problem (see this post for 
more detail [0]).

Finally, I have to wonder why are you doing so much of the work 
yourself, when ....

 : I would greatly appreciate any help you have.  This is a much 
 : simplified example from the script I'm actually writing, but I 
 : need to figure out a way to eliminate the noise after the key and 
 : the word immediately following it are found.

I realize that your question was not like the above, but in your 
example, it seems that you don't know about the 'csv' module.  It's 
convenient, simple, easy to use and quite robust.  This should help 
you.  I don't know much about your data format, nor why you are 
searching, but let's assume that you are searching where you wish to 
match 'key' as the contents of an entire field.  If that's the case, 
then:

  import sys
  import csv

  def test(infile,outfile,sought):
      tsv = csv.reader(infile, delimiter='\t')
      for row in tsv:
          if sought in row:
              outfile.write(  '\t'.join( row ) + '\n' )

Now, how would you call this function?

  if __name__ == '__main__':
      test(sys.stdin, sys.stdout, sys.argv[1])

And, suppose you were at a command line, how would you call that?

  python tabbed-reader.py  < "$MYFILE" 'key'

OK, so the above function called 'test' is probably not quite what 
you had wanted, but you should be able to adapt it pretty readily.

Good luck,

-Martin

 [0] http://mail.python.org/pipermail/tutor/2010-December/080592.html

-- 
Martin A. Brown
http://linux-ip.net/