[Tutor] a quick Q: how to use for loop to read a series of files with .doc end

Wed Oct 5 15:38:34 CEST 2011

On 10/05/2011 08:46 AM, lina wrote:
> On Wed, Oct 5, 2011 at 8:21 PM, Dave Angel<d at davea.name>  wrote:
>
>>
>>>> #these two are capitalized because they're intended to be constant
>>>> TOKENS = "BE"
>>>> LINESTOSKIP = 43
>>>> INFILEEXT = ".xpm"
>>>> OUTFILEEXT = ".txt"
>>>>
>>>> def dofiles(topdirectory):
>>>>     for filename in os.listdr(topdirectory):
>>>>
>>> Here your typo is listdir not listdr,
>>         processfile(filename)
>>>> def processfile(infilename):
>>>>     base, ext =os.path.splitext(fileName)
>>>>
>>> Here I changed the fileName to infilename
>>     if ext == INFILEEXT:
>>>>         text = fetchonefiledata(infilename)
>>>>         numcolumns = len(text[0])
>>>>         results = {}
>>>>         for ch in TOKENS:
>>>>
>>>>             results[ch] = [0] * numcolumns
>>>>         for line in text:
>>>>             line = line.strip()
>>>>
>>>>             for col, ch in enumerate(line):
>>>>                 if ch in tokens:
>>>>
>>> Here I changed the tokens to TOKENS
>>                     results[ch][col] += 1
>>>>         writeonefiledata(base+**OUTFILEEXT, results)
>>>>
>>>>
>>>> def fetchonefiledata(inname):
>>>>     infile = open(inname)
>>>>     text = infile.readlines()
>>>>     return text[LINESTOSKIP:]
>>>>
>>>> def writeonefiledata(outname):
>>>>     outfile = open(outname, "w")
>>>>     ...process the results as appropriate...
>>>>     ....(since you didn't tell us how multiple tokens were to be
>>>> displayed)
>>>>
>>>> if __name__ == "__main__":
>>>>     dofiles(".")     #or get the top directory from the sys.argv variable,
>>>> which is set from command line.
>>>>
>>>>
>>>> You dissect the former one you suggested before into 4 functions.
>>>>
>>> a little question, why choose .ext? why the splitext is also ext here?
>>>
>>>
>>>
>>>   Try the following, perhaps in the interpreter:
>> mytuple = ("one thing", "Another thing")
>> base, extension = mytuple
>>
>> Now look and see what base and extension have for values.
>>
>> Previously we just needed the second element of the splitext return value.
>>   This time we'll need both, so might as well put them in variables that have
>>   useful names.
> Yes, thanks for reminding, I understand now.
>
>>
>>
>>> import os.path
>>>
>>>
>>> TOKENS="E"
>>> LINESTOSKIP=0
>>> INFILEEXT=".xpm"
>>> OUTFILEEXT=".txt"
>>>
>>> def dofiles(topdirectory):
>>>      for filename in os.listdir(topdirectory):
>>>          processfile(filename)
>>>
>>> def processfile(infilename):
>>>      base, ext =os.path.splitext(infilename)
>>>      if ext == INFILEEXT:
>>>          text = fetchonefiledata(infilename)
>>>          numcolumns=len(text[0])
>>>          results={}
>>>          for ch in TOKENS:
>>>
>>>              results[ch] = [0]*numcolumns
>>>          for line in text:
>>>              line = line.strip()
>>>
>>>              for col, ch in enumerate(line):
>>>                  if ch in TOKENS:
>>>                      results[ch][col]+=1
>>>          writeonefiledata(base+**OUTFILEEXT,results)
>>>
>>> def fetchonefiledata(inname):
>>>      infile = open(inname)
>>>      text = infile.readlines()
>>>      return text[LINESTOSKIP:]
>>>
>>> def writeonefiledata(outname,**results):
>>>      outfile = open(outname,"w")
>>>      for item in results:
>>>          return outfile.write(item)
>>>
>>>
>>> if __name__=="__main__":
>>>      dofiles(".")
>>>
>>> just the results is a bit unexpected.
>>>
>>>   $ more try.txt
>>> E
>>>
>>> I might make a mistake in the writeonefiledata your left part.
>>>
>>>   I'd be amazed if there weren't at least a couple of typos in my message.
>>   But this is where you sprinkle a couple of prints.  What did results look
>> like when you print it out?
>>
> Yes, you did keep some typos there.
> The result is kind of weird? only E there.
>
I ask again.  What did results look like when you print it out.  I'm 
referring to the argument to writeonefiledata().
> def writeonefiledata(outname,results):
put the lines here:
             print ("results is: ", results)
             print("repr is:", repr(results))

>      outfile = open(outname,"w")
>      for item in results:
>          return outfile.write(item)
>
> This final part I made some mistakes?
>
yes, you're iterating over the keys of a dictionary.  Since it only has 
the key "E", that's what you get.  Try printing dir(results) to see what 
methods might return something other than the key.  Make the language 
work for you.
>> I hope you'll find that results is a dictionary, you might not want to just
>> write() its keys.  You probably want to write() its values instead, perhaps
>> with a heading showing what key you're printing.
> Later I wish to get the value of B+E, the two tokens. so the final results
> of each columns is enough. I will use this data to proceed further in
> future.
>
the code to get multiple keys is already there.  Only reason you're 
getting only E is that you only specified one token. Try changing it to

TOKENS = "EA"
>>
>>   But it gives you a simple refactoring that splits the logic so each can be
>>>> visualized (and tested) independently.  i'd also split up processfile(),
>>>> once I realized how big it was.
>>>>
>>>> There are many shortcuts that can be applied. Some of them probably use
>>>> language features you're not comfortable with, like perhaps generators.
>>>>   And
>>>> if  efficiency is important, there are optimizations to do, like using
>>>> islice directly on the infile object.  That one would eliminate having to
>>>> have the whole file stored in memory at one time.
>>>>
>>>> Likewise there are further things that could be done to decouple the
>>>> functions even more.
>>>>
>>>> But there's nothing in the above code which uses very advanced topics, so
>>>> you should be able to understand it and fix whatever typos I've
>>>> undoubtedly
>>>> got.
>>>>
>>>> What are you using for debugging aids?  Besides this group, I mean.
>>>>   print
>>>> statements?  An IDE ?  which one?
>>>>
>>>>   debugging aids?
>>> I just run python3 script.py
>>> it will pop up some hints,
>>> in the middle, probably try print.
>>>
>>>   Once the code is refactored into small enough independent functions, you
>> can do things like write multiple versions of a given function, for
>> debugging purposes.  For example, you could have another function called
>>   fetchonefiledata(), and have it return a list of strings.  For example, it
>> might be
>>
>> def fetchonefiledata(dummy):
>>     buf = """EEDC
>> AAAC
>> F145
>> CCCA
>> """
>>     return buf.split()
>>
>> and then you wouldn't be dependent on an actual file being available.
>>
>> Naturally, at that point, your top-level code would call processfiles()
>> instead of dofile().
>>
>> And remember the repr() and type() functions when trying to see just what
>> type of thing something is.y
>>
> I have not figured it out how to use the repr() and type() yet.
So try them.  repr() shows you a lot more information about an object 
than str() does, and the latter is what you're getting when you print 
something directly.

And type() shows you the type of something.

And dir() shows you the attributes of something.  Usually what you're 
interested in is the list of methods.  Anyway. once you find an 
interesting one you  can do help() on it.  For example, try help( 
{}.iteritems )

> another question, you know in linux, when use TAB, can automatically input
> something,
> so in python3, are there some way they can intelligent give some hints or
> fill the left.
>
Sure, that's the job of the IDE.  If you just want auto-indentation, 
emacs can do that with the python macros.  But if you want full method 
expansion and such, look into one of a dozen IDEs.  I happen to use 
Komodo, but there are others free, and non-free.  And there's Ipython.  
And I think there's something included in CPython, but I never looked 
into it.

-- 

DaveA