[Tutor] Simple text file processing using fileinput module. "Grabbing successive lines" failure

Tue Jul 3 11:35:22 CEST 2012

Flynn, Stephen (L & P - IT) wrote:

> Tutors,
> 
> Whilst having a play around with reading in textfiles and reformatting
> them I tried to write a python 3.2 script to read a CSV file, looking for
> any records which were short (indicating that the data may well contain an
> embedded CR/LF. I've attached a small sample file with a "split record" at
> line 3, and my code.
> 
> Call the code with
> 
> Python pipesmoker.py MyFile.txt ,
> 
> (first parameter is the file being read, second parameter is the field
> separator... a comma in this case)
> 
> I can read the file in, I can determine that I'm looking for records which
> have 13 fields and I can find a record which is too short (line 3).
> 
> What I can't do is read the successive line to a short line in order to
> append it onto the end of short line before writing the entire amended
> line out. I'm still thinking about how to persuade the fileinput module to
> leap over the successor line so it doesn't get processed again.
> 
> When I run the code as it stands, I get a traceback as I'm obviously not
> using fileinput.FileInput.readline() correctly.
> 
> value of file is C:\myfile.txt
> value of the delimiter is ,
> I'm looking for  13 , in each currentLine...
> ï»¿"1","0000000688      ","ABCD","930020854","34","0","1"," ","930020854
> ","          ","0","0","0","0"
> 
> "2","0000000688      ","ABCD","930020854","99","0","1"," ","930020854 "," 
>         ","0","0","0","0"
> 
> short line found at line 3
> Traceback (most recent call last):
>   File "C:\Documents and
>   Settings\flynns\workspace\PipeSmoker\src\pipesmoker\pipesmoker.py", line
>   35, in <module>
>     nextLine = fileinput.FileInput.readline(args.file)
>   File "C:\Python32\lib\fileinput.py", line 301, in readline
>     line = self._buffer[self._bufindex]
> AttributeError: 'str' object has no attribute '_buffer'
> 
> 
> Can someone explain to me how I am supposed to make use of readline() to
> grab the next line of a text file please? It may be that I should be using
> some other module, but chose fileinput as I was hoping to make the little
> routine as generic as possible; able to spot short lines in tab separated,
> comma separated, pipe separated, ^~~^ separated and anything else which my
> clients feel like sending me.

As you already learned the csv module is the best tool to address your 
problem. 

However, I'd like to show a generic way to get an extra item in a for-loop.

Instead of iterating over the "iterable" (a list or a FileInput object or 
whatever) you first convert it into an iterator explicitly with the iter() 
built-in function and keep the reference around:

iterable = ...
it = iter(iterable)

Then inside the for-loop you get an extra item with the next() function:

for item in it:
    if some_condition():
        extra = next(it)

next() also allows you to provide a default value; without it you may get a 
StopIteration exception when you apply it on an exhausted iterator.

Here's a self-contained example:

>>> items = "alpha- beta gamma- delta- epsilon zeta".split()
>>> it = iter(items)
>>> for item in it:
...     while item.endswith("-"):
...             item += next(it)
...     print item
... 
alpha-beta
gamma-delta-epsilon
zeta