More Help with python .find fucntion

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sat Jan 8 00:35:45 EST 2011


On Fri, 07 Jan 2011 22:43:54 -0600, Keith Anthony wrote:

> My previous question asked how to read a file into a strcuture a line at
> a time.  Figured it out.  Now I'm trying to use .find to separate out
> the PDF objects.  (See code)  PROBLEM/QUESTION: My call to lines[i].find
> does NOT find all instances of endobj. Any help available?  Any
> insights?
> 
> #!/usr/bin/python
> 
> inputfile =  file('sample.pdf','rb')            # This is PDF with which
> we will work 
> lines = inputfile.readlines()                   # read file
> one line at a time

That's incorrect. readlines() reads the entire file in one go, and splits 
it into individual lines.


> linestart = []                                  # Starting address for
> each line
> lineend = []                                    # Ending
> address for each line
> linetype = []

*raises eyebrow*

How is an empty list a starting or ending address?

The only thing worse than no comments where you need them is misleading 
comments. A variable called "linestart" implies that it should be a 
position, e.g. linestart = 0. Or possibly a flag.


> print len(lines)                                # print number of lines
> 
> i = 0                                           # define an iterator, i

Again, 0 is not an iterator. 0 is a number.


> addr = 0                                        # and address pointer
>
> while i < len(lines):                           # Go through each line
>     linestart = linestart + [addr]
>     length = len(lines[i])
>     lineend = lineend + [addr + (length-1)] addr = addr + length
>     i = i + 1

Complicated and confusing and not the way to do it in Python. Something 
like this is much simpler:


linetypes = []  # note plural
inputfile =  open('sample.pdf','rb')  # Don't use file, use open.

for line_number, line in enumerate(inputfile):
    # Process one line at a time. No need for that nonsense with manually
    # tracked line numbers, enumerate() does that for us.
    # No need to initialise linetypes.
    status = 'normal'
    i = line.find(' obj')
    if i >= 0:
        print "Object found at offset %d in line %d" % (i, line_number)
        status = 'object'
    i = line.find('endobj')
    if i >= 0:
        print "endobj found at offset %d in line %d" % (i, line_number)
        if status == 'normal': status = 'endobj'
        else: status = 'object & endobj'  # both found on the one line
    linetypes.append(status)
    # What if obj or endobj exist more than once in a line?



One last thing... if PDF files are a binary format, what makes you think 
that they can be processed line-by-line? They may not have lines, except 
by accident.


-- 
Steven



More information about the Python-list mailing list