More Help with python .find fucntion
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sat Jan 8 00:35:45 EST 2011
On Fri, 07 Jan 2011 22:43:54 -0600, Keith Anthony wrote:
> My previous question asked how to read a file into a strcuture a line at
> a time. Figured it out. Now I'm trying to use .find to separate out
> the PDF objects. (See code) PROBLEM/QUESTION: My call to lines[i].find
> does NOT find all instances of endobj. Any help available? Any
> insights?
>
> #!/usr/bin/python
>
> inputfile = file('sample.pdf','rb') # This is PDF with which
> we will work
> lines = inputfile.readlines() # read file
> one line at a time
That's incorrect. readlines() reads the entire file in one go, and splits
it into individual lines.
> linestart = [] # Starting address for
> each line
> lineend = [] # Ending
> address for each line
> linetype = []
*raises eyebrow*
How is an empty list a starting or ending address?
The only thing worse than no comments where you need them is misleading
comments. A variable called "linestart" implies that it should be a
position, e.g. linestart = 0. Or possibly a flag.
> print len(lines) # print number of lines
>
> i = 0 # define an iterator, i
Again, 0 is not an iterator. 0 is a number.
> addr = 0 # and address pointer
>
> while i < len(lines): # Go through each line
> linestart = linestart + [addr]
> length = len(lines[i])
> lineend = lineend + [addr + (length-1)] addr = addr + length
> i = i + 1
Complicated and confusing and not the way to do it in Python. Something
like this is much simpler:
linetypes = [] # note plural
inputfile = open('sample.pdf','rb') # Don't use file, use open.
for line_number, line in enumerate(inputfile):
# Process one line at a time. No need for that nonsense with manually
# tracked line numbers, enumerate() does that for us.
# No need to initialise linetypes.
status = 'normal'
i = line.find(' obj')
if i >= 0:
print "Object found at offset %d in line %d" % (i, line_number)
status = 'object'
i = line.find('endobj')
if i >= 0:
print "endobj found at offset %d in line %d" % (i, line_number)
if status == 'normal': status = 'endobj'
else: status = 'object & endobj' # both found on the one line
linetypes.append(status)
# What if obj or endobj exist more than once in a line?
One last thing... if PDF files are a binary format, what makes you think
that they can be processed line-by-line? They may not have lines, except
by accident.
--
Steven
More information about the Python-list
mailing list