Using PIL to find separator pages

Fri Jun 1 20:23:34 EDT 2007

Steve Holden wrote:
> Larry Bates wrote:
>> Steve Holden wrote:
>>> Larry Bates wrote:
>>>> I have a project that I wanted to solicit some advice
>>>> on from this group.  I have millions of pages of scanned
>>>> documents with each page in and individual .JPG file.
>>>> When the documents were scanned the people that did
>>>> the scanning put a colored (hot pink) separator page
>>>> between the individual documents.  I was wondering if
>>>> there was any way to utilize PIL to scan through the
>>>> individual files, look at some small section on the
>>>> page, and determine if it is a separator page by
>>>> somehow comparing the color to the separator page
>>>> color?  I realize that this would be some sort of
>>>> percentage match where 100% would be a perfect match
>>>> and any number lower would indicate that it was less
>>>> likely that it was a coverpage.
>>>>
>>>> Thanks in advance for any thoughts or advice.
>>>>
>>> I suspect the easiest way would be to select a few small patches of each
>>> image and average the color values of the pixels, then normalize to hue
>>> rather than RGB.
>>>
>>> Close enough to the hue you want (and you could include saturation and
>>> intensity too, if you felt like it) across several areas of the page
>>> would be a hit for a separator.
>>>
>>> regards
>>>  Steve
>>
>> Steve,
>>
>> I'm completely lost on how to proceed.  I don't know how to average color
>> values, normalize to hue...  Any guidance you could give would be greatly
>> appreciated.
>>
>> Thanks in advance,
>> Larry
> 
> I'd like to help but I don't have any sample code to hand. Maybe someone
> who does could give you more of a clue. Let's hope so, anyway ...
> 
> regards
>  Steve

I think I've come up with something that will work.  I use PIL
Image.getcolors() to get colors and take the top 10 colors of my
background page.  I then calculate the average of the R, G, B
components.  That becomes my reference.  Then I read a page and
make the same calculation.  I then calculate the absolute value
of the difference of R, G, B of the two values.  Sum those
together gives something like the average difference between
the two average colors (at least that is what I think it does).
This seems to give me small numbers when the pages are the same
and large numbers when they are different.  It isn't super fast
but it is working.

Thanks for pushing me in the right direction.

-Larry