table (ascii text) lin ayout recognition
James Stroud
jstroud at mbi.ucla.edu
Wed Sep 13 05:05:08 EDT 2006
vbfoobar at gmail.com wrote:
> Hello,
>
> I am looking for python code useful to process
> tables that are in ASCII text. The code must
> determine where are the columns (fields).
> Concerned tables for my application are various,
> but their columns are not very complicated
> to locate for a human, because even
> when ignoring the semantic of words,
> our eyes see vertical alignments
>
> Here is a sample table (must be viewed
> with fixed-width font to see alignments):
> =================================
>
> 44544 ipod apple black 102
> GFGFHHF-12 unknown thing bizar brick mortar tbc
> 45fjk do not know + is less biac
> disk seagate 250GB 130
> 5G_gff tbd tbd
> gjgh88hgg media record a and b 12
> hjj foo bar hop zip
> hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
> qdsd zert nope nope
>
> =================================
>
> I want the python code that builds a representation
> of this table (for exemple a list of lists, where each list
> represents a table line, each element of the list
> being a field value).
>
> Any hints?
> thanks
>
As promised. I call this the "cast a shadow" algorithm for table
discovery. This is about as obfuscated as I could make it. It will be up
to you to explain it to your teacher ;-)
Assuming the lines are all equal width (padded right with space) e.g.:
def rpadd(lines):
"""
Pass in the lines as a list of lines.
"""
lines = [line.rstrip() for line in lines]
maxlen = max([len(line) for line in lines])
return [line + ' ' * (maxlen - len(line)) for line in lines]
In which case, you can:
binary = [[((s==' ' and 2) or 1) for s in line] for line in lines]
shadow = [1 in c for c in zip(*binary)]
isit = False
indices = []
for i,v in enumerate(shadow):
if v is not isit:
indices.append(i)
isit = not isit
indices.append(i+1)
indices = [t for t in zip(indices[::2],indices[1::2])]
columns = [[line[t[0]:t[1]].strip() for line in lines] for t in indices]
In case you want rows:
rows = zip(*columns)
James
More information about the Python-list
mailing list