table (ascii text) lin ayout recognition

James Stroud jstroud at mbi.ucla.edu
Wed Sep 13 11:05:08 CEST 2006


vbfoobar at gmail.com wrote:
> Hello,
> 
> I am looking for python code useful to process
> tables that are in ASCII text. The code must
> determine where are the columns (fields).
> Concerned tables for my application are various,
> but their columns are not very complicated
> to locate for a human, because even
> when ignoring the semantic of  words,
> our eyes see vertical alignments
> 
> Here is a sample table (must be viewed
> with fixed-width font to see alignments):
> =================================
> 
> 44544      ipod          apple     black         102
> GFGFHHF-12 unknown thing bizar     brick mortar  tbc
> 45fjk      do not know   + is less               biac
>            disk          seagate   250GB         130
> 5G_gff                   tbd       tbd
> gjgh88hgg  media record  a and b                 12
> hjj        foo           bar       hop           zip
> hg uy oi   hj uuu ii a   qqq ccc v ZZZ Ughj
> qdsd       zert                    nope          nope
> 
> =================================
> 
> I want the python code that builds a representation
> of this table (for exemple a list of lists, where each list
> represents a table line, each element of the list
> being a field value).
> 
> Any hints?
> thanks
> 

As promised. I call this the "cast a shadow" algorithm for table 
discovery. This is about as obfuscated as I could make it. It will be up 
to you to explain it to your teacher ;-)

Assuming the lines are all equal width (padded right with space) e.g.:

def rpadd(lines):
   """
   Pass in the lines as a list of lines.
   """
   lines = [line.rstrip() for line in lines]
   maxlen = max([len(line) for line in lines])
   return [line + ' ' * (maxlen - len(line)) for line in lines]


In which case, you can:


binary = [[((s==' ' and 2) or 1) for s in line] for line in lines]
shadow = [1 in c for c in zip(*binary)]

isit = False
indices = []
for i,v in enumerate(shadow):
   if v is not isit:
     indices.append(i)
     isit = not isit

indices.append(i+1)

indices = [t for t in zip(indices[::2],indices[1::2])]

columns = [[line[t[0]:t[1]].strip() for line in lines] for t in indices]


In case you want rows:

rows = zip(*columns)


James



More information about the Python-list mailing list