Parse each line by character location

Tue Nov 4 12:43:47 EST 2008

On Nov 4, 11:45 am, Tyler <hayes.ty... at gmail.com> wrote:

> Hello All:
>
> I hope this is the right place to ask, but I am trying to come up with
> a way to parse each line of a file. Unfortunately, the file is neither
> comma, nor tab, nor space delimited. Rather, the character locations
> imply what field it is.
>
> For example:
>
> The first ten characters would be the record number, the next
> character is the client type, the next ten characters are a volume,
> and the next three are order type, and the last character would be an
> optional type depending on the order type.
>
> The lines are somewhat more complicated, but they work like that, and
> not all have to be populated, in that they may contain spaces. For
> example, the order number may be 2345, and it is space padded at the
> beginning of the line, and other might be zero padded in the front.
> Imagine I have a line:
>
> ______2345H0000300000_NC_
>
> where the underscores indicate a space. I then want to map this to:
>
> 2345,H,0000300000,NC,
>
> In other words, I want to preserve ALL of the fields, but map to
> something that awk could easily cut up afterwords, or open in a CSV
> editor. I am unsure how to place the commas based on character
> location.
>
> Any ideas?

Here's a general solution for fixed size records:

>>> def slicer(*sizes):
...     slices = len(sizes) * [None]
...     start = 0
...     for i,size in enumerate(sizes):
...         stop = start+size
...         slices[i] = slice(start,stop)
...         start = stop
...     return lambda string: [string[s].strip() for s in slices]
...
>>> order_slicer = slicer(10,1,10,4)
>>> order_slicer('______2345H0000300000_NC_'.replace('_',' '))
['2345', 'H', '0000300000', 'NC']

HTH,
George