Browsing text ; Python the right tool?

Mon Jan 31 18:32:59 EST 2005

Sorry to reply this late guys - I cannot access news from Work, and Google 
Groups cannot reply to a message so I had to do it at home. Let me address a 
few of the remarks and questions you guys asked:

First of all, the example I gave was just that - an example. Yes, I know 
Python starts with 0, and I know that you cannot fit a 4-digit number in 2 
positions, this was just to give the idea. To clarify, at THIS moment I need 
to browse 1-80 Mb size tekstfiles. At this moment, I have 16 different 
record definitions, numbered A,B, C1-C8, D-H. Each record definition has 
20-60 different attributes.

Not only that, but these formats change regularly; and I want to create or 
use something I can use on *other* applications or sites as well. As I said, 
I have encountered the type of problem I've described in numberous places 
already.

> John wrote:
> I have a Python script that takes layout info and an input file and can
> produce an output file in one of two formats:

Yes John, I was thinking along these lines myself. The problem is that I 
have to parse several of these large files each day (debugging) and browsing 
converted output seems just to tedious and inefficient. I would REALLY like 
a GIU, and preferable something portable I can re-use later on.

> This should be pretty easy.  If each record is CRLF terminated, then you 
> can get one record at a time simply by iterating over the file ("for line 
> in open('myfile.dat'): ...").

Jeff, this was indeed the way I was thinking. But instead of iterating I 
need the ability to browse forward and backward.

> You can have a dictionary of classes or factory functions, one for each 
> record type, keyed off of the 2-character identifier.  Each class/factory 
> would know the layout of that record type, and return a(n) 
> instance/dictionary with fields separated out into attributes/items.

This is of course a clean approach, but would mean re-coding every time a 
records is changed - frequently! I really would like to edit only a data 
definition file.

> The trickiest part would be in displaying the data; you could potentially 
> use COM to insert it into a Word or Excel document, or code your own GUI 
> in Python.  The former would be pretty easy if you're happy with fairly 
> simple formatting; the latter would require a bit more effort, but if you 
> used one of Python's RAD tools (Boa Constructor, or maybe PythonCard, as 
> examples) you'd be able to get very nice results.

I will at least look into Boa and PythonCard. Thanks for the hint.

> This is plausible only under the condition that Santa Claus is paying
> you $X per class/factory or per line of code, or you are so speed-crazy
> that you are machine-generating C code for the factories.

Unfortunately, neither is the case :)

> I'd suggest "data driven"

Yeah!

> Then you need a function to load this layout file into dictionaries,
> and build cross-references field_name -> field_number (0,1,2,...) and
> vice versa.

> As your record name is not in a fixed position in the record, you will
> also need to supply a function (file_type, record_string) ->
> record_name.

I thought about supplying a flat ASCII definition such as:

[record type] <TAB> [fieldname] <TAB> [start] <TAB> [end]

> Then you have *ONE* function that takes a file_type, a record_name, and
> a record_string, and gives you a list of the values. That is all you
> need for a generic browser application.

I like this.

> You *don't* have to hand-craft a class for each record type. And you
> wouldn't want to, if you were dealing with files whose spec keeps on
> having fields added and fields obsoleted.

Exactly.

> I think that's overly pessimistic.  I *was* presuming a case where the 
> number of record types was fairly small, and the definitions of those 
> records reasonably constant.  For ~10 or fewer types whose spec doesn't 
> change, hand-coding the conversion would probably be quicker and/or more 
> straightforward than writing a spec-parser as you suggest.

Unfortunately, all wrong :)

Lots of records, lots of changes, lots of different record types - 
hardcoding doesnt seem the right way.

> "Parse"? No parsing, and not much code at all: The routine to "load"
> (not "parse") the layout from the layout.csv file into dicts of dicts
> is only 35 lines of Python code. The routine to take an input line and
> serve up an object instance is about the same. It does more than the
> OP's browsing requirement already. The routine to take an object and
> serve up a correctly formatted output line is only 50 lines of which
> 1/4 is comment or blank.

John,do you have suggestions where I can find examples of these functions? I 
can program, but not being proficient in Python,  any help or examples I can 
adapt would be nice

> Also, files used to "create printed pages by
> an external company" (especially by a company that had "leaseplan" in
> its e-mail address) would indicate "many" and "complicated" to me.

How right you are. Think about production runs of 150.000 invoices, each 
invoice consisting of 2-10 records, and you are on the right track.

> I suspect
> that we're both assuming a case similar to our own personal
> experiences, which are different enough to lead to different
> preferred solutions. ;)

Seconded.

> My personal experiences and attitudes: (1) extreme aversion to having
> to type (correctly) lots of numbers (column positions and lengths), and
> to having to mentally translate start = 663, len = 13 to [662:675] or
> having ugliness like [663-1:663+13-1] (2) cases like 17 record types
> and 112 fields in one file, 8 record types and 86 fields in a second --
> this being a new relatively clean simple exercise in exchanging files
> with a government department (3) Past history of this govt dept is that
>there are at least another 7 file types in regular use and they change
> the _major_ version number of each file type about once a year on
>average (3) These things tend to start out deceptively small and simple
>and turn into monsters.

Our experiences are remarkably similair...

Cheers,
Paul