[Tutor] fixed or variable length fields?

Paul Tremblay phthenry@earthlink.net
Sat Mar 1 13:17:01 2003


On Sat, Mar 01, 2003 at 01:57:36PM +0100, Michael Janssen wrote:
> 
> On Sat, 1 Mar 2003, Paul Tremblay wrote:
> 
> > ob<nu<nu<nu<0001<{
> > cw<nu<nu<nu<rtf>true<rtf
> > cw<nu<nu<nu<macintosh>true<macintosh
> > cw<nu<nu<nu<font-table>true<font-table
> 
> I had taken a glimpse into an rtf-document and it looks different.
> font-table is for example in such a line:
> {\rtf1\ansi\ansicpg1252\deff0{\fonttbl  [continues spaceless]
> {\f0\fswiss\fprq2\fcharset0 Arial;}{\f1\fnil\fcharset2 Symbol;}}
> 
> are your "lines of tokens" data from an intermediate step (or is rtf that
> anti standardized)? Represent it an atomic rewrite of the information in
> hairy lines like above?

Right. The tokens are the result of two steps of processing. The first
token is "{". I tranlate this to 'ob<nu<nu<nu<0001<{', which means 'open
bracket, null, null, null, bracket is number 1, original token is "{". 

I put some valuable info in each field. For example, for the color table
I put 'cw<cl<nu<nu<blue>255<\blue255;' When I process the data, I can
somewhat easily find a color item:

if token[3:5] == 'cl': 
	# do something

> 
> now my question is (not affiliated with the subject of this thread, by
> the way :-):
> 
> In case it is intermediate data, why is it of type string? In case you
> have processed the information earlier (possibly rewrite it into an
> "normalized" format), you might want to save this results to disk - but
> you needn't to restart from disk splitting each line into
> computer-understandable datastructures. Just process with the
> datastructures from your former step.
> 
> Or is it neccessary to save memory? Or did i miss anything else?

So for example I could save each token as part of a list? I rejected
this idea right off, for better or worse, though many tokenizers use
this method. For example, you often need to know the tokens that came
*before* in order to process the current token. With a tokenizer,  you
can do this.

However, RTF files can be over 20 M, and this just seemed very
ineffecient. I believe that things run better if you read only one line
into memory, especially if you have big files. 

Overall, writing to a file, closing the file, and then opening it up
again to process takes some time. But to me is seemed that the small
waste of time was worth it because (1) I could process any file, whether
1k or 100 Gibabytes (2) it kept the process very simple and manageable by
breaking it down into steps that I could look at.

Paul



> 
> Michael
> 
> >
> > (Fields delimited with "<" and ">" because all "<" and ">" have
> > been converted to "&lt;" and "&gt;"
> >
> > I will make several passes through this file to convert the data.
> >
> > Each time I read a line, I will use the string method, and sometimes the
> > split method:
> >
> > if line[12:23] == 'font-table':
> > 	info = [12:23]
> > 	list = info.split(">")
> > 	if list[1] == 'true':
> > 		# do something
> >
> > If I use fixed length fields, then I won't have to do any splitting. I
> > also know that in perl, there is a way to use 'pack' and 'unpack' to
> > quickly access fixed fields. I have never used this, and don't know if
> > the pack in Python is similar.
> >
> > If fix fields did give me a speed increase, I would certainly suffer
> > from readibility. For example, the above 4 lines of tokens might look
> > like:
> >
> > opbr:null:null:null:0001
> > ctrw:null:null:true:rtfx
> > ctrw:null:null:true:mact
> > ctrw:null:null:true:fntb
> >
> > Instead of 'macintosh', I have 'mact'; instead of 'font-table', I have
> > 'fntb'.
> >
> > Thanks
> >
> > Paul
> >
> > --
> >
> > ************************
> > *Paul Tremblay         *
> > *phthenry@earthlink.net*
> > ************************
> >
> > _______________________________________________
> > Tutor maillist  -  Tutor@python.org
> > http://mail.python.org/mailman/listinfo/tutor
> >

-- 

************************
*Paul Tremblay         *
*phthenry@earthlink.net*
************************