[Tutor] Tokenizing Help

Wed Apr 22 23:23:11 CEST 2009

Hello William!

On Wednesday 22 April 2009, William Witteman wrote:
> The file format I am looking at (it is a bibliographic reference
> file) looks like this:
>
> <1>                   # the references are enumerated
> AU  - some text
> perhaps across lines
> AB  - some other text
> AB  - there may be multiples of some fields
> UN  - any 2-letter combination may exist, other than by exhaustion,
> I cannot anticipate what will be found

How do you decide that a word is a keyword (AU, AB, UN) and not a part 
of the text? There could be a file like this:

<567>
AU  - Bibliographical Theory and Practice - Volume 1 - The AU  - Tag 
and its applications  
AB  - Texts in Library Science
<568>
AU  - Bibliographical Theory and Practice - Volume 2 - The 
AB  - Tag and its applications  
AB  - Texts in Library Science
<569>
AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - 
AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU 
AB  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - 
AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU
ZZ  - Somewhat nonsensical case

To me it seems that a parsing library is unnecessary. Just look at the 
first few characters of each line and decide if its the start of a 
record, a tag or normal text. You might need some additional 
algorithm for corner cases.

Kind regards, 
Eike.