[Tutor] Tokenizing Help

Thu Apr 23 03:41:49 CEST 2009

On Wed, Apr 22, 2009 at 11:23:11PM +0200, Eike Welk wrote:

>How do you decide that a word is a keyword (AU, AB, UN) and not a part 
>of the text? There could be a file like this:
>
><567>
>AU  - Bibliographical Theory and Practice - Volume 1 - The AU  - Tag 
>and its applications  
>AB  - Texts in Library Science
><568>
>AU  - Bibliographical Theory and Practice - Volume 2 - The 
>AB  - Tag and its applications  
>AB  - Texts in Library Science
><569>
>AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - 
>AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU 
>AB  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - 
>AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU
>ZZ  - Somewhat nonsensical case

This is a good case, and luckily the files are validated on the other
end to prevent this kind of collision.

>To me it seems that a parsing library is unnecessary. Just look at the 
>first few characters of each line and decide if its the start of a 
>record, a tag or normal text. You might need some additional 
>algorithm for corner cases.

If this was the only type of file I'd need to parse, I'd agree with you,
but this is one of at least 4 formats I'll need to process, and so a
robust methodology will serve me better than a regex-based one-off.
-- 

yours,

William