[Tutor] Tokenizing Help
William Witteman
yam at nerd.cx
Thu Apr 23 03:41:49 CEST 2009
On Wed, Apr 22, 2009 at 11:23:11PM +0200, Eike Welk wrote:
>How do you decide that a word is a keyword (AU, AB, UN) and not a part
>of the text? There could be a file like this:
>
><567>
>AU - Bibliographical Theory and Practice - Volume 1 - The AU - Tag
>and its applications
>AB - Texts in Library Science
><568>
>AU - Bibliographical Theory and Practice - Volume 2 - The
>AB - Tag and its applications
>AB - Texts in Library Science
><569>
>AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU -
>AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU
>AB - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU -
>AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU
>ZZ - Somewhat nonsensical case
This is a good case, and luckily the files are validated on the other
end to prevent this kind of collision.
>To me it seems that a parsing library is unnecessary. Just look at the
>first few characters of each line and decide if its the start of a
>record, a tag or normal text. You might need some additional
>algorithm for corner cases.
If this was the only type of file I'd need to parse, I'd agree with you,
but this is one of at least 4 formats I'll need to process, and so a
robust methodology will serve me better than a regex-based one-off.
--
yours,
William
More information about the Tutor
mailing list