[Tutor] Tokenizing Help
Eike Welk
eike.welk at gmx.net
Wed Apr 22 23:23:11 CEST 2009
Hello William!
On Wednesday 22 April 2009, William Witteman wrote:
> The file format I am looking at (it is a bibliographic reference
> file) looks like this:
>
> <1> # the references are enumerated
> AU - some text
> perhaps across lines
> AB - some other text
> AB - there may be multiples of some fields
> UN - any 2-letter combination may exist, other than by exhaustion,
> I cannot anticipate what will be found
How do you decide that a word is a keyword (AU, AB, UN) and not a part
of the text? There could be a file like this:
<567>
AU - Bibliographical Theory and Practice - Volume 1 - The AU - Tag
and its applications
AB - Texts in Library Science
<568>
AU - Bibliographical Theory and Practice - Volume 2 - The
AB - Tag and its applications
AB - Texts in Library Science
<569>
AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU -
AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU
AB - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU -
AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU
ZZ - Somewhat nonsensical case
To me it seems that a parsing library is unnecessary. Just look at the
first few characters of each line and decide if its the start of a
record, a tag or normal text. You might need some additional
algorithm for corner cases.
Kind regards,
Eike.
More information about the Tutor
mailing list