[Tutor] Building a data structure from input
Terry Carroll
carroll@tjc.com
Sat Dec 28 23:05:02 2002
How would you approach this? I'm going to have an input file with many
lines in the following form:
key attribute-name1 value1
key attribute-name2 value2
...
key attribute-namen valuen
For any given key, there will be a line with an attribute name and its
value. Not all keys will have all attributes, though.
What I want to create a list of objects; one object per key, containing
the key, each of attributes 2, 4, 6 and 7, if they're defined -- but only
if either attribute 4 or 6 (or both) are defined for that particular key.
I think my explanation is poor, so let me make it more real. The input is
the Unihan.txt file from unicode.org at
<ftp://ftp.unicode.org/Public/UNIDATA/Unihan.txt>. This is a 26-meg file,
so don't browse it casually.
The attributes I want to store are the key, and
kDefinition, kMandarin, kBigFive, kGB0, but only if either kBigFive or
kGB0 are defined.
Here's one entry from this file:
U+570B kAlternateKangXi 0219.016
U+570B kAlternateMorohashi 04798
U+570B kBigFive B0EA
U+570B kCCCII 21376F
U+570B kCNS1986 1-594F
U+570B kCNS1992 1-594F
U+570B kCangjie WIRM
U+570B kCantonese GWOK3
U+570B kDaeJaweon 0447.090
U+570B kDefinition nation, country, nation-state
U+570B kEACC 21376F
U+570B kFrequency 1
U+570B kGB1 2590
U+570B kHanYu 10720.090
U+570B kIRGDaeJaweon 0447.090
U+570B kIRGDaiKanwaZiten 04798
U+570B kIRGHanyuDaZidian 10720.090
U+570B kIRGKangXi 0219.160
U+570B kIRG_GSource 1-397A
U+570B kIRG_JSource 0-5422
U+570B kIRG_KPSource KP0-D1B8
U+570B kIRG_KSource 0-4F50
U+570B kIRG_TSource 1-594F
U+570B kIRG_VSource 1-5046
U+570B kJapaneseKun KUNI
U+570B kJapaneseOn KOKU
U+570B kJis0 5202
U+570B kKPS0 D1B8
U+570B kKSC0 4748
U+570B kKangXi 0219.160
U+570B kKarlgren 118
U+570B kKorean KWUK
U+570B kMandarin GUO2
U+570B kMatthews 3738
U+570B kMorohashi 04798
U+570B kNelson 1042
U+570B kPhonetic 748
U+570B kRSKangXi 31.8
U+570B kRSUnicode 31.8
U+570B kSBGY 530.39
U+570B kSimplifiedVariant U+56FD
U+570B kTaiwanTelegraph 0948
U+570B kTotalStrokes 11
U+570B kXerox 241:056
U+570B kZVariant U+5700
I'm going to ignore nearly all of these lines, but want to have an object
that looks like this:
unicode = "570B"
definition = "nation, country, nation-state"
# from kDefinition
mandarin = "GUO2
# from kMandarin
Big5 = "B0EA"
# from kBig5
GB = <undefined>
# from kGB0, not present
How would you approach this? I can count on the unicode keys all being
together, and sorted, so that when I see a change in key, I know I have
the last attribute for the entry. But then, I've already started reading
the first line for the next entry, so there's a minor complication of
having to worry about whether that need to be processed. (Times like
this, I think a programming language needs an "unread" statement or method
-- something that, when a program eads a line, it could say, oh, now that
I see what's in that line, put it back, and let me read that next time
through the loop -- sort of like cheating at cards.)
I've thought about just creating an object for each line I see, and adding
attributes to the extent that I want to capture them, and then going back
through the list after I'm done to weed out any that don't have either the
Big5 or GB defined, with the "del" statement. But I hate to make two
passes like this given the size of the data, and that just seems messy.
Ideas?
--
Terry Carroll |
Santa Clara, CA | "The parties are advised to chill."
carroll@tjc.com | - Mattel, Inc. v. MCA Records, Inc.,
Modell delendus est | no. 98-56577 (9th Cir. July 24, 2002)