[Tutor] Building a data structure from input

Terry Carroll carroll@tjc.com
Sat Dec 28 23:05:02 2002


How would you approach this?  I'm going to have an input file with many
lines in the following form:

key   attribute-name1   value1
key   attribute-name2   value2
 ...
key   attribute-namen   valuen

For any given key, there will be a line with an attribute name and its
value.   Not all keys will have all attributes, though.

What I want to create a list of objects; one object per key, containing
the key, each of attributes 2, 4, 6 and 7, if they're defined -- but only
if either attribute 4 or 6 (or both) are defined for that particular key.

I think my explanation is poor, so let me make it more real.  The input is
the Unihan.txt file from unicode.org at
<ftp://ftp.unicode.org/Public/UNIDATA/Unihan.txt>.  This is a 26-meg file,
so don't browse it casually.

The attributes I want to store are the key, and
kDefinition, kMandarin, kBigFive, kGB0, but only if either kBigFive or
kGB0 are defined.

Here's one entry from this file:

U+570B	kAlternateKangXi	0219.016
U+570B	kAlternateMorohashi	04798
U+570B	kBigFive	B0EA
U+570B	kCCCII	21376F
U+570B	kCNS1986	1-594F
U+570B	kCNS1992	1-594F
U+570B	kCangjie	WIRM
U+570B	kCantonese	GWOK3
U+570B	kDaeJaweon	0447.090
U+570B	kDefinition	nation, country, nation-state
U+570B	kEACC	21376F
U+570B	kFrequency	1
U+570B	kGB1	2590
U+570B	kHanYu	10720.090
U+570B	kIRGDaeJaweon	0447.090
U+570B	kIRGDaiKanwaZiten	04798
U+570B	kIRGHanyuDaZidian	10720.090
U+570B	kIRGKangXi	0219.160
U+570B	kIRG_GSource	1-397A
U+570B	kIRG_JSource	0-5422
U+570B	kIRG_KPSource	KP0-D1B8
U+570B	kIRG_KSource	0-4F50
U+570B	kIRG_TSource	1-594F
U+570B	kIRG_VSource	1-5046
U+570B	kJapaneseKun	KUNI
U+570B	kJapaneseOn	KOKU
U+570B	kJis0	5202
U+570B	kKPS0	D1B8
U+570B	kKSC0	4748
U+570B	kKangXi	0219.160
U+570B	kKarlgren	118
U+570B	kKorean	KWUK
U+570B	kMandarin	GUO2
U+570B	kMatthews	3738
U+570B	kMorohashi	04798
U+570B	kNelson	1042
U+570B	kPhonetic	748
U+570B	kRSKangXi	31.8
U+570B	kRSUnicode	31.8
U+570B	kSBGY	530.39
U+570B	kSimplifiedVariant	U+56FD
U+570B	kTaiwanTelegraph	0948
U+570B	kTotalStrokes	11
U+570B	kXerox	241:056
U+570B	kZVariant	U+5700

I'm going to ignore nearly all of these lines, but want to have an object
that looks like this:

  unicode = "570B"
  definition = "nation, country, nation-state"
  # from kDefinition
  mandarin = "GUO2
  # from kMandarin
  Big5 = "B0EA"
  # from kBig5
  GB = <undefined>
  # from kGB0, not present

How would you approach this?  I can count on the unicode keys all being
together, and sorted, so that when I see a change in key, I know I have
the last attribute for the entry.  But then, I've already started reading
the first line for the next entry, so there's a minor complication of
having to worry about whether that need to be processed.  (Times like
this, I think a programming language needs an "unread" statement or method
-- something that, when a program eads a line, it could say, oh, now that
I see what's in that line, put it back, and let me read that next time
through the loop -- sort of like cheating at cards.)

I've thought about just creating an object for each line I see, and adding
attributes to the extent that I want to capture them, and then going back
through the list after I'm done to weed out any that don't have either the
Big5 or GB defined, with the "del" statement.  But I hate to make two
passes like this given the size of the data, and that just seems messy.

Ideas?

-- 
Terry Carroll        |
Santa Clara, CA      |   "The parties are advised to chill."
carroll@tjc.com      |       - Mattel, Inc. v. MCA Records, Inc.,
Modell delendus est  |         no. 98-56577 (9th Cir. July 24, 2002)