[Tutor] Parsing non-uniform strings
Kirby Urner
urnerk@qwest.net
Mon, 19 Nov 2001 21:34:56 -0800
At 06:50 PM 11/19/2001 -0600, Timothy Wilson wrote:
>Hi everyone,
>
>I'm working up an assignment for my programming students and I'd like to
>get some feedback on strategies that could be used to solve this problem.
Getting the data is step 1, parsing the data is step 2. Some
websites give a lot of formatting, doing the parsing for you,
but making any screen scraping program wade through a lot of
HTML looking for values.
I prefer to just get the metar string, as per your examples.
>Does anyone have some general advise about parsing data like this?
>This may be biting off more than my students are able to chew at
>this point.
>
>-Tim
A key question is whether students in your class have any experience
with regular expressions. These would make it easier to pick out
strings of the form dd/dd for temperature/dewpoint, each with an
optional M in front (e.g. [M]dd/[M]dd). Supposedly that's unique
to the metar string, and you could find it like this:
temp = re.compile(' (M{0,1}[\d]{2})/(M{0,1}[\d]{2}) ')
If a match is found, then match.group(1) will contain the
temperature, and match.group(2) the dew point temperature
(both in Celcius). Likewise, the sky conditions may be
extracted by building a dictionary, e.g.
skydict = {"BKN":"Broken","FEW":"Few",
"OVC":"Overcast","CLR":"Clear",
"C":"Clear","SCT":"Scattered",
"VV":"Vertical visibility"}
and then searching on skydict.keys(), one at a time is
easiest:
for k in self.skydict.keys():
patt = " ("+k+")"+"([0-9]{3})"
sky = re.compile(patt)
match = re.search(sky,self.weather)
if match:
self.report.append("Sky: %s @ %s ft" % \
(self.skydict[match.group(1)],\
match.group(2)))
Just for fun, I did a passable metar downloader and parser,
as if I were one of your students. I can send you the full
source by email if you like.
Usage:
>>> kpdx = metar.Wreport("KPDX") # new report object defined
>>> kpdx.getdata() # download data from website using urllib2
>>> kpdx.weather # gives the string parsed below
'KPDX 200355Z 13010KT 10SM SCT070 BKN120 OVC200 12/09 A2988 RMK
AO2 SLP119 T01170089'
>>> kpdx.report # list of data items, could be formatted
['Date/Time: 11.20.2001 03:55 GMT', 'Temp: 12 C ', 'Dew: 09 C ',
'Sky: Overcast @ 200 ft', 'Sky: Scattered @ 070 ft',
'Sky: Broken @ 120 ft']
Another example:
>>> ksgs = metar.Wreport("KSGS")
>>> ksgs.getdata()
>>> for i in ksgs.report: print i
Date/Time: 11.20.2001 04:56 GMT
Temp: 01 C
Dew: M06 C
>>> ksgs.weather
'KSGS 200456Z AUTO 00000KT 10SM CLR 01/M06 A3023 RMK AO2 '
I see a couple bugs. CLR isn't being picked up, because I'm
looking for the 3-digits that aren't there in this case. And
I should probably use a minus sign in place of the M for
temperatures. I'll fix those in my metar.py after posting
this.
Thanks for sharing a fun, educational project. I learned
a lot.
Kirby