Elementary string-parsing
Steve Holden
steve at holdenweb.com
Tue Feb 5 08:07:46 EST 2008
Dennis Lee Bieber wrote:
> On Tue, 05 Feb 2008 04:03:04 GMT, Odysseus
> <odysseus1479-at at yahoo-dot.ca> declaimed the following in
> comp.lang.python:
>
>> Sorry, translation problem: I am acquainted with Python's "for" -- if
>> far from fluent with it, so to speak -- but the PS operator that's most
>> similar (traversing a compound object, element by element, without any
>> explicit indexing or counting) is called "forall". PS's "for" loop is
>> similar to BASIC's (and ISTR Fortran's):
>>
>> start_value increment end_value {procedure} for
>>
>> I don't know the proper generic term -- "indexed loop"? -- but at any
>> rate it provides a counter, unlike Python's command of the same name.
>>
> The convention is Python is to use range() (or xrange() ) to
> generate a sequence of "index" values for the for statement to loop
> over:
>
> for i in range([start], end, [step]):
>
> with the caveat that "end" will not be one of the values, start defaults
> to 0, so if you supply range(4) the values become 0, 1, 2, 3 [ie, 4
> values starting at 0].
>
If you have a sequence of values s and you want to associate each with
its index value as you loop over the sequence the easiest way to do this
is the enumerate built-in function:
>>> for x in enumerate(['this', 'is', 'a', 'list']):
... print x
...
(0, 'this')
(1, 'is')
(2, 'a')
(3, 'list')
It's usually (though not always) much more convenient to bind the index
and the value to separate names, as in
>>> for i, v in enumerate(['this', 'is', 'a', 'list']):
... print i, v
...
0 this
1 is
2 a
3 list
[...]
> The whole idea behind the SGML parser is that YOU add methods to
> handle each tag type you need... Also, FYI, there IS an HTML parser (in
> module htmllib) that is already derived from sgmllib.
>
> class PageParser(SGMLParser):
> def __init__(self):
> #need to call the parent __init__, and then
> #initialize any needed attributes -- like someplace to collect
> #the parsed out cell data
> self.row = {}
> self.all_data = []
>
> def start_table(self, attrs):
> self.inTable = True
> .....
>
> def end_table(self):
> self.inTable = False
> .....
>
> def start_tr(self, attrs):
> if self.inRow:
> #unclosed row!
> self.end_tr()
> self.inRow = True
> self.cellCount = 0
> ...
>
> def end_tr(self):
> self.inRow = False
> # add/append collected row data to master stuff
> self.all_data.append(self.row)
> ...
>
> def start_td(self, attrs):
> if self.inCell:
> self.end_td()
> self.inCell = True
> ...
>
> def end_td(self):
> self.cellCount = self.cellCount + 1
> ...
>
> def handle_data(self, text):
> if self.inTable and self.inRow and self.inCell:
> if self.cellCount == 0:
> #first column stuff
> self.row["Epoch1"] = convert_if_needed(text)
> elif self.cellCount == 1:
> #second column stuff
> ...
>
>
> Hope you don't have nested tables -- it could get ugly as this style
> of parser requires the start_tag()/end_tag() methods to set instance
> attributes for the purpose of tracking state needed in later methods
> (notice the complexity of the handle_data() method just to ensure that
> the text is from a table cell, and not some random text).
>
There is, of course, nothing to stop you building a recursive data
structure, so that encountering a new opening tag such as <table> adds
another level to some stack-like object, and the corresponding closing
tag pops it off again, but this *does* add to the complexity somewhat.
It seems natural that more complex input possibilities lead to more
complex parsers.
> And somewhere before you close the parser, get a handle on the
> collected data...
>
>
> parsed_data = parser.all_data
> parser.close()
> return parsed_data
>
>
>> Why wouldn't one use a dictionary for that?
>>
> The overhead may not be needed... Tuples can also be used as the
> keys /in/ a dictionary.
>
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
More information about the Python-list
mailing list