[Tutor] little something in the way of file parsing

Sean 'Shaleh' Perry shalehperry@attbi.com
Fri, 19 Jul 2002 13:36:29 -0700 (PDT)


On 19-Jul-2002 Erik Price wrote:
> On Friday, July 19, 2002, at 02:49  PM, Sean 'Shaleh' Perry wrote:
> 
>> One stanza after another.  As preparation for a tool that would allow 
>> me to do
>> better things than grep on it I wrote the following bit of python.
>>
>> I am posting this because it shows some of the powers python has for 
>> rapid
>> coding and parsing.  Thought some of the lurkers might enjoy it.  Hope 
>> someone
>> learns something from it.
> 
> 
> See, now that's exactly what helps me learn, right there.  I really like 
> to read source code written by others, because you can sort of imagine 
> what's going through their head as you follow the path of execution 
> through the program.  I had never even thought of making an empty class 
> definition (and I still don't feel right about it), but I learned about 
> the setdefault() method of dictionaries, and a few other things.
> 

the empty class thing is a recent item I learned and this was the first time I
really felt I found a use for it.  setdefault() is a new item in python 2.x.

> The problem with reading source code is usually that it's just way to 
> big to be able to just sit down and digest.  Really, I think it's the 
> best way to learn (short of actually writing code), but usually I 
> download some cool looking program like Zope or Xerces or something like 
> that and I just get overwhelmed because the code extends over so many 
> files and you don't know where half the references are coming from, et 
> cetera.  This is the kind of thing that people can learn from.
> 

I released this version because it was simple.  The real version will move the
parsing into a function so the user can say "parse the data again, I think it
changed".

> I also found it interesting that you didn't use regular expressions 
> throughout the whole thing.  Normally when I think "parse", my mind goes 
> "regex" and I immediately think "Perl" (yeah I like that language too, 
> even though that might make me unpopular on this list ;).  But in this 
> case, you didn't need them -- the file's structure was well-organized 
> and you were able to use splices of line-strings and split() to grab the 
> important parts of each line and place it into a meaningful attribute of 
> the "package" object instance.  In fact, in my head I was wondering if 
> this isn't a perfect application for an XML file, although there seems 
> to be a bit more work involved in defining the structure of an XML 
> file....
> 

I think one of the interesting bits of Python is that you can do so much
without using 'import re'.  Each language has its idioms and for perl regexs
are a big part.

This file format is *OLD* Debian has been using it since dpkg was written like
6 years ago.  It is dirt simple to parse and does not include any serious
overhead.  XML despite the hype (and we are not going down this road in this
thread) is not easier for humans to parse, type or otherwise interact with. 
XML just makes the parsers job "easier" compared to SGML or HTML.  For the most
part the XML "ease" comes from the fact that rules are enforced and there are no
special cases.

>> # note I use the readline() idiom because there is currently 10 
>> thousand plus
>> # entries in the file which equates to some 90,000 lines.
>>
>> while 1:
>>     line = fd.readline()
> 
> I assume that this comment means that you use readline() as opposed to 
> readlines() ?  In other words, what are you actually saying in this 
> comment (or rather why did you feel the need to clarify why you chose 
> readline() ).
> 

I added this comment for the tutor list.  In many intro examples you see:

for line in file.readlines()
  handle(line)

which is great for quicky code.  However readlines() returns a list of strings.
In my case it returns a 100 thousand element list of strings -- memory wasted.

>>
>>     setattr(package, tag, value)
>>
> 
> It's strange to me to see an empty class definition and then this 
> function used to modify the class.  For you, it was convenient to use 
> the Package class to wrap up your data into "packages" (that's a pun 
> actually), but aren't class definitions usually used specifically for 
> their behaviors?
> 
> In other words, I can see perfectly well that this script works 
> perfectly well for your needs, so nothing else needs to be said.  But 
> for the sake of my understanding, if a professor of OO programming were 
> to come along, wouldn't he suggest that you define some methods of the 
> Package class to do some of the work for you, rather than the 
> setattr() ?  I'm curious because this is a big difference between Python 
> and a language like Java.  You get more flexibility with Python, but it 
> seems almost like it's too much of a shortcut.
> 

In C++ or Java I would have had to define all of the attributes and methods for
this class.  Probably would have been a good 30 or so lines of code at a
minimum.  Here in Python I was able to dynamically handle this.  For rapid
coding this is a handy feature.  if I actually needed some knd of data
integrity checks the class would be fleshed out.  Part of the fun of this code
was how fast and simple it was to write.

In perl you would have defined a hash instead of making a class and the lists
would hold references to the hash.  I like the class because when I add real
code later to use the data the class names will make this clean and easy.

For instance I can have:

for package in package_list:
  if package.maintainer == "Sean 'Shaleh' Perry <shaleh@debian.org>":
    print package.package

or:

total_size = 0
for package in package_list:
  total_size += package.size

print 'If you could install all of it, Debian is %d bytes total' % total_size

which feels more natural to me than dictionary lookups which is how you would
do this without the class.  Basically I used a class to wrap a dictionary for
me.

>> At this point I have a list of package classes and several dictionaries 
>> holding
>> lists of these packages.  There is only one instance of the actual 
>> package in
>> memory though, the rest are references handled by python's garbage 
>> collector.
>> Most handy.
> 
> Is that a specific behavior of list.append() or is that the way that 
> references are passed in all Python data structures (by reference, not 
> by value)?
> 

that is how Python works.

a_list = ....
b_list = a_list
b_list[0] = something_else
print a_list[0] # you get something_else

In this case it means that my memory usage is not as large because there is
only one instance of the actual data and the lists are just place holders. 
Sure some of those lists are 5,000 elements long but it is still not too bad.
The computer science name for this is 'shallow copy'.  If you want to actually
do the copying you want a 'deep copy' and there is a special function in python
for this.

Glad you enjoyed it.