[Tutor] I am a friendly inoffensive non-commercial subject line

Scot W. Stevenson scot@possum.in-berlin.de
Tue, 13 Aug 2002 11:11:34 +0200


[This did have the subject line "High level file handling in the rain" but 
the SMTP mail dragon at mail.python.org didn't like that]

Hi there, 

While wading thru the rain that is slowly turning most of Central Europe 
into one big lake, my mind went from water ditches to water pipes to FIFO 
pipes to file handeling, and I finally realized why I have always felt 
that there is something fishy about the way Python handles files: It seems 
so low-level to me, so downright gutteresque compared with everything 
else. Let me explain while I'm waiting for my hair to dry (not to mention 
the cat). 

The current high-level file commands force you to deal with a class of 
questions that the Python Elves usually keep submerged: You have to /open/ 
the things, and you even have to tell the Elves is it is a file you only 
want to read from, or maybe write to, and if you think it is binary, which 
is sort of like having to decide beforehand if the variable 'lifejacket' 
is going to be an integer or a (groan) float. Even worse, this information 
is transfered by magic characters like 'r+', 'br', 'a+', which is not the 
way Python usually does stuff. Even worse, you seem to need different 
magic depending on which operating system you use, because Windows and 
Macs don't do files the same way that Unix does. 

Then you have a bunch of methods that only really make sense if you know 
how hard disks physically operate: 

/seek/ - A synonym for 'search'. So does "filename.seek(0)" search for the 
occurence of the number 0 in the file and return the position? (the user 
shouldn't be forced to know what a 'head seek' is to read stuff)

/flush/ - But I always do! No, wait - the docs say this is for "buffers", 
so it probably is used by blond female teenagers to pound wooden stakes 
into vampires. (the user should not be forced to deal with buffers on the 
highest level)

/tell/ - The docs say that this gives me "the file's position", which is 
strange, because it should be on the harddisk in '/home/scot/python/' 
(more 'head seek' hardware stuff)

And so on. After you've done all of that, you are supposed to 'close' a 
file, which is sort of like asking me to call a destructor or run the 
garbage collector by hand. Look, let's face it: If we wanted to be forced 
to clean up after ourselves, we would a) still be living with our parents 
and b) would use C or some other low(er) level language, not Python. 

Note that 'open', 'seek', 'tell', 'flush', and 'close' are terribly general 
verbs and don't give you any hint that you are working with a file (though 
'open' has been replaced by 'file' in Python 2.2). This is a whole 
different nomenclature compared with the usual Python objects: If a file 
is a sequence of characters or bytes or whatnot, why can't I splice files 
with [n:m] to get a bunch of lines or use the normal [n] to index a 
certain byte like I can with everything else?

Of course it's good to have the choice of low-level, byte-by-byte control, 
but that is what the os module is for. The top level, inbuilt commands 
should protect the casual user from all of this - at least it does with 
everything else, including division. I realize that when Python was being 
invented, throwing a thin wrapper around the C functions that everybody 
involved knew was a good idea to just be able to move on to more important 
stuff, but it still is just a thin wrapper that exposes all of the 
squishy, wet, slimy C bits I don't really want to know about. 

(And to think a week ago I was worried about the grass not getting enough 
water.)

Just for the entertainment value, and because the cat is still throwing her 
wet body against my leg, what would be wrong with the following file 
handling system:

We recognize two different types of flies: 

1) "Normal" files which consist of lines, in other words, of strings. This 
gives us strings that are sequences of characters and files that are 
sequences of strings, a nice pyramid. We'll use the 'file'-keyword for 
this type.

2) "Binary" files which consist of bytes. Note that every 'normal' file can 
be accessed as a binary file, but the reverse is not true. We'll use the 
'binfile'-keyword for this type.

Now when we want to access a file, we don't "open" it - that's the Elves' 
job - we just get right down to it:

filehandle = file('/home/scot/python/wetcat.text')  or
binhandle = binfile('flooddata.dat')

With this type of file, we can iterate, splice, or index the content 
without having to explicitly tell the Elves that we want to read or write 
or whatnot:

>>>filehandle[2] 
'drip drip drip'
>>>filehandle[0:2]
['drip', 'drip drip']

>>>binhandle[2]
4
>>>binhandle[0:2]
(1, 2)

(Or maybe both should return lists, I'm not sure what would be better). 

Since we have direct (random) access with splices and indices, we don't 
need to 'seek' and 'tell' anymore, and we progress string by string or 
byte by byte with iteration. Stuff like 'flush' and 'close' is Elves' 
work. Still, you'd want to keep the following methods:

one_string = file('damp.txt').read()
all_strings_as_a_list = file('damp.txt').readall()    

one_byte = binfile('humidity.dat').read()
all_bytes_as_a_list_or_tuple = binfile('humidity.dat').readall()

You also probably want to keep '.readline()' and '.readlines()' around as 
synonyms. Reading is easy; writing gets you into trouble, because you 
don't want to go around casually overwriting files the way you do 
variables. There are four cases:

1) File exists: Overwrite anyway
2) File exists: Raise error
3) File doesn't exist: Make new file, write
4) File doesn't exist: Raise error

For example, we could define 

filename.forcewrite(data): 1 + 3
filename.writenew(data): 2 + 3
filename.overwrite(data): 2 + 4

or whatever combination seems to make sense. Other useful methods we can 
just steal from the 'list' type: append, insert, replace, index, remove... 

The idea behind this is that newbies don't have to take a crash course in 
hard disk mechanics, don't have to memorize magic characters that look 
like ex commands, don't have to decide beforehand if they want to read and 
write, don't have to close stuff afterwards, and can just access files in 
the same way they would a list or any other sequence. 

So instead of 

outputfile = open('poldern.txt', 'w')
inputfile = open('floodgates.txt', 'r+'):
for line in inputfile:
    outputfile.write(line)
outputfile.close()
inputfile.close()

you'd have

for line in file('floodgates.txt'):
   file('poldern.txt').append(line)

Now doesn't that look a lot cleaner? /Watered down/, so to speak...

Okay, this is probably more than enough gushing from me today, and the cat 
has gone asleep. I'd be interested to hear what the people with more 
experience in design concepts and low-level file handling think of this - 
or if it is just me who thinks that the current way of doing things is not 
quite as high level as the rest of the language is.

Dry feet to all, 
Y, Scot

-- 
   Scot W. Stevenson wrote this on Tuesday, 13. Aug 2002 in Berlin, Germany
       on his happy little Linux system that has been up for 1342 hours
        and has a CPU that is falling asleep at a system load of 0.76.