[Tutor] Design Question: File Object used everywhere

Dave Angel davea at ieee.org
Fri May 14 12:44:27 CEST 2010

Jan Jansen wrote:
> Hi there,
> I'm working on a code to read and write large amounts of binary data 
> according to a given specification. In the specification there are a 
> lot of "segments" defined. The segments in turn have defintions of 
> datatypes and what they represent, how many of some of the data values 
> are present in the file and sometimes the offset from the beginning of 
> the file.
> Now I wonder, what would be a good way to model the code.
> Currently I have one class, that is the "FileReader". This class holds 
> the file object, information about the endianess and also a method to 
> read data (using the struct module). Then, I have more classes 
> representing the segements. In those classes I define data-formats, 
> call the read-method of the FileReader object and hold the data. 
> Currently I'm passing the FileReader object as arguement.
> Here some examples, first the "FileReader" class:
> class JTFile():
>     def __init__(self, file_obj):
>         self.file_stream = file_obj
>         self.version_string = ""
>         self.endian_format_prefix = ""
>     def read_data(self, fmt, pos = None):
>         format_size = struct.calcsize(fmt)
>         if pos is not None:
>             self.file_stream.seek(pos)
>         return struct.unpack_from(self.endian_format_prefix + fmt, 
> self.file_stream.read(format_size))
> and here an example for a segment class that uses a FileReader 
> instance (file_stream):
> class LSGSegement():
>     def __init__(self, file_stream):
>         self.file_stream = file_stream
>         self.lsg_root_element = None
>         self._read_lsg_root()
>     def _read_lsg_root(self):
>         fmt = "80Bi"
>         raw_data = self.file_stream.read_data(fmt)
>         self.lsg_root_element = LSGRootElement(raw_data[:79], 
> raw_data[79])
> So, now I wonder, what would be a good pythonic way to model the 
> FileReader class. Maybe use a global functions to avoid passing the 
> FileReader object around? Or something like "Singleton" I've heard 
> about but never used it? Or keept it like that?
> Cheers,
> Jan
I agree with Luke's advice, but would add some comments.

As soon as you have a global (or a singleton) representing a file, 
you're making the explicit assumption that you'll never have two such 
files open.  So what happens if you need to merge two such files?  Start 
over?  You need to continue to pass something representing the file 
(JTFile object) into each constructor.

The real question is one of state, which isn't clear from your example.  
The file_stream attribute of an object of class JTFile has a file 
position, which you are implitly using.  But you said some segments are 
at fixed positions in the file, and presumably some are serially related 
to other segments. Or perhaps some segments are really a section of the 
file containing smaller segments of different type(s).

Similarly, each object, after being created, probably has relationship 
to other objects.  Without knowing that, you can't design those object 

Finally, you need to decide early on what to do about data validation.  
If the file happens to be busted, how are you going to notify the user.  
If you read it in an ad-hoc, random order, you'll have a very hard time 
informing the user anything useful about what's wrong with it, never 
mind recovering from it.

It's really a problem in serialization, where you read a file by 
deserializing.  Consider whether the file is going to be always small 
enough to support simply interpreting the entire stream into a tree of 
objects, and then dealing with them.  Conceivably you can do that 
lazily, only deserializing objects as they are referenced.  But the 
possibility of doing that depends highly on whether there is what 
amounts to a "directory" in the file, or whether each object's position 
is determined by the length of the previous one.

In addition to deserializing in one pass, or lazily deserializing, 
consider deserializing with callbacks. In this approach you do not 
necessarily keep the intermediate objects, you just call a specified 
user routine, who should keep the objects if she cares about them, or 
process them or ignore them as needed.

I've had to choose each of these approaches for different projects, and 
the choice depended in large part on the definition of the data file, 
and whether it could be randomly accessed.


More information about the Tutor mailing list