[Python-Dev] standard library mimetypes module pathologically broken?

Fri Jul 31 23:16:02 CEST 2009

Hi all,

In an attempt to figure out some twisted.web code, I was reading
through the Python Standard Library’s mimetypes module today, and
was shocked at the poor quality of the code. I wonder how the
mimetypes code made it into the standard library, and whether anyone
has ever bothered to read it or update it: it is an embarrassment. 
Much of the code is redundant, portions fail to execute, control 
flow is routed through a horribly confusing mess of spaghetti, and 
most of the complexity has no clear benefit as far as I can tell. I 
probably should drop the subject and get back to work, but as a good 
citizen, it’s hard to just ignore this sort of thing.

mimetypes.py stores its types in a pair of dictionaries, one for
"strict" use, and the other for "non-standard types". It creates the
strict dictionary by default out of apache's mime.types file, and
then overrides the entries it finds with a set of exceptions. Then
it creates the non-standard dictionary, which is set to match if the
strict parameter is set to False when guessing types. Just in this
basic design, and in the list of types in the file, there are
several problems:

  * Various apache mime types files are read, if found, but the 
    ordering of the files is such that older versions of apache are
    sometimes read after newer ones, overriding updated mime types
    with out-of-date versions if multiple versions of apache are
    installed on the system.

  * The vast majority of types declared in mimetypes.py are 
    duplicates of types already declared by Apache. In a few cases
    this is to change the apache default (make an exception, that
    is), but in most cases the mime type and extension are
    completely identical. This huge number of redundant types makes
    the file substantially harder to follow. No comments are
    provided to explain why various sets of exceptions are made to
    Apache's default mime types, and in several cases mimetypes.py
    seems to just be out of date as compared to recent versions of
    Apache, for instance not knowing about the 'text/troff' type
    which was registered in January 2006 in RFC 4263.

  * The 'non-standard' type dictionary is nearly useless, because 
    all of the types it declares are already in apache's mime.types
    file, meaning that types are, as far as I can tell trying to
    follow ugly program flow, *never* drawn from the non-strict
    dictionary, except in the improbable situation where the 
    mimetypes module is initialized with a custom set of 
    apache-mime.types–like files, which does not include those 
    'non-standard' types. I personally cannot see a use case for 
    initializing the module with a custom set of mime types, but 
    then leaving the very few types included as non-strict to the 
    defaults: this seems like a fragile and pathological use case.
    Given this, I don’t see any benefit to dragging the 'strict'
    parameter along all the way through the code, and would advise 
    getting rid of it altogether. Does anyone know of any code that 
    uses the mimetypes module with strict set to False, where the 
    non-strict code path ever *actually* is executed?

But though these problems, which affect actual use of the code and
are therefore probably most important, are significant, they really
pale in comparison to the awful quality of implementation. I'll try
to briefly outline my understanding of how code flows in
mimetypes.py, and what the problems are. I haven't stepped through 
the code in a debugger, this is just from reading it, so I apologize 
in advance if I get something wrong. This is, however, some of the 
worst code I’ve seen in the standard library or anywhere else.

  * It defines __all__: I didn’t even realize __all__ could be used 
    for single-file modules (w/o submodules), but it definitely 
    shouldn’t be here. This specific __all__ oddly does not include 
    all of the documented variables and functions in the mimetypes 
    class. It’s not clear why someone calling import * here wouldn’t 
    want the bits not included.

  * It creates a _default_mime_types() function which declares a 
    bunch of global variables, and then immediately calls 
    _default_mime_types() below the definition. There is literally 
    no difference in result between this and just putting those 
    variables at the top level of the file, so I have no idea why 
    this function exists, except to make the code more confusing.

  * It allows command line usage: I don’t think this is necessary
    for a part of the standard library like this. There are better 
    tools for finding mime types from the command line which ship 
    with most operating systems.

  * Its API is pretty poorly designed. It offers 6 functions when 
    about 3 are needed, and it takes a couple reads-through of the 
    code to figure out exactly what any of them are supposed to do.

  * The operation is crazy: It defines a MimeTypes class which 
    actually stores the type mappings, but this class is designed to 
    be a singleton. The way that such a design is enforced is 
    through the use of the module-global 'init' function, which 
    makes an instance of the class, and then maps all of the 
    functions in the module global namespace to instance methods. 
    But confusingly, all such functions are also defined 
    independently of the init function, with definitions such as:

        def guess_type(url, strict=True):
            if not inited:
                init()
            return guess_type(url, strict)

    I’d be amazed if anyone could guess what that code was trying to 
    do. I did a double-take when I saw it.

    Of course, that return call is only ever reached the first time 
    this function is called, if init() has not happened yet. This 
    was all presumably done for lazy initialization, so that the 
    type information would only be loaded when needed. Needless to 
    say, there are more pythonic ways to accomplish such a goal.

    Oh, also, the other good one here is that it means that someone 
    who writes `from mimetypes import guess_types` gets something 
    different than someone who writes:
    `import mimetypes; guess_types = mimetypes.guess_types`. In the 
    former case, this wrapper function is saved as guess_type, which 
    each time just calls the (changed after init()) 
    mimetypes.guess_types function. This caused a performance 
    nightmare before March of this year, when there was no check for 
    `if not inited` before running init() (amazing!?).

  * Because the type datastore is set up to be a singleton, any time 
    init() is called in one section of code, it resets any types 
    which have been added manually: this means that if init() is 
    called by different pieces of code in the same python program, 
    they will interfere with each-others’ type databases, and break 
    each-other. This is extremely fragile and, in my opinion, crazy. 
    It is hard for me to imagine any use case that would benefit 
    from this ability to clobber custom type mappings, and I very
    much doubt that any code calling the mimetypes module realizes 
    that the contract of the API is so flimsy by definition. In 
    practice, I would not advise consumers of this API to ever call 
    init() manually, or to ever add custom mime type mappings, 
    because they are setting themselves up for hard-to-track bugs 
    down the line.

  * The 'inited' flag is a documented part of the interface, in the
    standard library documentation. I cannot imagine any reason to 
    set this flag manually: setting it to false when it was true 
    will have no effect, because the top-level functions have 
    already been replaced by instance methods of the 'db' MimeTypes 
    instance. Setting it to true when it was false will make the 
    code just break outright.

  * In python 3, this has been changed a bit. There’s still an 
    inited flag, and it still in the docs, but now awful code from 
    above has been changed slightly, to:

        def guess_type(url, strict=True):
            if _db is None:
                init()
                return _db.guess_type(url, strict)

    Which is still embarrassingly confusing. On the upside, the 
    inited flag now does literally nothing, but remains defined, and 
    in the docs.

  * The 'types_map' and 'common_types' (for 'strict' and 
    'common' types, respectively) dictionaries are also a documented 
    part of the interface. When init() is called, a new MimeTypes 
    instance makes a (different) types_map which is a tuple of two 
    dictionaries, for 'strict' and 'common' types. Then this 
    instance reads the apache mime.types files and adds the types to 
    its pair of self.types_map dictionaries, and then after that 
    looks at the global types_map and common_types dictionaries and 
    adds *those* types to its self.types_map. Then at the end it 
    replaces the global types_map with self.types_map[True] and 
    replaces common_types with self.types_map[False]. Unfortunately, 
    while changing these dictionaries will have an effect on the 
    operation of the library, it will not update the types_map_inv 
    mapping, so inverse lookups will not behave as the changer 
    expects. If these dictionaries are going to remain documented, 
    the documentation should be clear to describe them as read only 
    to avoid very confusing bugs.

  * Speaking of these dictionaries, .copy() is called on those two 
    and a few other inside MimeTypes.__init__(), which happens every 
    time the global init() function is called, but then init() puts 
    the copies back in the global namespace, meaning that the 
    original is discarded. Basically the only reason for the .copy() 
    is to make sure that the correct updates are applied to the 
    apache mimetype defaults, but the code will gladly re-read all 
    of the apache files even after its mapped types are already in 
    these dictionaries, essentially making re-initializing a (very 
    expensive) no-op. All we’re doing is a lot of unnecessary extra 
    disk reads and memory allocations and deallocations. The only 
    time this has any effect is when a non-singleton MimeTypes 
    instance is created, as in the read_mime_types function.

  * And that read_mime_types function is a doozy. It tries to open a 
    filename, spits back None if there’s an IOError (instead of 
    raising the exception as it should), and then creates a new 
    MimeTypes instance (remember, this is identical to the singleton 
    MimeTypes instance because it starts itself from that one’s 
    mappings), adds any new types it finds in the file with that 
    name, and then returns the 'strict' types_map from it. I’m not 
    sure whether any sane user of this API would expect it to return 
    the existing type mappings *plus* the extra ones in the provided 
    filename, but I really can’t imagine this function ever being 
    particularly useful: it requires you are reading mime types in 
    apache format, but not the apache mime type files you already 
    looked at, and then the only way to find out what new mappings 
    were defined is to take the difference of the default mappings 
    with the result of the function.

  * The code itself, on a line-by-line basis, is unpythonic and 
    unnecessarily verbose, confusing, and slow. The code should be 
    rewritten to use python 2.3–2.6 features: even leaving its 
    functionality identical it could be cut to about half the number 
    of lines, and made clearer.

In case the above doesn’t make this clear: this code is extremely
confusing. Trying to read it has caused all the people around me to
look up as I shout "what the fuck??!" at the screen every few
minutes, as each new revelation gives another surprise. I’m not
convinced that I completely understand what the code does, because
it has been quite effectively obfuscated, but I understand enough to
want to throw the whole thing out, and start essentially from 
scratch.

So the question is, what should be done about this? I’d like to hear
how people use the mimetypes module, and how they expect it to work,
to figure out the sanest possible mostly-backwards-compatible
replacement which could be dropped in (ideally this would just allow
the use of default mimetypes and rip out the ability to alter the
default datastore: or is there some easy way to change this away
from a singleton without breaking code which calls these methods?),
and then extend that replacement to support a somewhat saner model
for anyone who actually wants to extend the set of mappings. My 
guess is that replacement code could actually fix subtle bugs in 
existing uses of this module, by people who had a sane expectation 
of how it was supposed to work.

At the very least, the parts about figuring out exactly which
exceptions to Apache’s set of default types are useful would be a
good idea, and I’d maybe even recommend including an up-to-date copy
of Apache’s mime.types file in the Python Standard Library, and then
only overriding its definitions for future versions of Apache (and
then overriding the combination of both of those with further
exceptions deemed useful for python, with comments explaining why
each exception), so that we’re not bothering to look up horribly
out-of-date types in multiple locations from Apache 1, 1.2, 1.3,
etc. I’d also recommend making the API for overriding definitions be
the same as the code used to declare the default overrides, because
as it is there are three ways do define types: a) in a mime.types
formatted file, b) in a python dictionary that gets initialized with
a confusing bit of code, and c) through the add_type function.

Does anyone else have thoughts about this, or maybe some good (it
had better be *really* good) explanations why this code is the way
it is? I'd be happy to try to rewrite it, but I think I’d need a bit
of help figuring out how to make the rewrite backwards-compatible.

Note: someone else has had fun with this module:
<http://lucumr.pocoo.org/2009/3/1/the-1000-speedup-or-the-stdlib-sucks>
<http://lucumr.pocoo.org/2009/7/24/singletons-and-their-problems-in-python>

Cheers,
Jacob Rus