[Python-Dev] standard library mimetypes module pathologically broken?

Sat Aug 1 00:17:53 CEST 2009

On Fri, Jul 31, 2009 at 14:16, Jacob Rus <jacobolus at gmail.com> wrote:

> Hi all,
>
> In an attempt to figure out some twisted.web code, I was reading
> through the Python Standard Library’s mimetypes module today, and
> was shocked at the poor quality of the code. I wonder how the
> mimetypes code made it into the standard library, and whether anyone
> has ever bothered to read it or update it: it is an embarrassment.
> Much of the code is redundant, portions fail to execute, control
> flow is routed through a horribly confusing mess of spaghetti, and
> most of the complexity has no clear benefit as far as I can tell. I
> probably should drop the subject and get back to work, but as a good
> citizen, it’s hard to just ignore this sort of thing.
>

I have not looked at the code nor ever used it (that I can remember) so I
can't directly address the quality. But I can say the code was added in 1997
which puts it as an addition in Python 1.4. That's why before Python took
off mainstream and began to tighten up the quality control on the standard
library.

I also would like to stay that I am not embarrassed by anything in Python.
It's unfortunate if the mimetypes module's code is a mess, but I think
putting at embarrassing is taking a little far and borderline insulting
(which I don't think you meant to do).

>
> mimetypes.py stores its types in a pair of dictionaries, one for
> "strict" use, and the other for "non-standard types". It creates the
> strict dictionary by default out of apache's mime.types file, and
> then overrides the entries it finds with a set of exceptions. Then
> it creates the non-standard dictionary, which is set to match if the
> strict parameter is set to False when guessing types. Just in this
> basic design, and in the list of types in the file, there are
> several problems:
>
>  * Various apache mime types files are read, if found, but the
>    ordering of the files is such that older versions of apache are
>    sometimes read after newer ones, overriding updated mime types
>    with out-of-date versions if multiple versions of apache are
>    installed on the system.
>
>  * The vast majority of types declared in mimetypes.py are
>    duplicates of types already declared by Apache. In a few cases
>    this is to change the apache default (make an exception, that
>    is), but in most cases the mime type and extension are
>    completely identical. This huge number of redundant types makes
>    the file substantially harder to follow. No comments are
>    provided to explain why various sets of exceptions are made to
>    Apache's default mime types, and in several cases mimetypes.py
>    seems to just be out of date as compared to recent versions of
>    Apache, for instance not knowing about the 'text/troff' type
>    which was registered in January 2006 in RFC 4263.
>
>  * The 'non-standard' type dictionary is nearly useless, because
>    all of the types it declares are already in apache's mime.types
>    file, meaning that types are, as far as I can tell trying to
>    follow ugly program flow, *never* drawn from the non-strict
>    dictionary, except in the improbable situation where the
>    mimetypes module is initialized with a custom set of
>    apache-mime.types–like files, which does not include those
>    'non-standard' types. I personally cannot see a use case for
>    initializing the module with a custom set of mime types, but
>    then leaving the very few types included as non-strict to the
>    defaults: this seems like a fragile and pathological use case.
>    Given this, I don’t see any benefit to dragging the 'strict'
>    parameter along all the way through the code, and would advise
>    getting rid of it altogether. Does anyone know of any code that
>    uses the mimetypes module with strict set to False, where the
>    non-strict code path ever *actually* is executed?
>
> But though these problems, which affect actual use of the code and
> are therefore probably most important, are significant, they really
> pale in comparison to the awful quality of implementation. I'll try
> to briefly outline my understanding of how code flows in
> mimetypes.py, and what the problems are. I haven't stepped through
> the code in a debugger, this is just from reading it, so I apologize
> in advance if I get something wrong. This is, however, some of the
> worst code I’ve seen in the standard library or anywhere else.
>
>  * It defines __all__: I didn’t even realize __all__ could be used
>    for single-file modules (w/o submodules), but it definitely
>    shouldn’t be here.

__all__ is used to control what a module exports when used in an import *,
nothing more. Thus it's use in a module compared to a package is completely
legitimate.

> This specific __all__ oddly does not include
>    all of the documented variables and functions in the mimetypes
>    class. It’s not clear why someone calling import * here wouldn’t
>    want the bits not included.

If something is documented by not listed in __all__ that is a bug.

>
>
>  * It creates a _default_mime_types() function which declares a
>    bunch of global variables, and then immediately calls
>    _default_mime_types() below the definition. There is literally
>    no difference in result between this and just putting those
>    variables at the top level of the file, so I have no idea why
>    this function exists, except to make the code more confusing.
>

It could potentially be used for testing, but that's a guess.

>
>  * It allows command line usage: I don’t think this is necessary
>    for a part of the standard library like this. There are better
>    tools for finding mime types from the command line which ship
>    with most operating systems.

Yeah, various modules have command-line versions which are not truly
necessary. This can probably stand to go.

>
>
>  * Its API is pretty poorly designed. It offers 6 functions when
>    about 3 are needed, and it takes a couple reads-through of the
>    code to figure out exactly what any of them are supposed to do.
>
>  * The operation is crazy: It defines a MimeTypes class which
>    actually stores the type mappings, but this class is designed to
>    be a singleton. The way that such a design is enforced is
>    through the use of the module-global 'init' function, which
>    makes an instance of the class, and then maps all of the
>    functions in the module global namespace to instance methods.
>    But confusingly, all such functions are also defined
>    independently of the init function, with definitions such as:
>
>        def guess_type(url, strict=True):
>            if not inited:
>                init()
>            return guess_type(url, strict)
>
>    I’d be amazed if anyone could guess what that code was trying to
>    do. I did a double-take when I saw it.
>

Probably came from someone who is very OO happy. Not everyone comes to
Python ready to embrace its procedural or slightly functional facets.

>
>    Of course, that return call is only ever reached the first time
>    this function is called, if init() has not happened yet. This
>    was all presumably done for lazy initialization, so that the
>    type information would only be loaded when needed. Needless to
>    say, there are more pythonic ways to accomplish such a goal.
>
>    Oh, also, the other good one here is that it means that someone
>    who writes `from mimetypes import guess_types` gets something
>    different than someone who writes:
>    `import mimetypes; guess_types = mimetypes.guess_types`. In the
>    former case, this wrapper function is saved as guess_type, which
>    each time just calls the (changed after init())
>    mimetypes.guess_types function. This caused a performance
>    nightmare before March of this year, when there was no check for
>    `if not inited` before running init() (amazing!?).
>
>  * Because the type datastore is set up to be a singleton, any time
>    init() is called in one section of code, it resets any types
>    which have been added manually: this means that if init() is
>    called by different pieces of code in the same python program,
>    they will interfere with each-others’ type databases, and break
>    each-other. This is extremely fragile and, in my opinion, crazy.
>    It is hard for me to imagine any use case that would benefit
>    from this ability to clobber custom type mappings, and I very
>    much doubt that any code calling the mimetypes module realizes
>    that the contract of the API is so flimsy by definition. In
>    practice, I would not advise consumers of this API to ever call
>    init() manually, or to ever add custom mime type mappings,
>    because they are setting themselves up for hard-to-track bugs
>    down the line.
>
>  * The 'inited' flag is a documented part of the interface, in the
>    standard library documentation. I cannot imagine any reason to
>    set this flag manually: setting it to false when it was true
>    will have no effect, because the top-level functions have
>    already been replaced by instance methods of the 'db' MimeTypes
>    instance. Setting it to true when it was false will make the
>    code just break outright.
>
>  * In python 3, this has been changed a bit. There’s still an
>    inited flag, and it still in the docs, but now awful code from
>    above has been changed slightly, to:
>
>        def guess_type(url, strict=True):
>            if _db is None:
>                init()
>                return _db.guess_type(url, strict)
>
>    Which is still embarrassingly confusing. On the upside, the
>    inited flag now does literally nothing, but remains defined, and
>    in the docs.
>
>  * The 'types_map' and 'common_types' (for 'strict' and
>    'common' types, respectively) dictionaries are also a documented
>    part of the interface. When init() is called, a new MimeTypes
>    instance makes a (different) types_map which is a tuple of two
>    dictionaries, for 'strict' and 'common' types. Then this
>    instance reads the apache mime.types files and adds the types to
>    its pair of self.types_map dictionaries, and then after that
>    looks at the global types_map and common_types dictionaries and
>    adds *those* types to its self.types_map. Then at the end it
>    replaces the global types_map with self.types_map[True] and
>    replaces common_types with self.types_map[False]. Unfortunately,
>    while changing these dictionaries will have an effect on the
>    operation of the library, it will not update the types_map_inv
>    mapping, so inverse lookups will not behave as the changer
>    expects. If these dictionaries are going to remain documented,
>    the documentation should be clear to describe them as read only
>    to avoid very confusing bugs.
>
>  * Speaking of these dictionaries, .copy() is called on those two
>    and a few other inside MimeTypes.__init__(), which happens every
>    time the global init() function is called, but then init() puts
>    the copies back in the global namespace, meaning that the
>    original is discarded. Basically the only reason for the .copy()
>    is to make sure that the correct updates are applied to the
>    apache mimetype defaults, but the code will gladly re-read all
>    of the apache files even after its mapped types are already in
>    these dictionaries, essentially making re-initializing a (very
>    expensive) no-op. All we’re doing is a lot of unnecessary extra
>    disk reads and memory allocations and deallocations. The only
>    time this has any effect is when a non-singleton MimeTypes
>    instance is created, as in the read_mime_types function.
>
>  * And that read_mime_types function is a doozy. It tries to open a
>    filename, spits back None if there’s an IOError (instead of
>    raising the exception as it should), and then creates a new
>    MimeTypes instance (remember, this is identical to the singleton
>    MimeTypes instance because it starts itself from that one’s
>    mappings), adds any new types it finds in the file with that
>    name, and then returns the 'strict' types_map from it. I’m not
>    sure whether any sane user of this API would expect it to return
>    the existing type mappings *plus* the extra ones in the provided
>    filename, but I really can’t imagine this function ever being
>    particularly useful: it requires you are reading mime types in
>    apache format, but not the apache mime type files you already
>    looked at, and then the only way to find out what new mappings
>    were defined is to take the difference of the default mappings
>    with the result of the function.
>
>  * The code itself, on a line-by-line basis, is unpythonic and
>    unnecessarily verbose, confusing, and slow. The code should be
>    rewritten to use python 2.3–2.6 features: even leaving its
>    functionality identical it could be cut to about half the number
>    of lines, and made clearer.
>
> In case the above doesn’t make this clear: this code is extremely
> confusing.

Yeah, kind of picked up on that. =)

> Trying to read it has caused all the people around me to
> look up as I shout "what the fuck??!" at the screen every few
> minutes, as each new revelation gives another surprise. I’m not
> convinced that I completely understand what the code does, because
> it has been quite effectively obfuscated, but I understand enough to
> want to throw the whole thing out, and start essentially from
> scratch.
>
> So the question is, what should be done about this? I’d like to hear
> how people use the mimetypes module, and how they expect it to work,
> to figure out the sanest possible mostly-backwards-compatible
> replacement which could be dropped in (ideally this would just allow
> the use of default mimetypes and rip out the ability to alter the
> default datastore: or is there some easy way to change this away
> from a singleton without breaking code which calls these methods?),
> and then extend that replacement to support a somewhat saner model
> for anyone who actually wants to extend the set of mappings. My
> guess is that replacement code could actually fix subtle bugs in
> existing uses of this module, by people who had a sane expectation
> of how it was supposed to work.
>
> At the very least, the parts about figuring out exactly which
> exceptions to Apache’s set of default types are useful would be a
> good idea, and I’d maybe even recommend including an up-to-date copy
> of Apache’s mime.types file in the Python Standard Library, and then
> only overriding its definitions for future versions of Apache (and
> then overriding the combination of both of those with further
> exceptions deemed useful for python, with comments explaining why
> each exception), so that we’re not bothering to look up horribly
> out-of-date types in multiple locations from Apache 1, 1.2, 1.3,
> etc. I’d also recommend making the API for overriding definitions be
> the same as the code used to declare the default overrides, because
> as it is there are three ways do define types: a) in a mime.types
> formatted file, b) in a python dictionary that gets initialized with
> a confusing bit of code, and c) through the add_type function.
>
> Does anyone else have thoughts about this, or maybe some good (it
> had better be *really* good) explanations why this code is the way
> it is? I'd be happy to try to rewrite it, but I think I’d need a bit
> of help figuring out how to make the rewrite backwards-compatible.

So the problem of changing fundamentally how the code works, even for a
cleanup, is that it will break someone's code out there because they
depended on the module's crazy way of doing things. Now if they are cheating
and looking at things that are meant to be hidden you might be able to clean
things up, but if the semantics are exposed to the user, then there is not
much we can do w/o breaking someone's code.

Honestly, if the code is as bad as it seems -- including its API --, the
best bet would be to come up with a new module for handling MIME types from
scratch, put it up on the Cheeseshop/PyPI, and get the community behind it.
If the community picks it up as the de-facto replacement for mimetypes and
the code has settled we can then talk about adding it to the standard
library and begin deprecating mimetypes.

And thanks for willing to volunteer to fix this.

-Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20090731/6a46de37/attachment-0001.htm>