upload file via web form

Fri Mar 7 18:35:55 EST 2003

[John Hunter wrote]
> 
> I want to automatically upload some data to a password protected web
> server where some form variables must be filled out and a file must be
> uploaded.
> 
> I know how to do it directly with httplib.  Is there a better way?

I don't think so. Normally urllib.urlopen() (or urllib2.urlopen()) would
be what you should use here however then do not handle HTTP POST
requests properly when a file upload is involved. Uploading form data
when a file is involved means that the HTTP POST body must be encoded as
multipart/form-data. .urlopen() is hardcoded to use a Content-Type
header of application/x-www-form-urlencoded (which is what you want if
_no_ file is involved).

Here is a method that I have been working on to do this. I suppose a bug
should be logged on urlopen for this as well.

Trent

-- 
Trent Mick
TrentM at ActiveState.com
-------------- next part --------------
"""Some httplib helper methods."""

def httprequest(url, postdata={}, headers={}):
    """A urllib.urlopen() replacement for http://... that gets the
    content-type right for multipart POST requests.

    "url" is the http URL to open.
    "postdata" is a dictionary describing data to post. If the dict is
        empty (the default) a GET request is made, otherwise a POST
        request is made. Each postdata item maps a string name to
        either:
        - a string value; or
        - a file part specification of the form:
            {"filename": <filename>,    # file to load content from
             "content": <content>,      # (optional) file content
             "headers": <headers>}      # (optional) headers
          <filename> is used to load the content (can be overridden by
          <content>) and as the filename to report in the request.
          <headers> is a dictionary of headers to use for the part.
          Note: currently the file part content but be US-ASCII text.
    "headers" is an optional dictionary of headers to send with the
        request. Note that the "Content-Type" and "Content-Length"
        headers are automatically determined.

    The current urllib.urlopen() *always* uses:
        Content-Type: application/x-www-form-urlencoded
    for POST requests. This is incorrect if the postdata includes a file
    to upload. If a file is to be posted the post data is:
        Content-Type: multipart/form-data

    This returns the response content if the request was successfull
    (HTTP code 200). Otherwise an IOError is raised.

    For example, this invocation:
        url = 'http://www.perl.org/survey.cgi'
        postdata = {
            "name": "Gisle Aas",
            "email": "gisle at aas.no",
            "gender": "M",
            "born": "1964",
            "init": {"filename": "~/.profile"},
        }
    Would generate a request similar to this (your boundary and
    ~/.profile content will likely be different):
        POST http://www.perl.org/survey.cgi
        Content-Length: 388
        Content-Type: multipart/form-data; boundary="6G+f"

        --6G+f
        Content-Disposition: form-data; name="name"

        Gisle Aas
        --6G+f
        Content-Disposition: form-data; name="email"

        gisle at aas.no
        --6G+f
        Content-Disposition: form-data; name="gender"

        M
        --6G+f
        Content-Disposition: form-data; name="born"

        1964
        --6G+f
        Content-Disposition: form-data; name="init"; filename=".profile"
        Content-Type: text/plain

        PATH=/local/perl/bin:$PATH
        export PATH
        --6G+f--

    Limitations:
        - I don't think binary files are handled properly. And I don't
          think Unicode files will be handled properly. We will have to
          get smart on allowing the mimetype and (if text) charset to be
          specified. By default we try to guess: text/plain or
          application/octet-stream. If text/* then try to guess the
          charset. See Lib/email/Charset.py for inspiration here. There
          are also a couple of Python Cookbook recipes for encoding
          guessing.
        - This doesn't do HTTP error handling for some code as does
          urllib.urlopen() for error codes 301, 302 and 401.
        - I don't know if the return semantics are good. For instance
          the reponse headers are not accessible.

    Inspiration: Perl's HTTP::Request module.
    http://aspn.activestate.com/ASPN/Reference/Products/ActivePerl/site/lib/HTTP/Request/Common.html
    """
    import httplib, urllib, urlparse
    from email.MIMEText import MIMEText
    from email.MIMEMultipart import MIMEMultipart

    if not url.startswith("http://"):
        raise "Invalid URL, only http:// URLs are allow: url='%s'" % url

    if not postdata:
        method = "GET"
        body = None
    else:
        method = "POST"

        # Determine if require a multipart content-type: 'contentType'.
        for part in postdata.values():
            if isinstance(part, dict):
                contentType = "multipart/form-data"
                break
        else:
            contentType = "application/x-www-form-urlencoded"
        headers["Content-Type"] = contentType

        # Encode the post data: 'body'.
        if contentType == "application/x-www-form-urlencoded":
            body = urllib.urlencode(postdata)
        elif contentType == "multipart/form-data":
            message = MIMEMultipart(_subtype="form-data")
            for name, value in postdata.items():
                if isinstance(value, dict):
                    # Get content.
                    if "content" in value:
                        content = value["content"]
                    else:
                        fp = open(value["filename"], "rb")
                        content = fp.read()
                        fp.close()

                    # Create text part. Do not use ctor to set payload
                    # to avoid adding a trailing newline.
                    part = MIMEText(None)
                    part.set_payload(content, "us-ascii")

                    # Add content-disposition header.
                    dispHeaders = value.get("headers", {})
                    if "Content-Disposition" not in dispHeaders:
                        #XXX Should be a case-INsensitive check.
                        part.add_header("Content-Disposition", "form-data",
                                        name=name, filename=value["filename"])
                    for dhName, dhValue in dispHeaders:
                        part.add_header(dhName, dhValue)
                else:
                    # Do not use ctor to set payload to avoid adding a
                    # trailing newline.
                    part = MIMEText(None)
                    part.set_payload(value, "us-ascii")
                    part.add_header("Content-Disposition", "form-data",
                                    name=name)
                message.attach(part)
            message.epilogue = "" # Make sure body ends with a newline.
            # Split off the headers block from the .as_string() to get
            # just the message content. Also add the multipart Message's
            # headers (mainly to get the Content-Type header _with_ the
            # boundary attribute).
            headerBlock, body = message.as_string().split("\n\n",1)
            for hName, hValue in message.items():
                headers[hName] = hValue
            #print "XXX ~~~~~~~~~~~~ multi-part body ~~~~~~~~~~~~~~~~~~~"
            #import sys
            #sys.stdout.write(body)
            #print "XXX ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
        else:
            raise "Invalid content-type: '%s'" % contentType

    # Make the HTTP request and get the response.
    # Precondition: 'url', 'method', 'headers', 'body' are all setup properly.
    scheme, netloc, path, parameters, query, fragment = urlparse.urlparse(url)
    if parameters or query or fragment:
        raise "Unexpected URL form: parameters, query or fragment parts "\
              "are not allowed: parameters=%r, query=%r, fragment=%r"\
              % (parameters, query, fragment)
    conn = httplib.HTTPConnection(netloc)
    try:
        conn.request(method, path, body, headers)
        response = conn.getresponse()

        # Process the reponse. Here is a summary of HTTP responses:
        #   http://www.btinternet.com/~wildfire/reference/httpstatus/index.htm
        if response.status == 200:
            return response.read()
        else:
            #print "XXX http error:"
            #print "    status:", response.status
            #print "    reason:", response.reason
            #print "    msg:", response.msg
            raise IOError, ('http error', response)
    finally:
        conn.close()