[Python-bugs-list] [ python-Bugs-231249 ] cgi.py opens too many (temporary) files

noreply@sourceforge.net noreply@sourceforge.net
Thu, 12 Apr 2001 21:04:48 -0700


Bugs item #231249, was updated on 2001-02-06 04:59
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=231249&group_id=5470

Category: Python Library
Group: None
Status: Open
Priority: 5
Submitted By: Richard van de Stadt (stadt)
Assigned to: Guido van Rossum (gvanrossum)
Summary: cgi.py opens too many (temporary) files

Initial Comment:
cgi.FieldStorage() is used to get the contents of a webform. It turns out that
for each <input> line, a new temporary file is opened. This causes the script
that is using cgi.FieldStorage() to reach the webserver's limit of number of
opened files, as described by 'ulimit -n'. The standard value for Solaris systems
seems to be 64, so webforms with that many <input> fields cannot be dealt
with.

A solution would seem to use the same temporary filename, since only a
maxmimum one file is (temporarily) used at the same time. I did an "ls|wc -l"
while the script was running, which showed only zeroes and ones.

(I'm using Python for CyberChair, an online paper submission and reviewing system.
The webform under discussion has one input field for each reviewer, stating the
papers he or she is supposed to be reviewing. One conference that is using CyberChair
has almost 140 reviewers. Their system's open file limit is 64. Using the same data on
a system with an open file limit of 260 _is_ able to deal with this.)

----------------------------------------------------------------------

Comment By: Kurt B. Kaiser (kbk)
Date: 2001-04-12 21:04

Message:
Logged In: YES 
user_id=149084

The patch posted 11 Apr is a neat and compact solution! 

The only thing I can imagine would be a problem would be if a form had a large number of (small) fields 
which set the content-length attribute.  I don't have an example of such, though. Text fields perhaps? If 
that was a realistic problem, a solution might be for make_file() to maintain a pool of temporary files; if the 
field (binary or not) turned out to be small a StringIO could be created and the temporary file returned to 
the pool.

There are a couple of things I've been thinking about in cgi.py; the patch doesn't seem to change the 
situation one way or the other:

There doesn't seem to be any RFC requirement that a file upload be accompanied by a content-length 
attribute, regardless of whether it is binary or ascii. In fact, some of the RFC examples I've seen omit it. If 
content-length is not specified, the upload will be processed by file.readline(). Can this cause problems for 
arbitrary binary files?

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-04-12 11:59

Message:
Logged In: YES 
user_id=6380

Uploading a new patch, more complicated.  I don't like it as
much. But it works even if the caller uses
item.file.fileno().

----------------------------------------------------------------------

Comment By: Kurt B. Kaiser (kbk)
Date: 2001-04-12 10:05

Message:
Logged In: YES 
user_id=149084

I have a thought on this, but it will be about 10 hours before I can submit it.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-04-11 13:20

Message:
Logged In: YES 
user_id=6380

Here's a proposed patch.

Can anyone think of a reason why this should not be checked
in as part of 2.1?

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-04-10 11:54

Message:
Logged In: YES 
user_id=6380

I wish I'd heard about this sooner.

It does seem a problem and it does make sense to use StringIO unless there's a lot of data.

But we can't fix this in time for 2.1...


----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2001-04-10 10:54

Message:
Logged In: YES 
user_id=11375

Unassigning so someone else can take a look at it.

----------------------------------------------------------------------

Comment By: Kurt B. Kaiser (kbk)
Date: 2001-02-18 23:32

Message:
In the particular HTML form referenced it appears that a workaround might be to eliminate the enctype attribute in the <form> tag and take the application/x-www-form-urlencoded default since no files are being uploaded. 

When make_file creates the temporary files they are immediately unlinked. There is probably a brief period before the unlink is finalized during which the ls process might see a file; that would account for the output of ls | wc. It appears that the current cgi.py implementation leaves all the (hundreds of) files open until the cgi process releases the FieldStorage object or exits. 

My first thought was, if the filename recovered from the header is None have make_file create a StringIO object instead of a temp file. That way a temp file is only created when a file is uploaded. This is not inconsistent with the cgi.py docs.

Unfortunately, RFC2388 4.4 states that a filename is not required to be sent, so it looks like your solution based on the size of the data received is the correct one. Below 1K you could copy the temp file contents to a StringIO and assign it to self.file, then explicitly close the temp file via its descriptor. 

If only I understood the module better ::-(( and had a way of tunnel testing it I might have had the temerity to offer a patch. (I'm away for a couple of weeks starting tomorrow.)

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2001-02-18 14:08

Message:
Ah, I see; the traceback makes this much clearer.  When you're uploading a file, everything in the form is sent as 
a MIME document in the body; every field is accompanied by 
a boundary separator and Content-Disposition header.
In multipart mode, cgi.py copies each field into a temporary file.  

The first idea I had was to only use tempfiles for the actual upload field; unfortunately, that doesn't help because the upload field isn't special, and cgi.py has no way to know which it is ahead of time.

Possible second approach: measure the size of the resulting file; if it's less than some threshold (1K? 10K?), read its contents into memory and close the tempfile.  This means only the largest fields will require that a file descriptor be kept open.  I'll explore this more after beta1.




----------------------------------------------------------------------

Comment By: Richard van de Stadt (stadt)
Date: 2001-02-17 18:37

Message:
I do *not* mean file upload fields. I stumbled upon this with a webform that
contains 141 'simple' input fields like the form you can see here (which 'only'
contains 31 of those input fields):

http://www.cyberchair.org/cgi-cyb/genAssignPageReviewerPapers.py

(use chair/chair to login)

When the maximum number of file descriptors used per process was increased
to 160 (by the sysadmins), the problem did not occur anymore, and the webform
could be processed.

This was the error message I got: 

Traceback (most recent call last):
  File "/usr/local/etc/httpd/DocumentRoot/ICML2001/cgi-bin/submitAssignRP.py", line 144, in main
  File "/opt/python/2.0/sparc-sunos5.6/lib/python2.0/cgi.py", line 504, in __init__
  File "/opt/python/2.0/sparc-sunos5.6/lib/python2.0/cgi.py", line 593, in read_multi
  File "/opt/python/2.0/sparc-sunos5.6/lib/python2.0/cgi.py", line 506, in __init__
  File "/opt/python/2.0/sparc-sunos5.6/lib/python2.0/cgi.py", line 603, in read_single
  File "/opt/python/2.0/sparc-sunos5.6/lib/python2.0/cgi.py", line 623, in read_lines
  File "/opt/python/2.0/sparc-sunos5.6/lib/python2.0/cgi.py", line 713, in make_file
  File "/opt/python/2.0/sparc-sunos5.6/lib/python2.0/tempfile.py", line 144, in TemporaryFile
OSError: [Errno 24] Too many open files: '/home/yara/brodley/icml2001/tmp/@26048.61'

I understand why you assume that it would concern *file* uploads, but this is
not the case. As I reported before, it turns out that for each 'simple' <input> field
a temporary file is used in to transfer the contents to the script that uses the
cgi.FieldStorage() method, even if no files are being uploaded. The problem is not
that too many files are open at the same time (which is 1 at most). It is the *amount*
of files that is causing the troubles. If the same temporary file would be used, this
problem would probably not have happened. 

My colleague Fred Gansevles wrote a possible solution, but mentioned that this
might introduce the need for protection against a 'symlink attack' (whatever
that may be). This solution(?) concentrates on the open file descriptor's problem,
while Fred suggests a redesign of FieldStorage() would probably be better.

import os, tempfile
AANTAL = 50

class TemporaryFile:
    def __init__(self):
        self.name = tempfile.mktemp("")
        open(self.name, 'w').close()
        self.offset = 0
    
    def seek(self, offset):
        self.offset = offset
    
    def read(self):
        fd = open(self.name, 'w+b', -1)
        fd.seek(self.offset)
        data = fd.read()
        self.offset = fd.tell()
        fd.close()
        return data
    
    def write(self, data):
        fd = open(self.name, 'w+b', -1)
        fd.seek(self.offset)
        fd.write(data)
        self.offset = fd.tell()
        fd.close()

    def __del__(self):
        os.unlink(self.name)

def add_fd(l, n) :
    map(lambda x,l=l: l.append(open('/dev/null')), range(n))

def add_tmp(l, n) :
    map(lambda x,l=l: l.append(TemporaryFile()), range(n))

def main ():
    import getopt, sys
    try:
        import resource
        soft, hard = resource.getrlimit (resource.RLIMIT_NOFILE)
        resource.setrlimit (resource.RLIMIT_NOFILE, (hard, hard))
    except ImportError:
        soft, hard = 64, 1024
    opts, args = getopt.getopt(sys.argv[1:], 'n:t')
    aantal = AANTAL
    tmp = add_fd
    for o, a in opts:
        if o == '-n':
            aantal = int(a)
        elif o == '-t':
            tmp = add_tmp
    print "aantal te gebruiken fd's:", aantal   #dutch; English: 'number of fds to be used'
    print 'tmp:', tmp.func_name
    tmp_files = []
    files=[]
    tmp(tmp_files, aantal)
    try:
        add_fd(files,hard)
    except IOError:
        pass
    print "aantal vrije gebruiken fd's:", len(files)  #enlish: 'number of free fds'

main()

Running the above code:

    python ulimit.py [-n number] [-t]
    default number = 50, while using 'real' fd-s for temporary files.
    When using the '-t' flag 'smart' temporary files are used.

    Output:

        $ python ulimit.py
        aantal te gebruiken fd's: 50
        tmp: add_fd
        aantal vrije gebruiken fd's: 970

        $ python ulimit.py -t
        aantal te gebruiken fd's: 50
        tmp: add_tmp
        aantal vrije gebruiken fd's: 1020

        $ python ulimit.py -n 1000
        aantal te gebruiken fd's: 1000
        tmp: add_fd
        aantal vrije gebruiken fd's: 20

        $ python ulimit.py -n 1000 -t
        aantal te gebruiken fd's: 1000
        tmp: add_tmp
        aantal vrije gebruiken fd's: 1020

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2001-02-16 21:41

Message:
I assume you mean 64 file upload fields, right?  
Can you provide a small test program that triggers the problem?

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=231249&group_id=5470