Creating huge data in very less time.

Tue Mar 31 14:05:55 EDT 2009

venutaurus539 at gmail.com wrote:
> On Mar 31, 1:15 pm, Steven D'Aprano
> <ste... at REMOVE.THIS.cybersource.com.au> wrote:
>> On Mon, 30 Mar 2009 22:44:41 -0700, venutaurus... at gmail.com wrote:
>>> Hello all,
>>>             I've a requirement where I need to create around 1000
>>> files under a given folder with each file size of around 1GB. The
>>> constraints here are each file should have random data and no two files
>>> should be unique even if I run the same script multiple times.
>> I don't understand what you mean. "No two files should be unique" means
>> literally that only *one* file is unique, the others are copies of each
>> other.
>>
>> Do you mean that no two files should be the same?
>>
>>> Moreover
>>> the filenames should also be unique every time I run the script. One
>>> possibility is that we can use Unix time format for the file   names
>>> with some extensions.
>> That's easy. Start a counter at 0, and every time you create a new file,
>> name the file by that counter, then increase the counter by one.
>>
>>> Can this be done within few minutes of time. Is it
>>> possble only using threads or can be done in any other way. This has to
>>> be done in Windows.
>> Is it possible? Sure. In a couple of minutes? I doubt it. 1000 files of
>> 1GB each means you are writing 1TB of data to a HDD. The fastest HDDs can
>> reach about 125 MB per second under ideal circumstances, so that will
>> take at least 8 seconds per 1GB file or 8000 seconds in total. If you try
>> to write them all in parallel, you'll probably just make the HDD waste
>> time seeking backwards and forwards from one place to another.
>>
>> --
>> Steven
> 
> That time is reasonable. The randomness should be in such a way that
> MD5 checksum of no two files should be the same.The main reason for
> having such a huge data is for doing stress testing of our product.

Does it really need to be *files* on the *hard disk*?

What nobody has suggested yet is that you can *simulate* the files by making a large set 
of custom file-like object and feed that to your application. (If possible!)
The object could return a 1 GB byte stream consisting of a GUID followed by random bytes
(or just millions of A's, because you write that the only requirement is to have a 
different MD5 checksum).
That way you have no need of a 1 terabyte hard drive and the huge wait time to create 
the actual files...

--irmen