importing large (1GB) mbox file, hitting a wall here..

First post- thanks much for your archives, been getting myself up to speed today and I hope to be a member of the community who gives back as well.
I took over a large mailing list with a 12 year archive, which is about a 1 GB .mbox file (about 35,000 messages). I need to upload it to a new Mailman install on a new server. I have a background in visual basic and command line SPSS, and have managed Mailman lists before, but I'm a little new on this part. Here's where I'm at- let me know where I'm off.
For reference, the name of the list on the machine is dbt-l_pdbti.org.
Per the FAQ ( [1]https://wiki.list.org/DOC/How%20do%20I%20import%20an%20archive%20into%20a%20... ), I uploaded the old .mbox into the correct folder (in this case archives/private/dbt-l_pdbti.org.mbox/). This is a brand new list install, with no posts. I then ran bin/arch --wipe dbt-l_pdbti.org. When I checked the archives, only about 11,000 messages were imported. I saw in the arch help file there can be memory issues, and so to run things in chunks. So, I did this:
bin/arch ---wipe q -s 0 e 10000 dbt-l_pdbti.org bin/arch -q -s 10001 e 20000 dbt-l_pdbti.org bin/arch q -s 20001 e 30000 dbt-l_pdbti.org bin/arch q -s 30001 e 40000 dbt-l_pdbti.org
So when I do this, each piece works, but each piece overwrites the previous- in other words, rather than each chunk adding into the archives, only the most recent command seems to affect the archives. At the end of these commands, only messages 30,000 to 35,000 are showing up in the archives.
I'm sure there is something I'm doing wrong here, but I'm feeling pretty stuck- is there something I'm leaving out?
Appreciate the help-
........................................................................ Andrew White, PhD Associate Director DBT-Linehan Board of Certification, Certified DBT Clinician* Licensed Clinical Psychologist Portland DBT Institute (503) 290.3281 (phone) (503) 231.8153 (fax)
Please be aware that e-mail communication can be intercepted in transmission or misdirected. This e-mail message and any documents attached to it are confidential and may contain information that is protected from disclosure by various federal and state laws, including the HIPAA privacy rule (45 C.F.R., Part 164). This information is intended to be used solely by the entity or individual to whom this message is addressed. If you are not the intended recipient, be advised that any use, dissemination, forwarding, printing, or copying of this message without the sender's written permission is strictly prohibited and may be unlawful. Accordingly, if you have received this message in error, please notify the sender immediately with a copy to hipaa(at)pdbti.org and destroy this message. Please do not include personal identifying information such as your birth date, or personal medical information in any emails you send to us. No one can diagnose your condition from email or other written communications and is not a reliable mechanism for emergency communication.
References
Visible links

On 12/27/2017 08:08 PM, Andrew White, PhD wrote:
I then ran bin/arch --wipe dbt-l_pdbti.org. When I checked the archives, only about 11,000 messages were imported. I saw in the arch help file there can be memory issues, and so to run things in chunks. So, I did this:
bin/arch ---wipe q -s 0 e 10000 dbt-l_pdbti.org bin/arch -q -s 10001 e 20000 dbt-l_pdbti.org bin/arch q -s 20001 e 30000 dbt-l_pdbti.org bin/arch q -s 30001 e 40000 dbt-l_pdbti.org
So when I do this, each piece works, but each piece overwrites the previous- in other words, rather than each chunk adding into the archives, only the most recent command seems to affect the archives. At the end of these commands, only messages 30,000 to 35,000 are showing up in the archives.
Are you sure you are not including the --wipe option on the subsequent commands? The behavior you describe should not occur unless --wipe is specified on the subsequent commands.

I checked for that- it looks like my problem solving was incomplete. I found an error message when running arch where it was sticking on a bad record - I kept getting "got an unexpected keyword argument 'flags' " (even after using cleanarch on the mbox file), and I think that was the actual problem, not running out of memory. I ran it last night removing that record, and it worked without batching as long as I didn't include that batch of records only about .3% of the file..
At 09:33 AM 12/28/2017, you wrote:
On 12/27/2017 08:08 PM, Andrew White, PhD wrote:
> I then ran bin/arch --wipe dbt-l_pdbti.org. When I
> checked the archives, only about 11,000 messages were imported. I
saw in
> the arch help file there can be memory issues, and so to run things
in
> chunks. So, I did this:
>
> bin/arch ---wipe q -s 0 e 10000 dbt-l_pdbti.org
> bin/arch -q -s 10001 e 20000 dbt-l_pdbti.org
> bin/arch q -s 20001 e 30000 dbt-l_pdbti.org
> bin/arch q -s 30001 e 40000 dbt-l_pdbti.org
>
> So when I do this, each piece works, but each piece overwrites the
> previous- in other words, rather than each chunk adding into the
archives,
> only the most recent command seems to affect the archives. At the
end of
> these commands, only messages 30,000 to 35,000 are showing up in
the
> archives.
Are you sure you are not including the --wipe option on the subsequent
commands? The behavior you describe should not occur unless --wipe is
specified on the subsequent commands.
--
Mark Sapiro <mark@msapiro.net> The highway is for gamblers,
San Francisco Bay Area, California better use your sense - B. Dylan
------------------------------------------------------
Mailman-Users mailing list Mailman-Users@python.org
[1]https://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: [2]http://wiki.list.org/x/AgA3
Security Policy: [3]http://wiki.list.org/x/QIA9
Searchable Archives:
[4]http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe:
[5]https://mail.python.org/mailman/options/mailman-users/awhite%40pdbti.org
........................................................................ Andrew White, PhD Associate Director DBT-Linehan Board of Certification, Certified DBT Clinician* Licensed Clinical Psychologist Portland DBT Institute (503) 290.3281 (phone) (503) 231.8153 (fax)
Please be aware that e-mail communication can be intercepted in transmission or misdirected. This e-mail message and any documents attached to it are confidential and may contain information that is protected from disclosure by various federal and state laws, including the HIPAA privacy rule (45 C.F.R., Part 164). This information is intended to be used solely by the entity or individual to whom this message is addressed. If you are not the intended recipient, be advised that any use, dissemination, forwarding, printing, or copying of this message without the sender's written permission is strictly prohibited and may be unlawful. Accordingly, if you have received this message in error, please notify the sender immediately with a copy to hipaa(at)pdbti.org and destroy this message. Please do not include personal identifying information such as your birth date, or personal medical information in any emails you send to us. No one can diagnose your condition from email or other written communications and is not a reliable mechanism for emergency communication.
References
Visible links

On 12/28/2017 11:14 AM, Andrew White, PhD wrote:
I checked for that- it looks like my problem solving was incomplete. I found an error message when running arch where it was sticking on a bad record - I kept getting "got an unexpected keyword argument 'flags' " (even after using cleanarch on the mbox file), and I think that was the actual problem, not running out of memory.
It looks like we have a bug somewhere. There may be a defective message in the .mbox, but even so, it should result in a more graceful error report.
Did you get a traceback with the "unexpected keyword argument" exception. I would like to see a traceback and if possible, the offending message.
Note that there is a place in the _set_date() function in Mailman/Archiver/pipermail.py where we are trying to determine the message's date and if there is no Date: header with a valid date and no X-List-Received-Date: header with a valid date, we look at a Received: header and try to extract a date with
date = floatdate(re.sub(r'^.*;\s*', '',
message.get('received', ''), flags=re.S))
but flags=re.S is a valid argument to re.sub. However, you might look in your mbox for a message without a Date: header.
Also note that cleanarch won't do anything about defective messages. All it does is look for lines that begin with 'From ' that don't appear to be mbox message separator 'From ' lines

On 12/29/2017 09:51 AM, Mark Sapiro wrote:
It looks like we have a bug somewhere. There may be a defective message in the .mbox, but even so, it should result in a more graceful error report.
Did you get a traceback with the "unexpected keyword argument" exception. I would like to see a traceback and if possible, the offending message.
Never mind. The use of the flags= argument in re.sub was introduced in Mailman 2.1.22 and requires Python 2.7.
The exception will occur with Python older than 2.7.x when attempting to archive a message with no valid Date: header and no valid X-List-Received-Date: header.
I have updated the FAQs at https://wiki.list.org/x/4030629 and https://wiki.list.org/x/876 to note this requirement.
I have also attached a patch to Mailman/Archiver/pipermail.py that will allow it to work with older Python.

I think that's totally the issue- I can see some really old messages (1996) which very malformed date fields. Am checking on my versions of software now- I'm a little limited on this end since I'm using a virtual box on bluehost.com, and while I have root access, there are some aspects I can't upgrade or change. Am looking now.
At 12:30 PM 12/29/2017, Mark Sapiro wrote:
On 12/29/2017 09:51 AM, Mark Sapiro wrote:
>
> It looks like we have a bug somewhere. There may be a defective
message
> in the .mbox, but even so, it should result in a more graceful error
report.
>
> Did you get a traceback with the "unexpected keyword argument"
> exception. I would like to see a traceback and if possible, the
> offending message.
Never mind. The use of the flags= argument in re.sub was introduced in
Mailman 2.1.22 and requires Python 2.7.
The exception will occur with Python older than 2.7.x when attempting to
archive a message with no valid Date: header and no valid
X-List-Received-Date: header.
I have updated the FAQs at < [1]https://wiki.list.org/x/4030629> and
< [2]https://wiki.list.org/x/876> to note this requirement.
I have also attached a patch to Mailman/Archiver/pipermail.py that will
allow it to work with older Python.
--
Mark Sapiro <mark@msapiro.net> The highway is for gamblers,
San Francisco Bay Area, California better use your sense - B. Dylan
------------------------------------------------------
Mailman-Users mailing list Mailman-Users@python.org
[3]https://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: [4]http://wiki.list.org/x/AgA3
Security Policy: [5]http://wiki.list.org/x/QIA9
Searchable Archives:
[6]http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe:
[7]https://mail.python.org/mailman/options/mailman-users/awhite%40pdbti.org
........................................................................ Andrew White, PhD Associate Director DBT-Linehan Board of Certification, Certified DBT Clinician* Licensed Clinical Psychologist Portland DBT Institute (503) 290.3281 (phone) (503) 231.8153 (fax)
Please be aware that e-mail communication can be intercepted in transmission or misdirected. This e-mail message and any documents attached to it are confidential and may contain information that is protected from disclosure by various federal and state laws, including the HIPAA privacy rule (45 C.F.R., Part 164). This information is intended to be used solely by the entity or individual to whom this message is addressed. If you are not the intended recipient, be advised that any use, dissemination, forwarding, printing, or copying of this message without the sender's written permission is strictly prohibited and may be unlawful. Accordingly, if you have received this message in error, please notify the sender immediately with a copy to hipaa(at)pdbti.org and destroy this message. Please do not include personal identifying information such as your birth date, or personal medical information in any emails you send to us. No one can diagnose your condition from email or other written communications and is not a reliable mechanism for emergency communication.
References
Visible links
participants (2)
-
Andrew White, PhD
-
Mark Sapiro