Mailman 3 Migrating from YahooGroups to Mailman - Mailman-Users

Migrating from YahooGroups to Mailman

older
Re: [Mailman-Users] what DB format...

Sarah K. Miller

July 31, 2001

8:08 p.m.

We're migrating some lists form YahooGroups to Mailman. Does anyone know of a way to automatically "slurp" all the messages off Yahoo and plop them into Mailman? The only way I've found to do it and retain the original information is cut and paste each one individually. That's a little overwhelming when you're looking at 1500+ messages. Yahoo was no help at all. If anybody here knows of a utility of some sort that would do it, please share!

-- Sarah Plus ça change, plus c'est la même chose

Show replies by date

Bradford Shaw

July 2001

8:18 p.m.

Hello Sarah,

I would also be interested in this. We just moved a large list off of yahoogroups to its own server and Yahoo refuses to cooperate in letting us have the archived messages (over 2 years worth at approx 1500 per month). We are at the point of just dumping the group but we'll hold off for a couple of days now to see if there is a way to do this. Please include me in any solutions.

Bradford

At 01:08 PM 7/31/2001 -0700, you wrote:

...

Greg Ward

8:46 p.m.

On 31 July 2001, Bradford Shaw said:

...

I've just poked around groups.yahoo.com a bit, and it looks like this is doable (but painful). From any message, you can hit the "View source" link. This takes you to a page like

http://groups.yahoo.com/group/NucNews/message/3859?source=1

(message 3859 of the "NucNews" list).

The good news:

the URL is dead easy to generate, eg. here's Python code to suck the entire archives for a list, one HTML file per message:

url_template = "http://groups.yahoo.com/group/%s/message/%d?source=1"

group_name = "NucNews" num_messages = 3887 # every message page has this for msg_num in xrange(num_messages): url = url_template % (group_name, msg_num) msg_filename = "msg-%04d.html" % msg_num urllib.urlretrieve(url, msg_filename)

(UNTESTED -- YMMV)

The bad news:

the HTML you download will need serious massaging before it's genuinely plain text (ie. valid RFC 822 messages). Eg. here's a sample from message 1 of NucNews:

"""   <table border="0" cellspacing="0" cellpadding="0" width="100%"> <pre>From <a href="/group/NucNews/post?protectID=197212253115067135062046036199121208067038025008">prop1@xxxxx.xxxx</a> Tue Jan 12 14:08:50 1999 X-Digest-Num: 0 Message-ID: <<a href="/group/NucNews/post?protectID=204183107153178091074082017036098126254083020093090065230045073141210143030150043098201196026">62814.0.1.959296473@e...</a>> Date: Tue, 12 Jan 1999 17:08:50 -0500 [...] """

Note how anything that looks like an email address (including the return-path and message-id headers) are turned into hyperlinks. You'll need to strip out these <a href> tags (see Python's htmllib, although you can probably kludge it with a regex) and just preserve the content of the tag, eg. "prop1@xxxxx.xxxx".

Even after doing this, you still don't have a valid RFC 822 message, eg. here's another excerpt from msg 1 of NucNews:

From: Peace Through Reason <<a href="/group/NucNews/post?protectID=197212253115067135062046036199121208067038025008">prop1@xxxxx.xxxx</a>

De-HTML-ified, that becomes: From: Peace Through Reason <prop1@xxxxx.xxxx

which, even if you ignore the mangled email address, is still missing a trailing angle-bracket. Someone goofed in converting this message to HTML, and now you lose! You'll probably have to add some "Fix Yahoo! bogosity" heuristics to your script. Bummer.

Executive summary: your situation isn't completely hopeless, but it sure does suck. Bummer.

    Greg

-- Greg Ward - software developer gward@mems-exchange.org MEMS Exchange http://www.mems-exchange.org

Greg Ward

8:27 p.m.

On 31 July 2001, Sarah K. Miller said:

...

I had a similar problem getting a list off of ListBot recently. It only had 19 messages in the archive (I was just saving them for posterity -- the list wasn't exactly a big hit), so it wasn't too bad. Out of principle, though, I automated the procedure a little bit. The only reason it was possible is that ListBot had a link to get the full headers as plain text (wrapped in <PRE> in the web page, of course) for each message. After I figured out the pattern for that URL, I did something like this:

for i in 1 2 3 ... 19 ; do # yes, you have to type them all out GET http://www.listbot.com/(hairy url with $i in it somewhere) > msg$i.txt fix_msg msg$i.txt done

GET is the alias for lwp-request installed by lwp (libwww-perl). It just does an HTTP request from the shell. Handy, but it would be easy to whip up something similar in Python (which I've been meaning to do for a while now...).

fix_msg was a little Perl script I wrote to undo HTML encodings in the not-quite-plain-text file downloaded from listbot.com. I don't seem to have it anymore; it went something like this:

#!/usr/bin/perl -p

s/&/&/g; s/"/"/g; # ...etc...

Again, you could do this pretty easily in Python, but why bother? Perl is perfect for this sort of hackery. It all depends on the text Yahoo presents you with; mine was similarly dependent on ListBot.

To glue the messages together into a legitimate mbox (which you can build into an archive with some Mailman tool... ummm... bin/arch maybe?), you need to make sure each msg*.txt file starts with a "From " line and ends with a blank line. formail, a tool supplied with procmail, will take care of the former. Or you could DIY in that mythical fix_msg script (which would also be a good place to ensure a trailing blank line, although I think you don't want one on the last message...)

You should also systematically rename the files from eg. msg1.txt to msg0001.txt so the next command works. Again, child's play if you know your way around Unix and Perl...

Finally, "cat msg*.txt > mylist.mbox" (or whatever) and run Mailman's archive tool on it.

Make sure you've created a workable mbox file by running your favourite mail client on it, eg. "mutt -f mylist.mbox".

Anyways, if you're familiar with Unix and the tools available to you, this should be doable... as long as Yahoo makes the full original text of the messages available to you! If not, all is lost, give up, doom, failure, etc. If I was in your shoes (1500 messages to process), I'd probably take a few hours and write a nice Python script to do it right. For the 19 messages I had to do, crude shell-and-Perl hackery was just fine.

    Greg

-- Greg Ward - software developer gward@mems-exchange.org MEMS Exchange http://www.mems-exchange.org

alex wetmore

10:18 p.m.

On Tue, 31 Jul 2001, Greg Ward wrote:

...

It only took a few minutes browsing groups.yahoo.com to figure out their archives.

Each message (starting with 1) up to the number of messages in the group is available with this url: http://groups.yahoo.com/group/<group>/message/<msgnum>?source=1

You just need a simple script which:

collects the messages
strips everything that isn't in <pre> </pre>
converts the HTML back to plain text
adds it to an mbox

This should be fairly simple for anyone with moderate perl or python knowledge to write. It will take a little while to download all of the archives, but I don't think that is a big issue.

I've done similar things with mining list archives off of other mailing list hosts, but I haven't had to do this with yahoogroups yet. My scripts (or whatever is left of them, I just have one that I hack as necessary) aren't going to be useful to the mailman crowd because I don't use the mbox format for my archives or pipermail.

alex

Bradford Shaw

July 2001

8:18 p.m.

Hello Sarah,

Bradford

At 01:08 PM 7/31/2001 -0700, you wrote:

...

Greg Ward

8:46 p.m.

On 31 July 2001, Bradford Shaw said:

...

I've just poked around groups.yahoo.com a bit, and it looks like this is doable (but painful). From any message, you can hit the "View source" link. This takes you to a page like

http://groups.yahoo.com/group/NucNews/message/3859?source=1

(message 3859 of the "NucNews" list).

The good news:

the URL is dead easy to generate, eg. here's Python code to suck the entire archives for a list, one HTML file per message:

url_template = "http://groups.yahoo.com/group/%s/message/%d?source=1"

group_name = "NucNews" num_messages = 3887 # every message page has this for msg_num in xrange(num_messages): url = url_template % (group_name, msg_num) msg_filename = "msg-%04d.html" % msg_num urllib.urlretrieve(url, msg_filename)

(UNTESTED -- YMMV)

The bad news:

the HTML you download will need serious massaging before it's genuinely plain text (ie. valid RFC 822 messages). Eg. here's a sample from message 1 of NucNews:

Even after doing this, you still don't have a valid RFC 822 message, eg. here's another excerpt from msg 1 of NucNews:

From: Peace Through Reason <<a href="/group/NucNews/post?protectID=197212253115067135062046036199121208067038025008">prop1@xxxxx.xxxx</a>

De-HTML-ified, that becomes: From: Peace Through Reason <prop1@xxxxx.xxxx

Executive summary: your situation isn't completely hopeless, but it sure does suck. Bummer.

    Greg

-- Greg Ward - software developer gward@mems-exchange.org MEMS Exchange http://www.mems-exchange.org

Greg Ward

8:27 p.m.

On 31 July 2001, Sarah K. Miller said:

...

for i in 1 2 3 ... 19 ; do # yes, you have to type them all out GET http://www.listbot.com/(hairy url with $i in it somewhere) > msg$i.txt fix_msg msg$i.txt done

fix_msg was a little Perl script I wrote to undo HTML encodings in the not-quite-plain-text file downloaded from listbot.com. I don't seem to have it anymore; it went something like this:

#!/usr/bin/perl -p

s/&/&/g; s/"/"/g; # ...etc...

You should also systematically rename the files from eg. msg1.txt to msg0001.txt so the next command works. Again, child's play if you know your way around Unix and Perl...

Finally, "cat msg*.txt > mylist.mbox" (or whatever) and run Mailman's archive tool on it.

Make sure you've created a workable mbox file by running your favourite mail client on it, eg. "mutt -f mylist.mbox".

    Greg

-- Greg Ward - software developer gward@mems-exchange.org MEMS Exchange http://www.mems-exchange.org

alex wetmore

10:18 p.m.

On Tue, 31 Jul 2001, Greg Ward wrote:

...

It only took a few minutes browsing groups.yahoo.com to figure out their archives.

Each message (starting with 1) up to the number of messages in the group is available with this url: http://groups.yahoo.com/group/<group>/message/<msgnum>?source=1

You just need a simple script which:

collects the messages
strips everything that isn't in <pre> </pre>
converts the HTML back to plain text
adds it to an mbox

This should be fairly simple for anyone with moderate perl or python knowledge to write. It will take a little while to download all of the archives, but I don't think that is a big issue.

alex

8627

Age (days ago)

8627

Last active (days ago)

List overview

Download

4 comments

4 participants

participants (4)

alex wetmore
Bradford Shaw
Greg Ward
Sarah K. Miller