internal mail archiver/htdig integration
In commissioning Mailman I needed to provide a per mail list search facility which fully honored private list access constraints (i.e. if the use cannot read it then do not tell him it even exists in search results). In implementing this I have made some changes to the mailman-2.0beta5 source. What I have done is hardly rocket-science. That said it would be good for me if the work was integrated with the standard releases of Mailman so I do not have to keep applying my own unique patches to any subsequent releases. I don't know the protocol for doing this nor whether other developers think my work is worth incorporating into the mainline of Mailman's code. I've outlined the changes below and would appreciate some advice. btw: If there's a better way to achieve the same objectives then I'm more than happy to throw my changes away.
I've integrated Mailman's internal archiver with htdig. Use of htdig is conditional on a new Mailman configuration variable being set, the use of the internal archiver and a list being subject to archiving. If these conditions are met then the relevant list-specific htdig conf files and such are created and maintained automatically. If you don't want the integration then turning of a single Mailman config variable makes this integration completely transparent.
For qualifying lists the htsearch form is embedded in the list's archive index page, when that is generated by HyperArch.py, and hence is downstream of user authentication for accessing a private archive. Access to URL's returned by htsearch are all mediated by an additional wrapped cgi script called htdig.py which is similar to private.py. Similar in the sense that for private lists it requires the presence of a list authentication cookie or it rejects the access. For public lists it just responds with the html page requested. In the normal course of events the user's browser gets the authentication cookie on board while reaching the archive index page, to do a search, in the normal fashion. This approach means that changing a list from public to private or vice versa does not invalidate the htdig search files as access to archive html files in either type of list uses a common root in their URLs.
An additional directory called htdig is created under each .../mailman/archives/private/<listname> in which the htdig search files and configuration file for each list are stored. An additional directory called .../mailman/archives/htdig holds symlinks to the list-specific htdig configuration files. .../mailman/archives/htdig is itself the target of a symlink in the directory in which htdig is configured to find its configuration files. [These symlinks are to allow htsearch to locate the list specific htdig conf files.] This directory arrangement meets that deleting a list in the normal fashion also disposes of related search stuff without any further effort or thought.
The whole is rounded off with a cron initiated script which uses rundig to update list-specific search files on a regular basis iff each given list archive has had more data added to it since the script last ran.
Changes have been made to the following mailman-2.0beta5 source files:
Hyperarch.py
- extra function to set up list-specific htdig creates list's htdig directory and generates list's htdig conf file in that directory. Logic added so that this function is called when archiving for the list is commenced.
- added meta tags and and <!--htdig_noindex> tags in the html templates to improve quality of seach results and efficiency of htdig'ing
- added htsearch form html template and logic to selectively include that when generating list index pages
src/Makefile
Added htdig to list of cgi scripts to get a wrapper for new htdig.py generated
Defaults.py
Added directives to provide control of htdig integration. Includes USE_HTDIG which enables/disables all the other changes if you want to use/not use htdig in this way.
New files are:
cgi script htdig.py mediates access to all htsearch results
cron activated script nightly_htdig to selectively regenerate htdig search file
Some tidy up of installer may yet be needed to finalise this work plus some minor installation documentation changes.
Richard Barrett, PostPoint 27, e-mail:r.barrett@ftel.co.uk Fujitsu Telecommunications Europe Ltd, tel: (44) 121 717 6337 Solihull Parkway, Birmingham Business Park, B37 7YU, England "Democracy is two wolves and a lamb voting on what to have for lunch. Liberty is a well armed lamb contesting the vote." Benjamin Franklin, 1759
"RB" == Richard Barrett <R.Barrett@ftel.co.uk> writes:
RB> In commissioning Mailman I needed to provide a per mail list
RB> search facility which fully honored private list access
RB> constraints (i.e. if the use cannot read it then do not tell
RB> him it even exists in search results). In implementing this I
RB> have made some changes to the mailman-2.0beta5 source. What I
RB> have done is hardly rocket-science. That said it would be
RB> good for me if the work was integrated with the standard
RB> releases of Mailman so I do not have to keep applying my own
RB> unique patches to any subsequent releases. I don't know the
RB> protocol for doing this nor whether other developers think my
RB> work is worth incorporating into the mainline of Mailman's
RB> code. I've outlined the changes below and would appreciate
RB> some advice. btw: If there's a better way to achieve the same
RB> objectives then I'm more than happy to throw my changes away.
Richard, on the face of it, your changes look like a reasonable approach. Note that some of the existing archiver stuff was updated for 2.0b6 to add support for non-ASCII character sets.
Normal procedure for any patch of this type is to upload it to the SourceForge patch manager for the Mailman project. That's the only way to guarantee your patches won't get buried in my inbox. ;}
Please try to port your changes to 2.0b6 first. I doubt any of the Pipermail stuff will change between now and 2.0 final.
Thanks, -Barry
bwarsaw@beopen.com said:
Richard, on the face of it, your changes look like a reasonable approach. Note that some of the existing archiver stuff was updated for 2.0b6 to add support for non-ASCII character sets.
I've had some discussion with Richard on these - they have some shared coverage with some pipermail patches I put in and which were included in 2.0beta6.
As Richard is less comfortable with some of the issues around building and submitting patches I'll try and "sponsor" these and take them through to the Sourceforge patch manager - I also suggested he split these into indexer neutral and htdig specific patches so I'll try to do that too.
Nigel.
-- [ - Opinions expressed are personal and may not be shared by VData - ] [ Nigel Metheringham Nigel.Metheringham@VData.co.uk ] [ Phone: +44 1423 850000 Fax +44 1423 858866 ]
"NM" == Nigel Metheringham <Nigel.Metheringham@vdata.co.uk> writes:
NM> As Richard is less comfortable with some of the issues around
NM> building and submitting patches I'll try and "sponsor" these
NM> and take them through to the Sourceforge patch manager - I
NM> also suggested he split these into indexer neutral and htdig
NM> specific patches so I'll try to do that too.
I'm actually really glad to see this kind of sponsorship activity, Nigel. Such mentoring by experienced programmers and open sourcerers is a fantastic way to teach the next generation how to contribute, and hopefully ensures that this great open source tradition is carried on.
"RB" == Richard Barrett <R.Barrett@ftel.co.uk> writes:
RB> Following your input, I've posted two patches covering the
RB> subject material:
RB> 1. Enhanced tagging of the archive html, with tags defineable
RB> from mm_cfg to make the change potentially usable with
RB> different search engine indexers:
RB> 2. Integration of htdig with Mailman to give per list search
RB> etc etc:
And we see the fruits already! Great job Richard, thanks.
-Barry
bwarsaw@beopen.com said:
I'm actually really glad to see this kind of sponsorship activity
Not much credit is needed... Richard uploaded the whole lot himself :-)
I'll be applying this set to my configuration in the near future and see how it runs there.
Nigel.
-- [ - Opinions expressed are personal and may not be shared by VData - ] [ Nigel Metheringham Nigel.Metheringham@VData.co.uk ] [ Phone: +44 1423 850000 Fax +44 1423 858866 ]
ack! it looks like my server is broken!
Things are dying attempting to write the mbox files in the archiver:
Sep 26 10:35:03 2000 (32215) Archive file access failure: /home/mailman/archives/private/sharks.mbox/sharks.mbox [Errno 75] Value too large for defined data type Sep 26 10:35:03 2000 (32215) Delivery exception: [Errno 75] Value too large for defined data type Sep 26 10:35:03 2000 (32215) Traceback (innermost last): File "/home/mailman/Mailman/Handlers/HandlerAPI.py", line 82, in do_pipeline func(mlist, msg, msgdata) File "/home/mailman/Mailman/Handlers/ToArchive.py", line 47, in process mlist.ArchiveMail(msg, msgdata) File "/home/mailman/Mailman/Archiver/Archiver.py", line 189, in ArchiveMail self.__archive_to_mbox(msg) File "/home/mailman/Mailman/Archiver/Archiver.py", line 160, in __archive_to_m box mbox.AppendMessage(post) File "/home/mailman/Mailman/Mailbox.py", line 41, in AppendMessage self.fp.seek(-1, 2) IOError: [Errno 75] Value too large for defined data type
Last night, I added a cron job that took my mbox files and moved them out of the mailman tree into a public archive. it looks like if the file doesn't exist or is zero length, this code now fails.
Barry? has this code been tested against a non-existant mbox file? It seems to be failing.
-- Chuq Von Rospach - Plaidworks Consulting (mailto:chuqui@plaidworks.com) Apple Mail List Gnome (mailto:chuq@apple.com)
And they sit at the bar and put bread in my jar and say 'Man, what are you doing here?'"
At 10:38 AM -0700 9/26/00, Chuq Von Rospach wrote:
ack! it looks like my server is broken!
Things are dying attempting to write the mbox files in the archiver:
mbox.AppendMessage(post)
File "/home/mailman/Mailman/Mailbox.py", line 41, in AppendMessage self.fp.seek(-1, 2) IOError: [Errno 75] Value too large for defined data type
We're definitely failing trying to seek beyond the front of a zero-length file. I added a blank line to the mbox file, and it started working.
-- Chuq Von Rospach - Plaidworks Consulting (mailto:chuqui@plaidworks.com) Apple Mail List Gnome (mailto:chuq@apple.com)
And they sit at the bar and put bread in my jar and say 'Man, what are you doing here?'"
On 2000.09.26, in <p04330105b5f6905d3d8b@[17.216.27.250]>, "Chuq Von Rospach" <chuqui@plaidworks.com> wrote:
We're definitely failing trying to seek beyond the front of a zero-length file. I added a blank line to the mbox file, and it started working.
For what this might be worth, I think this is an older problem. (That is: older than 2.0b6.) I meant to report it, but....
-- -D. dgc@uchicago.edu NSIT University of Chicago
"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
CVR> ack! it looks like my server is broken!
CVR> Things are dying attempting to write the mbox files in the
CVR> archiver:
CVR> Sep 26 10:35:03 2000 (32215) Archive file access failure:
CVR> /home/mailman/archives/private/sharks.mbox/sharks.mbox [Errno
CVR> 75] Value too large for defined data type Sep 26 10:35:03
CVR> 2000 (32215) Delivery exception: [Errno 75] Value too large
CVR> for defined data type Sep 26 10:35:03 2000 (32215) Traceback
CVR> (innermost last):
CVR> File "/home/mailman/Mailman/Handlers/HandlerAPI.py", line
CVR> 82, in do_pipeline func(mlist, msg, msgdata) File
CVR> "/home/mailman/Mailman/Handlers/ToArchive.py", line 47, in
CVR> process mlist.ArchiveMail(msg, msgdata) File
CVR> "/home/mailman/Mailman/Archiver/Archiver.py", line 189, in
CVR> ArchiveMail self.__archive_to_mbox(msg) File
CVR> "/home/mailman/Mailman/Archiver/Archiver.py", line 160, in
CVR> __archive_to_m
CVR> box
CVR> mbox.AppendMessage(post) File
CVR> "/home/mailman/Mailman/Mailbox.py", line 41, in AppendMessage
CVR> self.fp.seek(-1, 2)
CVR> IOError: [Errno 75] Value too large for defined data type
CVR> Barry? has this code been tested against a non-existant mbox
CVR> file? It seems to be failing.
Yep, but this could be a cross-platform issue.
What platform are you running on? For me on Linux RedHat 6.1, when I try to see past the end of a non-existant or zero length file, I get an EINVAL (errcode 22), which Mailbox.AppendMessage() should catch and ignore. If your error numbers are the same as mine, you're getting an EOVERFLOW, but why? What does "Value too large for defined data type" mean? Maybe that's just your platform's way of saying "Hey ya big dummy, you can't seek to before the end of a non-existing file!".
If that's the case, changing line 43 to
if e.errno not in (errno.EINVAL, errno.EOVERFLOW): raise
should do the trick. But be sure errno 75 == EOVERFLOW by doing:
% python
python -c "import errno; print errno.errorcode[75]"
I just tested this on a FreeBSD system I have available and the resulting error isn't EINVAL /or/ EOVERFLOW, it's an error code 0, which makes no sense!
Maybe Mailbox.AppendMessage() should simply discard any IOError it gets?
...
try:
self.fp.seek(-1, 2)
except IOError, e:
pass
# the file must be empty
...
? -Barry
At 2:32 PM -0400 9/26/00, Barry A. Warsaw wrote:
Yep, but this could be a cross-platform issue.
What platform are you running on? For me on Linux RedHat 6.1, when I
yellowdog linux, which is RedHat ported to the PowerPC chip.
try to see past the end of a non-existant or zero length file, I get an EINVAL (errcode 22), which Mailbox.AppendMessage() should catch and ignore. If your error numbers are the same as mine, you're getting an EOVERFLOW, but why? What does "Value too large for defined data type" mean?
it seems to be returing a value that won't fit in the variable.
If that's the case, changing line 43 to
if e.errno not in (errno.EINVAL, errno.EOVERFLOW): raise
okay, I've tweaked. I'll see how it works.
I just tested this on a FreeBSD system I have available and the resulting error isn't EINVAL /or/ EOVERFLOW, it's an error code 0, which makes no sense!
obviously this is implementation dependent (welcome to unix!)
Maybe Mailbox.AppendMessage() should simply discard any IOError it gets?
... try: self.fp.seek(-1, 2) except IOError, e: pass # the file must be empty ...
if we have three platforms and three errors, the answser is simple: yes...
-- Chuq Von Rospach - Plaidworks Consulting (mailto:chuqui@plaidworks.com) Apple Mail List Gnome (mailto:chuq@apple.com)
And they sit at the bar and put bread in my jar and say 'Man, what are you doing here?'"
"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
>> What platform are you running on? For me on Linux RedHat 6.1,
>> when I
CVR> yellowdog linux, which is RedHat ported to the PowerPC chip.
Okay.
>> try to see past the end of a non-existant or zero length file,
>> I get an EINVAL (errcode 22), which Mailbox.AppendMessage()
>> should catch and ignore. If your error numbers are the same as
>> mine, you're getting an EOVERFLOW, but why? What does "Value
>> too large for defined data type" mean?
CVR> it seems to be returing a value that won't fit in the
CVR> variable.
Yes, but which variable? Maybe it's the -1 as the first argument that overflows if the file is zero length?
>> If that's the case, changing line 43 to if e.errno not in
>> (errno.EINVAL, errno.EOVERFLOW): raise
CVR> okay, I've tweaked. I'll see how it works.
Cool.
>> I just tested this on a FreeBSD system I have available and the
>> resulting error isn't EINVAL /or/ EOVERFLOW, it's an error code
>> 0, which makes no sense!
CVR> obviously this is implementation dependent (welcome to unix!)
Heh...
>> Maybe Mailbox.AppendMessage() should simply discard any IOError
>> it gets? ... try: self.fp.seek(-1, 2) except IOError, e: pass
>> # the file must be empty ...
CVR> if we have three platforms and three errors, the answser is
CVR> simple: yes...
Okay, if the change works for you I'll commit it.
-Barry
What platform are you running on? For me on Linux RedHat 6.1, when I try to see past the end of a non-existant or zero length file, I get an EINVAL (errcode 22), which Mailbox.AppendMessage() should catch and ignore. If your error numbers are the same as mine, you're getting an EOVERFLOW, but why? What does "Value too large for defined data type" mean?
Standard says... (in re: lseek): EOVERFLOW: The resulting file offset would be a value which cannot be represented correctly in an object of type off_t. The error message seems a reasonable representation of that (off_t is required to be a signed integral type). Like Chuq says, this species (UNIX and relatives) is unfortunately prone to not quite agreeing with each other in boundary conditions, standards or no... and by the way, you can't count on the error /numbers/ being the same across systems, that's not part of UNIX standards.
Mats
"MW" == Mats Wichmann <mats@laplaza.org> writes:
>> What platform are you running on? For me on Linux RedHat 6.1,
>> when I try to see past the end of a non-existant or zero length
>> file, I get an EINVAL (errcode 22), which
>> Mailbox.AppendMessage() should catch and ignore. If your error
>> numbers are the same as mine, you're getting an EOVERFLOW, but
>> why? What does "Value too large for defined data type" mean?
MW> Standard says... (in re: lseek): EOVERFLOW: The resulting file
MW> offset would be a value which cannot be represented correctly
MW> in an object of type off_t. The error message seems a
MW> reasonable representation of that (off_t is required to be a
MW> signed integral type).
Agreed.
MW> Like Chuq says, this species (UNIX and relatives) is
MW> unfortunately prone to not quite agreeing with each other in
MW> boundary conditions, standards or no...
Or even the same distro on different platforms agreeing with each other (we're both essentially on RH Linux, but different h/w).
MW> and by the way, you can't count on the error /numbers/ being
MW> the same across systems, that's not part of UNIX standards.
Right, but Python's errno module exports the symbolic names too, so we always use those. It's the /output/ that likes to use the numeric values, and you can never be sure which symbolic error those map to!
Thanks, it looks like the patch I posted should do the trick. -Barry
participants (6)
-
bwarsaw@beopen.com
-
Chuq Von Rospach
-
David Champion
-
Mats Wichmann
-
Nigel Metheringham
-
Richard Barrett