At 1:37 AM -0400 2003/09/28, Barry Warsaw wrote:
I really really want to use something like message-ids to generate message file names.
IIRC, Earl talks about this in the FAQ. In short, for security
reasons, you can't trust any of the information you are given anywhere in the message, unless you can scrub that information and guarantee that it is now safe. Otherwise, you could get a message-id like "<../.htaccess>" or some other equally nasty thing that could potentially cause other files to be over-written inappropriately.
Moreover, given that there are a lot of people out there with
home networks using RFC 1918 private addressing, and this information is being used to help generate otherwise properly formatted message-ids, the probability of message-id collision increases significantly. This issue was recently brought to my attention because of my own RFC 1918 private networking here at home, and the information my MUA uses to generate message-ids.
Therefore, I think we might want to be a bit more careful in how
we generate the file names.
I want to be able to generate links to archived
messages in the footers, but I think the best way to do that is to agree on a reproducible, independent algorithm for calculating them.
One thing that MHonArc does for messages that are not assigned a
message-id (to help detect and eliminate duplicates) is to calculate an MD5 hash of the message headers and uses that as a substitute. We could do the same, or perhaps even use the MD5 hash instead of the message-id, and then store hash/message-id mappings in a database.
Another
approach would be to put even the public archives behind a cgi and have that implement a mapping between message-id derived links and the sequential file names (although that won't fix the regen problem).
One problem that most OSes have is with too many files in a
single directory -- go much over 1000 files in a directory and accessing anything in that directory starts taking significantly longer than it used to. If you use a sequential message numbering system, it's hard to break those up into smaller chunks of messages in a hashed directory scheme. With MD5 hashes, it would be a lot easier to convert the hash into a path name, just by adding slashes every so often in the hash value.
-- Brad Knowles, <brad.knowles@skynet.be>
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+ !w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++) tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)